Graph Data Science WORST Practices

Worst Practices and Gotchas
Graph Analytics team
Neo4j

2
About the presenter
Mats Rydberg
mats@neo4j.org
Software Engineer, Neo4j
Team Lead Graph Analytics

3
Overview
Objectives: Learn the most common issues and how to avoid them.

“Neo4j crashes when I put the JAR in the plugin directory”
Either:
● You’re running the wrong version of Neo4j
○ GDS 1.0, 1.1 supports Neo4j 3.5.9+, not 4.0
○ GDS 1.2 supports Neo4j 4.0.x, not 3.5
● You’ve installed graph algorithms and GDS together -- but they’re
not compatible
6
Common Mistakes

“I can't ﬁnd feature X”
Make sure you're using the latest GDS version. Upgrade using Neo4j
Desktop or from the Download Center:
https://neo4j.com/download-center/#algorithms
Also make sure you're reading the corresponding version of the
documentation: https://neo4j.com/docs/graph-data-science/1.1/
7
Common Mistakes

“My algorithm ran but returned all 0s”
or
“My algorithm ran but all the results are the same”
or
“My algorithm ran but every node is in its own community”
or
“My algorithm ran but the shortest path is inﬁnity”
These may result from having no connections between nodes.
Check the node and relationship projections -- are there edges
between your nodes?
8
Common Mistakes

“My algorithm ran, but all the nodes are in the same community”
Your graph may be too densely connected … but all hope is not lost!
1) Try a diﬀerent algorithm -- if WCC ﬁnds a single community, that
doesn’t mean LPA will as well.
2) Try using weights (for Louvain, LPA) or thresholds (for WCC)
3) If you’re using Louvain, check the intermediate communities
9
Common Mistakes

“The algorithm ran, so it must be right… right?”
We’ve built in some guardrails - for example, you can’t run an algo
that’s incompatible with the direction of your graph - but the library
isn’t foolproof.
- You could project a bunch of diﬀerent node labels as if they’re the
same thing
- You could treat any number as if it’s a weight or a seed property
- You could run a weighted algorithm on unweighted relationships with
default settings
- You could set weird values for tolerance, damping factor, iterations...
10
Common Mistakes

“I ran my algorithm twice and got diﬀerent results”
Yes -
1) a number of the algorithms are stochastic meaning the algorithm uses a
heuristic that is non-deterministic.
2) Thread concurrency may cause non-deterministic results because we
can’t control the order in which threads are processed.
It doesn’t mean the results are wrong!
- Use seeding so you keep the results from the ﬁrst run
- Know which algorithms are stochastic and choose appropriately
- Check if an algorithm converged on an answer
11
Common Mistakes

“I get diﬀerent results when I run my algorithm with diﬀerent X”
Yes - that’s intended behavior.
In particular, be mindful of:
- orientation: 'UNDIRECTED' will double all your edges
- aggregation: 'SINGLE' will deduplicate parallel edges
- maxIterations, tolerance control when an algorithm stops
- nodeLabels, relationshipTypes control what parts of your
projected graph are used in the algorithm
12
Common Mistakes

“This algorithm has literally been running for three months”
Some of the algorithms are really slow by their nature, speciﬁcally:
- Betweenness Centrality
- Node Similarity, or any of the alpha similarity algorithms
1) Check progress in the debug log
2) Break up the problem -- run WCC and execute on individual
components, or within individual communities
3) Set topK, topN, degreeCutOﬀ for similarity
4) Use an approximation method:
gds.alpha.betweenness.sampled, gds.alpha.ml.ann13
Common Mistakes

“My algorithm didn't run because of the memory guard”
There's a feature to protect your database from crashing. Sometimes
it stops an algorithm from running. This can be disabled with the
sudo conﬁguration parameter.
14
Common Mistakes

“My JVM went out of memory!”
Sometimes during workload the memory footprint is increased, for
example when making use of the .mutate execution mode.
To free up memory, you could
• Drop unused graphs
• Remove unused properties or relationships
Some algorithms can be conﬁgured to use less memory as well
15
Common Mistakes

16
… so how do I avoid this?

Do Don't
17
- Use memory estimation to find
out about memory requirements
- Configure Neo4j to use as much
heap as possible
(dbms.memory.heap.max_size)
- Run algorithms on a single
instance or read replica
- Configure Neo4j to use as much
page cache as possible
(dbms.memory.pagecache.size)
- Run algorithms on a core
member of a cluster
Configuration

Do Don't
18
- Only load nodes and
relationships that you plan to
use
- Only load necessary properties
- Avoid redundant relationship
projections (natural + reverse ==
undirected)
- Consider aggregating parallel
relationships
- Use '*' when creating in-memory
graphs
- Use Cypher projections in
production
Projections

Do Don't
19
- Use the catalog if multiple
algorithms are run on the
same graph
- Drop graphs that are not
needed any more
- One large in-memory graph
is better than multiple small
ones
- Use the catalog for one-time
algorithm executions
- Update your underlying data
in Neo4j without refreshing
your graph projection
Catalog

Do Don't
20
- Use seeding if possible
- Use threshold if possible
- Use tolerance if possible
- Try diﬀerent concurrency
settings
- Run on anonymous graphs in
production
Algorithm Execution

Do Don't
21
- Try to run only one workload
at a time
- Avoid writing into properties
used in production
- Alter the graph in the same
transaction that loads the
graph
- Run algorithms on
operational systems
Single User Mode

We assume single user mode
- Catalog is partitioned by user
- Algos will grab as many resources as they can
- Concurrency can be controlled by concurrency parameter
- One user could e.g. remove a graph while the algo is running
22
Caveats - Single User Mode

Loading
- One transaction per loader thread
- Changes made in the transaction that calls the procedure are not
visible to the loader
- Loading the same graph twice can result in diﬀerent graphs
Write back
- Happens in batches of 10k-100k elements
- Node values are written in parallel - one Tx per Thread and batch
- Relationship values are written single-threaded
- Rollback is not possible once a Batch-Tx has been committed
23
Caveats - Transactionality

Thank you!
Find us at
https://github.com/neo4j/graph-data-science

Graph Data Science WORST Practices

Recommended

Recommended

More Related Content

Similar to Graph Data Science WORST Practices

Similar to Graph Data Science WORST Practices (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Graph Data Science WORST Practices