This document discusses common mistakes and issues users encounter when using Neo4j's Graph Data Science (GDS) library. It provides tips to avoid errors such as incompatible GDS and Neo4j versions, empty or identical algorithm outputs, stochastic versus deterministic algorithms, and running out of memory. The document recommends best practices for configuration, projections, the catalog, algorithm execution, and avoiding conflicts in single user mode.
4. “Neo4j crashes when I put the JAR in the plugin directory”
Either:
● You’re running the wrong version of Neo4j
○ GDS 1.0, 1.1 supports Neo4j 3.5.9+, not 4.0
○ GDS 1.2 supports Neo4j 4.0.x, not 3.5
● You’ve installed graph algorithms and GDS together -- but they’re
not compatible
6
Common Mistakes
5. “I can't find feature X”
Make sure you're using the latest GDS version. Upgrade using Neo4j
Desktop or from the Download Center:
https://neo4j.com/download-center/#algorithms
Also make sure you're reading the corresponding version of the
documentation: https://neo4j.com/docs/graph-data-science/1.1/
7
Common Mistakes
6. “My algorithm ran but returned all 0s”
or
“My algorithm ran but all the results are the same”
or
“My algorithm ran but every node is in its own community”
or
“My algorithm ran but the shortest path is infinity”
These may result from having no connections between nodes.
Check the node and relationship projections -- are there edges
between your nodes?
8
Common Mistakes
7. “My algorithm ran, but all the nodes are in the same community”
Your graph may be too densely connected … but all hope is not lost!
1) Try a different algorithm -- if WCC finds a single community, that
doesn’t mean LPA will as well.
2) Try using weights (for Louvain, LPA) or thresholds (for WCC)
3) If you’re using Louvain, check the intermediate communities
9
Common Mistakes
8. “The algorithm ran, so it must be right… right?”
We’ve built in some guardrails - for example, you can’t run an algo
that’s incompatible with the direction of your graph - but the library
isn’t foolproof.
- You could project a bunch of different node labels as if they’re the
same thing
- You could treat any number as if it’s a weight or a seed property
- You could run a weighted algorithm on unweighted relationships with
default settings
- You could set weird values for tolerance, damping factor, iterations...
10
Common Mistakes
9. “I ran my algorithm twice and got different results”
Yes -
1) a number of the algorithms are stochastic meaning the algorithm uses a
heuristic that is non-deterministic.
2) Thread concurrency may cause non-deterministic results because we
can’t control the order in which threads are processed.
It doesn’t mean the results are wrong!
- Use seeding so you keep the results from the first run
- Know which algorithms are stochastic and choose appropriately
- Check if an algorithm converged on an answer
11
Common Mistakes
10. “I get different results when I run my algorithm with different X”
Yes - that’s intended behavior.
In particular, be mindful of:
- orientation: 'UNDIRECTED' will double all your edges
- aggregation: 'SINGLE' will deduplicate parallel edges
- maxIterations, tolerance control when an algorithm stops
- nodeLabels, relationshipTypes control what parts of your
projected graph are used in the algorithm
12
Common Mistakes
11. “This algorithm has literally been running for three months”
Some of the algorithms are really slow by their nature, specifically:
- Betweenness Centrality
- Node Similarity, or any of the alpha similarity algorithms
1) Check progress in the debug log
2) Break up the problem -- run WCC and execute on individual
components, or within individual communities
3) Set topK, topN, degreeCutOff for similarity
4) Use an approximation method:
gds.alpha.betweenness.sampled, gds.alpha.ml.ann13
Common Mistakes
12. “My algorithm didn't run because of the memory guard”
There's a feature to protect your database from crashing. Sometimes
it stops an algorithm from running. This can be disabled with the
sudo configuration parameter.
14
Common Mistakes
13. “My JVM went out of memory!”
Sometimes during workload the memory footprint is increased, for
example when making use of the .mutate execution mode.
To free up memory, you could
• Drop unused graphs
• Remove unused properties or relationships
Some algorithms can be configured to use less memory as well
15
Common Mistakes
15. Do Don't
17
- Use memory estimation to find
out about memory requirements
- Configure Neo4j to use as much
heap as possible
(dbms.memory.heap.max_size)
- Run algorithms on a single
instance or read replica
- Configure Neo4j to use as much
page cache as possible
(dbms.memory.pagecache.size)
- Run algorithms on a core
member of a cluster
Configuration
16. Do Don't
18
- Only load nodes and
relationships that you plan to
use
- Only load necessary properties
- Avoid redundant relationship
projections (natural + reverse ==
undirected)
- Consider aggregating parallel
relationships
- Use '*' when creating in-memory
graphs
- Use Cypher projections in
production
Projections
17. Do Don't
19
- Use the catalog if multiple
algorithms are run on the
same graph
- Drop graphs that are not
needed any more
- One large in-memory graph
is better than multiple small
ones
- Use the catalog for one-time
algorithm executions
- Update your underlying data
in Neo4j without refreshing
your graph projection
Catalog
18. Do Don't
20
- Use seeding if possible
- Use threshold if possible
- Use tolerance if possible
- Try different concurrency
settings
- Run on anonymous graphs in
production
Algorithm Execution
19. Do Don't
21
- Try to run only one workload
at a time
- Avoid writing into properties
used in production
- Alter the graph in the same
transaction that loads the
graph
- Run algorithms on
operational systems
Single User Mode
20. We assume single user mode
- Catalog is partitioned by user
- Algos will grab as many resources as they can
- Concurrency can be controlled by concurrency parameter
- One user could e.g. remove a graph while the algo is running
22
Caveats - Single User Mode
21. Loading
- One transaction per loader thread
- Changes made in the transaction that calls the procedure are not
visible to the loader
- Loading the same graph twice can result in different graphs
Write back
- Happens in batches of 10k-100k elements
- Node values are written in parallel - one Tx per Thread and batch
- Relationship values are written single-threaded
- Rollback is not possible once a Batch-Tx has been committed
23
Caveats - Transactionality