The document provides an overview of the internal workings of Neo4j. It describes how the graph data is stored on disk as linked lists of fixed size records and how two levels of caching work - a low-level filesystem cache and a high-level object cache that stores node and relationship data in a structure optimized for traversals. It also explains how traversals are implemented using relationship expanders and evaluators to iteratively expand paths through the graph, and how Cypher builds on this but uses graph pattern matching rather than the full traversal system.
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
An overview of Neo4j Internals
1. An overview of
Neo4j Internals
tobias@neotechnology.com
Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j)
web: neo4j.org neotechnology.com
Hacker @ Neo Technology my web: thobe.org
Monday, May 21, 2012
2. Outline
This is a rough structure of
how the pieces of Neo4j fit
together.
This talk will not cover
how disks/fs works, we
just assume it does. Traversals Core API Cypher
Nor will it cover the “Core
API”, you are assumed to
know it.
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
2
Monday, May 21, 2012
3. Outline
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Let’s start at the
bottom: the on disk Record files Transaction log
storage file layout.
Disk(s)
3
Monday, May 21, 2012
4. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
property record.
The Nodes also reference the first node in its
relationship chain.
Each Relationship references its start and
Name: Alistair end node.
Age: 34 KNOWS It also references the prev/next relationship
record for the start/end node respectively
Name: Tobias
Age: 27
Nationality: Swedish
KNOWS
KNOWS
KNOWS
Name: Ian
Age: 42
Name: Jim
Age: 37
KNOWS Stuff: good
4
Monday, May 21, 2012
5. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name property record.
The Nodes also reference the first node in its
Alistair relationship chain.
Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
Nationality
KNOWS Swedish
KNOWS
KNOWS
Name
Jim
Age
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
6. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name property record.
The Nodes also reference the first node in its
Alistair relationship chain.
Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
Nationality
KNOWS Swedish
KNOWS
KNOWS
Name
Jim
Age
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
7. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
8. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
9. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
10. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
11. Store files
๏Node store
๏Relationship store
• Relationship type store
๏Property store
• Property key store
• (long) String store Short string and
array values are
inlined in the
•
property store, long
values are stored in
(long) Array store separate store files.
5
Monday, May 21, 2012
12. Neo4j Storage Record Layout
Node (9 bytes)
inUse nextRelId nextPropId
1 5 9
Relationship (33 bytes)
inUse firstNode secondNode relationshipType firstPrevRelId firstNextRelId secondPrevRelId secondNextRelId nextPropId
1 5 9 13 17 21 25 29 33
Relationship Type (5 bytes)
inUse typeBlockId
1 5
Property (33 bytes)
inUse type keyIndexId propBlock nextPropId
1 3 5 29 33
Property Index (9 bytes)
inUse propCount keyBlockId
1 5 9
Dynamic Store (125 bytes)
inUse next data
1 5
NeoStore (5 bytes)
inUse datum
1 5
Monday, May 21, 2012
13. Outline
Traversals Core API Cypher
Next: The t wo levels of cache Node/Relationship
in Neo4j. Thread local diffs
The low level FS Cache for the Object cache
record files.
And the high level Object
cache storing a structure
more optimized for traversal.
FS Cache HA
Record files Transaction log
Disk(s)
7
Monday, May 21, 2012
14. The caches
๏ Filesystem cache:
• Caches regions store file intofiles sized regions)
(divides each
of the store
equally
• The cache holds a fixed number of regions for each file
• Regions are evicted based on ahit in non-cached region)
(hit count vs. miss count, i.e.
LFU-like policy
• Default implementation of regions uses OS mmap
๏ Node/Relationship cache
• Cache a version more optimized for traversals
8
Monday, May 21, 2012
15. What we put in cache
ID Relationship ID refs
in: R1 R2 ... Rn The structure of the elements in the high level
type 1 object cache.
out R1 R2 ... Rn
On disk most of the information is contained
in: R1 R2 R3 ... Rn in the relationship records, with the nodes just
type 2 referencing their first relationship. In the
out R1 ... Rn
Node cache this is turned around: the nodes hold
references to all its relationships. The
... (grouped by type) relationships are simple, only holding its
properties.
The relationships for each node is grouped by
RelationshipType to allow fast traversal of a
specific type.
Key 1 Key 2 ... Key n
All references (dotted arrows) are by ID, and
traversals do indirect lookup through the cache.
Val 1
Val 2
ID start end type Val n
Relationship
Key 1 Key 2 ... Key n
Val 1
Val 2
Val n
9
Monday, May 21, 2012
16. Outline
So how do traversals work...
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
10
Monday, May 21, 2012
17. Traversals - how do they work?
๏ RelationshipExpanders: given (a path to) a node, returns The surface
layer, the you
Relationships to continue traversing from that node interact with.
๏ Evaluators: given (a path to) a node, returns whether to:
• Continue traversing on that branch (i.e. expand) or not
• Include (the path to) the node in the result set or not
๏ Then a projection to Path, Node, or Relationship applied to
each Path in the result set.
... but also:
๏ Uniqueness level: policy for when it is ok to revisit a node
that has already been visited
๏ Implemented on top of the Core API
11
Monday, May 21, 2012
18. More on Traversals
๏ Fetch node data from cache - non-blocking access This is what happens
•
under the hood.
If not in cache, retrieve from storage, into cache
‣If region is in FS cache: blocking but short duration access
‣If region is outside FS cache: blocking slower access
๏ Get relationships from cached node
• If not fetched, retrieve from storage, by following chains
๏ Expand relationship(s) to end up on next node(s)
• The relationship knows the node, no need to fetch it yet
๏ Evaluate
• possibly emitting a Path into the result set
๏ Repeat 12
Monday, May 21, 2012
19. Outline
How is Cypher different?
Traversals Core API Cypher and how dowes it work?
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
13
Monday, May 21, 2012
20. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
21. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
22. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
23. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
24. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
25. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
26. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
27. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
28. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
29. What about gremlin?
๏ gremlin is a third party language, built by Marko Rodriguez of
Tinkerpop (a group of people who like to hack on graphs)
๏ Originally based on the idea of using xpath to describe traversals:
./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED
but bastardized to distinguish between nodes and relationships:
./outE[label=HAS_CART]/inV Traversals are close
/outE[label=CONTAINS_ITEM]/inV to xpath, which is
why xpath like
/inE[label=PURCHASED]/outV descriptions of
traversals seemed
/outE[label=PURCHASED]/inV like a good idea.
๏ xpath is not complete enough to express full algorithms, it needs a
host language, gremlin originally defined its own.
This changed Groovy as a more complete host language and
abandoned xpath in favor of method chaining
[ replace ‘/’ with ‘.’ ] 15
Monday, May 21, 2012
30. Gremlin compared to Cypher
๏start me=node:people(name={myname})
match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item
item<-[:PURCHASED]-user-[:PURCHASED]->recommendation
return recommendation
๏ Cypher is declarative, describes what data to get - its shape
๏ Gremlin is imperative, prescribes how to get the data
๏ Cypher has more opportunities for optimization by the engine
๏ Gremlin can implement pagerank, Cypher can’t (yet?)
16
Monday, May 21, 2012
31. Outline
Traversals Core API Cypher Transactions involve t wo
parts:
The (thread local) changes
being done by an active
transaction,
Node/Relationship and the transaction replay log
Thread local diffs
Object cache for recovery.
FS Cache HA
Record files Transaction log
Disk(s)
17
Monday, May 21, 2012
32. Transaction Isolation
๏ Mutating operations are not written when performed
๏ They are stored in a thread confined transaction state object
๏ This prevents other threads from seeing uncommitted changes
from the transactions of other threads
๏ When Transaction.finish() is invoked the transaction is either
committed or rolled back
๏ Rollback is simple: discard the transaction state object
18
Monday, May 21, 2012
33. Transactions & Durability
๏ Commit is:
• Changes made in the transaction are collected as commands
• Commands are sorted to get predictable update order
‣This prevents concurrent readers from seeing inconsistent
data when the changes are applied to the store
• Write changes (in sorted order) to the transaction log
• Mark the transaction as committed in the log
• Apply the changes (in sorted order) to the store files
19
Monday, May 21, 2012
34. Recovery
๏ Transaction commands dictate state, they don’t modify state
• i.e. SET property "count" to 5
• rather than ADD 1 to property "count"
๏ Thus: Applying the same command twice yields the same state
๏ Recovery simply replays all transactions since the last safe point
๏ If tx A mutates node1.name, then tx B also mutates
node1.name that doesn’t matter, because the database is not
recovered until all transactions have been replayed
20
Monday, May 21, 2012
35. Outline
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log High Availability in
Neo4j builds on top of
the transaction replay
Disk(s)
21
Monday, May 21, 2012
36. Outline The transaction logs are
shared bet ween all instances
in an High Availability setup,
all other parts operate on the
local data just like in the
standalone case.
Traversals Core API Cypher
Local Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log Shared
Disk(s)
22
Monday, May 21, 2012
37. • HA - the parts to it:
๏ Based on streaming transactions between servers
๏ All transactions are committed through the master
• Then (eventually) applied to the slaves
• Eventuality synchronizationupdate intervalby interaction
or when
defined by the
is mandated
๏ When writing to a slave:
• Locks coordinated through the master
• Transaction data buffered on theget a txid
applied first on the master to
slave
then applied with the same txid on the slave
23
Monday, May 21, 2012
38. Creating new Nodes and Relationships
๏ New Nodes/Relationships don’t need locks, so they don’t need a
transaction synced with master until the transaction commit
๏ They do need an ID that is unique and equal among all instances
๏ Each instance allocates IDs in blocks from the master, then assigns
them to new Nodes/Relationships locally
• This batch allocation can be seen in (Enterprise) 1000 as
Node/Relationship counts jumping in steps of
WebAdmin
24
Monday, May 21, 2012
39. HA synchronization points
๏ Transactions are uniquely identified by monotonically increasing ID
๏ All Requests from slave send the current latest txid on that slave
๏ Responses from master send back a stream of transactions that
have occurred since then, along with the actual result
๏ Transaction is replayed just like when committed / recovered
๏ Nodes/Relationships touched during this application are evicted
from cache to avoid cache staleness
๏ Transaction commands are only sorted when created, before
stored/transmitted, thus consistency is preserved during all
application phases
25
Monday, May 21, 2012
40. Locking semantics of HA
๏ To be granted a lock the slave must have the latest version of the
Node/Relationship it is trying to lock
• This ensures consistency
• The implementation of “Latest version ofentireNode/
Relationship” is “Latest version of the
the
graph”
• The slave must thus sync transactions from the master
26
Monday, May 21, 2012
41. Master election
๏ Each instance communicates/coordinates:
• its latest transaction id (including the master id for that tx)
• the id forclock value for when the txid was written
that instance
• (logical)
๏ Election chooses:
1. The instance with highest txid
2. IF multiple: The instance that was master for that tx
3. IF unavailable: The instance with the lowest clock value
4. IF multiple: The instance with the lowest id
๏ Election happens when the current master cannot be reached
•
Any instance can choose to re-elect
•
Each instance runs the election protocol individually
•
Notify others when election chooses new master
๏ When elected, the new master broadcasts to all instances, 27
forcing them to bind to the new master
Monday, May 21, 2012
42. Thank you for listening!
tobias@neotechnology.com
Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j)
web: neo4j.org neotechnology.com
Hacker @ Neo Technology my web: thobe.org
Monday, May 21, 2012