SlideShare a Scribd company logo
1 of 42
Download to read offline
An overview of
                       Neo4j Internals


                                   tobias@neotechnology.com
 Tobias Lindaaker                  twitter: @thobe, #neo4j (@neo4j)
                                   web: neo4j.org neotechnology.com
 Hacker @ Neo Technology           my web: thobe.org




Monday, May 21, 2012
Outline
       This is a rough structure of
       how the pieces of Neo4j fit
       together.

       This talk will not cover
       how disks/fs works, we
       just assume it does.           Traversals      Core API       Cypher
       Nor will it cover the “Core
       API”, you are assumed to
       know it.
                                      Node/Relationship
                                                            Thread local diffs
                                        Object cache

                                          FS Cache                         HA


                                        Record files          Transaction log


                                                      Disk(s)



                                                                                 2

Monday, May 21, 2012
Outline

                             Traversals      Core API       Cypher


                             Node/Relationship
                                                   Thread local diffs
                               Object cache

                                 FS Cache                         HA

      Let’s start at the
      bottom: the on disk      Record files          Transaction log
      storage file layout.



                                             Disk(s)



                                                                        3

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                linked lists of fixed size records on disk.

       Your graph on disk                       Properties are stored as a linked list of
                                                property records, each holding key+value.
                                                Each node/relationship references its first
                                                property record.
                                                The Nodes also reference the first node in its
                                                relationship chain.
                                                Each Relationship references its start and
                   Name: Alistair               end node.
                   Age: 34             KNOWS    It also references the prev/next relationship
                                                record for the start/end node respectively


                                                                   Name: Tobias
                                                                   Age: 27
                                                                   Nationality: Swedish



            KNOWS
                                        KNOWS


                                                                      KNOWS
              Name: Ian
              Age: 42


                                                           Name: Jim
                                                           Age: 37
                                    KNOWS                  Stuff: good

                                                                                                 4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                         linked lists of fixed size records on disk.

       Your graph on disk                Properties are stored as a linked list of
                                         property records, each holding key+value.
                                         Each node/relationship references its first
        Name                             property record.
                                         The Nodes also reference the first node in its
       Alistair                          relationship chain.
                                         Each Relationship references its start and
                                         end node.                                        Name
                                KNOWS    It also references the prev/next relationship
                                         record for the start/end node respectively       Tobias
      Age
       34
                                                                                                        Age
                                                                                                            27


                                                                                          Nationality

             KNOWS                                                                        Swedish
                                 KNOWS


                                                               KNOWS

                                                                               Name
                                                                                 Jim
                                                                                                            Age
     Name                                                                                                   37
                                                                                 Stuff
       Ian                   KNOWS
                       Age                                                      good

                       42                                                                               4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                         linked lists of fixed size records on disk.

       Your graph on disk                Properties are stored as a linked list of
                                         property records, each holding key+value.
                                         Each node/relationship references its first
        Name                             property record.
                                         The Nodes also reference the first node in its
       Alistair                          relationship chain.
                                         Each Relationship references its start and
                                         end node.                                        Name
                                KNOWS    It also references the prev/next relationship
                                         record for the start/end node respectively       Tobias
      Age
       34
                                                                                                        Age
                                                                                                            27


                                                                                          Nationality

             KNOWS                                                                        Swedish
                                 KNOWS


                                                               KNOWS

                                                                               Name
                                                                                 Jim
                                                                                                            Age
     Name                                                                                                   37
                                                                                 Stuff
       Ian                   KNOWS
                       Age                                                      good

                       42                                                                               4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Simple sample graph. It all boils down to
                                                   linked lists of fixed size records on disk.

       Your graph on disk                          Properties are stored as a linked list of
                                                   property records, each holding key+value.
                                                   Each node/relationship references its first
        Name
                                       SP    EP    property record.
                                                   The Nodes also reference the first node in its
       Alistair                                    relationship chain.
                                       SN    EN    Each Relationship references its start and
                                                   end node.                                        Name
                                       KNOWS       It also references the prev/next relationship
                                                   record for the start/end node respectively       Tobias
      Age
       34
                                                                                                                  Age
                                                                                                                      27
             SP        EP
                                        SP    EP
             SN        EN                                                                           Nationality
                                        SN    EN
             KNOWS                                                                                  Swedish
                                        KNOWS                            SP       EP
                                                                         SN       EN
                                                                         KNOWS

                                                                                         Name
                                  SP   EP                                                  Jim
                                                                                                                      Age
                                  SN   EN
     Name                                                                                                             37
                                                                                           Stuff
       Ian                        KNOWS
                            Age                                                           good

                            42                                                                                    4

Monday, May 21, 2012
Store files
                       ๏Node store
                       ๏Relationship store
                         •   Relationship type store
                       ๏Property store
                         •   Property key store

                         • (long) String store    Short string and
                                                  array values are
                                                  inlined in the




                         •
                                                  property store, long
                                                  values are stored in

                           (long) Array store     separate store files.


                                                                          5

Monday, May 21, 2012
Neo4j Storage Record Layout
Node (9 bytes)
inUse nextRelId                          nextPropId


     1                               5                9



Relationship (33 bytes)
inUse firstNode                          secondNode       relationshipType    firstPrevRelId    firstNextRelId    secondPrevRelId    secondNextRelId    nextPropId


     1                               5                9                      13                17                21                 25                 29            33



Relationship Type (5 bytes)
inUse typeBlockId


     1                               5



Property (33 bytes)
inUse type              keyIndexId       propBlock                                                                                                      nextPropId


     1              3                5                                                                                                                 29            33



Property Index (9 bytes)
inUse propCount                          keyBlockId


     1                               5                9



Dynamic Store (125 bytes)
inUse next                               data


     1                               5



NeoStore (5 bytes)
inUse datum


     1                               5

Monday, May 21, 2012
Outline

                                      Traversals      Core API       Cypher


     Next: The t wo levels of cache   Node/Relationship
     in Neo4j.                                              Thread local diffs
     The low level FS Cache for the     Object cache
     record files.
     And the high level Object
     cache storing a structure
     more optimized for traversal.
                                          FS Cache                         HA


                                        Record files          Transaction log


                                                      Disk(s)



                                                                                 7

Monday, May 21, 2012
The caches
        ๏ Filesystem cache:
                • Caches regions store file intofiles sized regions)
                    (divides each
                                   of the store
                                                  equally

                • The cache holds a fixed number of regions for each file
                • Regions are evicted based on ahit in non-cached region)
                    (hit count vs. miss count, i.e.
                                                    LFU-like policy


                • Default implementation of regions uses OS mmap
        ๏ Node/Relationship cache
                • Cache a version more optimized for traversals
                                                                            8

Monday, May 21, 2012
What we put in cache
                       ID             Relationship ID refs
                                      in:    R1       R2    ...   Rn               The structure of the elements in the high level
                       type 1                                                      object cache.
                                      out    R1       R2    ...   Rn
                                                                                   On disk most of the information is contained
                                      in:    R1       R2    R3     ...    Rn       in the relationship records, with the nodes just
                       type 2                                                      referencing their first relationship. In the
                                      out    R1       ...   Rn
              Node                                                                 cache this is turned around: the nodes hold
                                                                                   references to all its relationships. The
                        ...            (grouped by type)                           relationships are simple, only holding its
                                                                                   properties.

                                                                                   The relationships for each node is grouped by
                                                                                   RelationshipType to allow fast traversal of a
                                                                                   specific type.
                       Key 1                Key 2           ...          Key n
                                                                                   All references (dotted arrows) are by ID, and
                                                                                   traversals do indirect lookup through the cache.
                              Val 1



                                              Val 2




                       ID         start               end         type     Val n

        Relationship
                       Key 1                Key 2           ...          Key n
                              Val 1



                                              Val 2




                                                                           Val n




                                                                                                                       9

Monday, May 21, 2012
Outline

     So how do traversals work...
                                    Traversals      Core API       Cypher


                                    Node/Relationship
                                                          Thread local diffs
                                      Object cache

                                        FS Cache                         HA


                                      Record files          Transaction log


                                                    Disk(s)



                                                                               10

Monday, May 21, 2012
Traversals - how do they work?
       ๏ RelationshipExpanders: given (a path to) a node, returns           The surface
                                                                            layer, the you
                 Relationships to continue traversing from that node        interact with.


       ๏ Evaluators: given (a path to) a node, returns whether to:
                • Continue traversing on that branch (i.e. expand) or not
                • Include (the path to) the node in the result set or not
       ๏ Then a projection to Path, Node, or Relationship applied to
                 each Path in the result set.
            ... but also:
       ๏ Uniqueness level: policy for when it is ok to revisit a node
                 that has already been visited
       ๏ Implemented on top of the Core API
                                                                            11

Monday, May 21, 2012
More on Traversals
        ๏ Fetch node data from cache - non-blocking access                  This is what happens


                •
                                                                            under the hood.
                       If not in cache, retrieve from storage, into cache
                       ‣If region is in FS cache: blocking but short duration access
                       ‣If region is outside FS cache: blocking slower access
        ๏ Get relationships from cached node
                • If not fetched, retrieve from storage, by following chains
        ๏ Expand relationship(s) to end up on next node(s)
                • The relationship knows the node, no need to fetch it yet
        ๏ Evaluate
                •      possibly emitting a Path into the result set
        ๏ Repeat                                                                 12

Monday, May 21, 2012
Outline

                                                                  How is Cypher different?
                       Traversals      Core API       Cypher      and how dowes it work?



                       Node/Relationship
                                             Thread local diffs
                         Object cache

                           FS Cache                         HA


                         Record files          Transaction log


                                       Disk(s)



                                                                              13

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
Cypher - Just convenient traversal descriptions?
       ๏ Builds on the same infrastructure as Traversals - Expanders
                • but not on the full Traversal system
       ๏ Uses graph pattern matching for traversing the graph
              • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
            START x=...
                          matching with
                                        x-->z, y-->z,


                                                                Red: pattern graph
                                                                Blue: actual graph
                                                                Green: start node
                                                                Purple: matches




                                                                          14

Monday, May 21, 2012
What about gremlin?
        ๏ gremlin is a third party language, built by Marko Rodriguez of
                   Tinkerpop (a group of people who like to hack on graphs)
        ๏ Originally based on the idea of using xpath to describe traversals:
                   ./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED
                   but bastardized to distinguish between nodes and relationships:
                   ./outE[label=HAS_CART]/inV                     Traversals are close
                    /outE[label=CONTAINS_ITEM]/inV to xpath, which is
                                                                  why xpath like
                    /inE[label=PURCHASED]/outV                    descriptions of
                                                                  traversals seemed
                    /outE[label=PURCHASED]/inV                    like a good idea.

        ๏ xpath is not complete enough to express full algorithms, it needs a
                   host language, gremlin originally defined its own.
                   This changed Groovy as a more complete host language and
                   abandoned xpath in favor of method chaining
                   [ replace ‘/’ with ‘.’ ]                              15

Monday, May 21, 2012
Gremlin compared to Cypher
        ๏start me=node:people(name={myname})
              match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item
              item<-[:PURCHASED]-user-[:PURCHASED]->recommendation
              return recommendation

        ๏ Cypher is declarative, describes what data to get - its shape
        ๏ Gremlin is imperative, prescribes how to get the data
        ๏ Cypher has more opportunities for optimization by the engine
        ๏ Gremlin can implement pagerank, Cypher can’t (yet?)



                                                                     16

Monday, May 21, 2012
Outline

                       Traversals      Core API       Cypher      Transactions involve t wo
                                                                  parts:
                                                                  The (thread local) changes
                                                                  being done by an active
                                                                  transaction,
                       Node/Relationship                          and the transaction replay log
                                             Thread local diffs
                         Object cache                             for recovery.



                           FS Cache                         HA


                         Record files          Transaction log


                                       Disk(s)



                                                                               17

Monday, May 21, 2012
Transaction Isolation
        ๏ Mutating operations are not written when performed
        ๏ They are stored in a thread confined transaction state object
        ๏ This prevents other threads from seeing uncommitted changes
                   from the transactions of other threads
        ๏ When Transaction.finish() is invoked the transaction is either
                   committed or rolled back
        ๏ Rollback is simple: discard the transaction state object



                                                                          18

Monday, May 21, 2012
Transactions & Durability
        ๏ Commit is:
                • Changes made in the transaction are collected as commands
                • Commands are sorted to get predictable update order
                       ‣This prevents concurrent readers from seeing inconsistent
                        data when the changes are applied to the store
                • Write changes (in sorted order) to the transaction log
                • Mark the transaction as committed in the log
                • Apply the changes (in sorted order) to the store files
                                                                              19

Monday, May 21, 2012
Recovery
        ๏ Transaction commands dictate state, they don’t modify state
                • i.e. SET property "count" to 5
                • rather than ADD 1 to property "count"
        ๏ Thus: Applying the same command twice yields the same state
        ๏ Recovery simply replays all transactions since the last safe point
        ๏ If tx A mutates node1.name, then tx B also mutates
                   node1.name that doesn’t matter, because the database is not
                   recovered until all transactions have been replayed


                                                                         20

Monday, May 21, 2012
Outline

                       Traversals      Core API       Cypher


                       Node/Relationship
                                             Thread local diffs
                         Object cache

                           FS Cache                         HA


                         Record files          Transaction log     High Availability in
                                                                  Neo4j builds on top of
                                                                  the transaction replay


                                       Disk(s)



                                                                              21

Monday, May 21, 2012
Outline                       The transaction logs are
                                                                          shared bet ween all instances
                                                                          in an High Availability setup,
                                                                          all other parts operate on the
                                                                          local data just like in the
                                                                          standalone case.

                               Traversals      Core API       Cypher


                       Local   Node/Relationship
                                                     Thread local diffs
                                 Object cache

                                   FS Cache                         HA


                                 Record files          Transaction log     Shared

                                               Disk(s)



                                                                                       22

Monday, May 21, 2012
• HA - the parts to it:
        ๏ Based on streaming transactions between servers
        ๏ All transactions are committed through the master
                • Then (eventually) applied to the slaves
                • Eventuality synchronizationupdate intervalby interaction
                    or when
                              defined by the
                                              is mandated
        ๏ When writing to a slave:
                • Locks coordinated through the master
                • Transaction data buffered on theget a txid
                    applied first on the master to
                                                   slave

                       then applied with the same txid on the slave
                                                                             23

Monday, May 21, 2012
Creating new Nodes and Relationships
        ๏ New Nodes/Relationships don’t need locks, so they don’t need a
                   transaction synced with master until the transaction commit
        ๏ They do need an ID that is unique and equal among all instances
        ๏ Each instance allocates IDs in blocks from the master, then assigns
                   them to new Nodes/Relationships locally

                • This batch allocation can be seen in (Enterprise) 1000 as
                    Node/Relationship counts jumping in steps of
                                                                    WebAdmin




                                                                           24

Monday, May 21, 2012
HA synchronization points
        ๏ Transactions are uniquely identified by monotonically increasing ID
        ๏ All Requests from slave send the current latest txid on that slave
        ๏ Responses from master send back a stream of transactions that
                   have occurred since then, along with the actual result
        ๏ Transaction is replayed just like when committed / recovered
        ๏ Nodes/Relationships touched during this application are evicted
                   from cache to avoid cache staleness
        ๏ Transaction commands are only sorted when created, before
                   stored/transmitted, thus consistency is preserved during all
                   application phases
                                                                              25

Monday, May 21, 2012
Locking semantics of HA
        ๏ To be granted a lock the slave must have the latest version of the
                   Node/Relationship it is trying to lock

                • This ensures consistency
                • The implementation of “Latest version ofentireNode/
                    Relationship” is “Latest version of the
                                                            the
                                                                 graph”

                • The slave must thus sync transactions from the master


                                                                          26

Monday, May 21, 2012
Master election
        ๏ Each instance communicates/coordinates:
              • its latest transaction id (including the master id for that tx)
              • the id forclock value for when the txid was written
                            that instance
              • (logical)
        ๏    Election chooses:
           1. The instance with highest txid
           2. IF multiple: The instance that was master for that tx
           3. IF unavailable: The instance with the lowest clock value
           4. IF multiple: The instance with the lowest id
        ๏ Election happens when the current master cannot be reached
                •
              Any instance can choose to re-elect
                •
              Each instance runs the election protocol individually
                •
              Notify others when election chooses new master
        ๏ When elected, the new master broadcasts to all instances, 27
             forcing them to bind to the new master
Monday, May 21, 2012
Thank you for listening!

                                    tobias@neotechnology.com
 Tobias Lindaaker                   twitter: @thobe, #neo4j (@neo4j)
                                    web: neo4j.org neotechnology.com
 Hacker @ Neo Technology            my web: thobe.org




Monday, May 21, 2012

More Related Content

What's hot

Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j FundamentalsMax De Marzi
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Leveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache SparkLeveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache SparkNeo4j
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesMax De Marzi
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger InternalsNorberto Leite
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitternkallen
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redisZhichao Liang
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 

What's hot (20)

Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Leveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache SparkLeveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache Spark
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 

More from Tobias Lindaaker

Choosing the right NOSQL database
Choosing the right NOSQL databaseChoosing the right NOSQL database
Choosing the right NOSQL databaseTobias Lindaaker
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent ProgrammingTobias Lindaaker
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
Persistent graphs in Python with Neo4j
Persistent graphs in Python with Neo4jPersistent graphs in Python with Neo4j
Persistent graphs in Python with Neo4jTobias Lindaaker
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVMTobias Lindaaker
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVMTobias Lindaaker
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesTobias Lindaaker
 

More from Tobias Lindaaker (9)

NOSQL Overview
NOSQL OverviewNOSQL Overview
NOSQL Overview
 
JDK Power Tools
JDK Power ToolsJDK Power Tools
JDK Power Tools
 
Choosing the right NOSQL database
Choosing the right NOSQL databaseChoosing the right NOSQL database
Choosing the right NOSQL database
 
[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming[JavaOne 2011] Models for Concurrent Programming
[JavaOne 2011] Models for Concurrent Programming
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
Persistent graphs in Python with Neo4j
Persistent graphs in Python with Neo4jPersistent graphs in Python with Neo4j
Persistent graphs in Python with Neo4j
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVM
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVM
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
 

Recently uploaded

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Recently uploaded (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

An overview of Neo4j Internals

  • 1. An overview of Neo4j Internals tobias@neotechnology.com Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j) web: neo4j.org neotechnology.com Hacker @ Neo Technology my web: thobe.org Monday, May 21, 2012
  • 2. Outline This is a rough structure of how the pieces of Neo4j fit together. This talk will not cover how disks/fs works, we just assume it does. Traversals Core API Cypher Nor will it cover the “Core API”, you are assumed to know it. Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 2 Monday, May 21, 2012
  • 3. Outline Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Let’s start at the bottom: the on disk Record files Transaction log storage file layout. Disk(s) 3 Monday, May 21, 2012
  • 4. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first property record. The Nodes also reference the first node in its relationship chain. Each Relationship references its start and Name: Alistair end node. Age: 34 KNOWS It also references the prev/next relationship record for the start/end node respectively Name: Tobias Age: 27 Nationality: Swedish KNOWS KNOWS KNOWS Name: Ian Age: 42 Name: Jim Age: 37 KNOWS Stuff: good 4 Monday, May 21, 2012
  • 5. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name property record. The Nodes also reference the first node in its Alistair relationship chain. Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 Nationality KNOWS Swedish KNOWS KNOWS Name Jim Age Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 6. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name property record. The Nodes also reference the first node in its Alistair relationship chain. Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 Nationality KNOWS Swedish KNOWS KNOWS Name Jim Age Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 7. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 8. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 9. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 10. Simple sample graph. It all boils down to linked lists of fixed size records on disk. Your graph on disk Properties are stored as a linked list of property records, each holding key+value. Each node/relationship references its first Name SP EP property record. The Nodes also reference the first node in its Alistair relationship chain. SN EN Each Relationship references its start and end node. Name KNOWS It also references the prev/next relationship record for the start/end node respectively Tobias Age 34 Age 27 SP EP SP EP SN EN Nationality SN EN KNOWS Swedish KNOWS SP EP SN EN KNOWS Name SP EP Jim Age SN EN Name 37 Stuff Ian KNOWS Age good 42 4 Monday, May 21, 2012
  • 11. Store files ๏Node store ๏Relationship store • Relationship type store ๏Property store • Property key store • (long) String store Short string and array values are inlined in the • property store, long values are stored in (long) Array store separate store files. 5 Monday, May 21, 2012
  • 12. Neo4j Storage Record Layout Node (9 bytes) inUse nextRelId nextPropId 1 5 9 Relationship (33 bytes) inUse firstNode secondNode relationshipType firstPrevRelId firstNextRelId secondPrevRelId secondNextRelId nextPropId 1 5 9 13 17 21 25 29 33 Relationship Type (5 bytes) inUse typeBlockId 1 5 Property (33 bytes) inUse type keyIndexId propBlock nextPropId 1 3 5 29 33 Property Index (9 bytes) inUse propCount keyBlockId 1 5 9 Dynamic Store (125 bytes) inUse next data 1 5 NeoStore (5 bytes) inUse datum 1 5 Monday, May 21, 2012
  • 13. Outline Traversals Core API Cypher Next: The t wo levels of cache Node/Relationship in Neo4j. Thread local diffs The low level FS Cache for the Object cache record files. And the high level Object cache storing a structure more optimized for traversal. FS Cache HA Record files Transaction log Disk(s) 7 Monday, May 21, 2012
  • 14. The caches ๏ Filesystem cache: • Caches regions store file intofiles sized regions) (divides each of the store equally • The cache holds a fixed number of regions for each file • Regions are evicted based on ahit in non-cached region) (hit count vs. miss count, i.e. LFU-like policy • Default implementation of regions uses OS mmap ๏ Node/Relationship cache • Cache a version more optimized for traversals 8 Monday, May 21, 2012
  • 15. What we put in cache ID Relationship ID refs in: R1 R2 ... Rn The structure of the elements in the high level type 1 object cache. out R1 R2 ... Rn On disk most of the information is contained in: R1 R2 R3 ... Rn in the relationship records, with the nodes just type 2 referencing their first relationship. In the out R1 ... Rn Node cache this is turned around: the nodes hold references to all its relationships. The ... (grouped by type) relationships are simple, only holding its properties. The relationships for each node is grouped by RelationshipType to allow fast traversal of a specific type. Key 1 Key 2 ... Key n All references (dotted arrows) are by ID, and traversals do indirect lookup through the cache. Val 1 Val 2 ID start end type Val n Relationship Key 1 Key 2 ... Key n Val 1 Val 2 Val n 9 Monday, May 21, 2012
  • 16. Outline So how do traversals work... Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 10 Monday, May 21, 2012
  • 17. Traversals - how do they work? ๏ RelationshipExpanders: given (a path to) a node, returns The surface layer, the you Relationships to continue traversing from that node interact with. ๏ Evaluators: given (a path to) a node, returns whether to: • Continue traversing on that branch (i.e. expand) or not • Include (the path to) the node in the result set or not ๏ Then a projection to Path, Node, or Relationship applied to each Path in the result set. ... but also: ๏ Uniqueness level: policy for when it is ok to revisit a node that has already been visited ๏ Implemented on top of the Core API 11 Monday, May 21, 2012
  • 18. More on Traversals ๏ Fetch node data from cache - non-blocking access This is what happens • under the hood. If not in cache, retrieve from storage, into cache ‣If region is in FS cache: blocking but short duration access ‣If region is outside FS cache: blocking slower access ๏ Get relationships from cached node • If not fetched, retrieve from storage, by following chains ๏ Expand relationship(s) to end up on next node(s) • The relationship knows the node, no need to fetch it yet ๏ Evaluate • possibly emitting a Path into the result set ๏ Repeat 12 Monday, May 21, 2012
  • 19. Outline How is Cypher different? Traversals Core API Cypher and how dowes it work? Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Disk(s) 13 Monday, May 21, 2012
  • 20. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 21. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 22. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 23. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 24. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 25. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 26. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 27. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 28. Cypher - Just convenient traversal descriptions? ๏ Builds on the same infrastructure as Traversals - Expanders • but not on the full Traversal system ๏ Uses graph pattern matching for traversing the graph • Recursive MATCH x-->y, backtracking z-->a-->b, z-->b START x=... matching with x-->z, y-->z, Red: pattern graph Blue: actual graph Green: start node Purple: matches 14 Monday, May 21, 2012
  • 29. What about gremlin? ๏ gremlin is a third party language, built by Marko Rodriguez of Tinkerpop (a group of people who like to hack on graphs) ๏ Originally based on the idea of using xpath to describe traversals: ./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED but bastardized to distinguish between nodes and relationships: ./outE[label=HAS_CART]/inV Traversals are close /outE[label=CONTAINS_ITEM]/inV to xpath, which is why xpath like /inE[label=PURCHASED]/outV descriptions of traversals seemed /outE[label=PURCHASED]/inV like a good idea. ๏ xpath is not complete enough to express full algorithms, it needs a host language, gremlin originally defined its own. This changed Groovy as a more complete host language and abandoned xpath in favor of method chaining [ replace ‘/’ with ‘.’ ] 15 Monday, May 21, 2012
  • 30. Gremlin compared to Cypher ๏start me=node:people(name={myname}) match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item item<-[:PURCHASED]-user-[:PURCHASED]->recommendation return recommendation ๏ Cypher is declarative, describes what data to get - its shape ๏ Gremlin is imperative, prescribes how to get the data ๏ Cypher has more opportunities for optimization by the engine ๏ Gremlin can implement pagerank, Cypher can’t (yet?) 16 Monday, May 21, 2012
  • 31. Outline Traversals Core API Cypher Transactions involve t wo parts: The (thread local) changes being done by an active transaction, Node/Relationship and the transaction replay log Thread local diffs Object cache for recovery. FS Cache HA Record files Transaction log Disk(s) 17 Monday, May 21, 2012
  • 32. Transaction Isolation ๏ Mutating operations are not written when performed ๏ They are stored in a thread confined transaction state object ๏ This prevents other threads from seeing uncommitted changes from the transactions of other threads ๏ When Transaction.finish() is invoked the transaction is either committed or rolled back ๏ Rollback is simple: discard the transaction state object 18 Monday, May 21, 2012
  • 33. Transactions & Durability ๏ Commit is: • Changes made in the transaction are collected as commands • Commands are sorted to get predictable update order ‣This prevents concurrent readers from seeing inconsistent data when the changes are applied to the store • Write changes (in sorted order) to the transaction log • Mark the transaction as committed in the log • Apply the changes (in sorted order) to the store files 19 Monday, May 21, 2012
  • 34. Recovery ๏ Transaction commands dictate state, they don’t modify state • i.e. SET property "count" to 5 • rather than ADD 1 to property "count" ๏ Thus: Applying the same command twice yields the same state ๏ Recovery simply replays all transactions since the last safe point ๏ If tx A mutates node1.name, then tx B also mutates node1.name that doesn’t matter, because the database is not recovered until all transactions have been replayed 20 Monday, May 21, 2012
  • 35. Outline Traversals Core API Cypher Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log High Availability in Neo4j builds on top of the transaction replay Disk(s) 21 Monday, May 21, 2012
  • 36. Outline The transaction logs are shared bet ween all instances in an High Availability setup, all other parts operate on the local data just like in the standalone case. Traversals Core API Cypher Local Node/Relationship Thread local diffs Object cache FS Cache HA Record files Transaction log Shared Disk(s) 22 Monday, May 21, 2012
  • 37. • HA - the parts to it: ๏ Based on streaming transactions between servers ๏ All transactions are committed through the master • Then (eventually) applied to the slaves • Eventuality synchronizationupdate intervalby interaction or when defined by the is mandated ๏ When writing to a slave: • Locks coordinated through the master • Transaction data buffered on theget a txid applied first on the master to slave then applied with the same txid on the slave 23 Monday, May 21, 2012
  • 38. Creating new Nodes and Relationships ๏ New Nodes/Relationships don’t need locks, so they don’t need a transaction synced with master until the transaction commit ๏ They do need an ID that is unique and equal among all instances ๏ Each instance allocates IDs in blocks from the master, then assigns them to new Nodes/Relationships locally • This batch allocation can be seen in (Enterprise) 1000 as Node/Relationship counts jumping in steps of WebAdmin 24 Monday, May 21, 2012
  • 39. HA synchronization points ๏ Transactions are uniquely identified by monotonically increasing ID ๏ All Requests from slave send the current latest txid on that slave ๏ Responses from master send back a stream of transactions that have occurred since then, along with the actual result ๏ Transaction is replayed just like when committed / recovered ๏ Nodes/Relationships touched during this application are evicted from cache to avoid cache staleness ๏ Transaction commands are only sorted when created, before stored/transmitted, thus consistency is preserved during all application phases 25 Monday, May 21, 2012
  • 40. Locking semantics of HA ๏ To be granted a lock the slave must have the latest version of the Node/Relationship it is trying to lock • This ensures consistency • The implementation of “Latest version ofentireNode/ Relationship” is “Latest version of the the graph” • The slave must thus sync transactions from the master 26 Monday, May 21, 2012
  • 41. Master election ๏ Each instance communicates/coordinates: • its latest transaction id (including the master id for that tx) • the id forclock value for when the txid was written that instance • (logical) ๏ Election chooses: 1. The instance with highest txid 2. IF multiple: The instance that was master for that tx 3. IF unavailable: The instance with the lowest clock value 4. IF multiple: The instance with the lowest id ๏ Election happens when the current master cannot be reached • Any instance can choose to re-elect • Each instance runs the election protocol individually • Notify others when election chooses new master ๏ When elected, the new master broadcasts to all instances, 27 forcing them to bind to the new master Monday, May 21, 2012
  • 42. Thank you for listening! tobias@neotechnology.com Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j) web: neo4j.org neotechnology.com Hacker @ Neo Technology my web: thobe.org Monday, May 21, 2012