This document discusses graph databases and their use in computational biology. It introduces Neo4j and TitanDB as graph database options and describes how biological interaction networks and pathways can be modeled as graphs. Key advantages of graph databases over relational databases are also summarized, such as increased speed for graph queries and simpler programming. The document provides an overview of Neo4j and TitanDB, including their core abstractions, interfaces, and advantages/limitations for storing large biological network data. Examples are given of loading Reactome pathway data into Neo4j and performing graph queries.
3. Why even bother?
● ~ 1 Gb of raw data from Reactome
● ~ 300 Mb of Data from Uniprot / GO /
ENSEMBL/ … mappings
● => this is way over the conventional 1024 Mb
JVM limit => heap crash
● ~ 15 minutes to load
● Nightmare to visualize and debug
9. Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
● Operations:
– Immediate relations
– Traversals
● Get the shortest path from j to k
● Get the path with least weight from j to k, ...
10.
11. Main advantages promised
● Increased speed for graph-type applications
– Avoid “join” on 10M rows to get ~20 “related”
elements
– Traversals
● Simplified programming
– Java objects
– Xml / rdf / owl
– Schema alterations
12. Main advantages promised
● Ease of deployment / maintenance:
– Scalability
– Complexity
– Modifications
– Schema migrations
13. neo4j
● Started in 2003
● Schema-free
● ACID transactions
● Reasonably scalable, reasonably replicatable
● 10 000 open source projects, 1000 commercial
costumers
14.
15. neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 % open source
17. neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial costumers
● 100 % open source
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support
18.
19. neo4j
● Started in 2003
● Schema-free
● ACID transactions
● 10 000 open source projects, 1000 commercial
costumers
● 100 % open source
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support
● Plus graphical interface => De-bug!!!!
20.
21. Deployment Demo
● cd to specific DB location (better as a special
user)
● ./neo4j start
● ./neo4j stop
● => Serves localhost:7474
● 40 000 files => mainly indexes / user accesses
22. Under the hood
● Java & JVM
● Split in two
– In-RAM “pre-heated” v.s. Whole in-HDD
● Scalability:
– 32 G nodes / 32 G relations / 64 G properties
– 1 M traversals / sec, size-independent of a graph
● Lucene index: instant search
30. neo4j-specific
● Lucene index in the backend
– Exact indexing => constant-time retrieval
– Full-text indexing => searching partial names and
adding the missing links
● SRC = SRC_HUMAN = SRC1
31. Demo3
● Constant node retrieval time / internode
connection distance time
● Performing the partial search
● Adding missing links
● Neo4j server v.s. Local database
● Performing simple Gremlin queries
32. Use Case:
● Existent map of correlations:
ProteinDomain
Domain Type
Protein
function
33. Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
ProteinDomain
Domain Type
Protein
function
34. Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
ProteinDomain
Domain Type
Protein
function
35. Use Case
● SQL Python / SQLAlchemy:
– Create new table
– Add ForeignKeys, Primary key, indexes, ...
– Add the table to the data model,
– Create functions for access/update,
– ...
37. Use case 2
● In human proteome, find all chemical groups A and B separated
by less then x Å
– Database Structure:
● Suppose all the proteins are connected to a “Type node”
● Each protein is linked to it's domains, each domain is linked to it's amino acids,
each amino-acid linked to it's chemical groups and ultimately atoms
● Chemical groups have assigned distance between them and groups they are
close to
– Algorithm
● Select a protein of interest
● Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)
● Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)
● Recover the proteins: 1000*3*100*2
● With 1M traversals per second => 0.6 sec. to execute the query
– If TitanDB with ElasticSearch and geo-queries (all within circle of radius
x), higher speeds possible
38. Limitations
● Node Number:
– 32 Giga Nodes / Edges is a lot on servers
● ~100 Tb of data
● 1 Unix partition
● 40 000 ++ simultaneously opened files (Indexes+users)
– 32 Giga Edges is relatively small in biology
● ~ 43 M nodes in UniProt Only
● GO x UNIPROT x EMBL x GeneNames x Interaction
Maps x Localisations x names & Accesses ....
● All potentially druggable molecules, all protein atoms, all
atom-atom interactions
39. Limitations
● Absence of parallelism/distribution
– One process at time:
● 1 traversal at time
● ACID => Database locks
● Though master-slave distribution
– Single partition
● Replication
● 100 Tb + RAID!?
● Though full support for AWS and VM
40. Limitations
● Bubs: python over gremlin scripts
– Gremlin → Groovy → JVM → do what you want
=> SQL (Gremlin) injections
– Request sanitation needed
Hashes of the queries without variables
Pre-filtering before query referral to server
41. Limitations
● Bulk insert not naively implemented in Bulbs:
– Insertion rate ~10 nodes /sec
– Naive python binding tests:
● ~60 msec for ACID compliance (HDD write)
● ~1.8 msec/node cold insertion routines
(HDD sequential write)
● ~0.3 msec/node hot write insertion routines (RAM buffer)
– 500 - 1500 nodes/sec if packages of 1000
● 6 h to fill the database up to theoretical limit
– github.com/chefjerome/graphalchemy implements efficient
flush based on bulbs (alpha and thus unstable right now)
44. TitanDB
● Hbase / Cassandra / BerkleyDB as storage backend
● Lucene / ElasticSearch as Indexing backend
● Served over Rexter server
Full distribution
> 500 simultaneous connections (5000 is still stable)
Automatic replication (Hadoop)
Multiple simultaneous queries
Sky is the only limit for storage quantities
=> TitanDB / Hbase is stable up to 5 Pbytes in production
49. Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reality of Reactome.org:
– Main connex element: ~ 22 000 entities, but 6 other
with >100 elements
– Presence of generic classes : groups of objects
– Proteins = mix between proteins, domains, groups,
groups of domains…
– 15 000 proteins, 5000 UNIPROT references
– 156 genes, 56 RNA molecules => translation /
transcription regulation is not well described
52. Neo4j for bioinformatics:
parsing and curating Reactome.org
● Completed with HiNT protein-protein interaction
from Yue lab at Cornell
● Re-indexed:
– SwissProt protein names
– Full names from SwissProt
– Gene Names
– KEGG, GO, EMBL, ChEBI cross-references
– PDB implemented, not re-run
54. Conclusion
● Systems biology is more about graphs then
about systems of tables
● Graph Databases are awesome
● Neo4j is terrific
● TitanDB is cool
● You should definitely pick one of them, load
Reactome.org dataset or whatever you are
interested in and play with it.
56. Thanks
Pr. Philp Bourne
Pr. Bart Deplanke
Cedric Merlot
Li Xie
Spencer Blieven
Jiang Wang
Julia Ponomarenko
Cole Christie
Andreas Prilic
Lilia Iakoucheva
Editor's Notes
Why join millions of rows if only 10 relationships are iteresting? What to do if we want traversals