Graph databases for computational biology

Graph databases
In computational biology:
Neo4j and TitanDB
Andrei Kucharavy 23/08/2013

Rigid structure of
Interactions
= Interactome
Knowledge
access structure
= GO
Why even bother?
Those are
Graphs

Why even bother?
● ~ 1 Gb of raw data from Reactome
● ~ 300 Mb of Data from Uniprot / GO /
ENSEMBL/ … mappings
● => this is way over the conventional 1024 Mb
JVM limit => heap crash
● ~ 15 minutes to load
● Nightmare to visualize and debug

Relational Databases
Intro to neo4j presentation – jexp @ slideshare

Graph databases
Intro to neo4j presentation – jexp @ slideshare

Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges

Node1
Node2
Node3
Property1
Node2
Property2
Property3
Property1
Property1
Property2
Property2
Property
Property
Property

Core abstractions
● Objects:
– Nodes (Vertexes)
– Relationships between nodes (Edges)
– Properties for Vertexes and Edges
● Operations:
– Immediate relations
– Traversals
● Get the shortest path from j to k
● Get the path with least weight from j to k, ...

Main advantages promised
● Increased speed for graph-type applications
– Avoid “join” on 10M rows to get ~20 “related”
elements
– Traversals
● Simplified programming
– Java objects
– Xml / rdf / owl
– Schema alterations

Main advantages promised
● Ease of deployment / maintenance:
– Scalability
– Complexity
– Modifications
– Schema migrations

neo4j
● Started in 2003
● Schema-free
● ACID transactions
● Reasonably scalable, reasonably replicatable
● 10 000 open source projects, 1000 commercial
costumers

neo4j
● Started in 2003
● Schema-free
costumers
● 100 % open source

https://github.com/neo4j/neo4j

neo4j
● Started in 2003
● Schema-free
● 10 000 open source projects, 1000 commercial costumers
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support

neo4j
● Started in 2003
● Schema-free
costumers
● Master-slave replication
● AGPL 3 license: if you are open source, it is free,
Even the support
● Plus graphical interface => De-bug!!!!

Deployment Demo
● cd to specific DB location (better as a special
user)
● ./neo4j start
● ./neo4j stop
● => Serves localhost:7474
● 40 000 files => mainly indexes / user accesses

Under the hood
● Java & JVM
● Split in two
– In-RAM “pre-heated” v.s. Whole in-HDD
● Scalability:
– 32 G nodes / 32 G relations / 64 G properties
– 1 M traversals / sec, size-independent of a graph
● Lucene index: instant search

Interfaces
● Two-fold interface:
– REST server
– Local instance
● Specific query Language: Cipher

Interfaces
– REST server
– Local instance
● Interoperability: support for tinkerpop stack

What is Gremlin
● Domain-specific graph language
● Build atop Groovy
– JVM
– Dynamically evaluated
– ~ scripting in java
● Core = java
– Java
– Scala / Clojure
– Jpypes / Jython / Jruby
● Supported by most graph databases

Interfaces
– REST server
– Local instance
● Interoperability: support for TinkerPop stack
● Native bindings:
– Java
– Python, PHP, Ruby / Rails, node.js, .Net
– Scala, Clojure, Haskell, ...
● My stack:
– Native Python and Python through bulbs and REST

Python + Bulbs + REST + neo4j
● Bulbs = Pythonic wrapper for Gremlin
● Portability(BluePrints + Rexter)
– Titan DB (will be discussed later on)
– Bitsy
– Infinite Graph
– Sqrrl
– ArangoDB
● Class heritability and DDT:
– Java-like class heritability

Demo 2
● Datatype declaration
● GraphDB connection and declaration
● Fill-in
● Graphical Interface

neo4j-specific
● Lucene index in the backend
– Exact indexing => constant-time retrieval
– Full-text indexing => searching partial names and
adding the missing links
● SRC = SRC_HUMAN = SRC1

Demo3
● Constant node retrieval time / internode
connection distance time
● Performing the partial search
● Adding missing links
● Neo4j server v.s. Local database
● Performing simple Gremlin queries

Use Case:
● Existent map of correlations:
ProteinDomain
Domain Type
Protein
function

Use Case:
● Existent map of correlations:
● Wanted map of correlations:
ProteinDomain
Domain Type
Protein
function
ProteinDomain
Domain Type
Protein
function

Use Case
● SQL Python / SQLAlchemy:
– Create new table
– Add ForeignKeys, Primary key, indexes, ...
– Add the table to the data model,
– Create functions for access/update,
– ...

Use Case
● Bulbs / Neo4j => Live demo

Use case 2
● In human proteome, find all chemical groups A and B separated
by less then x Å
– Database Structure:
● Suppose all the proteins are connected to a “Type node”
● Each protein is linked to it's domains, each domain is linked to it's amino acids,
each amino-acid linked to it's chemical groups and ultimately atoms
● Chemical groups have assigned distance between them and groups they are
close to
– Algorithm
● Select a protein of interest
● Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)
● Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)
● Recover the proteins: 1000*3*100*2
● With 1M traversals per second => 0.6 sec. to execute the query
– If TitanDB with ElasticSearch and geo-queries (all within circle of radius
x), higher speeds possible

Limitations
● Node Number:
– 32 Giga Nodes / Edges is a lot on servers
● ~100 Tb of data
● 1 Unix partition
● 40 000 ++ simultaneously opened files (Indexes+users)
– 32 Giga Edges is relatively small in biology
● ~ 43 M nodes in UniProt Only
● GO x UNIPROT x EMBL x GeneNames x Interaction
Maps x Localisations x names & Accesses ....
● All potentially druggable molecules, all protein atoms, all
atom-atom interactions

Limitations
● Absence of parallelism/distribution
– One process at time:
● 1 traversal at time
● ACID => Database locks
● Though master-slave distribution
– Single partition
● Replication
● 100 Tb + RAID!?
● Though full support for AWS and VM

Limitations
● Bubs: python over gremlin scripts
– Gremlin → Groovy → JVM → do what you want
=> SQL (Gremlin) injections
– Request sanitation needed
Hashes of the queries without variables
Pre-filtering before query referral to server

Limitations
● Bulk insert not naively implemented in Bulbs:
– Insertion rate ~10 nodes /sec
– Naive python binding tests:
● ~60 msec for ACID compliance (HDD write)
● ~1.8 msec/node cold insertion routines
(HDD sequential write)
● ~0.3 msec/node hot write insertion routines (RAM buffer)
– 500 - 1500 nodes/sec if packages of 1000
● 6 h to fill the database up to theoretical limit
– github.com/chefjerome/graphalchemy implements efficient
flush based on bulbs (alpha and thus unstable right now)

TitanDB
● Hbase / Cassandra / BerkleyDB as backend

TitanDB
● Hbase / Cassandra / BerkleyDB as storage backend
● Lucene / ElasticSearch as Indexing backend
● Served over Rexter server
Full distribution
> 500 simultaneous connections (5000 is still stable)
Automatic replication (Hadoop)
Multiple simultaneous queries
Sky is the only limit for storage quantities
=> TitanDB / Hbase is stable up to 5 Pbytes in production

Neo4j for bioinformatics:
parsing and curating Reactome.org
● Reactome.org:
– BioPax : xml / RDF / OWL

● Reactome.org structure:
– BioPax : xml / RDF / OWL
– Physical entities:
● Proteins, small molecules, Complexes, RNA, DNA
● Fragments of physical entities
– Interaction:
● Degradation / polymerisation / Biochemical reactions
● Molecular interaction
● Genetic interaction
– Pathways, Genes, Post-translational modifications...

● Reality of Reactome.org:
– Main connex element: ~ 22 000 entities, but 6 other
with >100 elements
– Presence of generic classes : groups of objects
– Proteins = mix between proteins, domains, groups,
groups of domains…
– 15 000 proteins, 5000 UNIPROT references
– 156 genes, 56 RNA molecules => translation /
transcription regulation is not well described

● Reality of Reactome.org:
– heavily comment-based: case of SRC

● Completed with HiNT protein-protein interaction
from Yue lab at Cornell
● Re-indexed:
– SwissProt protein names
– Full names from SwissProt
– Gene Names
– KEGG, GO, EMBL, ChEBI cross-references
– PDB implemented, not re-run

● Example of pathway Parsing

Conclusion
● Systems biology is more about graphs then
about systems of tables
● Graph Databases are awesome
● Neo4j is terrific
● TitanDB is cool
● You should definitely pick one of them, load
Reactome.org dataset or whatever you are
interested in and play with it.

Thanks
Pr. Philp Bourne
Pr. Bart Deplanke
Cedric Merlot
Li Xie
Spencer Blieven
Jiang Wang
Julia Ponomarenko
Cole Christie
Andreas Prilic
Lilia Iakoucheva

Graph databases for computational biology

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Graph databases for computational biology

Similar to Graph databases for computational biology (20)

Recently uploaded

Recently uploaded (20)

Graph databases for computational biology

Editor's Notes