2. But who’s this guy talking here?
I am Currently working as a Bioinformatics consultant/developer/researcher at
Oh no sequences!
Oh no what !?
We are the R&D group at Era7 Bioinformatics.
we like bioinformatics, cloud computing, NGS, category theory, bacterial
genomics…
well, lots of things.
What about Era7 Bioinformatics?
Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
knowledge management and sequencing data interpretation.
Our area of expertise revolves around biological sequence analysis, particularly
Next Generation Sequencing data management and analysis.
www.ohnosequences.com www.bio4j.com
3. In Bioinformatics we have highly interconnected overlapping knowledge spread
throughout different DBs
www.ohnosequences.com www.bio4j.com
4. However all this data is in most cases modeled in relational databases.
Sometimes even just as plain CSV files
As the amount and diversity of data grows, domain models
become crazily complicated!
www.ohnosequences.com www.bio4j.com
5. With a relational paradigm, the double implication
Entity Table
does not go both ways.
You get ‘auxiliary’ tables that have no relationship with the small
piece of reality you are modeling.
You need ‘artificial’ IDs only for connecting entities, (and these are mixed
with IDs that somehow live in reality)
Entity-relationship models are cool but in the end you always have to
deal with ‘raw’ tables plus SQL.
Integrating/incorporating new knowledge into already existing
databases is hard and sometimes even not possible without changing
the domain model
www.ohnosequences.com www.bio4j.com
6. Life in general and biology in particular are probably not 100% like a graph…
but one thing’s sure, they are not a set of tables!
www.ohnosequences.com www.bio4j.com
8. Neo4j is a high-performance, NOSQL graph database with all
the features of a mature and robust database.
The programmer works with an object-oriented, flexible
network structure rather than with strict and static tables
All the benefits of a fully transactional, enterprise-strength
database.
For many applications, Neo4j offers performance
improvements on the order of 1000x or more compared to
relational DBs.
www.ohnosequences.com www.bio4j.com
9. What’s Bio4j?
Bio4j is a bioinformatics graph based DB including most data
available in :
Uniprot KB (SwissProt + Trembl) NCBI Taxonomy
Gene Ontology (GO) RefSeq
UniRef (50,90,100) Enzyme DB
www.ohnosequences.com www.bio4j.com
10. What’s Bio4j?
It provides a completely new and powerful framework
for protein related information querying and
management.
Since it relies on a high-performance graph engine, data
is stored in a way that semantically represents its own
structure
www.ohnosequences.com www.bio4j.com
11. What’s Bio4j?
Bio4j uses Neo4j technology, a "high-performance graph
engine with all the features of a mature and robust
database".
Thanks to both being based on Neo4j DB and the API
provided, Bio4j is also very scalable, allowing anyone
to easily incorporate his own data making the best
out of it.
www.ohnosequences.com www.bio4j.com
12. What’s Bio4j?
Everything in Bio4j is open source !
released under AGPLv3
www.ohnosequences.com www.bio4j.com
13. Bio4j in numbers
The current version (0.7) includes:
Relationships: 530.642.683
Nodes: 76.071.411
Relationship types: 139
Node types: 38
www.ohnosequences.com www.bio4j.com
14. Let’s dig a bit about Bio4j structure…
Data sources and their relationships:
www.ohnosequences.com www.bio4j.com
16. The Graph DB model: representation
Core abstractions:
Nodes
Relationships between nodes
Properties on both
www.ohnosequences.com www.bio4j.com
17. How are things modeled?
Couldn’t be simpler!
Entities Associations / Relationships
Nodes Edges
www.ohnosequences.com www.bio4j.com
18. Some examples of nodes would be:
GO term
Protein
Genome Element
and relationships:
Protein PROTEIN_GO_ANNOTATION
GO term
www.ohnosequences.com www.bio4j.com
19. We have developed a tool aimed to be used both as a reference manual and
initial contact for Bio4j domain model: Bio4jExplorer
Bio4jExplorer allows you to:
• Navigate through all nodes and relationships
• Access the javadocs of any node or relationship
• Graphically explore the neighborhood of a node/relationship
• Look up for the indexes that may serve as an entry point for a node
• Check incoming/outgoing relationships of a specific node
• Check start/end nodes of a specific relationship
www.ohnosequences.com www.bio4j.com
20. Entry points and indexing
There are two kinds of entry points for the graph:
Auxiliary relationships going from the reference node, e.g.
- CELLULAR_COMPONENT: leads to the root of GO cellular component
sub-ontology
- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Node indexing
There are two types of node indexes:
- Exact: Only exact values are considered hits
- Fulltext: Regular expressions can be used
www.ohnosequences.com www.bio4j.com
21. Retrieving protein info (Bio4jModel Java API)
//--creating manager and node retriever----
Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
NodeRetriever nR= new NodeRetriever(manager);
ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);
Getting more related info...
List<InterproNode> interpros = protein.getInterpro();
OrganismNode organism = protein.getOrganism();
List<GoTermNode> goAnnotations = protein.getGOAnnotations();
List<ArticleNode> articles = protein.getArticleCitations();
for (ArticleNode article : articles) {
System.out.println(article.getPubmedId());
}
//Don’t forget to close the manager
manager.shutDown();
www.ohnosequences.com www.bio4j.com
22. Querying Bio4j with Cypher
Getting a keyword by its ID
START k=node:keyword_id_index(keyword_id_index = "KW-0181")
return k.name, k.id
Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot
dataset:
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->
(p)
return p.accession, p2.accession, p3.accession
Check this blog post for more info and our Bio4j Cypher cheetsheet
www.ohnosequences.com www.bio4j.com
23. A graph traversal language
Get protein by its accession number and return its full name
gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
==> Aspartate aminotransferase, mitochondrial
Get proteins (accessions) associated to an interpro motif (limited to 4 results)
gremlin>
g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.
accession[0..3]
==> E2GK26
==> G3PMS4
==> G3Q865
==> G3PIL8
Check our Bio4j Gremlin cheetsheet
www.ohnosequences.com www.bio4j.com
24. REST Server
You can also query/navigate through Bio4j with the Neo4j REST API !
The default representation is json, both for responses and or data sent with
POST/PUT requests
Get protein by its accession number: (Q9UR66)
http://server_url:7474/db/data/index/node/protein_accession_index/
protein_accession_index/Q9UR66
Get outgoing relationships for protein Q9UR66
http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o
ut
www.ohnosequences.com www.bio4j.com
25. Visualizations (1) REST Server Data Browser
Navigate through Bio4j data in real time !
www.ohnosequences.com www.bio4j.com
27. Visualizations (3) Bio4j + Gephi
Get really cool graph visualizations using Bio4j and Gephi visualization and
exploration platform
www.ohnosequences.com www.bio4j.com
28. Bio4j + Cloud
We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving
us the following benefits:
Interoperability and data distribution
Releases are available as public EBS Snapshots, giving AWS users the
opportunity of creating and attaching to their instances Bio4j DB 100% ready
volumes in just a few seconds.
CloudFormation templates:
- Basic Bio4j DB Instance
- Bio4j REST Server Instance
Backup and Storage using S3 (Simple Storage Service)
We use S3 both for backup (indirectly through the EBS snapshots) and
storage (directly storing RefSeq sequences as independent S3 files)
www.ohnosequences.com www.bio4j.com
29. Why would I use Bio4j ?
Massive access to protein/genome/taxonomy… related information
Integration of your own DBs/resources around common information
Development of services tailored to your needs built around Bio4j
Networks analysis
Visualizations
Besides many others I cannot think of myself…
If you have something in mind for which Bio4j might be useful, please let us know so we
can all see how it could help you meet your needs! ;)
www.ohnosequences.com www.bio4j.com
30. Community
Bio4j has a fast growing internet presence:
- Twitter: check @bio4j for updates
- Blog: go to http://blog.bio4j.com
- Mail-list: ask any question you may have in our list.
- LinkedIn: check the Bio4j group
- Github issues: don’t be shy! open a new issue if you think
something’s going wrong.
www.ohnosequences.com www.bio4j.com
31. OK, but why starting all this?
Were you so bored…?!
It all started somehow around our need for massive access to protein GO
(Gene Ontology) annotations.
At that point I had to develop my own MySQL DB based on the official
GO SQL database, and problems started from the beginning:
I got crazy ‘deciphering’ how to extract Uniprot protein annotations
from GO official tables schema
Uniprot and GO official protein annotations were not always consistent
Populating my own DB took really long due to all the joins and
subqueries needed in order to get and store the protein annotations.
Soon enough we also had the need of having massive access to basic
protein information.
www.ohnosequences.com www.bio4j.com
32. These processes had to be automated for our (specifically designed for NGS data)
bacterial genome annotation system BG7
Uniprot web services available were too limited:
- Slow
- Number of queries limitation
- Too little information available
So I downloaded the whole Uniprot DB in XML format
(Swiss-Prot + Trembl)
and started to have some fun with it !
www.ohnosequences.com www.bio4j.com
33. BG7 algorithm
• Selection of the specific reference protein set
1
• Prediction of possible genes by BLAST similarity
2
• Gene definition: merging compatible similarity regions, detecting start and stop
3
• Solving overlapped predicted genes
4
• RNA prediction by BLAST similarity
5
6 • Final annotation and complete deliverables. Quality control.
www.era7bioinformatics.com
34. We got used to having massive direct access to all this protein related
information…
So why not adding other resources we needed quite often in most
projects and which now were becoming a sort of bottleneck
compared to all those already included in Bio4j ?
Then we incorporated:
- Isoform sequences
- Protein interactions and features
- Uniref 50, 90, and 100
- RefSeq
- NCBI Taxonomy
- Enzyme Expasy DB
www.ohnosequences.com www.bio4j.com
35. Bio4j + MG7 + 48 Blast XML files (~1GB each)
Some numbers:
• 157 639 502 nodes
• 742 615 705 relationships
• 632 832 045 properties
• 148 relationship types
• 44 node types
And it works just fine!
www.ohnosequences.com www.bio4j.com
37. What’s MG7?
MG7 provides the possibility of choosing different parameters to fix the
thresholds for filtering the BLAST hits:
i. E-value
ii. Identity and query coverage
It allows exporting the results of the analysis to different data formats like:
• XML
• CSV
• Gexf (Graph exchange XML format)
As well as provides to the user with Heat maps and graph visualizations whilst
including an user-friendly interface that allows to access to the alignment
responsible for each functional or taxonomical read assignation and that displays
the frequencies in the taxonomical tree --> MG7Viewer
www.ohnosequences.com www.bio4j.com
41. Mining Bio4j data
Finding topological patterns in Protein-Protein
Interaction networks
www.ohnosequences.com www.bio4j.com
42. Finding the lowest common ancestor of a set of NCBI
taxonomy nodes with Bio4j
www.ohnosequences.com www.bio4j.com
43. Future directions (1)
Gene flux tool
New tool for bacterial comparative genomics: massive tracing of vertical and
horizontal gene flux between genome elements based on the analysis of the
similarity between their proteins. It would analyze similarity relationships that could
be fixed to a 90% or 100% similarity threshold.
Pathways tool
Data from Metacyc is going to be included in Bio4j. This data would allow to dissect
the metabolic pathways in which a genome element, organism or community
(metagenomic samples) is involved. Gephi could be used for the representation of
metabolic pathways for each of them.
.
www.ohnosequences.com www.bio4j.com
44. Future directions (2)
Detector of common annotations in gene clusters
Many biological problems are related to the search of common annotations in a set of genes.
Some examples:
- a set of overexpressed genes
- a set of proteins with local structural similarities (WIP)
- a set of genes bearing SNPs in cancer samples
- a set of exclusive genes in a pathogenic bacterial strain
The detection of common annotations can help in the inference of important functional
connections.
www.ohnosequences.com www.bio4j.com
45. That’s it !
Thanks for
your time ;)
www.ohnosequences.com www.bio4j.com