SlideShare a Scribd company logo
1 of 45
Download to read offline
Neo4j and Bioinformatics




www.ohnosequences.com                   www.bio4j.com
But who’s this guy talking here?
     I am Currently working as a Bioinformatics consultant/developer/researcher at
     Oh no sequences!


     Oh no what !?
     We are the R&D group at Era7 Bioinformatics.
     we like bioinformatics, cloud computing, NGS, category theory, bacterial
     genomics…
     well, lots of things.


     What about Era7 Bioinformatics?
     Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
     knowledge management and sequencing data interpretation.
     Our area of expertise revolves around biological sequence analysis, particularly
     Next Generation Sequencing data management and analysis.




www.ohnosequences.com                                                      www.bio4j.com
In Bioinformatics we have highly interconnected overlapping knowledge spread
    throughout different DBs




www.ohnosequences.com                                                   www.bio4j.com
However all this data is in most cases modeled in relational databases.
        Sometimes even just as plain CSV files

               As the amount and diversity of data grows, domain models
               become crazily complicated!




www.ohnosequences.com                                                     www.bio4j.com
With a relational paradigm, the double implication

                              Entity  Table

         does not go both ways.


              You get ‘auxiliary’ tables that have no relationship with the small
              piece of reality you are modeling.


              You need ‘artificial’ IDs only for connecting entities, (and these are mixed
              with IDs that somehow live in reality)


              Entity-relationship models are cool but in the end you always have to
              deal with ‘raw’ tables plus SQL.


              Integrating/incorporating new knowledge into already existing
              databases is hard and sometimes even not possible without changing
              the domain model




www.ohnosequences.com                                                               www.bio4j.com
Life in general and biology in particular are probably not 100% like a graph…




                                but one thing’s sure, they are not a set of tables!



www.ohnosequences.com                                                                www.bio4j.com
NoSQL data models




www.ohnosequences.com        www.bio4j.com
Neo4j is a high-performance, NOSQL graph database with all
           the features of a mature and robust database.


           The programmer works with an object-oriented, flexible
           network structure rather than with strict and static tables


           All the benefits of a fully transactional, enterprise-strength
           database.


           For many applications, Neo4j offers performance
           improvements on the order of 1000x or more compared to
           relational DBs.




www.ohnosequences.com                                                    www.bio4j.com
What’s Bio4j?


     Bio4j is a bioinformatics graph based DB including most data
     available in :

        Uniprot KB (SwissProt + Trembl)   NCBI Taxonomy

        Gene Ontology (GO)                RefSeq

        UniRef (50,90,100)                Enzyme DB




www.ohnosequences.com                                      www.bio4j.com
What’s Bio4j?

     It provides a completely new and powerful framework
     for protein related information querying and
     management.


     Since it relies on a high-performance graph engine, data
     is stored in a way that semantically represents its own
     structure




www.ohnosequences.com                                www.bio4j.com
What’s Bio4j?

     Bio4j uses Neo4j technology, a "high-performance graph
     engine with all the features of a mature and robust
     database".

     Thanks to both being based on Neo4j DB and the API
     provided, Bio4j is also very scalable, allowing anyone
     to easily incorporate his own data making the best
     out of it.



www.ohnosequences.com                                 www.bio4j.com
What’s Bio4j?


                        Everything in Bio4j is open source !



       released under AGPLv3




www.ohnosequences.com                              www.bio4j.com
Bio4j in numbers


     The current version (0.7) includes:



             Relationships: 530.642.683

             Nodes: 76.071.411

             Relationship types: 139

             Node types: 38




www.ohnosequences.com                      www.bio4j.com
Let’s dig a bit about Bio4j structure…


               Data sources and their relationships:




www.ohnosequences.com                                  www.bio4j.com
Bio4j domain model




www.ohnosequences.com   www.bio4j.com
The Graph DB model: representation


          Core abstractions:

             Nodes

             Relationships between nodes

             Properties on both




www.ohnosequences.com                      www.bio4j.com
How are things modeled?




                            Couldn’t be simpler!




                 Entities           Associations / Relationships




                  Nodes                        Edges




www.ohnosequences.com                                        www.bio4j.com
Some examples of nodes would be:


                                      GO term
                  Protein
                                                         Genome Element




     and relationships:




                            Protein   PROTEIN_GO_ANNOTATION


                                                      GO term




www.ohnosequences.com                                                www.bio4j.com
We have developed a tool aimed to be used both as a reference manual and
    initial contact for Bio4j domain model: Bio4jExplorer



     Bio4jExplorer allows you to:

     • Navigate through all nodes and relationships


     • Access the javadocs of any node or relationship


     • Graphically explore the neighborhood of a node/relationship


     • Look up for the indexes that may serve as an entry point for a node


     • Check incoming/outgoing relationships of a specific node


     • Check start/end nodes of a specific relationship




www.ohnosequences.com                                                          www.bio4j.com
Entry points and indexing

         There are two kinds of entry points for the graph:



               Auxiliary relationships going from the reference node, e.g.

                 - CELLULAR_COMPONENT: leads to the root of GO cellular component
                 sub-ontology

                 - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl


               Node indexing

               There are two types of node indexes:

                 - Exact: Only exact values are considered hits

                 - Fulltext: Regular expressions can be used




www.ohnosequences.com                                                           www.bio4j.com
Retrieving protein info (Bio4jModel Java API)

     //--creating manager and node retriever----
     Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
     NodeRetriever nR= new NodeRetriever(manager);

     ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);


     Getting more related info...

     List<InterproNode> interpros = protein.getInterpro();
     OrganismNode organism = protein.getOrganism();
     List<GoTermNode> goAnnotations = protein.getGOAnnotations();

     List<ArticleNode> articles = protein.getArticleCitations();

     for (ArticleNode article : articles) {
         System.out.println(article.getPubmedId());
     }

     //Don’t forget to close the manager
     manager.shutDown();




www.ohnosequences.com                                                www.bio4j.com
Querying Bio4j with Cypher


     Getting a keyword by its ID

     START k=node:keyword_id_index(keyword_id_index = "KW-0181")
     return k.name, k.id


     Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot
     dataset:

     START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
     MATCH d <-[r:PROTEIN_DATASET]- p,
     circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
     [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]->
     (p)
      return p.accession, p2.accession, p3.accession


              Check this blog post for more info and our Bio4j Cypher cheetsheet




www.ohnosequences.com                                                                   www.bio4j.com
A graph traversal language


     Get protein by its accession number and return its full name

     gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
     ==> Aspartate aminotransferase, mitochondrial


     Get proteins (accessions) associated to an interpro motif (limited to 4 results)
     gremlin>
     g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.
     accession[0..3]
     ==> E2GK26
     ==> G3PMS4
     ==> G3Q865
     ==> G3PIL8


            Check our Bio4j Gremlin cheetsheet




www.ohnosequences.com                                                               www.bio4j.com
REST Server


     You can also query/navigate through Bio4j with the Neo4j REST API !

     The default representation is json, both for responses and or data sent with
     POST/PUT requests


     Get protein by its accession number: (Q9UR66)

     http://server_url:7474/db/data/index/node/protein_accession_index/
     protein_accession_index/Q9UR66


     Get outgoing relationships for protein Q9UR66

     http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o
     ut




www.ohnosequences.com                                                      www.bio4j.com
Visualizations (1)  REST Server Data Browser


      Navigate through Bio4j data in real time !




www.ohnosequences.com                               www.bio4j.com
Visualizations (2)  Bio4j GO Tools




www.ohnosequences.com                    www.bio4j.com
Visualizations (3)  Bio4j + Gephi

      Get really cool graph visualizations using Bio4j and Gephi visualization and
      exploration platform




www.ohnosequences.com                                                                www.bio4j.com
Bio4j + Cloud

     We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving
     us the following benefits:


          Interoperability and data distribution

           Releases are available as public EBS Snapshots, giving AWS users the
           opportunity of creating and attaching to their instances Bio4j DB 100% ready
           volumes in just a few seconds.

           CloudFormation templates:

             - Basic Bio4j DB Instance

             - Bio4j REST Server Instance


           Backup and Storage using S3 (Simple Storage Service)

           We use S3 both for backup (indirectly through the EBS snapshots) and
           storage (directly storing RefSeq sequences as independent S3 files)



www.ohnosequences.com                                                               www.bio4j.com
Why would I use Bio4j ?


    Massive access to protein/genome/taxonomy… related information


    Integration of your own DBs/resources around common information


    Development of services tailored to your needs built around Bio4j


    Networks analysis


    Visualizations


    Besides many others I cannot think of myself…
    If you have something in mind for which Bio4j might be useful, please let us know so we
    can all see how it could help you meet your needs! ;)




www.ohnosequences.com                                                                www.bio4j.com
Community

     Bio4j has a fast growing internet presence:



            - Twitter: check @bio4j for updates

            - Blog: go to http://blog.bio4j.com

            - Mail-list: ask any question you may have in our list.

            - LinkedIn: check the Bio4j group

            - Github issues: don’t be shy! open a new issue if you think
                             something’s going wrong.




www.ohnosequences.com                                                 www.bio4j.com
OK, but why starting all this?
   Were you so bored…?!

    It all started somehow around our need for massive access to protein GO
    (Gene Ontology) annotations.

     At that point I had to develop my own MySQL DB based on the official
     GO SQL database, and problems started from the beginning:


          I got crazy ‘deciphering’ how to extract Uniprot protein annotations
          from GO official tables schema

          Uniprot and GO official protein annotations were not always consistent


          Populating my own DB took really long due to all the joins and
          subqueries needed in order to get and store the protein annotations.

          Soon enough we also had the need of having massive access to basic
          protein information.




www.ohnosequences.com                                                              www.bio4j.com
These processes had to be automated for our (specifically designed for NGS data)
  bacterial genome annotation system BG7



              Uniprot web services available were too limited:

                - Slow

                - Number of queries limitation

                - Too little information available




                  So I downloaded the whole Uniprot DB in XML format
                  (Swiss-Prot + Trembl)

                  and started to have some fun with it !




www.ohnosequences.com                                                  www.bio4j.com
BG7 algorithm


       • Selection of the specific reference protein set
   1

       • Prediction of possible genes by BLAST similarity
   2


       • Gene definition: merging compatible similarity regions, detecting   start and stop
   3


       • Solving overlapped predicted genes
   4

       • RNA prediction by BLAST similarity
   5


   6   • Final annotation and complete deliverables. Quality control.




www.era7bioinformatics.com
We got used to having massive direct access to all this protein related
      information…


           So why not adding other resources we needed quite often in most
           projects and which now were becoming a sort of bottleneck
           compared to all those already included in Bio4j ?

       Then we incorporated:
            -   Isoform sequences

            -   Protein interactions and features

            -   Uniref 50, 90, and 100

            -   RefSeq

            -   NCBI Taxonomy

            -   Enzyme Expasy DB




www.ohnosequences.com                                                 www.bio4j.com
Bio4j + MG7 + 48 Blast XML files (~1GB each)


     Some numbers:

                •   157 639 502 nodes

                •   742 615 705 relationships

                •   632 832 045 properties

                •   148 relationship types

                •   44 node types


             And it works just fine!


www.ohnosequences.com                           www.bio4j.com
MG7 domain model




www.ohnosequences.com   www.bio4j.com
What’s MG7?

     MG7 provides the possibility of choosing different parameters to fix the
     thresholds for filtering the BLAST hits:

     i.    E-value
     ii.   Identity and query coverage


     It allows exporting the results of the analysis to different data formats like:
     • XML
     • CSV
     • Gexf (Graph exchange XML format)

     As well as provides to the user with Heat maps and graph visualizations whilst
     including an user-friendly interface that allows to access to the alignment
     responsible for each functional or taxonomical read assignation and that displays
     the frequencies in the taxonomical tree --> MG7Viewer




www.ohnosequences.com                                                         www.bio4j.com
Heat-map Viz




www.ohnosequences.com   www.bio4j.com
Graph Viz




www.ohnosequences.com   www.bio4j.com
MG7 Viewer




www.ohnosequences.com   www.bio4j.com
Mining Bio4j data

      Finding topological patterns in Protein-Protein
                  Interaction networks




www.ohnosequences.com                            www.bio4j.com
Finding the lowest common ancestor of a set of NCBI
                taxonomy nodes with Bio4j




www.ohnosequences.com                         www.bio4j.com
Future directions (1)


    Gene flux tool

    New tool for bacterial comparative genomics: massive tracing of vertical and
    horizontal gene flux between genome elements based on the analysis of the
    similarity between their proteins. It would analyze similarity relationships that could
    be fixed to a 90% or 100% similarity threshold.



    Pathways tool
    Data from Metacyc is going to be included in Bio4j. This data would allow to dissect
    the metabolic pathways in which a genome element, organism or community
    (metagenomic samples) is involved. Gephi could be used for the representation of
    metabolic pathways for each of them.
    .




www.ohnosequences.com                                                         www.bio4j.com
Future directions (2)


    Detector of common annotations in gene clusters

    Many biological problems are related to the search of common annotations in a set of genes.
    Some examples:

       - a set of overexpressed genes
       - a set of proteins with local structural similarities (WIP)
       - a set of genes bearing SNPs in cancer samples
       - a set of exclusive genes in a pathogenic bacterial strain

    The detection of common annotations can help in the inference of important functional
    connections.




www.ohnosequences.com                                                           www.bio4j.com
That’s it !


                        Thanks for
                        your time ;)




www.ohnosequences.com                  www.bio4j.com

More Related Content

What's hot

Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with GraphsElsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with GraphsNeo4j
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryKai Wähner
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Building Biomedical Knowledge Graphs for In-Silico Drug Discovery
Building Biomedical Knowledge Graphs for In-Silico Drug DiscoveryBuilding Biomedical Knowledge Graphs for In-Silico Drug Discovery
Building Biomedical Knowledge Graphs for In-Silico Drug DiscoveryVaticle
 
Livre Blanc : comprendre les data-lakes
Livre Blanc : comprendre les data-lakesLivre Blanc : comprendre les data-lakes
Livre Blanc : comprendre les data-lakesConverteo
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersDaniel Zivkovic
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...Neo4j
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxNeo4j
 
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...HostedbyConfluent
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking VN
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j
 
From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)Simon Maurin
 
Apache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, ConfluentApache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, ConfluentHostedbyConfluent
 

What's hot (20)

Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with GraphsElsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with Graphs
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
Apache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare IndustryApache Kafka in the Healthcare Industry
Apache Kafka in the Healthcare Industry
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Building Biomedical Knowledge Graphs for In-Silico Drug Discovery
Building Biomedical Knowledge Graphs for In-Silico Drug DiscoveryBuilding Biomedical Knowledge Graphs for In-Silico Drug Discovery
Building Biomedical Knowledge Graphs for In-Silico Drug Discovery
 
MongoDB
MongoDBMongoDB
MongoDB
 
Livre Blanc : comprendre les data-lakes
Livre Blanc : comprendre les data-lakesLivre Blanc : comprendre les data-lakes
Livre Blanc : comprendre les data-lakes
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
Neo4j Demo: Using Knowledge Graphs to Classify Diabetes Patients (GlaxoSmithK...
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
 
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4jNeo4j Graph Use Cases, Bruno Ungermann, Neo4j
Neo4j Graph Use Cases, Bruno Ungermann, Neo4j
 
From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)
 
Apache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, ConfluentApache Kafka and the Data Mesh | Michael Noll, Confluent
Apache Kafka and the Data Mesh | Michael Noll, Confluent
 

Viewers also liked

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological datadatablend
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions Pablo Pareja Tobes
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsdatablend
 
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015StampedeCon
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jSimon Jupp
 
Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009bosc
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataPablo Pareja Tobes
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBAndrei KUCHARAVY
 
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas WeberGraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas WeberNeo4j
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009bosc
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...Jonathan Eisen
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012Hilmar Lapp
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesisschamber
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlRutger Vos
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expressionMichael Barton
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenJonathan Eisen
 

Viewers also liked (20)

The power of graphs to analyze biological data
The power of graphs to analyze biological dataThe power of graphs to analyze biological data
The power of graphs to analyze biological data
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
 
FluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphsFluxGraph: a time-machine for your graphs
FluxGraph: a time-machine for your graphs
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
 
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
Managing Genetic Ancestry at Scale with Neo4j and Kafka - StampedeCon 2015
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 
Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009Arakawa_Glanguage_BOSC2009
Arakawa_Glanguage_BOSC2009
 
Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big DataBio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big Data
 
Graph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDBGraph databases in computational bioloby: case of neo4j and TitanDB
Graph databases in computational bioloby: case of neo4j and TitanDB
 
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas WeberGraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
GraphTalks - Semantisches Produktdatenmanagement, Dr. Andreas Weber
 
Bio4j
Bio4jBio4j
Bio4j
 
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
Jonathan Eisen: Phylogenetic approaches to the analysis of genomes and metage...
 
Procter Vamsas Bosc2009
Procter Vamsas Bosc2009Procter Vamsas Bosc2009
Procter Vamsas Bosc2009
 
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...Jonathan Eisen talk for #SCS2012 at #ISMB  "Networks in genomics and bioinfor...
Jonathan Eisen talk for #SCS2012 at #ISMB "Networks in genomics and bioinfor...
 
OBF Address at BOSC 2012
OBF Address at BOSC 2012OBF Address at BOSC 2012
OBF Address at BOSC 2012
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Bio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perlBio::Phylo - phyloinformatic analysis using perl
Bio::Phylo - phyloinformatic analysis using perl
 
The role of cost in yeast gene expression
The role of cost in yeast gene expressionThe role of cost in yeast gene expression
The role of cost in yeast gene expression
 
Tetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan EisenTetrahymena genome project 2003 presentation by Jonathan Eisen
Tetrahymena genome project 2003 presentation by Jonathan Eisen
 

Similar to Neo4j and bioinformatics

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...graphdevroom
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledgeBenjamin Good
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyChunlei Wu
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Trish Whetzel
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyBarry Smith
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebiNate Wildes
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0EBI
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesConnected Data World
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1iotest
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biologyrobertstevens65
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...Chunlei Wu
 

Similar to Neo4j and bioinformatics (20)

Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...Bio4j: A pioneer graph based database for the integration of biological Big D...
Bio4j: A pioneer graph based database for the integration of biological Big D...
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
BioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biologyBioThings SDK: a toolkit for building high-performance data APIs in biology
BioThings SDK: a toolkit for building high-performance data APIs in biology
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Pham yang embl-ebi
Pham yang embl-ebiPham yang embl-ebi
Pham yang embl-ebi
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Harvester I
Harvester IHarvester I
Harvester I
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
 
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
BioThings and SmartAPI: building an ecosystem of interoperable biological kno...
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Neo4j and bioinformatics

  • 2. But who’s this guy talking here? I am Currently working as a Bioinformatics consultant/developer/researcher at Oh no sequences! Oh no what !? We are the R&D group at Era7 Bioinformatics. we like bioinformatics, cloud computing, NGS, category theory, bacterial genomics… well, lots of things. What about Era7 Bioinformatics? Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis, knowledge management and sequencing data interpretation. Our area of expertise revolves around biological sequence analysis, particularly Next Generation Sequencing data management and analysis. www.ohnosequences.com www.bio4j.com
  • 3. In Bioinformatics we have highly interconnected overlapping knowledge spread throughout different DBs www.ohnosequences.com www.bio4j.com
  • 4. However all this data is in most cases modeled in relational databases. Sometimes even just as plain CSV files As the amount and diversity of data grows, domain models become crazily complicated! www.ohnosequences.com www.bio4j.com
  • 5. With a relational paradigm, the double implication Entity  Table does not go both ways. You get ‘auxiliary’ tables that have no relationship with the small piece of reality you are modeling. You need ‘artificial’ IDs only for connecting entities, (and these are mixed with IDs that somehow live in reality) Entity-relationship models are cool but in the end you always have to deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing databases is hard and sometimes even not possible without changing the domain model www.ohnosequences.com www.bio4j.com
  • 6. Life in general and biology in particular are probably not 100% like a graph… but one thing’s sure, they are not a set of tables! www.ohnosequences.com www.bio4j.com
  • 8. Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables All the benefits of a fully transactional, enterprise-strength database. For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs. www.ohnosequences.com www.bio4j.com
  • 9. What’s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot KB (SwissProt + Trembl) NCBI Taxonomy Gene Ontology (GO) RefSeq UniRef (50,90,100) Enzyme DB www.ohnosequences.com www.bio4j.com
  • 10. What’s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure www.ohnosequences.com www.bio4j.com
  • 11. What’s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of it. www.ohnosequences.com www.bio4j.com
  • 12. What’s Bio4j? Everything in Bio4j is open source ! released under AGPLv3 www.ohnosequences.com www.bio4j.com
  • 13. Bio4j in numbers The current version (0.7) includes: Relationships: 530.642.683 Nodes: 76.071.411 Relationship types: 139 Node types: 38 www.ohnosequences.com www.bio4j.com
  • 14. Let’s dig a bit about Bio4j structure… Data sources and their relationships: www.ohnosequences.com www.bio4j.com
  • 16. The Graph DB model: representation Core abstractions: Nodes Relationships between nodes Properties on both www.ohnosequences.com www.bio4j.com
  • 17. How are things modeled? Couldn’t be simpler! Entities Associations / Relationships Nodes Edges www.ohnosequences.com www.bio4j.com
  • 18. Some examples of nodes would be: GO term Protein Genome Element and relationships: Protein PROTEIN_GO_ANNOTATION GO term www.ohnosequences.com www.bio4j.com
  • 19. We have developed a tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to: • Navigate through all nodes and relationships • Access the javadocs of any node or relationship • Graphically explore the neighborhood of a node/relationship • Look up for the indexes that may serve as an entry point for a node • Check incoming/outgoing relationships of a specific node • Check start/end nodes of a specific relationship www.ohnosequences.com www.bio4j.com
  • 20. Entry points and indexing There are two kinds of entry points for the graph: Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT: leads to the root of GO cellular component sub-ontology - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl Node indexing There are two types of node indexes: - Exact: Only exact values are considered hits - Fulltext: Regular expressions can be used www.ohnosequences.com www.bio4j.com
  • 21. Retrieving protein info (Bio4jModel Java API) //--creating manager and node retriever---- Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); NodeRetriever nR= new NodeRetriever(manager); ProteinNode protein = nR.getProteinNodeByAccession(“P12345”); Getting more related info... List<InterproNode> interpros = protein.getInterpro(); OrganismNode organism = protein.getOrganism(); List<GoTermNode> goAnnotations = protein.getGOAnnotations(); List<ArticleNode> articles = protein.getArticleCitations(); for (ArticleNode article : articles) { System.out.println(article.getPubmedId()); } //Don’t forget to close the manager manager.shutDown(); www.ohnosequences.com www.bio4j.com
  • 22. Querying Bio4j with Cypher Getting a keyword by its ID START k=node:keyword_id_index(keyword_id_index = "KW-0181") return k.name, k.id Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset: START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) - [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Check this blog post for more info and our Bio4j Cypher cheetsheet www.ohnosequences.com www.bio4j.com
  • 23. A graph traversal language Get protein by its accession number and return its full name gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results) gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV. accession[0..3] ==> E2GK26 ==> G3PMS4 ==> G3Q865 ==> G3PIL8 Check our Bio4j Gremlin cheetsheet www.ohnosequences.com www.bio4j.com
  • 24. REST Server You can also query/navigate through Bio4j with the Neo4j REST API ! The default representation is json, both for responses and or data sent with POST/PUT requests Get protein by its accession number: (Q9UR66) http://server_url:7474/db/data/index/node/protein_accession_index/ protein_accession_index/Q9UR66 Get outgoing relationships for protein Q9UR66 http://server_url:7474/db/data/node/Q9UR66_node_id/relationships/o ut www.ohnosequences.com www.bio4j.com
  • 25. Visualizations (1)  REST Server Data Browser Navigate through Bio4j data in real time ! www.ohnosequences.com www.bio4j.com
  • 26. Visualizations (2)  Bio4j GO Tools www.ohnosequences.com www.bio4j.com
  • 27. Visualizations (3)  Bio4j + Gephi Get really cool graph visualizations using Bio4j and Gephi visualization and exploration platform www.ohnosequences.com www.bio4j.com
  • 28. Bio4j + Cloud We use AWS (Amazon Web Services) everywhere we can around Bio4j, giving us the following benefits: Interoperability and data distribution Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds. CloudFormation templates: - Basic Bio4j DB Instance - Bio4j REST Server Instance Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) and storage (directly storing RefSeq sequences as independent S3 files) www.ohnosequences.com www.bio4j.com
  • 29. Why would I use Bio4j ? Massive access to protein/genome/taxonomy… related information Integration of your own DBs/resources around common information Development of services tailored to your needs built around Bio4j Networks analysis Visualizations Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;) www.ohnosequences.com www.bio4j.com
  • 30. Community Bio4j has a fast growing internet presence: - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.com - Mail-list: ask any question you may have in our list. - LinkedIn: check the Bio4j group - Github issues: don’t be shy! open a new issue if you think something’s going wrong. www.ohnosequences.com www.bio4j.com
  • 31. OK, but why starting all this? Were you so bored…?! It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations. At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning: I got crazy ‘deciphering’ how to extract Uniprot protein annotations from GO official tables schema Uniprot and GO official protein annotations were not always consistent Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Soon enough we also had the need of having massive access to basic protein information. www.ohnosequences.com www.bio4j.com
  • 32. These processes had to be automated for our (specifically designed for NGS data) bacterial genome annotation system BG7 Uniprot web services available were too limited: - Slow - Number of queries limitation - Too little information available So I downloaded the whole Uniprot DB in XML format (Swiss-Prot + Trembl) and started to have some fun with it ! www.ohnosequences.com www.bio4j.com
  • 33. BG7 algorithm • Selection of the specific reference protein set 1 • Prediction of possible genes by BLAST similarity 2 • Gene definition: merging compatible similarity regions, detecting start and stop 3 • Solving overlapped predicted genes 4 • RNA prediction by BLAST similarity 5 6 • Final annotation and complete deliverables. Quality control. www.era7bioinformatics.com
  • 34. We got used to having massive direct access to all this protein related information… So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ? Then we incorporated: - Isoform sequences - Protein interactions and features - Uniref 50, 90, and 100 - RefSeq - NCBI Taxonomy - Enzyme Expasy DB www.ohnosequences.com www.bio4j.com
  • 35. Bio4j + MG7 + 48 Blast XML files (~1GB each) Some numbers: • 157 639 502 nodes • 742 615 705 relationships • 632 832 045 properties • 148 relationship types • 44 node types And it works just fine! www.ohnosequences.com www.bio4j.com
  • 37. What’s MG7? MG7 provides the possibility of choosing different parameters to fix the thresholds for filtering the BLAST hits: i. E-value ii. Identity and query coverage It allows exporting the results of the analysis to different data formats like: • XML • CSV • Gexf (Graph exchange XML format) As well as provides to the user with Heat maps and graph visualizations whilst including an user-friendly interface that allows to access to the alignment responsible for each functional or taxonomical read assignation and that displays the frequencies in the taxonomical tree --> MG7Viewer www.ohnosequences.com www.bio4j.com
  • 41. Mining Bio4j data Finding topological patterns in Protein-Protein Interaction networks www.ohnosequences.com www.bio4j.com
  • 42. Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j www.ohnosequences.com www.bio4j.com
  • 43. Future directions (1) Gene flux tool New tool for bacterial comparative genomics: massive tracing of vertical and horizontal gene flux between genome elements based on the analysis of the similarity between their proteins. It would analyze similarity relationships that could be fixed to a 90% or 100% similarity threshold. Pathways tool Data from Metacyc is going to be included in Bio4j. This data would allow to dissect the metabolic pathways in which a genome element, organism or community (metagenomic samples) is involved. Gephi could be used for the representation of metabolic pathways for each of them. . www.ohnosequences.com www.bio4j.com
  • 44. Future directions (2) Detector of common annotations in gene clusters Many biological problems are related to the search of common annotations in a set of genes. Some examples: - a set of overexpressed genes - a set of proteins with local structural similarities (WIP) - a set of genes bearing SNPs in cancer samples - a set of exclusive genes in a pathogenic bacterial strain The detection of common annotations can help in the inference of important functional connections. www.ohnosequences.com www.bio4j.com
  • 45. That’s it ! Thanks for your time ;) www.ohnosequences.com www.bio4j.com