SlideShare a Scribd company logo
1 of 32
Behind the scenes of KnetMiner
Marco Brandizi
marco.brandizi@rothamsted.ac.uk
Bioinformatics Group Training, 27/11/2018
Find these slides at:
https://www.slideshare.net/mbrandizi
Behind the scenes of KnetMiner
Behind the scenes of KnetMiner
<concept>
<id>1</id>
<pid>Q75WV3</pid>
<description/>
<elementOf>
<idRef>UNIPROTKB-SwissProt</idRef>
</elementOf>
<ofType>
<idRef>Protein</idRef>
</ofType>
<evidences>
<evidence>
<idRef>IMPD</idRef>
</evidence>
</evidences>
<conames>
<concept_name>
<name>Probable trehalose-phosphate phosphatase 1</name>
<isPreferred>true</isPreferred>
</concept_name>
…
<cc>
<id>Protein</id>
<fullname>Protein</fullname>
<description>
A protein is comprised of one or more Polypeptides
and potentially other molecules.
</description>
<specialisationOf>
<idRef>MolCmplx</idRef>
</specialisationOf>
</cc>
<relation>
<fromConcept>1</fromConcept>
<toConcept>3</toConcept>
<ofType>
<idRef>participates_in</idRef>
</ofType>
<evidences>
<evidence>
<idRef>ECO:0000316</idRef>
</evidence>
</evidences>
<relgds/>
</relation>
<concept>
<id>3</id>
<pid>GO:0009651</pid>
<description>response to salt stress</description>
<ofType><idRef>BioProc</idRef></ofType>
<coaccessions>
<concept_accession>
<accession>GO:0009651</accession>
<elementOf><idRef>GO</idRef></elementOf>
<ambiguous>false</ambiguous>
</concept_accession>
</coaccessions>
</concept>
The OXL format
The Ondex Integrator
And the Command Line Version
But it Needs some Pre-Processing Too
Why Changing?
https://funnyjunk.com/Reinvent+the+wheel/funny-
pictures/5665443/
Why Changing?
• Graph databases have emerged
• having expressive query Languages (eg, SPARQL, Cypher)
• Having low memory footprint (and possibly scalability over clusters/clouds)
• More stable APIs and implementations
• Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc
• Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL
CONSTRUCT, scripting with JSON)
• Useful in Output: applications based on APIs/micro-services, query languages, machine readable &
standardised data.
• New apps can be either ours or 3rd parties
• Ondex issues
• Getting old (and older with Java >8)
• All data must be in memory
• Not exactly high quality code
Property Graphs
The Cypher Query/DML Language
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] ->
(pway:Path{ title: ‘apoptosis’ })
// further conditions, not always so performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Proteins->Reactions->Pathways:
// Single-path (or same-direction branching) easy to write
MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
- [:part_of*1..3] -> (pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway
Cypher as Semantic Motif Language
Cypher as Semantic Motif Language
Exercise 1: Try Cypher
• Go to http://babvs48.rothamsted.ac.uk:7476/browser
• Use neo4j/test as credentials
• Try the query:
• MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction)
- [react2path:part_of] -> (pway:Path)
WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism'
RETURN * LIMIT 10
• And explore the graphical result
• What do you think you’ve found?
• What do you have in () and in []?
• What’s the meaning of the ‘|’ operator?
• cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’
• What’s the difference between -[]-> and -[]-> ?
• More help about Cypher at: https://neo4j.com/developer/cypher-query-language
Exercise 1: Solution
• You should see something like the figure
• Which shows the ACP pathway at the centre, a member
reaction and proteins consumed/produced by the latter
• (name:Label) matches nodes (label is synonym of type),
[name:Type] matches relations
• [r:R1|R2] matches relations of either type R1 or R2
• (src:Label1)-[r:R]->(dst:Label2) matches relations of type R
going from nodes of type Label1 to nodes of type Label2
• (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2
and n2->r2->n1
Exercise 2: Write Your Own Cypher
• Using the same browser, find:
• genes,
• which are encoded by proteins,
• which are mentioned by articles that contain ‘ZmPEAMT1’ in the title
• Hints
• Use the node labels: Gene, Protein, Publication
• Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’)
• Use the attribute AbstractHeader (meaning ‘publication title’)
• Use the filter operator CONTAINS, as in the previous exercise
• More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the
following slides
Exercise 2: Solution
• MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication)
WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1'
RETURN * LIMIT 10
• Your solution might be a variant of this
But how to Encode Data? The Semantic
Web Way
But how to Encode Data? The Semantic
Web Way
@prefix bkr: <http://www.ondex.org/bioknet/resources/> .
@prefix bk: <http://www.ondex.org/bioknet/terms/> .
@prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>.
bkr:TOB1 a bk:Protein ;
bk:participates_in <http://www.wikipathways.org/id1> ;
bk:prefName “TOB1";
bk:published_in bkr:23236473.
But how to Encode Data? The Semantic
Web Way
But how to Encode Data? The Semantic
Web Way
select distinct ?prot ?comp {
where {
?prot a kb:Protein;
rdfs:label ?protLabel.
filter ( contains ( ?protLabel, ‘TOB1’ ).
?enz kb:activated_by ?prot.
?enz kb:activated_by ?comp.
?comp rdfs:label ?compLabel.
}
LIMIT 1000
Querying KnetMiner with SPARQL
select distinct ?prot ?pway {
where {
# Branch 1
?prot kb:pd_by|kb:cs_by ?react.
?prot a kb:Protein.
?react a kb:Reaction.
?react kb:part_of ?pway.
?pway a kb:Path.
}
union { # Branch 2
?prot ^kb:ac_by|kb:is_a ?enz.
?prot a kb:Protein.
?enz a kb:Enzyme.
{ # Branch 2.1
?enz kb:ac_by|kb:in_by ?comp.
?comp a kb:Compound.
?comp kb:cs_by|kb:pd_by ?trns
?trns a kb:Transport
} union {
# Branch 2.2
?enz ^kb:ca_by ?trns.
?comp a kb:Compound.
?trns a kb:Transport
}
?trns kb:part_of ?pway.
?pway a kb:Path.
}
} LIMIT 1000
Querying KnetMiner with SPARQL
So, Why Both?
And more
Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores
Data xchg format
- No official one, just Cypher,
Support for GraphML, RDF
+/- Focus on backing applications
+ Focus on data sharing standards
Data model
+ Relations with properties
- Metadata/schemas/ontologies management
- Relations cannot have properties (reification
required)
+ Metadata/schemas/ontologies as first citizen
and standardised OWL
Performance + complex graph traversals + Comparable in most cases
Query Language
+ Cypher is easier (eg, compact, implicit elems)? -
Expressivity issues (unions)
- No standard QL (but efforts in progress, eg,
OpenCypher)
- SPARQL is Harder? (URIs, namespaces,
verbosity) + SPARQL More expressive
Standardisation,
openness
+/- (TinkerPop is open, Neo4J isn’t)
+ Commercial support
+ More alive and up-to date (e.g., support for
Hadoop, nice Neo4j browser, easy installation)
+ Natively open, many open implementations
- Instability and many short-lived prototypes
- Advancements seems to be slowing down
+ Some nice open and commercial browser
(LODEStar,
Scalability, big data
+/- Commercial support to clustering/clouds for
Neo4J + Open support in TinkerPop
+ Load Balancing/Cluster solutions, Commercial
Cloud support (eg GraphDB) + SPARQL Over
TinkerPop (via SAIL inteface)
So, the New Architecture
Why Should I Bother?
• As data consumer
• Querying data via Cypher (or SPARQL)
• In particular, define new semantic motifs to find gene-related entities
• Knowing our BioKNO ontology/schema (TODO)
• In future, querying data via API/Cypher, getting back JSON/BioKNO
• As data producer (for KnetMiner)
• Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data
sets)
• Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner
data with other RDF/SPARQL sources
Exercise 3: Playing with RDF
• Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto
• What is the meaning of ‘a’? What are the classes (ie, types) used in example 1?
• Which property types (ie, relations) link proteins, pathways and protein accessions?
• According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the
publication mentioning TOB1?
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• Hint, use bk:is_annotated_by
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• Hint, use the attribute bka:EVIDENCE and bka:Score
• Possibly use further documentation:
• A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf-
turtle.pdf
• BioKNO Ontology Reference:
• http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core)
• http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)
Exercise 3: Solution
• ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class
• So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession
• is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’?
• The question aims at highlighting a feature of graph data, that is: automatic reasoning
• ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to)
• So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’
• However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z
• This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424
• and logically infer that obo:GO_0030014 part_of obo:GO_0044424
• This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not
explicitly declared in the original data
• The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a
intracellular part (as per is_a)
• In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1?
• Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement
• Compare this with the Neo4j equivalent
• How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)?
• bkr:TOB1 bk:is_annotated_by obo:GO_0003714.
• How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05?
• You need to add:
bkr:citation_TOB1_15489334 a bk:Relation ;
bk:relTypeRef bk:is_annotated_by;
bk:relFrom bkr:TOB1;
bk:relTo obo:GO_0003714 ;
bka:Score 0.95 ;
bka:EVIDENCE “text mining tool”.
• bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).
Exercise 4: Data Integration based on
RDF
• Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human,
which build a KnetMiner network in RDF format (and following the BioKNO ontology)
• using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked-
data/0/steps/16104) to perform RDF-to-RDF transformations
• and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data
into RDF
• Look at the transformation https://github.com/Rothamsted/bioknet-
onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our
BioKNO
• What is happening? Look at it before the next question
• Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the
CONSTRUCT block. Is the new graph smaller or bigger?
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql
• Se example queries at: https://github.com/Rothamsted/bioknet-
onto/tree/master/examples/bmp_reg_human/queries
Exercise 4: Solution
• The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of
protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds
chains of protein/pathway in BioKNO format.
• So, it maps a format to another (an alternative would be to do so in data queries, see
queries/pw_commons_fed.sparql)
• and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at
serving don’t need certain details)
• How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)?
• In the CONSTRUCT block you’d have:
?comp bk:participates_in ?path.
Thanks!
• Even more material:
• On graph databases, standards, KnetMiner new backend:
• https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs
• https://doi.org/10.1515/jib-2018-0023
• On Semantic Web, Linked Data, RDF, SPARQL, etc:
• https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary
• https://goo.gl/bfF1hu
• https://www.nature.com/articles/nbt1139
• https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age
• http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf

More Related Content

Similar to Knetminer Backend Training, Nov 2018

Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...Rothamsted Research, UK
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanPhilippe Rocca-Serra
 
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorialRDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorialJerven Bolleman
 
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Matthew Skelton
 
Metadata for web ontologies and rules: current practices and perspectives
Metadata for web ontologies and rules: current practices and perspectivesMetadata for web ontologies and rules: current practices and perspectives
Metadata for web ontologies and rules: current practices and perspectivesCarlos Tejo-Alonso
 
Keystone project onboarding
Keystone project onboardingKeystone project onboarding
Keystone project onboardingColleen_Murphy
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD MicrothesauriMarcia Zeng
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataTomasz Adamusiak
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Media mosa architecture - features -10 june 2010
Media mosa   architecture - features -10 june 2010Media mosa   architecture - features -10 june 2010
Media mosa architecture - features -10 june 2010Andrii Podanenko
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화Henry Jeong
 
Fhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesFhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesDevDays
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design BootcampMark Niebergall
 
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...RuleML
 

Similar to Knetminer Backend Training, Nov 2018 (20)

Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
 
Bio4j
Bio4jBio4j
Bio4j
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Sword Crig 2007 12 06
Sword Crig 2007 12 06Sword Crig 2007 12 06
Sword Crig 2007 12 06
 
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorialRDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
 
NCBO Technology
NCBO TechnologyNCBO Technology
NCBO Technology
 
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
 
Metadata for web ontologies and rules: current practices and perspectives
Metadata for web ontologies and rules: current practices and perspectivesMetadata for web ontologies and rules: current practices and perspectives
Metadata for web ontologies and rules: current practices and perspectives
 
Keystone project onboarding
Keystone project onboardingKeystone project onboarding
Keystone project onboarding
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Media mosa architecture - features -10 june 2010
Media mosa   architecture - features -10 june 2010Media mosa   architecture - features -10 june 2010
Media mosa architecture - features -10 june 2010
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
Fhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_servicesFhir dev days_advanced_fhir_terminology_services
Fhir dev days_advanced_fhir_terminology_services
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design Bootcamp
 
COinS (eng version)
COinS (eng version)COinS (eng version)
COinS (eng version)
 
2016-07-06-openphacts-docker
2016-07-06-openphacts-docker2016-07-06-openphacts-docker
2016-07-06-openphacts-docker
 
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
 

More from Rothamsted Research, UK

FAIR Agronomy, where are we? The KnetMiner Use Case
FAIR Agronomy, where are we? The KnetMiner Use CaseFAIR Agronomy, where are we? The KnetMiner Use Case
FAIR Agronomy, where are we? The KnetMiner Use CaseRothamsted Research, UK
 
Interoperable Data for KnetMiner and DFW Use Cases
Interoperable Data for KnetMiner and DFW Use CasesInteroperable Data for KnetMiner and DFW Use Cases
Interoperable Data for KnetMiner and DFW Use CasesRothamsted Research, UK
 
AgriSchemas: Sharing Agrifood data with Bioschemas
AgriSchemas: Sharing Agrifood data with BioschemasAgriSchemas: Sharing Agrifood data with Bioschemas
AgriSchemas: Sharing Agrifood data with BioschemasRothamsted Research, UK
 
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
Publishing and Consuming FAIR DataA Case in the Agri-Food DomainPublishing and Consuming FAIR DataA Case in the Agri-Food Domain
Publishing and Consuming FAIR Data A Case in the Agri-Food DomainRothamsted Research, UK
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesRothamsted Research, UK
 
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerA Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerRothamsted Research, UK
 
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...Rothamsted Research, UK
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...Rothamsted Research, UK
 
myEquivalents, aka a new cross-reference service
myEquivalents, aka a new cross-reference servicemyEquivalents, aka a new cross-reference service
myEquivalents, aka a new cross-reference serviceRothamsted Research, UK
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialRothamsted Research, UK
 

More from Rothamsted Research, UK (20)

FAIR Agronomy, where are we? The KnetMiner Use Case
FAIR Agronomy, where are we? The KnetMiner Use CaseFAIR Agronomy, where are we? The KnetMiner Use Case
FAIR Agronomy, where are we? The KnetMiner Use Case
 
Interoperable Data for KnetMiner and DFW Use Cases
Interoperable Data for KnetMiner and DFW Use CasesInteroperable Data for KnetMiner and DFW Use Cases
Interoperable Data for KnetMiner and DFW Use Cases
 
AgriSchemas: Sharing Agrifood data with Bioschemas
AgriSchemas: Sharing Agrifood data with BioschemasAgriSchemas: Sharing Agrifood data with Bioschemas
AgriSchemas: Sharing Agrifood data with Bioschemas
 
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
Publishing and Consuming FAIR DataA Case in the Agri-Food DomainPublishing and Consuming FAIR DataA Case in the Agri-Food Domain
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
 
Continuos Integration @Knetminer
Continuos Integration @KnetminerContinuos Integration @Knetminer
Continuos Integration @Knetminer
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
 
AgriSchemas Progress Report
AgriSchemas Progress ReportAgriSchemas Progress Report
AgriSchemas Progress Report
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
 
Notes about SWAT4LS 2018
Notes about SWAT4LS 2018Notes about SWAT4LS 2018
Notes about SWAT4LS 2018
 
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerA Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
 
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...
 
Interoperable Open Data: Which Recipes?
Interoperable Open Data: Which Recipes?Interoperable Open Data: Which Recipes?
Interoperable Open Data: Which Recipes?
 
Linked Data with the EBI RDF Platform
Linked Data with the EBI RDF PlatformLinked Data with the EBI RDF Platform
Linked Data with the EBI RDF Platform
 
BioSD Linked Data: Lessons Learned
BioSD Linked Data: Lessons LearnedBioSD Linked Data: Lessons Learned
BioSD Linked Data: Lessons Learned
 
myEquivalents, aka a new cross-reference service
myEquivalents, aka a new cross-reference servicemyEquivalents, aka a new cross-reference service
myEquivalents, aka a new cross-reference service
 
Dev 2014 LOD tutorial
Dev 2014 LOD tutorialDev 2014 LOD tutorial
Dev 2014 LOD tutorial
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS Tutorial
 
Semic 2013
Semic 2013Semic 2013
Semic 2013
 
Uk onto net_2013_notes_brandizi
Uk onto net_2013_notes_brandiziUk onto net_2013_notes_brandizi
Uk onto net_2013_notes_brandizi
 

Recently uploaded

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 

Recently uploaded (20)

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

Knetminer Backend Training, Nov 2018

  • 1. Behind the scenes of KnetMiner Marco Brandizi marco.brandizi@rothamsted.ac.uk Bioinformatics Group Training, 27/11/2018 Find these slides at: https://www.slideshare.net/mbrandizi
  • 2. Behind the scenes of KnetMiner
  • 3. Behind the scenes of KnetMiner
  • 4. <concept> <id>1</id> <pid>Q75WV3</pid> <description/> <elementOf> <idRef>UNIPROTKB-SwissProt</idRef> </elementOf> <ofType> <idRef>Protein</idRef> </ofType> <evidences> <evidence> <idRef>IMPD</idRef> </evidence> </evidences> <conames> <concept_name> <name>Probable trehalose-phosphate phosphatase 1</name> <isPreferred>true</isPreferred> </concept_name> … <cc> <id>Protein</id> <fullname>Protein</fullname> <description> A protein is comprised of one or more Polypeptides and potentially other molecules. </description> <specialisationOf> <idRef>MolCmplx</idRef> </specialisationOf> </cc> <relation> <fromConcept>1</fromConcept> <toConcept>3</toConcept> <ofType> <idRef>participates_in</idRef> </ofType> <evidences> <evidence> <idRef>ECO:0000316</idRef> </evidence> </evidences> <relgds/> </relation> <concept> <id>3</id> <pid>GO:0009651</pid> <description>response to salt stress</description> <ofType><idRef>BioProc</idRef></ofType> <coaccessions> <concept_accession> <accession>GO:0009651</accession> <elementOf><idRef>GO</idRef></elementOf> <ambiguous>false</ambiguous> </concept_accession> </coaccessions> </concept> The OXL format
  • 6. And the Command Line Version
  • 7. But it Needs some Pre-Processing Too
  • 9. Why Changing? • Graph databases have emerged • having expressive query Languages (eg, SPARQL, Cypher) • Having low memory footprint (and possibly scalability over clusters/clouds) • More stable APIs and implementations • Data Standards, Machine-Readable Data, FAIR Principles, etc etc etc • Useful in Input: standardised data, less custom ELT to do, useful tools and techniques (e.g., SPARQL CONSTRUCT, scripting with JSON) • Useful in Output: applications based on APIs/micro-services, query languages, machine readable & standardised data. • New apps can be either ours or 3rd parties • Ondex issues • Getting old (and older with Java >8) • All data must be in memory • Not exactly high quality code
  • 11. The Cypher Query/DML Language Proteins->Reactions->Pathways: // chain of paths, node selection via property (exploits indices) MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ }) // further conditions, not always so performant WHERE prot.name =~ ‘(?i)^DNA.+’ // Usual projection and post-selection operators RETURN prot.name, pway // Relations can have properties ORDER BY csby.pvalue LIMIT 1000 Proteins->Reactions->Pathways: // Single-path (or same-direction branching) easy to write MATCH (prot:Protein) - [:produced_by|consumed_by] -> (:Reaction) - [:part_of*1..3] -> (pway:Path) RETURN ID(prot), ID(pway) LIMIT 1000 // Very compact forms available, depending on the data MATCH (prot:Protein) - (pway:Path) RETURN pway
  • 12. Cypher as Semantic Motif Language
  • 13. Cypher as Semantic Motif Language
  • 14. Exercise 1: Try Cypher • Go to http://babvs48.rothamsted.ac.uk:7476/browser • Use neo4j/test as credentials • Try the query: • MATCH (prot:Protein) - [prot2react:cs_by|pd_by] - (react:Reaction) - [react2path:part_of] -> (pway:Path) WHERE pway.prefName CONTAINS 'acyl carrier protein metabolism' RETURN * LIMIT 10 • And explore the graphical result • What do you think you’ve found? • What do you have in () and in []? • What’s the meaning of the ‘|’ operator? • cs_by and pd_by are shortcuts for ‘consumed by’ and ‘produced by’ • What’s the difference between -[]-> and -[]-> ? • More help about Cypher at: https://neo4j.com/developer/cypher-query-language
  • 15. Exercise 1: Solution • You should see something like the figure • Which shows the ACP pathway at the centre, a member reaction and proteins consumed/produced by the latter • (name:Label) matches nodes (label is synonym of type), [name:Type] matches relations • [r:R1|R2] matches relations of either type R1 or R2 • (src:Label1)-[r:R]->(dst:Label2) matches relations of type R going from nodes of type Label1 to nodes of type Label2 • (n1)-[:R]-(n2) matches both directions, so both n1->r1->n2 and n2->r2->n1
  • 16. Exercise 2: Write Your Own Cypher • Using the same browser, find: • genes, • which are encoded by proteins, • which are mentioned by articles that contain ‘ZmPEAMT1’ in the title • Hints • Use the node labels: Gene, Protein, Publication • Use the relation types: enc (meaning ‘encodes’), pub_in (meaning ‘published in’, or ‘mentioned in’) • Use the attribute AbstractHeader (meaning ‘publication title’) • Use the filter operator CONTAINS, as in the previous exercise • More info about the KnetMiner node/relation types on the left column in the Neo4j browser, and on the following slides
  • 17. Exercise 2: Solution • MATCH (gene:Gene)-[enc:enc]->(prot:Protein)-[xref:pub_in]->(article:Publication) WHERE article.AbstractHeader CONTAINS 'ZmPEAMT1' RETURN * LIMIT 10 • Your solution might be a variant of this
  • 18. But how to Encode Data? The Semantic Web Way
  • 19. But how to Encode Data? The Semantic Web Way @prefix bkr: <http://www.ondex.org/bioknet/resources/> . @prefix bk: <http://www.ondex.org/bioknet/terms/> . @prefix bka: <http://www.ondex.org/bioknet/terms/attributes/>. bkr:TOB1 a bk:Protein ; bk:participates_in <http://www.wikipathways.org/id1> ; bk:prefName “TOB1"; bk:published_in bkr:23236473.
  • 20. But how to Encode Data? The Semantic Web Way
  • 21. But how to Encode Data? The Semantic Web Way
  • 22. select distinct ?prot ?comp { where { ?prot a kb:Protein; rdfs:label ?protLabel. filter ( contains ( ?protLabel, ‘TOB1’ ). ?enz kb:activated_by ?prot. ?enz kb:activated_by ?comp. ?comp rdfs:label ?compLabel. } LIMIT 1000 Querying KnetMiner with SPARQL
  • 23. select distinct ?prot ?pway { where { # Branch 1 ?prot kb:pd_by|kb:cs_by ?react. ?prot a kb:Protein. ?react a kb:Reaction. ?react kb:part_of ?pway. ?pway a kb:Path. } union { # Branch 2 ?prot ^kb:ac_by|kb:is_a ?enz. ?prot a kb:Protein. ?enz a kb:Enzyme. { # Branch 2.1 ?enz kb:ac_by|kb:in_by ?comp. ?comp a kb:Compound. ?comp kb:cs_by|kb:pd_by ?trns ?trns a kb:Transport } union { # Branch 2.2 ?enz ^kb:ca_by ?trns. ?comp a kb:Compound. ?trns a kb:Transport } ?trns kb:part_of ?pway. ?pway a kb:Path. } } LIMIT 1000 Querying KnetMiner with SPARQL
  • 25. And more Neo4J, Cypher DBs, Graph DBs Semantic Web/Triple Stores Data xchg format - No official one, just Cypher, Support for GraphML, RDF +/- Focus on backing applications + Focus on data sharing standards Data model + Relations with properties - Metadata/schemas/ontologies management - Relations cannot have properties (reification required) + Metadata/schemas/ontologies as first citizen and standardised OWL Performance + complex graph traversals + Comparable in most cases Query Language + Cypher is easier (eg, compact, implicit elems)? - Expressivity issues (unions) - No standard QL (but efforts in progress, eg, OpenCypher) - SPARQL is Harder? (URIs, namespaces, verbosity) + SPARQL More expressive Standardisation, openness +/- (TinkerPop is open, Neo4J isn’t) + Commercial support + More alive and up-to date (e.g., support for Hadoop, nice Neo4j browser, easy installation) + Natively open, many open implementations - Instability and many short-lived prototypes - Advancements seems to be slowing down + Some nice open and commercial browser (LODEStar, Scalability, big data +/- Commercial support to clustering/clouds for Neo4J + Open support in TinkerPop + Load Balancing/Cluster solutions, Commercial Cloud support (eg GraphDB) + SPARQL Over TinkerPop (via SAIL inteface)
  • 26. So, the New Architecture
  • 27. Why Should I Bother? • As data consumer • Querying data via Cypher (or SPARQL) • In particular, define new semantic motifs to find gene-related entities • Knowing our BioKNO ontology/schema (TODO) • In future, querying data via API/Cypher, getting back JSON/BioKNO • As data producer (for KnetMiner) • Scripting with RDF/SPARQL/etc to integrate data sources (and produce KnetMiner data sets) • Querying multiple SPARQL endpoints to produce data sets and/or integrate our KnetMiner data with other RDF/SPARQL sources
  • 28. Exercise 3: Playing with RDF • Study Bio-KNO examples at https://github.com/Rothamsted/bioknet-onto • What is the meaning of ‘a’? What are the classes (ie, types) used in example 1? • Which property types (ie, relations) link proteins, pathways and protein accessions? • According to example 2, is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • Hint, use bk:is_annotated_by • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • Hint, use the attribute bka:EVIDENCE and bka:Score • Possibly use further documentation: • A quick tutorial about RDF and Turtle syntax: https://ai.ia.agh.edu.pl/wiki/_media/pl:dydaktyka:semweb:quick-tutorial-rdf- turtle.pdf • BioKNO Ontology Reference: • http://www.marcobrandizi.info/files/bkn-owldoc/bioknet/index.html (core) • http://www.marcobrandizi.info/files/bkn-owldoc/bk_ondex/index.html (entities used in KnetMiner/Ondex)
  • 29. Exercise 3: Solution • ‘a’ is a shortcut for the URI rdf:type, which is the standard property to state that an entity is instance of a class • So, you can find the classes used in the example by looking at the target of the ‘a’ predicate: bk:Path, bk:Protein, bk:Accession • is a ‘CCR4-NOT core complex’ a part of ‘intracellular part’? • The question aims at highlighting a feature of graph data, that is: automatic reasoning • ‘CCR4-NOT core complex’ is only explicitly stated as being part of ‘CCR4-NOT complex’ (follow the bk:part_of relation and the URIs it refers to) • So, using only the declared data in the example, a computer cannot ‘know’ that CCR4-NOT complex is also part of ‘intracellular part’ • However, graph systems are able to work with rules like: ?x bk:part_of ?y, ?y bk:is_a ?z => ?x part_of ?z • This rule can be applied to ?x := obo:GO_0030014, ?y := obo:GO_0030015, ?z := obo:GO_0044424 • and logically infer that obo:GO_0030014 part_of obo:GO_0044424 • This additional statements can be used in queries, eg: searching for all things that are part of intracellular part would return CCR4-NOT core complex in the results, even if this is not explicitly declared in the original data • The rationale for this conclusion is that anything that is part of something that is a core complex is also part of something that is an intracellular part, because every core complex is also a intracellular part (as per is_a) • In the example 3, why do we need more than: “bkr:TOB1 bk:published_in bkr:20068231” to represent all details about the publication mentioning TOB1? • Because you need to provide a context for the usually binary relation, ie, you need to tell what its confidence score is and the evidence to justify the statement • Compare this with the Neo4j equivalent • How would relate TOB1 to the GO term ‘transcription corepressor activity’ (accession 0003714)? • bkr:TOB1 bk:is_annotated_by obo:GO_0003714. • How would you state that the link was created by the ‘text mining tool’ and has a confidence score of 0.05? • You need to add: bkr:citation_TOB1_15489334 a bk:Relation ; bk:relTypeRef bk:is_annotated_by; bk:relFrom bkr:TOB1; bk:relTo obo:GO_0003714 ; bka:Score 0.95 ; bka:EVIDENCE “text mining tool”. • bka:EVIDENCE is an attribute, and it’s an alternative simplified form to represent evidence in KnetMiner (just a string, rather than a resource having multiple attributes).
  • 30. Exercise 4: Data Integration based on RDF • Study the example at https://github.com/Rothamsted/bioknet-onto/tree/master/examples/bmp_reg_human, which build a KnetMiner network in RDF format (and following the BioKNO ontology) • using two tools: the SPARQL CONSTRUCT construct (https://www.futurelearn.com/courses/linked- data/0/steps/16104) to perform RDF-to-RDF transformations • and the SPARQL CONSTRUCT coupled with TARQL tool (http://tarql.github.io/) to transform CSV/table data into RDF • Look at the transformation https://github.com/Rothamsted/bioknet- onto/blob/master/examples/bmp_reg_human/cvt_bpax.sparql, which transform the BioPAX RDF data into our BioKNO • What is happening? Look at it before the next question • Sketch a schema of the BioPAX graph that is matched by the WHERE clause and the one built by the CONSTRUCT block. Is the new graph smaller or bigger? • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • You can play with the data generated in this example at http://marcobrandizi.info:8890/sparql • Se example queries at: https://github.com/Rothamsted/bioknet- onto/tree/master/examples/bmp_reg_human/queries
  • 31. Exercise 4: Solution • The CONSTRUCT statement (which is part of the SPARQL query language), takes chains of protein/reaction/pathway expressed in the BioPAX format (not the use of the bp: namespace) and builds chains of protein/pathway in BioKNO format. • So, it maps a format to another (an alternative would be to do so in data queries, see queries/pw_commons_fed.sparql) • and generates a simplified representation (many KnetMiner data sets do so, the data explorations we aim at serving don’t need certain details) • How would you add the fact that bp:BioChemicalReaction instances participates in pathways(bk:Pathway)? • In the CONSTRUCT block you’d have: ?comp bk:participates_in ?path.
  • 32. Thanks! • Even more material: • On graph databases, standards, KnetMiner new backend: • https://www.slideshare.net/mbrandizi/behind-the-scenes-of-knetminer-towards-standardised-and-interoperable-knowledge-graphs • https://doi.org/10.1515/jib-2018-0023 • On Semantic Web, Linked Data, RDF, SPARQL, etc: • https://prezi.com/hbxhz0kesfnn/sod-2014-presentations-summary • https://goo.gl/bfF1hu • https://www.nature.com/articles/nbt1139 • https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age • http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf