This workshop aims at gathering together practioners of all levels and from a variety of research areas (agronomy, plant biology, food, life sciences etc) to compare best practices, points of views and projects about producing and consuming data in the agrifood field.
As it happens in general for digital data, the current trends in this arena include integration of "traditional" semantic-based approaches (eg, ontoloies, RDF-based linked data) with lightweight schemas (eg, Bioschemas/schema.org), use of JSON-based APIs, development of data lakes and knowledge graphs based on NoSQL technologies, graph databases based on property graphs (eg, Neo4j, TinkerPop/Gremlin).
Workshop participants will get an opportunity to discuss how those approaches and technologies are being used in the agrifood field, for the purpose or realising the FAIR data principles and make data sharing a powerful tool for research, industry or socio-economic investigation. In particular, we will propose an interactive session to outline the way participant-proposed datasets can be encoded through bioschemas or similar approaches.
AgriFood Data, Models, Standards, Tools, Use Cases
1. Open Data in Agrifood and Life
Sciences: Models, Standards, Tools,
Use Cases
Paris, 17/9/2019
Marco Brandizi <marco.brandizi@rothamsted.ac.uk>
Keywan Hassani-Pak <keywan.hassani-pak@rothamsted.ac.uk>
Find these slides on SlideShare
2. Why this workshop (ideally)
• Focus on sharing machine-readable data for agrifood and related areas
(eg, weather)
• See (examples of) at what point we are, trends, etc
• Share experiences, best practices, solutions etc
• (At least) outline some common efforts (eg, to have a common schema)
3. The new oil (*)
https://lod-cloud.net/
https://goo.gl/n4m5xL
https://www.economist.com/node/21521548
(*) or the old
mess?
5. What do we want to get by data?
What are the genes involved in yellow rust and the proteins they
encode?
In which pathways are they involved?
What publications and field trials exist as evidence?
6. How can we get it? => FAIR+
• Data need to be raw, PDF or HTML-only web sites not very good (Accessible, Reusable)
• Datasets need meta-data (Findable)
• Which should be FAIR too, in particular interoperable
• Common formats, schemas and ontologies (Interoperable)
• RDF in the linked data world
• OWL in the Semantic Web world (lightweight schemas ever more popular)
• JSON, APIs, JSON-Schema elsewhere
• Common identifiers (Interoperable)
• URIs in the linked data world (related to F, I too), accessions, code lists elsewhere
• Common query language(s) (FAIR principles affected)
• SPARQL in the linked data world. A plethora of competing QLs elsewhere (eg, GraphQL, Cypher, SQL-like)
• Proper licences, preferably open (Reusable)
• Ideally, text translated to and published as FAIR+ data
• Should have good quality, and it should be measurable + produce to evidence (Reusable and Useful)
• metrics, automated tests, frequent-enough updates (report publish dates, prod and version dependencies),
completeness
7. How to implement it?
At the begin, there was the Semantic Web, then the Linked Data,
then…
8. How to implement it?
We’re still in Babel, still with the same issues
14. The Knetminer use case
• Green: Ondex plug-ins
• rdf2neo is a generic, non Ondex-specific
rdf->Neo4j conversion tool
• Brandizi et al, IB-2018
(https://dx.doi.org/10.1515%2Fjib-2018-0023)
• Brandizi et al, SWAT4LS-2018
(https://doi.org/10.6084/m9.figshare.7314323.v1)
15. The Knetminer use case
Cypher examples:
MATCH
// branching via ‘|’
(prot:Protein) - [:produced_by|consumed_by] -> (:Reaction)
// variable-length chains
- [:part_of*1..3] -> (pway:Path)
RETURN
prot.name, pwy LIMIT 1000
// Very compact forms available:
MATCH (prot:Protein) - (pway:Path) RETURN pway
• RDF + OWL used as a standardised modelling/representation language
(see BioKNO ontology: github.com/Rothamsted/bioknet-onto)
• SPARQL available too, both having pros/cons
(see our benchmark: github.com/Rothamsted/graphdb-benchmarks)
• Cypher being used for “Semantic Motif” queries, linking genes to entities of interest
(work in progress)
26. Conclusions?
• Actually, questions above offered to you: where are we going? Where to go?
Personally, I’ve to offer my 2 cents only
• We have many FAIRification efforts, mostly based on custom formats, APIs, downloads
• We’re still missing integration, interoperability, standards
• For the purpose of queries like the one shown above (show me genes linked to
phenotype, known knowledge, experimental evidence, etc)
• It used to be the focus of Semantic Web and linked data
• Other approaches have become popular (JSON, APIs, NoSQL)
• Only recently they’ve started addressing the same old problems (eg, GraphQL,
JSON-Schema)
• Schematisation has become lightweight, even in LODs (eg, schema.org or SHACL vs
OWL or OBO ontologies)
• Though “true” ontologies are still important in life sciences, mostly for annotations
27. Acknowledgements
Ajit Singh
Software Engineer
• Alice Minotto, Earlham Inst, hosting providers
• Monika Mistry, master Student, Data Curator
• William Brown, IT admin
• Madhu Donepudi, Richard Holland, ext contractors, developers
Keywan Hassani-Pak
KnetMiner Team Leader
Chris Rawlings
Head of Computational & Analytical Sciences
Joseph Hearnshaw
Software Engineer
Sandeep Amberkar
Bioinformatician, Data curator
28. Interactive Session: Proposals
• Model your data of interest with the AgriSchemas approach
• And/Or review what we have already drafted, let’s have a discussion about it
• Experiment with The Knetminer SPARQL/Neo4j endpoints (which includes
experimental import from GXA)
• A closer look at the Knetminer ELT pipeline, from external sources to XML/OXL,
RDF, Neo4j
29. Playing Knetminer endpoints
• eg, using queries at the end point, find
proteins related to "oxygenic photosynthesis" and related publications
• Use https://github.com/Rothamsted/bioknet-onto for info about the data model
• Also, see the figure: https://github.com/Rothamsted/graphdb-benchmarks/blob/master/results/ara_knet_pattern.png
• Suggestion: use the relation bk:pub_in, relating bk:Protein to Bk:Publication
• Provisional endpoint: http://marcobrandizi.info:9090/lodestar/sparql
• Solution: https://gist.github.com/marco-brandizi/d51f823a879630f46b5ba582f1450a3c
30. Playing Knetminer endpoints
• Using Cypher and Neo4j endpoint, explore the components (part_of relation) of
the path (Path node type) about the pathway titled (prefName) "chlorophyll a
biosynthesis I"
• Provisional endpoit: http://marcobrandizi.info:7474 (ib2019/ib2019)
• Solution: https://gist.github.com/marco-brandizi/7b37d815e2dd539361e76d5817a5d99c
Editor's Notes
SW-based knowledge bases still much used by big players, but behind the scenes (eg, to build Google snippets)
Ground-based AI (machine learning, neural networks, etc) used to sort out the mess in modelling (preferred to symbolic/formal approaches, like OWL)
Commercial/general-purpose world is focused more on big data
Not so much on modelling and standardisation
Details for the downloader, or the second part
Details for the downloader, or the second part
Details for the downloader, or the second part
Details for the downloader, or the second part
We’re deploying our own SPARQL endpoint, where wheat and arabidopsis datasets are merged
We can play with it via the LODEStar browser
Data from different sources are merged together in the RDF coming from URI resolution
The LODEStar browser can show that, but also resolve the URI (via content negotiation)