"On chemical structures, substances, nanomaterials and measurements"
Nina Jeliazkova, Ideaconsult
This talk attempts to highlight how I came to recognize the fundamental role of measurements, coming from the realm of data modelling and data analysis. Besides retaining the data provenance it provides insights how do we go beyond chemical structures and address the challenges of representing the identity of chemical substances and nanomaterials (with examples from the latest developments of AMBIT web services and OpenTox API). Finally, supporting the vision of distributed, open, web-like approach towards recording subtle experimental details is essential, not only for the chemists and biologists in the labs, but for all of us using, modelling, storing and querying the data.
Presented July 14,2014 in Cambridge , UK
Defining the Future for Open Notebook Science – A Memorial Symposium Celebrating the Work of Jean-Claude Bradley
http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium
2. Sharing experience
about:
OpenTox API and beyond
Chemical structures
Substance identity
Experimental data
challenges
Protocols
Nanomaterials
Final thoughts
I D E A C O N S U L T L T D . 2
CONTENT
• EC FP7 2008-2011 OpenTox
• Distributed framework for predictive
toxicology
• Building blocks: data, chemical
structures, algorithms and models.
• Build models, apply models, validate
models, access and query data in various
ways;
• Tech: REST API, RDF
4. PREDICTIONS
I D E A C O N S U L T L T D . 4
31 May 2013 :
The REACH deadline
for registering
substances [100 to
1000 tonnes per year]
http://ToxPredict.net access statistics
5. • AMBIT REST web
services
OpenTox Application Programming
Interface (API)
Dataset web services
Chemical search, data pooling, structure QA
Computational web services
Descriptor calculation, machine learning,
structure optimisation, tautomers
Web Applications using AMBIT
REST web services
New 2014: Embeddable
JS widgets
I D E A C O N S U L T L T D . 5
AMBIT http://ambit.sf.net
6. I D E A C O N S U L T L T D . 6
DATA CURATION EXAMPLE (DIISONONYLPHTALATE)
7. I D E A C O N S U L T L T D . 7
DATA CURATION EXAMPLE (RN 25155-25-3)
8. I D E A C O N S U L T L T D . 8
DATA CURATION EXAMPLE (RN 25155-25-3)
European Chemical Agency Registration dossier
9. SUBSTANCE IDENTITY IN REACH
• Guidance for identification and naming of
substances under REACH and CLP (118
pages)
• Substance characterization
“During the first 5 months of 2009, around 450 enquiries were received by
ECHA, 23% of which were rejected on the grounds that the dossiers were
incomplete (e.g. missing spectral data) or the substance identity had not been
sufficiently described.”
I D E A C O N S U L T L T D . 9
http://echa.europa.eu/documents/10162/13643/substance_id_en.pdf
10. “Only a limited
number of tools are
capable to provide
easily accessible data
on
substance identity,
composition
together with
chemical structures
and high quality and
detailed endpoint
data”
I D E A C O N S U L T L T D . 10
SUBSTANCE IDENTITY/COMPOSITION
11. SUBSTANCE ENDPOINT DATA
I D E A C O N S U L T L T D . 11
OECD Harmonized templates
Well defined XML schema for > 100
endpoints
Experimental protocols:
OECD Guidelines
BioPortal ontologies coverage
of OECD guidelines: None
12. PROTOCOLS, SOP,
INVESTIGATIONS, STUDY, ASSAYS
SEP
COACH
Towards the replacement of in vivo
repeated dose systemic toxicity testing
SEURAT-1 ~ 70 research groups from European Universities, Public
Research Institutes and Companies (more than 30% SMEs)
http://www.seurat-1.eu/
http://toxbank.net/
FP7 Projects
13. G O A L S
Prediction of repeated dose toxicity
Shared repository of know-how and
experimental results
from SEURAT-1 research activities and
relevant public sources
Examples include:
Protocol describing a method for long term
maintenance of functional hepatocytes
Results from a repeated dose 14 day
transcriptomics study using acetaminophen
and iPS-derived hepatocytes
T E C H N O L O G I C A L
S O L U T I O N S
• REST Web services API
• Protocol service
• Investigation service
• RDF data model
• ISA-TAB & ontologies
• ISA-TAB converted to RDF
• Stored in a triple store
• Chemical search (AMBIT)
13
TOXBANK DATA WAREHOUSE
Challenges:
• Diverse data types
• Changing research protocols
• Data formatting
time consuming
• Data sharing - little incentive
14. FP7 ENANOMAPPER PROJECT
• Develop an ontology and database unifying
information about nanomaterial safety (in humans
and the environment)
• Cover the full lifecycle from manufacturing to
environmental decay or accumulation
• Pan-European project, 7 partners
• Ontology growth through community and re-use
15. NANOINFORMATICS CHALLENGES
• nanoSMILES
• nanoInChI
• Nanomaterial identity - only through characterisation
with multitude of experimental methods
• Experiments reproducibility; standards
• Experiments description (protocols, experimental
details)
• Models: structure based cheminformatics doesn’t
really work
• Common database? NO!
But Yes! for an integrated search across databases! (requirement analysis
feedback)
I D E A C O N S U L T L T D . 15
Nanomaterial “unique” challenge of identification?
16. NANOMATERIAL ENDPOINT DATA
I D E A C O N S U L T L T D . 16
• Same data model as for substances
(ISA-TAB inspired)
• NM specific measurement protocols
• Ontology support – under
development eNanoMapper WP2
(Janna Hastings, Egon Willighagen)
18. LESSONS LEARNED
What is more difficult:
1. Succeed in implementing a “moving target” API
by a distributed team of developers.
2. Succeed in bringing together several wet lab
teams to use a common tool/ format for
preparing and sharing experimental data.
I D E A C O N S U L T L T D . 18
1. OpenTox: Partners succeeded in creating 5 independent
implementations of the OpenTox API; through “rough consensus and
running code”; most services are online and being used 3y after the
OpenTox project completion; API being used and extended in related
projects;
2. In ToxBank we’ve resorted to taking the role of “data managers” in
SEURAT-1 cluster; a setup typical to most EU data projects.
19. WHY DATA FORMATTING AND SHARING IS SO
DIFFICULT?
Thoughts about the technology aspects; not about the
incentives to share
• Data format – the more flexible the format is, the more
difficult is the data preparation;
• Tools typically need to understand both data modelling
and the experimental setup;
• Preparing and data sharing requires additional efforts,
which are typically not within the scope of the research
projects;
• Typical setup is “data managers” or “Excel templates”
I D E A C O N S U L T L T D . 19
Compare with the easiness of sharing, liking and tagging pictures on
social networks; liking and tagging essentially creates semantic
knowledge!
20. GUESS THE AUTHOR
“This proposal concerns the management of
general information about experiments at ???.
It discusses the problems of loss of
information about complex evolving systems
and derives a solution based on a ???"
I D E A C O N S U L T L T D . 20
21. TIM BERNERS-LEE , 1989
“This proposal concerns the management of
general information about accelerators and
experiments at CERN.
It discusses the problems of loss of
information about complex evolving systems
and derives a solution based on a distributed
hypertext system."
I D E A C O N S U L T L T D . 21
http://www.w3.org/History/1989/proposal.html
Non-Centralisation
Information systems start small and grow. They also start
isolated and then merge. A new system must allow existing
systems to be linked together without requiring any
central control or coordination.
22. FINAL THOUGHTS
• Facilitate researchers organize their own data locally;
• The cost of entering /recording data should be low;
• Easy to use tools;
• Formats – understandable or hidden behind user friendly
tools;
• Non-centralisation;
• Added value:
“The data-sharing environment must invite collaboration as well
as facilitate it. Stakeholders have broad interests that go beyond
retrieving existing data — they want to discover materials and
forecast enhanced products”
I D E A C O N S U L T L T D . 22
http://www.nature.com/news/technology-
sharing-data-in-materials-science-1.14224