A set of ideas on the use of artificial intelligence for data curation that has been presented at the Pharma-IT conference (London, 2017), in the artificial intelligence track.
It begins with some broad discussion about semantic web, knowledge representation, machine learning and artificial intelligence. It the focus on how a "data curation" problem can be framed and hints at some possible examples.
1. AI for Data Curation
Yes, can we?
Andrea Splendiani, AD, Information Systems
London
September 28, 2017
NIBR Informatics
2. Business or Operating Unit/Franchise or Department
Agenda
1. Focus: metadata and
reference data
2. Knowledge Engineering
and AI
3. Data curation: a use case
for AI?
4. Ideas and experiences
5. Conclusions
Public2
What we do
in context
Some
considerations
at 10000ft
Holistic view on
a process
(1000ft)
Details
Reflections at
10000ft
3. Business or Operating Unit/Franchise or Department
Focus: metadata and reference data
1. What:
– Annotation of datasets
– Standards
– Ontologies
– Reference information
2. Why:
– Support analysis
– Support search and query answering
– Support extraction
– Building knowledge networks / information discovery and inference
3. Where
– Typically in research
Public3
4. Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(a stopper)
• 10 years ago: AI
approaches to Systems
Biology
• Ontology based
knowledge-bases
(Semantic Web)
• ANN/Fuzzy systems even
older
Knowledge Engineering and AI
Public4
5. Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(taken seriously)
• Now: AI and ML are in the
hype
• Interest in Life Sciences
industries
Knowledge Engineering and AI
Public5
6. Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
Public6
• What helped the resurgence of ML?
– Massive data available
– Massive computational power available
– Few technical improvements
– Success stories (Deep learning)
• Do these also apply to Ontology/Sem-Web based
systems?
– Uniprot: 5.7B triples in 2009, 30+B triples in 2017
– EBI RDF Platform (2015)
– Wikidata (2014?)
Source: https://tools.wmflabs.org/wikidata-todo/stats.php
7. Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• The way information is represented has implications on
what is built on it (e.g.: analytics, data mining)
– network: are parallel executions in AND or OR
– Annotations: explicit mention of negative information
Public7
8. Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• Metadata is important in a data-centric world (and at
least in part of ML applications)
• Knowledge representation matters, beyond metadata
(examples: AND/OR in pathways, NOT in
annotations…)
• We start to have large, distributed knowledge-bases
– Is there a role for AI systems based on logic/KR?
– Can we combine symbolic and sub-symbolic reasoning ?
– Is this already happening ?
Public8
9. Business or Operating Unit/Franchise or Department
Data curation
Public9
• Annotation
• Metadata
• Standards
• Model
• Literature
• Databases
• …
Source BioCuration 2017 Abstracts via wordscloud.com
10. Business or Operating Unit/Franchise or Department
An example: public data curation
Public10
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%
2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
11. Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public11
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607
Property Value Ontology Bio-
Charac
teristic
?
Sample_sou
rce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_10
090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk
%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
12. Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public12
Property Value Ontology Bio-
Charact
eristic?
Sample_sour
ce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_100
90
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
13. Business or Operating Unit/Franchise or Department
An example: public data curation
Public13
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.
ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Supports:
• Aggregation
• Analysis
• Search
• Link discovery
• “Machine learning”
14. Business or Operating Unit/Franchise or Department
Can we use AI for Data Curation ?
Why ?
– Data curation is an intellectually intensive
activity, time consuming and intensive
– Given the the increasing role and amount
of data, curation risks to be a bottleneck
Public14
Example of exponential growth in data
15. Business or Operating Unit/Franchise or Department
AI for data curation:
characteristics and constraints
• Can we automate data curation ?
• Difficult:
– Missing data
– Discretionality (e.g.: level of granularity)
• Looks reasonable:
– Repetition
– Consistency
– Data/distances evaluations (clustering/attractors)
• We need to combine human aspects and machineable
aspects
Public15
16. Business or Operating Unit/Franchise or Department
AI for data curation
framing the problem: what
Public16
Should this value be
normalized?
Meaning. E.g.: is “age”
same as “years”?
Confidence: is this
information true ?
The need. E.g.: is this a
required information. When? Is this a valid identifier?
Example, extract from NCBI GEO GSM701607
17. Business or Operating Unit/Franchise or Department
AI for data curation
Framing the problem: how
We consider curation activities as functions in a “curation
space” that is exemplified via a “curation record”
Public17
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
18. Business or Operating Unit/Franchise or Department
AI and data curation
Using a record to modularize curation
processes
• Different classes of
operations
– Schema mapping (assign a
type)
– Standard setting (assign a
goal)
– Validation (setting a validation
value)
Public18
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Validation state Valid Valid
Curation goal Required
Semantic type Identifier about
Sample
Name about
Organism
Name about Gender
Field Name ID Organism Gender
Value GSM701607 Mus Musculus
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
19. Business or Operating Unit/Franchise or Department
• Different classes of
operations
– Normalization (filling a
column)
– Enrichment (adding a
column)
Public19
AI and data curation
Using a record to modularize curation
processes
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male PATO:000038
4
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descripti
on about
Age
Field Name ID taxID Organism age
Value GSM701607 10090 Mus Musculus 6 weeks
old
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descript
ion
about
Age
Identifie
rabout
Sample
Field Name ID taxID Organism age EBI ref.
Value GSM70160
7
10090 Mus
Musculus
6 weeks
old
SAME
A1189
935
20. Business or Operating Unit/Franchise or Department
Big picture
Quantity/Quality tradeoff
Public20
Quality/validity
Time/cost
• Is the optimal trade-off the
same for all data?
• Can this change for the
same data over time and
use cases ?
• Can we embed a “cost
function” in curation
processes ?
21. Business or Operating Unit/Franchise or Department
Big picture
(Meta) data evolution, immutability
Public21
Initial condition:
organism name
present, missing ID
Initial condition:
identifier extracted,
not verified
Identifier extracted
and verified
Entity: 1234
Information: V1
Meta-Info: V1
Entity: 1234
Information: V2
Meta-Info: V2
Entity: 1234
Information: V2
Meta-Info: V3
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Validation state Valid Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
23. Business or Operating Unit/Franchise or Department
Data and metadata transformations
(deterministic actions + extractors)
• Curation processes can be
expressed (by curators) in
terms of rules
• Rules embed “atomic
operations” e.g.: extractors,
transformations,…
• Simple rules go a very long
way…
Public23
<ruleConfig method="Extract">
<param name="setType" value="UNIT"/>
<param name="setAmbiguous" value="true"/>
<param name="setFullMatch" value="false"/>
<param name="setResultInJson" value="false"/>
<param name="setSimpleJson" value="false"/>
<param name="setText">
<ruleConfig method="GetCell">
<param name="setAttr" value="AgeDescription"/>
<param name="setBase" value="XCF_1"/>
</ruleConfig>
24. Business or Operating Unit/Franchise or Department
Abstract rules and meta-rules
• Rules can rely on abstraction/inference for higher genericity
• They can also be used to produce meta-information
Public24
Example rules (pesudo-syntax)
• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y:
E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then
E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”
• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation
State=Valid
Validation state Valid Valid
Curation goal Required Required Required
Semantic type Identifier
about
Sample
ID about
Organism
Name
about
Organism
Name about
Gender
Identifier about
Gender
Field Name
(the “location” in
the source)
ID taxID Organism Gender
Value GSM701607 10090 Mus
Musculus
25. Business or Operating Unit/Franchise or Department
“Approximate” transformations
• Some transformations cannot (easily) be expressed in
terms rules
– Complex and ad hoc relations
– Discretional elements
• Examples:
– Entities de-duplication
– Whether two homonymous authors mentions are referring to the same author
or not is a complex function of an extended range of the author’s features
(where they work, contact information, subject study,…)
– Schema mapping
– Determining the meaning of an attribute (e.g.: time) is a complex function of
the values this attribute takes, as well as other parameters (is this a duration, a
time point, or an execution timestamp?)
– Is ”Sample tracking number” to be mapped to “Tracking number” or to
“Identifier” ?
Public25
26. Business or Operating Unit/Franchise or Department
Implementation of de-duplication
and schema mapping via Tamr
• One approach that we have chosen to provide
approximate schema-mapping and de-duplication
functions is via Tamr (tamr.com)
• Tamr is data unification platform that combines machine
learning with human expertise.
– E.g.: to support schema mapping, Tamr combines several features:
– Data distribution
– Property names
– Property metadata
– It learns how to compose such functions via machine learning, through
an iterative process where human experts can provide input and
improve predictions
Public26
27. Business or Operating Unit/Franchise or Department
Schema-mapping (Tamr)
Public27
Users are suggested
a range of potential
mapping, with a
confidence score.
They can confirm or
suggest different
mappings. New
predictions are
routinely provided as
more input is
accumulated.
User interface for curators showing potential attribute matches
28. Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
User interface for curators showing potential duplicates
Public28
Users are shown a
set of potential
duplicates with a
confidence score.
They can accept or
refuse such
suggestions, thus
providing training
data and iteratively
refining predictions.
29. Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
Details of the implementation of the deduplication process (courtesy of Tamr)
Public29
30. Business or Operating Unit/Franchise or Department
Re-introducing logic
• Can we predict (or suggest) the association between
parameters and entities in a template?
– An ontology models the “real world”: entities, qualities, processes
– Parameters are annotated with axioms based on this ontology
– Inference provides multiple classifications of parameters, as well as
possible/necessary associations between parameters and entities.
• Can this work?
Public30
31. Business or Operating Unit/Franchise or Department
Re-introducing logic
Public31
Extract from an ontology representing entities and
qualities
Example of axiomatic mapping between a
parameter and an entity and qualities ontology
Deductions for parameter ReportID:
must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity,
Information Entity, Immaterial Entity
may refer to: Report, InternalReport
32. Business or Operating Unit/Franchise or Department
Exploring automatic ontology
matching
Public32
• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge
• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project
• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics
http://oaei.ontologymatching.org/2016
33. Business or Operating Unit/Franchise or Department
Conclusions: On rules, standards
and data ethnography
• Data Curation: “AI” may help (not limited to ML)
– Formal knowledge representation is part of the goal
• The need for explanations
– We need to define (document) a process
– We have theorems for proofs: can we do without ?
– Is there a role for “ML” GURUs?
• The “human side” of data
– Data normalization is based on assumptions (e.g.: what can be
considered same, what not): there is a cultural side to this.
– Would we accept an AI “editor” ?
Public33
34. Business or Operating Unit/Franchise or Department
Acknowledgments
• NIBR
• Daniel Cronenberger
• Ming Fang
• Frederic Sutter
• Anosha Siripala
• Fabien Pernot
• Jean Marc von Allmen
• Martin Petracchi
• Dorothy Reilly
• Pierre Parisot
• Therese Vachon
• Tamr.com
• Pistoia Alliance Ontology Matching Project team
Public34