SlideShare a Scribd company logo
1 of 35
Download to read offline
AI for Data Curation
Yes, can we?
Andrea Splendiani, AD, Information Systems
London
September 28, 2017
NIBR Informatics
Business or Operating Unit/Franchise or Department
Agenda
1. Focus: metadata and
reference data
2. Knowledge Engineering
and AI
3. Data curation: a use case
for AI?
4. Ideas and experiences
5. Conclusions
Public2
What we do
in context
Some
considerations
at 10000ft
Holistic view on
a process
(1000ft)
Details
Reflections at
10000ft
Business or Operating Unit/Franchise or Department
Focus: metadata and reference data
1. What:
– Annotation of datasets
– Standards
– Ontologies
– Reference information
2. Why:
– Support analysis
– Support search and query answering
– Support extraction
– Building knowledge networks / information discovery and inference
3. Where
– Typically in research
Public3
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(a stopper)
• 10 years ago: AI
approaches to Systems
Biology
• Ontology based
knowledge-bases
(Semantic Web)
• ANN/Fuzzy systems even
older
Knowledge Engineering and AI
Public4
Business or Operating Unit/Franchise or Department
Can Artificial Intelligence solve biology ?
(taken seriously)
• Now: AI and ML are in the
hype
• Interest in Life Sciences
industries
Knowledge Engineering and AI
Public5
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
Public6
• What helped the resurgence of ML?
– Massive data available
– Massive computational power available
– Few technical improvements
– Success stories (Deep learning)
• Do these also apply to Ontology/Sem-Web based
systems?
– Uniprot: 5.7B triples in 2009, 30+B triples in 2017
– EBI RDF Platform (2015)
– Wikidata (2014?)
Source: https://tools.wmflabs.org/wikidata-todo/stats.php
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• The way information is represented has implications on
what is built on it (e.g.: analytics, data mining)
– network: are parallel executions in AND or OR
– Annotations: explicit mention of negative information
Public7
Business or Operating Unit/Franchise or Department
Knowledge Engineering and AI
• Metadata is important in a data-centric world (and at
least in part of ML applications)
• Knowledge representation matters, beyond metadata
(examples: AND/OR in pathways, NOT in
annotations…)
• We start to have large, distributed knowledge-bases
– Is there a role for AI systems based on logic/KR?
– Can we combine symbolic and sub-symbolic reasoning ?
– Is this already happening ?
Public8
Business or Operating Unit/Franchise or Department
Data curation
Public9
• Annotation
• Metadata
• Standards
• Model
• Literature
• Databases
• …
Source BioCuration 2017 Abstracts via wordscloud.com
Business or Operating Unit/Franchise or Department
An example: public data curation
Public10
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%
2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public11
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607
Property Value Ontology Bio-
Charac
teristic
?
Sample_sou
rce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_10
090
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk
%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Business or Operating Unit/Franchise or Department
An example: public data curation
(data view)
Public12
Property Value Ontology Bio-
Charact
eristic?
Sample_sour
ce_name
WT6 biological rep 1, Affy
processing batch 2
EFO_0000001
Organism Mus musculus EFO_0000001
NCBITaxon_100
90
strain 129S6/Sv/Ev EFO_0000001 Bio
genotype wild type EFO_0000001
EFO_0005168
Bio
Sex male EFO_0000001
EFO_0001266
PATO_0000384
age 6 weeks old EFO_0000001 Bio
Business or Operating Unit/Franchise or Department
An example: public data curation
Public13
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.
ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
Supports:
• Aggregation
• Analysis
• Search
• Link discovery
• “Machine learning”
Business or Operating Unit/Franchise or Department
Can we use AI for Data Curation ?
Why ?
– Data curation is an intellectually intensive
activity, time consuming and intensive
– Given the the increasing role and amount
of data, curation risks to be a bottleneck
Public14
Example of exponential growth in data
Business or Operating Unit/Franchise or Department
AI for data curation:
characteristics and constraints
• Can we automate data curation ?
• Difficult:
– Missing data
– Discretionality (e.g.: level of granularity)
• Looks reasonable:
– Repetition
– Consistency
– Data/distances evaluations (clustering/attractors)
• We need to combine human aspects and machineable
aspects
Public15
Business or Operating Unit/Franchise or Department
AI for data curation
framing the problem: what
Public16
Should this value be
normalized?
Meaning. E.g.: is “age”
same as “years”?
Confidence: is this
information true ?
The need. E.g.: is this a
required information. When? Is this a valid identifier?
Example, extract from NCBI GEO GSM701607
Business or Operating Unit/Franchise or Department
AI for data curation
Framing the problem: how
We consider curation activities as functions in a “curation
space” that is exemplified via a “curation record”
Public17
Validation
state
(Confidence)
Valid Valid Valid
Curation goal
(The need)
Required Required Required Required Required
Semantic type1
(Meaning)
Identifier
about
Sample
ID2 about
Organism
Name
about
Organism
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name
(the “location”
in the source)
ID taxID Organism Gender age
Value GSM701
607
10090 Mus
Musculus
6 weeks old
1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition)
2 Identifiers also require a domain specification
Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
Business or Operating Unit/Franchise or Department
AI and data curation
Using a record to modularize curation
processes
• Different classes of
operations
– Schema mapping (assign a
type)
– Standard setting (assign a
goal)
– Validation (setting a validation
value)
Public18
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Validation state Valid Valid
Curation goal Required
Semantic type Identifier about
Sample
Name about
Organism
Name about Gender
Field Name ID Organism Gender
Value GSM701607 Mus Musculus
Validation state Valid Valid Valid
Curation goal Require
d
Required Required Required
Semantic type Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Description
about Age
Age Unit
about
Age
Field Name ID Gender age
Value GSM70
1607
6 weeks old
Business or Operating Unit/Franchise or Department
• Different classes of
operations
– Normalization (filling a
column)
– Enrichment (adding a
column)
Public19
AI and data curation
Using a record to modularize curation
processes
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male
Validation
state
Valid Valid
Curation
goal
Require
d
Required
Semantic
type
Identifier
about
Sample
Name about
Gender
Identifier
about Gender
Field Name ID Gender
Value GSM70
1607
male PATO:000038
4
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descripti
on about
Age
Field Name ID taxID Organism age
Value GSM701607 10090 Mus Musculus 6 weeks
old
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier
about
Sample
ID2 about
Organism
Name about
Organism
Descript
ion
about
Age
Identifie
rabout
Sample
Field Name ID taxID Organism age EBI ref.
Value GSM70160
7
10090 Mus
Musculus
6 weeks
old
SAME
A1189
935
Business or Operating Unit/Franchise or Department
Big picture
Quantity/Quality tradeoff
Public20
Quality/validity
Time/cost
• Is the optimal trade-off the
same for all data?
• Can this change for the
same data over time and
use cases ?
• Can we embed a “cost
function” in curation
processes ?
Business or Operating Unit/Franchise or Department
Big picture
(Meta) data evolution, immutability
Public21
Initial condition:
organism name
present, missing ID
Initial condition:
identifier extracted,
not verified
Identifier extracted
and verified
Entity: 1234
Information: V1
Meta-Info: V1
Entity: 1234
Information: V2
Meta-Info: V2
Entity: 1234
Information: V2
Meta-Info: V3
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male
Validation state Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Validation state Valid Valid Valid
Curation goal Required Required
Semantic type Identifier about
Sample
Name about Gender Identifier about Gender
Field Name ID Gender
Value GSM701607 male PATO:0000384
Ideas and experiences
Some details
Business or Operating Unit/Franchise or Department
Data and metadata transformations
(deterministic actions + extractors)
• Curation processes can be
expressed (by curators) in
terms of rules
• Rules embed “atomic
operations” e.g.: extractors,
transformations,…
• Simple rules go a very long
way…
Public23
<ruleConfig method="Extract">
<param name="setType" value="UNIT"/>
<param name="setAmbiguous" value="true"/>
<param name="setFullMatch" value="false"/>
<param name="setResultInJson" value="false"/>
<param name="setSimpleJson" value="false"/>
<param name="setText">
<ruleConfig method="GetCell">
<param name="setAttr" value="AgeDescription"/>
<param name="setBase" value="XCF_1"/>
</ruleConfig>
Business or Operating Unit/Franchise or Department
Abstract rules and meta-rules
• Rules can rely on abstraction/inference for higher genericity
• They can also be used to produce meta-information
Public24
Example rules (pesudo-syntax)
• Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y:
E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then
E.X.Value=extract(isAbout(E.Y.type), E.Y.value)
• Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required”
• Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation
State=Valid
Validation state Valid Valid
Curation goal Required Required Required
Semantic type Identifier
about
Sample
ID about
Organism
Name
about
Organism
Name about
Gender
Identifier about
Gender
Field Name
(the “location” in
the source)
ID taxID Organism Gender
Value GSM701607 10090 Mus
Musculus
Business or Operating Unit/Franchise or Department
“Approximate” transformations
• Some transformations cannot (easily) be expressed in
terms rules
– Complex and ad hoc relations
– Discretional elements
• Examples:
– Entities de-duplication
– Whether two homonymous authors mentions are referring to the same author
or not is a complex function of an extended range of the author’s features
(where they work, contact information, subject study,…)
– Schema mapping
– Determining the meaning of an attribute (e.g.: time) is a complex function of
the values this attribute takes, as well as other parameters (is this a duration, a
time point, or an execution timestamp?)
– Is ”Sample tracking number” to be mapped to “Tracking number” or to
“Identifier” ?
Public25
Business or Operating Unit/Franchise or Department
Implementation of de-duplication
and schema mapping via Tamr
• One approach that we have chosen to provide
approximate schema-mapping and de-duplication
functions is via Tamr (tamr.com)
• Tamr is data unification platform that combines machine
learning with human expertise.
– E.g.: to support schema mapping, Tamr combines several features:
– Data distribution
– Property names
– Property metadata
– It learns how to compose such functions via machine learning, through
an iterative process where human experts can provide input and
improve predictions
Public26
Business or Operating Unit/Franchise or Department
Schema-mapping (Tamr)
Public27
Users are suggested
a range of potential
mapping, with a
confidence score.
They can confirm or
suggest different
mappings. New
predictions are
routinely provided as
more input is
accumulated.
User interface for curators showing potential attribute matches
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
User interface for curators showing potential duplicates
Public28
Users are shown a
set of potential
duplicates with a
confidence score.
They can accept or
refuse such
suggestions, thus
providing training
data and iteratively
refining predictions.
Business or Operating Unit/Franchise or Department
Entity de-duplication (Tamr)
Details of the implementation of the deduplication process (courtesy of Tamr)
Public29
Business or Operating Unit/Franchise or Department
Re-introducing logic
• Can we predict (or suggest) the association between
parameters and entities in a template?
– An ontology models the “real world”: entities, qualities, processes
– Parameters are annotated with axioms based on this ontology
– Inference provides multiple classifications of parameters, as well as
possible/necessary associations between parameters and entities.
• Can this work?
Public30
Business or Operating Unit/Franchise or Department
Re-introducing logic
Public31
Extract from an ontology representing entities and
qualities
Example of axiomatic mapping between a
parameter and an entity and qualities ontology
Deductions for parameter ReportID:
must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity,
Information Entity, Immaterial Entity
may refer to: Report, InternalReport
Business or Operating Unit/Franchise or Department
Exploring automatic ontology
matching
Public32
• 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge
• Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project
• Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics
http://oaei.ontologymatching.org/2016
Business or Operating Unit/Franchise or Department
Conclusions: On rules, standards
and data ethnography
• Data Curation: “AI” may help (not limited to ML)
– Formal knowledge representation is part of the goal
• The need for explanations
– We need to define (document) a process
– We have theorems for proofs: can we do without ?
– Is there a role for “ML” GURUs?
• The “human side” of data
– Data normalization is based on assumptions (e.g.: what can be
considered same, what not): there is a cultural side to this.
– Would we accept an AI “editor” ?
Public33
Business or Operating Unit/Franchise or Department
Acknowledgments
• NIBR
• Daniel Cronenberger
• Ming Fang
• Frederic Sutter
• Anosha Siripala
• Fabien Pernot
• Jean Marc von Allmen
• Martin Petracchi
• Dorothy Reilly
• Pierre Parisot
• Therese Vachon
• Tamr.com
• Pistoia Alliance Ontology Matching Project team
Public34
Thank you

More Related Content

What's hot

The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseKartik Kalpande Patil
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AImelissadata
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Mathieu d'Aquin
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining conceptsBasit Rafiq
 

What's hot (19)

The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
data mining
data miningdata mining
data mining
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Data mining
Data miningData mining
Data mining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
 
Data mining concepts
Data mining conceptsData mining concepts
Data mining concepts
 

Viewers also liked

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...Ramy K. Aziz
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network MedicineMarc Santolini
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysisNoha Lotfy Ibrahim
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
 
Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networksngulbahce
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomicssonam786
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelLars Juhl Jensen
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its toolsGaurav Diwakar
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biologylemberger
 

Viewers also liked (11)

The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
The Opera of Phantome - 2017 (presented at the 22nd Biennial Evergreen Phage ...
 
Introduction to Network Medicine
Introduction to Network MedicineIntroduction to Network Medicine
Introduction to Network Medicine
 
Gene expression concept and analysis
Gene expression concept and analysisGene expression concept and analysis
Gene expression concept and analysis
 
RT-PCR
RT-PCRRT-PCR
RT-PCR
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Graph properties of biological networks
Graph properties of biological networksGraph properties of biological networks
Graph properties of biological networks
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
Introduction to systems biology
Introduction to systems biologyIntroduction to systems biology
Introduction to systems biology
 
Dr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 MedicineDr. Leroy Hood Lecuture on P4 Medicine
Dr. Leroy Hood Lecuture on P4 Medicine
 

Similar to Artificial Intelligence in Data Curation

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating TaxonomiesJoseph Busch
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Be Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationFintricity
 
The Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Finalrotciv
 
AI for information management: why and how
AI for information management: why and howAI for information management: why and how
AI for information management: why and howAnna Divoli
 
Channeling insights to the right people
Channeling insights to the right peopleChanneling insights to the right people
Channeling insights to the right peopleSebastien Lefebvre
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...StampedeCon
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?Ahmed Banafa
 
Actionable analytics with mongo db mongophilly-2011
Actionable analytics with mongo db   mongophilly-2011Actionable analytics with mongo db   mongophilly-2011
Actionable analytics with mongo db mongophilly-2011MongoDB
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for FindabilityFindwise
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchDaniel Faggella
 

Similar to Artificial Intelligence in Data Curation (20)

Evaluating Taxonomies
Evaluating TaxonomiesEvaluating Taxonomies
Evaluating Taxonomies
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Nordic health data metadata
Nordic health data   metadataNordic health data   metadata
Nordic health data metadata
 
Wild hairtech bih
Wild hairtech   bihWild hairtech   bih
Wild hairtech bih
 
Be Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital TransformationBe Digital or Die - Predictive Analytics for Digital Transformation
Be Digital or Die - Predictive Analytics for Digital Transformation
 
The Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 FinalThe Evolution Of Competitive Intelligence Dec09 Final
The Evolution Of Competitive Intelligence Dec09 Final
 
AI for information management: why and how
AI for information management: why and howAI for information management: why and how
AI for information management: why and how
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
Channeling insights to the right people
Channeling insights to the right peopleChanneling insights to the right people
Channeling insights to the right people
 
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
Data Analytics Ethics: Issues and Questions (Arnie Aronoff, Ph.D.)
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
What is Data Science?
What is Data Science?What is Data Science?
What is Data Science?
 
Actionable analytics with mongo db mongophilly-2011
Actionable analytics with mongo db   mongophilly-2011Actionable analytics with mongo db   mongophilly-2011
Actionable analytics with mongo db mongophilly-2011
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Optimising Your Content for Findability
Optimising Your Content for FindabilityOptimising Your Content for Findability
Optimising Your Content for Findability
 
The Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent SearchThe Digital Workplace Powered by Intelligent Search
The Digital Workplace Powered by Intelligent Search
 

Recently uploaded

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 

Recently uploaded (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 

Artificial Intelligence in Data Curation

  • 1. AI for Data Curation Yes, can we? Andrea Splendiani, AD, Information Systems London September 28, 2017 NIBR Informatics
  • 2. Business or Operating Unit/Franchise or Department Agenda 1. Focus: metadata and reference data 2. Knowledge Engineering and AI 3. Data curation: a use case for AI? 4. Ideas and experiences 5. Conclusions Public2 What we do in context Some considerations at 10000ft Holistic view on a process (1000ft) Details Reflections at 10000ft
  • 3. Business or Operating Unit/Franchise or Department Focus: metadata and reference data 1. What: – Annotation of datasets – Standards – Ontologies – Reference information 2. Why: – Support analysis – Support search and query answering – Support extraction – Building knowledge networks / information discovery and inference 3. Where – Typically in research Public3
  • 4. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (a stopper) • 10 years ago: AI approaches to Systems Biology • Ontology based knowledge-bases (Semantic Web) • ANN/Fuzzy systems even older Knowledge Engineering and AI Public4
  • 5. Business or Operating Unit/Franchise or Department Can Artificial Intelligence solve biology ? (taken seriously) • Now: AI and ML are in the hype • Interest in Life Sciences industries Knowledge Engineering and AI Public5
  • 6. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI Public6 • What helped the resurgence of ML? – Massive data available – Massive computational power available – Few technical improvements – Success stories (Deep learning) • Do these also apply to Ontology/Sem-Web based systems? – Uniprot: 5.7B triples in 2009, 30+B triples in 2017 – EBI RDF Platform (2015) – Wikidata (2014?) Source: https://tools.wmflabs.org/wikidata-todo/stats.php
  • 7. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • The way information is represented has implications on what is built on it (e.g.: analytics, data mining) – network: are parallel executions in AND or OR – Annotations: explicit mention of negative information Public7
  • 8. Business or Operating Unit/Franchise or Department Knowledge Engineering and AI • Metadata is important in a data-centric world (and at least in part of ML applications) • Knowledge representation matters, beyond metadata (examples: AND/OR in pathways, NOT in annotations…) • We start to have large, distributed knowledge-bases – Is there a role for AI systems based on logic/KR? – Can we combine symbolic and sub-symbolic reasoning ? – Is this already happening ? Public8
  • 9. Business or Operating Unit/Franchise or Department Data curation Public9 • Annotation • Metadata • Standards • Model • Literature • Databases • … Source BioCuration 2017 Abstracts via wordscloud.com
  • 10. Business or Operating Unit/Franchise or Department An example: public data curation Public10 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk% 2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 11. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public11 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 Property Value Ontology Bio- Charac teristic ? Sample_sou rce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_10 090 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk %2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935
  • 12. Business or Operating Unit/Franchise or Department An example: public data curation (data view) Public12 Property Value Ontology Bio- Charact eristic? Sample_sour ce_name WT6 biological rep 1, Affy processing batch 2 EFO_0000001 Organism Mus musculus EFO_0000001 NCBITaxon_100 90 strain 129S6/Sv/Ev EFO_0000001 Bio genotype wild type EFO_0000001 EFO_0005168 Bio Sex male EFO_0000001 EFO_0001266 PATO_0000384 age 6 weeks old EFO_0000001 Bio
  • 13. Business or Operating Unit/Franchise or Department An example: public data curation Public13 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM701607 https://www.ebi.ac.uk/rdf/services/describe?uri=http%3A%2F%2Frdf.ebi. ac.uk%2Fresource%2Fbiosamples%2Fsample%2FSAMEA1189935 Supports: • Aggregation • Analysis • Search • Link discovery • “Machine learning”
  • 14. Business or Operating Unit/Franchise or Department Can we use AI for Data Curation ? Why ? – Data curation is an intellectually intensive activity, time consuming and intensive – Given the the increasing role and amount of data, curation risks to be a bottleneck Public14 Example of exponential growth in data
  • 15. Business or Operating Unit/Franchise or Department AI for data curation: characteristics and constraints • Can we automate data curation ? • Difficult: – Missing data – Discretionality (e.g.: level of granularity) • Looks reasonable: – Repetition – Consistency – Data/distances evaluations (clustering/attractors) • We need to combine human aspects and machineable aspects Public15
  • 16. Business or Operating Unit/Franchise or Department AI for data curation framing the problem: what Public16 Should this value be normalized? Meaning. E.g.: is “age” same as “years”? Confidence: is this information true ? The need. E.g.: is this a required information. When? Is this a valid identifier? Example, extract from NCBI GEO GSM701607
  • 17. Business or Operating Unit/Franchise or Department AI for data curation Framing the problem: how We consider curation activities as functions in a “curation space” that is exemplified via a “curation record” Public17 Validation state (Confidence) Valid Valid Valid Curation goal (The need) Required Required Required Required Required Semantic type1 (Meaning) Identifier about Sample ID2 about Organism Name about Organism Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name (the “location” in the source) ID taxID Organism Gender age Value GSM701 607 10090 Mus Musculus 6 weeks old 1 All semantic types expressed are expressed via an ontology (here presented as a simplified definition) 2 Identifiers also require a domain specification Example, extract from NCBI GEO GSM701607 (only a subset of fields from the previous slide are considered)
  • 18. Business or Operating Unit/Franchise or Department AI and data curation Using a record to modularize curation processes • Different classes of operations – Schema mapping (assign a type) – Standard setting (assign a goal) – Validation (setting a validation value) Public18 Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old Validation state Valid Valid Curation goal Required Semantic type Identifier about Sample Name about Organism Name about Gender Field Name ID Organism Gender Value GSM701607 Mus Musculus Validation state Valid Valid Valid Curation goal Require d Required Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Description about Age Age Unit about Age Field Name ID Gender age Value GSM70 1607 6 weeks old
  • 19. Business or Operating Unit/Franchise or Department • Different classes of operations – Normalization (filling a column) – Enrichment (adding a column) Public19 AI and data curation Using a record to modularize curation processes Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male Validation state Valid Valid Curation goal Require d Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM70 1607 male PATO:000038 4 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descripti on about Age Field Name ID taxID Organism age Value GSM701607 10090 Mus Musculus 6 weeks old Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample ID2 about Organism Name about Organism Descript ion about Age Identifie rabout Sample Field Name ID taxID Organism age EBI ref. Value GSM70160 7 10090 Mus Musculus 6 weeks old SAME A1189 935
  • 20. Business or Operating Unit/Franchise or Department Big picture Quantity/Quality tradeoff Public20 Quality/validity Time/cost • Is the optimal trade-off the same for all data? • Can this change for the same data over time and use cases ? • Can we embed a “cost function” in curation processes ?
  • 21. Business or Operating Unit/Franchise or Department Big picture (Meta) data evolution, immutability Public21 Initial condition: organism name present, missing ID Initial condition: identifier extracted, not verified Identifier extracted and verified Entity: 1234 Information: V1 Meta-Info: V1 Entity: 1234 Information: V2 Meta-Info: V2 Entity: 1234 Information: V2 Meta-Info: V3 Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male Validation state Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384 Validation state Valid Valid Valid Curation goal Required Required Semantic type Identifier about Sample Name about Gender Identifier about Gender Field Name ID Gender Value GSM701607 male PATO:0000384
  • 23. Business or Operating Unit/Franchise or Department Data and metadata transformations (deterministic actions + extractors) • Curation processes can be expressed (by curators) in terms of rules • Rules embed “atomic operations” e.g.: extractors, transformations,… • Simple rules go a very long way… Public23 <ruleConfig method="Extract"> <param name="setType" value="UNIT"/> <param name="setAmbiguous" value="true"/> <param name="setFullMatch" value="false"/> <param name="setResultInJson" value="false"/> <param name="setSimpleJson" value="false"/> <param name="setText"> <ruleConfig method="GetCell"> <param name="setAttr" value="AgeDescription"/> <param name="setBase" value="XCF_1"/> </ruleConfig>
  • 24. Business or Operating Unit/Franchise or Department Abstract rules and meta-rules • Rules can rely on abstraction/inference for higher genericity • They can also be used to produce meta-information Public24 Example rules (pesudo-syntax) • Compute missing identifer: If (E.X.type=“Identifier” ^ E.X.Goal=“Required” ^ E.X.Value=“” ^ exists (E.Y: E.Y.type.about=E.X.type.about and E.Y.type=“Description” and E.Y.Value!=“”)) then E.X.Value=extract(isAbout(E.Y.type), E.Y.value) • Set a curation goal: If subClassOf(E.OrganismID.Value, NCBI_40674), then E.GenderID.Goal=“Required” • Assert validity on condition: If one identifier is unambiguously extracted from a species name, then Validation State=Valid Validation state Valid Valid Curation goal Required Required Required Semantic type Identifier about Sample ID about Organism Name about Organism Name about Gender Identifier about Gender Field Name (the “location” in the source) ID taxID Organism Gender Value GSM701607 10090 Mus Musculus
  • 25. Business or Operating Unit/Franchise or Department “Approximate” transformations • Some transformations cannot (easily) be expressed in terms rules – Complex and ad hoc relations – Discretional elements • Examples: – Entities de-duplication – Whether two homonymous authors mentions are referring to the same author or not is a complex function of an extended range of the author’s features (where they work, contact information, subject study,…) – Schema mapping – Determining the meaning of an attribute (e.g.: time) is a complex function of the values this attribute takes, as well as other parameters (is this a duration, a time point, or an execution timestamp?) – Is ”Sample tracking number” to be mapped to “Tracking number” or to “Identifier” ? Public25
  • 26. Business or Operating Unit/Franchise or Department Implementation of de-duplication and schema mapping via Tamr • One approach that we have chosen to provide approximate schema-mapping and de-duplication functions is via Tamr (tamr.com) • Tamr is data unification platform that combines machine learning with human expertise. – E.g.: to support schema mapping, Tamr combines several features: – Data distribution – Property names – Property metadata – It learns how to compose such functions via machine learning, through an iterative process where human experts can provide input and improve predictions Public26
  • 27. Business or Operating Unit/Franchise or Department Schema-mapping (Tamr) Public27 Users are suggested a range of potential mapping, with a confidence score. They can confirm or suggest different mappings. New predictions are routinely provided as more input is accumulated. User interface for curators showing potential attribute matches
  • 28. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) User interface for curators showing potential duplicates Public28 Users are shown a set of potential duplicates with a confidence score. They can accept or refuse such suggestions, thus providing training data and iteratively refining predictions.
  • 29. Business or Operating Unit/Franchise or Department Entity de-duplication (Tamr) Details of the implementation of the deduplication process (courtesy of Tamr) Public29
  • 30. Business or Operating Unit/Franchise or Department Re-introducing logic • Can we predict (or suggest) the association between parameters and entities in a template? – An ontology models the “real world”: entities, qualities, processes – Parameters are annotated with axioms based on this ontology – Inference provides multiple classifications of parameters, as well as possible/necessary associations between parameters and entities. • Can this work? Public30
  • 31. Business or Operating Unit/Franchise or Department Re-introducing logic Public31 Extract from an ontology representing entities and qualities Example of axiomatic mapping between a parameter and an entity and qualities ontology Deductions for parameter ReportID: must refer to: Report, Document, Descriptive Entity, Concrete Entity, Entity, Information Entity, Immaterial Entity may refer to: Report, InternalReport
  • 32. Business or Operating Unit/Franchise or Department Exploring automatic ontology matching Public32 • 26 submissions. Algorithms covering structural approaches, axiomatic mappings and use of background knowledge • Phenotype track sponsored by the Pistoia Alliance Ontologies Mapping Project • Evaluation results for Phenotype track submitted to Journal of Biomedical Semantics http://oaei.ontologymatching.org/2016
  • 33. Business or Operating Unit/Franchise or Department Conclusions: On rules, standards and data ethnography • Data Curation: “AI” may help (not limited to ML) – Formal knowledge representation is part of the goal • The need for explanations – We need to define (document) a process – We have theorems for proofs: can we do without ? – Is there a role for “ML” GURUs? • The “human side” of data – Data normalization is based on assumptions (e.g.: what can be considered same, what not): there is a cultural side to this. – Would we accept an AI “editor” ? Public33
  • 34. Business or Operating Unit/Franchise or Department Acknowledgments • NIBR • Daniel Cronenberger • Ming Fang • Frederic Sutter • Anosha Siripala • Fabien Pernot • Jean Marc von Allmen • Martin Petracchi • Dorothy Reilly • Pierre Parisot • Therese Vachon • Tamr.com • Pistoia Alliance Ontology Matching Project team Public34