SlideShare a Scribd company logo
1 of 26
Diversity and Depth:
Implementing AI across
many long tail domains
Paul Groth | pgroth.com | @pgroth
Elsevier Labs
Thanks to Helena Deus, Tony Scerri, Sujit Pal, Corey Harper,
Ron Daniel, Brad Allen
IJCAI 2018 – Industry Day
Introducing Elsevier
Content Technology
Chemistry database
500m published experimental facts
User queries
13m monthly users on ScienceDirect
Books
35,000 published books
Drug Database
100% of drug information from
pharmaceutical companies updated daily
Research
16% of the world’s research data and
articles published by Elsevier
1,000
technologists employed by Elsevier
Machine learning
Over 1,000 predictive models trained on 1.5
billion electronic health care events
Machine reading
475m facts extracted from
ScienceDirect
Collaborative filtering:
1bn scientific articles added by 2.5m
researchers analyzed daily to generate over
250m article recommendations
Semantic Enhancement
Knowledge on 50m chemicals captured as 11B
facts
June 15, 2018
3
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
INFORMATION OVERLOAD
IN PRACTICE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017).
Searching Data: A Review of Observational Data Retrieval Practices.
arXiv preprint arXiv:1707.06937.
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g.
early career researchers, policy makers, students) are not
well documented.
• Participants require details about data collection and
handling
• Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND
NURSES: ANSWERS NEED AI
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders &
institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health
records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives
RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT:
MOTIVATING USE CASE
• Clinical Key is Elsevier’s flagship medical
reference search product
• Clinicians prefer “answers” in the form of
tables or flowcharts
• Eliminates need to page through retrieved content to find
actionable information
• Clinical Key provides a sidebar section
displaying answers, but this feature
depends on very labor-intensive manual
curation
• Solution: automatically classify images in
medical content corpus at index time
• Benefits: lower cost and improved user
experience
8
“Curated Answers”
section displays medical
decision graphs
RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT:
SOLUTION
• Perfect fit for transfer learning approach
• Input to the classifier is a classifier image and output is one of 8 classes:
Photo, Radiological, Data graphic, Illustration, Microscopy,
Flowchart, Electrophoresis, Medical decision graph
• Image dataset is augmented by producing variations of the training
images by rotating, flipping, transposing, jittering, etc.
• Reusing all but the last two Dense layers of a pre-trained model (VGG-
CNN, available from Caffe’s “model zoo”)
• VGG-CNN was trained on Imagenet (14 million images from the Web,
1000 general topic classes e.g., Cat, Airplane, House)
• Last layer is a multinomial logistic regression (or softmax) classifier
• Model trained on 10,167 images with a 70/30
train/test split
• Achieves 93% test set accuracy
• Evaluated image + caption text model but did not get a big performance
boost
• Searchable image base used to support training set
and model development
9
• Total concepts = 540,632
• 100+ person years of clinical
expert knowledge
H-GRAPH KNOWLEDGE
GRAPH
11
Open Information Extraction
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• One weird trick for open information extraction …
• ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
12
ReVerb output
After ReVerb pulls out noun phrases, match them up to EMMeT concepts
Discard rare concepts, relations, or relations that are not used with many different concepts
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566
13
Universal schemas - Initialization
• Method to combine ‘facts’ found by
machine reading with stronger
assertions from ontology.
• Build ExR matrix with entity-pairs
as rows and relations as columns.
• Relation columns can come from
EMMeT, or from ReVerb
extractions.
• Cells contain 1.0 if that pair of
entities is connected by that
relation.
14
Universal schemas - Prediction
• Factorize matrix to ExK and KxR,
then recombine.
• “Learns” the correlations between
text relations and EMMeT relations,
in the context of pairs of objects.
• Find new triples to go into EMMeT
e.g., (glaucoma,
has_alternativeProcedure,
biofeedback)
15
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
ONTOLOGY MAINTENANCE
• Pretty good F measure around 0.7
• Good enough as a recommender to experts
TOPIC PAGES
Definition
Related
terms
Relevant
ranked
snippets
GOOD & BAD DEFINITIONS
Concept Bad Good
Inferior Colliculus
By comparing activation obtained in an
equivalent standard ( non-cardiac-gated )
fMRI experiment , Guimaraes and colleagues
found that cardiac-gated activation maps
yielded much greater activation in subcortical
nuclei , such as the inferior colliculus .
The inferior colliculus (IC) is part of the tectum of the midbrain (mesencephalon) comprising the
quadrigeminal plate (Lamina quadrigemina). It is located caudal to the superior colliculus on the
dorsal surface of the mesencephalon ( Figure 36.7 FIGURE 36.7Overview of the human brainstem;
view from dorsal. The superior and inferior colliculi form the quadrigeminal plate. Parts of the
cerebellum are removed.). The ventral border is formed by the lateral lemniscus. The inferior
colliculus is the largest nucleus of the human auditory system. …
Purkinje cells
It is felt that the aminopyridines are likely to
increase the excitability of the potassium
channel-rich cerebellar Purkinje cells in the
flocculus ( Etzion and Grossman , 2001 ) .
Purkinje cells are the most salient cellular elements of the cerebellar cortex. They are arranged in a
single row throughout the entire cerebellar cortex between the molecular (outer) layer and the
granular (inner) layer. They are among the largest neurons and have a round perikaryon, classically
described as shaped “like a chianti bottle,” with a highly branched dendritic tree shaped like a
candelabrum and extending into the molecular layer where they are contacted by incoming systems
of afferent fibers from granule neurons and the brainstem…
Olfactory Bulb
The most common sites used for induction of
kindling include the amygdala, perforant path
, dorsal hippocampus , olfactory bulb , and
perirhinal cortex.
The olfactory bulb is the first relay station of the central olfactory system in the vertebrate brain and
contains in its superficial layer a few thousand glomeruli, spherical neuropils with sharp borders (
Figure 1 Figure 1Axonal projection pattern of olfactory sensory neurons to the glomeruli of the rodent
olfactory bulb. The olfactory epithelium in rats and mice is divided into four zones (zones 1–4). A
given odorant receptor is expressed by sensory neurons located within one zone of the epithelium.
Individual olfactory sensory neurons express a single odorant receptor…
HOW - OVERVIEW
Content
Books, Articles, Ontologies ...
• Identification of concepts
• Disambiguation
• Domain/sub-domain
identification
• Abbreviations,
variants
• Gazeteering
• Identification and
classification of text snippets
around concepts
• Features building for
concept/snippet pairs
• Lexical, syntactic,
semantic, doc
structure …
• Ranking concept snippet pairs
• Machine learning
• Hand made rules
• Similarities
• Deduplication
Technologies
NLP, ML
• Curation
• White list driven
• Black list
• Corrections/improve
ments
• Evaluation
• Gold set by domain
• Random set by
domain
• By SMEs (Subject
Matter Experts)
• Automation
• Content Enrichment
Framework
• Taxonomy coverage
extension
Knowledge Graph
Concepts, snippets, meta data, …
SCIENTIFIC TEXT IS CHALLENGING
698 unique relation types – 400 relation types
Open Information Extraction on Scientific Text: An Evaluation.
Paul Groth, Mike Lauruhn, Antony Scerri and Ron Daniel, Jr..
To appear at COLING 2018
21
Augenstein, Isabelle, et al. "SemEval 2017 Task 10:
ScienceIE-Extracting Keyphrases and Relations from
Scientific Publications." Proceedings of the 11th
International Workshop on Semantic Evaluation
(SemEval-2017). 2017.
SCIENTIFIC TEXT IS CHALLENGING
June 15, 2018
22
THE CROWD ISN’T AN EXPERT
AMIRSYS
Burger and Beans – weakly supervised/joint embeddings
24
correct text vector
image vector
Hypersphere of joint
embeddings
incorrect text
vector
Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers:
Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)
25
Image Vector
Text Vector
2. Had to “pre-warm” with
ImageNet – separate
model/task
1. ResNet152 (not ResNet50 as
usual)
3. From Weldon
model (had to be
ported to python from
lua)
4. Had to find the right
embeddings (K=620)
5. Had to find a library
and stack many SRU
APPLYING THE STATE OF THE ART
• Science and medicine are challenging domains for AI:
• long tailed, deep knowledge, constantly changing
• AI has the potential to change how we do scientific discovery and
transition it into practice
• At Elsevier we are applying AI to build platforms support health and
science professionals
• Of course, we’re hiring 
26
CONCLUSION
Paul Groth (@pgroth)
p.groth@elsevier.com
Elsevier Labs

More Related Content

What's hot

AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinarPistoia Alliance
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIPistoia Alliance
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12Mazhar Poohlah
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesPistoia Alliance
 
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...Lisette Giepmans
 
Practical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial IntelligencePractical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial IntelligenceAl Dossetter
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationnirvdrum
 
Towards open and reproducible neuroscience in the age of big data
Towards open and  reproducible neuroscience in the age of big dataTowards open and  reproducible neuroscience in the age of big data
Towards open and reproducible neuroscience in the age of big dataKrzysztof Gorgolewski
 
The comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesThe comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesfawad khan
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for DiscoveryDayOne
 
Some Questions About Your Data
Some Questions About Your DataSome Questions About Your Data
Some Questions About Your DataDamian T. Gordon
 
How to Start Doing Data Science
How to Start Doing Data ScienceHow to Start Doing Data Science
How to Start Doing Data ScienceAyodele Odubela
 

What's hot (20)

AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinar
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
MPS webinar master deck
MPS webinar master deckMPS webinar master deck
MPS webinar master deck
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Data analysis
Data analysisData analysis
Data analysis
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...
BioSHaRE: The DataSHIELD Legal Analysis Template - Susan Wallace - University...
 
Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability
 
Practical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial IntelligencePractical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial Intelligence
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Towards open and reproducible neuroscience in the age of big data
Towards open and  reproducible neuroscience in the age of big dataTowards open and  reproducible neuroscience in the age of big data
Towards open and reproducible neuroscience in the age of big data
 
The comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesThe comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search engines
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 
Some Questions About Your Data
Some Questions About Your DataSome Questions About Your Data
Some Questions About Your Data
 
How to Start Doing Data Science
How to Start Doing Data ScienceHow to Start Doing Data Science
How to Start Doing Data Science
 

Similar to Diversity and Depth: Implementing AI across many long tail domains

Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
The Neuroscience Information Framework:The present and future of neuroscience...
The Neuroscience Information Framework:The present and future of neuroscience...The Neuroscience Information Framework:The present and future of neuroscience...
The Neuroscience Information Framework:The present and future of neuroscience...Neuroscience Information Framework
 
Effective search of bibliographic databases
Effective search of bibliographic databasesEffective search of bibliographic databases
Effective search of bibliographic databasesTarek Tawfik Amin
 
How to Conduct a Systematic Search
How to Conduct a Systematic SearchHow to Conduct a Systematic Search
How to Conduct a Systematic SearchRobin Featherstone
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...Maryann Martone
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
Research methodology
Research methodologyResearch methodology
Research methodologyTosif Ahmad
 
How do we know what we don't know?  Exploring the data and knowledge space th...
How do we know what we don't know?  Exploring the data and knowledge space th...How do we know what we don't know?  Exploring the data and knowledge space th...
How do we know what we don't know?  Exploring the data and knowledge space th...Maryann Martone
 
Program of Academic Excellence
Program of Academic ExcellenceProgram of Academic Excellence
Program of Academic ExcellenceDarrell W. Gunter
 
How Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersHow Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersDarrell W. Gunter
 
AAP/PSP Semantic Publishing Workshop
AAP/PSP Semantic Publishing  WorkshopAAP/PSP Semantic Publishing  Workshop
AAP/PSP Semantic Publishing WorkshopDarrell W. Gunter
 
Overview Write a 2–3-page assessment in which you respond to a ser.docx
Overview Write a 2–3-page assessment in which you respond to a ser.docxOverview Write a 2–3-page assessment in which you respond to a ser.docx
Overview Write a 2–3-page assessment in which you respond to a ser.docxkarlacauq0
 
and Practice in Nursing.docx
and Practice in Nursing.docxand Practice in Nursing.docx
and Practice in Nursing.docxwrite5
 
How to Conduct a Literature Review
How to Conduct a Literature ReviewHow to Conduct a Literature Review
How to Conduct a Literature ReviewRobin Featherstone
 
Open science in RIKEN-KI doctorial course on March 20, 2019
Open science in RIKEN-KI doctorial course on March 20, 2019Open science in RIKEN-KI doctorial course on March 20, 2019
Open science in RIKEN-KI doctorial course on March 20, 2019Takeya Kasukawa
 

Similar to Diversity and Depth: Implementing AI across many long tail domains (20)

Paul Groth
Paul GrothPaul Groth
Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Neuroscience Information Framework:The present and future of neuroscience...
The Neuroscience Information Framework:The present and future of neuroscience...The Neuroscience Information Framework:The present and future of neuroscience...
The Neuroscience Information Framework:The present and future of neuroscience...
 
Effective search of bibliographic databases
Effective search of bibliographic databasesEffective search of bibliographic databases
Effective search of bibliographic databases
 
استخدام قاعدة المعلومات pubmed
استخدام قاعدة المعلومات pubmedاستخدام قاعدة المعلومات pubmed
استخدام قاعدة المعلومات pubmed
 
How to Conduct a Systematic Search
How to Conduct a Systematic SearchHow to Conduct a Systematic Search
How to Conduct a Systematic Search
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
How do we know what we don't know?  Exploring the data and knowledge space th...
How do we know what we don't know?  Exploring the data and knowledge space th...How do we know what we don't know?  Exploring the data and knowledge space th...
How do we know what we don't know?  Exploring the data and knowledge space th...
 
Program of Academic Excellence
Program of Academic ExcellenceProgram of Academic Excellence
Program of Academic Excellence
 
MVilla IUI 2012 Lisbon
MVilla IUI 2012 LisbonMVilla IUI 2012 Lisbon
MVilla IUI 2012 Lisbon
 
How Semantic Technology Helps Researchers
How Semantic Technology Helps ResearchersHow Semantic Technology Helps Researchers
How Semantic Technology Helps Researchers
 
AAP/PSP Semantic Publishing Workshop
AAP/PSP Semantic Publishing  WorkshopAAP/PSP Semantic Publishing  Workshop
AAP/PSP Semantic Publishing Workshop
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
Overview Write a 2–3-page assessment in which you respond to a ser.docx
Overview Write a 2–3-page assessment in which you respond to a ser.docxOverview Write a 2–3-page assessment in which you respond to a ser.docx
Overview Write a 2–3-page assessment in which you respond to a ser.docx
 
and Practice in Nursing.docx
and Practice in Nursing.docxand Practice in Nursing.docx
and Practice in Nursing.docx
 
How to Conduct a Literature Review
How to Conduct a Literature ReviewHow to Conduct a Literature Review
How to Conduct a Literature Review
 
Ontology
OntologyOntology
Ontology
 
Open science in RIKEN-KI doctorial course on March 20, 2019
Open science in RIKEN-KI doctorial course on March 20, 2019Open science in RIKEN-KI doctorial course on March 20, 2019
Open science in RIKEN-KI doctorial course on March 20, 2019
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 

More from Paul Groth (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Diversity and Depth: Implementing AI across many long tail domains

  • 1. Diversity and Depth: Implementing AI across many long tail domains Paul Groth | pgroth.com | @pgroth Elsevier Labs Thanks to Helena Deus, Tony Scerri, Sujit Pal, Corey Harper, Ron Daniel, Brad Allen IJCAI 2018 – Industry Day
  • 2. Introducing Elsevier Content Technology Chemistry database 500m published experimental facts User queries 13m monthly users on ScienceDirect Books 35,000 published books Drug Database 100% of drug information from pharmaceutical companies updated daily Research 16% of the world’s research data and articles published by Elsevier 1,000 technologists employed by Elsevier Machine learning Over 1,000 predictive models trained on 1.5 billion electronic health care events Machine reading 475m facts extracted from ScienceDirect Collaborative filtering: 1bn scientific articles added by 2.5m researchers analyzed daily to generate over 250m article recommendations Semantic Enhancement Knowledge on 50m chemicals captured as 11B facts
  • 3. June 15, 2018 3 Bloom, N., Jones, C. I., Van Reenen, J., & Webb, M. (2017). Are ideas getting harder to find? (No. w23782). National Bureau of Economic Research. Slides: https://web.stanford.edu/~chadj/slides- ideas.pdf
  • 5. IN PRACTICE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017). Searching Data: A Review of Observational Data Retrieval Practices. arXiv preprint arXiv:1707.06937. Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971
  • 6. PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND NURSES: ANSWERS NEED AI My work is moving towards a new field; what should I know? • Journal articles, reference works, profiles of researchers, funders & institutions • Recommendations of people to connect with, reading lists, topic pages How should I treat my patient given her condition & history? • Journal articles, reference works, medical guidelines, electronic health records • Treatment plan with alternatives personalized for the patient How can I master the subject matter of the course I am taking? • Course syllabus, reference works, course objectives, student history • Quiz plan based on the student’s history and course objectives
  • 7.
  • 8. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT: MOTIVATING USE CASE • Clinical Key is Elsevier’s flagship medical reference search product • Clinicians prefer “answers” in the form of tables or flowcharts • Eliminates need to page through retrieved content to find actionable information • Clinical Key provides a sidebar section displaying answers, but this feature depends on very labor-intensive manual curation • Solution: automatically classify images in medical content corpus at index time • Benefits: lower cost and improved user experience 8 “Curated Answers” section displays medical decision graphs
  • 9. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT: SOLUTION • Perfect fit for transfer learning approach • Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph • Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc. • Reusing all but the last two Dense layers of a pre-trained model (VGG- CNN, available from Caffe’s “model zoo”) • VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House) • Last layer is a multinomial logistic regression (or softmax) classifier • Model trained on 10,167 images with a 70/30 train/test split • Achieves 93% test set accuracy • Evaluated image + caption text model but did not get a big performance boost • Searchable image base used to support training set and model development 9
  • 10. • Total concepts = 540,632 • 100+ person years of clinical expert knowledge H-GRAPH KNOWLEDGE GRAPH
  • 11. 11 Open Information Extraction • Knowledge bases are populated by scanning text and doing Information Extraction • Most information extraction systems are looking for very specific things, like drug-drug interactions • Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text • For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar • One weird trick for open information extraction … • ReVerb*: 1. Find “relation phrases” starting with a verb and ending with a verb or preposition 2. Find noun phrases before and after the relation phrase 3. Discard relation phrases not used with multiple combinations of arguments. In addition, brain scans were performed to exclude other causes of dementia. * Fader et al. Identifying Relations for Open Information Extraction
  • 12. 12 ReVerb output After ReVerb pulls out noun phrases, match them up to EMMeT concepts Discard rare concepts, relations, or relations that are not used with many different concepts # SD Documents Scanned 14,000,000 Extracted ReVerb Triples 473,350,566
  • 13. 13 Universal schemas - Initialization • Method to combine ‘facts’ found by machine reading with stronger assertions from ontology. • Build ExR matrix with entity-pairs as rows and relations as columns. • Relation columns can come from EMMeT, or from ReVerb extractions. • Cells contain 1.0 if that pair of entities is connected by that relation.
  • 14. 14 Universal schemas - Prediction • Factorize matrix to ExK and KxR, then recombine. • “Learns” the correlations between text relations and EMMeT relations, in the context of pairs of objects. • Find new triples to go into EMMeT e.g., (glaucoma, has_alternativeProcedure, biofeedback)
  • 15. 15 Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). ONTOLOGY MAINTENANCE • Pretty good F measure around 0.7 • Good enough as a recommender to experts
  • 17.
  • 18. GOOD & BAD DEFINITIONS Concept Bad Good Inferior Colliculus By comparing activation obtained in an equivalent standard ( non-cardiac-gated ) fMRI experiment , Guimaraes and colleagues found that cardiac-gated activation maps yielded much greater activation in subcortical nuclei , such as the inferior colliculus . The inferior colliculus (IC) is part of the tectum of the midbrain (mesencephalon) comprising the quadrigeminal plate (Lamina quadrigemina). It is located caudal to the superior colliculus on the dorsal surface of the mesencephalon ( Figure 36.7 FIGURE 36.7Overview of the human brainstem; view from dorsal. The superior and inferior colliculi form the quadrigeminal plate. Parts of the cerebellum are removed.). The ventral border is formed by the lateral lemniscus. The inferior colliculus is the largest nucleus of the human auditory system. … Purkinje cells It is felt that the aminopyridines are likely to increase the excitability of the potassium channel-rich cerebellar Purkinje cells in the flocculus ( Etzion and Grossman , 2001 ) . Purkinje cells are the most salient cellular elements of the cerebellar cortex. They are arranged in a single row throughout the entire cerebellar cortex between the molecular (outer) layer and the granular (inner) layer. They are among the largest neurons and have a round perikaryon, classically described as shaped “like a chianti bottle,” with a highly branched dendritic tree shaped like a candelabrum and extending into the molecular layer where they are contacted by incoming systems of afferent fibers from granule neurons and the brainstem… Olfactory Bulb The most common sites used for induction of kindling include the amygdala, perforant path , dorsal hippocampus , olfactory bulb , and perirhinal cortex. The olfactory bulb is the first relay station of the central olfactory system in the vertebrate brain and contains in its superficial layer a few thousand glomeruli, spherical neuropils with sharp borders ( Figure 1 Figure 1Axonal projection pattern of olfactory sensory neurons to the glomeruli of the rodent olfactory bulb. The olfactory epithelium in rats and mice is divided into four zones (zones 1–4). A given odorant receptor is expressed by sensory neurons located within one zone of the epithelium. Individual olfactory sensory neurons express a single odorant receptor…
  • 19. HOW - OVERVIEW Content Books, Articles, Ontologies ... • Identification of concepts • Disambiguation • Domain/sub-domain identification • Abbreviations, variants • Gazeteering • Identification and classification of text snippets around concepts • Features building for concept/snippet pairs • Lexical, syntactic, semantic, doc structure … • Ranking concept snippet pairs • Machine learning • Hand made rules • Similarities • Deduplication Technologies NLP, ML • Curation • White list driven • Black list • Corrections/improve ments • Evaluation • Gold set by domain • Random set by domain • By SMEs (Subject Matter Experts) • Automation • Content Enrichment Framework • Taxonomy coverage extension Knowledge Graph Concepts, snippets, meta data, …
  • 20. SCIENTIFIC TEXT IS CHALLENGING 698 unique relation types – 400 relation types Open Information Extraction on Scientific Text: An Evaluation. Paul Groth, Mike Lauruhn, Antony Scerri and Ron Daniel, Jr.. To appear at COLING 2018
  • 21. 21 Augenstein, Isabelle, et al. "SemEval 2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications." Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. SCIENTIFIC TEXT IS CHALLENGING
  • 22. June 15, 2018 22 THE CROWD ISN’T AN EXPERT
  • 24. Burger and Beans – weakly supervised/joint embeddings 24 correct text vector image vector Hypersphere of joint embeddings incorrect text vector Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers: Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)
  • 25. 25 Image Vector Text Vector 2. Had to “pre-warm” with ImageNet – separate model/task 1. ResNet152 (not ResNet50 as usual) 3. From Weldon model (had to be ported to python from lua) 4. Had to find the right embeddings (K=620) 5. Had to find a library and stack many SRU APPLYING THE STATE OF THE ART
  • 26. • Science and medicine are challenging domains for AI: • long tailed, deep knowledge, constantly changing • AI has the potential to change how we do scientific discovery and transition it into practice • At Elsevier we are applying AI to build platforms support health and science professionals • Of course, we’re hiring  26 CONCLUSION Paul Groth (@pgroth) p.groth@elsevier.com Elsevier Labs

Editor's Notes

  1. Work with dans Reviewed 400 papers deep dive 114
  2. Using EMMeT, and some code and data we already had, he built a quick prototype and tested it. Performance (in terms of accuracy of predictions) was surprisingly high. Unsupervised is very important because it means the construction of the rough underlying knowledge base is scalable and not limited by the availability of experts. Raw predictions not good enough for fully automatic operation, but are plenty good enough to help taxonomy editors and other people do their job much faster.
  3. ehr and radiology datasets and report are going to jut as messy flickr relativelly consistent when compared to real world medical images – challenge of pre-process
  4. We need to rely on more unsupervised than supervised techniques. Burger and beans is a weakly supervised which lets infer negatives by knowing what are the positives through word embeddings can also learn synonyms and such
  5. Joint embeddings Could start “end-to-end” but that was NOT the case from the authors Complicated training regime: we learned that the authors were using pretrained resNet, pre-trained word embeddings, Pre-warming step – include everything up to pooling layer (inc), then attach a different dense layer with softmax to do image classification task Let it learn SRU Had to add dropouts on the conv and SRU layer; later the third model added to affine layer Global weight decay was another challgenge; keras does not support those, had to use regularizer functions but we’re not sure that they behave the same way