Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Knowledge graph construction for research & medicine
1. KNOWLEDGE GRAPH CONSTRUCTION
FOR RESEARCH & MEDICINE
Paul Groth (@pgroth)
pgroth.com
Disruptive Technology Director
Elsevier Labs (@elsevierlabs)
Connected Data London 2017
Contributions: Brad Allen, Pascal Coupet, Sujit Pal, Craig Stanley, Ron Daniel, Alex de Jong
2. Our customers are facing challenges in
science and health
1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization
Elsevier is in a unique position to make a contribution
towards solving these challenges
Life-saving drugs are expensive to develop.3
Global research spend is growing every year.1
3.4%
from 2015
Predicted spend
$1.9TN
research in 2016
Studies:
70-80% of
research asks the
wrong questions
or cannot be
reproduced
Researchers lack the tools they need to be
effective.2
Preventable medicalerrors:
Third largest cause of death in theUS
Health providers cannot save lives without the best
information.4
$2.5BN
median pharmaceutical
spend per drug
1/20
successrate
of drugs
Heart
Disease
611k
Cancer
585k
Medical
Error
225k 149k
Respiratory
Illness
3. ELSEVIER’S BUSINESS: PROVIDING ANSWERS FOR
RESEARCHERS, DOCTORS AND NURSES
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders &
institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health
records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives
4. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
ANSWERS ARE ABOUT THINGS, NOT JUST WORKS
Why shouldn’t a search on an author return
information about the author, including the
author’s works? Where was the author born,
when did she live, what is she known for? … All of
this is possible, but only if we can make some
fundamental changes in our approach to
bibliographic description. ... The challenge for us
lies in transforming what we can of our data into
interrelated “things” without overindulging that
metaphor.
Coyle, K. (2016). FRBR, before and after: a look at our
bibliographical models. Chicago: ALA Editions.
5. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
KNOWLEDGE GRAPHS DEFINED
• Knowledge graphs are "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities” (Nickel, M., Murphy, K., Tresp, V. and
Gabrilovich, E. (2015). A review of relational machine learning for knowledge graphs.
arXiv:1503.00759v3)
• Knowledge graphs are metadata evolved beyond the focus on the work, linking people, concepts,
things and events
• Knowledge Graphs are focused on things to provide answers
6.
7.
8.
9. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
ELSEVIER’S KNOWLEDGE PLATFORM
Products
Data & Content
Sources
Knowledge
Graphs
Platforms &
Shared Services
Entity Hubs
Usage logs Pathways EHRsArticles Authors Institutions
SyllabiCitations ChemicalsBooks DrugsFunders
Funder Hub Article HubProfile Hub Journal Hub Institution Hub
Research HealthcareLife Sciences
Content Life Sciences Search IdentityResearch
Reaxys CK SherpathScopus SD ROS
10. THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
THE GROWTH OF SCIENCE COMPLICATES OUR EFFORTS
11. MORE DOMAINS & MORE SPECIFICITY
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A.,
& Wyatt, S. (2017). Searching Data: A Review of
Observational Data Retrieval Practices. arXiv
preprint arXiv:1707.06937.
Some observations from @gregory_km
survey:
1. The needs and behaviours of specific user groups
(e.g. early career researchers, policy makers,
students) are not well documented.
2. Background uses of observational data are better
documented than foreground uses.
3. Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
15. HOW - OVERVIEW
Content
Books, Articles, Ontologies ...
• Identification of concepts
• Disambiguation
• Domain/sub-domain
identification
• Abbreviations,
variants
• Gazeteering
• Identification and
classification of text snippets
around concepts
• Features building for
concept/snippet pairs
• Lexical, syntactic,
semantic, doc
structure …
• Ranking concept snippet pairs
• Machine learning
• Hand made rules
• Similarities
• Deduplication
Technologies
NLP, ML
• Curation
• White list driven
• Black list
• Corrections/improve
ments
• Evaluation
• Gold set by domain
• Random set by
domain
• By SMEs (Subject
Matter Experts)
• Automation
• Content Enrichment
Framework
• Taxonomy coverage
extension
Knowledge Graph
Concepts, snippets, meta data, …
16. | 16
OmniScience
Neuros
cience
Extension vocabularies by domains to provide coverage
Number of Concepts Number of Labels
OmniScience 01.16.11 45969 47421
OmniScience Neuroscience branch 21/11/2016 2356 2455
OmniScience Extension Neuroscience branch 21/11/2016 23932 101276
17. | 17
Concept Bad Good
Inferior Colliculus
By comparing activation obtained in an equivalent
standard ( non-cardiac-gated ) fMRI experiment ,
Guimaraes and colleagues found that cardiac-
gated activation maps yielded much greater
activation in subcortical nuclei , such as the
inferior colliculus .
The inferior colliculus (IC) is part of the tectum of the midbrain (mesencephalon) comprising the quadrigeminal
plate (Lamina quadrigemina). It is located caudal to the superior colliculus on the dorsal surface of the
mesencephalon ( Figure 36.7 FIGURE 36.7Overview of the human brainstem; view from dorsal. The superior and
inferior colliculi form the quadrigeminal plate. Parts of the cerebellum are removed.). The ventral border is
formed by the lateral lemniscus. The inferior colliculus is the largest nucleus of the human auditory system. …
Purkinje cells
It is felt that the aminopyridines are likely to
increase the excitability of the potassium channel-
rich cerebellar Purkinje cells in the flocculus (
Etzion and Grossman , 2001 ) .
Purkinje cells are the most salient cellular elements of the cerebellar cortex. They are arranged in a single row
throughout the entire cerebellar cortex between the molecular (outer) layer and the granular (inner) layer. They
are among the largest neurons and have a round perikaryon, classically described as shaped “like a chianti
bottle,” with a highly branched dendritic tree shaped like a candelabrum and extending into the molecular layer
where they are contacted by incoming systems of afferent fibers from granule neurons and the brainstem…
Olfactory Bulb
The most common sites used for induction of
kindling include the amygdala, perforant path ,
dorsal hippocampus , olfactory bulb , and
perirhinal cortex.
The olfactory bulb is the first relay station of the central olfactory system in the vertebrate brain and contains in
its superficial layer a few thousand glomeruli, spherical neuropils with sharp borders ( Figure 1 Figure 1Axonal
projection pattern of olfactory sensory neurons to the glomeruli of the rodent olfactory bulb. The olfactory
epithelium in rats and mice is divided into four zones (zones 1–4). A given odorant receptor is expressed by
sensory neurons located within one zone of the epithelium. Individual olfactory sensory neurons express a single
odorant receptor…
Examples of good and bad snippets
18.
19. 19
One Weird Trick from Natural Language Processing (NLP)
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• The weird trick for open information extraction … a simple algorithm, known as ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
21. 21
Universal schemas – Predict ‘missing‘ KG facts
• Make a matrix:
• columns for the relation phrases
from ReVerb or the semantic
relations from EMMeT
• rows are the pairs of concepts
linked by a relation
• A ‘1.0’ in a cell if those concepts
were linked by that relation
• Outlined cells in diagram
are the ones initialized to
1.
• Factorize matrix to ExK and KxR, then
recombine.
• “Learns” the correlations between text
relations and EMMeT relations, in the
context of pairs of objects.
• Cells going from 0 to > 0
indicates potential.
• Find new triples to go into EMMeT e.g.,
(glaucoma, has_alternativeProcedure,
biofeedback)
23. 23
Medical Graph – Statistical correlations at scale
I65
Occlusion and stenosis
of precerebral arteries
G40
Epilepsy
has_successor
I61
C71
Malignant neoplasm
of brain
odds ratio: 1.12
intracerebral
hemorrhage has_successor criteria1:
• Correlation selected by
preditive modeling
algorithmus
• No. of relations is higher
than in mirrored relation
• p-value < 0,05
• Odds ratios balanced over
all covariates.
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data
covering 6.2 million patients. Nature Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Other
covariates
Primary care
Secondary care
Drug prescriptions
5m patients
each 6 years longitudinality
24. 24
Medical Graph in practice, patient 35: risk of depression
• 49 year old man
• Dx: overweight,
diabetes,
hypertension,
anxiety disorder
has an absolute
risk of 36% to
develop a
depression within
the next 4 years
26. 26
• Targets for prediction: ICD-coded diagnoses
• Only incident patients per diagnose considered, i.e. diagnosis-free 2009 – 2010
• if these patients remain diagnosis-free 2011 - 2014 (observation period), then 0 else 1
• Covariates: all ICD-/ATC-codes, age and sex measured in 2010
Example: Model to predict „I50 – Heart Failure“
26
Analysis Design
Predict 4 year long-term effects, balanced for all co-variables
I50 -
I50 free
patients
2009 2010
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
Remaining I50 free patients/ newly I50 diagnosed patients
27. 27
27
(A) integrate & clean
Research on anonymized claims data
Primary care
Secondary care
Drug prescriptions
Other data
Visits & diagnoses
Visits, diagnoses &
procedures
Drug presciptions
Further cooperations just started
Will enable analysis of vital and laboratory parameters
Data integration
& cleaning
• Data cleaning
• Longitudinally linked &
integrated for analytics
• Anonymized
6 Mio patients
6 years
> 1.5b events
Billing data flow
60+ sickness funds
28. 28
Technology
stack
feature
extraction
For 3.8m patients:
• age, gender
• all diagnoses: ICD10-coded, 3 digits, i.e. 2054 codes
• all medications: ATC-coded, 5 digits, i.e. 906 codes
• death, hospitalization
Results in: 6277 features
• 1623 targets, 2011-2014
• 2320 covariates, 2010
• 2334 filter-columns, 2009-2010
data mining Calculate prevalence, incidence, mean age for all covariates (i.e. diseases
and medications)
machine
learning
Predictive modelling for ~1600 targets
• Linear classification model, resulting in odds ratios
• Calculation of p-values
(B) mine & learn
Calculate statistics & build prediction models for ~1600 targets
34. CONCLUSION
• Knowledge graphs are critical components for delivering customer value
• AI techniques such as machine learning and predictive modelling from data are key parts of
knowledge graph construction
• This is particularly the case as the amount, speed and specificity of data and requirements
accelerates
• Leveraging existing assets such as ontologies, data, and controlled (i.e. connected data) have
been key assets for Elsevier in the build out of knowledge graphs
• Another talk is how all this enables intelligent based solutions
• Oh and we are hiring
• Paul Groth p.groth@elsevier.com