Presentation at the IJCAI 2018 Industry Day
Elsevier serves researchers, doctors, and nurses. They have come to expect the same AI based services that they use in everyday life in their work environment, e.g.: recommendations, answer driven search, and summarized information. However, providing these sorts of services over the plethora of low resource domains that characterize science and medicine is a challenging proposition. (For example, most of the shelf NLP components are trained on newspaper corpora and exhibit much worse performance on scientific text). Furthermore, the level of precision expected in these domains is quite high. In this talk, we overview our efforts to overcome this challenge through the application of four techniques: 1) unsupervised learning; 2) leveraging of highly skilled but low volume expert annotators; 2) designing annotation tasks for non-experts in expert domains; and 4) transfer learning. We conclude with a series of open issues for the AI community stemming from our experience.
Diversity and Depth: Implementing AI across many long tail domains
1. Diversity and Depth:
Implementing AI across
many long tail domains
Paul Groth | pgroth.com | @pgroth
Elsevier Labs
Thanks to Helena Deus, Tony Scerri, Sujit Pal, Corey Harper,
Ron Daniel, Brad Allen
IJCAI 2018 – Industry Day
2. Introducing Elsevier
Content Technology
Chemistry database
500m published experimental facts
User queries
13m monthly users on ScienceDirect
Books
35,000 published books
Drug Database
100% of drug information from
pharmaceutical companies updated daily
Research
16% of the world’s research data and
articles published by Elsevier
1,000
technologists employed by Elsevier
Machine learning
Over 1,000 predictive models trained on 1.5
billion electronic health care events
Machine reading
475m facts extracted from
ScienceDirect
Collaborative filtering:
1bn scientific articles added by 2.5m
researchers analyzed daily to generate over
250m article recommendations
Semantic Enhancement
Knowledge on 50m chemicals captured as 11B
facts
3. June 15, 2018
3
Bloom, N., Jones, C. I., Van Reenen, J., &
Webb, M. (2017). Are ideas getting harder to
find? (No. w23782). National Bureau of
Economic Research.
Slides: https://web.stanford.edu/~chadj/slides-
ideas.pdf
5. IN PRACTICE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2017).
Searching Data: A Review of Observational Data Retrieval Practices.
arXiv preprint arXiv:1707.06937.
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g.
early career researchers, policy makers, students) are not
well documented.
• Participants require details about data collection and
handling
• Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
6. PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND
NURSES: ANSWERS NEED AI
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders &
institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health
records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives
7.
8. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT:
MOTIVATING USE CASE
• Clinical Key is Elsevier’s flagship medical
reference search product
• Clinicians prefer “answers” in the form of
tables or flowcharts
• Eliminates need to page through retrieved content to find
actionable information
• Clinical Key provides a sidebar section
displaying answers, but this feature
depends on very labor-intensive manual
curation
• Solution: automatically classify images in
medical content corpus at index time
• Benefits: lower cost and improved user
experience
8
“Curated Answers”
section displays medical
decision graphs
9. RECOGNIZING DECISION GRAPHS IN MEDICAL CONTENT:
SOLUTION
• Perfect fit for transfer learning approach
• Input to the classifier is a classifier image and output is one of 8 classes:
Photo, Radiological, Data graphic, Illustration, Microscopy,
Flowchart, Electrophoresis, Medical decision graph
• Image dataset is augmented by producing variations of the training
images by rotating, flipping, transposing, jittering, etc.
• Reusing all but the last two Dense layers of a pre-trained model (VGG-
CNN, available from Caffe’s “model zoo”)
• VGG-CNN was trained on Imagenet (14 million images from the Web,
1000 general topic classes e.g., Cat, Airplane, House)
• Last layer is a multinomial logistic regression (or softmax) classifier
• Model trained on 10,167 images with a 70/30
train/test split
• Achieves 93% test set accuracy
• Evaluated image + caption text model but did not get a big performance
boost
• Searchable image base used to support training set
and model development
9
10. • Total concepts = 540,632
• 100+ person years of clinical
expert knowledge
H-GRAPH KNOWLEDGE
GRAPH
11. 11
Open Information Extraction
• Knowledge bases are populated by scanning text and doing Information Extraction
• Most information extraction systems are looking for very specific things, like drug-drug interactions
• Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text
• For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar
• One weird trick for open information extraction …
• ReVerb*:
1. Find “relation phrases” starting with a verb and ending with a verb or preposition
2. Find noun phrases before and after the relation phrase
3. Discard relation phrases not used with multiple combinations of arguments.
In addition, brain scans were performed to exclude
other causes of dementia.
* Fader et al. Identifying Relations for Open Information Extraction
12. 12
ReVerb output
After ReVerb pulls out noun phrases, match them up to EMMeT concepts
Discard rare concepts, relations, or relations that are not used with many different concepts
# SD Documents Scanned 14,000,000
Extracted ReVerb Triples 473,350,566
13. 13
Universal schemas - Initialization
• Method to combine ‘facts’ found by
machine reading with stronger
assertions from ontology.
• Build ExR matrix with entity-pairs
as rows and relations as columns.
• Relation columns can come from
EMMeT, or from ReVerb
extractions.
• Cells contain 1.0 if that pair of
entities is connected by that
relation.
14. 14
Universal schemas - Prediction
• Factorize matrix to ExK and KxR,
then recombine.
• “Learns” the correlations between
text relations and EMMeT relations,
in the context of pairs of objects.
• Find new triples to go into EMMeT
e.g., (glaucoma,
has_alternativeProcedure,
biofeedback)
18. GOOD & BAD DEFINITIONS
Concept Bad Good
Inferior Colliculus
By comparing activation obtained in an
equivalent standard ( non-cardiac-gated )
fMRI experiment , Guimaraes and colleagues
found that cardiac-gated activation maps
yielded much greater activation in subcortical
nuclei , such as the inferior colliculus .
The inferior colliculus (IC) is part of the tectum of the midbrain (mesencephalon) comprising the
quadrigeminal plate (Lamina quadrigemina). It is located caudal to the superior colliculus on the
dorsal surface of the mesencephalon ( Figure 36.7 FIGURE 36.7Overview of the human brainstem;
view from dorsal. The superior and inferior colliculi form the quadrigeminal plate. Parts of the
cerebellum are removed.). The ventral border is formed by the lateral lemniscus. The inferior
colliculus is the largest nucleus of the human auditory system. …
Purkinje cells
It is felt that the aminopyridines are likely to
increase the excitability of the potassium
channel-rich cerebellar Purkinje cells in the
flocculus ( Etzion and Grossman , 2001 ) .
Purkinje cells are the most salient cellular elements of the cerebellar cortex. They are arranged in a
single row throughout the entire cerebellar cortex between the molecular (outer) layer and the
granular (inner) layer. They are among the largest neurons and have a round perikaryon, classically
described as shaped “like a chianti bottle,” with a highly branched dendritic tree shaped like a
candelabrum and extending into the molecular layer where they are contacted by incoming systems
of afferent fibers from granule neurons and the brainstem…
Olfactory Bulb
The most common sites used for induction of
kindling include the amygdala, perforant path
, dorsal hippocampus , olfactory bulb , and
perirhinal cortex.
The olfactory bulb is the first relay station of the central olfactory system in the vertebrate brain and
contains in its superficial layer a few thousand glomeruli, spherical neuropils with sharp borders (
Figure 1 Figure 1Axonal projection pattern of olfactory sensory neurons to the glomeruli of the rodent
olfactory bulb. The olfactory epithelium in rats and mice is divided into four zones (zones 1–4). A
given odorant receptor is expressed by sensory neurons located within one zone of the epithelium.
Individual olfactory sensory neurons express a single odorant receptor…
19. HOW - OVERVIEW
Content
Books, Articles, Ontologies ...
• Identification of concepts
• Disambiguation
• Domain/sub-domain
identification
• Abbreviations,
variants
• Gazeteering
• Identification and
classification of text snippets
around concepts
• Features building for
concept/snippet pairs
• Lexical, syntactic,
semantic, doc
structure …
• Ranking concept snippet pairs
• Machine learning
• Hand made rules
• Similarities
• Deduplication
Technologies
NLP, ML
• Curation
• White list driven
• Black list
• Corrections/improve
ments
• Evaluation
• Gold set by domain
• Random set by
domain
• By SMEs (Subject
Matter Experts)
• Automation
• Content Enrichment
Framework
• Taxonomy coverage
extension
Knowledge Graph
Concepts, snippets, meta data, …
20. SCIENTIFIC TEXT IS CHALLENGING
698 unique relation types – 400 relation types
Open Information Extraction on Scientific Text: An Evaluation.
Paul Groth, Mike Lauruhn, Antony Scerri and Ron Daniel, Jr..
To appear at COLING 2018
21. 21
Augenstein, Isabelle, et al. "SemEval 2017 Task 10:
ScienceIE-Extracting Keyphrases and Relations from
Scientific Publications." Proceedings of the 11th
International Workshop on Semantic Evaluation
(SemEval-2017). 2017.
SCIENTIFIC TEXT IS CHALLENGING
24. Burger and Beans – weakly supervised/joint embeddings
24
correct text vector
image vector
Hypersphere of joint
embeddings
incorrect text
vector
Engilberge, Martin, Louis Chevallier, Patrick Pérez and Matthieu Cord. “Finding beans in burgers:
Deep semantic-visual embedding with localization.” CoRR abs/1804.01720 (2018)
25. 25
Image Vector
Text Vector
2. Had to “pre-warm” with
ImageNet – separate
model/task
1. ResNet152 (not ResNet50 as
usual)
3. From Weldon
model (had to be
ported to python from
lua)
4. Had to find the right
embeddings (K=620)
5. Had to find a library
and stack many SRU
APPLYING THE STATE OF THE ART
26. • Science and medicine are challenging domains for AI:
• long tailed, deep knowledge, constantly changing
• AI has the potential to change how we do scientific discovery and
transition it into practice
• At Elsevier we are applying AI to build platforms support health and
science professionals
• Of course, we’re hiring
26
CONCLUSION
Paul Groth (@pgroth)
p.groth@elsevier.com
Elsevier Labs
Editor's Notes
Work with dans
Reviewed 400 papers deep dive 114
Using EMMeT, and some code and data we already had, he built a quick prototype and tested it. Performance (in terms of accuracy of predictions) was surprisingly high.
Unsupervised is very important because it means the construction of the rough underlying knowledge base is scalable and not limited by the availability of experts.
Raw predictions not good enough for fully automatic operation, but are plenty good enough to help taxonomy editors and other people do their job much faster.
ehr and radiology datasets and report are going to jut as messy
flickr relativelly consistent when compared to real world medical images – challenge of pre-process
We need to rely on more unsupervised than supervised techniques. Burger and beans is a weakly supervised which lets infer negatives by knowing what are the positives
through word embeddings can also learn synonyms and such
Joint embeddings
Could start “end-to-end” but that was NOT the case from the authors
Complicated training regime: we learned that the authors were using pretrained resNet, pre-trained word embeddings,
Pre-warming step – include everything up to pooling layer (inc), then attach a different dense layer with softmax to do image classification task
Let it learn SRU
Had to add dropouts on the conv and SRU layer; later the third model added to affine layer
Global weight decay was another challgenge; keras does not support those, had to use regularizer functions but we’re not sure that they behave the same way