Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for Literature-based Discovery
1. A CONTEXT-DRIVEN SUBGRAPH MODEL FOR
LITERATURE-BASED DISCOVERY
PH.D. DISSERTATION DEFENSE
DELROY CAMERON
AUGUST 18, 2014
PH.D. COMMITTEE
AMIT P. SHETH (ADVISOR)
KRISHNAPRASAD THIRUNARAYAN
MICHAEL RAYMER
RAMAKANTH KAVULURU (UKY)
THOMAS C. RINDFLESCH (NIH)
VARUN BHAGWAN (YAHOO! LABS)All truths are easy to understand once they are discovered;
the point is to discover them. (Galileo Galilei, 1564–1642)
2. 2
Historical Perspectives
Walter Sutton
(1877 – 1916)
Theodor Boveri
(1862 – 1915)
Gregor Johann Mendel
(1822 – 1884)
Mendelian Laws of Inheritance
(1866)
Boveri-Sutton Chromosome Theory
(1903)
3. 3
Science of Making Discoveries
Discovery
Information Processing
System
What is promising?
4. 4
Thesis Statement
An information processing system that leverages rich representations
of textual content from scientific literature based on implicit and explicit
context can provide effective means for literature-based discovery.
5. 5
Motivation
Rofecoxib Osteoarthritis1999 TREAT
Merck & Co.
Increased risk of
Heart Attack
2002
2004
$254.3 million
Settlement
2005
Vioxx
Withdrawn
$4.85 billion
Settlement
Confirmed by
Clinical Trial
2007 2011
$950 million
Settlement
2013
$23 million
Settlement
7. 7
Literature-Based Discovery (LBD)
ABC Model
AnC Model
Context-Driven Subgraph Model
A CB
A CB1 B2 BiSource: Wikipedia - http://en.wikipedia.org/wiki/Don_R._Swanson
Keyword-based
Concept-based
Relations-based
2006 20111986 1996
ARROWSMITH v1
Term Frequency
1999
IRIDESCENT
Term Co-occurrence
2001
DAD
MetaMAP
UMLS
2003
Litlinker
MeSH, UMLS, Rules
Level of Support
Contribution #1
Context-Driven
Subgraph Model for LBD
SemBT
Semantic Predications
Level of Support
Discovery Browsing
Degree Centrality
Cooperative Reciprocity
Manual
2013
Manjal
UMLS, MeSH
Topic Profiles, TF-IDF
2004
Rajolink
MeSH, Rarity
BioSbKDS
UMLS Relations
MeSH
2005
BITOLA
UMLS, MeSH
Assoc. Rules,
Confidence
Graph-based
ACS (2004)
MeSH,
Hebbian Learning
A CB
CAUSESINHIBITS
A C
PRODUCES
INHIBITS
Discovery Patterns
Hybrid
ARROWSMITH v2
8 Features (2007)
Semantic MEDLINE
Summarization
Discovery Browsing
Epiphanet
Predications-based
Semantic Indexing
CoPub
Keywords, Mutual
Information
2010
Literature-based discovery refers to the use of papers and other academic publications
(the “literature”) to find new relationships between existing knowledge (the “discovery”).
Definition courtesy of Wikipedia: http://en.wikipedia.org/wiki/Literature-based_discovery
8. 8
Application: Raynaud Syndrome – Fish Oil
ISA
Prostaglandin I3
CONVERTS_TO
Dietary
Fish Oils
Platelet
Aggregation
DISRUPTS
ISA
DISRUPTS
DISRUPTS
Epoprostenol
DISRUPTS
ISA
STIMULATES
Prostaglandin
CONVERTS_TO
Raynaud
Syndrome
TREATS
CAUSES
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition
of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13). 46(2): 238–251, 2013.
Dietary
Fish Oils
Platelet
Aggregation
Raynaud
Syndrome
DISRUPTS CAUSES
Dietary
Fish Oils
Platelet
Aggregation
Raynaud
Syndrome
Keyword/
Concept
based
Relations
based
Subgraph
based
Inferred predicates
10. DISRUPTS
ISA
ISA
Dietary
Fish Oils
Platelet
Aggregation
DISRUPTS
Raynaud
Syndrome
CAUSES
Prostaglandins
CONVERTS_TO
Prostacyclin
(PGI2)
DISRUPTS
Prostaglandin I3
(PGI3) TREATSSTIMULATES
Raynaud
Syndrome
Dietary
Fish Oils
Fatty Acid
Essential
Fatty Acid
Triglyceride
Lipid
ISA
DISRUPTS CAUSES
ISA
INHIBIT
AFFECTS
ISA
INHIBITS
Blood
Viscosity
Cellular
Activity
Blood
Physiology
Problem
How to automate this?
Tissue
Function
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis
using
DISRUPTS
ISA
Dietary
Fish Oils
Prostaglandin I3
(PGI3)
Prostacyclin
(PGI2)
Raynaud
Syndrome
CAUSESVasoconstrictionINHIBIT
CONVERTS_TO
AFFECTS DISRUPTS
TREATS
13. 13
. . .
Subgraph Model
Predications
Graph (G)
Candidate
Graph (RG)
Subgraphs (SG)
No two contexts are the same
R(s,t)(c1) R(s,t)(c2) R(s,t)(ck)
R(s,t)
. . .
. . .
What is context?
15. 15
• Path Relatedness
• Semantic Predication Context
Context Distribution Assumption: The context of a semantic predication
can be expressed as the distribution of all MeSH descriptors associated
with all articles that contain it.
Semantic Underpinnings
Relational
Semantic
Summary
Textual
Semantic
Summary
Concept-Level
Semantic
Summary
Interchangeability Assumption: The concept-level and relational semantic
summary of a MEDLINE article are interchangeable.
16. 16
Linguistic Underpinnings
Linguistic items with similar distributions have similar meanings
“You shall know a word
by the company it keeps”
– J. R. Firth 1957
Semantic Predications with shared contexts in their distributions are related
Distributional Semantics
Context-sensitive nature of meaning
18. 18
MeSH Hierarchy
MeSH Hierarchy
Automatic Subgraph Creation
m1 m2
m7 m8
m1 m7 m2 m8
m
1
m5 m9 m
8
Semantic Relatedness
of MeSH Context Vectorsm9m1
m5 m8
Contribution #2
Context of a path
as a vector of
MeSH Descriptors
pi
pj
22. 22
Hierarchical Agglomerative Clustering
A C A CA CA C A CA CA C A C
Iteration 1
Iteration n
. . .
Bucket PopulationBucket Merging
...
A C
A C
A C
A C
Path Relatedness Threshold
1. Bucket Population
2. Bucket Merging
3. Subgraph Ranking
28. 28
Bucket Merging
Ba
Bb
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction to information retrieval. Cambridge University Press 2008,
ISBN 978-0-521-86571-5, pp. I-XXI, 1-482
Straggly Clusters Compact Clusters
Broad Clusters
47. 47
Predications-based Knowledge Exploration
Corpus
Predications Graph
Definitional Knowledge (UMLS + MeSH)
Provenance
Knowledge Abstraction
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature
International Bioinformatics and Biomedical Conference (BIBM11). 512–519 , 2011.
Contribution #4
Combining Assertional and
Definitional Knowledge
for Knowledge Exploration
48. 48
Levels of Contexts
A CB
Predication
Context
A CB1 B2 Bi
Path
Context
A CB1 B2 B3
A CB1 B2
Shared
Context
A C
PRODUCES
INHIBITS
Subgraph
Context
…
…
…
…
…
…
A C
A C
A C
…
Dimensions
54. 54
Levels of Semantic Representation
Keywords
Concepts
MeSH Descriptors
Semantic Predications
Ensemble of Features
Relationships
A B
Semantic Predication
PREDICATE
55. 55
Limitations
1. Manual Threshold
– MeSH Semantic Similarity
2. Path Relatedness Threshold
– Only Approximate Gaussian
3. Definition of Context
4. MEDLINE Querying
– Deep integration of Assertional/Definitional
5. Contradiction Detection
6. Statistical Evaluation
7. Scalability of Clustering Algorithm
8. Subgraph Labeling
56. 56
Take Away
• Future of Information Processing
– Rich Knowledge Representations
o Implicit, Formal, Powerful semantics
– Application to Literature-Based Discovery
57. 57
Conclusion
• Context-Driven Subgraph Model
– Manually create Complex Associations
– Automatic Subgraph Creation
o Novel definitions for Context and Shared Context
o Multiple Thematic Dimensions
– Predications-based Knowledge Exploration
o Predicates
o Highlighted MEDLINE sentences
– Knowledge Rediscovery
o 8 out of 9 existing scientific discoveries
58. 58
Publications
1. D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Context-Driven Automatic Subgraph Creation for
Literature-Based Discovery (under preparation)
2. D. Cameron, A. P. Sheth, N. Jaykumar, G. Anand, K. Thirunarayan, G. A. Smith. A Hybrid Approach to Finding Relevant Social Media Content for
Domain Specific Information Needs. (submitted to the Journal of Web Semantics)
3. D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery
and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13). 46(2): 238–251, 2013.
4. D. Cameron, G. A. Smith, R. Daniulaityte, A. P. Sheth, D. Dave, L. Chen, G. Anand, R. Carlson, K. Z. Watkins, R. Falck. PREDOSE: A Semantic Web
Platform for Drug Abuse Epidemiology using Social Media Journal of Biomedical Informatics (JBI13). 46(6): 985–997, 2013.
5. R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth. “I just wanted to tell you that Loperamide WILL WORK: A Web-Based
Study of Extra-medical use of Loperamide. Journal of Drug and Alcohol Dependence (DAD13) 130(1–3): 241–244, 2013.
6. D. Cameron, V. Bhagwan, A. P. Sheth. Towards Comprehensive Longitudinal Healthcare Data Capture. International Workshop on Semantic Web
in Literature-Based Discovery (SWLBD12). 241–247, 2012.
7. R. Daniulaityte, R. Carlson, R. Falck, D. Cameron, S. Perera, L. Chen, A. P. Sheth. A Web-Based Study of Extra-medical use of Loperamide. The College on
Problems of Drug Dependence (CPDD12), 2012.
8. D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in
Biomedical Literature. International Bioinformatics and Biomedical Conference (BIBM11). 512–519, 2011.
9. D. Cameron, B. Aleman-Meza, I. B. Arpinar, S. L. Decker, A. P. Sheth. A Taxonomy-based Model for Expertise Extrapolation. International
Conference on Semantic Computing (ICSC10). 333–240, 2010.
10. D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference
(ACMSE10). 14, 2010.
11. C. Thomas, W. Wang, P. Mehra, D. Cameron, P. N. Mendes, A. P. Sheth. What Goes Around Comes Around – Improving Linked Open Data through On-
Demand Model Creation. Web Science Conference (WebSci10), 2010.
12. P. N. Mendes, P. Kapanipathi, D. Cameron, A. P. Sheth. Dynamic Associative Relationships on the Linked Data Web. Web Science Conference (WebSci10),
2010.
60. 60
Parting Words
“...some day the piecing together of dissociated knowledge will open up such
terrifying vistas of reality,...that we shall either go mad from the revelation or
flee from the deadly light into the peace and safety of a new dark age.”
– H. P. Lovecraft (The Call of Cthulhu, The Horror in Clay).
H. P. Lovecraft. The Call of Cthulhu. In S. T. Joshi, editor. The Call of Cthulhu and Other Weird Stories. Penguin Books Ltd., London, 1999
61. 61
Acknowledgements
• Olivier Bodenreider
• Marcelo Fiszman
• Mike Cairelli
• Swapna Abhyankar
• Drashti Dave
• Dongwook Shin
• Special Thanks
o Pavan
o Shreyansh
o Swapnil
o Nishita
• PREDOSE Team
o Nishita
o Gaurish
o Alan
o Revathy
62. 62
Ph.D. Committee Members
Amit P. Sheth
(Advisor)
T.K. Prasad Michael Raymer
Ramakanth Kavuluru Thomas C. Rindflesch Varun Bhagwan
Editor's Notes
Thank everyone for coming.
Feel free to ask questions
Explored the Research Question of: Characteristics of Inheritance of Traits across Generations of Peas
Gregor Johann Mendel – Debunked Blending Inheritance, Founder of Genetics, Pea Hybridization, 1866
EXPERIMENTATION
OBSERVATION
- Inheritance of traits across generations seemed to extend beyond the immediate parents in the lineage
EXPLANATION
- Inheritance of traits appears to be influenced by the presence of dominant and recessive factors, which split, then independently recombine
THEORY
- Law of Segregation
- Law of Independent assortment
Explored the Research Question of: The mechanism of Cell Division (cytology) in the embryos of Grasshoppers
Walter Sutton & Theodor Boveri – Cytology 1903, Genetic Inheritance, each cell split is equally likely – gives the causal mechanism for Mendel’s law
OBSERVED
- splitting of chromosomes in the cells of grasshoppers (meiosis)
EXPLAINED
- Mendels laws of inheritance applied to chromosomes at the cellular level in living organisms
THEORIZED
- Chromosomes are the basis for genetic inheritance
Jorn Dyerberg & Hans Olaf Bang (1913–1994) – The Greenland Eskimo
OBSERVED
- Greenland Eskimos, no AMI
EXPLAINED
- diet rich in omega-3 fatty acids
THEORIZED
- marine oils can treat thrombosis, atherosceloris, and AMI
LBD is now driven by digital data (in silico as opposed to in vivo)
Four activities involved in the science of making discoveries under the guidance of a Human
An information processing system that leverages rich representations of textual content from scientific literature based on implicit and explicit context can provide effective means for literature-based discovery. This has been convincingly demonstrated through rediscovery of several well-known associations (between biomedical concepts) and their substantiation using MEDLINE and the Medical Subject Headings (MeSH) vocabulary.
Vioxx Brand Name (Rofecoxib is a nonsteroidal anti-inflammatory drug - NSAID)
- stronger pain medication than Naproxen (Brand Name Aleve)
- easier on the stomach than Naproxen
2004 Merck’s Clinical Trial - proved risk of heart attack
Lawsuit by 50,000 patients
Vioxx (anti-inflammatory)
- stronger
- less severe side effects (easier on the stomach)
Lawsuit by 50,000 patients
LBD is different from traditional research
Direct observations of the object of interest
Keyword-based – error prone due to absence of text normalization to standard concepts
Concept-based – (also Semantics-based, concepts but no explicit relationships)
Relations-based – (explicit relationships) but limited complexity, unable to capture causality, mechanisms of interaction
Graph-based - Giant Component, Clustering Coefficient, Geodesic, Centrality (betweenness, closeness)
Hybrid – combine machine learning, summarization with traditional LBD approaches
Rich representations
Personalization
Google Knowledge Graph
Human Activity Modeling
Mobile Applications/Advertising (get examples)
Two goals for automation:
Create subgraphs that capture complex associations
Along multiple thematic dimensions
Use of background knowledge to improve LBD
BKR
MeSH
Problem definition
In terms of path relatedness
Decomposed to semantic predication relatedness
To achieve this, we have studied characteristics of MEDLINE abstracts
Articles have properties/attributes
Provide various levels of abstraction of the full text
Given a way to represent context of a path, subgraphs can be automatically created in 6 steps
Frequency is the epiphenomenon of context
Compute Path Relatedness
Two Objectives
Binarize the vectors
Mean – weighted average of the points
Variance – average of the sum of squared distances away from the mean
Standard Deviation – square root of Variance (What is normal, what is not)
Mean – weighted average of the points
Variance – average of the sum of squared distances away from the mean
Standard Deviation – square root of Variance (What is normal, what is not)
Single-link
Cluster if maximum similarity is above the threshold
Straggly Clusters
Complete-link
Cluster if minimum similarity is above threshold
Strict, compact clusters
Group-average
Average of intra-cluster + inter-cluster
Well connected but more broad connections than complete link
Definitional Knowledge – Top-down
Assertional Knowledge – Bottom-up
Using both together is probably best.
Analogy
Google Knowledge Graph
IBM Human Activity Modeling
Yahoo Personalization
Biomedicine Literature-based Discovery
Mobile Applications