Using NLP to Explore Entity Relationships in COVID-19 Literature

Using Spark-NLP to build a Biomedical
Knowledge Graph
Or how to build a space telescope to not get lost in the darkness

About Us
Alex Thomas is a principal data scientist at Wisecube. He's
used natural language processing and machine learning with
clinical data, identity data, employer and jobseeker data, and
now biochemical data. Alex is also the author of Natural
Language Processing with Spark NLP.
Vishnu is the CTO and Founder of Wisecube AI and has over
two decades of experience building data science teams and
platforms. Vishnu is a big believer in graph based systems
and has extensive experience with various graph databases
including Neo4J, the original TitanDB release (now
JanusGraph) and more recently OrientDB and AWS Neptune.

About Wisecube AI
Wisecube AI helps
accelerate Biomedical R&D
by combining the power of
knowledge graphs, intelligent
applications and a low code
Platform.

Biomedical data: The final frontier
Biomedical Big (data) Bang !
Too much dark data being created to comprehend by
scientists
Insights are hidden
The curse of unstructured data
Not enough labeled data
Gathering Labels is expensive in Biomedical domains

Hubble Space Telescope: NLP and Knowledge
Graphs
1. Visualize: Allow users to gather high level
insights for biomedical research area
2. Explore: Discover connections and concepts
hidden in the text and structured data
3. Learn: Create representations of concepts
used to learn deep learning models for
prediction and experimentation.

Overview
● Datasets
○ Text Processing
○ Topic Modeling
○ Entity Extraction
○ Graph Building

Datasets
❏ CORD-19
❏ Text dataset of biomedical articles related to COVID-19
❏ From the Semantic Scholar team at the Allen Institute for AI
❏ Contains articles and metadata file
❏ More than ~200k articles mentioned in metadata
❏ ~90k with JSON files converted from PDFs
❏ ~70k with JSON files converted from PubMed Central
❏ Most overlap
❏ ~90k articles with JSON files in the dataset
https://www.semanticscholar.org/cord19

Datasets
❏ Infectious Disease Names
❏ List of ~500 diseases names and synonyms
❏ Manually curated
❏ Added preferred names
❏ Wikipedia links
❏ Removed overly common or ambiguous terms
❏ ~10 hours work
❏ 1351 disease names
https://www.atsu.edu/faculty/chamberlain/website/diseases.htm

Datasets
❏ ChEMBL subset
❏ Synonyms
❏ SMILES
❏ 13309 compounds
https://www.ebi.ac.uk/chembl/

Datasets
❏ UniProt subset
❏ Names
❏ Associated UniProt IDs
❏ 1,473,423 protein names
❏ ~40k usually found in CORD-19
https://www.uniprot.org/

CORD-19 Dataset
❏ 94483 documents
❏ 26147.6 average character length
❏ 20685 journals

Overview
✓ CORD-19
● Text Processing
○ Topic Modeling
○ Graph Building

Text Processing
❏ The dataset is too large to process all in
memory
❏ Spark NLP
❏ Website: nlp.johnsnowlabs.com
❏ Book: Natural Language Processing with Spark NLP (by me, Alex)
❏ Amazon link: amzn.com/1492047767
❏ NLP library built on top of Spark MLlib
❏ Building our own pipeline

Text Processing
❏ Pipeline
❏ Sentence tokenizing - splitting text into sentences
❏ Tokenizing - splitting text into tokens
❏ Normalizing - lowercasing, removing non-alphabetics
❏ Stop word cleaning - removing common words
❏ Lemmatizing - reducing words to their dictionary entry
❏ E.g. symptoms → symptom, diagnoses → diagnosis
❏ Outputs
❏ Normalized tokens - used for entity extraction
❏ Lemmas - used for topic modeling

Text Processing
❏ Identifying phrases (manual process)
❏ Run pipeline
❏ Analyze n-gram frequencies
❏ n-grams are sequences of tokens of length n
❏ Here n=2,3,4
❏ Identify “stop phrases”
❏ Formulaic statements related to copywrite, links, etc.
❏ Identify phrases
❏ Add to tokenizer
❏ Repeat

Text Processing
- Log-Mean TF.IDF of unigrams
- Log-Mean TF.IDF of bigrams
- Log-Mean TF.IDF of trigrams
- Log-Mean TF.IDF of 4-grams

Overview
✓ CORD-19
✓ Text Processing
● Topic Modeling
○ Graph Building

Topic Modeling
❏ Topic modeling
❏ “You shall know a word by the company it keeps” - J. R. Firth 1957
❏ Clustering text data into topics
❏ Visualize diversity in corpus
❏ Analyze vocabulary
❏ Latent Dirichlet Allocation
❏ Documents are mixtures of topics
❏ Topics are mixtures of words

Topic Modeling
❏ pyLDAvis used for
visualization

Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
● Entity Extraction
○ Graph Building

Entity Extraction
❏ Dictionary based extraction
❏ Aho-Corasick algorithm
❏ Requires dictionary / wordlist
❏ Model based extraction
❏ Deep learning common for recent models
❏ Conditional random fields common for before 5 years ago

Entity Extraction
❏ Aho-Corasick algorithm
❏ Efficiently search for large number of patterns
❏ Pro: no training data needed, only a list of names and aliases
❏ Con: can only find entities in the alias list
❏ Con: does not use context
❏ APRIL is a protein name (UniProt)

Entity Extraction
❏ Model-based approach
❏ Predict which tokens are part of a reference to an entity
❏ Pro: Identifies phrases based on context
❏ Pro: Can be tuned to different data sets
❏ Con: Requires training data
❏ Deep learning model require a lot of data
token label
The O
influenza B-DISEASE
virus I-DISEASE
is O
divided O
into O
different O
types O
and O
subtypes O

Entity Extraction
❏ Bootstrapping
❏ Output the tokens with the entity labels
❏ BIO format
❏ Tokenization must be consistent
❏ CRF
❏ Fewer open-source implementations
❏ Deep Learning
❏ State of the art
❏ Requires large amounts of data
❏ Slow to run without GPU

Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
✓ Entity Extraction
● Graph Building

Graph Building
❏ Heuristic vs Model
❏ Relationship extraction data sets are rare, compared to NER models
❏ Creating labels requires experts
❏ Heuristics with labels
❏ Stated relationships may span across multiple sentences
❏ Certain styles of language are excessively verbose
❏ Especially academic language

Graph Building
❏ Entity co-occurrence
❏ Context
❏ Document
❏ Sentence
❏ Weight
❏ Binary
❏ Co-occurrence count
❏ TF.IDF

Graph Building
edge_ixs = [(i, j) for i, j in edge_ixs if i < j]
C = np.dot(B.T, B)
nonzero_edges = np.argwhere(C > 0)
DF = B.sum(axis=0)
TF = np.log2(1 + X)
IDF = np.log2(N / DF)
TFIDF = np.multiply(TF, IDF)
N = ... # number of docs
M = ... # number of unique entities
# NxM matrix, X[i, j] = number of times
# entity j occurs in doc i
X = ...
B = X>0
# co-occurrence matrix
# C[i, j] = number of docs with
# entities i and j
# C[i, i] = number of docs with
# entity i
C = np.dot(B.T, B)
edges_ixs = np.argwhere(C > 0)

Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
✓ Entity Extraction
✓ Graph Building

Full pipeline
❏ Pre-calculate data for
LDA visualization
❏ Load articles + entities +
topics into search index
❏ Elasticsearch
❏ Load entities and edges
into graph database
❏ OrientDB, Amazon Neptune

Key Takeaways
Biomedical Scientists are drowning in
Dark data (un-labeled and/or
unstructured)
NLP is the lense to help bring this data
into focus.
Knowledge Graphs are the Hubble
Space Telescopes of the Biomedical
Domain

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Using NLP to Explore Entity Relationships in COVID-19 Literature

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using NLP to Explore Entity Relationships in COVID-19 Literature

Similar to Using NLP to Explore Entity Relationships in COVID-19 Literature (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Using NLP to Explore Entity Relationships in COVID-19 Literature