In this talk, we will cover how to extract entities from text using both rule-based and deep learning techniques. We will also cover how to use rule-based entity extraction to bootstrap a named entity recognition model. The other important aspect of this project we will cover is how to infer relationships between entities, and combine them with explicit relationships found in the source data sets. Although this talk is focused on the CORD-19 data set, the techniques covered are applicable to a wide variety of domains. This talk is for those who want to learn how to use NLP to explore relationships in text.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Using NLP to Explore Entity Relationships in COVID-19 Literature
1. Using Spark-NLP to build a Biomedical
Knowledge Graph
Or how to build a space telescope to not get lost in the darkness
2. About Us
Alex Thomas is a principal data scientist at Wisecube. He's
used natural language processing and machine learning with
clinical data, identity data, employer and jobseeker data, and
now biochemical data. Alex is also the author of Natural
Language Processing with Spark NLP.
Vishnu is the CTO and Founder of Wisecube AI and has over
two decades of experience building data science teams and
platforms. Vishnu is a big believer in graph based systems
and has extensive experience with various graph databases
including Neo4J, the original TitanDB release (now
JanusGraph) and more recently OrientDB and AWS Neptune.
3. About Wisecube AI
Wisecube AI helps
accelerate Biomedical R&D
by combining the power of
knowledge graphs, intelligent
applications and a low code
Platform.
4. Biomedical data: The final frontier
Biomedical Big (data) Bang !
Too much dark data being created to comprehend by
scientists
Insights are hidden
The curse of unstructured data
Not enough labeled data
Gathering Labels is expensive in Biomedical domains
5. Hubble Space Telescope: NLP and Knowledge
Graphs
1. Visualize: Allow users to gather high level
insights for biomedical research area
2. Explore: Discover connections and concepts
hidden in the text and structured data
3. Learn: Create representations of concepts
used to learn deep learning models for
prediction and experimentation.
9. Datasets
❏ CORD-19
❏ Text dataset of biomedical articles related to COVID-19
❏ From the Semantic Scholar team at the Allen Institute for AI
❏ Contains articles and metadata file
❏ More than ~200k articles mentioned in metadata
❏ ~90k with JSON files converted from PDFs
❏ ~70k with JSON files converted from PubMed Central
❏ Most overlap
❏ ~90k articles with JSON files in the dataset
https://www.semanticscholar.org/cord19
10. Datasets
❏ Infectious Disease Names
❏ List of ~500 diseases names and synonyms
❏ Manually curated
❏ Added preferred names
❏ Wikipedia links
❏ Removed overly common or ambiguous terms
❏ ~10 hours work
❏ 1351 disease names
https://www.atsu.edu/faculty/chamberlain/website/diseases.htm
15. Text Processing
❏ The dataset is too large to process all in
memory
❏ Spark NLP
❏ Website: nlp.johnsnowlabs.com
❏ Book: Natural Language Processing with Spark NLP (by me, Alex)
❏ Amazon link: amzn.com/1492047767
❏ NLP library built on top of Spark MLlib
❏ Building our own pipeline
16. Text Processing
❏ Pipeline
❏ Sentence tokenizing - splitting text into sentences
❏ Tokenizing - splitting text into tokens
❏ Normalizing - lowercasing, removing non-alphabetics
❏ Stop word cleaning - removing common words
❏ Lemmatizing - reducing words to their dictionary entry
❏ E.g. symptoms → symptom, diagnoses → diagnosis
❏ Outputs
❏ Normalized tokens - used for entity extraction
❏ Lemmas - used for topic modeling
17. Text Processing
❏ Identifying phrases (manual process)
❏ Run pipeline
❏ Analyze n-gram frequencies
❏ n-grams are sequences of tokens of length n
❏ Here n=2,3,4
❏ Identify “stop phrases”
❏ Formulaic statements related to copywrite, links, etc.
❏ Identify phrases
❏ Add to tokenizer
❏ Repeat
18. Text Processing
- Log-Mean TF.IDF of unigrams
- Log-Mean TF.IDF of bigrams
- Log-Mean TF.IDF of trigrams
- Log-Mean TF.IDF of 4-grams
20. Topic Modeling
❏ Topic modeling
❏ “You shall know a word by the company it keeps” - J. R. Firth 1957
❏ Clustering text data into topics
❏ Visualize diversity in corpus
❏ Analyze vocabulary
❏ Latent Dirichlet Allocation
❏ Documents are mixtures of topics
❏ Topics are mixtures of words
23. Entity Extraction
❏ Dictionary based extraction
❏ Aho-Corasick algorithm
❏ Requires dictionary / wordlist
❏ Model based extraction
❏ Deep learning common for recent models
❏ Conditional random fields common for before 5 years ago
24. Entity Extraction
❏ Aho-Corasick algorithm
❏ Efficiently search for large number of patterns
❏ Pro: no training data needed, only a list of names and aliases
❏ Con: can only find entities in the alias list
❏ Con: does not use context
❏ APRIL is a protein name (UniProt)
25. Entity Extraction
❏ Model-based approach
❏ Predict which tokens are part of a reference to an entity
❏ Pro: Identifies phrases based on context
❏ Pro: Can be tuned to different data sets
❏ Con: Requires training data
❏ Deep learning model require a lot of data
token label
The O
influenza B-DISEASE
virus I-DISEASE
is O
divided O
into O
different O
types O
and O
subtypes O
26. Entity Extraction
❏ Bootstrapping
❏ Output the tokens with the entity labels
❏ BIO format
❏ Tokenization must be consistent
❏ CRF
❏ Fewer open-source implementations
❏ Deep Learning
❏ State of the art
❏ Requires large amounts of data
❏ Slow to run without GPU
28. Graph Building
❏ Heuristic vs Model
❏ Relationship extraction data sets are rare, compared to NER models
❏ Creating labels requires experts
❏ Heuristics with labels
❏ Stated relationships may span across multiple sentences
❏ Certain styles of language are excessively verbose
❏ Especially academic language
30. Graph Building
edge_ixs = [(i, j) for i, j in edge_ixs if i < j]
C = np.dot(B.T, B)
nonzero_edges = np.argwhere(C > 0)
DF = B.sum(axis=0)
TF = np.log2(1 + X)
IDF = np.log2(N / DF)
TFIDF = np.multiply(TF, IDF)
N = ... # number of docs
M = ... # number of unique entities
# NxM matrix, X[i, j] = number of times
# entity j occurs in doc i
X = ...
B = X>0
# co-occurrence matrix
# C[i, j] = number of docs with
# entities i and j
# C[i, i] = number of docs with
# entity i
C = np.dot(B.T, B)
edges_ixs = np.argwhere(C > 0)
32. Full pipeline
❏ Pre-calculate data for
LDA visualization
❏ Load articles + entities +
topics into search index
❏ Elasticsearch
❏ Load entities and edges
into graph database
❏ OrientDB, Amazon Neptune
34. Key Takeaways
Biomedical Scientists are drowning in
Dark data (un-labeled and/or
unstructured)
NLP is the lense to help bring this data
into focus.
Knowledge Graphs are the Hubble
Space Telescopes of the Biomedical
Domain