SlideShare a Scribd company logo
1 of 35
Download to read offline
Using Spark-NLP to build a Biomedical
Knowledge Graph
Or how to build a space telescope to not get lost in the darkness
About Us
Alex Thomas is a principal data scientist at Wisecube. He's
used natural language processing and machine learning with
clinical data, identity data, employer and jobseeker data, and
now biochemical data. Alex is also the author of Natural
Language Processing with Spark NLP.
Vishnu is the CTO and Founder of Wisecube AI and has over
two decades of experience building data science teams and
platforms. Vishnu is a big believer in graph based systems
and has extensive experience with various graph databases
including Neo4J, the original TitanDB release (now
JanusGraph) and more recently OrientDB and AWS Neptune.
About Wisecube AI
Wisecube AI helps
accelerate Biomedical R&D
by combining the power of
knowledge graphs, intelligent
applications and a low code
Platform.
Biomedical data: The final frontier
Biomedical Big (data) Bang !
Too much dark data being created to comprehend by
scientists
Insights are hidden
The curse of unstructured data
Not enough labeled data
Gathering Labels is expensive in Biomedical domains
Hubble Space Telescope: NLP and Knowledge
Graphs
1. Visualize: Allow users to gather high level
insights for biomedical research area
2. Explore: Discover connections and concepts
hidden in the text and structured data
3. Learn: Create representations of concepts
used to learn deep learning models for
prediction and experimentation.
Building Hubble Orpheus
Orpheus Pipeline
Overview
● Datasets
○ Text Processing
○ Topic Modeling
○ Entity Extraction
○ Graph Building
Datasets
❏ CORD-19
❏ Text dataset of biomedical articles related to COVID-19
❏ From the Semantic Scholar team at the Allen Institute for AI
❏ Contains articles and metadata file
❏ More than ~200k articles mentioned in metadata
❏ ~90k with JSON files converted from PDFs
❏ ~70k with JSON files converted from PubMed Central
❏ Most overlap
❏ ~90k articles with JSON files in the dataset
https://www.semanticscholar.org/cord19
Datasets
❏ Infectious Disease Names
❏ List of ~500 diseases names and synonyms
❏ Manually curated
❏ Added preferred names
❏ Wikipedia links
❏ Removed overly common or ambiguous terms
❏ ~10 hours work
❏ 1351 disease names
https://www.atsu.edu/faculty/chamberlain/website/diseases.htm
Datasets
❏ ChEMBL subset
❏ Synonyms
❏ SMILES
❏ 13309 compounds
https://www.ebi.ac.uk/chembl/
Datasets
❏ UniProt subset
❏ Names
❏ Associated UniProt IDs
❏ 1,473,423 protein names
❏ ~40k usually found in CORD-19
https://www.uniprot.org/
CORD-19 Dataset
❏ 94483 documents
❏ 26147.6 average character length
❏ 20685 journals
Overview
✓ CORD-19
● Text Processing
○ Topic Modeling
○ Entity Extraction
○ Graph Building
Text Processing
❏ The dataset is too large to process all in
memory
❏ Spark NLP
❏ Website: nlp.johnsnowlabs.com
❏ Book: Natural Language Processing with Spark NLP (by me, Alex)
❏ Amazon link: amzn.com/1492047767
❏ NLP library built on top of Spark MLlib
❏ Building our own pipeline
Text Processing
❏ Pipeline
❏ Sentence tokenizing - splitting text into sentences
❏ Tokenizing - splitting text into tokens
❏ Normalizing - lowercasing, removing non-alphabetics
❏ Stop word cleaning - removing common words
❏ Lemmatizing - reducing words to their dictionary entry
❏ E.g. symptoms → symptom, diagnoses → diagnosis
❏ Outputs
❏ Normalized tokens - used for entity extraction
❏ Lemmas - used for topic modeling
Text Processing
❏ Identifying phrases (manual process)
❏ Run pipeline
❏ Analyze n-gram frequencies
❏ n-grams are sequences of tokens of length n
❏ Here n=2,3,4
❏ Identify “stop phrases”
❏ Formulaic statements related to copywrite, links, etc.
❏ Identify phrases
❏ Add to tokenizer
❏ Repeat
Text Processing
- Log-Mean TF.IDF of unigrams
- Log-Mean TF.IDF of bigrams
- Log-Mean TF.IDF of trigrams
- Log-Mean TF.IDF of 4-grams
Overview
✓ CORD-19
✓ Text Processing
● Topic Modeling
○ Entity Extraction
○ Graph Building
Topic Modeling
❏ Topic modeling
❏ “You shall know a word by the company it keeps” - J. R. Firth 1957
❏ Clustering text data into topics
❏ Visualize diversity in corpus
❏ Analyze vocabulary
❏ Latent Dirichlet Allocation
❏ Documents are mixtures of topics
❏ Topics are mixtures of words
Topic Modeling
❏ pyLDAvis used for
visualization
Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
● Entity Extraction
○ Graph Building
Entity Extraction
❏ Dictionary based extraction
❏ Aho-Corasick algorithm
❏ Requires dictionary / wordlist
❏ Model based extraction
❏ Deep learning common for recent models
❏ Conditional random fields common for before 5 years ago
Entity Extraction
❏ Aho-Corasick algorithm
❏ Efficiently search for large number of patterns
❏ Pro: no training data needed, only a list of names and aliases
❏ Con: can only find entities in the alias list
❏ Con: does not use context
❏ APRIL is a protein name (UniProt)
Entity Extraction
❏ Model-based approach
❏ Predict which tokens are part of a reference to an entity
❏ Pro: Identifies phrases based on context
❏ Pro: Can be tuned to different data sets
❏ Con: Requires training data
❏ Deep learning model require a lot of data
token label
The O
influenza B-DISEASE
virus I-DISEASE
is O
divided O
into O
different O
types O
and O
subtypes O
Entity Extraction
❏ Bootstrapping
❏ Output the tokens with the entity labels
❏ BIO format
❏ Tokenization must be consistent
❏ CRF
❏ Fewer open-source implementations
❏ Deep Learning
❏ State of the art
❏ Requires large amounts of data
❏ Slow to run without GPU
Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
✓ Entity Extraction
● Graph Building
Graph Building
❏ Heuristic vs Model
❏ Relationship extraction data sets are rare, compared to NER models
❏ Creating labels requires experts
❏ Heuristics with labels
❏ Stated relationships may span across multiple sentences
❏ Certain styles of language are excessively verbose
❏ Especially academic language
Graph Building
❏ Entity co-occurrence
❏ Context
❏ Document
❏ Sentence
❏ Weight
❏ Binary
❏ Co-occurrence count
❏ TF.IDF
Graph Building
edge_ixs = [(i, j) for i, j in edge_ixs if i < j]
C = np.dot(B.T, B)
nonzero_edges = np.argwhere(C > 0)
DF = B.sum(axis=0)
TF = np.log2(1 + X)
IDF = np.log2(N / DF)
TFIDF = np.multiply(TF, IDF)
N = ... # number of docs
M = ... # number of unique entities
# NxM matrix, X[i, j] = number of times
# entity j occurs in doc i
X = ...
B = X>0
# co-occurrence matrix
# C[i, j] = number of docs with
# entities i and j
# C[i, i] = number of docs with
# entity i
C = np.dot(B.T, B)
edges_ixs = np.argwhere(C > 0)
Overview
✓ CORD-19
✓ Text Processing
✓ Topic Modeling
✓ Entity Extraction
✓ Graph Building
Full pipeline
❏ Pre-calculate data for
LDA visualization
❏ Load articles + entities +
topics into search index
❏ Elasticsearch
❏ Load entities and edges
into graph database
❏ OrientDB, Amazon Neptune
Demo
Key Takeaways
Biomedical Scientists are drowning in
Dark data (un-labeled and/or
unstructured)
NLP is the lense to help bring this data
into focus.
Knowledge Graphs are the Hubble
Space Telescopes of the Biomedical
Domain
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
Mariano Rodriguez-Muro
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 

What's hot (20)

Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
 
HDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/OHDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/O
 
Introduction to HDF5 Data and Programming Models
Introduction to HDF5 Data and Programming ModelsIntroduction to HDF5 Data and Programming Models
Introduction to HDF5 Data and Programming Models
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care Analytics
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 
Directories
DirectoriesDirectories
Directories
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Text Mining
Text MiningText Mining
Text Mining
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsSWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Open Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaicsOpen Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaics
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 

Similar to Using NLP to Explore Entity Relationships in COVID-19 Literature

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
Ravi Okade
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie
 

Similar to Using NLP to Explore Entity Relationships in COVID-19 Literature (20)

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
R tutorial
R tutorialR tutorial
R tutorial
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Hcls sci disc-isa2rdf
Hcls sci disc-isa2rdfHcls sci disc-isa2rdf
Hcls sci disc-isa2rdf
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Using NLP to Explore Entity Relationships in COVID-19 Literature

  • 1. Using Spark-NLP to build a Biomedical Knowledge Graph Or how to build a space telescope to not get lost in the darkness
  • 2. About Us Alex Thomas is a principal data scientist at Wisecube. He's used natural language processing and machine learning with clinical data, identity data, employer and jobseeker data, and now biochemical data. Alex is also the author of Natural Language Processing with Spark NLP. Vishnu is the CTO and Founder of Wisecube AI and has over two decades of experience building data science teams and platforms. Vishnu is a big believer in graph based systems and has extensive experience with various graph databases including Neo4J, the original TitanDB release (now JanusGraph) and more recently OrientDB and AWS Neptune.
  • 3. About Wisecube AI Wisecube AI helps accelerate Biomedical R&D by combining the power of knowledge graphs, intelligent applications and a low code Platform.
  • 4. Biomedical data: The final frontier Biomedical Big (data) Bang ! Too much dark data being created to comprehend by scientists Insights are hidden The curse of unstructured data Not enough labeled data Gathering Labels is expensive in Biomedical domains
  • 5. Hubble Space Telescope: NLP and Knowledge Graphs 1. Visualize: Allow users to gather high level insights for biomedical research area 2. Explore: Discover connections and concepts hidden in the text and structured data 3. Learn: Create representations of concepts used to learn deep learning models for prediction and experimentation.
  • 8. Overview ● Datasets ○ Text Processing ○ Topic Modeling ○ Entity Extraction ○ Graph Building
  • 9. Datasets ❏ CORD-19 ❏ Text dataset of biomedical articles related to COVID-19 ❏ From the Semantic Scholar team at the Allen Institute for AI ❏ Contains articles and metadata file ❏ More than ~200k articles mentioned in metadata ❏ ~90k with JSON files converted from PDFs ❏ ~70k with JSON files converted from PubMed Central ❏ Most overlap ❏ ~90k articles with JSON files in the dataset https://www.semanticscholar.org/cord19
  • 10. Datasets ❏ Infectious Disease Names ❏ List of ~500 diseases names and synonyms ❏ Manually curated ❏ Added preferred names ❏ Wikipedia links ❏ Removed overly common or ambiguous terms ❏ ~10 hours work ❏ 1351 disease names https://www.atsu.edu/faculty/chamberlain/website/diseases.htm
  • 11. Datasets ❏ ChEMBL subset ❏ Synonyms ❏ SMILES ❏ 13309 compounds https://www.ebi.ac.uk/chembl/
  • 12. Datasets ❏ UniProt subset ❏ Names ❏ Associated UniProt IDs ❏ 1,473,423 protein names ❏ ~40k usually found in CORD-19 https://www.uniprot.org/
  • 13. CORD-19 Dataset ❏ 94483 documents ❏ 26147.6 average character length ❏ 20685 journals
  • 14. Overview ✓ CORD-19 ● Text Processing ○ Topic Modeling ○ Entity Extraction ○ Graph Building
  • 15. Text Processing ❏ The dataset is too large to process all in memory ❏ Spark NLP ❏ Website: nlp.johnsnowlabs.com ❏ Book: Natural Language Processing with Spark NLP (by me, Alex) ❏ Amazon link: amzn.com/1492047767 ❏ NLP library built on top of Spark MLlib ❏ Building our own pipeline
  • 16. Text Processing ❏ Pipeline ❏ Sentence tokenizing - splitting text into sentences ❏ Tokenizing - splitting text into tokens ❏ Normalizing - lowercasing, removing non-alphabetics ❏ Stop word cleaning - removing common words ❏ Lemmatizing - reducing words to their dictionary entry ❏ E.g. symptoms → symptom, diagnoses → diagnosis ❏ Outputs ❏ Normalized tokens - used for entity extraction ❏ Lemmas - used for topic modeling
  • 17. Text Processing ❏ Identifying phrases (manual process) ❏ Run pipeline ❏ Analyze n-gram frequencies ❏ n-grams are sequences of tokens of length n ❏ Here n=2,3,4 ❏ Identify “stop phrases” ❏ Formulaic statements related to copywrite, links, etc. ❏ Identify phrases ❏ Add to tokenizer ❏ Repeat
  • 18. Text Processing - Log-Mean TF.IDF of unigrams - Log-Mean TF.IDF of bigrams - Log-Mean TF.IDF of trigrams - Log-Mean TF.IDF of 4-grams
  • 19. Overview ✓ CORD-19 ✓ Text Processing ● Topic Modeling ○ Entity Extraction ○ Graph Building
  • 20. Topic Modeling ❏ Topic modeling ❏ “You shall know a word by the company it keeps” - J. R. Firth 1957 ❏ Clustering text data into topics ❏ Visualize diversity in corpus ❏ Analyze vocabulary ❏ Latent Dirichlet Allocation ❏ Documents are mixtures of topics ❏ Topics are mixtures of words
  • 21. Topic Modeling ❏ pyLDAvis used for visualization
  • 22. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ● Entity Extraction ○ Graph Building
  • 23. Entity Extraction ❏ Dictionary based extraction ❏ Aho-Corasick algorithm ❏ Requires dictionary / wordlist ❏ Model based extraction ❏ Deep learning common for recent models ❏ Conditional random fields common for before 5 years ago
  • 24. Entity Extraction ❏ Aho-Corasick algorithm ❏ Efficiently search for large number of patterns ❏ Pro: no training data needed, only a list of names and aliases ❏ Con: can only find entities in the alias list ❏ Con: does not use context ❏ APRIL is a protein name (UniProt)
  • 25. Entity Extraction ❏ Model-based approach ❏ Predict which tokens are part of a reference to an entity ❏ Pro: Identifies phrases based on context ❏ Pro: Can be tuned to different data sets ❏ Con: Requires training data ❏ Deep learning model require a lot of data token label The O influenza B-DISEASE virus I-DISEASE is O divided O into O different O types O and O subtypes O
  • 26. Entity Extraction ❏ Bootstrapping ❏ Output the tokens with the entity labels ❏ BIO format ❏ Tokenization must be consistent ❏ CRF ❏ Fewer open-source implementations ❏ Deep Learning ❏ State of the art ❏ Requires large amounts of data ❏ Slow to run without GPU
  • 27. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ✓ Entity Extraction ● Graph Building
  • 28. Graph Building ❏ Heuristic vs Model ❏ Relationship extraction data sets are rare, compared to NER models ❏ Creating labels requires experts ❏ Heuristics with labels ❏ Stated relationships may span across multiple sentences ❏ Certain styles of language are excessively verbose ❏ Especially academic language
  • 29. Graph Building ❏ Entity co-occurrence ❏ Context ❏ Document ❏ Sentence ❏ Weight ❏ Binary ❏ Co-occurrence count ❏ TF.IDF
  • 30. Graph Building edge_ixs = [(i, j) for i, j in edge_ixs if i < j] C = np.dot(B.T, B) nonzero_edges = np.argwhere(C > 0) DF = B.sum(axis=0) TF = np.log2(1 + X) IDF = np.log2(N / DF) TFIDF = np.multiply(TF, IDF) N = ... # number of docs M = ... # number of unique entities # NxM matrix, X[i, j] = number of times # entity j occurs in doc i X = ... B = X>0 # co-occurrence matrix # C[i, j] = number of docs with # entities i and j # C[i, i] = number of docs with # entity i C = np.dot(B.T, B) edges_ixs = np.argwhere(C > 0)
  • 31. Overview ✓ CORD-19 ✓ Text Processing ✓ Topic Modeling ✓ Entity Extraction ✓ Graph Building
  • 32. Full pipeline ❏ Pre-calculate data for LDA visualization ❏ Load articles + entities + topics into search index ❏ Elasticsearch ❏ Load entities and edges into graph database ❏ OrientDB, Amazon Neptune
  • 33. Demo
  • 34. Key Takeaways Biomedical Scientists are drowning in Dark data (un-labeled and/or unstructured) NLP is the lense to help bring this data into focus. Knowledge Graphs are the Hubble Space Telescopes of the Biomedical Domain
  • 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.