Spark NLP: State of the Art Natural Language Processing at Scale

Spark NLP: State of the
Art Natural Language
Processing at Scale
Maziyar Panahi
David Talby

Agenda
Introducing Spark NLP
Accuracy
Performance
Scale
Ease of Use

Spark NLP in Industry
NLP Industry Survey by Gradient Flow,
an independent data science research & insights company, September 2020
Which NLP libraries does your organization use?

! Tokenization
! Sentence Detector
! Stop Words Removal
! Normalizer
! Stemmer
! Lemmatizer
! NGrams
! Regex Matching
! Text Matching
! Chunking
! Date Matcher
! Part-of-speech tagging
! Dependency parsing
! Sentiment Detection (ML models)
! Spell Checker (ML and DL models)
! Word Embeddings
! BERT Embeddings
! ELMO Embeddings
! ALBERT Embeddings
! XLNet Embeddings
! Universal Sentence Encoder
! BERT Sentence Embeddings
! Sentence Embeddings
! Chunk Embeddings
! Unsupervised keywords extraction
! Language Detection & Identification
! Multi-class Text Classification
! Multi-label Text Classification
! Multi-class Sentiment Analysis
! Named entity recognition
! Easy TensorFlow integration
! Full integration with Spark
ML functions
! +250 pre-trained models in 46
Spark NLP: Apache License 2.0

Vision of Spark NLP
Python, Scala, and Java
! ACCURACY
! PERFORMANCE
! SCALABILITY

! “State of the art” means the
best peer-reviewed academic
results
! For example: Best F1 score on
CoNLL-2003 NER benchmark
for a system in production
! Spark NLP uses Bi-LSTM + Char-
CNN + CRF + Word
Embeddings
Accuracy: State-of-the-art Models
Named Entity Recognition

! The best F1 score on CoNLL-2003
NER benchmark for a system in
production by using Spark NLP
! BERT Large model was used to
train our Bi-LSTM + Char-CNN +
CRF model

! Everything must work right out of the
box. Meaning, there is no extra steps to
make something work
! All the parameters are default
! CoNLL 2003 dataset is used in this
benchmark. The eng.train was used for
training and
the eng.testa was used for evaluating the
model

Transformers & Embeddings

Transformers & Embeddings
Spark NLP: 50 Word
Embeddings
! BERT
! Small BERT
! BioBERT
! CovidBERT
! ALBERT
! ELECTRA
! XLNet
! ELMO
! GloVe

Multi-class & Multi-label Text Classifications
! Multi-class text classification to detect
emotions, cyberbullying, fake news, spams,
etc.
! Multi-label text classification to detect toxic
comments, movie genre, etc.
! 90+ pertained Word and Sentence
Embeddings models
! Language-Agnostic BERT Sentence
Embedding
! Universal Sentence Encoder as an input for
text classifications

SentimentDL, ClassifierDL, and MultiClassifierDL
! BERT
! Small BERT
! BioBERT
! CovidBERT
! LaBSE
! ALBERT
! ELECTRA
! XLNet
! ELMO
! Universal Sentence
Encoder
! GloVe
! 100 dimensions
! 200
dimensions
! 128 dimensions
! 256
dimensions
! 300
dimensions
! 512 dimensions
! 768
dimensions
! 1024
dimensions
! tfhub_ues
! tfhub_use_lg
! glove_6B_100
! glove_6B_300
! glove_840B_300
! bert_base_cased
! bert_base_uncased
! bert_large_cased
! bert_large_uncased
! bert_multi_uncased
! electra_small_uncased
! elmo
! ... 90+ Word & Sentence models
! 2 classes (positive/negative)
! 3 classes (0, 1, 2)
! 4 classes (Sports, Business,
etc.)
! 5 classes (1.0, 2.0, 3.0, 4.0, 5.0)
! ... 100 classes!

Language Detection & Identification
! LanguageDetectorDL is a state-of-the-art
TensorFlow/Keras model
! Uses the positions of the characters
! It is around 3 MB to 5 MB
! It has been trained over 8 million Wikipedia
pages
! It has between 97% to 99% accuracy for text
longer than 140 characters

Context Spell Checker
! Ability to consider OCR specific error
patterns
! Ability to leverage the context
! Ability to preserve and even correct custom
patterns
! Flexibility to incorporate your own custom
patterns

Improved Speed &
Accuracy
Performance

Performance:
BERT Embeddings
! Transformers are slow!
! They need GPUs
! It depends highly on max sequence
length
Spark NLP 2.6:
! Improve the memory consumption by
30%
! Improve performance by more than 70%
with dynamic shape

Performance:
BERT Embeddings
! 24 new smaller models!
! Tiny BERT
! Mini BERT
! Small BERT
! Medium BERT
Example:
! BERT-Tiny is 24x times smaller and
28x times faster than BERT-Base

Performance:
Hardware
! Optimized builds of Spark NLP for
both Intel and Nvidia
! Benchmark done on AWS: Train a
Named Entity Recognition in
French for 80 Epochs with 512
batch size
! Intel outperformed Nvidia:
Cascade Lake was 19% faster &
46% cheaper than Tesla P-100

Scale: Distribution & Parallelism
! Zero code changes to scale a pipeline to any
Spark cluster
! Only natively distributed open-source NLP
library
! Spark provides execution planning, caching,
serialization, and shuffling
! Caveats
! Speedup depends on what you actually do
! Spark configurations matter
! Cluster tuning based on your data is advised

Recognize Entity DL Pipeline
! Amazon full reviews, 15 million sentences,
and
255 million tokens
! Single node, 32G memory & 32 cores
! 10x workers with 32G memory & 16 cores
! The pipeline includes sentence detection,
tokenization, word embeddings, and NER
NOTE:
! Single node is dedicated Dell Server
! 10 Nodes are in Databricks on AWS

BERT Embeddings
! Amazon full reviews, 15 million sentences,
and 255 million tokens
! Single node with 64G memory & 32 cores
! 10x workers with 32G memory & 16 cores
! 128 max sequence length
NOTE:
! Single node is dedicated Dell Server
! 10 Nodes are in Databricks on AWS

Easy to Use
Python, Scala, and Java
! Pretrained pipelines
! Pretrained models
! Training your own models

Easy to Use
Pretrained Pipelines
! Over 90+ pretrained pipelines
! Full support for 13 languages
! Simple and easy to use
! Works online and offline
! Not flexible

Easy to Use
Pretrained Models
! Over 250+ pretrained models
! Support 46 languages
! Works online and offline
! Flexible & customized pipelines
! Caveat: some models depend on each
other

Easy to Use
Train your own POS tagging models
! POS() accepts token-tag format
! POS Tagger is based on
Perceptron Average algorithm
! Language-agnostic and supports
any language

Easy to Use
Train your own NER models
! CoNLL 2003 format as input
! Accepts 50+ Word Embeddings
models
! Train on CPU or GPU
! Extended metrics and evaluation
! Built-in validation split with
metrics

Easy to Use
Train your own NER models
! BERT with 2 layers & 768
dimensions
! 16 minutes training
! 91% Micro F1 on Dev
! 90% conll_eval on Dev
! Full CoNLL 2003 training dataset
! Google Colab with GPU

Easy to Use
Train your own multi-class classifiers
! Supports up to 100 classes
! Accepts 90+ Word & Sentence Embeddings
models
! Train on CPU or GPU
! Extended metrics and evaluation
! Built-in validation split with metrics

Spark NLP
! 73 total releases
! Release every two weeks for
the
past 3 years
! A single unified library for all
your NLP/NLU need
! Active community on Slack &
GitHub

Thank You!
Getting started, docs, videos, examples:
nlp.johnsnowlabs.com
Community:
spark-nlp.slack.com

Spark NLP: State of the Art Natural Language Processing at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark NLP: State of the Art Natural Language Processing at Scale

Similar to Spark NLP: State of the Art Natural Language Processing at Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark NLP: State of the Art Natural Language Processing at Scale