Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 33

Spark NLP: State of the Art Natural Language Processing at Scale

1

Share

Download to read offline

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark. Spark NLP natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared.

Spark NLP: State of the Art Natural Language Processing at Scale

  1. 1. Spark NLP: State of the Art Natural Language Processing at Scale Maziyar Panahi David Talby
  2. 2. Agenda Introducing Spark NLP Accuracy Performance Scale Ease of Use
  3. 3. Spark NLP: Apache License 2.0
  4. 4. Spark NLP in Industry NLP Industry Survey by Gradient Flow, an independent data science research & insights company, September 2020 Which NLP libraries does your organization use?
  5. 5. ! Tokenization ! Sentence Detector ! Stop Words Removal ! Normalizer ! Stemmer ! Lemmatizer ! NGrams ! Regex Matching ! Text Matching ! Chunking ! Date Matcher ! Part-of-speech tagging ! Dependency parsing ! Sentiment Detection (ML models) ! Spell Checker (ML and DL models) ! Word Embeddings ! BERT Embeddings ! ELMO Embeddings ! ALBERT Embeddings ! XLNet Embeddings ! Universal Sentence Encoder ! BERT Sentence Embeddings ! Sentence Embeddings ! Chunk Embeddings ! Unsupervised keywords extraction ! Language Detection & Identification ! Multi-class Text Classification ! Multi-label Text Classification ! Multi-class Sentiment Analysis ! Named entity recognition ! Easy TensorFlow integration ! Full integration with Spark ML functions ! +250 pre-trained models in 46 Spark NLP: Apache License 2.0
  6. 6. Trusted By
  7. 7. Vision of Spark NLP Python, Scala, and Java ! ACCURACY ! PERFORMANCE ! SCALABILITY
  8. 8. ! “State of the art” means the best peer-reviewed academic results ! For example: Best F1 score on CoNLL-2003 NER benchmark for a system in production ! Spark NLP uses Bi-LSTM + Char- CNN + CRF + Word Embeddings Accuracy: State-of-the-art Models Named Entity Recognition
  9. 9. ! The best F1 score on CoNLL-2003 NER benchmark for a system in production by using Spark NLP ! BERT Large model was used to train our Bi-LSTM + Char-CNN + CRF model Accuracy: State-of-the-art Models Named Entity Recognition
  10. 10. ! Everything must work right out of the box. Meaning, there is no extra steps to make something work ! All the parameters are default ! CoNLL 2003 dataset is used in this benchmark. The eng.train was used for training and the eng.testa was used for evaluating the model Accuracy: State-of-the-art Models Named Entity Recognition
  11. 11. Accuracy: State-of-the-art Models Transformers & Embeddings
  12. 12. Transformers & Embeddings Spark NLP: 50 Word Embeddings ! BERT ! Small BERT ! BioBERT ! CovidBERT ! ALBERT ! ELECTRA ! XLNet ! ELMO ! GloVe
  13. 13. Accuracy: State-of-the-art Models Multi-class & Multi-label Text Classifications ! Multi-class text classification to detect emotions, cyberbullying, fake news, spams, etc. ! Multi-label text classification to detect toxic comments, movie genre, etc. ! 90+ pertained Word and Sentence Embeddings models ! Language-Agnostic BERT Sentence Embedding ! Universal Sentence Encoder as an input for text classifications
  14. 14. Accuracy: State-of-the-art Models SentimentDL, ClassifierDL, and MultiClassifierDL ! BERT ! Small BERT ! BioBERT ! CovidBERT ! LaBSE ! ALBERT ! ELECTRA ! XLNet ! ELMO ! Universal Sentence Encoder ! GloVe ! 100 dimensions ! 200 dimensions ! 128 dimensions ! 256 dimensions ! 300 dimensions ! 512 dimensions ! 768 dimensions ! 1024 dimensions ! tfhub_ues ! tfhub_use_lg ! glove_6B_100 ! glove_6B_300 ! glove_840B_300 ! bert_base_cased ! bert_base_uncased ! bert_large_cased ! bert_large_uncased ! bert_multi_uncased ! electra_small_uncased ! elmo ! ... 90+ Word & Sentence models ! 2 classes (positive/negative) ! 3 classes (0, 1, 2) ! 4 classes (Sports, Business, etc.) ! 5 classes (1.0, 2.0, 3.0, 4.0, 5.0) ! ... 100 classes!
  15. 15. Accuracy: State-of-the-art Models Language Detection & Identification ! LanguageDetectorDL is a state-of-the-art TensorFlow/Keras model ! Uses the positions of the characters ! It is around 3 MB to 5 MB ! It has been trained over 8 million Wikipedia pages ! It has between 97% to 99% accuracy for text longer than 140 characters
  16. 16. Accuracy: State-of-the-art Models Context Spell Checker ! Ability to consider OCR specific error patterns ! Ability to leverage the context ! Ability to preserve and even correct custom patterns ! Flexibility to incorporate your own custom patterns
  17. 17. Improved Speed & Accuracy Performance Named Entity Recognition
  18. 18. Performance: BERT Embeddings ! Transformers are slow! ! They need GPUs ! It depends highly on max sequence length Spark NLP 2.6: ! Improve the memory consumption by 30% ! Improve performance by more than 70% with dynamic shape
  19. 19. Performance: BERT Embeddings ! 24 new smaller models! ! Tiny BERT ! Mini BERT ! Small BERT ! Medium BERT Example: ! BERT-Tiny is 24x times smaller and 28x times faster than BERT-Base
  20. 20. Performance: Hardware ! Optimized builds of Spark NLP for both Intel and Nvidia ! Benchmark done on AWS: Train a Named Entity Recognition in French for 80 Epochs with 512 batch size ! Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  21. 21. Scale: Distribution & Parallelism ! Zero code changes to scale a pipeline to any Spark cluster ! Only natively distributed open-source NLP library ! Spark provides execution planning, caching, serialization, and shuffling ! Caveats ! Speedup depends on what you actually do ! Spark configurations matter ! Cluster tuning based on your data is advised
  22. 22. Scale: Distribution & Parallelism Recognize Entity DL Pipeline ! Amazon full reviews, 15 million sentences, and 255 million tokens ! Single node, 32G memory & 32 cores ! 10x workers with 32G memory & 16 cores ! The pipeline includes sentence detection, tokenization, word embeddings, and NER NOTE: ! Single node is dedicated Dell Server ! 10 Nodes are in Databricks on AWS
  23. 23. Scale: Distribution & Parallelism BERT Embeddings ! Amazon full reviews, 15 million sentences, and 255 million tokens ! Single node with 64G memory & 32 cores ! 10x workers with 32G memory & 16 cores ! 128 max sequence length NOTE: ! Single node is dedicated Dell Server ! 10 Nodes are in Databricks on AWS
  24. 24. Easy to Use Python, Scala, and Java ! Pretrained pipelines ! Pretrained models ! Training your own models
  25. 25. Easy to Use Pretrained Pipelines ! Over 90+ pretrained pipelines ! Full support for 13 languages ! Simple and easy to use ! Works online and offline ! Not flexible
  26. 26. Easy to Use Pretrained Models ! Over 250+ pretrained models ! Support 46 languages ! Works online and offline ! Flexible & customized pipelines ! Caveat: some models depend on each other
  27. 27. Easy to Use Train your own POS tagging models ! POS() accepts token-tag format ! POS Tagger is based on Perceptron Average algorithm ! Language-agnostic and supports any language
  28. 28. Easy to Use Train your own NER models ! CoNLL 2003 format as input ! Accepts 50+ Word Embeddings models ! Train on CPU or GPU ! Extended metrics and evaluation ! Built-in validation split with metrics
  29. 29. Easy to Use Train your own NER models ! BERT with 2 layers & 768 dimensions ! 16 minutes training ! 91% Micro F1 on Dev ! 90% conll_eval on Dev ! Full CoNLL 2003 training dataset ! Google Colab with GPU
  30. 30. Easy to Use Train your own multi-class classifiers ! Supports up to 100 classes ! Accepts 90+ Word & Sentence Embeddings models ! Train on CPU or GPU ! Extended metrics and evaluation ! Built-in validation split with metrics
  31. 31. Spark NLP ! 73 total releases ! Release every two weeks for the past 3 years ! A single unified library for all your NLP/NLU need ! Active community on Slack & GitHub
  32. 32. Thank You! Getting started, docs, videos, examples: nlp.johnsnowlabs.com Community: spark-nlp.slack.com

×