SlideShare a Scribd company logo
1 of 55
Download to read offline
NLP Project Full Cycle
Vsevolod Dyomkin
10/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Plan
* Overview of NLP
* NLP Data
* Common NLP problems
and approaches
* Example NLP application:
text language identification
What Is NLP?
Transforming free-form text
into structured data and back
What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* ML, Stats, Information Theory
Natural Language
* ambiguous
* noisy
* evolving
Roles
linguist [noun]
1. A specialist in linguistics
linguist [noun]
1. A specialist in linguistics
linguistics [noun]
1. The scientific study of
language.
NLP Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs
Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* Internet/user Data
Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...
Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog
engineer [noun]
5. A person skilled in the
design and programming of
computer systems
Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.
Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
In fact, it works:
https://github.com/lang-uk/ner-uk/blob/master/doc
/tokenization.md
Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability
Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE
researcher [noun]
1. One who researches
researcher [noun]
1. One who researches
research [noun]
1. Diligent inquiry or
examination to seek or revise
facts, principles, theories,
applications, etc.; laborious
or continued search after
truth
Models
Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace
Language Models
Question: what is the probability of a
sequence of words/sentence?
Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …
Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
Bag-of-words Model
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
http://www.paulgraham.com/spam.html - A Plan for Spam
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%
Naive Bayes
Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)
select Y = argmax P(Y|x)
Naive step:
P(Y|x) = P(Y) * prod(P(x|Y))
for all x in X
(P(x) is marginalized out because it's the
same for all Y)
Machine Learning
Approach
Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/
Shift-reduce Parsing
Shift-reduce Parsing
Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)
ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse
The Hierarchy of
ML Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests, Boosted
Trees
* Artificial Neural networks
Semantics
Question: how to model relationships
between words?
Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia
Word Similarity
Next question: now, how do we measure those
relations?
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))
Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe, …)
* Hierarchical repr (Brown clusters)
Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
frameworks where possible
* Setup and perform a proper
experiment (series of experiments)
* Optimize the system for production
Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Don't trust research results
* Value pre- and post- processing
* Gather user feedback
Text Language
Identification
Not an unsolved problem:
* https://github.com/CLD2Owners/cld2 - C++
* https://github.com/saffsd/langid.py - Python
* https://github.com/shuyo/language-detection/ - Java
To read:
https://blog.twitter.com/2015/evaluating-language-identifi
cation-performance
http://blog.mikemccandless.com/2011/10/accuracy-and-perfor
mance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/
WILD Challenges
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages,
always evolving
* Wanted to do in Lisp
WILD Linguistics
* Scripts vs languages
http://www.omniglot.com/writing/langalph.htm
* Languages distribution
https://en.wikipedia.org/wiki/Languages_used_o
n_the_Internet#Content_languages_for_websites
* Frequency word lists
https://invokeit.wordpress.com/frequency-word-
lists/
* Word segmentation?
WILD Data
Wiktionary Wikipedia data:
used abstracts, ~175 languages
- download & store
- process (SAX parsing)
- setup learning & test data sets
10,778,404 unique words
481,581 unique character trigrams
WILD Engineering
* Initial model size ~1G -
script hacks & Huffman coding
to the rescue
* Model pruning
* Proper probability calculations
* Efficient testing
* Properly saving the model
* Library & public API

More Related Content

What's hot

Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsMatej Varga
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERTATPowr
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
Formal Verification
Formal VerificationFormal Verification
Formal VerificationIlia Levin
 
Fin bert paper review !
Fin bert paper review !Fin bert paper review !
Fin bert paper review !taeseon ryu
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 

What's hot (20)

Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERT
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Formal Verification
Formal VerificationFormal Verification
Formal Verification
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Fin bert paper review !
Fin bert paper review !Fin bert paper review !
Fin bert paper review !
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 

Similar to NLP Project Full Cycle

Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Artificial intelligence for Social Good
Artificial intelligence for Social GoodArtificial intelligence for Social Good
Artificial intelligence for Social GoodOana Tifrea-Marciuska
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning SARCCOM
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptxGaytriDhingra1
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 

Similar to NLP Project Full Cycle (20)

Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
ppt
pptppt
ppt
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Machine learning
Machine learningMachine learning
Machine learning
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Artificial intelligence for Social Good
Artificial intelligence for Social GoodArtificial intelligence for Social Good
Artificial intelligence for Social Good
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 

More from Vsevolod Dyomkin

Lisp in a Startup: the Good, the Bad, and the Ugly
Lisp in a Startup: the Good, the Bad, and the UglyLisp in a Startup: the Good, the Bad, and the Ugly
Lisp in a Startup: the Good, the Bad, and the UglyVsevolod Dyomkin
 
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageVsevolod Dyomkin
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationVsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturyVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python ProgrammersVsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelinesVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 

More from Vsevolod Dyomkin (15)

NLP Project Full Circle
NLP Project Full CircleNLP Project Full Circle
NLP Project Full Circle
 
Lisp in a Startup: the Good, the Bad, and the Ugly
Lisp in a Startup: the Good, the Bad, and the UglyLisp in a Startup: the Good, the Bad, and the Ugly
Lisp in a Startup: the Good, the Bad, and the Ugly
 
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp Image
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

NLP Project Full Cycle

  • 1. NLP Project Full Cycle Vsevolod Dyomkin 10/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 3. Plan * Overview of NLP * NLP Data * Common NLP problems and approaches * Example NLP application: text language identification
  • 4. What Is NLP? Transforming free-form text into structured data and back
  • 5. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * ML, Stats, Information Theory
  • 8. linguist [noun] 1. A specialist in linguistics
  • 9. linguist [noun] 1. A specialist in linguistics linguistics [noun] 1. The scientific study of language.
  • 10.
  • 11.
  • 12. NLP Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 13. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * Internet/user Data
  • 14. Where to Get Data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  • 15. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 16. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  • 17. engineer [noun] 5. A person skilled in the design and programming of computer systems
  • 18. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  • 19. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ In fact, it works: https://github.com/lang-uk/ner-uk/blob/master/doc /tokenization.md
  • 20. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  • 21. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  • 22.
  • 23. researcher [noun] 1. One who researches
  • 24. researcher [noun] 1. One who researches research [noun] 1. Diligent inquiry or examination to seek or revise facts, principles, theories, applications, etc.; laborious or continued search after truth
  • 26. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  • 27. Language Models Question: what is the probability of a sequence of words/sentence?
  • 28. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  • 29. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  • 30. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  • 31. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold
  • 32. Bag-of-words Model * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold http://www.paulgraham.com/spam.html - A Plan for Spam Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%
  • 33. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  • 35. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) https://honnibal.wordpress.com/2013/12/18/a-simple-fas t-algorithm-for-natural-language-dependency-parsing/
  • 38. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  • 39. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  • 40. The Hierarchy of ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests, Boosted Trees * Artificial Neural networks
  • 41. Semantics Question: how to model relationships between words?
  • 42. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  • 43. Word Similarity Next question: now, how do we measure those relations?
  • 44. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  • 45. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  • 46. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe, …) * Hierarchical repr (Brown clusters)
  • 47. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and frameworks where possible * Setup and perform a proper experiment (series of experiments) * Optimize the system for production
  • 48. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Don't trust research results * Value pre- and post- processing * Gather user feedback
  • 49. Text Language Identification Not an unsolved problem: * https://github.com/CLD2Owners/cld2 - C++ * https://github.com/saffsd/langid.py - Python * https://github.com/shuyo/language-detection/ - Java To read: https://blog.twitter.com/2015/evaluating-language-identifi cation-performance http://blog.mikemccandless.com/2011/10/accuracy-and-perfor mance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  • 51.
  • 52. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  • 53. WILD Linguistics * Scripts vs languages http://www.omniglot.com/writing/langalph.htm * Languages distribution https://en.wikipedia.org/wiki/Languages_used_o n_the_Internet#Content_languages_for_websites * Frequency word lists https://invokeit.wordpress.com/frequency-word- lists/ * Word segmentation?
  • 54. WILD Data Wiktionary Wikipedia data: used abstracts, ~175 languages - download & store - process (SAX parsing) - setup learning & test data sets 10,778,404 unique words 481,581 unique character trigrams
  • 55. WILD Engineering * Initial model size ~1G - script hacks & Huffman coding to the rescue * Model pruning * Proper probability calculations * Efficient testing * Properly saving the model * Library & public API