SlideShare a Scribd company logo
1 of 118
Download to read offline
Approaching (almost) any NLP problem
@abhi1thakur
AI is like an imaginary
friend most enterprises
claim to have these days
3
4
5
I like big data
and
I cannot lie
➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
Agenda
6
Translation
Sentiment Classification
Chatbots / VAs
Autocomplete
Entity Extraction
Question Answering
Review Rating
Prediction
Search Engine Speech to Text
Topic Extraction
Applications of natural language processing
Pre-processing the text data
8
can u he.lp me with loan? 😊
Unintentional
Characters
Abbreviations Symbols Emojis
can you help me with loan ?
Pre-processing the text data
9
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
Pre-processing the text data
10
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def remove_space(text):
text = text.strip()
text = text.split()
return " ".join(text)
Pre-processing the text data
11
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very important step
➢ Is not always about spaces
➢ Converts words into tokens
➢ Might be different for different
languages
➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Write your own ;)
Pre-processing the text data
12
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
'hello', ',', 'how', 'are', 'you', '?'
hello, how are you?
Pre-processing the text data
13
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very very crucial step
➢ In chat: can u tel me abot new sim
card pland?
➢ Most models without spelling
correction will fail
➢ Peter Norvig’s spelling correction
➢ Make your own ;)
Pre-processing the text data
14
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need a new car insurance
I need aa new car insurance
I ned a new car insuraance
I needd a new carr insurance
I need a neew car insurance
I need a new car insurancee
EmbeddingsLayer
BidirectionalStacked
char-LSTM
Output
Pre-processing the text data
15
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def edits1(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
Pre-processing the text data
16
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
contraction = {
"'cause": 'because',
',cause': 'because',
';cause': 'because',
"ain't": 'am not',
'ain,t': 'am not',
'ain;t': 'am not',
'ain´t': 'am not',
'ain’t': 'am not',
"aren't": 'are not',
'aren,t': 'are not',
'aren;t': 'are not',
'aren´t': 'are not',
'aren’t': 'are not'
}
Pre-processing the text data
17
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def mapping_replacer(x, dic):
for word in dic.keys():
if " " + word + " " in x:
x = x.replace(" " + word + " ", " " + dic[word] + " ")
return x
Pre-processing the text data
18
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Reduces words to root form
➢ Why is stemming important?
➢ NLTK stemmers
Pre-processing the text data
19
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
fishing
fishfished
fishes
In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
In [3]: s.stem("fishing")
Out[3]: 'fish'
Pre-processing the text data
20
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import emoji
emojis = emoji.UNICODE_EMOJI
pip install emoji
Pre-processing the text data
21
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need new car insurance
car insurance
new
need
I
Pre-processing the text data
22
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
23
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
24
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
25
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
What kind of models to use?
26
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks
Let’s look at a problem
27
Quora duplicate question identification
28
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
Non-duplicate questions
29
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?
➢ How can I start an online shopping (e-commerce) website?
➢ Which web technology is best suitable for building a big E-Commerce
website?
Duplicate questions
30
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
Dataset
31
➢ 400,000+ pairs of questions
➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)
Dataset
32
➢ 255045 negative samples (non-duplicates)
➢ 149306 positive samples (duplicates)
➢ 40% positive samples
Dataset: basic exploration
33
➢ Average number characters in question1: 59.57
➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14
➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169
Basic feature engineering
34
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
Basic feature engineering
35
data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)
Basic feature engineering
Basic modelling
Tabular Data
(Basic
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.721
Fuzzy features
38
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering
Fuzzy features
39
➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
https://github.com/seatgeek/fuzzywuzzy
Fuzzy features
40
data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
Fuzzy features
41
data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
Improving models
Tabular Data
(Basic
Features +
Fuzzy
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.660
0.721
0.738
Can we improve it further?
43
Traditional handling of text data
46
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD
TF-IDF
47
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
Total number of documents
IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it
TF-IDF(t) = TF(t) * IDF(t)
TF-IDF
48
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
)
SVD
49
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)
Question-1 Question-2
Simply using TF-IDF: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.749
0.658
0.660
0.777
Question-1 Question-2
Simply using TF-IDF: method-2
TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.748
0.658
0.660
0.804
Question-1 Question-2
Simply using TF-IDF + SVD: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.763
0.658
0.660
0.706
SVD SVD
Question-1 Question-2
Simply using TF-IDF + SVD: method-2
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.753
0.658
0.660
0.700
SVD
Question-1 Question-2
Simply using TF-IDF + SVD: method-3
TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.759
0.658
0.660
0.714
SVD
Word embeddings
WORD | | | | | | |
➢ Multi-dimensional vector for all the words in any dictionary
➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText
Word embeddings
Germany
Berlin
- Germany
France
Paris
+ France
Berlin - Germany + France ~ Paris
Every word gets a position in space
Word embeddings
➢ Embeddings for words
➢ Embeddings for whole sentence
Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
Word embeddings
Word embeddings features
Word embeddings features
Spatial
Distances
Euclidean
Manhattan
Cosine
Canberra
Minkowski
Braycurtis
Word embeddings features
Statistical
Features
Skew
Kurtosis
➢ Skew = 0 for normal
distribution
➢ Skew > 0: more weight in left
tail
➢ Kurtosis: 4th central moment
over the square of variance
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Word mover’s distance: WMD
Results comparison
Features Logistic
Regression
Accuracy
XGBoost
Accuracy
Basic Features 0.658 0.721
Basic Features + Fuzzy Features 0.660 0.738
Basic + Fuzzy + Word2Vec Features 0.676 0.766
Word2Vec Features X 0.78
Basic + Fuzzy + Word2Vec Features + Full Word2Vec
Vectors
X 0.814
TFIDF + SVD (Best Combination) 0.804 0.763
What can deep learning do?
➢ Natural language processing
➢ Speech processing
➢ Computer vision
➢ And more and more
1-D CNN
➢ One dimensional convolutional layer
➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
LSTM
➢ Long short term memory
➢ A type of RNN
➢ Used two LSTM layers
Embedding layers
➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time distributed dense layer
➢ TimeDistributed wrapper around dense layer
➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
Handling text data before training
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
Handling text data before training
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Handling text data before training
Handling text data before training
Handling text data before training
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
Creating the deep learning model
Final Deep Learning Model
Model 1 and Model 2
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4
model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Final Deep Learning Model
Model 5 and Model 6
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Final Deep Learning Model
Merged Model
Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
Can we end without talking about the muppets?
Ofcourse!
Just kidding, no!
BERT
➢ Based on transformer encoder
➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16
BERT encoder block
Encoder Block
1
__
__
__
__
__
__
__
__
__
__
__
__
512
512inputs
Vectorsofsize768or1024
How BERT learns?
➢ BERT has a fixed vocab
➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns
BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]
Example of tokenization:
hi, everyone! this is tokenization example
[CLS] hi , everyone ! this is token ##ization example [SEP]
BERT tokenization
https://github.com/huggingface/tokenizers
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
There is a lot more….
Maybe next time!
Few things to remember...
Fine-tuning often gives good results
➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?
Fine-tuning often gives good results
Bigger isn’t always better
A good model has some key ingredients...
Understanding the data
Exploring the data
Sugar
Pre-processing
Feature engineering
Feature selection
Spice
A good cross validation
Low Error Rate
Simple or combination of models
Post-processing
All the things that are nice
Chemical X
A
Good
Machine Learning
Model
➢ e-mail: abhishek4@gmail.com
➢ Linkedin: linkedin.com/in/abhi1thakur
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.
Fill out the form here to be the
first one to know when it’s
ready to buy:
http://bit.ly/approachingalmost

More Related Content

What's hot

Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionalityNikhil Sharma
 
Data Encryption Standard (DES)
Data Encryption Standard (DES)Data Encryption Standard (DES)
Data Encryption Standard (DES)Haris Ahmed
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detectionNASIM ALAM
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 

What's hot (20)

Web crawler
Web crawlerWeb crawler
Web crawler
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Data Encryption Standard (DES)
Data Encryption Standard (DES)Data Encryption Standard (DES)
Data Encryption Standard (DES)
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Pytorch
PytorchPytorch
Pytorch
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
 
Xgboost
XgboostXgboost
Xgboost
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Text clustering
Text clusteringText clustering
Text clustering
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Similar to NLP Approaching any problem

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job EasierTonya Mork
 
Regular expression presentation for the HUB
Regular expression presentation for the HUBRegular expression presentation for the HUB
Regular expression presentation for the HUBthehoagie
 
Code quality; patch quality
Code quality; patch qualityCode quality; patch quality
Code quality; patch qualitydn
 
Code quality. Patch quality
Code quality. Patch qualityCode quality. Patch quality
Code quality. Patch qualitymalcolmt
 
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...indeedeng
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments ImprovementMisha Kozik
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Matthias Noback
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style GuidesMosky Liu
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)Mike Felch
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/ibrettflorio
 
Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Jamie Reffell
 
Hunting primes (a caccia di primi) 27 ott 2014
Hunting primes (a caccia di primi)   27 ott 2014Hunting primes (a caccia di primi)   27 ott 2014
Hunting primes (a caccia di primi) 27 ott 2014Vincenzo Sambito
 
Inside Darwin Analytics
Inside Darwin AnalyticsInside Darwin Analytics
Inside Darwin Analyticsjelmersnoeck
 
The Art of Clean code
The Art of Clean codeThe Art of Clean code
The Art of Clean codeVictor Rentea
 
BITM3730 10-17.pptx
BITM3730 10-17.pptxBITM3730 10-17.pptx
BITM3730 10-17.pptxMattMarino13
 

Similar to NLP Approaching any problem (20)

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
 
Regular expression presentation for the HUB
Regular expression presentation for the HUBRegular expression presentation for the HUB
Regular expression presentation for the HUB
 
Code quality; patch quality
Code quality; patch qualityCode quality; patch quality
Code quality; patch quality
 
Code quality. Patch quality
Code quality. Patch qualityCode quality. Patch quality
Code quality. Patch quality
 
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments Improvement
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
XSS and How to Escape
XSS and How to EscapeXSS and How to Escape
XSS and How to Escape
 
Tdd in practice
Tdd in practiceTdd in practice
Tdd in practice
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 
Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006
 
Hunting primes (a caccia di primi) 27 ott 2014
Hunting primes (a caccia di primi)   27 ott 2014Hunting primes (a caccia di primi)   27 ott 2014
Hunting primes (a caccia di primi) 27 ott 2014
 
Inside Darwin Analytics
Inside Darwin AnalyticsInside Darwin Analytics
Inside Darwin Analytics
 
The Art of Clean code
The Art of Clean codeThe Art of Clean code
The Art of Clean code
 
BITM3730 10-17.pptx
BITM3730 10-17.pptxBITM3730 10-17.pptx
BITM3730 10-17.pptx
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

NLP Approaching any problem

  • 1. Approaching (almost) any NLP problem @abhi1thakur
  • 2. AI is like an imaginary friend most enterprises claim to have these days
  • 3. 3
  • 4. 4
  • 5. 5 I like big data and I cannot lie
  • 6. ➢ Not so much intro ➢ Where is NLP used ➢ Pre-processing ➢ Machine Learning Models ➢ Solving a problem ➢ Traditional approaches ➢ Deep Learning Models ➢ Muppets Agenda 6
  • 7. Translation Sentiment Classification Chatbots / VAs Autocomplete Entity Extraction Question Answering Review Rating Prediction Search Engine Speech to Text Topic Extraction Applications of natural language processing
  • 8. Pre-processing the text data 8 can u he.lp me with loan? 😊 Unintentional Characters Abbreviations Symbols Emojis can you help me with loan ?
  • 9. Pre-processing the text data 9 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML
  • 10. Pre-processing the text data 10 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def remove_space(text): text = text.strip() text = text.split() return " ".join(text)
  • 11. Pre-processing the text data 11 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very important step ➢ Is not always about spaces ➢ Converts words into tokens ➢ Might be different for different languages ➢ Simplest is to use `word_tokenizer` from NLTK ➢ Write your own ;)
  • 12. Pre-processing the text data 12 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "hello, how are you?" tokens = word_tokenize(text) print(tokens) 'hello', ',', 'how', 'are', 'you', '?' hello, how are you?
  • 13. Pre-processing the text data 13 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very very crucial step ➢ In chat: can u tel me abot new sim card pland? ➢ Most models without spelling correction will fail ➢ Peter Norvig’s spelling correction ➢ Make your own ;)
  • 14. Pre-processing the text data 14 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need a new car insurance I need aa new car insurance I ned a new car insuraance I needd a new carr insurance I need a neew car insurance I need a new car insurancee EmbeddingsLayer BidirectionalStacked char-LSTM Output
  • 15. Pre-processing the text data 15 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def edits1(word): letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): return (e2 for e1 in edits1(word) for e2 in edits1(e1))
  • 16. Pre-processing the text data 16 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML contraction = { "'cause": 'because', ',cause': 'because', ';cause': 'because', "ain't": 'am not', 'ain,t': 'am not', 'ain;t': 'am not', 'ain´t': 'am not', 'ain’t': 'am not', "aren't": 'are not', 'aren,t': 'are not', 'aren;t': 'are not', 'aren´t': 'are not', 'aren’t': 'are not' }
  • 17. Pre-processing the text data 17 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def mapping_replacer(x, dic): for word in dic.keys(): if " " + word + " " in x: x = x.replace(" " + word + " ", " " + dic[word] + " ") return x
  • 18. Pre-processing the text data 18 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Reduces words to root form ➢ Why is stemming important? ➢ NLTK stemmers
  • 19. Pre-processing the text data 19 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML fishing fishfished fishes In [1]: from nltk.stem import SnowballStemmer In [2]: s = SnowballStemmer('english') In [3]: s.stem("fishing") Out[3]: 'fish'
  • 20. Pre-processing the text data 20 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import emoji emojis = emoji.UNICODE_EMOJI pip install emoji
  • 21. Pre-processing the text data 21 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need new car insurance car insurance new need I
  • 22. Pre-processing the text data 22 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 23. Pre-processing the text data 23 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 24. Pre-processing the text data 24 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 25. Pre-processing the text data 25 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 26. What kind of models to use? 26 ➢ SVM ➢ Logistic Regression ➢ Gradient Boosting ➢ Neural Networks
  • 27. Let’s look at a problem 27
  • 28. Quora duplicate question identification 28 ➢ ~ 13 million questions ➢ Many duplicate questions ➢ Cluster and join duplicates together ➢ Remove clutter
  • 29. Non-duplicate questions 29 ➢ Who should I address my cover letter to if I'm applying for a big company like Mozilla? ➢ Which car is better from safety view?""swift or grand i10"".My first priority is safety? ➢ How can I start an online shopping (e-commerce) website? ➢ Which web technology is best suitable for building a big E-Commerce website?
  • 30. Duplicate questions 30 ➢ How does Quora quickly mark questions as needing improvement? ➢ Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… ➢ What practical applications might evolve from the discovery of the Higgs Boson? ➢ What are some practical benefits of discovery of the Higgs Boson?
  • 31. Dataset 31 ➢ 400,000+ pairs of questions ➢ Initially data was very skewed ➢ Negative sampling ➢ Noise exists (as usual)
  • 32. Dataset 32 ➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates) ➢ 40% positive samples
  • 33. Dataset: basic exploration 33 ➢ Average number characters in question1: 59.57 ➢ Minimum number of characters in question1: 1 ➢ Maximum number of characters in question1: 623 ➢ Average number characters in question2: 60.14 ➢ Minimum number of characters in question2: 1 ➢ Maximum number of characters in question2: 1169
  • 34. Basic feature engineering 34 ➢ Length of question1 ➢ Length of question2 ➢ Difference in the two lengths ➢ Character length of question1 without spaces ➢ Character length of question2 without spaces ➢ Number of words in question1 ➢ Number of words in question2 ➢ Number of common words in question1 and question2
  • 35. Basic feature engineering 35 data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) data['diff_len'] = data.len_q1 - data.len_q2 data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
  • 36. data['len_common_words'] = data.apply(lambda x: len( set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()) )), axis=1) Basic feature engineering
  • 37. Basic modelling Tabular Data (Basic Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.721
  • 38. Fuzzy features 38 ➢ Also known as approximate string matching ➢ Number of “primitive” operations required to convert string to exact match ➢ Primitive operations: ○ Insertion ○ Deletion ○ Substitution ➢ Typically used for: ○ Spell checking ○ Plagiarism detection ○ DNA sequence matching ○ Spam filtering
  • 39. Fuzzy features 39 ➢ pip install fuzzywuzzy ➢ Uses Levenshtein distance ➢ QRatio ➢ WRatio ➢ Token set ratio ➢ Token sort ratio ➢ Partial token set ratio ➢ Partial token sort ratio https://github.com/seatgeek/fuzzywuzzy
  • 40. Fuzzy features 40 data['fuzz_qratio'] = data.apply( lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply( lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply( lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply( lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
  • 41. Fuzzy features 41 data['fuzz_partial_token_sort_ratio'] = data.apply( lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply( lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply( lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
  • 42. Improving models Tabular Data (Basic Features + Fuzzy Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.660 0.721 0.738
  • 43. Can we improve it further? 43
  • 44.
  • 45.
  • 46. Traditional handling of text data 46 ➢ Hashing of words ➢ Count vectorization ➢ TF-IDF ➢ SVD
  • 47. TF-IDF 47 Number of times a term t appears in a document TF(t) = ------------------------------------------------------- Total number of terms in the document Total number of documents IDF(t) = LOG( ------------------------------------------------------- ) Number of documents with term t in it TF-IDF(t) = TF(t) * IDF(t)
  • 49. SVD 49 ➢ Latent semantic analysis ➢ scikit-learn version of SVD ➢ 120 components svd = decomposition.TruncatedSVD(n_components=120) xtrain_svd = svd.fit_transform(xtrain) xtest_svd = svd.transform(xtest)
  • 50. Question-1 Question-2 Simply using TF-IDF: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.749 0.658 0.660 0.777
  • 51. Question-1 Question-2 Simply using TF-IDF: method-2 TF-IDF Logistic Regression XGB 0.721 0.738 0.748 0.658 0.660 0.804
  • 52. Question-1 Question-2 Simply using TF-IDF + SVD: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.763 0.658 0.660 0.706 SVD SVD
  • 53. Question-1 Question-2 Simply using TF-IDF + SVD: method-2 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.753 0.658 0.660 0.700 SVD
  • 54. Question-1 Question-2 Simply using TF-IDF + SVD: method-3 TF-IDF Logistic Regression XGB 0.721 0.738 0.759 0.658 0.660 0.714 SVD
  • 55. Word embeddings WORD | | | | | | | ➢ Multi-dimensional vector for all the words in any dictionary ➢ Always great insights ➢ Very popular in natural language processing tasks ➢ Google news vectors 300d ➢ GloVe ➢ FastText
  • 56. Word embeddings Germany Berlin - Germany France Paris + France Berlin - Germany + France ~ Paris Every word gets a position in space
  • 57. Word embeddings ➢ Embeddings for words ➢ Embeddings for whole sentence
  • 58. Word embeddings def sent2vec(s, model, stop_words, tokenizer): words = str(s).lower() words = tokenizer(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: M.append(model[w]) M = np.array(M) v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum())
  • 62. Word embeddings features Statistical Features Skew Kurtosis ➢ Skew = 0 for normal distribution ➢ Skew > 0: more weight in left tail ➢ Kurtosis: 4th central moment over the square of variance
  • 63. Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances. Word mover’s distance: WMD
  • 64. Results comparison Features Logistic Regression Accuracy XGBoost Accuracy Basic Features 0.658 0.721 Basic Features + Fuzzy Features 0.660 0.738 Basic + Fuzzy + Word2Vec Features 0.676 0.766 Word2Vec Features X 0.78 Basic + Fuzzy + Word2Vec Features + Full Word2Vec Vectors X 0.814 TFIDF + SVD (Best Combination) 0.804 0.763
  • 65.
  • 66. What can deep learning do? ➢ Natural language processing ➢ Speech processing ➢ Computer vision ➢ And more and more
  • 67.
  • 68. 1-D CNN ➢ One dimensional convolutional layer ➢ Temporal convolution ➢ Simple to implement: for i in range(sample_length): y[i] = 0 for j in range(kernel_length): y[i] += x[i-j] * h[j]
  • 69. LSTM ➢ Long short term memory ➢ A type of RNN ➢ Used two LSTM layers
  • 70. Embedding layers ➢ Simple layer ➢ Converts indexes to vectors ➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
  • 71. Time distributed dense layer ➢ TimeDistributed wrapper around dense layer ➢ TimeDistributed applies the layer to every temporal slice of input ➢ Followed by Lambda layer ➢ Implements “translation” layer used by Stephen Merity (keras snli model) model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  • 72. Handling text data before training tk = text.Tokenizer(nb_words=200000) max_len = 40 tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str))) x1 = tk.texts_to_sequences(data.question1.values) x1 = sequence.pad_sequences(x1, maxlen=max_len) x2 = tk.texts_to_sequences(data.question2.values.astype(str)) x2 = sequence.pad_sequences(x2, maxlen=max_len) word_index = tk.word_index
  • 73. Handling text data before training embeddings_index = {} f = open('glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close()
  • 74. Handling text data before training
  • 75. Handling text data before training
  • 76. Handling text data before training embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector
  • 77. Basis of deep learning model ➢ Keras-snli model: https://github.com/Smerity/keras_snli
  • 78. Creating the deep learning model
  • 80. Model 1 and Model 2 model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,))) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model2.add(TimeDistributed(Dense(300, activation='relu'))) model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  • 82. Model 3 and Model 4
  • 83. Model 3 and Model 4 model3 = Sequential() model3.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model3.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu', subsample_length=1)) model3.add(Dropout(0.2)) . . . model3.add(Dense(300)) model3.add(Dropout(0.2)) model3.add(BatchNormalization())
  • 85. Model 5 and Model 6 model5 = Sequential() model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model6 = Sequential() model6.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
  • 88. Time to Train the DeepNet ➢ Total params: 174,913,917 ➢ Trainable params: 60,172,917 ➢ Non-trainable params: 114,741,000 ➢ NVIDIA Titan X
  • 89.
  • 90. Time to Train the DeepNet ➢ The deep network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and took 10-15 hours to train. This network achieved an accuracy of 0.848 (~0.85). ➢ The SOTA at that time was around 0.88. (Bi-MPM model)
  • 91. Can we end without talking about the muppets?
  • 94. BERT ➢ Based on transformer encoder ➢ Each encoder block has self-attention ➢ Encoder blocks: 12 or 24 ➢ Feed forward hidden units: 768 or 1024 ➢ Attention heads: 12 or 16
  • 95. BERT encoder block Encoder Block 1 __ __ __ __ __ __ __ __ __ __ __ __ 512 512inputs Vectorsofsize768or1024
  • 96. How BERT learns? ➢ BERT has a fixed vocab ➢ BERT has encoder blocks (transformer blocks) ➢ A word is masked and BERT tries to predict that word ➢ BERT training also tries to predict next sentence ➢ Combining losses from two above approaches, BERT learns
  • 97. BERT tokenization ➢ [CLS] TOKENS [SEP] ➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP] Example of tokenization: hi, everyone! this is tokenization example [CLS] hi , everyone ! this is token ##ization example [SEP]
  • 105.
  • 106. There is a lot more….
  • 108. Few things to remember...
  • 109. Fine-tuning often gives good results ➢ It is faster ➢ It is better (not always) ➢ Why reinvent the wheel?
  • 110. Fine-tuning often gives good results
  • 112. A good model has some key ingredients...
  • 115. A good cross validation Low Error Rate Simple or combination of models Post-processing All the things that are nice
  • 118. ➢ e-mail: abhishek4@gmail.com ➢ Linkedin: linkedin.com/in/abhi1thakur ➢ kaggle: kaggle.com/abhishek ➢ tweet me: @abhi1thakur ➢ YouTube: youtube.com/AbhishekThakurAbhi Approaching (almost) any machine learning problem: the book will release in Summer 2020. Fill out the form here to be the first one to know when it’s ready to buy: http://bit.ly/approachingalmost