Interested in learning about Natural Language Processing (NLP)? Are you using NLP for your SEO already and want to step it up a level? Join this session to get a crash course in NLP. From stemming and lemmatization to word embeddings and its applications for SEO. Paul Shapiro will break down NLP to explain how NLP technology uses machine learning to decipher and analyze our human languages in a way that is highly valuable for marketers and SEOs. Paul will also share specific examples using the Python programming language along the way so you can either start using NLP right away for SEO or find new and more effective ways to use NLP.
3. @fighto
Assumptions & Prerequisites
• Familiarity with Python
• Familiarity with common data science libraries such as pandas and
NumPy
• Familiarity with Jupyter Notebooks (optional)
• But no prior knowledge of NLP
7. @fighto
“NLP is a way for computers to analyze, understand, and derive
meaning from human language in a smart and useful way. By
utilizing NLP, developers can organize and structure knowledge to
perform tasks such as automatic summarization, translation, named
entity recognition, relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.”
https://blog.algorithmia.com/introduction-natural-language-processing-nlp/
15. @fighto
Text Pre-Processing
• Noise and Junk Removal/Cleanup
• Punctuation and Special Characters
• Stop Words
• Common Abbreviations
• Common Character Cases
• Etc.
20. @fighto
Why Normalization, Text Analytics Ex
• Speeds up machine learning analysis
• Disambiguation
• Say there are 500 jokes in our corpus that mention “Donald Trump”
• 25 of those jokes include the word “economy, 15 include the word
“economic” and 10 mention “world economies”.
• All of these jokes have to do with both “economics” and “Donald Trump” but
would turn up as 3 distinct co-occurences.
21. @fighto
Why Stemming and Pitfalls
• More basic method of reducing different forms of the same word to
a common base
• Stemming chops off the end of the word to accomplish this
• Faster method
• Results in terms that are not real words:
23. @fighto
Why Lemmatization and Pitfalls
• More sophisticated method of reducing different forms of the same
word to a common base
• Lemmatizations leverages vocabulary and grammar to infer the root
of a word
• Requires Parts of Speech tagging
• Slower but more accurate method
25. @fighto
Information Extraction & Grouping
• Getting more context
• N-Grams
• Parts of Speech Tagging
• Chunking/Chinking
• Named Entity Recognition
• Word Embeddings
32. @fighto
Statistical Feature Creation
• Leverage personal heuristics to create customized
numeric representations that you think could be
used by a machine learning model to make
predictions
36. @fighto
Feature Normalization
• Box-Cox Power Transformations
• “A Box Cox transformation is a way to transform non-
normal dependent variables into a normal
shape. Normality is an important assumption for many
statistical techniques; if your data isn’t normal, applying
a Box-Cox means that you are able to run a broader
number of tests.”
https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/
45. @fighto
Let’s Talk About TF-IDF for a Moment
• Count Vectorizer looked at how many times a term or n-gram
appeared in a joke and represents as positive integer
• TF-IDF would create a score that considers how many time a term
appears in a joke as well as how many times it appears in the entire
corpus of jokes.
• Rarer words are deemed to more important because they can be used
distinguish one joke from another.
• Higher TF-IDF value = more uncommon
• Lower TF-IDF value = less common
48. @fighto
Random Forest
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
52. @fighto
Having Done This Better
• Reduce overfitting
• Standardize features (mixing sparse and non-sparse data)
• Word embeddings for more context
• More sophisticated models
53. @fighto
More Applications for SEO
• Creating performant content (joke example extrapolated)
• Predicting natural link earning potential
• Natural language generation, writing bits of content
• Semantic content optimization
• Site architecture design and taxonomy
• User flow creation
• Keyword research
• Etc.
54. @fighto
How to Learn More, Resources
• https://web.stanford.edu/~jurafsky/slp3/
• https://www.kaggle.com/learn/overview
• https://towardsdatascience.com
• https://github.com/keon/awesome-nlp