SlideShare a Scribd company logo
1 of 55
INTRO TO TEXT MINING
USING TM, OPENNLP , &
TOPICMODELS
Ted Kwartler
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
Intro to text mining using tm, openNLP , & topicmodels
www.linkedin.com/in/edwardkwartler
@tkwartler
Agenda
What is
Text
Mining?
Keyword Scanning,
Cleaning
Freq Counts,
Dendograms
Wordclouds
Topic Modeling
openNLP
What is Text Mining?
What is
Text
Mining?
Keyword Scanning,
Cleaning
Freq Counts,
Dendograms
Wordclouds
Topic Modeling
openNLP
What is text mining?
What is Text
Mining?
•Extract new insights from text
•Let’s you drink from a fire hose
of information
•Language is hard; many
unsolved problems
•Unstructured
• Expression is individualistic
•Multi-language/cultural
implications
Before Text Mining
What is text mining?
What is Text
Mining?
After Text Mining
•Extract new insights from text
•Let’s you drink from a fire hose
of information
•Language is hard; many
unsolved problems
•Unstructured
• Expression is individualistic
•Multi-language/cultural
implications
Natural Language Text
Sources
Unorganized State
Insight or
Recommendation
Organized & Distilled
Corpora > Cleaned Corpus > Analysis
What is Text
Mining?
Text Mining Workflow
emails
articles
blogs
reviews
tweets
surveys
Bag of Words
What is Text
Mining?
Text Mining Approaches
“Lebron James hit a tough shot.”
tough
a
Semantic Using Syntactic Parsing
Sentence
Noun Phrase Verb Phrase
Lebron James
verb
a
article
Noun phrase
adjective
shothit tough
noun
What is Text
Mining?
Text Mining Approaches
Some Challenges in Text Mining
•Compound words (tokenization)
changes meaning
• “not bad” versus “bad”
•Disambiguation
• Sarcasm
• “I like it…NOT!”
• Cultural differences
• “It’s wicked good” (in Boston)
•I cooked waterfowl to eat.
•I cooked waterfowl belonging to her.
•I created the (clay?) duck and gave it to
her.
•Duck!!
“I made her duck.”
What is Text
Mining?
Text Sources
Text can be captured within the enterprise and elsewhere
•Books
•Electronic Docs (PDFs)
•Blogs
•Websites
•Social Media
•Customer Records
•Customer Service Notes
•Notes
•Emails
•…
The source and context of the medium
is important. It will have a lot of impact
on difficulty and data integrity.
What is Text
Mining?
Enough of me talking…let’s do it for real!
Scripts in this workshop follow a simple workflow
Set the Working Directory
Load Libraries
Make Custom Functions &
Specify Options
Read in Data & Pre-Process
Perform Analysis & Save
What is Text
Mining?
Enough of me talking…let’s do it for real!
Setup
Install R/R Studio
• http://cran.us.r-project.org/
• http://www.rstudio.com/products/rstudio/download/
Download scripts and data, save to a
desktop folder & unzip
https://goo.gl/CYfe33
Install Packages
• Run “1_Install_Packages.R” script
• An error may occur if the Java version doesn’t match
I have tested these scripts on R 3.1.2 with a Mac.
Some steps may be different in other environments.
Warning: Twitter Profanity
• Twitter demographics skew young and as a result
have profanity that appear in the examples.
• It’s the easiest place to get a lot of messy text
fast, if it is offensive feel free to talk to me and I
will work to get you other texts for use on your
own. No offense is intended.
#%@*!!!
Keyword Scanning, Cleaning & Freq Counts
What is
Text
Mining?
Keyword Scanning,
Cleaning
Freq Counts,
Dendograms
Wordclouds
Topic Modeling
openNLP
Keyword
scanning,
Cleaning &
Freq Counts
Open the “coffee.csv” to get familiar with the data structure
Setup1000 tweets mentioning “coffee”
“text$text”
is the vector of tweets that we are interested in.
All other attributes are auto
generated from the twitter API
Keyword
scanning,
Cleaning &
Freq Counts
2_Keyword_Scanning.R
Setup
grepl(“pattern”, searchable object, ignore.case=TRUE)
grepl returns a vector of T/F if the pattern is present at least once
grep(“pattern”, searchable object, ignore.case=TRUE)
grep returns the position of the pattern in the document
Basic R Unix Commands
“library(stringi)” Functions
stri_count(searchable object, fixed=“pattern”)
stri_count counts the number of patterns in a document
Keyword
scanning,
Cleaning &
Freq Counts
2_Keyword_Scanning.R
Setup
Let’s try to scan for other words in our other corpora.
1. Change the read.csv file to another corpus.
2. Change the search patterns to understand the result.
Examples
Keyword
scanning,
Cleaning &
Freq Counts
Remember this?
Setup
Natural Language Text
Sources
Unorganized State
Insight or
Recommendation
Organized & Distilled
Corpora > Cleaned Corpus > Analysis
emails
articles
blogs
reviews
tweets
surveys
Keyword
scanning,
Cleaning &
Freq Counts
R for our Cleaning Steps
Setup
emails
articles
blogs
reviews
tweets
Tomorrow I’m going to have a nice
glass of Chardonnay and wind down
with a good book in the corner of the
county :-)
1.Remove Punctuation
2.Remove extra white space
3.Remove Numbers
4.Make Lower Case
5.Remove “stop” words
tomorrow going nice glass
chardonnay wind down good book
corner county
Keyword
scanning,
Cleaning &
Freq Counts
3_Cleaning and Frequency Count.R
Setup
VCorpus(source)
VCorpus creates a corpus held in memory.
tm_map(corpus, function)
tm_map applies the transformations for the cleaning
“library(tm)” Functions
New Text Mining Concepts
Corpus- A collection of documents that analysis will be based on.
Stopwords – are common words that provide very little insight, often articles like “a”, “the”.
Customizing them is sometimes key in order to extract valuable insights.
tm_map(corpus, removePunctuation) - removes the punctuation from the documents
tm_map(corpus, stripWhitespace) - extra spaces, tabs are removed
tm_map(corpus, removeNumbers) - removes numbers
tm_map(corpus, tolower) – makes all case lower
tm_map(corpus, removeWords) - removes specific “stopwords”
getTransformations() will list all standard tm corpus transformations
We can apply standard R ones too. Sometimes it makes sense to perform all of these or a subset or
even other transformations not listed like “stemming”
Keyword
scanning,
Cleaning &
Freq Counts
3_Cleaning and Frequency Count.R
Setup
“custom.stopwords” combines vectors of words to remove from the corpus
“tryTolower”is poached to account for errors when making lowercase.
“clean.corpus” makes applying all transformations easier.
“custom.reader” keeps the meta data (tweet ID) with the original document
Keyword
scanning,
Cleaning &
Freq Counts
3_Cleaning and Frequency Count.R
Setup
Bag of Words means creating a Term Document Matrix or Document Term Matrix*
*Depends on analysis, both are transpositions of the other
Tweet1 Tweet
2
Tweet3 Tweet4 … Tweet_n
Term1 0 0 0 0 0 0
Term2 1 1 0 0 0 0
Term3 1 0 0 2 0 0
… 0 0 3 0 1 1
Term_n 0 0 0 1 1 0
Term1 Term2 Term3 … Term_n
Tweet1 0 1 1 0 0
Tweet2 0 1 0 0 0
Tweet3 0 0 0 3 0
… 0 0 0 1 1
Tweet_n 0 0 0 1 0
Document Term MatrixTerm Document Matrix
“as.matrix” makes the tm’s version of a matrix into a simpler version
These matrices are often very sparse and large therefore some special steps
may be needed and will be covered in subsequent scripts.
Dendograms & Wordclouds
What is
Text
Mining?
Keyword Scanning,
Cleaning
Freq Counts,
Dendograms
Wordclouds
Topic Modeling
openNLP
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script builds on the matrices
SetupFirst let’s explore simple frequencies
•Adjust v>=30 to limit the number of words in the “d” dataframe. This will make the
visual look better.
•You can also write.csv(d,”frequent_terms.csv”) if you want to export the data frame.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Setup
“zombie” is an unexpected, highly
frequent term. Let’s explore it.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Setup
Next let’s explore word associations, similar to correlation
•Adjust 0.30 to get the terms that are associated .30 or more with the ‘zombie’ term.
•Make sure the 0.30s are the same for both in order to make the dataframe which is
plotted.
•Treating the terms as factors lets ggplot2 sort them for a cleaner look.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Setup
“geek” has the highest word
association with “zombie”.
Unapologetic thinker. Lifelong beer scholar.
Incurable zombie fanatic. Explorer. Tvaholic.
Alcohol geek. Avid tv buff. Friendly beer
aficionado. Coffee guru. Zombie junkie.
Freq Counts,
Dendograms
Wordclouds
Extracting Meaning using Dendograms
SetupDendograms are hierarchical clusters based on frequencies.
• Reduces information much like average is a reduction of many observations’
values
• Word clusters emerge often showing related terms
• Term frequency is used to construct the word cluster. Put another way, term A
and term B have similar frequencies in the matrix so they are considered a
cluster.
City Annual
Rainfall
Portland 43.5
Boston 43.8
New
Orleans
62.7
Boston
Portland
NewOrleans
44
63
Boston & Portland are a cluster at hieght 44.
You lose some of the exact rainfall amount
in order to cluster them.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Setup
Weird associations! Maybe a dendogram will help us more
New Text Mining Concept
Sparse- Term Document Matrices are often extremely sparse. This means that any document
(column) has mostly zero’s. Reducing the dimensions of these matrices is possible by specifying a
sparse cutoff parameter. Higher sparse parameter will bring in more terms.
Adjust the sparse parameter to have ~40 to 50 terms in order to have a visually
appealing dendogram.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Setup
Base Plot of a Dendogram
• Less visually appealing
• Clusters can be hard to
read given the different
heights
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
• Aesthetically my
choice is to have
colored clusters
and all terms at
the bottom.
Freq Counts,
Dendograms
Wordclouds
4_Dendogram.R script
Keyword
scanning,
Cleaning &
Freq Counts
4_Dendogram.R
Setup
Let’s review other corpora.
1. Change the read.csv file to another corpus.
2. Construct the ggplot frequency visual.
3. Construct the ggplot word association for a term you see in the frequency visual.
4. Create a dendogram with ~40 to 50 terms to see how it clusters.
Freq Counts,
Dendograms
Wordclouds
5_Simple_Wordcloud.R script
New Text Mining Concept
Tokenization- So far we have created single word n-grams. We can create multi word “tokens” like
bigrams, or trigrams with this line function. It is applied when making the term document matrix.
Freq Counts,
Dendograms
Wordclouds
5_Simple_Wordcloud.R script
To make a wordcloud we follow the previous steps and create a data frame with the
word and the frequency.
Freq Counts,
Dendograms
Wordclouds
5_Simple_Wordcloud.R script
Next we need to select the colors for the wordcloud.
Freq Counts,
Dendograms
Wordclouds
5_Simple_Wordcloud.R script
•Bigram Tokenization has
captured “marvin gaye”, “bottle
(of) chardonnay”, “chocolate
milkshake” etc.
•A word cloud is a frequency
visualization. The larger the
term (or bigram here) the more
frequent the term.
•You may get warnings if certain
tokens are to large to be plotted
in the graphics device.
Freq Counts,
Dendograms
Wordclouds
Types of Wordclouds
wordcloud( )
Single Corpus Multiple Corpora
commonality.cloud()
comparison.cloud( ) comparison.cloud( )
*this wc has 3 corpora
although the Venn
diagram only shows 2 as
an example.
Freq Counts,
Dendograms
Wordclouds
6_Other_Wordcloud.R
Bring in more than one corpora.
Apply each transformation separately…more in script.
Be sure to label the columns in the same order as was brought in.
Freq Counts,
Dendograms
Wordclouds
Commonality Cloud
•The tweets mentioning
“chardonnay” “beer”, and
“coffee” have these words in
common.
•Again size is related to
frequency.
•Not helpful in this example as
we already knew these were
“drinks” but in diverse corpora it
may be more helpful e.g.
political speeches.
Freq Counts,
Dendograms
Wordclouds
Commonality Cloud
•The tweets mentioning
“chardonnay” “beer”, and
“coffee” have these dissimilar
words.
•Again size is related to
frequency.
•Beer drinkers in this snapshot
are passionate (fanatics, geeks,
specialists) on various subjects
while Chardonnay drinkers
mention Marvin Gaye. Coffee
mentions mention morning cup
and work.
Freq Counts,
Dendograms
Wordclouds
Word Cloud Practice
•Try your hand at using tequila in
the word clouds
•Remember to adjust your stop
words!
Agenda
What is
Text
Mining?
Keyword Scanning,
Cleaning
Freq Counts,
Dendograms
Wordclouds
Topic Modeling
openNLP
Topic
Modeling
openNLP
7_Topic_Modeling_Sentiment.R
Setup
TREEMAP: multi dimensional representation of the corpus attributes.
• Color will represent quick, simple
sentiment
• Each tweet is a small square
• The area of the tweet’s square is
related to the number of terms in
the tweet, document length
• The larger grouping is based on
LDA Topic Modeling
End result is understanding
broad topics, their sentiment
and amount of the corpus
documents devoted to the
identified topic.
First, simple Sentiment Polarity
Zipf’s Law
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
Top 100 Word Usage from 3M Tweets
Many words in natural language but
there is steep decline in everyday
usage. Follows a predictable pattern.
Scoring
•I loathe BestBuy Service -1
•I love BestBuy Service. They
are the best. +2
•I like shopping at BestBuy but
hate traffic. 0
Surprise is a sentiment.
Hit by a bus! – Negative Polarity
Won the lottery!- Positive Polarity
R script scans for positive words, and negative
words as defined by MQPA Academic Lexicon
research. It adds positive words and subtracts
negative ones. The final score represents the
polarity of the social interaction.
In reality sentiment is more complex.
Plutchik’s WheelCountless Emoji
Kanjoya’s Experience Corpus
Second, LDA (Latent Dirichlet
Allocation) Topic Modeling
ExampleLDA
• Probability is assigned to each
document (tweet) for the specific
observed topics.
• A document can have varying
probabilities of topics simultaneously.
Each document is made up of mini
topics.
Technical Explanations:
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocati
on
http://cs.brown.edu/courses/csci2950-
p/spring2010/lectures/2010-03-03_santhanam.pdf
*Full disclosure I am still learning this methodology
1.I like to watch basketball and football on TV.
2.I watched basketball and Shark Tank yesterday.
3.Open source analytics software is the best.
4.R is an open source software for analysis.
5.I use R for basketball analytics.
Topic A: 30% basketball, 20% football, 15%
watched, 10% TV…something to do with
watching sports
Topic B: 40% software, 10% open, 10%
source…10% analytics something to do with open
source analytics software
Corpus
LDA Topics
Documents
1.100% Topic A
2.100% Topic A
3.100% Topic B
4.100% Topic B
5.60% Topic B, 40% Topic A
Topic
Modeling
openNLP
7_Topic_Modeling_Sentiment.R
Setup
Open the script and let’s walk through it line by line because there are multiple
additions to the previous scripts
Can you try to make the same visual with the coffee corpus?
Topic
Modeling
openNLP
7_Topic_Modeling_Sentiment.R
Setup
Coffee Treemap
Topic
Modeling
openNLP
7_Topic_Modeling_Sentiment.R
Setup
Insurance Forum Treemap shows a wider topic sentiment and length range.
Topic
Modeling
openNLP
8_Open_Language_Processing.R
Setup
library(openNLP)
• https://rpubs.com/lmullen/nlp-chapter
• Uses Annotators to identify the specific item in the corpus. Then holds all
annotations in a plain text document. Think of auto-tagging words in a
document and saving the document terms paired with the tag.
• Documentation, examples are hard to come by!
• Grammatical or POS (Part of Speech) Tagging
• Sentence Tagging
• Word Tagging
• Named Entity Recognition
• Persons
• Locations
• Organizations
Annotations
Topic
Modeling
openNLP
8_Open_Language_Processing.R
Setup
The annotated plain text object
is large with a complex
structure. This function allows
us to extract the named
entities.
Topic
Modeling
openNLP
8_Open_Language_Processing.R
Setup
Questions?
@tkwartler
www.linkedin.com/in/edwardkwartler

More Related Content

Viewers also liked

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text AnalyticsSeth Grimes
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen Ralf Stockmann
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsDaniel Tunkelang
 
Social media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief OverviewSocial media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief OverviewSherin Daniel
 
Social Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality IndustrySocial Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality IndustryBrandwatch
 
Text Mining of Movie Reviews
Text Mining of Movie ReviewsText Mining of Movie Reviews
Text Mining of Movie ReviewsMaruthi Nataraj K
 
Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014Irene Ventayol
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Social listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentationSocial listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentationPerformics
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2Pier Luca Lanzi
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysisM. Atif Qureshi
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011Seth Grimes
 

Viewers also liked (19)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text Analytics
 
Social media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief OverviewSocial media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief Overview
 
Social Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality IndustrySocial Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality Industry
 
Selling Text Analytics to your boss
Selling Text Analytics to your bossSelling Text Analytics to your boss
Selling Text Analytics to your boss
 
Text mining
Text miningText mining
Text mining
 
Text Mining of Movie Reviews
Text Mining of Movie ReviewsText Mining of Movie Reviews
Text Mining of Movie Reviews
 
Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Log Data Mining
Log Data MiningLog Data Mining
Log Data Mining
 
Social listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentationSocial listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentation
 
DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2DMTM 2015 - 18 Text Mining Part 2
DMTM 2015 - 18 Text Mining Part 2
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 

More from odsc

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discoveryodsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysisodsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 

More from odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Intro to Text Mining Using tm, openNLP and topicmodels

  • 1. INTRO TO TEXT MINING USING TM, OPENNLP , & TOPICMODELS Ted Kwartler O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. Intro to text mining using tm, openNLP , & topicmodels www.linkedin.com/in/edwardkwartler @tkwartler
  • 3. Agenda What is Text Mining? Keyword Scanning, Cleaning Freq Counts, Dendograms Wordclouds Topic Modeling openNLP
  • 4. What is Text Mining? What is Text Mining? Keyword Scanning, Cleaning Freq Counts, Dendograms Wordclouds Topic Modeling openNLP
  • 5. What is text mining? What is Text Mining? •Extract new insights from text •Let’s you drink from a fire hose of information •Language is hard; many unsolved problems •Unstructured • Expression is individualistic •Multi-language/cultural implications Before Text Mining
  • 6. What is text mining? What is Text Mining? After Text Mining •Extract new insights from text •Let’s you drink from a fire hose of information •Language is hard; many unsolved problems •Unstructured • Expression is individualistic •Multi-language/cultural implications
  • 7. Natural Language Text Sources Unorganized State Insight or Recommendation Organized & Distilled Corpora > Cleaned Corpus > Analysis What is Text Mining? Text Mining Workflow emails articles blogs reviews tweets surveys
  • 8. Bag of Words What is Text Mining? Text Mining Approaches “Lebron James hit a tough shot.” tough a Semantic Using Syntactic Parsing Sentence Noun Phrase Verb Phrase Lebron James verb a article Noun phrase adjective shothit tough noun
  • 9. What is Text Mining? Text Mining Approaches Some Challenges in Text Mining •Compound words (tokenization) changes meaning • “not bad” versus “bad” •Disambiguation • Sarcasm • “I like it…NOT!” • Cultural differences • “It’s wicked good” (in Boston) •I cooked waterfowl to eat. •I cooked waterfowl belonging to her. •I created the (clay?) duck and gave it to her. •Duck!! “I made her duck.”
  • 10. What is Text Mining? Text Sources Text can be captured within the enterprise and elsewhere •Books •Electronic Docs (PDFs) •Blogs •Websites •Social Media •Customer Records •Customer Service Notes •Notes •Emails •… The source and context of the medium is important. It will have a lot of impact on difficulty and data integrity.
  • 11. What is Text Mining? Enough of me talking…let’s do it for real! Scripts in this workshop follow a simple workflow Set the Working Directory Load Libraries Make Custom Functions & Specify Options Read in Data & Pre-Process Perform Analysis & Save
  • 12. What is Text Mining? Enough of me talking…let’s do it for real! Setup Install R/R Studio • http://cran.us.r-project.org/ • http://www.rstudio.com/products/rstudio/download/ Download scripts and data, save to a desktop folder & unzip https://goo.gl/CYfe33 Install Packages • Run “1_Install_Packages.R” script • An error may occur if the Java version doesn’t match I have tested these scripts on R 3.1.2 with a Mac. Some steps may be different in other environments.
  • 13. Warning: Twitter Profanity • Twitter demographics skew young and as a result have profanity that appear in the examples. • It’s the easiest place to get a lot of messy text fast, if it is offensive feel free to talk to me and I will work to get you other texts for use on your own. No offense is intended. #%@*!!!
  • 14. Keyword Scanning, Cleaning & Freq Counts What is Text Mining? Keyword Scanning, Cleaning Freq Counts, Dendograms Wordclouds Topic Modeling openNLP
  • 15. Keyword scanning, Cleaning & Freq Counts Open the “coffee.csv” to get familiar with the data structure Setup1000 tweets mentioning “coffee” “text$text” is the vector of tweets that we are interested in. All other attributes are auto generated from the twitter API
  • 16. Keyword scanning, Cleaning & Freq Counts 2_Keyword_Scanning.R Setup grepl(“pattern”, searchable object, ignore.case=TRUE) grepl returns a vector of T/F if the pattern is present at least once grep(“pattern”, searchable object, ignore.case=TRUE) grep returns the position of the pattern in the document Basic R Unix Commands “library(stringi)” Functions stri_count(searchable object, fixed=“pattern”) stri_count counts the number of patterns in a document
  • 17. Keyword scanning, Cleaning & Freq Counts 2_Keyword_Scanning.R Setup Let’s try to scan for other words in our other corpora. 1. Change the read.csv file to another corpus. 2. Change the search patterns to understand the result. Examples
  • 18. Keyword scanning, Cleaning & Freq Counts Remember this? Setup Natural Language Text Sources Unorganized State Insight or Recommendation Organized & Distilled Corpora > Cleaned Corpus > Analysis emails articles blogs reviews tweets surveys
  • 19. Keyword scanning, Cleaning & Freq Counts R for our Cleaning Steps Setup emails articles blogs reviews tweets Tomorrow I’m going to have a nice glass of Chardonnay and wind down with a good book in the corner of the county :-) 1.Remove Punctuation 2.Remove extra white space 3.Remove Numbers 4.Make Lower Case 5.Remove “stop” words tomorrow going nice glass chardonnay wind down good book corner county
  • 20. Keyword scanning, Cleaning & Freq Counts 3_Cleaning and Frequency Count.R Setup VCorpus(source) VCorpus creates a corpus held in memory. tm_map(corpus, function) tm_map applies the transformations for the cleaning “library(tm)” Functions New Text Mining Concepts Corpus- A collection of documents that analysis will be based on. Stopwords – are common words that provide very little insight, often articles like “a”, “the”. Customizing them is sometimes key in order to extract valuable insights. tm_map(corpus, removePunctuation) - removes the punctuation from the documents tm_map(corpus, stripWhitespace) - extra spaces, tabs are removed tm_map(corpus, removeNumbers) - removes numbers tm_map(corpus, tolower) – makes all case lower tm_map(corpus, removeWords) - removes specific “stopwords” getTransformations() will list all standard tm corpus transformations We can apply standard R ones too. Sometimes it makes sense to perform all of these or a subset or even other transformations not listed like “stemming”
  • 21. Keyword scanning, Cleaning & Freq Counts 3_Cleaning and Frequency Count.R Setup “custom.stopwords” combines vectors of words to remove from the corpus “tryTolower”is poached to account for errors when making lowercase. “clean.corpus” makes applying all transformations easier. “custom.reader” keeps the meta data (tweet ID) with the original document
  • 22. Keyword scanning, Cleaning & Freq Counts 3_Cleaning and Frequency Count.R Setup Bag of Words means creating a Term Document Matrix or Document Term Matrix* *Depends on analysis, both are transpositions of the other Tweet1 Tweet 2 Tweet3 Tweet4 … Tweet_n Term1 0 0 0 0 0 0 Term2 1 1 0 0 0 0 Term3 1 0 0 2 0 0 … 0 0 3 0 1 1 Term_n 0 0 0 1 1 0 Term1 Term2 Term3 … Term_n Tweet1 0 1 1 0 0 Tweet2 0 1 0 0 0 Tweet3 0 0 0 3 0 … 0 0 0 1 1 Tweet_n 0 0 0 1 0 Document Term MatrixTerm Document Matrix “as.matrix” makes the tm’s version of a matrix into a simpler version These matrices are often very sparse and large therefore some special steps may be needed and will be covered in subsequent scripts.
  • 23. Dendograms & Wordclouds What is Text Mining? Keyword Scanning, Cleaning Freq Counts, Dendograms Wordclouds Topic Modeling openNLP
  • 24. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script builds on the matrices SetupFirst let’s explore simple frequencies •Adjust v>=30 to limit the number of words in the “d” dataframe. This will make the visual look better. •You can also write.csv(d,”frequent_terms.csv”) if you want to export the data frame.
  • 25. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script Setup “zombie” is an unexpected, highly frequent term. Let’s explore it.
  • 26. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script Setup Next let’s explore word associations, similar to correlation •Adjust 0.30 to get the terms that are associated .30 or more with the ‘zombie’ term. •Make sure the 0.30s are the same for both in order to make the dataframe which is plotted. •Treating the terms as factors lets ggplot2 sort them for a cleaner look.
  • 27. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script Setup “geek” has the highest word association with “zombie”. Unapologetic thinker. Lifelong beer scholar. Incurable zombie fanatic. Explorer. Tvaholic. Alcohol geek. Avid tv buff. Friendly beer aficionado. Coffee guru. Zombie junkie.
  • 28. Freq Counts, Dendograms Wordclouds Extracting Meaning using Dendograms SetupDendograms are hierarchical clusters based on frequencies. • Reduces information much like average is a reduction of many observations’ values • Word clusters emerge often showing related terms • Term frequency is used to construct the word cluster. Put another way, term A and term B have similar frequencies in the matrix so they are considered a cluster. City Annual Rainfall Portland 43.5 Boston 43.8 New Orleans 62.7 Boston Portland NewOrleans 44 63 Boston & Portland are a cluster at hieght 44. You lose some of the exact rainfall amount in order to cluster them.
  • 29. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script Setup Weird associations! Maybe a dendogram will help us more New Text Mining Concept Sparse- Term Document Matrices are often extremely sparse. This means that any document (column) has mostly zero’s. Reducing the dimensions of these matrices is possible by specifying a sparse cutoff parameter. Higher sparse parameter will bring in more terms. Adjust the sparse parameter to have ~40 to 50 terms in order to have a visually appealing dendogram.
  • 30. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script Setup Base Plot of a Dendogram • Less visually appealing • Clusters can be hard to read given the different heights
  • 31. Freq Counts, Dendograms Wordclouds 4_Dendogram.R script • Aesthetically my choice is to have colored clusters and all terms at the bottom.
  • 33. Keyword scanning, Cleaning & Freq Counts 4_Dendogram.R Setup Let’s review other corpora. 1. Change the read.csv file to another corpus. 2. Construct the ggplot frequency visual. 3. Construct the ggplot word association for a term you see in the frequency visual. 4. Create a dendogram with ~40 to 50 terms to see how it clusters.
  • 34. Freq Counts, Dendograms Wordclouds 5_Simple_Wordcloud.R script New Text Mining Concept Tokenization- So far we have created single word n-grams. We can create multi word “tokens” like bigrams, or trigrams with this line function. It is applied when making the term document matrix.
  • 35. Freq Counts, Dendograms Wordclouds 5_Simple_Wordcloud.R script To make a wordcloud we follow the previous steps and create a data frame with the word and the frequency.
  • 36. Freq Counts, Dendograms Wordclouds 5_Simple_Wordcloud.R script Next we need to select the colors for the wordcloud.
  • 37. Freq Counts, Dendograms Wordclouds 5_Simple_Wordcloud.R script •Bigram Tokenization has captured “marvin gaye”, “bottle (of) chardonnay”, “chocolate milkshake” etc. •A word cloud is a frequency visualization. The larger the term (or bigram here) the more frequent the term. •You may get warnings if certain tokens are to large to be plotted in the graphics device.
  • 38. Freq Counts, Dendograms Wordclouds Types of Wordclouds wordcloud( ) Single Corpus Multiple Corpora commonality.cloud() comparison.cloud( ) comparison.cloud( ) *this wc has 3 corpora although the Venn diagram only shows 2 as an example.
  • 39. Freq Counts, Dendograms Wordclouds 6_Other_Wordcloud.R Bring in more than one corpora. Apply each transformation separately…more in script. Be sure to label the columns in the same order as was brought in.
  • 40. Freq Counts, Dendograms Wordclouds Commonality Cloud •The tweets mentioning “chardonnay” “beer”, and “coffee” have these words in common. •Again size is related to frequency. •Not helpful in this example as we already knew these were “drinks” but in diverse corpora it may be more helpful e.g. political speeches.
  • 41. Freq Counts, Dendograms Wordclouds Commonality Cloud •The tweets mentioning “chardonnay” “beer”, and “coffee” have these dissimilar words. •Again size is related to frequency. •Beer drinkers in this snapshot are passionate (fanatics, geeks, specialists) on various subjects while Chardonnay drinkers mention Marvin Gaye. Coffee mentions mention morning cup and work.
  • 42. Freq Counts, Dendograms Wordclouds Word Cloud Practice •Try your hand at using tequila in the word clouds •Remember to adjust your stop words!
  • 43. Agenda What is Text Mining? Keyword Scanning, Cleaning Freq Counts, Dendograms Wordclouds Topic Modeling openNLP
  • 44. Topic Modeling openNLP 7_Topic_Modeling_Sentiment.R Setup TREEMAP: multi dimensional representation of the corpus attributes. • Color will represent quick, simple sentiment • Each tweet is a small square • The area of the tweet’s square is related to the number of terms in the tweet, document length • The larger grouping is based on LDA Topic Modeling End result is understanding broad topics, their sentiment and amount of the corpus documents devoted to the identified topic.
  • 45. First, simple Sentiment Polarity Zipf’s Law 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 Top 100 Word Usage from 3M Tweets Many words in natural language but there is steep decline in everyday usage. Follows a predictable pattern. Scoring •I loathe BestBuy Service -1 •I love BestBuy Service. They are the best. +2 •I like shopping at BestBuy but hate traffic. 0 Surprise is a sentiment. Hit by a bus! – Negative Polarity Won the lottery!- Positive Polarity R script scans for positive words, and negative words as defined by MQPA Academic Lexicon research. It adds positive words and subtracts negative ones. The final score represents the polarity of the social interaction.
  • 46. In reality sentiment is more complex. Plutchik’s WheelCountless Emoji
  • 48. Second, LDA (Latent Dirichlet Allocation) Topic Modeling ExampleLDA • Probability is assigned to each document (tweet) for the specific observed topics. • A document can have varying probabilities of topics simultaneously. Each document is made up of mini topics. Technical Explanations: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocati on http://cs.brown.edu/courses/csci2950- p/spring2010/lectures/2010-03-03_santhanam.pdf *Full disclosure I am still learning this methodology 1.I like to watch basketball and football on TV. 2.I watched basketball and Shark Tank yesterday. 3.Open source analytics software is the best. 4.R is an open source software for analysis. 5.I use R for basketball analytics. Topic A: 30% basketball, 20% football, 15% watched, 10% TV…something to do with watching sports Topic B: 40% software, 10% open, 10% source…10% analytics something to do with open source analytics software Corpus LDA Topics Documents 1.100% Topic A 2.100% Topic A 3.100% Topic B 4.100% Topic B 5.60% Topic B, 40% Topic A
  • 49. Topic Modeling openNLP 7_Topic_Modeling_Sentiment.R Setup Open the script and let’s walk through it line by line because there are multiple additions to the previous scripts Can you try to make the same visual with the coffee corpus?
  • 52. Topic Modeling openNLP 8_Open_Language_Processing.R Setup library(openNLP) • https://rpubs.com/lmullen/nlp-chapter • Uses Annotators to identify the specific item in the corpus. Then holds all annotations in a plain text document. Think of auto-tagging words in a document and saving the document terms paired with the tag. • Documentation, examples are hard to come by! • Grammatical or POS (Part of Speech) Tagging • Sentence Tagging • Word Tagging • Named Entity Recognition • Persons • Locations • Organizations Annotations
  • 53. Topic Modeling openNLP 8_Open_Language_Processing.R Setup The annotated plain text object is large with a complex structure. This function allows us to extract the named entities.