2. Why do this?
•It’s a known business standard for
enabling stakeholders, clients, and
decision makers to easily see and
compare like and unalike elements.
3. Problems
•At their most numerical they typically are
qualitative in nature.
•Even more are designed using a
“scorecard” approach.
•If a scorecard is used it is limited to only a
few points and a qualitative assessment to
define the numeric measures.
•Distances and positioning are defined
using a “human element”.
4. Perceptual Maps
•Perceptual mapping is
a diagrammatic technique used by
asset marketers that attempts to visually
display the perceptions of customers or
potential customers. [wikipedia]
•Perceptual maps enable you to find
opportunities in the market for a new product
or to identify potentially competitive products.
5. Perceptual Maps
•Perceptual mapping is
a diagrammatic technique used by
asset marketers that attempts to visually
display the perceptions of customers or
potential customers. [wikipedia]
•Perceptual maps enable you to find
opportunities in the market for a new product
or to identify potentially competitive products.
Attributed to http://npdbook.com/
7. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Twitter Extraction
•Implementation
• Streaming Api was left online for 1 week pulling data a target group of fast food companies. This twitter
data collected will be used to generate our feature set.
• A second set of data, without filters is used to create a control set of data that will help us determine
what words are key to food, and which words are most important in the context of fast food
8. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Twitter Extraction
•Packages:
• Twython library provides an easily accessible API wrapper which can be used for the twitter streaming
API.
• The Twython API allows a user to plug into twitter and access data
•Limitations
• The api didn’t seamlessly handle some necessary parameters to filter twitter data (ie. language)
• The classic unicode/asci conversion problems are rife in the twitter dataset.
• The Firehose API was deprecated while this project was being worked on. Without the ability to parse to
the feed based on language and without firehose access I alternatively used the Swiftkey dataset.
9. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Tokenizing Dataset
•Implementation
• Tokenizing is simple, there are a few ways to do it but the easiest is to split the data. First by column “n”
and then by an empty string.
• At the tokenization stage it was also ideal to filter, and format my tokens.
• Also at this stage I perform deduplication of the data (if a line appears more than once remove it, this is
there to help manage spammers.)
10. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Stemming Tokens
•Value Add of Stemming
• Stemming is a technique first proposed in the late 1960’s by Julie Beth Lovins but was finalized later by
Martin Porter whose algorithm has come to be the de-facto standard for stemming.
• Stemming is used to remove roots of words so you only have the root word. (e.g. moved -> move)
• In the context in this analysis you can compare root words of these stores.
•Implementation
• The newer version of the algorithm, Porter2 is readily available as a python package.
• In the context of this analysis I stemmed both my core food dataset and then the general firehose
corpus.
• For simplicity and kindness to my RAM I stored the stemmed output as a csv.
http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dan-porters.pdf
11. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
TFIDF: Term Frequency inverse document frequency
•Value Add of TFIDF
• TFIDF simplified allows you to find the relative occurrence of a word in a series of documents (tweets),
and provides a simple way to compare it to the occurrence of other words.
•Implementation
• TFIDF is fit to a larger set of firehose data, in this case the firehose data is broken apart into tweet
documents about any and all thoughts a twitter user might be interested in.
• After creating a TFIDF model I then use the TFIDF object from the firehose data and compare it to the set
of data I have from the restaurants. The list of words with the highest score are considered the “most
important” in the context of food and restaurants.
12. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
TFIDF: Term Frequency inverse document frequency
•Packages
• TFIDF is simple and easy to implement in Scikit Learn.
• I simply point my string document objects into this function, create a tokenizer relevant to my textfiles,
and then I simply run the function.
• The output is a dictionary of words and their TFIDF scores which need to be read into a tuple and sorted.
• I then create a filter based the sorted list by TFIDF score and use that to remove all non-relevant terms
from the food list which I will use as a feature.
13. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Wordcount Matrix
•Implementation
1. First I create a dictionary object of all words (a default dict would work just as well.)
2. I then create a set of all words and compare that to a separate list of restaurants.
3. Based on the restaurant in the list I run a separate list to increment the word counters stored for each element.
4. Finally I take the wordcount and save it off as a csv that can be imported as a matrix.
14. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
MDS/ NMDS / CA (PCA)
•What is MDS and why is it used for perceptual mapping.
• MDS uses matrix operations to compute the distances between elements and plot them while
maintaining the distance between all elements.
• While it is made to handle continuous variables in standard MDS if you have ordinal or comparison data
than going with a non-metric MDS solution is necessary. A nonmetric MDS gives you results as your data
elements compare to each other, rather than trying to solve for the total differences between them.
• Where MDS differs from PCA is that they have entirely different goals and are studied separately. While
the goal of PCA is dimensionality reduction in support of factor analysis the goal of MDS is to simplify the
visual inspection of elements and their relationships to other like elements.
• Another interesting way this analysis can be used is to find similarity between your measurements , for
instance if you are using MDS with demographic data you would probably see that minivan owners and
families have a very similar vector.
15. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
MDS/ NMDS / CA (PCA)
•What is MDS and why is it used for perceptual mapping. (cont.)
• The easiest way to think about this is to use the concept of unidimensional scaling and apply it to a
multidimensional environment.
>
20. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Analysis of the R NMDS
•What we can determine with this analysis
1. Wendy’s and Burger King have favorable offerings for chicken
and bacon.
2. Taco bell doesn’t have a notable salad offering. McDonalds also
has a far distance from that term.
3. Price of Chipotle, Pizzahut, and McDonalds is frequently
references.
4. Tacobell owns the term “crunch”
•What to do next
1. Pull in additional data to find the relative profitability of these
firms and align them with our terms. If any “blue ocean” space
is seen, that could be a potential business opportunity.
21. Twitter Extraction Tokenizing Stemming
TFIDF
Stopwords
Word Count
Matrix
MDS Plotting
Pain Points and Lessons Learned
• Ascii and Unicode conversion issues are a constant pain. It’s much easier to be overaggressive with casting, also
make sure that all modules and classes specify the type of text being used.
• For long calculations it is best to use pickle to checkpoint the work done and to make sure you have the processing
saved off.
• Sometimes R is the right call, particularly when it comes to plotting.
• The Scikit learn has almost all functions needed and it is easier to stay there as opposed to trying to find other best
of breed packages.
• LDA for topic modeling would be a great next step to reduce dimensionality.