This presentation was delivered at Splunk's User Conference (conf2012). It covers info about social media data, how to index / use it with Splunk and a lot of content around Sentiment Analysis.
3. "What
is
social
data"?
data generated from
human activity on
social networks
3
4. "What
is
social
data"?
Oh yeah, right Twitter.
But I work in IT… so,
who cares, right?
4
5. Social Data Should be in Splunk!
{[-]
checkin : {[-]
• easy to analyze with badges : [],
fields created : 1345093539,
geolat : "41.7686007592",
• easy to create realtime/ geolong : "-72.621648",
mayor : {[-]
historical dashboards and type : "nochange"
},
views primarycategory : {[-]
fullpathname : "Food:Mexican Restaurants"
• easy to translate many iconurl : "https://foursquare.com/img/cat
word problems in to mexican_32.png",
id : "4bf58dd8d48988d1c1941735",
questions nodename : "Mexican Restaurants"
},
timezone : "America/New_York",
user : {[-]
gender : "female"
5
},
venue : {[-]
6. "What
is
social
data"?
Wilde, we just
said we work
in IT and don’t
care about
Twitter!
6
7. "What
is
social
data"?
Except when we search
on the words “site”
AND “is down”
7
9. "What
is
social
data"?
Except when I search
on the words “site”
AND “is down”
IT and the brand collide at times.
9
10. Getting Social is
social
data"?
"What
Data
Network
Method
Frequency: 3rd Parties
Real-time Scheduled
Push Pull
10
11. Best thing about Social Data?
Its almost always
Structured JSON!
11
12. What can you do with it?
Map Conversations
Analyze People
12
13. What can you do with it?
Enrich it with
lookups
Track Olympians
13
14. Indexing the social mother lode
A single stream of big data
@itayNeeman’s curl splitter
scripted input (TBR)
Multiple forwarders
installed on a single server
streaming to multiple
indexers
14
15. Sir Bill, I believe the demos cometh..
…whoa.
15
16. The Double Rainbow
When it comes to
“numbers”, the search
language rocks!
In social, what people “mean”
matters. For that you’ll need
some new tools that
understand words and
language
“…what does it mean?!”
16
17. Analyzing Sentiment
Extract linguistic, subjective
information of opinions, attitudes,
emotions, and perspectives
17
20. Understanding brings…
Empathy with customers
and prospects
Intelligent business and
design decisions
20
21. Brand Perception Impacts Stock
In 2011, our friends at Netflix announced that it would be
increasing its subscription prices. The feedback on its
Facebook page was outrage and the impact on its stock
price was dramatic.
21
22. Sentiment complements and informs
“We analyze several surveys on consumer
confidence and political opinion over the
2008 to 2009 period, and find they
correlate to sentiment word frequencies in
contemporaneous Twitter messages… …as
high as 80%, and capture important large-
scale trends.
The results highlight the potential of text
streams as a substitute and supplement for
traditional polling.”
From
Tweets
to
Polls:
Linking
Text
SenOment
to
Public
Opinion
Time
Series
(CMU:
O'Connor,
Balasubramanyan,
Routledge,
and
Smith
2010)
22
24. Box Office Revenue Forecasting
“We use the chatter from Twitter.com to forecast box-office
revenues for movies. We show that a simple model built from
the rate at which tweets are created about particular topics
can outperform market-based predictors. We further
demonstrate how sentiments extracted from Twitter can be
further utilized to improve the forecasting power of social
media.”
Asur and Huberman 2010
24
26. What’s in a word?
Terms have many context
dependent meanings.
" depend on the writer, the
reader, and their relationship,
history, goals and preferences
" “unpredictable” bad in general,
but good in movie reviews.
" “jobs” data was affected by
iPhone release
26
27. How are you feeling right now?
Plutchik's Wheel of Emotions
Ekman’s Six Basic Emotions
27
28. Sentiment analysis gone
wrong
When Anne Hathaway is mentioned, it’s
almost always in a positive context, and
as a result some trading algorithms
seem to purchase Berkshire Hathaway.
When she is mentioned
in the news, the stock
goes up.
28
30. Bags of Words and Phrases
Many sentiment words and
expressions are not directly
influenced by what is around them:
That was fun :)
But certainly they can be!
They said it would be wonderful, but they were wrong.
This "wonderful" movie turned out to be boring.
30
31. Human Engineering vs. Machine Learning
Hand-built expert systems and parse rules
Similarly, human engineered lists of good
and bad words (e.g., “good”, “great”, “bad”,
“awful”)
Natural Language Processing & Speech
Understand - statistical and data driven.
Sentiment analysis generally uses statistics
and training sets.
31
32. Machine Learning Choices
" Learning Type
– Supervised: + straightforward. – lots of training data.
– Unsupervised: + no training data. - may not find what you
want.
– Semi-Supervised: + small initial training data. – interactive
feedback.
" Algorithms
– Naïve Bayes: +simplest probabilistic classifier model.
– assumes words are independent
– EM: +performs better, doesn’t assume independence.
- more complicated, over-fitting a problem
32
33. Supervised Learning
Labeled
New
Training
Learn
Unlabeled
Data
Model
Data
Labeled
New
Test
Validate
Model
Predict
Labeled
Data
Model
Labels
Data
33
34. The Effect of Negation
“The food was not good”
Strategies: Negating
sentiment for all terms up to a
breaking punctuation (i.e.,
comma or sentence end)
Negation effect is dependent
on the term.
• Mild words negate about the same: not bad ≈ good
• Extreme words negate towards neutral: not horrible ≈ average
34
35. Learning Bias
A
common
feature
of
online
user-‐supplied
reviews
is
that
the
posiOve
More occurrences
reviews
vastly
out-‐number
the
negaOve
ones.
Movie
reviews
at
IMDB:
of “bad” in 10-star
reviews than in 2-
star ones.
Normalize by
accounting for
relative
frequencies.
35
36. Sentiment in Social Media
" Emoticons: :-) ;( :/
– Reliable measure of sentiment
– Simple regex can cover more than 95% of emoticons on twitter
– Ignores complex emotions
" Lengthening
– This talk is greeeeeat! David is the beeeeeeest! Ahhhhhhhhh!
– In English 3 or more of the same char in a row doesn’t exist,
except for 7 obscure terms in unix dict.
– Can indicate heightened emotion, but actual lengths are probably
not meaningful.
– Useful to normalize because of how common they are (hiiii è hi)
36
37. Maybe it’s not so hard?
“We are only interested in aggregate
sentiment. A high error rate merely implies
the sentiment detector is a noisy
measurement instrument. With a fairly
large number of measurements, these
errors will cancel out relative to the
quantity we are interested in estimating…
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time
Series
37
39. Design
Decisions
• Use supervised learning. Why? Doesn’t require interactive
feedback. Learning get almost the best they are going to
do with only a few hundred or perhaps a few thousand
documents
• Use
naïve
bayes.
Why?
Dirt
simple
and
understandable.
The
difference
between
the
best
algorithms
and
a
simple
naïve
bayes
is
generally
only
a
few
percent.
39
40. Design Decision
• Handle lengthening. Greeeat!
• Ignore negation. In the aggregate
it won’t matter much.
• Supply multiple trained models:
• Movie reviews (using IMDB ratings)
• Tweets (using emoticons to create
training sets)
• Please suggest more
40
41. Summary
• Sentiment analysis helps you understand your customers
and marketplace.
• True sentiment analysis is hard.
• Aggregate sentiment analysis is easier but still very
valuable.
• The simplest algorithms work almost as well as the most
complex, given a few thousand training points.
• Splunk has a Sentiment App.
• Download it and give feedback.
• Integrate Social data into your existing corporate data
• Share your trained models with others.
41
42. “splunk now knows when you’ve
“I actually learned something! Not.”
been naughty or nice #sentiment”
“#splunk #sentiment niiice.”
Teh End If you’re reading this, start
“keep-it-simple sentiment works #conf2012”
clapping. The talk is over.
“Worst talk. Ever.” Golf clapping at #sentiment_talk
42