This document analyzes existing literature on using social media mining for market analysis through predictive analysis, community detection, and influence propagation. It discusses how social media data can be preprocessed and applied to predictive models to forecast trends. Community detection algorithms can identify online groups with similar interests based on sentiment analysis of opinions. Influence propagation methods aim to target influential users who can activate positive word-of-mouth marketing through their social connections. The document concludes that properly analyzed social media data has predictive power and can provide insights into customer requirements and influencing purchasing decisions when applied to statistical models.
2. Introduction
Problem Addressed: Three research areas in Social Media
Mining
Predictive Power
Community Detection
Influence Propagation
Focus: Analyzed the existing literature and find
applications in Social Media for Knowledge Discovery for
Market Analysis
3. Background
Fact 1: Facebook has over 1.55 billion active users by
November 2015
(extracted from Statistics Portal – November 2015)
Fact 2: All adults spend at least 2 hours a day on some
form of social media network
4. Focus of Research
A rich source of
data with
human
sentiment and
behavior
Developed
online
relationships
and groups
Online
interactions
where people
voice their ideas
Understand
customer
satisfaction and
changing
customer
requirements
Focused
marketing
campaigns for
better results
Influencing
consumer
behavior
effectively via
influential users
5. Using Social Media to make Predictions
Progress So Far:
Human Intuition – Can’t be duplicated
Data Based Models – Inadequate data to represent
human cognitive process
SOLUTION: Use data available on social media for
predictive analysis.
6. Using Social Media to make Predictions
Progress So Far:
Yahoo Finance Message Board – Stock market
variability (Antweiler & Frank 2004)
Google Search Queries – Track disease outbreaks
(Ginsberg et al. 2009)
Amazon Reviews – Predicting product sales (Ghose &
Ipeirotis 2011)
7. General Framework for SMM for Predictions
Stage 1: Preprocessing
Social Media data are
unstructured
Convert them into
high quality structured
data, suitable for data
mining
Quality: Strong et al.
(1997)
Objectivity
Completeness
Sufficiency
Stage 2: Predictive
Analysis
Develop a model to
make accurate
predictions on a new
set of data (Harold
2013)
Methodologies:
Market Models
Survey Models
Statistical Models
8. Data Preprocessing
Problem Solution
Data Cleaning Missing values
Noise
Outliers
Substitution
Regression
Data
Integration
Entity Identification
Redundancy
Schema based Entity
Identification
Duplicate Detection
Data
Transformation
Data can’t be used straight
away for mining
Generalize
Attribute Construction
Data Reduction Large amounts of data
requires a significant
processing power
Data Cube Aggregation
Attribute Selection
9. Application of Predictions in Market Analysis
Objective: How the knowledge available could be used to
make predictions with regard to market analysis and how
successful is it ?
Microblogging (Twitter) is most popular
Focus: Twitter data for predicting box
office performance of movies
10. Application of Predictions in Market Analysis
Literature:
Asur & Huberman (2010) used correlation and
regression based models on Twitter data
Leskovec (2011) rectified imperfections which could rise
due to incomplete data
Vasu Jain (2013) used sentiment analysis for predictions
Gaikar & Marakarkandy (2015) introduced a framework for
using Twitter data for sentiment analysis and making
predictions
11. Application of Predictions in Market Analysis
Gaikar & Marakarkandy (2015)
Predict box office
performance of a
Bollywood movie as
a hit, flop or an
average
Predict the opening
weekend revenue
collection
12. Twitter for Predictions: Methodology
Module 1: Data Extraction
The most trending hashtag on Twitter and
related hashtags are extracted (HashTags.org)
Twitter4j API used to connect and extract
tweets from Twitter servers
Stored in mySQL database
Movie star ratings taken from Timex Celebex
A complete set of most relevant data has been
extracted
13. Twitter for Predictions: Methodology
Module 2: Sentiment Analysis
14. Twitter for Predictions: Methodology
Module 3: Predictive Analysis
Predicting movie performance
Input: Sentiment score + Movie Star Rating
Process: Fuzzy Inference based model is
created
Output: Box office movie performance as Hit,
Flop or Average
15. Twitter for Predictions: Methodology
Module 3: Predictive Analysis
Predicting weekend collection
Input: Hype factor, Shows per day on all
screens, average full house collection
Process:
Output: Estimated opening weekend collection
16. Twitter for Predictions: Findings & Evaluation
10269 tweets for 14 movies released in a
period of six months (relevant, complete,
sufficient) was considered
Actor ratings in the month of release was
considered
Predictions compared against the real ratings
extracted from IMDB (near perfect predictions)
Mean Square Error used to evaluate the
effectiveness of the predictive model (<7%
error rate)
18. Twitter for Predictions: Applications
If the predicted revenue < budgeted revenue,
increase marketing and publicity efforts
Can determine the maximum allowable
promotional budget
Limitations:
Only two predictor variables used to predict
box office performance (sentiment score +
actor rating)
Use more variables
19. Using Social Media for Community Detection
60% of American population chose social media as
their first choice for information seeking (Scot et al.
2014)
Social relationships transferred to the internet
Online communities based on similar interests and
opinions have been created
Opinion based community detection can be used to
identify such online communities
20. Literature:
Park & Cho (2012) identified online communities as an
information source for apparel shopping
Dev (2014) proposed an algorithm for community
detection in social media based on different interaction
methods (no opinion mining)
Kavoura (2014) identified the impact of online
communities for communication
Dinsoreanu & Potolea introduced a framework for
opinion based community detection in social media
Using Social Media for Community Detection
21. Data Preparation:
Extracted user comments from blog posts and forums
A classification model for opinion mining created a set of
labelled documents and 5 grammar rules introduced by
Turney 2002.
Extracted tokens (after filtering) are classified into positive
and negative opinions using SVM and NB. A sentiment
score assigned to each token.
Tokens stored in a structure format (includes the id,
holder, opinion keyword, polarity score etc.)
Community Detection: Methodology
22. Opinion based Community Detection:
Identifying communities based on similar interests in
multiple targets
Aggregate functions to represent the similarity of
opinions in multiple targets
Similarity graphs based on Euclidean distance were drawn
Community Detection: Methodology
23. Opinion based Community Detection:
Similarity Functions:
Community Detection: Methodology
24. 1000 labelled documents used as the training
set for NB and SVM
Near perfect classification of opinions can be
obtained
A user generated data set was used to apply
community detection algorithms
Findings:
Linear functions perform poorly when number of
targets increase
Exponential functions with cutoff perform best with
increasing opinions
Community Detection: Findings & Evaluation
25. A practice application of community
detection was not conducted
Suggestion: The proposed framework can be
applied in the pharmaceutical industry for
online community detection
Background Literature:
“CyberRx” by Radar & Subhan (2013)
Community Detection: Limitations
26. Community Detection: Potential Application
CyberRx New Approach
Data Collection Forums and Blogs using
Google Alerts
Additional sources such
as bulletin boards
Keywords Used Formal names and
language
More popular brand
names and consumer
driven language
Opinion Mining Manual Automated (SVM
classifier)
Community
Detection
Manual Aggregated functions
and Similarity graphs
Findings Two main communities,
- Side effects, medications
- Changing medication
More specific
communities can be
identified
27. Community Detection: Potential Application
Knowledge such as,
Most prevalent diseases classified based on
geography and demography
Most popularly used brands of drugs
Competing alternatives for a given drug
Information of specifications, variations,
duration ad personal experience of side effects
(both normal and abnormal)
28. Using Social Media for Influence Propagation
People influence each other via online interactions and
communications
Purchase decisions are heavily influenced by eWoM in
social media networks
34% of Twitter users post product related opinions at
least once a week (ROI Research Institute)
Objective: Target most influential user on social media
to activate a chain of influence driven by eWoM
29. Literature:
Khobzi (2014) conducted a basic content based
analysis on Facebook posts, to identify the connection
between the sentiment and the popularity of the post
Kaiser et al. (2012) analyzed opinion formation and
influential users based on data collected on iPhone
reviews
Okazaki et al. (2014) explored the different types of
customer engagement in social media networks and
their impact on influence propagation
Using Social Media for Influence Propagation
30. Influence Propagation: Methodology
Focus group: IKEA customers
Training set included 300 preprocessed Tweets
Classified manually based on customer emotional
status and content
Emotional Status: Satisfied, Dissatisfied, Neutral
Content: Information, Sharing, Opinion, Question, Reply
Trained NB, KNN, SVM classfiers
NB performed best
31. Influence Propagation: Application
New data set: 4000 tweets
Users were seen as nodes and tweets as their
relationships
Google’s PageRank algorithm to determine the relative
importance of each user
Findings:
One satisfied user sharing information (positive eWoM)
Three dissatisfied users spreading negative opinions
32. Influence Propagation: Suggestions
Conclusion:
Influential Users can be identified
Different customer satisfaction levels are crucial
Suggestions:
Using celebrities and converting their followers into influence
makers.
Additional incentives could be provided to encourage
engagement in discussions
Closely monitor for dissatisfied customers online and
occasionally mediate in retweets suggesting feasible solutions
and demonstrate their commitment
33. Knowledge Discovery in SMM: Conclusions
Consolidates the potential knowledge areas that could
be exploited for market analysis via community
detection in, predictive power of and influence
propagation in social media.
Properly preprocessed social media data, with
acceptable quality when applied to robust statistical
models could predict future market trends with
considerable accuracy.
Social media taken social relationships to the digital
platform and have created opinion based communities
online. These can be used to identify genuine
consumer requirements.
34. Knowledge Discovery in SMM: Conclusions
People express their genuine consumer experiences on
social media networks which clearly influence
purchasing decisions of other potential consumers.
An efficient framework can identify influential users
online and trigger a chain of positive eWoM promoting
viral marketing.
My focus is to understand how the unstructured data in social media could be transformed into valuable knowledge via the application of social media mining techniques, and how it can be applied in a real world application for market analysis.
Attribute Construction : Create new attributes by combining many
Objective: How the knowledge available could be used to make predictions with regard to market analysis and how successful is it ?
The methodology, findings, limitations and suggestions are presented
Input: the populated DB file and the Keyword DB file
Process: PLSA classifier implemented in Matlab used for sentiment analysis classification, compared each word with the keywords in the dictionary file
Output: Sentiment score was assigned to each tweet and a total score was taken considering all the tweets (cell2mat)
Why Probabilistic Latent Semantic Analysis?
Why FIS?
Hype Factor was obtained using the number of distinct users tweeting and their average follower count
(director rating, producer rating, impact of promotional and production budgets)
Features of the data set: Number of key opinion, number of distinct users, degree of membership, maximum allowable similarity between two communities