Project Report for Twitter Sentiment Analysis done using Apache Flume and data is analysed using Hive.
I intend to address the following questions:
How raw tweets can be used to find audience’s perception or sentiment about a person ?
How Hadoop can be used to solve this problem?
How Apache Hive can be used to organize the final data in a tabular format and query it?
How a data visualization tool can be used to display the findings?
1. 11/18/2015 Analyze Twitter Data
with Hortonworks
Hadoop
Intermediate Project Report
Bharat Khanna
UNIVERSITY AT BUFFALO
2. 1
Sentiment Analysis of Mr. Narendra Modi’s Brand Image using Twitter Data
Summary: - I am doing sentiment analysis of Mr. Narendra Modi’s Brand Image across
different nations using data from twitter. For fetching the twitter data, I am using Apache
Flume that is open source and by default comes installed in Hortonworks sandbox platform
1.3.
After fetching the data from twitter, it would be loaded directly to HDFS (Hadoop Distributed
File System). This way I am reducing the extra overhead of transferring the data from local
system to HDFS.
Data loaded in HDFS is still in unstructured format and not good for Ad-hoc analysis. So I will
be converting the JSON data to tabular format and store it in HIVE. Also I would be providing
a graphical user interface to end users to run their own ad-hoc analysis.
Next step deals with using the dictionary file to score the sentiment of each tweet by the
number of positive words compared to number of negative words, and then assigned a
positive, negative or neutral sentiment value to eachtweet. I have downloaded the dictionary
file from below link.
Click here for Dictionary
Last part of project is to show results of sentiments analysis in form of visualizations. Here I
will be using Tableau for it. I will be connecting Tableau to Hive using Hortonworks ODBC
Driver that I downloaded from Hortonworks website (link mentioned in references section).
I will show the results of analysis in the form graphs and maps using Tableau’s inbuilt VIZQL
server.
Data sets and Software:
Sentiment Data: - Sentiment Data is unstructured data that represents opinions, emotions,
attitudes contained in sources such as social media posts, online blogs, and product reviews
etc.
Whyuse sentiment Data:- Organizations use sentiment data to know what people feel about
their product and what they can do to effectively market their product.
How did I fetched Twitter Data: - Created twitter app, configured flume.conf with app
credentials and ran flume. All the steps for fetching data from twitter using Apache Flume I
have mentioned in a YouTube video and a ppt, the link of which is below. I have alsouploaded
video at ublearns discussion forum of DC.
YouTube: - https://youtu.be/E1w5SkE7Cco
Slide share: - http://www.slideshare.net/bharat3khanna/extracting-twitter-data-using-
apache-flume
Source code for Flume-Snapshot.jar:- Idownloadedsource code of Flume-snapshot.jarfromgithub
and builtthe jarusingmavenpackage inHadoop cluster.
3. 2
Click here for Flume Source Code
Size of Data: - Though there is no limitation of amount of data I can get from twitter but for this
project, I am going to do my analysis on approximately 100 mb of data.
AlgorithmsUsed:- IamnotusingMap-Reduce Algorithmhere,sinceIwanttodoanalysis oncomplete
data and I don’twant to use aggregatedmeasures.If I wouldhave usedMap Reduce,thenmy lot of
data wouldhave beenaggregatedbyreducer.My source data isin JSON format and I am usingHive-
serde.jar (serde stands serializer and deserializer) that helps in parsing the JSON data effectively to
hive tables.
Source code forHive-serde.jar:-Idownloaded source code of Hive-serde.jarfromgithubandbuiltthe
jar using maven package in Hadoop cluster.
Clickhere forHive-serde.jarsource code
Analysis to be done on Twitter data: - I am going to do following analysis using Hive and Tableau:-
a) Maximum tweets count per user.
b) Count of retweets.
c) Geographically mapping people’s sentiments towards Mr. Modi.
References: -
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop
https://github.com/cloudera/cdh-twitter-example
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
http://hortonworks.com/products/releases/hdp-1-3/#add_ons