Data Mining on Twitter

1

Data Mining and Analysis on
Twitter

Company Proprietary and Confidential Copyright Info Goes Here Just Like
This

Professor 2

• Prof. Pascal Frossard

Project Supervisor
• Xiaowen Dong

Students
• Pulkit Goyal (twitter.com/pulkit110)
• Sapan Diwakar (twitter.com/diwakarsapan)

This

Contents 3

• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter

This

Objective 4

• Large amount of new data created every minute on social
networking sites.
– Difficult to obtain and interpret
– Collect data to allow for further analysis

• Identify online communities of users on Twitter

• Explore reasons of user interactions as a step towards prediction of
future interactions

This

Contents 5

• Objective
• Modules
• Data Collection

This

Twitter at a glance 6

Micro-blogging platform
Since March 2006

Status Update
300 Million users
(June, 2011)

Giant Chat room
Instant Messaging

This

Lingo 7

• Tweet - A message of 140 characters or less
• Retweet - Repeat a tweet from somebody else
• Hashtag - Tweet that includes a #term (tracking)
• Reply/Mention - Mentioning another user in a tweet

This

Contents 8

• Objective
• Modules
• Data Collection

This

Modules 9

• Data Collection
– Setup system to collect data based on some constraints

• Visualization
– Build some visualizations based on the collected data
– Analyze the results

– Identify communities of users on Twitter based on several different similarty
measures

• Analysis of Future Mentions
– Identify factors for future mentions between users on twitter.

This

Contents 10

• Objective
• Modules
• Data Collection

This

Data Collection | Data based on location 11

• Collect data based on locations: Objectives:
– London • Model the spread of interests
– New York • Time
– Paris • Location
– San Francisco • Rate of information flow
– Mumbai • Identify future events
• Identify landmarks
• Model Relationships among users
• Friendship/Social Connections
• Common Interests

This

Data Collection | Data based on topics 12

• Collect data based on keywords Objectives:
– Apple (Tech) • Model the spread of interests
– Manchester United (Soccer) • Time
• Location
• Rate of information flow
• Identify future events

This

Data Collection | Data from a group of users 13

• Collect tweets from a "group of users" Objectives:
– Group of around 25k users • Model the spread of interests
• Time
• Created by a specified user
• Location
• Explicitly in-reply-to a status created by a • Rate of information flow
specified user (pressed reply button) • Identify future events

Overview of links we
use to collect users
This

Contents 14

• Objective
• Modules
• Data Collection

This

Visualization Results | Streets of London 15

• Setup
– Geo-tagged tweets for one week (16 to 22 August 2011)
• 111,206 tweets

This

Visualization Results | Streets of London | 1 week 16

• Analysis
• High density of tweets from famous places/tourist attractions
• Clustering of tweets
• Content of tweets can be used to predict the place
• More tweets along the roads/streets

National Gallery
London Waterloo Rail
The Big Ben
London Victoria Rail

Oval Cricket Ground Greenwich

This

17

This

Tweets in London | Aggregated by wards 18

No. of tweets
in increasing
order

This

Tweets about a topic| Manchester United 19

• Setup
– Data for two weeks (27 Oct to 8 Nov 2011)
• Keywords
– "manchesterunited", "manchester united", "manchester utd", "man
united", "manutd", "man utd", "manu", "mufc"

This

Visualization Results | Tweets About 20

Manchester United

Analysis
• More tweets in and around Europe
• Manchester United plays in the English Premiere League and has homeground in Manchester
• High amount of tweets from countries whose players play for Manchester United
• High popularity of Manchester United in Indonesia and Malaysia
This

Tweets about a topic| Apple 21

• Setup
– Data for two weeks (27 Oct to 8 Nov 2011)
• Keywords
– "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx",
"osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch",
"itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s",
"iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3"

This

Visualization Results | Tweets About Apple 22

Analysis
• High volume of tweets in USA and Europe
• Popularity of apple products in Europe and USA
• Volume of data as compared to Manchester United
• 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United
• Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries
This

Contents 23

• Modules
• Data Collection
• Future mentions on Twitter

This

Community Detection| Background 24

• Community
– A set of users having strong connections.
– Held together by some common interests of a large group of users.

• Similarity Measures
– Users’ Social Connection
– User Mentions
– Description Content Similarity
– Tweet Content Similarity
– Hash-Tag Similarity

• Algorithms for community detection
– Modularity Maximization Clustering
• Spectrum Based
• Greedy Bottom-up Fast Modularity Clustering
– Spectral Clustering

This

Community Detection| Analysis on small dataset 25

• Experimental setup
– 501 users from three different lists on twitter
• List id 4293757, 12932674 and 33222959

– Tweets collected for 2 weeks
• 26th October, 2011 to 7th November 2011

• Goal
– Recover ground truth clusters
– Evaluation based on NMI and RI

• Similarity Measures used
– Users’ social connections
– User mentions
– Users’ Description content similarity
– Users’ Tweet content similarity Spy plot for Social connections
with users ordered by the list to
which they belong
• Algorithms used
– Spectrum based Modularity Maximization
– Spectral Algorithm – Normalized Laplacian Matrix

This

Analysis on small dataset | Modularity Based Clustering 26

Clusters for spectrum based Clusters for spectrum based Modularity
Ground truth clusters modularity maximization clustering on maximization clustering on combined
User Connections similarity measure

Similarity Matrix Modularity Matrix
Analysis
• Social connections most dominating for
NMI RI clustering this group of users.
User Connections 0.3868 0.7174 • Individual similarity measures perform
inaccurately
Mention 0.0130 0.3398 • Combined similarity measures not as good
Tweet content 0.0074 0.3371 as user connections alone
• Addition of low information content to user
Description content 0.0780 0.5254 connections decreases accuracy.
• User behavior not consistent with ground
All combined 0.2500 0.6175
truth.
Company Proprietary and Confidential Copyright Info Goes Here Just Like • Post similar content
This

Analysis on small dataset | Laplacian Based Clustering 27

Clusters for Normalized Laplacian based spectral
Ground truth clusters clustering on combined similarity measure

Symmetric Normalized Analysis
Similarity Matrix • Clustering on Social connections fails.
Laplacian Matrix
• Laplacian based methods are sensitive to
NMI RI
the presence of disconnected nodes.
User Connections 0.0077 0.3374 • Individual similarity measures (including
Mention 0.0077 0.3374 social connections) fail to reconstruct any
cluster information.
Tweet content 0.0077 0.3374 • Combined similarity measures gives results
Description content 0.0088 0.3381
consistent with the modularity based
approach.
All combined 0.2931 0.6472 • Addition of different information to the
social connections makes it connected.
This

Community Detection| Analysis on large dataset 28

• Experimental setup
– 11273 users from the set of all users collected during data-collection

– Tweets collected for 4 weeks
• 26th October, 2011 to 22nd November 2011

• Similarity Measures used
– Users’ social connections
– User mentions
– Users’ Hash tag similarity
– Users’ Tweet content similarity

• Algorithm used
– Bottom up Fast Modularity Clustering

Spy plot for Social connections

This

Analysis on large dataset| Clustering on Social Connections 29

Spy plot for social connections with
Visualization of clustering results
users ordered by the clusters that
they are present in

This

Analysis on large dataset| Clustering on Social Connections 30

Tag cloud 1: Frequent keywords in tweets from cluster 2

Visualization of clustering results Tag cloud 2: Frequent keywords in tweets from cluster 6

Analysis
• Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web
developers/software developers and talk consistently about these terms.
• Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag
cloud 1
• Users in cluster 4 are from same university in India ‘IIIT Hyderabad’.
• Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club
Juventus.
This

Analysis on large dataset| Clustering on Combined matrices 31

Results for data from week 1 Results for week 2 Results for only social connections

Analysis
• Using combined data leads to much
finer clustering results as compared to
clustering on social connections.
• Additional information allowed
making division between users who
weren’t tightly connected.
• Division into smaller cluster consistent
with different weeks
Results for week 3 Results for week 4 • Not due to some shifts of interests for
a small period of time.
Combined and Confidential
Company Proprietary
This
= Conection+Mention+Hashtag+Tweet
Copyright Info Goes Here Just Like

Contents 32

• Modules
• Data Collection
• Future mentions on Twitter

This

Future Mentions| Reasons for mentions on Twitter 33

• Social Connections
– Users can see the tweets of their friends on their wall and therefore are
more likely to mention them in their future tweets.
– Mentions should occur only if two users share a ‘following ‘or ‘being
followed’ relationship
• Past mentions
– Users who have mentioned each other often in the past are more likely to
mention each other in the future .
– Past mentions means that the users might have had a conversation on
Twitter which means that they share a good relationship.
• Hash Tag Similarity
– Hash tags are used to highlight important keywords in tweets and make it
easy to find tweets or set trending topics on Twitter.
– If two users discuss about the same topic/keyword (hashtag) they are
more likely to mention each other in future.
• Tweet Content Similarity
– Users can mention others if they find their tweets to be interesting.
– Highly similar tweet content means that there is higher probability of a
mention event between two users.

This

Future Mentions| Correlation between features 34

and future mentions
Correlation between features of week 1 as compared to mentions in week 2 Weighted combination =
W1/W2 Mention Hash Tag Tweet Combined Class 2*Mention + 5*Hashtag +
Mention 1 0.0528 0.003 0.919 0.1656
Hash Tag 0.0528 1 0.0031 0.4422 0.0565 Tweet Similarity
Tweet 0.003 0.0031 1 0.0134 0.0272
Combined
Class
0.919
0.1656
0.4422
0.0565
0.0134
0.0272
1
0.1713
0.1713
1
Analysis
• Past user mentions has a high
correlation with mentions in
Correlation between features of week 1,2 and 3 as compared to mentions in week 4 the next week.
W123/W4 Mention Hash Tag Tweet Combined Class • Combined similarity measure
Mention 1 0.1428 0.0219 0.8912 0.1906
provides some increase in the
Hash Tag 0.1428 1 0.0193 0.5761 0.0861
Tweet 0.0219 0.0193 1 0.0343 -0.006 correlation as compared to past
Combined 0.8912 0.5761 0.0343 1 0.1968 mentions.
Class 0.1906 0.0861 -0.006 0.1968 1 • We can improve accuracy by
increasing the learning data.
Correlation between features of week 1 as compared to mentions in week • Correlation for only one cluster
2 only for users of cluster 1
W1/W2 Mention Hash Tag Tweet Combined Class
is very good.
Mention 1 0.0343 -0.0062 0.7492 0.1616 • Only 1-week learning
Hash Tag 0.0343 1 -0.0049 0.6876 0.2192 data outperforms 3 weeks
Tweet -0.0062 -0.0049 1 -0.0001 -0.0116
learning data for
Combined 0.7492 0.6876 -0.0001 1 0.2625
Class 0.1616 0.2192 -0.0116 0.2625 1 complete set of users.
This

Future Work 35

• Landmark detection
– Tweets collected from different cities can be used to identify
landmark/places of interest in these cities.
• Identify future events
– Algorithms can be developed to identify future events with the help of
tweets collected for different topics.
• Combined similarity measure for community detection
– Different weighted combinations of similarity measures like mentions,
tweet, hashtag, description and social connection etc. can be used to
improve clustering results.
• Future Mentions
– Causes of mentions like past mentions, hashtag similarity etc. can be
used to predict future mentions.

This

36

This

Data Mining on Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Mining on Twitter

Similar to Data Mining on Twitter (20)

Recently uploaded

Recently uploaded (20)

Data Mining on Twitter

Editor's Notes