With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.
1. 1
Data Mining and Analysis on
Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
2. Professor 2
• Prof. Pascal Frossard
Project Supervisor
• Xiaowen Dong
Students
• Pulkit Goyal (twitter.com/pulkit110)
• Sapan Diwakar (twitter.com/diwakarsapan)
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
3. Contents 3
• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
4. Objective 4
• Large amount of new data created every minute on social
networking sites.
– Difficult to obtain and interpret
– Collect data to allow for further analysis
• Identify online communities of users on Twitter
• Explore reasons of user interactions as a step towards prediction of
future interactions
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
5. Contents 5
• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
6. Twitter at a glance 6
Micro-blogging platform
Since March 2006
Status Update
300 Million users
(June, 2011)
Giant Chat room
Instant Messaging
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
7. Lingo 7
• Tweet - A message of 140 characters or less
• Retweet - Repeat a tweet from somebody else
• Hashtag - Tweet that includes a #term (tracking)
• Reply/Mention - Mentioning another user in a tweet
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
8. Contents 8
• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
9. Modules 9
• Data Collection
– Setup system to collect data based on some constraints
• Visualization
– Build some visualizations based on the collected data
– Analyze the results
• Community Detection
– Identify communities of users on Twitter based on several different similarty
measures
• Analysis of Future Mentions
– Identify factors for future mentions between users on twitter.
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
10. Contents 10
• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
11. Data Collection | Data based on location 11
• Collect data based on locations: Objectives:
– London • Model the spread of interests
– New York • Time
– Paris • Location
– San Francisco • Rate of information flow
– Mumbai • Identify future events
• Identify landmarks
• Model Relationships among users
• Friendship/Social Connections
• Common Interests
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
12. Data Collection | Data based on topics 12
• Collect data based on keywords Objectives:
– Apple (Tech) • Model the spread of interests
– Manchester United (Soccer) • Time
• Location
• Rate of information flow
• Identify future events
• Identify landmarks
• Model Relationships among users
• Friendship/Social Connections
• Common Interests
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
13. Data Collection | Data from a group of users 13
• Collect tweets from a "group of users" Objectives:
– Group of around 25k users • Model the spread of interests
• Time
• Created by a specified user
• Location
• Explicitly in-reply-to a status created by a • Rate of information flow
specified user (pressed reply button) • Identify future events
• Identify landmarks
• Model Relationships among users
• Friendship/Social Connections
• Common Interests
Overview of links we
use to collect users
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
14. Contents 14
• Objective
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future Mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
15. Visualization Results | Streets of London 15
• Setup
– Geo-tagged tweets for one week (16 to 22 August 2011)
• 111,206 tweets
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
16. Visualization Results | Streets of London | 1 week 16
• Analysis
• High density of tweets from famous places/tourist attractions
• Clustering of tweets
• Content of tweets can be used to predict the place
• More tweets along the roads/streets
National Gallery
London Waterloo Rail
The Big Ben
London Victoria Rail
Oval Cricket Ground Greenwich
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
18. Tweets in London | Aggregated by wards 18
No. of tweets
in increasing
order
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
19. Tweets about a topic| Manchester United 19
• Setup
– Data for two weeks (27 Oct to 8 Nov 2011)
• Keywords
– "manchesterunited", "manchester united", "manchester utd", "man
united", "manutd", "man utd", "manu", "mufc"
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
20. Visualization Results | Tweets About 20
Manchester United
Analysis
• More tweets in and around Europe
• Manchester United plays in the English Premiere League and has homeground in Manchester
• High amount of tweets from countries whose players play for Manchester United
• High popularity of Manchester United in Indonesia and Malaysia
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
21. Tweets about a topic| Apple 21
• Setup
– Data for two weeks (27 Oct to 8 Nov 2011)
• Keywords
– "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx",
"osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch",
"itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s",
"iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3"
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
22. Visualization Results | Tweets About Apple 22
Analysis
• High volume of tweets in USA and Europe
• Popularity of apple products in Europe and USA
• Volume of data as compared to Manchester United
• 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United
• Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
23. Contents 23
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
24. Community Detection| Background 24
• Community
– A set of users having strong connections.
– Held together by some common interests of a large group of users.
• Similarity Measures
– Users’ Social Connection
– User Mentions
– Description Content Similarity
– Tweet Content Similarity
– Hash-Tag Similarity
• Algorithms for community detection
– Modularity Maximization Clustering
• Spectrum Based
• Greedy Bottom-up Fast Modularity Clustering
– Spectral Clustering
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
25. Community Detection| Analysis on small dataset 25
• Experimental setup
– 501 users from three different lists on twitter
• List id 4293757, 12932674 and 33222959
– Tweets collected for 2 weeks
• 26th October, 2011 to 7th November 2011
• Goal
– Recover ground truth clusters
– Evaluation based on NMI and RI
• Similarity Measures used
– Users’ social connections
– User mentions
– Users’ Description content similarity
– Users’ Tweet content similarity Spy plot for Social connections
with users ordered by the list to
which they belong
• Algorithms used
– Spectrum based Modularity Maximization
– Spectral Algorithm – Normalized Laplacian Matrix
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
26. Analysis on small dataset | Modularity Based Clustering 26
Clusters for spectrum based Clusters for spectrum based Modularity
Ground truth clusters modularity maximization clustering on maximization clustering on combined
User Connections similarity measure
Similarity Matrix Modularity Matrix
Analysis
• Social connections most dominating for
NMI RI clustering this group of users.
User Connections 0.3868 0.7174 • Individual similarity measures perform
inaccurately
Mention 0.0130 0.3398 • Combined similarity measures not as good
Tweet content 0.0074 0.3371 as user connections alone
• Addition of low information content to user
Description content 0.0780 0.5254 connections decreases accuracy.
• User behavior not consistent with ground
All combined 0.2500 0.6175
truth.
Company Proprietary and Confidential Copyright Info Goes Here Just Like • Post similar content
This
27. Analysis on small dataset | Laplacian Based Clustering 27
Clusters for Normalized Laplacian based spectral
Ground truth clusters clustering on combined similarity measure
Symmetric Normalized Analysis
Similarity Matrix • Clustering on Social connections fails.
Laplacian Matrix
• Laplacian based methods are sensitive to
NMI RI
the presence of disconnected nodes.
User Connections 0.0077 0.3374 • Individual similarity measures (including
Mention 0.0077 0.3374 social connections) fail to reconstruct any
cluster information.
Tweet content 0.0077 0.3374 • Combined similarity measures gives results
Description content 0.0088 0.3381
consistent with the modularity based
approach.
All combined 0.2931 0.6472 • Addition of different information to the
Company Proprietary and Confidential Copyright Info Goes Here Just Like
social connections makes it connected.
This
28. Community Detection| Analysis on large dataset 28
• Experimental setup
– 11273 users from the set of all users collected during data-collection
– Tweets collected for 4 weeks
• 26th October, 2011 to 22nd November 2011
• Similarity Measures used
– Users’ social connections
– User mentions
– Users’ Hash tag similarity
– Users’ Tweet content similarity
• Algorithm used
– Bottom up Fast Modularity Clustering
Spy plot for Social connections
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
29. Analysis on large dataset| Clustering on Social Connections 29
Spy plot for social connections with
Visualization of clustering results
users ordered by the clusters that
they are present in
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
30. Analysis on large dataset| Clustering on Social Connections 30
Tag cloud 1: Frequent keywords in tweets from cluster 2
Visualization of clustering results Tag cloud 2: Frequent keywords in tweets from cluster 6
Analysis
• Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web
developers/software developers and talk consistently about these terms.
• Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag
cloud 1
• Users in cluster 4 are from same university in India ‘IIIT Hyderabad’.
• Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club
Juventus.
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
31. Analysis on large dataset| Clustering on Combined matrices 31
Results for data from week 1 Results for week 2 Results for only social connections
Analysis
• Using combined data leads to much
finer clustering results as compared to
clustering on social connections.
• Additional information allowed
making division between users who
weren’t tightly connected.
• Division into smaller cluster consistent
with different weeks
Results for week 3 Results for week 4 • Not due to some shifts of interests for
a small period of time.
Combined and Confidential
Company Proprietary
This
= Conection+Mention+Hashtag+Tweet
Copyright Info Goes Here Just Like
32. Contents 32
• Twitter at a glance
• Modules
• Data Collection
• Visualization Results
• Community Detection
• Future mentions on Twitter
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
33. Future Mentions| Reasons for mentions on Twitter 33
• Social Connections
– Users can see the tweets of their friends on their wall and therefore are
more likely to mention them in their future tweets.
– Mentions should occur only if two users share a ‘following ‘or ‘being
followed’ relationship
• Past mentions
– Users who have mentioned each other often in the past are more likely to
mention each other in the future .
– Past mentions means that the users might have had a conversation on
Twitter which means that they share a good relationship.
• Hash Tag Similarity
– Hash tags are used to highlight important keywords in tweets and make it
easy to find tweets or set trending topics on Twitter.
– If two users discuss about the same topic/keyword (hashtag) they are
more likely to mention each other in future.
• Tweet Content Similarity
– Users can mention others if they find their tweets to be interesting.
– Highly similar tweet content means that there is higher probability of a
mention event between two users.
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
34. Future Mentions| Correlation between features 34
and future mentions
Correlation between features of week 1 as compared to mentions in week 2 Weighted combination =
W1/W2 Mention Hash Tag Tweet Combined Class 2*Mention + 5*Hashtag +
Mention 1 0.0528 0.003 0.919 0.1656
Hash Tag 0.0528 1 0.0031 0.4422 0.0565 Tweet Similarity
Tweet 0.003 0.0031 1 0.0134 0.0272
Combined
Class
0.919
0.1656
0.4422
0.0565
0.0134
0.0272
1
0.1713
0.1713
1
Analysis
• Past user mentions has a high
correlation with mentions in
Correlation between features of week 1,2 and 3 as compared to mentions in week 4 the next week.
W123/W4 Mention Hash Tag Tweet Combined Class • Combined similarity measure
Mention 1 0.1428 0.0219 0.8912 0.1906
provides some increase in the
Hash Tag 0.1428 1 0.0193 0.5761 0.0861
Tweet 0.0219 0.0193 1 0.0343 -0.006 correlation as compared to past
Combined 0.8912 0.5761 0.0343 1 0.1968 mentions.
Class 0.1906 0.0861 -0.006 0.1968 1 • We can improve accuracy by
increasing the learning data.
Correlation between features of week 1 as compared to mentions in week • Correlation for only one cluster
2 only for users of cluster 1
W1/W2 Mention Hash Tag Tweet Combined Class
is very good.
Mention 1 0.0343 -0.0062 0.7492 0.1616 • Only 1-week learning
Hash Tag 0.0343 1 -0.0049 0.6876 0.2192 data outperforms 3 weeks
Tweet -0.0062 -0.0049 1 -0.0001 -0.0116
learning data for
Combined 0.7492 0.6876 -0.0001 1 0.2625
Class 0.1616 0.2192 -0.0116 0.2625 1 complete set of users.
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
35. Future Work 35
• Landmark detection
– Tweets collected from different cities can be used to identify
landmark/places of interest in these cities.
• Identify future events
– Algorithms can be developed to identify future events with the help of
tweets collected for different topics.
• Combined similarity measure for community detection
– Different weighted combinations of similarity measures like mentions,
tweet, hashtag, description and social connection etc. can be used to
improve clustering results.
• Future Mentions
– Causes of mentions like past mentions, hashtag similarity etc. can be
used to predict future mentions.
Company Proprietary and Confidential Copyright Info Goes Here Just Like
This
% of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
% of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
% of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
The organisation into groups should be such that similar objects belong to the same cluster whereas there is little or no similarity between objects that belong to different clusters.
Lists are a way of grouping users on twitter. Users can follow lists to obtain updates from a group of users. lists @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting respectively.
A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
Note that there is no special ordering enforced on the users here so we cannot immediately see some cluster structure in the network.
We can now observe a community structure in the graph, i.e. the users have more connections within the community with other users in other communities. Clusters are ordered by the number of users present in each cluster. Red is largest cluster followed by green, blue, purple and cyanThis is just layout. Colors define the distribution of users into clusters. In fact the top 4 communities in the graph cover more than 93% of the total nodes.
Use connections, mentions, hash tag, tweet content Used weekly data
If two users discuss about the same topic/keyword (hashtag) they are more likely to see each others’ tweets and therefore more likely to share a mention relationship in the future.Tweet Content Similarity: Here we implicitly assume that the users also post something that they are interested in.