SlideShare a Scribd company logo
1 of 36
1




  Data Mining and Analysis on
            Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Professor                                                                                                   2

                 • Prof. Pascal Frossard

    Project Supervisor
                 • Xiaowen Dong



                                                                    Students
                                                                            • Pulkit Goyal (twitter.com/pulkit110)
                                                                            • Sapan Diwakar (twitter.com/diwakarsapan)




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Contents                                                                3



                   •      Objective
                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future Mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Objective                                                                               4




                   • Large amount of new data created every minute on social
                     networking sites.
                            – Difficult to obtain and interpret
                            – Collect data to allow for further analysis


                   • Identify online communities of users on Twitter

                   • Explore reasons of user interactions as a step towards prediction of
                     future interactions




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Contents                                                                5



                   •      Objective
                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future Mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Twitter at a glance                                                                                           6




                                                                             Micro-blogging platform
              Since March 2006



                                                                                                  Status Update
           300 Million users
            (June, 2011)



                                                                                    Giant Chat room
                                       Instant Messaging




Company Proprietary and Confidential    Copyright Info Goes Here Just Like
This
Lingo                                                                      7




                   •      Tweet - A message of 140 characters or less
                   •      Retweet - Repeat a tweet from somebody else
                   •      Hashtag - Tweet that includes a #term (tracking)
                   •      Reply/Mention - Mentioning another user in a tweet




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Contents                                                                8



                   •      Objective
                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future Mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Modules                                                                                        9



        • Data Collection
                 – Setup system to collect data based on some constraints


        • Visualization
                 – Build some visualizations based on the collected data
                 – Analyze the results


        • Community Detection
                 – Identify communities of users on Twitter based on several different similarty
                   measures


        • Analysis of Future Mentions
                 – Identify factors for future mentions between users on twitter.




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Contents                                                                10



                   •      Objective
                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future Mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Data Collection | Data based on location                                                             11



      • Collect data based on locations:                                    Objectives:
               –     London                                                 • Model the spread of interests
               –     New York                                                   • Time
               –     Paris                                                      • Location
               –     San Francisco                                              • Rate of information flow
               –     Mumbai                                                 • Identify future events
                                                                            • Identify landmarks
                                                                            • Model Relationships among users
                                                                                • Friendship/Social Connections
                                                                                • Common Interests




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Data Collection | Data based on topics                                                               12



      • Collect data based on keywords                                      Objectives:
               – Apple (Tech)                                               • Model the spread of interests
               – Manchester United (Soccer)                                     • Time
                                                                                • Location
                                                                                • Rate of information flow
                                                                            • Identify future events
                                                                            • Identify landmarks
                                                                            • Model Relationships among users
                                                                                • Friendship/Social Connections
                                                                                • Common Interests




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Data Collection | Data from a group of users                                                      13



       • Collect tweets from a "group of users"                             Objectives:
               – Group of around 25k users                                  • Model the spread of interests
                                                                                • Time
   •     Created by a specified user
                                                                                • Location
   •     Explicitly in-reply-to a status created by a                           • Rate of information flow
         specified user (pressed reply button)                              • Identify future events
                                                                            • Identify landmarks
                                                                            • Model Relationships among users
                                                                                • Friendship/Social Connections
                                                                                • Common Interests




                 Overview of links we
                 use to collect users
Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Contents                                                                14



                   •      Objective
                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future Mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Visualization Results | Streets of London                                         15




                   • Setup
                            – Geo-tagged tweets for one week (16 to 22 August 2011)
                                       •   111,206 tweets




Company Proprietary and Confidential       Copyright Info Goes Here Just Like
This
Visualization Results | Streets of London | 1 week                                                       16


• Analysis
      •     High density of tweets from famous places/tourist attractions
      •     Clustering of tweets
      •     Content of tweets can be used to predict the place
      •     More tweets along the roads/streets




                                                                   National Gallery
                                                                                  London Waterloo Rail
                                                                      The Big Ben
                                             London Victoria Rail

                                                                              Oval Cricket Ground        Greenwich

 Company Proprietary and Confidential   Copyright Info Goes Here Just Like
 This
17




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Tweets in London | Aggregated by wards                                            18




                                                                            No. of tweets
                                                                            in increasing
                                                                            order




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Tweets about a topic| Manchester United                                                         19



                   • Setup
                           – Data for two weeks (27 Oct to 8 Nov 2011)
                   •      Keywords
                            –     "manchesterunited", "manchester united", "manchester utd", "man
                                  united", "manutd", "man utd", "manu", "mufc"




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Visualization Results | Tweets About                                                                   20

     Manchester United




Analysis
       •     More tweets in and around Europe
                • Manchester United plays in the English Premiere League and has homeground in Manchester
                • High amount of tweets from countries whose players play for Manchester United
       •     High popularity of Manchester United in Indonesia and Malaysia
 Company Proprietary and Confidential   Copyright Info Goes Here Just Like
 This
Tweets about a topic| Apple                                                                               21



                   • Setup
                           – Data for two weeks (27 Oct to 8 Nov 2011)
                   •      Keywords
                            –     "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx",
                                  "osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch",
                                  "itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s",
                                  "iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3"




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Visualization Results | Tweets About Apple                                                                             22




Analysis
      •     High volume of tweets in USA and Europe
                • Popularity of apple products in Europe and USA
      •     Volume of data as compared to Manchester United
                • 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United
                • Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries
  Company Proprietary and Confidential   Copyright Info Goes Here Just Like
  This
Contents                                                                23



                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Community Detection| Background                                                                   24


                   •      Community
                            –     A set of users having strong connections.
                            –     Held together by some common interests of a large group of users.


                   •      Similarity Measures
                            –     Users’ Social Connection
                            –     User Mentions
                            –     Description Content Similarity
                            –     Tweet Content Similarity
                            –     Hash-Tag Similarity


                   •      Algorithms for community detection
                            –     Modularity Maximization Clustering
                                    • Spectrum Based
                                    • Greedy Bottom-up Fast Modularity Clustering
                            –     Spectral Clustering




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Community Detection| Analysis on small dataset                                                                       25

                   •      Experimental setup
                            –     501 users from three different lists on twitter
                                       •   List id 4293757, 12932674 and 33222959


                            –     Tweets collected for 2 weeks
                                       •   26th October, 2011 to 7th November 2011


                   •      Goal
                            –     Recover ground truth clusters
                            –     Evaluation based on NMI and RI


                   •      Similarity Measures used
                            –     Users’ social connections
                            –     User mentions
                            –     Users’ Description content similarity
                            –     Users’ Tweet content similarity                    Spy plot for Social connections
                                                                                     with users ordered by the list to
                                                                                     which they belong
                   •      Algorithms used
                            –     Spectrum based Modularity Maximization
                            –     Spectral Algorithm – Normalized Laplacian Matrix


Company Proprietary and Confidential       Copyright Info Goes Here Just Like
This
Analysis on small dataset | Modularity Based Clustering                                                                   26




                                                       Clusters for spectrum based            Clusters for spectrum based Modularity
               Ground truth clusters               modularity maximization clustering on       maximization clustering on combined
                                                             User Connections                            similarity measure

       Similarity Matrix                            Modularity Matrix
                                                                                 Analysis
                                                                                 •   Social connections most dominating for
                                           NMI                              RI       clustering this group of users.
       User Connections                   0.3868                       0.7174    •   Individual similarity measures perform
                                                                                     inaccurately
              Mention                     0.0130                       0.3398    •   Combined similarity measures not as good
          Tweet content                   0.0074                       0.3371        as user connections alone
                                                                                       •   Addition of low information content to user
      Description content                 0.0780                       0.5254              connections decreases accuracy.
                                                                                 •   User behavior not consistent with ground
           All combined                   0.2500                       0.6175
                                                                                     truth.
Company Proprietary and Confidential   Copyright Info Goes Here Just Like              •   Post similar content
This
Analysis on small dataset | Laplacian Based Clustering                                                                     27




                                                     Clusters for Normalized Laplacian based spectral
               Ground truth clusters                 clustering on combined similarity measure

                                                Symmetric Normalized             Analysis
       Similarity Matrix                                                         •   Clustering on Social connections fails.
                                                  Laplacian Matrix
                                                                                       •   Laplacian based methods are sensitive to
                                           NMI                              RI
                                                                                           the presence of disconnected nodes.
       User Connections                   0.0077                       0.3374    •   Individual similarity measures (including
              Mention                     0.0077                       0.3374        social connections) fail to reconstruct any
                                                                                     cluster information.
          Tweet content                   0.0077                       0.3374    •   Combined similarity measures gives results
      Description content                 0.0088                       0.3381
                                                                                     consistent with the modularity based
                                                                                     approach.
           All combined                   0.2931                       0.6472          •   Addition of different information to the
Company Proprietary and Confidential   Copyright Info Goes Here Just Like
                                                                                           social connections makes it connected.
This
Community Detection| Analysis on large dataset                                                                      28


                   •      Experimental setup
                            –     11273 users from the set of all users collected during data-collection

                            –     Tweets collected for 4 weeks
                                       •   26th October, 2011 to 22nd November 2011


                   •      Similarity Measures used
                            –     Users’ social connections
                            –     User mentions
                            –     Users’ Hash tag similarity
                            –     Users’ Tweet content similarity


                   •      Algorithm used
                            –     Bottom up Fast Modularity Clustering

                                                                                      Spy plot for Social connections




Company Proprietary and Confidential       Copyright Info Goes Here Just Like
This
Analysis on large dataset| Clustering on Social Connections                                                   29




                 Spy plot for social connections with
                                                                            Visualization of clustering results
                 users ordered by the clusters that
                 they are present in




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Analysis on large dataset| Clustering on Social Connections                                                      30




                                                                        Tag cloud 1: Frequent keywords in tweets from cluster 2




Visualization of clustering results                                     Tag cloud 2: Frequent keywords in tweets from cluster 6

Analysis
•      Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web
       developers/software developers and talk consistently about these terms.
•      Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag
       cloud 1
•      Users in cluster 4 are from same university in India ‘IIIT Hyderabad’.
•      Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club
       Juventus.
    Company Proprietary and Confidential   Copyright Info Goes Here Just Like
    This
Analysis on large dataset| Clustering on Combined matrices 31




Results for data from week 1           Results for week 2               Results for only social connections

                                                                   Analysis
                                                                   •   Using combined data leads to much
                                                                       finer clustering results as compared to
                                                                       clustering on social connections.
                                                                         •   Additional information allowed
                                                                             making division between users who
                                                                             weren’t tightly connected.
                                                                   •   Division into smaller cluster consistent
                                                                       with different weeks
  Results for week 3                   Results for week 4                •   Not due to some shifts of interests for
                                                                             a small period of time.
Combined and Confidential
Company Proprietary
This
                    = Conection+Mention+Hashtag+Tweet
                              Copyright Info Goes Here Just Like
Contents                                                                32



                   •      Twitter at a glance
                   •      Modules
                   •      Data Collection
                   •      Visualization Results
                   •      Community Detection
                   •      Future mentions on Twitter




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Future Mentions| Reasons for mentions on Twitter                                                          33


                   •      Social Connections
                            –     Users can see the tweets of their friends on their wall and therefore are
                                  more likely to mention them in their future tweets.
                            –     Mentions should occur only if two users share a ‘following ‘or ‘being
                                  followed’ relationship
                   •      Past mentions
                            –     Users who have mentioned each other often in the past are more likely to
                                  mention each other in the future .
                            –     Past mentions means that the users might have had a conversation on
                                  Twitter which means that they share a good relationship.
                   •      Hash Tag Similarity
                            –     Hash tags are used to highlight important keywords in tweets and make it
                                  easy to find tweets or set trending topics on Twitter.
                            –     If two users discuss about the same topic/keyword (hashtag) they are
                                  more likely to mention each other in future.
                   •      Tweet Content Similarity
                            –     Users can mention others if they find their tweets to be interesting.
                            –     Highly similar tweet content means that there is higher probability of a
                                  mention event between two users.




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
Future Mentions| Correlation between features                                                                                34

       and future mentions
   Correlation between features of week 1 as compared to mentions in week 2                                Weighted combination =
    W1/W2               Mention           Hash Tag            Tweet               Combined    Class        2*Mention + 5*Hashtag +
    Mention                1               0.0528             0.003                 0.919    0.1656
   Hash Tag             0.0528                1               0.0031               0.4422    0.0565        Tweet Similarity
    Tweet                0.003             0.0031                1                 0.0134    0.0272
   Combined
     Class
                         0.919
                        0.1656
                                           0.4422
                                           0.0565
                                                              0.0134
                                                              0.0272
                                                                                      1
                                                                                   0.1713
                                                                                             0.1713
                                                                                                1
                                                                                                       Analysis
                                                                                                       •    Past user mentions has a high
                                                                                                            correlation with mentions in
Correlation between features of week 1,2 and 3 as compared to mentions in week 4                            the next week.
   W123/W4             Mention            Hash Tag           Tweet                Combined    Class    •    Combined similarity measure
    Mention               1                0.1428            0.0219                0.8912    0.1906
                                                                                                            provides some increase in the
   Hash Tag            0.1428                 1              0.0193                0.5761    0.0861
     Tweet             0.0219              0.0193               1                  0.0343    -0.006         correlation as compared to past
   Combined            0.8912              0.5761            0.0343                   1      0.1968         mentions.
     Class             0.1906              0.0861            -0.006                0.1968       1      •    We can improve accuracy by
                                                                                                            increasing the learning data.
        Correlation between features of week 1 as compared to mentions in week                         •    Correlation for only one cluster
                              2 only for users of cluster 1
    W1/W2              Mention            Hash Tag             Tweet              Combined     Class
                                                                                                            is very good.
    Mention               1                 0.0343            -0.0062              0.7492     0.1616           • Only 1-week learning
   Hash Tag            0.0343                  1              -0.0049              0.6876     0.2192              data outperforms 3 weeks
     Tweet             -0.0062             -0.0049               1                 -0.0001   -0.0116
                                                                                                                  learning data for
   Combined            0.7492               0.6876            -0.0001                 1       0.2625
     Class             0.1616               0.2192            -0.0116              0.2625        1                complete set of users.
   Company Proprietary and Confidential      Copyright Info Goes Here Just Like
   This
Future Work                                                                                            35


                   •      Landmark detection
                            –     Tweets collected from different cities can be used to identify
                                  landmark/places of interest in these cities.
                   •      Identify future events
                            –     Algorithms can be developed to identify future events with the help of
                                  tweets collected for different topics.
                   •      Combined similarity measure for community detection
                            –     Different weighted combinations of similarity measures like mentions,
                                  tweet, hashtag, description and social connection etc. can be used to
                                  improve clustering results.
                   •      Future Mentions
                            –     Causes of mentions like past mentions, hashtag similarity etc. can be
                                  used to predict future mentions.




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This
36




Company Proprietary and Confidential   Copyright Info Goes Here Just Like
This

More Related Content

What's hot

Social media analytics powered by data science
Social media analytics powered by data scienceSocial media analytics powered by data science
Social media analytics powered by data scienceNavin Manaswi
 
Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Anil Shrestha
 
Presentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligencePresentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligenceDEDE IRYAWAN
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization Ana Jofre
 
Restaurant recommender
Restaurant recommenderRestaurant recommender
Restaurant recommenderAnnie Thomas
 
Data mining
Data mining Data mining
Data mining AthiraR23
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Kavita Ganesan
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysisSunil Kandari
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Crisp dm
Crisp dmCrisp dm
Crisp dmakbkck
 
Missing data
Missing dataMissing data
Missing datamandava57
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 

What's hot (20)

Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data Analyst Role
Data Analyst RoleData Analyst Role
Data Analyst Role
 
Social media analytics powered by data science
Social media analytics powered by data scienceSocial media analytics powered by data science
Social media analytics powered by data science
 
Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)Tweet sentiment analysis (Data mining)
Tweet sentiment analysis (Data mining)
 
Presentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligencePresentasi 1 - Business Intelligence
Presentasi 1 - Business Intelligence
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization
 
Restaurant recommender
Restaurant recommenderRestaurant recommender
Restaurant recommender
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Data mining
Data mining Data mining
Data mining
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Crisp dm
Crisp dmCrisp dm
Crisp dm
 
Big data
Big dataBig data
Big data
 
Missing data
Missing dataMissing data
Missing data
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 

Similar to Data Mining on Twitter

Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob Garner
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob GarnerPubcon Las Vegas 2012 - Social Signals on Search, presented by Rob Garner
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob GarnerRob Garner
 
SocialNetGate Value Proposition
SocialNetGate Value PropositionSocialNetGate Value Proposition
SocialNetGate Value PropositionSocialNetGate
 
Managing Your Digital Footprint - 2012 National BDPA Conference Presentation
Managing Your Digital Footprint - 2012 National BDPA Conference PresentationManaging Your Digital Footprint - 2012 National BDPA Conference Presentation
Managing Your Digital Footprint - 2012 National BDPA Conference PresentationShauna_Cox
 
Itri icl 0116_distribute
Itri icl 0116_distributeItri icl 0116_distribute
Itri icl 0116_distributeFuming Shih
 
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12Chameleon
 
Exploring social theory through enterprise social media (muller, ibm research)
Exploring social theory through enterprise social media (muller, ibm research)Exploring social theory through enterprise social media (muller, ibm research)
Exploring social theory through enterprise social media (muller, ibm research)Michael Muller
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysisikanow
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...Helen Mitchell
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingUniversity of Arizona
 
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...Declan Fitzgerald
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With DataPeter Skomoroch
 
Share point 2013 the way to go...
Share point 2013 the way to go...Share point 2013 the way to go...
Share point 2013 the way to go...K.Mohamed Faizal
 
Effective Internal Investigations
Effective Internal InvestigationsEffective Internal Investigations
Effective Internal InvestigationsDaegis
 

Similar to Data Mining on Twitter (20)

Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob Garner
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob GarnerPubcon Las Vegas 2012 - Social Signals on Search, presented by Rob Garner
Pubcon Las Vegas 2012 - Social Signals on Search, presented by Rob Garner
 
SocialNetGate Value Proposition
SocialNetGate Value PropositionSocialNetGate Value Proposition
SocialNetGate Value Proposition
 
Managing Your Digital Footprint - 2012 National BDPA Conference Presentation
Managing Your Digital Footprint - 2012 National BDPA Conference PresentationManaging Your Digital Footprint - 2012 National BDPA Conference Presentation
Managing Your Digital Footprint - 2012 National BDPA Conference Presentation
 
Itri icl 0116_distribute
Itri icl 0116_distributeItri icl 0116_distribute
Itri icl 0116_distribute
 
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12
Fund Raisin Digital Pops breakfast event with Charles Russell - 24.10.12
 
Exploring social theory through enterprise social media (muller, ibm research)
Exploring social theory through enterprise social media (muller, ibm research)Exploring social theory through enterprise social media (muller, ibm research)
Exploring social theory through enterprise social media (muller, ibm research)
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...
When SharePoint Isn't Enough - Adding Enterprise Class Search for Better Coll...
 
Enterprise Social Search
Enterprise Social SearchEnterprise Social Search
Enterprise Social Search
 
Provenance Management to Enable Data Sharing
Provenance Management to Enable Data SharingProvenance Management to Enable Data Sharing
Provenance Management to Enable Data Sharing
 
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...
Example of Irish Recruiters Tuesday Club 2009 Content e twitter to recruit tu...
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Enterprise 2.0
Enterprise 2.0Enterprise 2.0
Enterprise 2.0
 
Share point 2013 the way to go...
Share point 2013 the way to go...Share point 2013 the way to go...
Share point 2013 the way to go...
 
Effective Internal Investigations
Effective Internal InvestigationsEffective Internal Investigations
Effective Internal Investigations
 

Recently uploaded

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 

Recently uploaded (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 

Data Mining on Twitter

  • 1. 1 Data Mining and Analysis on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 2. Professor 2 • Prof. Pascal Frossard Project Supervisor • Xiaowen Dong Students • Pulkit Goyal (twitter.com/pulkit110) • Sapan Diwakar (twitter.com/diwakarsapan) Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 3. Contents 3 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 4. Objective 4 • Large amount of new data created every minute on social networking sites. – Difficult to obtain and interpret – Collect data to allow for further analysis • Identify online communities of users on Twitter • Explore reasons of user interactions as a step towards prediction of future interactions Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 5. Contents 5 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 6. Twitter at a glance 6 Micro-blogging platform Since March 2006 Status Update 300 Million users (June, 2011) Giant Chat room Instant Messaging Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 7. Lingo 7 • Tweet - A message of 140 characters or less • Retweet - Repeat a tweet from somebody else • Hashtag - Tweet that includes a #term (tracking) • Reply/Mention - Mentioning another user in a tweet Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 8. Contents 8 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 9. Modules 9 • Data Collection – Setup system to collect data based on some constraints • Visualization – Build some visualizations based on the collected data – Analyze the results • Community Detection – Identify communities of users on Twitter based on several different similarty measures • Analysis of Future Mentions – Identify factors for future mentions between users on twitter. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 10. Contents 10 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 11. Data Collection | Data based on location 11 • Collect data based on locations: Objectives: – London • Model the spread of interests – New York • Time – Paris • Location – San Francisco • Rate of information flow – Mumbai • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common Interests Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 12. Data Collection | Data based on topics 12 • Collect data based on keywords Objectives: – Apple (Tech) • Model the spread of interests – Manchester United (Soccer) • Time • Location • Rate of information flow • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common Interests Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 13. Data Collection | Data from a group of users 13 • Collect tweets from a "group of users" Objectives: – Group of around 25k users • Model the spread of interests • Time • Created by a specified user • Location • Explicitly in-reply-to a status created by a • Rate of information flow specified user (pressed reply button) • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common Interests Overview of links we use to collect users Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 14. Contents 14 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 15. Visualization Results | Streets of London 15 • Setup – Geo-tagged tweets for one week (16 to 22 August 2011) • 111,206 tweets Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 16. Visualization Results | Streets of London | 1 week 16 • Analysis • High density of tweets from famous places/tourist attractions • Clustering of tweets • Content of tweets can be used to predict the place • More tweets along the roads/streets National Gallery London Waterloo Rail The Big Ben London Victoria Rail Oval Cricket Ground Greenwich Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 17. 17 Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 18. Tweets in London | Aggregated by wards 18 No. of tweets in increasing order Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 19. Tweets about a topic| Manchester United 19 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "manchesterunited", "manchester united", "manchester utd", "man united", "manutd", "man utd", "manu", "mufc" Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 20. Visualization Results | Tweets About 20 Manchester United Analysis • More tweets in and around Europe • Manchester United plays in the English Premiere League and has homeground in Manchester • High amount of tweets from countries whose players play for Manchester United • High popularity of Manchester United in Indonesia and Malaysia Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 21. Tweets about a topic| Apple 21 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx", "osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch", "itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s", "iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3" Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 22. Visualization Results | Tweets About Apple 22 Analysis • High volume of tweets in USA and Europe • Popularity of apple products in Europe and USA • Volume of data as compared to Manchester United • 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United • Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 23. Contents 23 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 24. Community Detection| Background 24 • Community – A set of users having strong connections. – Held together by some common interests of a large group of users. • Similarity Measures – Users’ Social Connection – User Mentions – Description Content Similarity – Tweet Content Similarity – Hash-Tag Similarity • Algorithms for community detection – Modularity Maximization Clustering • Spectrum Based • Greedy Bottom-up Fast Modularity Clustering – Spectral Clustering Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 25. Community Detection| Analysis on small dataset 25 • Experimental setup – 501 users from three different lists on twitter • List id 4293757, 12932674 and 33222959 – Tweets collected for 2 weeks • 26th October, 2011 to 7th November 2011 • Goal – Recover ground truth clusters – Evaluation based on NMI and RI • Similarity Measures used – Users’ social connections – User mentions – Users’ Description content similarity – Users’ Tweet content similarity Spy plot for Social connections with users ordered by the list to which they belong • Algorithms used – Spectrum based Modularity Maximization – Spectral Algorithm – Normalized Laplacian Matrix Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 26. Analysis on small dataset | Modularity Based Clustering 26 Clusters for spectrum based Clusters for spectrum based Modularity Ground truth clusters modularity maximization clustering on maximization clustering on combined User Connections similarity measure Similarity Matrix Modularity Matrix Analysis • Social connections most dominating for NMI RI clustering this group of users. User Connections 0.3868 0.7174 • Individual similarity measures perform inaccurately Mention 0.0130 0.3398 • Combined similarity measures not as good Tweet content 0.0074 0.3371 as user connections alone • Addition of low information content to user Description content 0.0780 0.5254 connections decreases accuracy. • User behavior not consistent with ground All combined 0.2500 0.6175 truth. Company Proprietary and Confidential Copyright Info Goes Here Just Like • Post similar content This
  • 27. Analysis on small dataset | Laplacian Based Clustering 27 Clusters for Normalized Laplacian based spectral Ground truth clusters clustering on combined similarity measure Symmetric Normalized Analysis Similarity Matrix • Clustering on Social connections fails. Laplacian Matrix • Laplacian based methods are sensitive to NMI RI the presence of disconnected nodes. User Connections 0.0077 0.3374 • Individual similarity measures (including Mention 0.0077 0.3374 social connections) fail to reconstruct any cluster information. Tweet content 0.0077 0.3374 • Combined similarity measures gives results Description content 0.0088 0.3381 consistent with the modularity based approach. All combined 0.2931 0.6472 • Addition of different information to the Company Proprietary and Confidential Copyright Info Goes Here Just Like social connections makes it connected. This
  • 28. Community Detection| Analysis on large dataset 28 • Experimental setup – 11273 users from the set of all users collected during data-collection – Tweets collected for 4 weeks • 26th October, 2011 to 22nd November 2011 • Similarity Measures used – Users’ social connections – User mentions – Users’ Hash tag similarity – Users’ Tweet content similarity • Algorithm used – Bottom up Fast Modularity Clustering Spy plot for Social connections Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 29. Analysis on large dataset| Clustering on Social Connections 29 Spy plot for social connections with Visualization of clustering results users ordered by the clusters that they are present in Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 30. Analysis on large dataset| Clustering on Social Connections 30 Tag cloud 1: Frequent keywords in tweets from cluster 2 Visualization of clustering results Tag cloud 2: Frequent keywords in tweets from cluster 6 Analysis • Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web developers/software developers and talk consistently about these terms. • Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag cloud 1 • Users in cluster 4 are from same university in India ‘IIIT Hyderabad’. • Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club Juventus. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 31. Analysis on large dataset| Clustering on Combined matrices 31 Results for data from week 1 Results for week 2 Results for only social connections Analysis • Using combined data leads to much finer clustering results as compared to clustering on social connections. • Additional information allowed making division between users who weren’t tightly connected. • Division into smaller cluster consistent with different weeks Results for week 3 Results for week 4 • Not due to some shifts of interests for a small period of time. Combined and Confidential Company Proprietary This = Conection+Mention+Hashtag+Tweet Copyright Info Goes Here Just Like
  • 32. Contents 32 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on Twitter Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 33. Future Mentions| Reasons for mentions on Twitter 33 • Social Connections – Users can see the tweets of their friends on their wall and therefore are more likely to mention them in their future tweets. – Mentions should occur only if two users share a ‘following ‘or ‘being followed’ relationship • Past mentions – Users who have mentioned each other often in the past are more likely to mention each other in the future . – Past mentions means that the users might have had a conversation on Twitter which means that they share a good relationship. • Hash Tag Similarity – Hash tags are used to highlight important keywords in tweets and make it easy to find tweets or set trending topics on Twitter. – If two users discuss about the same topic/keyword (hashtag) they are more likely to mention each other in future. • Tweet Content Similarity – Users can mention others if they find their tweets to be interesting. – Highly similar tweet content means that there is higher probability of a mention event between two users. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 34. Future Mentions| Correlation between features 34 and future mentions Correlation between features of week 1 as compared to mentions in week 2 Weighted combination = W1/W2 Mention Hash Tag Tweet Combined Class 2*Mention + 5*Hashtag + Mention 1 0.0528 0.003 0.919 0.1656 Hash Tag 0.0528 1 0.0031 0.4422 0.0565 Tweet Similarity Tweet 0.003 0.0031 1 0.0134 0.0272 Combined Class 0.919 0.1656 0.4422 0.0565 0.0134 0.0272 1 0.1713 0.1713 1 Analysis • Past user mentions has a high correlation with mentions in Correlation between features of week 1,2 and 3 as compared to mentions in week 4 the next week. W123/W4 Mention Hash Tag Tweet Combined Class • Combined similarity measure Mention 1 0.1428 0.0219 0.8912 0.1906 provides some increase in the Hash Tag 0.1428 1 0.0193 0.5761 0.0861 Tweet 0.0219 0.0193 1 0.0343 -0.006 correlation as compared to past Combined 0.8912 0.5761 0.0343 1 0.1968 mentions. Class 0.1906 0.0861 -0.006 0.1968 1 • We can improve accuracy by increasing the learning data. Correlation between features of week 1 as compared to mentions in week • Correlation for only one cluster 2 only for users of cluster 1 W1/W2 Mention Hash Tag Tweet Combined Class is very good. Mention 1 0.0343 -0.0062 0.7492 0.1616 • Only 1-week learning Hash Tag 0.0343 1 -0.0049 0.6876 0.2192 data outperforms 3 weeks Tweet -0.0062 -0.0049 1 -0.0001 -0.0116 learning data for Combined 0.7492 0.6876 -0.0001 1 0.2625 Class 0.1616 0.2192 -0.0116 0.2625 1 complete set of users. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 35. Future Work 35 • Landmark detection – Tweets collected from different cities can be used to identify landmark/places of interest in these cities. • Identify future events – Algorithms can be developed to identify future events with the help of tweets collected for different topics. • Combined similarity measure for community detection – Different weighted combinations of similarity measures like mentions, tweet, hashtag, description and social connection etc. can be used to improve clustering results. • Future Mentions – Causes of mentions like past mentions, hashtag similarity etc. can be used to predict future mentions. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  • 36. 36 Company Proprietary and Confidential Copyright Info Goes Here Just Like This

Editor's Notes

  1. % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  2. % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  3. % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  4. The organisation into groups should be such that similar objects belong to the same cluster whereas there is little or no similarity between objects that belong to different clusters.
  5. Lists are a way of grouping users on twitter. Users can follow lists to obtain updates from a group of users. lists @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting respectively.
  6. A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
  7. A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
  8. Note that there is no special ordering enforced on the users here so we cannot immediately see some cluster structure in the network.
  9. We can now observe a community structure in the graph, i.e. the users have more connections within the community with other users in other communities. Clusters are ordered by the number of users present in each cluster. Red is largest cluster followed by green, blue, purple and cyanThis is just layout. Colors define the distribution of users into clusters. In fact the top 4 communities in the graph cover more than 93% of the total nodes.
  10. Use connections, mentions, hash tag, tweet content Used weekly data
  11. If two users discuss about the same topic/keyword (hashtag) they are more likely to see each others’ tweets and therefore more likely to share a mention relationship in the future.Tweet Content Similarity: Here we implicitly assume that the users also post something that they are interested in.