SlideShare a Scribd company logo
1 of 85
Download to read offline
Social Data Mining

Toby Segaran
About Me




           http://kiwitobes.com
What is data mining?

   Implicit
   Unknown
   Useful
What is data?
Data-mining traditional uses
Why it’s important now




          Data
Why it’s important now
Why it’s important now




 All products are actually sold on Amazon
Why it’s important now
 Facebook      Google
Why it’s important now
For Social Insight


Home Prices   Blogs and News   Movie Data




 Fashion      Product Prices     Hotties
Blogs…
The Technorati Top 100
Getting the content
                The
                Six
                Degrees
                Hypothesis
                Experienced
                It
                Is
                When
                You
                Travel
Building a Word Matrix
The
Six
Degrees
                 Six
Hypothesis                     Six           3
                 Degrees
Experienced                    Degrees       3
                 Hypothesis
                               Hypothesis    1
It
                 Experienced
                               Experienced   5
Is
                 Travel
                               Travel        6
When
You
Travel
The Word Matrix
                  “china”   “kids”   “music”              “yahoo”
                                               “travel”

Gothamist         0         3        3                    0
                                               3


GigaOM            6         0        1                    2
                                               4


QuickOnlineTips   0         2        2                    12
                                               0



O’Reilly Radar    1         0        3                    4
                                               6
Determining distance
                         “china”   “kids”   “music”      “yahoo”




     Gothamist           0         3        3            0



     GigaOM              6         0        1            2



     Quick Online Tips   0         2        2            12




Euclidean “as the crow flies”


              (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2

                                                      = 12 (approx)
Hierarchical Clustering

 Find the two closest item
 Combine them into a single item
 Repeat…
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Dendrogram
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Rotating the Matrix

   Words in a blog -> blogs containing each word


             Gothamist     GigaOM        Quick Onl
china        0             6             0
kids         3             0             2
music        3             1             2
Yahoo        0             2            12
Hierarchical Word Clusters
K-Means Clustering

 Divides data into distinct clusters
 User determines how many
 Algorithm
   Start with arbitrary centroids
   Assign points to centroids
   Move the centroids
   Repeat
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Results

1                           2

The Viral Garden            Wonkette
Copyblogger                 Gawker
Creating Passionate Users   Gothamist
Oilman                      Huffington Post
ProBlogger Blog Tips
Seth's Blog
2D Visualizations

 Instead of Clusters, a 2D Map
 Goals
   Preserve distances as much as
   possible
   Draw in two dimensions
 Dimension Reduction
   Principal Components Analysis
   Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Zillow
The Zillow API

 Allows querying by address
 Returns information about the
 property
   Bedrooms
   Bathrooms
   Zip Code
   Price Estimate
   Last Sale Price
A home price dataset

House   Zip     Bathrooms   Bedrooms   Built   Type      Price

                                               Single    505296
A       02138   1.5         2          1847

B       02139   3.5         9                  Triplex   776378
                                       1916

C       02140   3.5         4                  Duplex    595027
                                       1894

D       02139   2.5         4                  Duplex    552213
                                       1854

E       02138   3.5         5                  Duplex    947528
                                       1909

F       02138   3.5         4                  Single    2107871
                                       1930

etc..
What can we learn?

 A made-up houses price
 How important is Zip Code?
 What are the important attributes?

 Can we do better than averages?
Introducing Regression
         Trees
A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
Introducing Regression
         Trees
A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         Choose the division that decreases deviation the
         most
                                 Initially
A    B           Value
                                 Average = 14
10   Circle      20
                                 Standard Deviation = 8.2
11   Square      22

22   Square      8

18   Circle      6
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         Choose the division that decreases deviation the
         most
                                 B = Circle
A    B           Value
                                 Average = 13
10   Circle      20
                                 Standard Deviation = 9.9
11   Square      22

22   Square      8
                                 B = Square
18   Circle      6
                                  Average = 15
                                  Standard Deviation = 9.9
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         Choose the division that decreases deviation the
         most
                                 A > 18
A    B           Value
                                 Average = 8
10   Circle      20
                                 Standard Deviation = 0
11   Square      22

22   Square      8
                                 A <= 20
18   Circle      6
                                  Average = 16
                                  Standard Deviation = 8.7
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         Choose the division that decreases deviation the
         most
                                 A > 11
A    B           Value
                                 Average = 7
10   Circle      20
                                 Standard Deviation = 1.4
11   Square      22

22   Square      8
                                 A <= 11
18   Circle      6
                                  Average = 21
                                  Standard Deviation = 1.4
CART Algoritm

A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
CART Algoritm

A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
CART Algoritm




  10   Circle   20   22   Square   8
  11   Square   22   18   Circle   6
CART Algoritm
Zillow Results

                           Bathrooms > 3




      Zip: 02139?                               After 1903?




Zip: 02140?    Bedrooms > 4?               Duplex?            Triplex?
Just for Fun… Hot or Not
Variance dividers
9
8
7
6
5
4
3
2
1
0
     Northeast   South   Male     Female

    Low Variance Split   High Variance Split
Just for Fun… Hot or Not
Supervised and
Unsupervised
 Clustering methods are unsupervised
   There are no answers
   Methods just characterize the data
   Show interesting patterns
 Regression Trees are supervised
   “answers” are in the dataset
   Tree models predict answers
Personal Ads
The Analysis
         Five Cities




       W4M Personal Ads
Bayesian filter


If you listen to NPR, watch Hardball,
and love the Red Sox, you may be the             Sox            0.4
guy for me.                                      Red            0.35
                                        Boston
                                                 Grad           0.2
Please email me back.
                                                 Professional   0.1
I'm a professional with a grad school            Humor          0.1
degree who has a sense of humor and
loves the Sox.
Bayesian filter

      P( C | W ) = P (C & W) / P (W)

              How often do the word and the city appear together?



                   How often does the word appear overall…




Rank these, and you have a list of the words most particular to a given city
Results
New York         Boston           Chicago
Mets             Pink             Cubs
Lounges          Sox              Burbs
Offense          Poetry           Bears
Desires          Intellectually   Girlie
Musical          Punk             Insecure
Submissive       Appreciation     Cheat
Create           Exercise         Importance
Song             Winter           Blunt
Oral             Education        Mouth
Results
Los Angeles     San Francisco
Excellent       Tee
Vegas           Employment
Meaningful      Picnic
Star            STD
Lame            Tasting
Industry        Hikes
Heat            French
Fitness         .com
Entertainment   Kayaking
Latino          Cycling
Newsgroup Discussion
Overlapping themes
Themes in a document
Another word matrix
            Msg1   Msg2     Msg3          Msg4   Msg5

Gym          2      0          0           3      0

Calorie      0      2          1           1      3

Weigh        1      0          2           0      0

Carbs        0      3          0           0      2

Treadmill    1      0          0           2      0

                          Actual Matrix
Weights and features


              F1   F2       F3
                                          Msg1   M2    M3   M4   M5
Gym           0     1       2


                                 x
                                     F1    1      0    2     3   0
Calorie       2     0       1
                                     F2    0      2    1     1   3
Weigh         2     2       1
                                     F3    1      0    2     0   0
Carbs         1     0       3
                                                 Weight Matrix
Treadmill     0     1       2

          Features Matrix
Matrix factorization
                F1    F2     F3
Gym               0    1     2                     Msg1      M2     M3   M4   M5
Calorie           2    0     1               F1        1        0   2    3    0

                                     x
Weigh             2    2     1               F2        0        2   1    1    3
Carbs             1    0     3               F3        1        0   2    0    0
Treadmill         0    1     2
                                                            Weight Matrix
          Features Matrix

                      Msg1   Msg2    Msg3         Msg4     Msg5
      Gym              1         3       3         0        1
      Calorie          0         2       4         1        3
      Weigh            2         3       1         0        1
      Carbs            0         1       1         0        2
      Treadmill        3         2       0         2        2


                             Current Guess
Matrix factorization
                F1    F2       F3
Gym               0   1        2                   Msg1    M2    M3   M4        M5
Calorie           2   0        1             F1     1      0     2     3        0

                                        x
Weigh             2   2        1             F2     0      2     1     1        3
Carbs             1   0        3             F3     1      0     2     0        0
Treadmill         0   1        2
                                                          Weight Matrix
          Features Matrix


                                                                                     Msg1   Msg2   Msg3   Msg4   Msg5
                      Msg1         Msg2     Msg3   Msg4   Msg5
                                                                      Gym             2      0      0      3      0
        Gym                1        3        3      0      1
                                                                      Calorie         0      2      1      1      3
        Calorie            0        2        4      1      3
                                                                      Weigh           1      0      2      0      0
        Weigh              2        3        1      0      1
                                                                      Carbs           0      3      0      0      2
        Carbs              0        1        1      0      2
                                                                      Treadmill       1      0      0      2      0
        Treadmill          3        2        0      2      2

                                                                                            Target Result
                                    Current Guess
Matrix factorization
                F1    F2       F3
Gym               1   0        0                   Msg1    M2    M3   M4        M5
Calorie           0   1        1             F1     2      0     0     1        0

                                        x
Weigh             0   0        2             F2     0      2     0     1        3
Carbs             0   1        0             F3     1      0     1     0        0
Treadmill         1   0        0
                                                          Weight Matrix
          Features Matrix


                                                                                     Msg1   Msg2   Msg3   Msg4   Msg5
                      Msg1         Msg2     Msg3   Msg4   Msg5
                                                                      Gym             2      0      0      3      0
        Gym                2        0        0      3      0
                                                                      Calorie         0      2      1      1      3
        Calorie            0        2        1      1      3
                                                                      Weigh           1      0      2      0      0
        Weigh              1        0        2      0      0
                                                                      Carbs           0      3      0      0      2
        Carbs              0        3        0      0      2
                                                                      Treadmill       1      0      0      2      0
        Treadmill          1        0        0      2      0

                                                                                            Target Result
                                    Current Guess
Interpreting Features
                F1    F2   F3
                                          Theme 1      Theme 2      Theme 3
Gym             1      0    0
Calorie         0      1    1             Gym          Calorie      Weigh
Weigh           0      0    2
                                          Treadmill    Carbs        Calorie
Carbs           0      1    0
Treadmill       1      0    0

          Features Matrix


          Msg1        M2   M3   M4   M5
                                             Msg1         Msg2          Msg3 etc.
  F1        2         0    0    1    0
                                             Theme 1      Theme 2       Theme 3
  F2        0         2    0    1    3
  F3        1         0    1    0    0       Theme 3
                     Weight Matrix
“Diet and body” themes
                                Calories
                                Weight
Atkins
                                Fats
Induction
                                Protein
South         Chocolate
                                Cholesterol
Beach         Black
Carbs         Coffee
              Olive
                             Gym
              Broccoli
                             Weights
                             Exercise
                             Running
     Cook
                             Injured
     Recipe
     Fried
     Home          Money
                   Organic
                   Want
                   Best
Side note: NMF for faces
Methods covered

 Regression trees
 Hierarchical clustering
 k-means clustering
 Multidimensional scaling
 Bayesian Classifier
 Non-negative Matrix Factorization
Other ideas

 Finance
   Analysts already drowning in info
   Stories sometimes broken on blogs
   Message boards show sentiment

   Extremely low signal-to-noise ratio
Other ideas

 Product problems/ideas
   Use support message boards
   Extract themes
   Understand recurring issues
   Learn what features people want
Other ideas

 Entertainment
   How much buzz is a movie
   generating?
   What psychographic profiles like this
   type of movie?

   Of interest to studios and media
   investors

More Related Content

Viewers also liked

Unsaturated polyester resin as a matrix
Unsaturated polyester resin  as a matrixUnsaturated polyester resin  as a matrix
Unsaturated polyester resin as a matrixlukkumanul
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
FAILURES OF AMALGAM RESTORATION
FAILURES OF AMALGAM RESTORATIONFAILURES OF AMALGAM RESTORATION
FAILURES OF AMALGAM RESTORATIONDR YASMIN MOIDIN
 
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case Report
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case ReportStrip Crowns Technique for Restoration of Primary Anterior Teeth: Case Report
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case ReportAbu-Hussein Muhamad
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Contacts and Contours By Dr.Ruchir Kapur
Contacts and Contours By Dr.Ruchir KapurContacts and Contours By Dr.Ruchir Kapur
Contacts and Contours By Dr.Ruchir KapurDr.Ruchir kapur
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...Indian dental academy
 
Matrices, retainers and wedges /certified fixed orthodontic courses by India...
Matrices, retainers and wedges  /certified fixed orthodontic courses by India...Matrices, retainers and wedges  /certified fixed orthodontic courses by India...
Matrices, retainers and wedges /certified fixed orthodontic courses by India...Indian dental academy
 
Amalgam &composite
Amalgam &compositeAmalgam &composite
Amalgam &compositeDrGhadooRa
 

Viewers also liked (20)

Unsaturated polyester resin as a matrix
Unsaturated polyester resin  as a matrixUnsaturated polyester resin  as a matrix
Unsaturated polyester resin as a matrix
 
Data Mining (Predict The Future)
Data Mining (Predict The Future)Data Mining (Predict The Future)
Data Mining (Predict The Future)
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Matricing
MatricingMatricing
Matricing
 
FAILURES OF AMALGAM RESTORATION
FAILURES OF AMALGAM RESTORATIONFAILURES OF AMALGAM RESTORATION
FAILURES OF AMALGAM RESTORATION
 
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case Report
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case ReportStrip Crowns Technique for Restoration of Primary Anterior Teeth: Case Report
Strip Crowns Technique for Restoration of Primary Anterior Teeth: Case Report
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Contacts and Contours By Dr.Ruchir Kapur
Contacts and Contours By Dr.Ruchir KapurContacts and Contours By Dr.Ruchir Kapur
Contacts and Contours By Dr.Ruchir Kapur
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Gym Final Report.ppt
Gym Final Report.pptGym Final Report.ppt
Gym Final Report.ppt
 
Crowns
CrownsCrowns
Crowns
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...
CONTACTS AND CONTOURS IN CONSERVATIVE DENTISTRY / rotary endodontic courses b...
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Matrices, retainers and wedges /certified fixed orthodontic courses by India...
Matrices, retainers and wedges  /certified fixed orthodontic courses by India...Matrices, retainers and wedges  /certified fixed orthodontic courses by India...
Matrices, retainers and wedges /certified fixed orthodontic courses by India...
 
Amalgam &composite
Amalgam &compositeAmalgam &composite
Amalgam &composite
 
Polymer composites
Polymer compositesPolymer composites
Polymer composites
 

Similar to Mining Social Data for Fun and Insight

Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1shreymodi
 
OOD - Object orientated design
OOD - Object orientated designOOD - Object orientated design
OOD - Object orientated designRuberto Paulo
 
Lecture 3 production and costs
Lecture 3 production and costsLecture 3 production and costs
Lecture 3 production and costsWONDE TOM
 
Algebra and Trigonometry 9th Edition Larson Solutions Manual
Algebra and Trigonometry 9th Edition Larson Solutions ManualAlgebra and Trigonometry 9th Edition Larson Solutions Manual
Algebra and Trigonometry 9th Edition Larson Solutions Manualkejeqadaqo
 
New automated techniques to validate and populate property valuations
New automated techniques to validate and populate property valuationsNew automated techniques to validate and populate property valuations
New automated techniques to validate and populate property valuationsRob Carroll
 

Similar to Mining Social Data for Fun and Insight (6)

Game Show Math
Game Show MathGame Show Math
Game Show Math
 
Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1Evolutionary Algorithms and their Applications in Civil Engineering - 1
Evolutionary Algorithms and their Applications in Civil Engineering - 1
 
OOD - Object orientated design
OOD - Object orientated designOOD - Object orientated design
OOD - Object orientated design
 
Lecture 3 production and costs
Lecture 3 production and costsLecture 3 production and costs
Lecture 3 production and costs
 
Algebra and Trigonometry 9th Edition Larson Solutions Manual
Algebra and Trigonometry 9th Edition Larson Solutions ManualAlgebra and Trigonometry 9th Edition Larson Solutions Manual
Algebra and Trigonometry 9th Edition Larson Solutions Manual
 
New automated techniques to validate and populate property valuations
New automated techniques to validate and populate property valuationsNew automated techniques to validate and populate property valuations
New automated techniques to validate and populate property valuations
 

More from adunne

Seedcamp Overview
Seedcamp OverviewSeedcamp Overview
Seedcamp Overviewadunne
 
Netvibes Preview
Netvibes PreviewNetvibes Preview
Netvibes Previewadunne
 
Community Practices: From Forums to Social Networks
Community Practices: From Forums to Social NetworksCommunity Practices: From Forums to Social Networks
Community Practices: From Forums to Social Networksadunne
 
Designing Tag Navigation
Designing Tag NavigationDesigning Tag Navigation
Designing Tag Navigationadunne
 
Social Commerce and Community
Social Commerce and CommunitySocial Commerce and Community
Social Commerce and Communityadunne
 
The Starfish and the Spider
The Starfish and the SpiderThe Starfish and the Spider
The Starfish and the Spideradunne
 
Ginger Preview
Ginger PreviewGinger Preview
Ginger Previewadunne
 
Add Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with SolrAdd Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with Solradunne
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Appsadunne
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industryadunne
 
Building Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data CentersBuilding Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data Centersadunne
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...adunne
 
Designing for a Web of Data
Designing for a Web of DataDesigning for a Web of Data
Designing for a Web of Dataadunne
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Appsadunne
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...adunne
 
Your User's Privacy
Your User's PrivacyYour User's Privacy
Your User's Privacyadunne
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Setadunne
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approachesadunne
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketingadunne
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storageadunne
 

More from adunne (20)

Seedcamp Overview
Seedcamp OverviewSeedcamp Overview
Seedcamp Overview
 
Netvibes Preview
Netvibes PreviewNetvibes Preview
Netvibes Preview
 
Community Practices: From Forums to Social Networks
Community Practices: From Forums to Social NetworksCommunity Practices: From Forums to Social Networks
Community Practices: From Forums to Social Networks
 
Designing Tag Navigation
Designing Tag NavigationDesigning Tag Navigation
Designing Tag Navigation
 
Social Commerce and Community
Social Commerce and CommunitySocial Commerce and Community
Social Commerce and Community
 
The Starfish and the Spider
The Starfish and the SpiderThe Starfish and the Spider
The Starfish and the Spider
 
Ginger Preview
Ginger PreviewGinger Preview
Ginger Preview
 
Add Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with SolrAdd Powerful Full Text Search to Your Web App with Solr
Add Powerful Full Text Search to Your Web App with Solr
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industry
 
Building Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data CentersBuilding Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data Centers
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
 
Designing for a Web of Data
Designing for a Web of DataDesigning for a Web of Data
Designing for a Web of Data
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...
 
Your User's Privacy
Your User's PrivacyYour User's Privacy
Your User's Privacy
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketing
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Mining Social Data for Fun and Insight

  • 2. About Me http://kiwitobes.com
  • 3. What is data mining? Implicit Unknown Useful
  • 8. Why it’s important now All products are actually sold on Amazon
  • 9. Why it’s important now Facebook Google
  • 11. For Social Insight Home Prices Blogs and News Movie Data Fashion Product Prices Hotties
  • 14. Getting the content The Six Degrees Hypothesis Experienced It Is When You Travel
  • 15. Building a Word Matrix The Six Degrees Six Hypothesis Six 3 Degrees Experienced Degrees 3 Hypothesis Hypothesis 1 It Experienced Experienced 5 Is Travel Travel 6 When You Travel
  • 16. The Word Matrix “china” “kids” “music” “yahoo” “travel” Gothamist 0 3 3 0 3 GigaOM 6 0 1 2 4 QuickOnlineTips 0 2 2 12 0 O’Reilly Radar 1 0 3 4 6
  • 17. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2 = 12 (approx)
  • 18. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  • 28. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  • 30. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  • 36. K-Means Results 1 2 The Viral Garden Wonkette Copyblogger Gawker Creating Passionate Users Gothamist Oilman Huffington Post ProBlogger Blog Tips Seth's Blog
  • 37. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  • 42.
  • 43.
  • 44.
  • 45.
  • 47. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price
  • 48. A home price dataset House Zip Bathrooms Bedrooms Built Type Price Single 505296 A 02138 1.5 2 1847 B 02139 3.5 9 Triplex 776378 1916 C 02140 3.5 4 Duplex 595027 1894 D 02139 2.5 4 Duplex 552213 1854 E 02138 3.5 5 Duplex 947528 1909 F 02138 3.5 4 Single 2107871 1930 etc..
  • 49. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  • 50. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 51. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 52. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  • 53. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  • 54. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  • 55. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  • 56. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 57. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 58. CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6
  • 60. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  • 61. Just for Fun… Hot or Not
  • 62. Variance dividers 9 8 7 6 5 4 3 2 1 0 Northeast South Male Female Low Variance Split High Variance Split
  • 63. Just for Fun… Hot or Not
  • 64. Supervised and Unsupervised Clustering methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns Regression Trees are supervised “answers” are in the dataset Tree models predict answers
  • 66. The Analysis Five Cities W4M Personal Ads
  • 67. Bayesian filter If you listen to NPR, watch Hardball, and love the Red Sox, you may be the Sox 0.4 guy for me. Red 0.35 Boston Grad 0.2 Please email me back. Professional 0.1 I'm a professional with a grad school Humor 0.1 degree who has a sense of humor and loves the Sox.
  • 68. Bayesian filter P( C | W ) = P (C & W) / P (W) How often do the word and the city appear together? How often does the word appear overall… Rank these, and you have a list of the words most particular to a given city
  • 69. Results New York Boston Chicago Mets Pink Cubs Lounges Sox Burbs Offense Poetry Bears Desires Intellectually Girlie Musical Punk Insecure Submissive Appreciation Cheat Create Exercise Importance Song Winter Blunt Oral Education Mouth
  • 70. Results Los Angeles San Francisco Excellent Tee Vegas Employment Meaningful Picnic Star STD Lame Tasting Industry Hikes Heat French Fitness .com Entertainment Kayaking Latino Cycling
  • 73. Themes in a document
  • 74. Another word matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Actual Matrix
  • 75. Weights and features F1 F2 F3 Msg1 M2 M3 M4 M5 Gym 0 1 2 x F1 1 0 2 3 0 Calorie 2 0 1 F2 0 2 1 1 3 Weigh 2 2 1 F3 1 0 2 0 0 Carbs 1 0 3 Weight Matrix Treadmill 0 1 2 Features Matrix
  • 76. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 0 1 Calorie 0 2 4 1 3 Weigh 2 3 1 0 1 Carbs 0 1 1 0 2 Treadmill 3 2 0 2 2 Current Guess
  • 77. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 1 3 3 0 1 Calorie 0 2 1 1 3 Calorie 0 2 4 1 3 Weigh 1 0 2 0 0 Weigh 2 3 1 0 1 Carbs 0 3 0 0 2 Carbs 0 1 1 0 2 Treadmill 1 0 0 2 0 Treadmill 3 2 0 2 2 Target Result Current Guess
  • 78. Matrix factorization F1 F2 F3 Gym 1 0 0 Msg1 M2 M3 M4 M5 Calorie 0 1 1 F1 2 0 0 1 0 x Weigh 0 0 2 F2 0 2 0 1 3 Carbs 0 1 0 F3 1 0 1 0 0 Treadmill 1 0 0 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Treadmill 1 0 0 2 0 Target Result Current Guess
  • 79. Interpreting Features F1 F2 F3 Theme 1 Theme 2 Theme 3 Gym 1 0 0 Calorie 0 1 1 Gym Calorie Weigh Weigh 0 0 2 Treadmill Carbs Calorie Carbs 0 1 0 Treadmill 1 0 0 Features Matrix Msg1 M2 M3 M4 M5 Msg1 Msg2 Msg3 etc. F1 2 0 0 1 0 Theme 1 Theme 2 Theme 3 F2 0 2 0 1 3 F3 1 0 1 0 0 Theme 3 Weight Matrix
  • 80. “Diet and body” themes Calories Weight Atkins Fats Induction Protein South Chocolate Cholesterol Beach Black Carbs Coffee Olive Gym Broccoli Weights Exercise Running Cook Injured Recipe Fried Home Money Organic Want Best
  • 81. Side note: NMF for faces
  • 82. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Bayesian Classifier Non-negative Matrix Factorization
  • 83. Other ideas Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  • 84. Other ideas Product problems/ideas Use support message boards Extract themes Understand recurring issues Learn what features people want
  • 85. Other ideas Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors