SlideShare a Scribd company logo
1 of 42
Download to read offline
Mammoth Scale Machine Learning

Speaker: Robin Anil, Apache Mahout PMC Member

                  OSCONʼ10
                 Portland, OR
                  July 2010
Quick Show of Hands
  Are you fascinated about ML?
  Have you used ML?
  Do you have Gigabytes and Terabytes of data to analyze?
  Do you have Hadoop or MapReduce experience?

  Thanks for the survey!
Little bit about me
  Apache Mahout PMC member
  A ML Enthusiast 
  Software Engineer @ Google
  Google Summer of Code Mentor
  Previous Life: Google Summer of Code student for 2 years.
Agenda
  Introducing Mahout
  Different classes of problems
  And their Mahout based solutions
  Basic data structure
  Usage examples
  Sneak peek at our Next Release
The Mission




To build a scalable machine learning library
Scale!
  Scale to large datasets
   -  Hadoop MapReduce implementations that scales linearly with
      data.
   -  Fast sequential algorithms whose runtime doesn’t depend on
      the size of the data
   -  Goal: To be as fast as possible for any algorithm
  Scalable to support your business case
   -  Apache Software License 2
  Scalable community
   -  Vibrant, responsive and diverse
   -  Come to the mailing list and find out more
The Mission




To build a scalable machine learning library
Why a new Library
  Plenty of open source Machine Learning libraries either
   -  Lack community
   -  Lack scalability
   -  Lack documentations and examples
   -  Lack Apache licensing
   -  Are not well tested
   -  Are Research oriented
Agenda
  Introducing Mahout
  Different classes of problems
  And their Mahout based solutions
  Basic data structure
  Usage examples
  Sneak peek at our Next Release
ML on Twitter
  Collection of tweets in the last hour
  Each 140 character or token stream
  We will keep using this example throughout this talk
What is Clustering
  Call it fuzzy grouping based on a notion of similarity
Mahout Clustering
  Plenty of Algorithms: K-Means,
   Fuzzy K-Means, Mean Shift,
   Canopy, Dirichlet

  Group similar looking objects

  Notion of similarity: Distance measure:
   -  Euclidean
   -  Cosine
   -  Tanimoto
   -  Manhattan
Clustering Tweets

“Identify tweets that are similar and group them”
Topic modeling
  Grouping similar or co-occurring features into a topic
   -  Topic “Lol Cat”:
      -  Cat
      -  Meow
      -  Purr
      -  Haz
      -  Cheeseburger
      -  Lol
Mahout Topic Modeling
  Algorithm: Latent Dirichlet Allocation
   -  Input a set of documents
   -  Output top K prominent topics and the
      features in each topic
Filtering Topics from Tweets

“Identify emerging topics in a collection of tweets”
Classification
  Predicting the type of a new object based on its features
  The types are predetermined




                      Dog              Cat
Mahout Classification
  Plenty of algorithms
   -  Naïve Bayes
   -  Complementary Naïve Bayes
   -  Random Forests
   -  Logistic Regression (Almost done)
   -  Support Vector Machines (patch ready)

  Learn a model from a manually classified data
  Predict the class of a new object based on its
   features and the learned model
Detect OSCON Tweets

      “Tweets without #OSCON”

Use tweets mentioning #OSCON to train
                   and
        Classify incoming tweets
Recommendations
  Predict what the user likes based on
   -  His/Her historical behavior
   -  Aggregate behavior of people similar to him
Mahout Recommenders
  Different types of recommenders
   -  User based
   -  Item based
  Full framework for storage, online
   online and offline computation of recommendations
  Like clustering, there is a notion of similarity in users or items
   -  Cosine, Tanimoto, Pearson and LLR
Recommended Tweets

“Discover interesting tweets without Re-Tweeting or Replying”
Frequent Pattern Mining
  Find interesting groups of items based on how they co-occur in a
   dataset
Mahout Parallel FPGrowth
  Identify the most commonly
   occurring patterns from
   -  Sales Transactions
      buy “Milk, eggs and bread”
   -  Query Logs
      ipad -> apple, tablet, iphone
   -  Spam Detection
      Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam
Frequent patterns in Tweets

“Identify groups of words that occur together”
                       Or
 “Identify related searches from search logs”
Mahout is Evolving
  Mapreduce enabled fitness functions for Genetic programming
   -  Integration with Watchmaker
   -  Solves: Travelling salesman, class discovery and many others
  Singular Value decomposition [SVD] of large matrices
   -  Reduce a large matrix into a smaller one by identifying the key
      rows and columns and discarding the others
   -  Mapreduce implementation of Lanczos algorithm
Agenda
  Introducing Mahout
  Different classes of problems
  And their Mahout based solutions
  Basic data structure
  Usage examples
  Sneak peek at our Next Release
Vector
Representing Data as Vectors
 Y




                                      X = 5 , Y = 3
                                      (5, 3)




                                                 X
  The vector denoted by point (5, 3) is simply
   Array([5, 3]) or HashMap([0 => 5], [1 => 3])
Representing Vectors – The basics
  Now think 3, 4, 5, ….. n-dimensional
  Think of a document as a bag of words.
                “she sells sea shells on the sea shore”
  Now map them to integers
                                 she => 0
                                sells => 1
                                 sea => 2
                                and so on
  The resulting vector [1.0, 1.0, 2.0, … ]
Vectorizer tools
  Map/Reduce tools to convert text data to vectors
   -  Use collate multiple words (n-grams) eg: “San Francisco”
   -  Normalization
   -  Optimize for sequential or random access
   -  TF-IDF calculation
   -  Pruning
   -  Stop words removal
Agenda
  Introducing Mahout
  Different classes of problems
  And their Mahout based solutions
  Basic data structure
  Usage examples
  Sneak peek at our Next Release
How to use mahout
  Command line launcher bin/mahout
  See the list of tools and algorithms by running bin/mahout
  Run any algorithm by its shortname:
   -  bin/mahout kmeans –help
  By default runs locally
  export HADOOP_HOME = /pathto/hadoop-0.20.2/
   -  Runs on the cluster configured as per the conf files in the
      hadoop directory
  Use driver classes to launch jobs:
   -  KMeansDriver.runjob(Path input, Path output …)
Clustering Walkthrough (tiny example)
  Input: set of text files in a directory
  Download Mahout and unzip
   -  mvn install
   -  bin/mahout seqdirectory –i <input> –o <seq-
      output>
   -  bin/mahout seq2sparse –i seq-output –o <vector-
      output>
   -  bin/mahout kmeans –i<vector-output>
    
-c <cluster-temp> -o <cluster-output> -k 10 –cd
      0.01 –x 20
Clustering Walkthrough (a bit more)

  Use bigrams: -ng 2
  Prune low frequency: –s 10
  Normalize: -n 2

  Use a distance measure : -dm
   org.apache.mahout.common.distance.CosineDistanceM
   easure
Clustering Walkthrough (viewing results)
  bin/mahout clusterdump 

     –s cluster-output/clusters-9/part-00000            

     -d vector-output/dictionary.file-* 

     -dt sequencefile -n 5 -b 100


  Top terms in a typical cluster
         comic                      =>   9.793121272867376
         comics                     =>   6.115341078151356
         con                        =>   5.015090566692931
         sdcc                       =>   3.927590843402978
         webcomics                  =>   2.916910980686997
Agenda
  Introducing Mahout
  Different classes of problems
  And their Mahout based solutions
  Basic data structure
  Usage examples
  Sneak peek at our Next Release
Mahout 0.4 (trunk)
  New breed of classifiers:
    -  Stochastic Gradient Descent (SGD)
    -  Pegasos SVM (Order of magnitude faster than SVM Perf)
    -  Lib Linear (Winner, ICML 2008)
  New Recommenders:
    -  Restricted Boltzmann Machine (RBM) based recommender
    -  SVD++ recommender
  New Clustering algorithms:
    -  Spectral Clustering
    -  K-Means++
  Full Hadoop 0.20 API compliance and performance
   improvements
Get Started
  http://mahout.apache.org
  dev@mahout.apache.org - Developer mailing list
  user@mahout.apache.org - User mailing list
  Check out the documentations and wiki for quickstart
  http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
Resources
  “Mahout in Action” Owen, Anil, Dunning, Friedman
   http://www.manning.com/owen




  “Taming Text” Ingersoll, Morton, Farris
   http://www.manning.com/ingersoll




  “Introducing Apache Mahout”
   http://www.ibm.com/developerworks/java/library/j-mahout/
Thanks to
  Apache Foundation
  Mahout Committers
  Google Summer of Code Organizers
  And Students
  OSCON
  Open source!
References
  news.google.com
  Cat http://www.flickr.com/photos/gattou/3178745634/
  Dog http://www.flickr.com/photos/30800139@N04/3879737638/
  Milk Eggs Bread
   http://www.flickr.com/photos/nauright/4792775946/
  Amazon Recommendations
  twitter

More Related Content

Viewers also liked

An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Msrtc __ online reservation system
Msrtc __ online reservation systemMsrtc __ online reservation system
Msrtc __ online reservation systemBalaji Lakshmipathi
 
click here for ticket
click here for ticketclick here for ticket
click here for ticketsaadhyam
 
Irctc ltd,booked ticket printing
Irctc ltd,booked ticket printingIrctc ltd,booked ticket printing
Irctc ltd,booked ticket printingRushab Singh
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...Qianzhan Intelligence
 
全球互联网产业研究之美国游戏市场:博彩游戏篇
全球互联网产业研究之美国游戏市场:博彩游戏篇全球互联网产业研究之美国游戏市场:博彩游戏篇
全球互联网产业研究之美国游戏市场:博彩游戏篇Kou Chou
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportQianzhan Intelligence
 
อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์I'Tay Tanawin
 

Viewers also liked (18)

Test_Sprinklr
Test_SprinklrTest_Sprinklr
Test_Sprinklr
 
Bng pune
Bng puneBng pune
Bng pune
 
Jammu delhi-redbus
Jammu delhi-redbusJammu delhi-redbus
Jammu delhi-redbus
 
Ssss
SsssSsss
Ssss
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Msrtc __ online reservation system
Msrtc __ online reservation systemMsrtc __ online reservation system
Msrtc __ online reservation system
 
sfsd
sfsdsfsd
sfsd
 
reservation tickets
reservation ticketsreservation tickets
reservation tickets
 
click here for ticket
click here for ticketclick here for ticket
click here for ticket
 
Irctc ltd,booked ticket printing
Irctc ltd,booked ticket printingIrctc ltd,booked ticket printing
Irctc ltd,booked ticket printing
 
Goibibo
GoibiboGoibibo
Goibibo
 
Tarea seminario 9 Cecilia
Tarea seminario 9 CeciliaTarea seminario 9 Cecilia
Tarea seminario 9 Cecilia
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...
 
Culture
CultureCulture
Culture
 
Mein hobby bilbas
Mein hobby bilbasMein hobby bilbas
Mein hobby bilbas
 
全球互联网产业研究之美国游戏市场:博彩游戏篇
全球互联网产业研究之美国游戏市场:博彩游戏篇全球互联网产业研究之美国游戏市场:博彩游戏篇
全球互联网产业研究之美国游戏市场:博彩游戏篇
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast report
 
อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์
 

Similar to OSCON: Apache Mahout - Mammoth Scale Machine Learning

Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability MahoutJake Mannix
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Configuring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank ScholtenConfiguring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Gabriel Moreira
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmapDr. Mohan K. Bavirisetty
 

Similar to OSCON: Apache Mahout - Mammoth Scale Machine Learning (20)

Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Configuring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank ScholtenConfiguring Mahout Clustering Jobs - Frank Scholten
Configuring Mahout Clustering Jobs - Frank Scholten
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
 

OSCON: Apache Mahout - Mammoth Scale Machine Learning

  • 1. Mammoth Scale Machine Learning Speaker: Robin Anil, Apache Mahout PMC Member OSCONʼ10 Portland, OR July 2010
  • 2. Quick Show of Hands   Are you fascinated about ML?   Have you used ML?   Do you have Gigabytes and Terabytes of data to analyze?   Do you have Hadoop or MapReduce experience?   Thanks for the survey!
  • 3. Little bit about me   Apache Mahout PMC member   A ML Enthusiast    Software Engineer @ Google   Google Summer of Code Mentor   Previous Life: Google Summer of Code student for 2 years.
  • 4. Agenda   Introducing Mahout   Different classes of problems   And their Mahout based solutions   Basic data structure   Usage examples   Sneak peek at our Next Release
  • 5. The Mission To build a scalable machine learning library
  • 6. Scale!   Scale to large datasets -  Hadoop MapReduce implementations that scales linearly with data. -  Fast sequential algorithms whose runtime doesn’t depend on the size of the data -  Goal: To be as fast as possible for any algorithm   Scalable to support your business case -  Apache Software License 2   Scalable community -  Vibrant, responsive and diverse -  Come to the mailing list and find out more
  • 7. The Mission To build a scalable machine learning library
  • 8. Why a new Library   Plenty of open source Machine Learning libraries either -  Lack community -  Lack scalability -  Lack documentations and examples -  Lack Apache licensing -  Are not well tested -  Are Research oriented
  • 9. Agenda   Introducing Mahout   Different classes of problems   And their Mahout based solutions   Basic data structure   Usage examples   Sneak peek at our Next Release
  • 10. ML on Twitter   Collection of tweets in the last hour   Each 140 character or token stream   We will keep using this example throughout this talk
  • 11. What is Clustering   Call it fuzzy grouping based on a notion of similarity
  • 12. Mahout Clustering   Plenty of Algorithms: K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet   Group similar looking objects   Notion of similarity: Distance measure: -  Euclidean -  Cosine -  Tanimoto -  Manhattan
  • 13. Clustering Tweets “Identify tweets that are similar and group them”
  • 14. Topic modeling   Grouping similar or co-occurring features into a topic -  Topic “Lol Cat”: -  Cat -  Meow -  Purr -  Haz -  Cheeseburger -  Lol
  • 15. Mahout Topic Modeling   Algorithm: Latent Dirichlet Allocation -  Input a set of documents -  Output top K prominent topics and the features in each topic
  • 16. Filtering Topics from Tweets “Identify emerging topics in a collection of tweets”
  • 17. Classification   Predicting the type of a new object based on its features   The types are predetermined Dog Cat
  • 18. Mahout Classification   Plenty of algorithms -  Naïve Bayes -  Complementary Naïve Bayes -  Random Forests -  Logistic Regression (Almost done) -  Support Vector Machines (patch ready)   Learn a model from a manually classified data   Predict the class of a new object based on its features and the learned model
  • 19. Detect OSCON Tweets “Tweets without #OSCON” Use tweets mentioning #OSCON to train and Classify incoming tweets
  • 20. Recommendations   Predict what the user likes based on -  His/Her historical behavior -  Aggregate behavior of people similar to him
  • 21. Mahout Recommenders   Different types of recommenders -  User based -  Item based   Full framework for storage, online online and offline computation of recommendations   Like clustering, there is a notion of similarity in users or items -  Cosine, Tanimoto, Pearson and LLR
  • 22. Recommended Tweets “Discover interesting tweets without Re-Tweeting or Replying”
  • 23. Frequent Pattern Mining   Find interesting groups of items based on how they co-occur in a dataset
  • 24. Mahout Parallel FPGrowth   Identify the most commonly occurring patterns from -  Sales Transactions buy “Milk, eggs and bread” -  Query Logs ipad -> apple, tablet, iphone -  Spam Detection Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam
  • 25. Frequent patterns in Tweets “Identify groups of words that occur together” Or “Identify related searches from search logs”
  • 26. Mahout is Evolving   Mapreduce enabled fitness functions for Genetic programming -  Integration with Watchmaker -  Solves: Travelling salesman, class discovery and many others   Singular Value decomposition [SVD] of large matrices -  Reduce a large matrix into a smaller one by identifying the key rows and columns and discarding the others -  Mapreduce implementation of Lanczos algorithm
  • 27. Agenda   Introducing Mahout   Different classes of problems   And their Mahout based solutions   Basic data structure   Usage examples   Sneak peek at our Next Release
  • 29. Representing Data as Vectors Y X = 5 , Y = 3 (5, 3) X   The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])
  • 30. Representing Vectors – The basics   Now think 3, 4, 5, ….. n-dimensional   Think of a document as a bag of words. “she sells sea shells on the sea shore”   Now map them to integers she => 0 sells => 1 sea => 2 and so on   The resulting vector [1.0, 1.0, 2.0, … ]
  • 31. Vectorizer tools   Map/Reduce tools to convert text data to vectors -  Use collate multiple words (n-grams) eg: “San Francisco” -  Normalization -  Optimize for sequential or random access -  TF-IDF calculation -  Pruning -  Stop words removal
  • 32. Agenda   Introducing Mahout   Different classes of problems   And their Mahout based solutions   Basic data structure   Usage examples   Sneak peek at our Next Release
  • 33. How to use mahout   Command line launcher bin/mahout   See the list of tools and algorithms by running bin/mahout   Run any algorithm by its shortname: -  bin/mahout kmeans –help   By default runs locally   export HADOOP_HOME = /pathto/hadoop-0.20.2/ -  Runs on the cluster configured as per the conf files in the hadoop directory   Use driver classes to launch jobs: -  KMeansDriver.runjob(Path input, Path output …)
  • 34. Clustering Walkthrough (tiny example)   Input: set of text files in a directory   Download Mahout and unzip -  mvn install -  bin/mahout seqdirectory –i <input> –o <seq- output> -  bin/mahout seq2sparse –i seq-output –o <vector- output> -  bin/mahout kmeans –i<vector-output> -c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
  • 35. Clustering Walkthrough (a bit more)   Use bigrams: -ng 2   Prune low frequency: –s 10   Normalize: -n 2   Use a distance measure : -dm org.apache.mahout.common.distance.CosineDistanceM easure
  • 36. Clustering Walkthrough (viewing results)   bin/mahout clusterdump 
 –s cluster-output/clusters-9/part-00000 
 -d vector-output/dictionary.file-* 
 -dt sequencefile -n 5 -b 100
   Top terms in a typical cluster comic => 9.793121272867376 comics => 6.115341078151356 con => 5.015090566692931 sdcc => 3.927590843402978 webcomics => 2.916910980686997
  • 37. Agenda   Introducing Mahout   Different classes of problems   And their Mahout based solutions   Basic data structure   Usage examples   Sneak peek at our Next Release
  • 38. Mahout 0.4 (trunk)   New breed of classifiers: -  Stochastic Gradient Descent (SGD) -  Pegasos SVM (Order of magnitude faster than SVM Perf) -  Lib Linear (Winner, ICML 2008)   New Recommenders: -  Restricted Boltzmann Machine (RBM) based recommender -  SVD++ recommender   New Clustering algorithms: -  Spectral Clustering -  K-Means++   Full Hadoop 0.20 API compliance and performance improvements
  • 39. Get Started   http://mahout.apache.org   dev@mahout.apache.org - Developer mailing list   user@mahout.apache.org - User mailing list   Check out the documentations and wiki for quickstart   http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
  • 40. Resources   “Mahout in Action” Owen, Anil, Dunning, Friedman http://www.manning.com/owen   “Taming Text” Ingersoll, Morton, Farris http://www.manning.com/ingersoll   “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/
  • 41. Thanks to   Apache Foundation   Mahout Committers   Google Summer of Code Organizers   And Students   OSCON   Open source!
  • 42. References   news.google.com   Cat http://www.flickr.com/photos/gattou/3178745634/   Dog http://www.flickr.com/photos/30800139@N04/3879737638/   Milk Eggs Bread http://www.flickr.com/photos/nauright/4792775946/   Amazon Recommendations   twitter