This document summarizes a presentation on classifying data using the Mahout machine learning library. It begins with an overview of classification and Mahout. It then describes using Mahout for classification, including preparing a dataset on question tags, splitting the data into training and test sets, building a naive Bayes classifier model, and applying the model to classify new data. Code examples and commands are provided for each step.
2. Agenda
● Classification Overview
● Mahout Overview
○ Classification on Mahout
● Case Study with Demo
○ Problem Description
○ Working Environment
○ Data Preparation
○ ML Model Generation
3. Classification?
● Classifying examples into given set of categories
● Supervised learning
○ Prepare data
○ Build classifier (train & test)
○ Apply classifier to new data
http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
4. Mahout?
● Scalable machine learning
library
= Can handle Big Data
● Runs on HDFS
● Classification, Clustering,
Collaborative Filtering , etc
http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
5. Classification on Mahout?
Classifying examples into given set of categories
Scalable machine learning library that can handle big data
Classifying big data into given set of categories
6. Case Study & Demo
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3
8. Working Environment
● Mac OS 10.9.1
● Eclipse 4.3.2
● Hadoop 1.2.1
● Mahout 0.9
● Source code available here.
9. Prerequisite (Where are you?)
● You have input tsv file at result > output-topfivetags.
● You are at “result” directory in Terminal.
● Command “hadoop” and “mahout” is working.
10. Prepare Data
1. Convert TSV file to Hadoop sequence file format.
Specify tag as a category. (Run TSVToSeq.java)
output-tsvtoseq folder
and chunk-0 file is
created.
11. Prepare Data
1. Make directory in HDFS and upload chunk-0 (sequence
file) to the folder.
17. Prepare Data
3. Split data into
a. Train set : to train model
b. Test set : to test model
18. mahout split
-i <input directory>
--trainingOutput <output dir to train>
--testOutput <output dir to test>
--randomSelectionPct <integer>
--overwrite
--sequenceFiles
-xm sequential
19.
20. Build Classifier
1. Choose algorithm to use for classification
Available algorithms:
○ Naive Bayes
■ trainnb, testnb
■ org.apache.mahout.
classifier.naivebayes
○ Hidden Markov Model
■ baumwelch, hmmpredict
■ org.apache.mahout.
classifier.sequencelearning.
hmm
○ Logistic Regression
■ trainlogistic, testlogistic
■ org.apache.mahout.
classifier.sgd
○ Random Forest
■ ?
■ ?
21. 2. Train & test model using train set
Should yield high accuracy
Build Classifier (Naive Bayes)
22. mahout trainnb
-i <dir to train vectors>
-el
-li <dir to put label index>
-o <dir to put model>
-ow
-c
23.
24. mahout testnb
-i <dir to train vectors>
-m <dir to model>
-l <dir to label index>
-ow
-o <output dir>
-c
25.
26. Build Classifier (Naive Bayes)
3. Test model using test set
Check if the accuracy is satisfactory
27.
28. Apply Classifier
What do you have at this point?
● model
● label index
You can start classifying new data! (Check this example)
Model
Label Index
29. References
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages (part 2: distribute classification with hadoop)