Mahout classification presentation

•

4 likes•2,785 views

This document summarizes a presentation on classifying data using the Mahout machine learning library. It begins with an overview of classification and Mahout. It then describes using Mahout for classification, including preparing a dataset on question tags, splitting the data into training and test sets, building a naive Bayes classifier model, and applying the model to classify new data. Code examples and commands are provided for each step.

Technology Education

Classification on Mahout
Naoki Nakatani
San Jose State University
CS185C Spring 2014

Agenda
● Classification Overview
● Mahout Overview
○ Classification on Mahout
● Case Study with Demo
○ Problem Description
○ Working Environment
○ Data Preparation
○ ML Model Generation

Classification?
● Classifying examples into given set of categories
● Supervised learning
○ Prepare data
○ Build classifier (train & test)
○ Apply classifier to new data
http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg

Classification on Mahout?
Classifying examples into given set of categories
Scalable machine learning library that can handle big data
Classifying big data into given set of categories

Case Study & Demo
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3

Dataset
File :
● TrainSmall.tsv
Fields :
● id, title, body, tags
Characteristics :
● Each question contains
only one tag
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
0
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”

Working Environment
● Mac OS 10.9.1
● Eclipse 4.3.2
● Hadoop 1.2.1
● Mahout 0.9
● Source code available here.

Prerequisite (Where are you?)
● You have input tsv file at result > output-topfivetags.
● You are at “result” directory in Terminal.
● Command “hadoop” and “mahout” is working.

Prepare Data
1. Convert TSV file to Hadoop sequence file format.
Specify tag as a category. (Run TSVToSeq.java)
output-tsvtoseq folder
and chunk-0 file is
created.

Prepare Data
1. Make directory in HDFS and upload chunk-0 (sequence
file) to the folder.

Prepare Data
2. Transform questions into vectors. (mahout seq2sparse)

mahout seq2sparse -i <input directory> -o <output directory>

Prepare Data
3. Split data into
a. Train set : to train model
b. Test set : to test model

mahout split
-i <input directory>
--trainingOutput <output dir to train>
--testOutput <output dir to test>
--randomSelectionPct <integer>
--overwrite
--sequenceFiles
-xm sequential

Build Classifier
1. Choose algorithm to use for classification
Available algorithms:
○ Naive Bayes
■ trainnb, testnb
■ org.apache.mahout.
classifier.naivebayes
○ Hidden Markov Model
■ baumwelch, hmmpredict
■ org.apache.mahout.
classifier.sequencelearning.
hmm
○ Logistic Regression
■ trainlogistic, testlogistic
■ org.apache.mahout.
classifier.sgd
○ Random Forest
■ ?
■ ?

2. Train & test model using train set
Should yield high accuracy
Build Classifier (Naive Bayes)

mahout trainnb
-i <dir to train vectors>
-el
-li <dir to put label index>
-o <dir to put model>
-ow
-c

mahout testnb
-i <dir to train vectors>
-m <dir to model>
-l <dir to label index>
-ow
-o <output dir>
-c

Build Classifier (Naive Bayes)
3. Test model using test set
Check if the accuracy is satisfactory

Apply Classifier
What do you have at this point?
● model
● label index
You can start classifying new data! (Check this example)
Model
Label Index

References
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages (part 2: distribute classification with hadoop)

What's hot

Intro to Mahout -- DC HadoopGrant Ingersoll

Tutorial Mahout - RecommendationCataldo Musto

Mahout Introduction BarCampDCDrew Farris

Apache Mahout Architecture OverviewStefano Dalla Palma

Introduction to Collaborative Filtering with Apache Mahoutsscdotopen

Logistic Regression using Mahouttanuvir

Intro to Apache MahoutGrant Ingersoll

Apache MahoutSave Manos

Apache Mahout 於電子商務的應用James Chen

Apache mahoutPuneet Gupta

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Apache MahoutAjit Koti

Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto

MahoutEdureka!

Introduction to Apache MahoutAman Adhikari

mahout introductionchanggeng Zhang

Next directions in Mahout's recommenderssscdotopen

Random forest using apache mahoutGaurav Kasliwal

OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil

An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka

What's hot (20)

Intro to Mahout -- DC Hadoop

Tutorial Mahout - Recommendation

Mahout Introduction BarCampDC

Apache Mahout Architecture Overview

Introduction to Collaborative Filtering with Apache Mahout

Logistic Regression using Mahout

Intro to Apache Mahout

Apache Mahout

Apache Mahout 於電子商務的應用

Apache mahout

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Mahout

Apache Mahout Tutorial - Recommendation - 2013/2014

Mahout

Introduction to Apache Mahout

mahout introduction

Next directions in Mahout's recommenders

Random forest using apache mahout

OSCON: Apache Mahout - Mammoth Scale Machine Learning

An Introduction to Supervised Machine Learning and Pattern Classification: Th...

Viewers also liked

Machine intelligente d’analyse financiereSabrine MASTOURA

Machine learning, deep learning et search : à quand ces innovations dans nos ...Antidot

Machine learning pour tousDamien Seguy

Apprentissage Automatique et moteurs de recherchePhilippe YONNET

Mix it2014 - Machine Learning et Régulation NumériqueDidier Girard

Mahout clusteringLearningMahout

Machine learningebiznext

Introduction au Machine LearningMathieu Goeminne

Analyse financièreAbdo attar

Ia project Apprentissage AutomatiqueNizar Bechir

Cours Big Data Chap4 - SparkAmal Abid

TP2 Big Data HBaseAmal Abid

Cours Big Data Chap1Amal Abid

Viewers also liked (13)

Machine intelligente d’analyse financiere

Machine learning, deep learning et search : à quand ces innovations dans nos ...

Machine learning pour tous

Apprentissage Automatique et moteurs de recherche

Mix it2014 - Machine Learning et Régulation Numérique

Mahout clustering

Machine learning

Introduction au Machine Learning

Analyse financière

Ia project Apprentissage Automatique

Cours Big Data Chap4 - Spark

TP2 Big Data HBase

Cours Big Data Chap1

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

How to write a Business Continuity Plan

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

Connect Wave/ connectwave Pitch Deck Presentation

unit 4 immunoblotting technique complete.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Generative AI for Technical Writer or Information Developers

Unleash Your Potential - Namagunga Girls Coding Club

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Gen AI in Business - Global Trends Report 2024.pdf

SAP Build Work Zone - Overview L2-L3.pptx

How AI, OpenAI, and ChatGPT impact business and software.

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

"Debugging python applications inside k8s environment", Andrii Soldatenko

The Ultimate Guide to Choosing WordPress Pros and Cons

Nell’iperspazio con Rocket: il Framework Web di Rust!

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Mahout classification presentation

1. Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014

2. Agenda ● Classification Overview ● Mahout Overview ○ Classification on Mahout ● Case Study with Demo ○ Problem Description ○ Working Environment ○ Data Preparation ○ ML Model Generation

3. Classification? ● Classifying examples into given set of categories ● Supervised learning ○ Prepare data ○ Build classifier (train & test) ○ Apply classifier to new data http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg

4. Mahout? ● Scalable machine learning library = Can handle Big Data ● Runs on HDFS ● Classification, Clustering, Collaborative Filtering , etc http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png

5. Classification on Mahout? Classifying examples into given set of categories Scalable machine learning library that can handle big data Classifying big data into given set of categories

6. Case Study & Demo Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3

7. Dataset File : ● TrainSmall.tsv Fields : ● id, title, body, tags Characteristics : ● Each question contains only one tag 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” “----” , ”-----------” , “------------------------” , “--- --- --- --- ”

8. Working Environment ● Mac OS 10.9.1 ● Eclipse 4.3.2 ● Hadoop 1.2.1 ● Mahout 0.9 ● Source code available here.

9. Prerequisite (Where are you?) ● You have input tsv file at result > output-topfivetags. ● You are at “result” directory in Terminal. ● Command “hadoop” and “mahout” is working.

10. Prepare Data 1. Convert TSV file to Hadoop sequence file format. Specify tag as a category. (Run TSVToSeq.java) output-tsvtoseq folder and chunk-0 file is created.

11. Prepare Data 1. Make directory in HDFS and upload chunk-0 (sequence file) to the folder.

12. hadoop fs -mkdir <directory>

13. hadoop fs -put <source> <destination>

14. Prepare Data 2. Transform questions into vectors. (mahout seq2sparse)

15. mahout seq2sparse -i <input directory> -o <output directory>

16.

17. Prepare Data 3. Split data into a. Train set : to train model b. Test set : to test model

18. mahout split -i <input directory> --trainingOutput <output dir to train> --testOutput <output dir to test> --randomSelectionPct <integer> --overwrite --sequenceFiles -xm sequential

19.

20. Build Classifier 1. Choose algorithm to use for classification Available algorithms: ○ Naive Bayes ■ trainnb, testnb ■ org.apache.mahout. classifier.naivebayes ○ Hidden Markov Model ■ baumwelch, hmmpredict ■ org.apache.mahout. classifier.sequencelearning. hmm ○ Logistic Regression ■ trainlogistic, testlogistic ■ org.apache.mahout. classifier.sgd ○ Random Forest ■ ? ■ ?

21. 2. Train & test model using train set Should yield high accuracy Build Classifier (Naive Bayes)

22. mahout trainnb -i <dir to train vectors> -el -li <dir to put label index> -o <dir to put model> -ow -c

23.

24. mahout testnb -i <dir to train vectors> -m <dir to model> -l <dir to label index> -ow -o <output dir> -c

25.

26. Build Classifier (Naive Bayes) 3. Test model using test set Check if the accuracy is satisfactory

27.

28. Apply Classifier What do you have at this point? ● model ● label index You can start classifying new data! (Check this example) Model Label Index

29. References ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)

30. Happy Machine Learning!

Mahout classification presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Mahout classification presentation

Similar to Mahout classification presentation (20)

Recently uploaded

Recently uploaded (20)

Mahout classification presentation