SlideShare a Scribd company logo
1 of 30
Download to read offline
Classification on Mahout
Naoki Nakatani
San Jose State University
CS185C Spring 2014
Agenda
● Classification Overview
● Mahout Overview
○ Classification on Mahout
● Case Study with Demo
○ Problem Description
○ Working Environment
○ Data Preparation
○ ML Model Generation
Classification?
● Classifying examples into given set of categories
● Supervised learning
○ Prepare data
○ Build classifier (train & test)
○ Apply classifier to new data
http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
Mahout?
● Scalable machine learning
library
= Can handle Big Data
● Runs on HDFS
● Classification, Clustering,
Collaborative Filtering , etc
http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
Classification on Mahout?
Classifying examples into given set of categories
Scalable machine learning library that can handle big data
Classifying big data into given set of categories
Case Study & Demo
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3
Dataset
File :
● TrainSmall.tsv
Fields :
● id, title, body, tags
Characteristics :
● Each question contains
only one tag
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
0
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
Working Environment
● Mac OS 10.9.1
● Eclipse 4.3.2
● Hadoop 1.2.1
● Mahout 0.9
● Source code available here.
Prerequisite (Where are you?)
● You have input tsv file at result > output-topfivetags.
● You are at “result” directory in Terminal.
● Command “hadoop” and “mahout” is working.
Prepare Data
1. Convert TSV file to Hadoop sequence file format.
Specify tag as a category. (Run TSVToSeq.java)
output-tsvtoseq folder
and chunk-0 file is
created.
Prepare Data
1. Make directory in HDFS and upload chunk-0 (sequence
file) to the folder.
hadoop fs -mkdir <directory>
hadoop fs -put <source> <destination>
Prepare Data
2. Transform questions into vectors. (mahout seq2sparse)
mahout seq2sparse -i <input directory> -o <output directory>
Prepare Data
3. Split data into
a. Train set : to train model
b. Test set : to test model
mahout split 
-i <input directory> 
--trainingOutput <output dir to train> 
--testOutput <output dir to test> 
--randomSelectionPct <integer> 
--overwrite 
--sequenceFiles 
-xm sequential
Build Classifier
1. Choose algorithm to use for classification
Available algorithms:
○ Naive Bayes
■ trainnb, testnb
■ org.apache.mahout.
classifier.naivebayes
○ Hidden Markov Model
■ baumwelch, hmmpredict
■ org.apache.mahout.
classifier.sequencelearning.
hmm
○ Logistic Regression
■ trainlogistic, testlogistic
■ org.apache.mahout.
classifier.sgd
○ Random Forest
■ ?
■ ?
2. Train & test model using train set
Should yield high accuracy
Build Classifier (Naive Bayes)
mahout trainnb 
-i <dir to train vectors> 
-el 
-li <dir to put label index> 
-o <dir to put model> 
-ow 
-c
mahout testnb 
-i <dir to train vectors> 
-m <dir to model> 
-l <dir to label index> 
-ow 
-o <output dir> 
-c
Build Classifier (Naive Bayes)
3. Test model using test set
Check if the accuracy is satisfactory
Apply Classifier
What do you have at this point?
● model
● label index
You can start classifying new data! (Check this example)
Model
Label Index
References
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages (part 2: distribute classification with hadoop)
Happy Machine Learning!

More Related Content

What's hot

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 

What's hot (20)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 

Viewers also liked

Machine intelligente d’analyse financiere
Machine intelligente d’analyse financiereMachine intelligente d’analyse financiere
Machine intelligente d’analyse financiereSabrine MASTOURA
 
Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Antidot
 
Machine learning pour tous
Machine learning pour tousMachine learning pour tous
Machine learning pour tousDamien Seguy
 
Apprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de rechercheApprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de recherchePhilippe YONNET
 
Mix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueMix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueDidier Girard
 
Machine learning
Machine learningMachine learning
Machine learningebiznext
 
Introduction au Machine Learning
Introduction au Machine LearningIntroduction au Machine Learning
Introduction au Machine LearningMathieu Goeminne
 
Analyse financière
Analyse financièreAnalyse financière
Analyse financièreAbdo attar
 
Ia project Apprentissage Automatique
Ia project Apprentissage AutomatiqueIa project Apprentissage Automatique
Ia project Apprentissage AutomatiqueNizar Bechir
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkAmal Abid
 
TP2 Big Data HBase
TP2 Big Data HBaseTP2 Big Data HBase
TP2 Big Data HBaseAmal Abid
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1Amal Abid
 

Viewers also liked (13)

Machine intelligente d’analyse financiere
Machine intelligente d’analyse financiereMachine intelligente d’analyse financiere
Machine intelligente d’analyse financiere
 
Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...
 
Machine learning pour tous
Machine learning pour tousMachine learning pour tous
Machine learning pour tous
 
Apprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de rechercheApprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de recherche
 
Mix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueMix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation Numérique
 
Mahout clustering
Mahout clusteringMahout clustering
Mahout clustering
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction au Machine Learning
Introduction au Machine LearningIntroduction au Machine Learning
Introduction au Machine Learning
 
Analyse financière
Analyse financièreAnalyse financière
Analyse financière
 
Ia project Apprentissage Automatique
Ia project Apprentissage AutomatiqueIa project Apprentissage Automatique
Ia project Apprentissage Automatique
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - Spark
 
TP2 Big Data HBase
TP2 Big Data HBaseTP2 Big Data HBase
TP2 Big Data HBase
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1
 

Similar to Mahout classification presentation

Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.docRohanS38
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Drupal 8: frontend development
Drupal 8: frontend developmentDrupal 8: frontend development
Drupal 8: frontend developmentsparkfabrik
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmapDr. Mohan K. Bavirisetty
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML WorkflowsBigML, Inc
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and PackagesDamian T. Gordon
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Hands On Intro to Node.js
Hands On Intro to Node.jsHands On Intro to Node.js
Hands On Intro to Node.jsChris Cowan
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 

Similar to Mahout classification presentation (20)

Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Drupal 8: frontend development
Drupal 8: frontend developmentDrupal 8: frontend development
Drupal 8: frontend development
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
 
2.5 lab1
2.5 lab12.5 lab1
2.5 lab1
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML Workflows
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and Packages
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Hands On Intro to Node.js
Hands On Intro to Node.jsHands On Intro to Node.js
Hands On Intro to Node.js
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Mahout classification presentation

  • 1. Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014
  • 2. Agenda ● Classification Overview ● Mahout Overview ○ Classification on Mahout ● Case Study with Demo ○ Problem Description ○ Working Environment ○ Data Preparation ○ ML Model Generation
  • 3. Classification? ● Classifying examples into given set of categories ● Supervised learning ○ Prepare data ○ Build classifier (train & test) ○ Apply classifier to new data http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
  • 4. Mahout? ● Scalable machine learning library = Can handle Big Data ● Runs on HDFS ● Classification, Clustering, Collaborative Filtering , etc http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
  • 5. Classification on Mahout? Classifying examples into given set of categories Scalable machine learning library that can handle big data Classifying big data into given set of categories
  • 6. Case Study & Demo Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  • 7. Dataset File : ● TrainSmall.tsv Fields : ● id, title, body, tags Characteristics : ● Each question contains only one tag 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” “----” , ”-----------” , “------------------------” , “--- --- --- --- ”
  • 8. Working Environment ● Mac OS 10.9.1 ● Eclipse 4.3.2 ● Hadoop 1.2.1 ● Mahout 0.9 ● Source code available here.
  • 9. Prerequisite (Where are you?) ● You have input tsv file at result > output-topfivetags. ● You are at “result” directory in Terminal. ● Command “hadoop” and “mahout” is working.
  • 10. Prepare Data 1. Convert TSV file to Hadoop sequence file format. Specify tag as a category. (Run TSVToSeq.java) output-tsvtoseq folder and chunk-0 file is created.
  • 11. Prepare Data 1. Make directory in HDFS and upload chunk-0 (sequence file) to the folder.
  • 12. hadoop fs -mkdir <directory>
  • 13. hadoop fs -put <source> <destination>
  • 14. Prepare Data 2. Transform questions into vectors. (mahout seq2sparse)
  • 15. mahout seq2sparse -i <input directory> -o <output directory>
  • 16.
  • 17. Prepare Data 3. Split data into a. Train set : to train model b. Test set : to test model
  • 18. mahout split -i <input directory> --trainingOutput <output dir to train> --testOutput <output dir to test> --randomSelectionPct <integer> --overwrite --sequenceFiles -xm sequential
  • 19.
  • 20. Build Classifier 1. Choose algorithm to use for classification Available algorithms: ○ Naive Bayes ■ trainnb, testnb ■ org.apache.mahout. classifier.naivebayes ○ Hidden Markov Model ■ baumwelch, hmmpredict ■ org.apache.mahout. classifier.sequencelearning. hmm ○ Logistic Regression ■ trainlogistic, testlogistic ■ org.apache.mahout. classifier.sgd ○ Random Forest ■ ? ■ ?
  • 21. 2. Train & test model using train set Should yield high accuracy Build Classifier (Naive Bayes)
  • 22. mahout trainnb -i <dir to train vectors> -el -li <dir to put label index> -o <dir to put model> -ow -c
  • 23.
  • 24. mahout testnb -i <dir to train vectors> -m <dir to model> -l <dir to label index> -ow -o <output dir> -c
  • 25.
  • 26. Build Classifier (Naive Bayes) 3. Test model using test set Check if the accuracy is satisfactory
  • 27.
  • 28. Apply Classifier What do you have at this point? ● model ● label index You can start classifying new data! (Check this example) Model Label Index
  • 29. References ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)