This document summarizes a presentation on machine learning and Hadoop. It discusses the current state and future directions of machine learning on Hadoop platforms. In industrial machine learning, well-defined objectives are rare, predictive accuracy has limits, and systems must precede algorithms. Currently, Hadoop is used for data preparation, feature engineering, and some model fitting. Tools include Pig, Hive, Mahout, and new interfaces like Spark. The future includes YARN for running diverse jobs and improved machine learning libraries. The document calls for academic work on feature engineering languages and broader model selection ontologies.
Axa Assurance Maroc - Insurer Innovation Award 2024
Hadoop and Machine Learning
1. Machine Learning and Hadoop
Present and Future
Josh Wills, Tom Pierce, and Jeff Hammerbacher
Cloudera Data Science Team
December 17th, 2011
2. High Availability for Data Scientists
NIPS
Copyright 2011 Cloudera Inc. All rights reserved
3. Agenda
• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
• State of the World
• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
5. Delta One: Model Evaluation
• ML Systems Are One Piece of a Complex System
• Well-defined objective functions are the exception
• Multiple, often conflicting goals
• Weights are fuzzy and shift with business priorities
• Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point
• Examples
• Computational advertising
• Friend recommendations on social networks
Copyright 2011 Cloudera Inc. All rights reserved
6. Delta Two: Systems Precede Algorithms
• Greenfield Projects Hardly Ever Happen
• (and don’t usually launch)
• Industrial Computational Infrastructure
• General-purpose
• Cheap
• Shared
• Constraints Drive Innovation
• Vowpal Wabbit Hashing Trick
• SETI @ Google
Copyright 2011 Cloudera Inc. All rights reserved
7. Delta Three: Workflow
Practice Over Theory Blog
Copyright 2011 Cloudera Inc. All rights reserved
8. Delta Three: Workflow
• Optimize the Overall Process
• Model fitting is a small piece of the overall flow time
• Parallelize everything
• Better Features > Better Models
• Fast Model Deployment
• Common Feature Extraction Logic
• Servable Models
• Validation as Sanity Checking
• Deploy to a small subset of real data and evaluate
Copyright 2011 Cloudera Inc. All rights reserved
9. Agenda
• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
• State of the World
• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
10. Hadoop: It’s Where The Data Is
Copyright 2011 Cloudera Inc. All rights reserved
11. Hadoop Platform: Substrate
• Commodity servers
• Open Compute
• Open source operating system
• Linux
• Open source configuration management
• Puppet
• Chef
• Coordination service
• ZooKeeper
Copyright 2011 Cloudera Inc. All rights reserved
12. Hadoop Platform: Storage
• Distributed schema-less storage
• HDFS
• Ceph
• Append-only storage formats and metadata
• Avro
• RCFile
• HCatalog
• Mutable key-value storage and metadata
• HBase
Copyright 2011 Cloudera Inc. All rights reserved
13. Hadoop Platform: Integration
• Tool Access
• FUSE
• JDBC
• ODBC
• Data Ingestion
• Flume
• Sqoop
Copyright 2011 Cloudera Inc. All rights reserved
14. ML and Hadoop: The State of the World
Copyright 2011 Cloudera Inc. All rights reserved
15. Computation: Plain Old MapReduce
• Great for:
• Data Preparation
• Feature Engineering
• Model Validation/Evaluation
• Works For Certain Model Fitting Problems
• Recommendation Systems
• Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning
• Way More Detail from the KDD 2011 Talk
Copyright 2011 Cloudera Inc. All rights reserved
16. Tools for Data Preparation/Feature Engineering
• Languages/Environments
• PigLatin
• HiveQL
• Need to deal with mismatch between offline/online feature
generation
• Java/Scala APIs
• Crunch (Cloudera)
• Scoobi (NICTA)
• Cascading (Concurrent)
• Jaql (IBM)
Copyright 2011 Cloudera Inc. All rights reserved
17. Apache Mahout
• The starting place for MapReduce-based machine
learning algorithms
• Not machine-learning-in-a-box
• Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:
• Recommendations
• Clustering
• Classification
• Frequent Itemset Mining
Copyright 2011 Cloudera Inc. All rights reserved
18. Apache Mahout (cont.)
• Best Library: Taste Recommender
• Oldest project, most widely-deployed in production
• SVD implementation is particularly active
• Good Libraries: Online SGD
• Does not use MapReduce
• Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes
• Challenges
• “Secret sauce” effect
• Delta between Mahout + the cutting edge in ML
Copyright 2011 Cloudera Inc. All rights reserved
19. More Machine Learning Interfaces for Hadoop
• Based on MapReduce
• SystemML (IBM)
• AllReduce (Vowpal Wabbit)
• No MapReduce
• Spark
• R-Based Systems (Augment MapReduce with R)
• Segue
• RHIPE
• RHadoop
• Ricardo (IBM)
Copyright 2011 Cloudera Inc. All rights reserved
20. ML and Hadoop: Where Things are Headed
Copyright 2011 Cloudera Inc. All rights reserved
21. MRv2 and YARN
• Eliminates JobTracker bottleneck
• Separate Resource Manager/Scheduler
• Individual jobs have their own task masters
• Moves MapReduce into user-land
• Enables Hadoop clusters to run all sorts of jobs
• MPI (Hamster; MAPREDUCE-2911)
• Native BSP (Giraph)
• Spark
• AllReduce, GraphLab
Copyright 2011 Cloudera Inc. All rights reserved
22. Agenda
• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
• State of the World
• Where Things Are Headed
• Part 3: Things Industry Needs From Academia
Copyright 2011 Cloudera Inc. All rights reserved
23. Machine Learning on Multivariate Time Series
• 1e5 writes/sec
• Positive events are
relatively rare
• Feature extraction
challenge
• May not be clear what
the right time horizon is
• Tight SLAs
• Very high stakes
Copyright 2011 Cloudera Inc. All rights reserved
24. An Academic Language For Feature Engineering
• Feature extraction/selection is as important as model
fitting
• e.g., hierarchical feature representation, impact on training
time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and
dispersed across multiple fields
• NIPS 2003
• HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these
problems across disciplines
Copyright 2011 Cloudera Inc. All rights reserved
25. A Broader Ontology For Model Selection
• Practical factors that enter into the “best” choice of
model…
• Data arrival rate
• Data volume
• Scoring latency
• Model refresh time
• Robustness/reliability
• …in addition to the standard predictive power/simplicity
tradeoffs
Copyright 2011 Cloudera Inc. All rights reserved