SlideShare a Scribd company logo
1 of 48
Download to read offline
H2O.ai

Machine Intelligence
Scalable Data Science
and Deep Learning
with H2O
1
SF Bay ACM Chapter Meetup
eBay Whitman Campus, June 3 2015
Arno Candel, PhD

Chief Architect, H2O.ai
H2O.ai

Machine Intelligence
Who Am I?
Arno Candel
Chief Architect,
Physicist & Hacker at H2O.ai
PhD Physics, ETH Zurich 2005
10+ yrs Supercomputing (HPC)
6 yrs at SLAC (Stanford Lin. Accel.)
3.5 yrs Machine Learning
1.5 yrs at H2O.ai
Fortune Magazine
Big Data All Star 2014
Follow me @ArnoCandel 2
H2O.ai

Machine Intelligence
Outline
• Introduction
• H2O Deep Learning Architecture
• Live Demos:
Flow GUI - Airline Dataset
R - MNIST World Record + Anomaly Detection
Flow GUI - Higgs Boson Classification
Sparkling Water - Chicago Crime Prediction
Sparkling Water - Craigslist Jobs Word2Vec
iPython - CitiBike Demand Prediction
Scoring Engine - Million Songs Classification
• Outlook
3
H2O.ai

Machine Intelligence
In-Memory ML
Distributed
Open Source
APIs
4
Memory-Efficient Data Structures
Cutting-Edge Algorithms
Use all your Data (No Sampling)
Accuracy with Speed and Scale
Ownership of Methods - Apache V2
Easy to Deploy: Bare, Hadoop, Spark, etc.
Java, Scala, R, Python, JavaScript, JSON
NanoFast Scoring Engine (POJO)
H2O - Product Overview
H2O.ai

Machine Intelligence
25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
5
Join H2O World 2015!
H2O.ai

Machine Intelligence
103 634 2789
463 2,887 13,237
Companies
Users
Mar 2014 July 2014 Mar 2015
Active Users
150+
6
Strong Community & Growth
5/25/15 @kdnuggets t.co/4xSgleSIdY
H2O.ai

Machine Intelligence
7
HDFS
S3
SQL
NoSQL
Classification
Regression
Feature
Engineering
Distributed In-Memory
Map Reduce/Fork Join
Columnar Compression
GLM, Deep Learning
K-Means, PCA, NB,
Cox
Random Forest / GBM
Ensembles
Fast Modeling Engine
Streaming
Nano Fast Java Scoring Engines
(POJO code generation)
Matrix
Factorization Clustering
Munging
Unsupervised
Supervised
Accuracy with Speed and Scale
Most code is written
in-house from scratch
H2O.ai

Machine Intelligence
8
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models,
15x faster with H2O than before)
Fraud Detection (11% higher accuracy with
H2O Deep Learning - saves millions)
…and many large insurance and financial
services companies!
Real-time marketing (H2O is 10x faster than
anything else)
Actual Customer Use Cases
H2O.ai

Machine Intelligence
9
h2o.ai/download & Run!
H2O.ai

Machine Intelligence
10
Airline Data: Predict Delayed Departure
Predict dep. delay Y/N
116M rows
31 colums
12 GB CSV
4 GB compressed
20 years of domestic
airline flight data
H2O.ai

Machine Intelligence
11
Results in Seconds on Big Data
Logistic Regression: ~20s
elastic net, alpha=0.5, lambda=1.379e-4 (auto)
Deep Learning: ~70s
4 hidden ReLU layers of 20 neurons, 2 epochs
8-node EC2 cluster: 64 virtual cores, 1GbE
Year, Month, Sched.
Dep. Time have
non-linear impact
Chicago, Atlanta,
Dallas:

often delayed
All cores maxed out
+9% AUC
+--+++
H2O.ai

Machine Intelligence
Multi-layer feed-forward Neural Network

Trained with back-propagation (SGD, ADADELTA)
+ distributed processing for big data
(fine-grain in-memory MapReduce on distributed data)
+ multi-threaded speedup
(async fork/join worker threads operate at FORTRAN speeds)
+ smart algorithms for fast & accurate results
(automatic standardization, one-hot encoding of categoricals, missing value imputation, weight &
bias initialization, adaptive learning rate, momentum, dropout/l1/L2 regularization, grid search, 

N-fold cross-validation, checkpointing, load balancing, auto-tuning, model averaging, etc.)
= powerful tool for (un)supervised machine
learning on real-world data
12
H2O Deep Learning
all 320 cores maxed out
H2O.ai

Machine Intelligence
threads: async
13
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:

each node trains a copy of
the weights and biases with
(some* or all of) its local
data with asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O in-memory non-
blocking hash map:

K-V store
reduce:

model averaging: average weights
and biases from all nodes,
speedup is at least #nodes/
log(#rows)
http://arxiv.org/abs/1209.4129
Keep iterating over the data (“epochs”), score at user-given times
Query & display the
model via JSON, WWW
2
2 4
31
1
1
1
4
3 2
1 2
1
i
*auto-tuned (default) or user-
specified number of rows per
MapReduce iteration
Main Loop:
H2O.ai

Machine Intelligence
14
H2O Deep Learning beats MNIST
MNIST: Handwritten digits: 28^2=784 gray-scale pixel values
full run: 10 hours on 10-node cluster
2 hours on desktop gets to 0.9% test set error
Just supervised training
on original 60k/10k dataset:
No data augmentation
No distortions
No convolutions
No pre-training
No ensemble
0.83% test set error:

current world record
1-liner: call h2o.deeplearning() in R
H2O.ai

Machine Intelligence
15
Unsupervised Anomaly Detection
The good The bad The ugly
Try it yourself!
Auto-Encoder learns
“Compressed Identity”
H2O.ai

Machine Intelligence
16
Images courtesy CERN / LHC
Higgs

vs

Background
Large Hadron Collider: Largest experiment of mankind!
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc.
Higgs boson discovery (July ’12) led to 2013 Nobel prize!
Higgs Boson - Classification Problem
H2O.ai

Machine Intelligence
17
UCI Higgs Dataset: 11M rows, 29 cols
C2-C22: 21 low-level features

(detector data)
7 high-level features

(physics formulae)
Assume we don’t know Physics…
H2O.ai

Machine Intelligence
18
? ? ?
Former CERN baseline for AUC: 0.733 and 0.816
H2O Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
Random Forest 0.764 0.840
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830
H2O Deep Learning ?
add
derived
features
Deep Learning for Higgs Detection?
Q: Can Deep Learning learn Physics for us?
H2O.ai

Machine Intelligence
19
EC2 Demo Cluster: 8 nodes, 64 cores
H2O Deep Learning:
Expect good cluster
utilization :)
H2O.ai

Machine Intelligence
20
Deep DL model on

low-level features
only
train 10M rows
valid 500k rows
test 500k rows
H2O Deep Learning Higgs Demo
H2O: same results as Nature paper
Deep Learning just learned Particle Physics!
8 EC2 nodes:
AUC = 0.86 after 100 mins
AUC = 0.87+ overnight
H2O.ai

Machine Intelligence
21
http://www.slideshare.net/0xdata/crime-deeplearningkey
http://www.datanami.com/2015/05/07/what-police-can-learn-from-deep-learning/
H2O Deep Learning in the News
Alex, Michal,
et al.
H2O.ai

Machine Intelligence
22
Weather + Census + Crime Data
H2O.ai

Machine Intelligence 23
Spark + H2O = Sparkling Water
H2O.ai

Machine Intelligence
24
Sparkling Water Demo
Instructions at h2o.ai/download
H2O.ai

Machine Intelligence
25
Parse & Munge with H2O, Convert to RDD
H2O Parser: Robust & Fast
Simple Column Selection
H2O.ai

Machine Intelligence
26
Parse & Munge with H2O, Convert to RDD
Munging: Date Manipulations
Conversion to DataFrame
H2O.ai

Machine Intelligence
27
Join RDDs with SQL, Convert to H2O
Spark SQL Query Execution
Convert back to H2OFrame
Split into Train 80% / Test 20%
H2O.ai

Machine Intelligence
28
Build H2O Deep Learning Model
Train a H2O Deep Learning
Model on Data obtained by
Spark SQL Query
Predict whether Arrest will be
made with AUC of 0.90+
H2O.ai

Machine Intelligence
29
Visualize Results with Flow
Using Flow to interactively plot
Arrest Rate (blue)
vs
Relative Occurrence (red)
per crime type.
H2O.ai

Machine Intelligence
30
Word2Vec Demo: CraigsList Jobs
Task: Predict the job category from the Ad Title
Alex et al.
H2O.ai

Machine Intelligence
31
Load Data into Spark (ling Water)
14k jobs, 700kB
6 job classes
H2O.ai

Machine Intelligence
32
Data Cleaning
H2O.ai

Machine Intelligence
33
Spark Word2Vec Results into H2OFrame
Compute average Word2Vec vector
(length 100) for each job title
Convert Spark Word2Vec
Frame to H2OFrame
H2O.ai

Machine Intelligence
34
Run H2O GBM on Word2Vec Data
In Flow, split data 75%/25%, and train GBM classifier for 6-class problem
80% accuracy in 9 seconds!
(priors: ~16%)
H2O.ai

Machine Intelligence
35
Predict Rental Bike Demand in NYC
Cliff et al.
H2O.ai

Machine Intelligence
iPython
Notebook Demo
36
Group-By Aggregation
H2O.ai

Machine Intelligence
iPython
Notebook Demo
37
Model Building And Scoring
91% AUC baseline
H2O.ai

Machine Intelligence
38
Joining Bikes-Per-Day with Weather
H2O.ai

Machine Intelligence
39
Improved Models with Weather Data
93% AUC after joining
bike and weather data
H2O.ai

Machine Intelligence
40
Example: First GBM tree
Fast and easy path to Production (batch or real-time)!
POJO Scoring Engine
Standalone Java scoring code is auto-generated!
Note:
no heap allocation,
pure decision-making
H2O.ai

Machine Intelligence
More Info in H2O Booklets
https://leanpub.com/u/h2oai
http://learn.h2o.ai
41
H2O.ai

Machine Intelligence
42
Kaggle - H2O Starter Scripts
H2O.ai

Machine Intelligence
43
Kaggle - Active H2O Community
18k+ views in 2 weeks
H2O.ai

Machine Intelligence
44
Hyper-Parameter Tuning
93 numerical features
9 output classes
62k training set rows
144k test set rows
Ensemble of H2O DL + GBM => Top 10%
R script by DataGeek
H2O.ai

Machine Intelligence
45
Mastering Kaggle with H2O
DL + GBM
GBM
GBM + GLM
DRF + GLM
Stay tuned: Kaggle Master
@Mark_A_Landry recently
joined H2O as Competitive
Data Scientist!
www.meetup.com/Silicon-
Valley-Big-Data-Science/
events/222303884/
H2O.ai

Machine Intelligence
46
R’s data.table now in H2O!
H2O.ai

Machine Intelligence
Outlook - Algorithm Roadmap
• Ensembles (Erin LeDell et al.)
• Automated Hyper-Parameter Tuning
• Convolutional Layers for Deep Learning
• Natural Language Processing: tf-idf, Word2Vec, …
• Generalized Low Rank Models
• PCA, SVD, K-Means, Matrix Factorization
• Recommender Systems
And many more!
47
Public JIRAs - Join H2O!
H2O.ai

Machine Intelligence
Key Take-Aways
H2O is an open source predictive analytics
platform for data scientists and business analysts
who need scalable, fast and accurate machine
learning.
H2O Deep Learning is ready to take your
advanced analytics to the next level.
Try it on your data!
48
https://github.com/h2oai
H2O Google Group
http://h2o.ai
@h2oai
Thank You!

More Related Content

More from Sri Ambati

Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
Sri Ambati
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
Sri Ambati
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 

More from Sri Ambati (20)

Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOps
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 

Arnocandelscalabledatascienceanddeeplearningwithh2o_meetup_acm_ebay