Suggestions:
1) For best quality, download the PDF before viewing.
2) A screencast with audio is available at: http://youtu.be/fdbQreQacIQ
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Africa Soil Kaggle Challenge top (#1) position by H2O DeepLearning
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
How to submit a standout Adobe Champion Application
Deep Learning through Examples - Kaggle #1
1. Deep Learning
through Examples
Arno Candel
!
0xdata, H2O.ai
Scalable In-Memory Machine Learning
!
Silicon Valley Big Data Science Meetup,
Vendavo, Mountain View, 9/11/14
!
2. Who am I?
@ArnoCandel
PhD in Computational Physics, 2005
from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
9 months at 0xdata/H2O - Machine Learning
!
15 years in HPC/Supercomputing/Modeling
!
Named “2014 Big Data All-Star” by Fortune Magazine
!
3. H2O Deep Learning, @ArnoCandel 3
H2O DeepLearning:
Kaggle #1 rank (out of 413) - 40d left
Achieved with H2O Deep Learning from R!
!
#1
#17
@matlabulous (Jo-fai Chow, Blend it like a Bayesian!) says:
“I am 99.99999999999% sure that I can still go further with H2O.”
4. H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (10 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q & A (5 mins)
4
5. H2O Deep Learning, @ArnoCandel
About H20 (aka 0xdata)
Java, Apache v2 Open Source
Join the www.h2o.ai/community!
#1 Java Machine Learning in Github
5
6. H2O Deep Learning, @ArnoCandel
Customer Demands for
Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata from
scratch to meet these requirements
7. H2O Deep Learning, @ArnoCandel
H2O Integration
H2O
R JSON Scala Python
YARN Hadoop MR
HDFS HDFS HDFS
Standalone Over YARN On MRv1
7
H2O H2O
Java
8. H2O Deep Learning, @ArnoCandel
H2O Architecture
Prediction Engine
Distributed
In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Memory manager
e.g. Deep Learning
8
MapReduce
9. H2O Deep Learning, @ArnoCandel
H2O - The Killer App on Spark
9
http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html
10. H2O Deep Learning, @ArnoCandel
H2O DeepLearning on Spark
10
// Test if we can correctly learn A, B where Y = logistic(A + B*X)
test("deep learning log regression") {
val nPoints = 10000
val A = 2.0
val B = -1.5 !
// Generate testing data
val trainData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 42)
// Create RDD from testing data
val trainRDD = sc.parallelize(trainData, 2)
trainRDD.cache() !
import H2OContext._
// Create H2O data frame (will be implicit in the future)
val trainH2ORDD = toDataFrame(sc, trainRDD)
// Create a H2O DeepLearning model
val dlParams = new DeepLearningParameters()
dlParams.source = trainH2ORDD
dlParams.response = trainH2ORDD.lastVec()
dlParams.classification = true
val dl = new DeepLearning(dlParams)
val dlModel = dl.train().get() !
// Score validation data
val validationData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 17)
val validationRDD = sc.parallelize(validationData, 2)
val validationH2ORDD = toDataFrame(sc, validationRDD)
val predictionH2OFrame = new DataFrame(dlModel.score(validationH2ORDD))('predict)
val predictionRDD = toRDD[DoubleHolder](sc, predictionH2OFrame) // will be implicit in the future
// Validate prediction
validatePrediction( predictionRDD.collect().map (_.predict.getOrElse(Double.NaN)), validationData)
}
Brand-Sparkling-New Sneak Preview!
11. H2O Deep Learning, @ArnoCandel 11
H2O R CRAN package
John Chambers (creator of the S language, R-core member)
names H2O R API in top three promising R projects
12. H2O Deep Learning, @ArnoCandel
H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with R:
Data resides on the H2O cluster!
13. H2O Deep Learning, @ArnoCandel 13
Higgs Particle Discovery
Large Hadron Collider: Largest experiment of mankind!
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc.
Higgs boson discovery (July ’12) led to 2013 Nobel prize!
Higgs
vs
Background
http://arxiv.org/pdf/1402.4735v2.pdf
Images courtesy CERN / LHC
Machine Learning Meets Physics
Or rather: Back to the roots
(WWW was invented at CERN in ’89…)
14. H2O Deep Learning, @ArnoCandel 14
Higgs: Binary Classification Problem
Current methods of choice for physicists:
- Boosted Decision Trees
- Neural networks with 1 hidden layer
BUT: Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset:
21 low-level features AND
7 high-level derived features
Train: 10M rows, Test: 500k rows
Metric: AUC = Area under the ROC curve (range: 0.5…1, higher is better)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
add
derived
Random Forest 0.764 0.840
features
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830
15. H2O Deep Learning, @ArnoCandel 15
Higgs: Can Deep Learning Do Better?
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
Random Forest 0.764 0.840
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830
Deep Learning ? ?
<Your guess goes here>
reference paper results: baseline 0.733
Let’s build a H2O Deep Learning model and
find out! (That was my last weekend)
16. H2O Deep Learning, @ArnoCandel
What is Deep Learning?
Wikipedia:
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple
non-linear transformations.
Example:
Input data
(image)
Prediction
(who is it?)
16
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans
17. H2O Deep Learning, @ArnoCandel
What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(only 1 layer - no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space, no new features generated)
17
18. H2O Deep Learning, @ArnoCandel
Deep Learning is Trending
Google trends
2009 2011
2013
18
Businesses are using
Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
19. H2O Deep Learning, @ArnoCandel
Deep Learning History
slides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions
AND
makes humans, businesses and
machines (cyborgs!?) smarter
20. H2O Deep Learning, @ArnoCandel
Deep Learning in H2O
1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning rate, momentum, dropout regularization,
l1/L2 regularization, grid search, checkpointing, auto-tuning, model averaging)
!
= Top-notch prediction engine!
20
21. H2O Deep Learning, @ArnoCandel
Example Neural Network
“fully connected” directed graph of neurons
age
income
employment
input/output neuron
hidden neuron
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
#connections 3x4 4x3 3x2
information flow
#neurons 3 4 3 2
21
22. H2O Deep Learning, @ArnoCandel
Prediction: Forward Propagation
“neurons activate each other via weighted sums”
age
income
employment
uij
vjk
zk pl
yj = tanh(sumi(xi*uij)+bj)
xi
yj
22
married
per-class probabilities
sum(pl) = 1
wkl
zk = tanh(sumj(yj*vjk)+ck)
single
pl = softmax(sumk(zk*wkl)+dl)
softmax(xk) = exp(xk) / sumk(exp(xk))
activation function: tanh
alternative:
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values
(indep. of inputs)
23. H2O Deep Learning, @ArnoCandel
Data preparation & Initialization
Neural Networks are sensitive to numerical noise,
operate best in the linear regime (not saturated)
age
income
employment
xi
Automatic standardization of data
xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed}
->
{0,1,0} = part-time, {0,0,0} = self-employed
married
single
wkl
Automatic initialization of weights
!
23
Poor man’s initialization: random weights wkl
!
Default (better): Uniform distribution in
+/- sqrt(6/(#units + #units_previous_layer))
24. H2O Deep Learning, @ArnoCandel
Training: Update Weights & Biases
For each training row, we make a prediction and compare
with the actual label (supervised learning):
predicted actual
0.8 1 married
Objective: minimize prediction error (MSE or cross-entropy)
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
1
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
w <— w - rate * ∂E/∂w
24
0.2 0 single
E
w
rate
25. H2O Deep Learning, @ArnoCandel
Backward Propagation
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±Δ,…,wN)… Slow!
Backprop: Compute ∂E/∂wi via chain rule going backwards
xi
!
net = sumi(wi*xi) + b
wi
y = activation(net)
E = error(y)
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
25
26. H2O Deep Learning, @ArnoCandel
H2O Deep Learning Architecture
K-V
HTTPD
nodes/JVMs: sync
threads: async
communication
K-V
HTTPD
w
1
w w
2
1
w w w w
1 3 2 4
w1 w3 w2
w4
3 2
w w2+w4 1+w3
4
1 2
w* = (w1+w2+w3+w4)/4
map:
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
1
1
updated model: w*
H2O atomic
in-memory
K-V store
reduce:
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
i
Query & display
the model via
JSON, WWW
Keep iterating over the data (“epochs”), score from time to time
*auto-tuned (default) or user-specified number of points per MapReduce iteration
26
27. H2O Deep Learning, @ArnoCandel
Adaptive learning rate - ADADELTA (Google)
Automatically set learning rate for each neuron
based on its training history
Regularization
L1: penalizes non-zero weights
L2: penalizes large weights
Dropout: randomly ignore certain inputs
Grid Search and Checkpointing
Run a grid search to scan many hyper-parameters,
then continue training the most
promising model(s)
27
“Secret” Sauce to Higher Accuracy
28. H2O Deep Learning, @ArnoCandel
Detail: Adaptive Learning Rate
!
Compute moving average of Δwi2 at time t for window length rho:
!
E[Δwi2]t = rho * E[Δwi2]t-1 + (1-rho) * Δwi2
!
Compute RMS of Δwi at time t with smoothing epsilon:
!
RMS[Δwi]t = sqrt( E[Δwi2]t + epsilon )
Adaptive acceleration / momentum:
accumulate previous weight updates,
but over a window of time
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents “freezing”
(unlike ADAGRAD: no window)
Do the same for ∂E/∂wi, then
obtain per-weight learning rate:
RMS[Δwi]t-1
RMS[∂E/∂wi]t
rate(wi, t) =
cf. ADADELTA paper
28
29. H2O Deep Learning, @ArnoCandel
Detail: Dropout Regularization
29
Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.
!
age
income
employment
married
single
X
X
X
Testing:
Use all activations, but reduce them by a factor p
(to “simulate” the missing activations during training).
cf. Geoff Hinton's paper
30. H2O Deep Learning, @ArnoCandel
MNIST: digits classification
MNIST = Digitized handwritten
digits database (Yann LeCun)
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Standing world record:
Without distortions or convolutions,
the best-ever published error rate on
test set: 0.83% (Microsoft)
30
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
Let’s see how H2O does on the MNIST dataset!
31. H2O Deep Learning, @ArnoCandel
H2O Deep Learning on MNIST:
0.87% test set error (so far)
Frequent errors: confuse 2/7 and 4/9
31
test set error: 1.5% after 10 mins
1.0% after 1.5 hours
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each
32. H2O Deep Learning, A. Candel
Weather Dataset
32
Predict “RainTomorrow” from Temperature,
Humidity, Wind, Pressure, etc.
33. H2O Deep Learning, A. Candel
Live Demo: Weather Prediction
5-fold cross validation
Interactive ROC curve with
real-time updates
33
3 hidden Rectifier
layers, Dropout,
L1-penalty
12.7% 5-fold cross-validation error is at
least as good as GBM/RF/GLM models
34. H2O Deep Learning, @ArnoCandel
Live Demo: Grid Search
How did I find those parameters? Grid Search!
(works for multiple hyper parameters at once)
34
Then continue training
the best model
35. H2O Deep Learning, @ArnoCandel
Text Classification
Goal: Predict the item from
seller’s text description
35
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
gold vintage condition
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Let’s see how H2O does on the ebay dataset!
36. H2O Deep Learning, @ArnoCandel
36
Text Classification
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Note 2: No tuning was done
(results are for illustration only)
37. H2O Deep Learning, @ArnoCandel
Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
37
Speedup
40.00
30.00
20.00
10.00
0.00
1 2 4 8 16 32 63
H2O Nodes
Training Time
2.7 mins
100
75
50
25
0
in minutes
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
38. H2O Deep Learning, @ArnoCandel
Deep Learning Auto-Encoders for
Anomaly Detection
38
Toy example:
Find anomaly in ECG heart
beat data. First, train a
model on what’s “normal”:
20 time-series samples of
210 data points each
Deep Auto-Encoder:
Learn low-dimensional
non-linear “structure” of
the data that allows to
reconstruct the orig. data
Also for categorical data!
39. H2O Deep Learning, @ArnoCandel 39
Deep Learning Auto-Encoders for
Test set with anomaly
Test set prediction is
reconstruction, looks “normal”
Found anomaly! large
reconstruction error
Model of what’s “normal”
+
=>
Anomaly Detection
40. H2O Deep Learning, @ArnoCandel 40
H2O brings Deep Learning to R
R Vignette with
example R scripts
http://0xdata.com/h2o/algorithms/
All parameters are
available from R…
41. H2O Deep Learning, @ArnoCandel
POJO Model Export for
Production Scoring
41
Plain old Java code is
auto-generated to take
your H2O Deep Learning
models into production!
42. H2O Deep Learning, @ArnoCandel 42
Higgs Particle Discovery with H2O
How well did H2O
Deep Learning do?
<Your guess goes here>
reference paper results
Any guesses for AUC on low-level features?
AUC=0.76 was the best for RF/GBM/NN (H2O)
Let’s see how H2O did in the past 30 minutes!
43. H2O Deep Learning, @ArnoCandel
H2O Steam: Scoring Platform
43
http://server:port/steam/index.html
Higgs Dataset Demo on 10-node cluster
Let’s score all our H2O models and compare them!
Live Demo
44. H2O Deep Learning, @ArnoCandel 44
Scoring Higgs Models in H2O Steam
Live Demo on 10-node cluster:
<10 minutes runtime for all algos!
Better than LHC baseline of AUC=0.73!
45. H2O Deep Learning, @ArnoCandel 45
Higgs Particle Detection with H2O
HIGGS UCI Dataset:
21 low-level features AND
7 high-level derived features
Train: 10M rows, Test: 500k rows
Algorithm
*Nature paper: http://arxiv.org/pdf/1402.4735v2.pdf
Paper’s
l-l AUC
low-level
H2O AUC
all features
H2O AUC
Parameters (not heavily tuned),
H2O running on 10 nodes
Generalized Linear Model - 0.596 0.684 default, binomial
Random Forest - 0.764 0.840 50 trees, max depth 50
Gradient Boosted Trees 0.73 0.753 0.839 50 trees, max depth 15
Neural Net 1 layer 0.733 0.760 0.830 1x300 Rectifier, 100 epochs
Deep Learning 3 hidden layers 0.836 0.850 - 3x1000 Rectifier, L2=1e-5, 40 epochs
Deep Learning 4 hidden layers 0.868 0.869 - 4x500 Rectifier, L1=L2=1e-5, 300 epochs
Deep Learning 6 hidden layers 0.880 running - 6x500 Rectifier, L1=L2=1e-5
Deep Learning on low-level features alone beats everything else!
H2O prelim. results compare well with paper’s results* (TMVA & Theano)
46. H2O Deep Learning, @ArnoCandel
Tips for H2O Deep Learning !
General:
More layers for more complex functions (exp. more non-linearity).
More neurons per layer to detect finer structure in data (“memorizing”).
Add some regularization for less overfitting (lower validation set error).
Specifically:
Do a grid search to get a feel for convergence, then continue training.
Try Tanh/Rectifier, try max_w2=10…50, L1=1e-5..1e-3 and/or L2=1e-5…1e-3
Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation
set. Input dropout is recommended for noisy high-dimensional input.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-9,
momentum_start = 0.5…0.9, momentum_stable = 0.99,
momentum_ramp = 1/rate_annealing.
Try balance_classes = true for datasets with large class imbalance.
Enable force_load_balance for small datasets.
Enable replicate_training_data if each node can h0ld all the data.
46
47. H2O Deep Learning, @ArnoCandel
Extensions for H2O Deep Learning
47
- Vision: Convolutional & Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training: Stacked Auto-Encoders PUB-1014
- Faster Training: GPGPU support PUB-1013
- Language/Sequences: Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2O!
Add your own JIRA tickets!
48. H2O Deep Learning, @ArnoCandel
Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
https://github.com/h2oai
http://docs.h2o.ai
www.h2o.ai/community
@h2oai
48
Thank you!