Deep Learning through Examples - Kaggle #1

Deep Learning
through Examples
Arno Candel
!
0xdata, H2O.ai
Scalable In-Memory Machine Learning
!
Silicon Valley Big Data Science Meetup,
Vendavo, Mountain View, 9/11/14
!

Who am I?
@ArnoCandel
PhD in Computational Physics, 2005
from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
9 months at 0xdata/H2O - Machine Learning
!
15 years in HPC/Supercomputing/Modeling
!
Named “2014 Big Data All-Star” by Fortune Magazine
!

H2O Deep Learning, @ArnoCandel 3
H2O DeepLearning:
Kaggle #1 rank (out of 413) - 40d left
Achieved with H2O Deep Learning from R!
!
#1
#17
@matlabulous (Jo-fai Chow, Blend it like a Bayesian!) says:
“I am 99.99999999999% sure that I can still go further with H2O.”

H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (10 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
Higgs boson detection
MNIST handwritten digits
text classification
Q & A (5 mins)
4

About H20 (aka 0xdata)
Java, Apache v2 Open Source
Join the www.h2o.ai/community!
#1 Java Machine Learning in Github
5

Customer Demands for
Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata from
scratch to meet these requirements

H2O Integration
H2O
R JSON Scala Python
YARN Hadoop MR
HDFS HDFS HDFS
Standalone Over YARN On MRv1
7
H2O H2O
Java

H2O Architecture
Prediction Engine
Distributed
In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Memory manager
e.g. Deep Learning
8
MapReduce

H2O - The Killer App on Spark
9
http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html

H2O DeepLearning on Spark
10
// Test if we can correctly learn A, B where Y = logistic(A + B*X)
test("deep learning log regression") {
val nPoints = 10000
val A = 2.0
val B = -1.5 !
// Generate testing data
val trainData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 42)
// Create RDD from testing data
val trainRDD = sc.parallelize(trainData, 2)
trainRDD.cache() !
import H2OContext._
// Create H2O data frame (will be implicit in the future)
val trainH2ORDD = toDataFrame(sc, trainRDD)
// Create a H2O DeepLearning model
val dlParams = new DeepLearningParameters()
dlParams.source = trainH2ORDD
dlParams.response = trainH2ORDD.lastVec()
dlParams.classification = true
val dl = new DeepLearning(dlParams)
val dlModel = dl.train().get() !
// Score validation data
val validationData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 17)
val validationRDD = sc.parallelize(validationData, 2)
val validationH2ORDD = toDataFrame(sc, validationRDD)
val predictionH2OFrame = new DataFrame(dlModel.score(validationH2ORDD))('predict)
val predictionRDD = toRDD[DoubleHolder](sc, predictionH2OFrame) // will be implicit in the future
// Validate prediction
validatePrediction( predictionRDD.collect().map (_.predict.getOrElse(Double.NaN)), validationData)
}
Brand-Sparkling-New Sneak Preview!

H2O R CRAN package
John Chambers (creator of the S language, R-core member)
names H2O R API in top three promising R projects

H2O + R = Happy Data Scientist
12
Machine Learning on Big Data with R:
Data resides on the H2O cluster!

Higgs Particle Discovery
Large Hadron Collider: Largest experiment of mankind!
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc.
Higgs boson discovery (July ’12) led to 2013 Nobel prize!
Higgs
vs
Background
http://arxiv.org/pdf/1402.4735v2.pdf
Images courtesy CERN / LHC
Machine Learning Meets Physics
Or rather: Back to the roots
(WWW was invented at CERN in ’89…)

Higgs: Binary Classification Problem
Current methods of choice for physicists:
- Boosted Decision Trees
- Neural networks with 1 hidden layer
BUT: Must first add derived high-level features (physics formulae)
HIGGS UCI Dataset:
21 low-level features AND
7 high-level derived features
Train: 10M rows, Test: 500k rows
Metric: AUC = Area under the ROC curve (range: 0.5…1, higher is better)
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
add
derived
Random Forest 0.764 0.840
features
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830

Higgs: Can Deep Learning Do Better?
Algorithm low-level H2O AUC all features H2O AUC
Generalized Linear Model 0.596 0.684
Random Forest 0.764 0.840
Gradient Boosted Trees 0.753 0.839
Neural Net 1 hidden layer 0.760 0.830
Deep Learning ? ?
<Your guess goes here>
reference paper results: baseline 0.733
Let’s build a H2O Deep Learning model and
find out! (That was my last weekend)

What is Deep Learning?
Wikipedia:
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple
non-linear transformations.
Example:
Input data
(image)
Prediction
(who is it?)
16
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans

What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(only 1 layer - no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space, no new features generated)
17

Deep Learning is Trending
Google trends
2009 2011
2013
18
Businesses are using
Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)

Deep Learning History
slides by Yan LeCun (now Facebook)
19
Deep Learning wins competitions
AND
makes humans, businesses and
machines (cyborgs!?) smarter

Deep Learning in H2O
1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning rate, momentum, dropout regularization,
l1/L2 regularization, grid search, checkpointing, auto-tuning, model averaging)
!
= Top-notch prediction engine!
20

Example Neural Network
“fully connected” directed graph of neurons
age
income
employment
input/output neuron
hidden neuron
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
#connections 3x4 4x3 3x2
information flow
#neurons 3 4 3 2
21

Prediction: Forward Propagation
“neurons activate each other via weighted sums”
age
income
employment
uij
vjk
zk pl
yj = tanh(sumi(xi*uij)+bj)
xi
yj
22
married
per-class probabilities
sum(pl) = 1
wkl
zk = tanh(sumj(yj*vjk)+ck)
single
pl = softmax(sumk(zk*wkl)+dl)
softmax(xk) = exp(xk) / sumk(exp(xk))
activation function: tanh
alternative:
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values
(indep. of inputs)

Data preparation & Initialization
Neural Networks are sensitive to numerical noise,
operate best in the linear regime (not saturated)
age
income
employment
xi
Automatic standardization of data
xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed}
->
{0,1,0} = part-time, {0,0,0} = self-employed
married
single
wkl
Automatic initialization of weights
!
23
Poor man’s initialization: random weights wkl
!
Default (better): Uniform distribution in
+/- sqrt(6/(#units + #units_previous_layer))

Training: Update Weights & Biases
For each training row, we make a prediction and compare
with the actual label (supervised learning):
predicted actual
0.8 1 married
Objective: minimize prediction error (MSE or cross-entropy)
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
1
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
w <— w - rate * ∂E/∂w
24
0.2 0 single
E
w
rate

Backward Propagation
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±Δ,…,wN)… Slow!
Backprop: Compute ∂E/∂wi via chain rule going backwards
xi
!
net = sumi(wi*xi) + b
wi
y = activation(net)
E = error(y)
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
25

H2O Deep Learning Architecture
K-V
HTTPD
nodes/JVMs: sync
threads: async
communication
K-V
HTTPD
w
1
w w
2
1
w w w w
1 3 2 4
w1 w3 w2
w4
3 2
w w2+w4 1+w3
4
1 2
w* = (w1+w2+w3+w4)/4
map:
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
1
1
updated model: w*
H2O atomic
in-memory
K-V store
reduce:
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
i
Query & display
the model via
JSON, WWW
Keep iterating over the data (“epochs”), score from time to time
*auto-tuned (default) or user-specified number of points per MapReduce iteration
26

Adaptive learning rate - ADADELTA (Google)
Automatically set learning rate for each neuron
based on its training history
Regularization
L1: penalizes non-zero weights
L2: penalizes large weights
Dropout: randomly ignore certain inputs
Grid Search and Checkpointing
Run a grid search to scan many hyper-parameters,
then continue training the most
promising model(s)
27
“Secret” Sauce to Higher Accuracy

Detail: Adaptive Learning Rate
!
Compute moving average of Δwi2 at time t for window length rho:
!
E[Δwi2]t = rho * E[Δwi2]t-1 + (1-rho) * Δwi2
!
Compute RMS of Δwi at time t with smoothing epsilon:
!
RMS[Δwi]t = sqrt( E[Δwi2]t + epsilon )
Adaptive acceleration / momentum:
accumulate previous weight updates,
but over a window of time
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents “freezing”
(unlike ADAGRAD: no window)
Do the same for ∂E/∂wi, then
obtain per-weight learning rate:
RMS[Δwi]t-1
RMS[∂E/∂wi]t
rate(wi, t) =
cf. ADADELTA paper
28

Detail: Dropout Regularization
29
Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.
!
age
income
employment
married
single
X
X
X
Testing:
Use all activations, but reduce them by a factor p
(to “simulate” the missing activations during training).
cf. Geoff Hinton's paper

MNIST: digits classification
MNIST = Digitized handwritten
digits database (Yann LeCun)
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Standing world record:
Without distortions or convolutions,
the best-ever published error rate on
test set: 0.83% (Microsoft)
30
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
Let’s see how H2O does on the MNIST dataset!

H2O Deep Learning on MNIST:
0.87% test set error (so far)
Frequent errors: confuse 2/7 and 4/9
31
test set error: 1.5% after 10 mins
1.0% after 1.5 hours
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each

H2O Deep Learning, A. Candel
Weather Dataset
32
Predict “RainTomorrow” from Temperature,
Humidity, Wind, Pressure, etc.

H2O Deep Learning, A. Candel
Live Demo: Weather Prediction
5-fold cross validation
Interactive ROC curve with
real-time updates
33
3 hidden Rectifier
layers, Dropout,
L1-penalty
12.7% 5-fold cross-validation error is at
least as good as GBM/RF/GLM models

Live Demo: Grid Search
How did I find those parameters? Grid Search!
(works for multiple hyper parameters at once)
34
Then continue training
the best model

Text Classification
Goal: Predict the item from
seller’s text description
35
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
gold vintage condition
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Let’s see how H2O does on the ebay dataset!

36
Text Classification
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Note 2: No tuning was done
(results are for illustration only)

Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
37
Speedup
40.00
30.00
20.00
10.00
0.00
1 2 4 8 16 32 63
H2O Nodes
Training Time
2.7 mins
100
75
50
25
0
in minutes
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)

Deep Learning Auto-Encoders for
Anomaly Detection
38
Toy example:
Find anomaly in ECG heart
beat data. First, train a
model on what’s “normal”:
20 time-series samples of
210 data points each
Deep Auto-Encoder:
Learn low-dimensional
non-linear “structure” of
the data that allows to
reconstruct the orig. data
Also for categorical data!

Deep Learning Auto-Encoders for
Test set with anomaly
Test set prediction is
reconstruction, looks “normal”
Found anomaly! large
reconstruction error
Model of what’s “normal”
+
=>
Anomaly Detection

H2O brings Deep Learning to R
R Vignette with
example R scripts
http://0xdata.com/h2o/algorithms/
All parameters are
available from R…

POJO Model Export for
Production Scoring
41
Plain old Java code is
auto-generated to take
your H2O Deep Learning
models into production!

Higgs Particle Discovery with H2O
How well did H2O
Deep Learning do?
<Your guess goes here>
reference paper results
Any guesses for AUC on low-level features?
AUC=0.76 was the best for RF/GBM/NN (H2O)
Let’s see how H2O did in the past 30 minutes!

H2O Steam: Scoring Platform
43
http://server:port/steam/index.html
Higgs Dataset Demo on 10-node cluster
Let’s score all our H2O models and compare them!
Live Demo

Scoring Higgs Models in H2O Steam
Live Demo on 10-node cluster:
<10 minutes runtime for all algos!
Better than LHC baseline of AUC=0.73!

Higgs Particle Detection with H2O
HIGGS UCI Dataset:
21 low-level features AND
7 high-level derived features
Train: 10M rows, Test: 500k rows
Algorithm
*Nature paper: http://arxiv.org/pdf/1402.4735v2.pdf
Paper’s
l-l AUC
low-level
H2O AUC
all features
H2O AUC
Parameters (not heavily tuned),
H2O running on 10 nodes
Generalized Linear Model - 0.596 0.684 default, binomial
Random Forest - 0.764 0.840 50 trees, max depth 50
Gradient Boosted Trees 0.73 0.753 0.839 50 trees, max depth 15
Neural Net 1 layer 0.733 0.760 0.830 1x300 Rectifier, 100 epochs
Deep Learning 3 hidden layers 0.836 0.850 - 3x1000 Rectifier, L2=1e-5, 40 epochs
Deep Learning 4 hidden layers 0.868 0.869 - 4x500 Rectifier, L1=L2=1e-5, 300 epochs
Deep Learning 6 hidden layers 0.880 running - 6x500 Rectifier, L1=L2=1e-5
Deep Learning on low-level features alone beats everything else!
H2O prelim. results compare well with paper’s results* (TMVA & Theano)

Tips for H2O Deep Learning !
General:
More layers for more complex functions (exp. more non-linearity).
More neurons per layer to detect finer structure in data (“memorizing”).
Add some regularization for less overfitting (lower validation set error).
Specifically:
Do a grid search to get a feel for convergence, then continue training.
Try Tanh/Rectifier, try max_w2=10…50, L1=1e-5..1e-3 and/or L2=1e-5…1e-3
Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation
set. Input dropout is recommended for noisy high-dimensional input.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-9,
momentum_start = 0.5…0.9, momentum_stable = 0.99,
momentum_ramp = 1/rate_annealing.
Try balance_classes = true for datasets with large class imbalance.
Enable force_load_balance for small datasets.
Enable replicate_training_data if each node can h0ld all the data.
46

Extensions for H2O Deep Learning
47
- Vision: Convolutional & Pooling Layers PUB-644
- Anomaly Detection PUB-806
- Pre-Training: Stacked Auto-Encoders PUB-1014
- Faster Training: GPGPU support PUB-1013
- Language/Sequences: Recurrent Neural Networks
- Benchmark vs other Deep Learning packages
- Investigate other optimization algorithms
Contribute to H2O!
Add your own JIRA tickets!

Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
https://github.com/h2oai
http://docs.h2o.ai
www.h2o.ai/community
@h2oai
48
Thank you!

Deep Learning through Examples - Kaggle #1

Recommended

Recommended

More Related Content

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Deep Learning through Examples - Kaggle #1