Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop
Clusters
Andy Feng & Jun Shi
Yahoo! Inc.

Our Talks @ Hadoop Summit
2
 Storm on YARN (2013)
› http://bit.ly/1W02tZy
 Spark on YARN (2014)
› http://bit.ly/1W03dxE
 Machine Learning on Hadoop/Spark (2015)
› http://bit.ly/1NW3GvO

Agenda
• Why Deep Learning on Hadoop?
• CaffeOnSpark
– Architecture
– API: Scala + Python
• Demo
– CaffeOnSpark + Python Notebook

Use Case: Flickr Magic View flickr.com/cameraroll

Yahoo Use Case: Yahoo Weather
6
 Beauty
› Computational
assessed
 Relevant
› Location
› Time
› Cloudy
› Shower
› …
Weather App Yahoo Weather App

(4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale
* 10 billion photos * 7.5 million per day

10
Machine Learning & Deep Learning on Hadoop

11
Hadoop Cluster Enhanced
 GPU servers added
› 4 Tesla K80 cards
• 2 GK210 GPUs, 24GB memory
 Network interface enhanced
› InfiniBand for direct access to GPU
memory
› Ethernet for external communication

Deep Learning Frameworks
 Caffe
› Available since Sept, 2013, 6.3k forks
› Popular in vision community & Yahoo
 TensorFlow
› Released in Nov. 2015, 9.8k forks
 Theano, Torch, DL4J, etc.

 Released in Feb. 2016
• Apache 2.0 license
• Distributed deep learning
– GPU or CPU
– Ethernet or InfiniBand
• Easily deployed on public
cloud or private cloud
13
CaffeOnSpark Open Sourced
github.com/yahoo/CaffeOnSpark

CaffeOnSpark: Scalable Architecture
14

CaffeOnSpark: 19x Speedup (est.)
Training latency (hours)
Top-5ValidationError

CaffeOnSpark: Deployment Options
16
• Single node
– Spark-submit –master local
• Multiple nodes
– Spark-submit –master URL –connection ethernet
– Ex. EC2
– Spark-submit –master URL –connection infiniband
– Ex., Yahoo Hadoop cluster

Spark CLI
• spark-submit
--num-executors #_Processes
--class com.yahoo.ml.CaffeOnSpark
caffe-on-spark.jar
-devices #_gpus_per_proc
-conf solver_config_file
-model model_file
-train | -test | -feature
Caffe Configuration
layer {
name: "data"
type: "MemoryData"
source_class=“com.yahoo.ml.caffe.LMDB”
memory_data_param {
source: ”hdfs:///mnist/trainingdata/"
batch_size: 64;
channels: 1;
height: 28;
width: 28;
}
…
}
17
CaffeOnSpark: DL Made Easy

CaffeOnSpark: One Program (Scala)
http://bit.ly/21ZY1c2
18
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
// (1) training DL model
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
// (2) extract features via DL
lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source)
// (3) apply ML
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df)
Non-deep
Learning
DeepLearning

CaffeOnSpark: One Notebook (Python)
http://bit.ly/1REZ0cN
19

Demo: CaffeOnSpark on EC2
 https://github.com/yahoo/CaffeOnSpark/wiki
› Get started on EC2
› Python for CaffeOnSpark

CaffeOnSpark: What’s Next?
 Validation within training
 Enhanced data layer
 RNN and LSTM
 Java API
 Asynchronous distributed training

Related Work: SparkNet & DL4J
1) [driver] sc.broadcast(model) to executors
2) [executor] apply DL training against a mini-batch of dataset to
update models locally
3) [driver] aggregate(models) to produce a new model
REPEAT
Driver

Summary
24
 Yahoo Hadoop clusters enhanced for deep learning
› GPU nodes + CPU nodes
› Infiniband network for fast communication
 CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
• In production since Q3 2015
• Reduced training latency, and improved accuracy
› Scalable deep learning made easy
• spark-submit on your Spark cluster

25
Thank You!
Repo: github.com/yahoo/CaffeOnSpark
Email: caffeonspark-users@googlegroups.com

Distributed Deep Learning on Hadoop Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Deep Learning on Hadoop Clusters

Similar to Distributed Deep Learning on Hadoop Clusters (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Distributed Deep Learning on Hadoop Clusters

Editor's Notes