This document discusses distributed deep learning on Hadoop clusters using CaffeOnSpark. CaffeOnSpark is an open source project that allows deep learning models defined in Caffe to be trained and run on large datasets distributed across a Spark cluster. It provides a scalable architecture that can reduce training time by up to 19x compared to single node training. CaffeOnSpark provides APIs in Scala and Python and can be easily deployed on both public and private clouds. It has been used in production at Yahoo since 2015 to power applications like Flickr and Yahoo Weather.
8. (4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale
* 10 billion photos * 7.5 million per day
11. 11
Hadoop Cluster Enhanced
GPU servers added
› 4 Tesla K80 cards
• 2 GK210 GPUs, 24GB memory
Network interface enhanced
› InfiniBand for direct access to GPU
memory
› Ethernet for external communication
12. Deep Learning Frameworks
Caffe
› Available since Sept, 2013, 6.3k forks
› Popular in vision community & Yahoo
TensorFlow
› Released in Nov. 2015, 9.8k forks
Theano, Torch, DL4J, etc.
13. Released in Feb. 2016
• Apache 2.0 license
• Distributed deep learning
– GPU or CPU
– Ethernet or InfiniBand
• Easily deployed on public
cloud or private cloud
13
CaffeOnSpark Open Sourced
github.com/yahoo/CaffeOnSpark
18. CaffeOnSpark: One Program (Scala)
http://bit.ly/21ZY1c2
18
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
// (1) training DL model
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
// (2) extract features via DL
lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source)
// (3) apply ML
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df)
Non-deep
Learning
DeepLearning
21. Demo: CaffeOnSpark on EC2
https://github.com/yahoo/CaffeOnSpark/wiki
› Get started on EC2
› Python for CaffeOnSpark
22. CaffeOnSpark: What’s Next?
Validation within training
Enhanced data layer
RNN and LSTM
Java API
Asynchronous distributed training
23. Related Work: SparkNet & DL4J
1) [driver] sc.broadcast(model) to executors
2) [executor] apply DL training against a mini-batch of dataset to
update models locally
3) [driver] aggregate(models) to produce a new model
REPEAT
Driver
24. Summary
24
Yahoo Hadoop clusters enhanced for deep learning
› GPU nodes + CPU nodes
› Infiniband network for fast communication
CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
• In production since Q3 2015
• Reduced training latency, and improved accuracy
› Scalable deep learning made easy
• spark-submit on your Spark cluster
In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing.
Last year, I described Yahoo’s effort to bring Spark onto YARN cluster.
Today, we should cover our progress on machine learning using YARN clusters.
I will cover 3 areas:
WHY does Yahoo apply machine learning
WHAT challenges we try to address
HOW we address them
I will wrap the talk with key lessons learned from our experience.
At least year’s Hadoop Summit, we discussed how Hadoop clusters have become the preferred platform for large-scale machine learning at Yahoo. Recently, we introduced distributed deep learning as a new capability of Hadoop clusters. These new clusters augment our existing CPU nodes and Ethernet connectivity with GPU nodes and Infiniband connectivity. We developed a distributed deep learning solution, CaffeOnSpark, based on Apache Spark and Caffe from UC Berkeley. CaffeOnSpark enables deep learning tasks to be launched via spark-submit command, as in any Spark application. Given a partition of HDFS-based training data, each Spark executors launches Caffe-based training threads to train deep neural network models. After back-propagation processing of a batch of training examples, CaffeOnSpark training threads exchange the gradients of model parameters across all GPUs on multiple servers. In this talk, we will provide a technical overview of CaffeOnSpark, and explain how CaffeOnSpark conducts deep learning in a private cloud or public cloud (such as AWS EC2). We will share our experience at Yahoo through use cases (including photo auto tagging), and discuss the areas of collaboration with open source communities for Hadoop-based deep learning.
Deep learning is a branch of a branch of artificial intelligence.
It attempts to model high-level abstractions in data by using multiple processing layers.
A deep neural network has multiple hidden layers of units between the input and output layers.
* ImageNet competition 2014 … GoogleNet w/ 22 layers.
* ILSVRC competition 2015 … Microsoft w/ 152 layers.
Many of these deep networks have millions or even billions of parameters.
To learn these parameters from data, we go through many iterations of forward prediction and back propagation over these networks.
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore.
This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers.
These servers are a YARN application, specfically design for machine learning.
All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second.
Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection.
Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners.
To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers.
As an example, you may want to perform statistic analysis of large models using MapReduce operations.
Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
In summary, Yahoo has made significant progress on scalable machine learning.
We conduct daily training w/ billions of signals for our critical business such as search and advertisement.
Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing.
We are currently exploring both GPU and CPU in a single cluster.