Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
2. About me
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine Learning
(and Spark user since Spark 0.2)
2
3. Outline
• Lessons (and challenges) learned from the field
• Tuning Spark for Deep Learning and GPUs
• Loading data in Spark
• Monitoring
3
4. Deep Learning and Spark
• 2016: the year of emerging solutions for
combining Spark and Deep Learning
– 6 talks on Deep Learning, 3 talks on GPUs
• Yet, no consensus:
– Official MLlib support is limited (perceptron-like
networks)
– Each talk mentions a different framework
4
5. Deep Learning and Spark
• Deep learning frameworks with Spark bindings:
– Caffe (CaffeOnSpark)
– Keras (Elephas)
– mxnet
– Paddle
– TensorFlow (TensorFlow on Spark, TensorFrames)
• Running natively on Spark
– BigDL
– DeepDist
– DeepLearning4J
– MLlib
– SparkCL
– SparkNet
• Extensions to Spark for specialized hardware
– Blaze (UCLA & Falcon Computing Solutions)
– IBM Conductor with Spark
(Alphabetical order!)
5
6. One Framework to Rule Them All?
• Should we look for The One Deep Learning
Framework?
6
7. Databricks’ perspective
• Databricks: a Spark vendor on top of a public cloud
• Now provides GPU instances
• Enables customers with compute-intensive
workloads
• This talk:
– lessons learned when deploying GPU instances on a
public cloud
– what Spark could do better when running Deep Learning
applications
7
8. ML in a data pipeline
• ML is small part in the full pipeline:
Hidden technical debt in Machine Learning systems
Sculley et al., NIPS 2016
8
9. DL in a data pipeline (1)
• Training tasks:
9
Data
collection
ETL Featurization Deep
Learning
Validation Export,
Serving
compute intensive IO intensiveIO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
10. DL in a data pipeline (2)
• Specialized data transform (feature extraction, …)
10
Input Output
cat
dog
dog
Saulius Garalevicius - CC BY-SA 3.0
11. Recurring patterns
• Spark as a scheduler
– Embarrassingly parallel tasks
– Data stored outside Spark
• Embedded Deep Learning transforms
– Data stored in DataFrame/RDDs
– Follows the RDD guarantees
• Specialized computations
– Multiple passes over data
– Specific communication patterns
11
12. Using GPUs through PySpark
• A popular choice when a lot of independent tasks
• A lot of DL packages have a Python interface:
TensorFlow, Theano, Caffe, mxnet, etc.
• Lifetime for python packages: the process
• Requires some configuration tweaks in Spark
12
13. PySpark recommendation
• spark.executor.cores = 1
– Gives the DL framework full access over all the
resources
– This may be important for some frameworks that
attempt to optimize the processor pipelines
13
14. Cooperative frameworks
• Use Spark for data input
• Examples:
– IBM GPU efforts
– Skymind’s DeepLearning4J
– DistML and other Parameter Server efforts
14
RDD
Partition 1
Partition n
RDD
Partition 1
Partition m
Black box
15. Cooperative frameworks
• Bypass Spark for asynchronous / specific
communication patterns across machines
• Lose benefit of RDDs and DataFrames and
reproducibility/determinism
• But these guarantees are not requested anyway
when doing deep learning (stochastic gradient)
• “reproducibility is worth a factor of 2” (Leon Bottou,
quoted by John Langford)
15
16. Streaming data through DL
• The general choice:
– Cold layer (HDFS/S3/etc.)
– Local storage: files, Spark’s on-disk persistence layer
– In memory: Spark RDDs or Spark Dataframes
• Find out if you are I/O constrained or processor-constrained
– How big is your dataset? MNIST or ImageNet?
• If using PySpark:
– All frameworks heavily optimized for disk I/O
– Use Spark’s broadcast for small datasets that fit in memory
– Reading files is fast: use local files when it does not fit
• If using a binding:
– each binding has its own layer
– in general, better performance by caching in files (same reasons)
16
17. Detour: simplifying the life of users
• Deep Learning commonly used with GPUs
• A lot of work on Spark dependencies:
– Few dependencies on local machine when compiling Spark
– The build process works well in a large number of configurations
(just scala + maven)
• GPUs present challenges: CUDA, support libraries, drivers,
etc.
– Deep software stack, requires careful construction (hardware +
drivers + CUDA + libraries)
– All these are expected by the user
– Turnkey stacks just starting to appear
17
18. Container:
nvidia-docker,
lxc, etc.
Simplifying the life of users
• Provide a Docker image with all the GPU SDK
• Pre-install GPU drivers on the instance
GPU hardware
Linux kernel NV Kernel driver
CuBLAS CuDNN
Deep learning libraries
(Tensorflow, etc.) JCUDA
Python / JVM clients
18
CUDA
NV kernel driver (userspace interface)
20. Monitoring
• How do you monitor the progress of your tasks?
• It depends on the granularity
– Around tasks
– Inside (long-running) tasks
20
21. Monitoring: Accumulators
• Good to check
throughput or failure rate
• Works for Scala
• Limited use for Python
(for now, SPARK-2868)
• No “real-time” update
batchesAcc = sc.accumulator(1)
def processBatch(i):
global acc
acc += 1
# Process image batch here
images = sc.parallelize(…)
images.map(processBatch).collect()
21
22. Monitoring: external system
• Plugs into an external system
• Existing solutions: Grafana, Graphite,
Prometheus, etc.
• Most flexible, but more complex to deploy
22
23. Conclusion
• Distributed deep learning: exciting and fast-
moving space
• Most insights are specific to a task, a dataset
and an algorithm: nothing replaces experiments
• Easy to get started with parallel jobs
• Move in complexity only when enough
data/insufficient patience
23
24. Challenges to address
• For Spark developers
– Monitoring long-running tasks
– Presenting and introspecting intermediate results
• For DL developers:
– What boundary to put between the algorithm and
Spark?
– How to integrate with Spark at the low-level?
24
25. Thank You.
Check our blog post on Deep Learning and GPUs!
https://docs.databricks.com/applications/deep-learning/index.html
Office hours: today 3:50 at the Databricks booth
Editor's Notes
Hello everyone, welcome to this session. My name is Timothy Hunter and it is my pleasure to talk to talk to you about deep learning with Spark.
Before we start, a couple of house rules: I hope to have an interactive discussion with you. Feel free to interrupt and raise your hands if you want to share a comment or a question.
But first, let me introduce myself briefly. I am a software engineer at databricks, and I have been writing machine learning algorithms with Spark since version 0.2.
I expect everyone here to have heard about deep learning. They are recent techniques that allow training complex data structures that look like neural networks. They are at the origin of a lot of recent advances in computer vision, speech recognition, even gaming and robotics. If you read the mainstream press, you may even believe they have some supernatural powers and will soon replace all of us in the room.
But I want to talk today about the challenges that one encounters when we try to use them on the field.
what are the considerations when evaluating a DL framework, or integrating with Spark?
There is performance of course, but in practice, there are many other aspects to consider.
The most important is that deep learning or machine learning does not work in isolation. It needs to cooperate with a large number of other services in a production environment.
(usually a compute-intensive one)
What sort of tasks do you expect to see deep learning in? The first one is TRAINING.
We want to build a network that solves a specific task. One of the most famous tasks is for example building a network that assigns labels to images.
How does this fit into a full pipeline? Usually, it is a very specific part of a much larger data pipeline.
This pipeline typically starts by ingesting data from multiple data sources, etc.
One of the most remarkable characteristics is that the deep learning has very different resource requirements compared to the other tasks.
The second typical usage is to transform some data
The data has already been collected, and it may be very massive. A trained neural network is then applied, for example to build a new representation of the data, or to augment it with new information such as labels, etc.
In that case, each row of data can be processed indepently, and the issue is how to apply this transform at large scale and high speed.
Depending on the data and the operational intensity of the transform, it is not clear if the limiting factor is going to be the IO (ingesting the data) or the compute power (the transform itself).
How do you integrate Spark with deep learning frameworks?
At Databricks, we see a couple of patterns emerging.
The simplest: just use Spark to schedule a large number of independent tasks. The data is stored outside Spark. Spark just sees that some computations are happening on each worker.
The next level: transform data stored by Spark (RDD/Dataframe). For good performance, one should use DL framework with some bindings with Spark.
The most complex: the transform does not follow the traditional divide-and-conquer Spark model and requires specific communication patterns.
Because the transforms are so intense in computations, it is common to use some specialized hardware for speed. The easiest and most common solution is GPUs.
At the same time, GPUs are hard to integrate for a number of reasons.
…
Spark gives you the promise that you can write some code on your laptop, and it will work as expected on a few thousands computers sitting in a different continents. Deep learning with GPUs breaks this assumptions.
This is especially a problem when working on large clusters with tens or hundreds of machines, in which they may be various hardware generations, configurations, etc.
This is why a typical solution is to package all the dependencies in a way that is independent from the underlying hardware. Virtualization is not always aware of GPU requirements, so a typical solution is to use containers.
we won't have time for demo, but here are some inputs: + highlights