In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly.
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
TensorFlowOnSpark Scales TensorFlow Apps to Spark Clusters
1. TensorFlowOnSpark
S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s
Lee Yang, Andr ew Feng
Yahoo Big D ata ML Platfor m Team
3. Why TensorFlowOnSpark at Yahoo?
▪ Major contributor to open-source Hadoop ecosystem
• Originators of Hadoop (2006)
• An early adopter of Spark (since 2013)
• Open-sourced CaffeOnSpark (2016)
• CaffeOnSpark Update: Recent Enhancements and Use Cases
• Wednesday @ 12:20pm by Mridul Jain & Jun Shi
▪ Large investment in production clusters
• Tens of clusters
• Thousands of nodes per cluster
▪ Massive amounts of data
• Petabytes of data
8. RDMA Speedup over gRPC
http://www.mellanox.com/solutions/machine-learning/tensorflow.php
9. RDMA Speedup over gRPC
http://www.mellanox.com/solutions/machine-learning/tensorflow.php
10. TensorFlowOnSpark Design Goals
▪ Scale up existing TF apps with minimal changes
▪ Support all current TensorFlow functionality
• Synchronous/asynchronous training
• Model/data parallelism
• TensorBoard
▪ Integrate with existing HDFS data pipelines and ML
algorithms
• ex. Hive, Spark, MLlib
11. TensorFlowOnSpark
▪ Pyspark wrapper of TF app code
▪ Launches distributed TF clusters using Spark
executors
▪ Supports TF data ingestion modes
• feed_dict – RDD.mapPartitions()
• queue_runner – direct HDFS access from TF
▪ Supports TensorBoard during/after training
▪ Generally agnostic to Spark/TF versions
20. Failure Recovery
▪ TF Checkpoints written to HDFS
▪ InputMode.SPARK
• TF worker runs in background
• RDD data feeding tasks can be retried
• However, TF worker failures will be “hidden” from Spark
▪ InputMode.TENSORFLOW
• TF worker runs in foreground
• TF worker failures will be retried as Spark task
• TF worker restores from checkpoint
21. Failure Recovery
▪ Executor failures are problematic
• e.g. pre-emption
• TF cluster_spec is statically-defined at startup
• YARN does not re-allocate on same node
• Even if possible, port may no longer be available.
▪ Need dynamic cluster membership
• Exploring options w/ TensorFlow team