SlideShare a Scribd company logo
1 of 37
TensorFlowOnSpark
S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s
Lee Yang, Andr ew Feng
Yahoo Big D ata ML Platfor m Team
What is TensorFlowOnSpark?
Why TensorFlowOnSpark at Yahoo?
▪ Major contributor to open-source Hadoop ecosystem
• Originators of Hadoop (2006)
• An early adopter of Spark (since 2013)
• Open-sourced CaffeOnSpark (2016)
• CaffeOnSpark Update: Recent Enhancements and Use Cases
• Wednesday @ 12:20pm by Mridul Jain & Jun Shi
▪ Large investment in production clusters
• Tens of clusters
• Thousands of nodes per cluster
▪ Massive amounts of data
• Petabytes of data
Private ML Clusters
Machine-learning at scale
Why TensorFlowOnSpark?
Machine-learning at scale
Scaling
Near-linear
scaling
RDMA Speedup over gRPC
2.4X faster
RDMA Speedup over gRPC
http://www.mellanox.com/solutions/machine-learning/tensorflow.php
RDMA Speedup over gRPC
http://www.mellanox.com/solutions/machine-learning/tensorflow.php
TensorFlowOnSpark Design Goals
▪ Scale up existing TF apps with minimal changes
▪ Support all current TensorFlow functionality
• Synchronous/asynchronous training
• Model/data parallelism
• TensorBoard
▪ Integrate with existing HDFS data pipelines and ML
algorithms
• ex. Hive, Spark, MLlib
TensorFlowOnSpark
▪ Pyspark wrapper of TF app code
▪ Launches distributed TF clusters using Spark
executors
▪ Supports TF data ingestion modes
• feed_dict – RDD.mapPartitions()
• queue_runner – direct HDFS access from TF
▪ Supports TensorBoard during/after training
▪ Generally agnostic to Spark/TF versions
Supported Environments
▪ Python 2.7 - 3.x
▪ Spark 1.6 - 2.x
▪ TensorFlow 0.12, 1.x
▪ Hadoop 2.x
Architectural Overview
TensorFlowOnSpark Basics
1. Launch TensorFlow cluster
2. Feed data to TensorFlow app
3. Shutdown TensorFlow cluster
API Example
cluster = TFCluster.run(sc, map_fn, args, num_executors,
num_ps, tensorboard, input_mode)
cluster.train(dataRDD, num_epochs=0)
cluster.inference(dataRDD)
cluster.shutdown()
Conversion Example
# diff –w eval_image_classifier.py
20a21,27
> from pyspark.context import SparkContext
> from pyspark.conf import SparkConf
> from tensorflowonspark import TFCluster, TFNode
> import sys
>
> def main_fun(argv, ctx):
27a35,36
> sys.argv = argv
>
84,85d92
<
< def main(_):
88a96,97
> cluster_spec, server = TFNode.start_cluster_server(ctx)
>
191c200,204
< tf.app.run()
---
> sc = SparkContext(conf=SparkConf().setAppName("eval_image_classifier"))
> num_executors = int(sc._conf.get("spark.executor.instances"))
> cluster = TFCluster.run(sc, main_fun, sys.argv, num_executors, 0, False, TFCluster.InputMode.TENSORFLOW)
> cluster.shutdown()
Input Modes
▪ InputMode.SPARK
HDFS → RDD.mapPartitions → feed_dict
▪ InputMode.TENSORFLOW
TFReader + QueueRunner ← HDFS
Executor
python worker
Executor
python workerpython worker python worker
InputMode.SPARK
worker:0
queue
Executor
python workerpython worker
ps:0
worker:1
queue
RDD RDD
Executor
python worker
InputMode.TENSORFLOW
Executor
python worker
Executor
python workerpython worker
ps:0
python worker
worker:0
python worker
worker:1
HDFS
Failure Recovery
▪ TF Checkpoints written to HDFS
▪ InputMode.SPARK
• TF worker runs in background
• RDD data feeding tasks can be retried
• However, TF worker failures will be “hidden” from Spark
▪ InputMode.TENSORFLOW
• TF worker runs in foreground
• TF worker failures will be retried as Spark task
• TF worker restores from checkpoint
Failure Recovery
▪ Executor failures are problematic
• e.g. pre-emption
• TF cluster_spec is statically-defined at startup
• YARN does not re-allocate on same node
• Even if possible, port may no longer be available.
▪ Need dynamic cluster membership
• Exploring options w/ TensorFlow team
TensorBoard
TensorBoard
TensorBoard
TensorFlow App Development
Experimentation Phase
•Single-node
•Small scale data
•TensorFlow APIs
tf.Graph
tf.Session
tf.InteractiveSession
TensorFlow App Development
Scaling Phase
•Multi-node
•Medium scale data (local disk)
•Distributed TensorFlow APIs
tf.train.ClusterSpec
tf.train.Server
tf.train.Saver
tf.train.Supervisor
TensorFlow App Development
Production Phase
•Cluster deployment
•Upstream data pipeline
•Model training w/ TensorFlowOnSpark APIs
TFCluster.run
TFNode.start_cluster_server
TFCluster.shutdown
•Production inference w/ TensorFlow Serving
Example Usage
https://github.com/yahoo/TensorFlowOnSpark/tree/master/examples
Common Gotchas
▪ Single node/task per executor
▪ HDFS access (native libs/env)
▪ Why doesn’t algorithm X scale linearly?
What’s New?
▪ Community contributions
• CDH compatibility
• TFNode.DataFeed
• Bug fixes
▪ RDMA merged into TensorFlow repository
▪ Registration server
▪ Spark streaming
▪ Pip packaging
Spark Streaming
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 10)
images = sc.textFile(args.images).map(lambda ln: parse(ln))
stream = ssc.textFileStream(args.images)
imageRDD = stream.map(lambda ln: parse(ln))
cluster = TFCluster.run(sc, map_fun, args,…)
predictionRDD = cluster.inference(imageRDD)
predictionRDD.saveAsTextFile(args.output)
predictionRDD.saveAsTextFiles(args.output)
ssc.start()
cluster.shutdown(ssc)
Pip packaging
pip install tensorflowonspark
${SPARK_HOME}/bin/spark-submit 
--master ${MASTER} 
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py 
--archives ${TFoS_HOME}/tfspark.zip 
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py 
--cluster_size ${SPARK_WORKER_INSTANCES} 
--images examples/mnist/csv/train/images 
--labels examples/mnist/csv/train/labels 
--format csv 
--mode train 
--model mnist_model
Demo
Next Steps
▪ TF/Keras Layers
▪ Failure recovery w/ dynamic cluster
management (e.g. registration server)
Summary
TFoS brings deep learning to big-data
clusters
• TensorFlow: 0.12 -1.x
• Spark: 1.6-2.x
• Cluster manager: YARN, Standalone, Mesos
• EC2 image provided
• RDMA in TensorFlow
Thanks!
And our open-source contributors!
Questions?
https://github.com/yahoo/TensorFlowOnSpark

More Related Content

What's hot

How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichDevOpsDays Tel Aviv
 
Cours d'introduction aux HTML5 & CSS3
Cours d'introduction aux HTML5 & CSS3Cours d'introduction aux HTML5 & CSS3
Cours d'introduction aux HTML5 & CSS3Abel LIFAEFI MBULA
 
Laravel 9 comprendre les fondamentaux.
Laravel 9 comprendre les fondamentaux.Laravel 9 comprendre les fondamentaux.
Laravel 9 comprendre les fondamentaux.Valdy Ngouabira
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingLilia Sfaxi
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop QuébecMathieu Dumoulin
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantALTIC Altic
 
Présentation Talend Open Studio
Présentation Talend Open StudioPrésentation Talend Open Studio
Présentation Talend Open Studiohoracio lassey
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingJulien Pivotto
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Cours java avance avancé thread arraylist
Cours java avance avancé thread arraylistCours java avance avancé thread arraylist
Cours java avance avancé thread arraylistHoussem Hamrouni
 

What's hot (20)

How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
 
Apache flink
Apache flinkApache flink
Apache flink
 
Cours d'introduction aux HTML5 & CSS3
Cours d'introduction aux HTML5 & CSS3Cours d'introduction aux HTML5 & CSS3
Cours d'introduction aux HTML5 & CSS3
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
 
Storm
StormStorm
Storm
 
Laravel 9 comprendre les fondamentaux.
Laravel 9 comprendre les fondamentaux.Laravel 9 comprendre les fondamentaux.
Laravel 9 comprendre les fondamentaux.
 
Rapport tp2 j2ee
Rapport tp2 j2eeRapport tp2 j2ee
Rapport tp2 j2ee
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data Processing
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Chapitre 2 hadoop
Chapitre 2 hadoopChapitre 2 hadoop
Chapitre 2 hadoop
 
Nouveautés de java 8
Nouveautés de java 8Nouveautés de java 8
Nouveautés de java 8
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performant
 
Présentation Talend Open Studio
Présentation Talend Open StudioPrésentation Talend Open Studio
Présentation Talend Open Studio
 
Taking advantage of Prometheus relabeling
Taking advantage of Prometheus relabelingTaking advantage of Prometheus relabeling
Taking advantage of Prometheus relabeling
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Cours java avance avancé thread arraylist
Cours java avance avancé thread arraylistCours java avance avancé thread arraylist
Cours java avance avancé thread arraylist
 
Une Introduction à Hadoop
Une Introduction à HadoopUne Introduction à Hadoop
Une Introduction à Hadoop
 

Similar to TensorFlowOnSpark Scales TensorFlow Apps to Spark Clusters

饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearnJiang Jun
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Chris Fregly
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Guglielmo iozzia - Google I/O extended dublin 2018
Guglielmo iozzia - Google  I/O extended dublin 2018Guglielmo iozzia - Google  I/O extended dublin 2018
Guglielmo iozzia - Google I/O extended dublin 2018Guglielmo Iozzia
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Chris Fregly
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017Chris Fregly
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
 

Similar to TensorFlowOnSpark Scales TensorFlow Apps to Spark Clusters (20)

饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn饿了么 TensorFlow 深度学习平台:elearn
饿了么 TensorFlow 深度学习平台:elearn
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016Atlanta Hadoop Users Meetup 09 21 2016
Atlanta Hadoop Users Meetup 09 21 2016
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Guglielmo iozzia - Google I/O extended dublin 2018
Guglielmo iozzia - Google  I/O extended dublin 2018Guglielmo iozzia - Google  I/O extended dublin 2018
Guglielmo iozzia - Google I/O extended dublin 2018
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
Terraform training 🎒 - Basic
Terraform training 🎒 - BasicTerraform training 🎒 - Basic
Terraform training 🎒 - Basic
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

TensorFlowOnSpark Scales TensorFlow Apps to Spark Clusters

  • 1. TensorFlowOnSpark S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s Lee Yang, Andr ew Feng Yahoo Big D ata ML Platfor m Team
  • 3. Why TensorFlowOnSpark at Yahoo? ▪ Major contributor to open-source Hadoop ecosystem • Originators of Hadoop (2006) • An early adopter of Spark (since 2013) • Open-sourced CaffeOnSpark (2016) • CaffeOnSpark Update: Recent Enhancements and Use Cases • Wednesday @ 12:20pm by Mridul Jain & Jun Shi ▪ Large investment in production clusters • Tens of clusters • Thousands of nodes per cluster ▪ Massive amounts of data • Petabytes of data
  • 7. RDMA Speedup over gRPC 2.4X faster
  • 8. RDMA Speedup over gRPC http://www.mellanox.com/solutions/machine-learning/tensorflow.php
  • 9. RDMA Speedup over gRPC http://www.mellanox.com/solutions/machine-learning/tensorflow.php
  • 10. TensorFlowOnSpark Design Goals ▪ Scale up existing TF apps with minimal changes ▪ Support all current TensorFlow functionality • Synchronous/asynchronous training • Model/data parallelism • TensorBoard ▪ Integrate with existing HDFS data pipelines and ML algorithms • ex. Hive, Spark, MLlib
  • 11. TensorFlowOnSpark ▪ Pyspark wrapper of TF app code ▪ Launches distributed TF clusters using Spark executors ▪ Supports TF data ingestion modes • feed_dict – RDD.mapPartitions() • queue_runner – direct HDFS access from TF ▪ Supports TensorBoard during/after training ▪ Generally agnostic to Spark/TF versions
  • 12. Supported Environments ▪ Python 2.7 - 3.x ▪ Spark 1.6 - 2.x ▪ TensorFlow 0.12, 1.x ▪ Hadoop 2.x
  • 14. TensorFlowOnSpark Basics 1. Launch TensorFlow cluster 2. Feed data to TensorFlow app 3. Shutdown TensorFlow cluster
  • 15. API Example cluster = TFCluster.run(sc, map_fn, args, num_executors, num_ps, tensorboard, input_mode) cluster.train(dataRDD, num_epochs=0) cluster.inference(dataRDD) cluster.shutdown()
  • 16. Conversion Example # diff –w eval_image_classifier.py 20a21,27 > from pyspark.context import SparkContext > from pyspark.conf import SparkConf > from tensorflowonspark import TFCluster, TFNode > import sys > > def main_fun(argv, ctx): 27a35,36 > sys.argv = argv > 84,85d92 < < def main(_): 88a96,97 > cluster_spec, server = TFNode.start_cluster_server(ctx) > 191c200,204 < tf.app.run() --- > sc = SparkContext(conf=SparkConf().setAppName("eval_image_classifier")) > num_executors = int(sc._conf.get("spark.executor.instances")) > cluster = TFCluster.run(sc, main_fun, sys.argv, num_executors, 0, False, TFCluster.InputMode.TENSORFLOW) > cluster.shutdown()
  • 17. Input Modes ▪ InputMode.SPARK HDFS → RDD.mapPartitions → feed_dict ▪ InputMode.TENSORFLOW TFReader + QueueRunner ← HDFS
  • 18. Executor python worker Executor python workerpython worker python worker InputMode.SPARK worker:0 queue Executor python workerpython worker ps:0 worker:1 queue RDD RDD
  • 19. Executor python worker InputMode.TENSORFLOW Executor python worker Executor python workerpython worker ps:0 python worker worker:0 python worker worker:1 HDFS
  • 20. Failure Recovery ▪ TF Checkpoints written to HDFS ▪ InputMode.SPARK • TF worker runs in background • RDD data feeding tasks can be retried • However, TF worker failures will be “hidden” from Spark ▪ InputMode.TENSORFLOW • TF worker runs in foreground • TF worker failures will be retried as Spark task • TF worker restores from checkpoint
  • 21. Failure Recovery ▪ Executor failures are problematic • e.g. pre-emption • TF cluster_spec is statically-defined at startup • YARN does not re-allocate on same node • Even if possible, port may no longer be available. ▪ Need dynamic cluster membership • Exploring options w/ TensorFlow team
  • 25. TensorFlow App Development Experimentation Phase •Single-node •Small scale data •TensorFlow APIs tf.Graph tf.Session tf.InteractiveSession
  • 26. TensorFlow App Development Scaling Phase •Multi-node •Medium scale data (local disk) •Distributed TensorFlow APIs tf.train.ClusterSpec tf.train.Server tf.train.Saver tf.train.Supervisor
  • 27. TensorFlow App Development Production Phase •Cluster deployment •Upstream data pipeline •Model training w/ TensorFlowOnSpark APIs TFCluster.run TFNode.start_cluster_server TFCluster.shutdown •Production inference w/ TensorFlow Serving
  • 29. Common Gotchas ▪ Single node/task per executor ▪ HDFS access (native libs/env) ▪ Why doesn’t algorithm X scale linearly?
  • 30. What’s New? ▪ Community contributions • CDH compatibility • TFNode.DataFeed • Bug fixes ▪ RDMA merged into TensorFlow repository ▪ Registration server ▪ Spark streaming ▪ Pip packaging
  • 31. Spark Streaming from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 10) images = sc.textFile(args.images).map(lambda ln: parse(ln)) stream = ssc.textFileStream(args.images) imageRDD = stream.map(lambda ln: parse(ln)) cluster = TFCluster.run(sc, map_fun, args,…) predictionRDD = cluster.inference(imageRDD) predictionRDD.saveAsTextFile(args.output) predictionRDD.saveAsTextFiles(args.output) ssc.start() cluster.shutdown(ssc)
  • 32. Pip packaging pip install tensorflowonspark ${SPARK_HOME}/bin/spark-submit --master ${MASTER} --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py --archives ${TFoS_HOME}/tfspark.zip ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py --cluster_size ${SPARK_WORKER_INSTANCES} --images examples/mnist/csv/train/images --labels examples/mnist/csv/train/labels --format csv --mode train --model mnist_model
  • 33. Demo
  • 34. Next Steps ▪ TF/Keras Layers ▪ Failure recovery w/ dynamic cluster management (e.g. registration server)
  • 35. Summary TFoS brings deep learning to big-data clusters • TensorFlow: 0.12 -1.x • Spark: 1.6-2.x • Cluster manager: YARN, Standalone, Mesos • EC2 image provided • RDMA in TensorFlow