SlideShare a Scribd company logo
1 of 39
TensorFlow & TensorFrames w/
Apache Spark
Presents...
Marco Saviano
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments
Numerical computing
• Queries and algorithms are computation-heavy
• Numerical algorithms, like ML, uses very simple data types:
integers/floating-point operations, vectors, matrixes
• Not necessary a lot of data movement
• Numerical bottlenecks are good targets for optimization
Evolution of computing power
Scale out
Scaleup
HPC Frameworks
Scale out
Scaleup
Today’s talk:
Spark + TensorFlow = TensorFrames
Open source successes
Commits on master branch on GitHub
Apache Spark – 1015 contributors
Google Tensorflow – 582 contributors
Spark enterprise users
Tensorflow enterprise users
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments
Apache Spark
Apache Spark™ is a fast and general engine for
large-scale data processing, with built-in
modules for streaming, SQL, machine learning
and graph processing
Spark Unified Stack
How does it work? (1/3)
Spark is written in Scala and runs on the
Java Virtual Machine.
Every Spark application consists of a driver
program:
It contains the main function, defines
distributed datasets on the cluster and
then applies operations to them.
Driver programs access Spark through a
SparkContext object.
How does it work? (2/3)
To run the operations defined in the
application the driver typically manage a
number of nodes called executors.
These operations result in tasks that the
executors have to perform.
How does it work?(3/3)
Managing and manipulating datasets distributed over a cluster writing just a
driver program without taking care of the distributed system is possible
because of:
• Cluster managers (resources management, networking… )
• SparkContext (task definition from more abstract operations)
• RDD (Spark’s programming main abstraction to represent distributed
datasets)
RDD vs DataFrame
• RDD:
Immutable distributed collection of elements of your data,
partitioned across nodes in your cluster that can be operated in
parallel with a low-level API, i.e. transformations and actions.
• DataFrame:
Immutable distributed collection of data, organized into named
columns. It is like table in a relational database.
DataFrame: pros and cons
• Higher level API, which makes Spark available to wider
audience
• Performance gains thanks to the Catalyst query optimizer
• Space efficiency by leveraging the Tungsten subsystem
• Off-Heap Memory Management and managing memory explicitly
• Cache-aware computation improves the speed of data processing through
more effective use of L1/ L2/L3 CPU caches
• Higher level API may limit expressiveness
• Complex transformation are better express using RDD’s API
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments
Google TensorFlow
• Programming system in which you represent computations as graphs
• Google Brain Team (https://research.google.com/teams/brain/)
• Very popular for deep learning and neural networks
Google TensorFlow
• Core written in C++
• Interface in C++ and Python
C++ front end Python front end
Core TensorFlow Execution System
CPU GPU Android iOS …
Google TensorFlow adoption
Tensors
• Big idea: Express a numeric computation as a graph.
• Graph nodes are operations which have any number of inputs and outputs
• Graph edges are tensors which flow between nodes
• Tensors can be viewed as a multidimensional array of numbers
• A scalar is a tensor,
• A vector is a tensor
• A matrix is a tensor
• And so on…
Programming model
import tensorflow as tf
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output, {x: 3, y: 5})
x:
int32
y:
int32
mul
z
3
Tensorflow
Demo
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments
Tensorframes
• TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate
Spark's DataFrames with TensorFlow programs.
• Code written in Python, Scala or directly by passing a protocol buffer
description of the operations graph
• Build on the javacpp project
• Officially supported Spark versions: 1.6+
Spark with Tensorflow
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object
TensorFrames:
native embedding of
TensorFlow
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
Programming model
• Integrate Tensorflow API in Spark Dataframes
df=sqlContext.createDataFrame(zip(
range(0,10),
range(1,11))
).toDF(“x”,”y”)
import tensorflow as tf
import tensorframes as tfs
x = tfs.row(df,"x")
y = tfs.row(df,"y")
output = tf.add(x, 3*y, name=“z”)
output_df = tfs.map_rows(output,df)
output_df.collect()
x:
int32
y:
int32
mul
z
3
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
Demo
Tensors
• TensorFlow expresses operations on tensors: homogeneous data
structures that consist in an array and a shape
• In Tensorframes, tensors are stored into Spark Dataframe
x y
1 [1.1 1.2]
2 [2.1 2.2]
3 [3.1 3.2]
x y
1 [1.1 1.2]
x y
2 [2.1 2.2]
3 [3.1 3.2]
Spark Dataframe
Cluster
Node 1
Node 2
Chunk and
distribute the
table across the
cluster
Map operations
• TensorFrames provides most operations in two forms
• row-based version
• block-based version
• The block transforms are usually more efficient: there is less overhead in calling
TensorFlow, and they can manipulate more data at once.
• In some cases, it is not possible to consider a sequence of rows as a single tensors
because the data must be homogeneous
process_row: x = 1, y = [1.1 1.2]
process_row: x = 2, y = [2.1 2.2]
process_row: x = 3, y = [3.1 3.2]
row-based
process_block: x = [1], y = [1.1 1.2]
process_block: x = [2 3], y = [[2.1 2.2]
[3.1 3.2]]
block-based
x y
1 [1]
2 [1 2]
3 [1 2 3]
Row-based vs Block-based
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x = tfs.row(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_rows(z, df)
df2.show()
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x = tfs.block(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_blocks(z, df)
df2.show()
Reduction operations
• Reduction operations coalesce a pair or a collection of rows and transform them
into a single row, until there is one row left.
• The transforms must be algebraic: the order in which they are done should not matter
f(f(a, b), c) == f(a, f(b, c))
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x_input = tfs.block(df, "x", tf_name="x_input")
x = tf.reduce_sum(x_input, name='x')
res = tfs.reduce_blocks(x, df)
print res
Demo
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
Direct memory copy
Improving communication
Spark worker process
C++
buffer
Direct memory copy
Columnar
storage
Improving communication
Future
• Integration with Tungsten:
• Direct memory copy
• Columnar storage
• Better integration with MLlib data types
• Improving GPU support
Questions ?

More Related Content

What's hot

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks
 
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA PlatformAccelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA PlatformDatabricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 

What's hot (20)

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA PlatformAccelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform
Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 

Viewers also liked

[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016Andrii Babii
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016Caesar Chi
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsDocker, Inc.
 
人工智能在量化投资分析中的实践
人工智能在量化投资分析中的实践人工智能在量化投资分析中的实践
人工智能在量化投资分析中的实践Philip Zheng
 
Trading bot演算法與軟工在程式交易上的實踐
Trading bot演算法與軟工在程式交易上的實踐Trading bot演算法與軟工在程式交易上的實踐
Trading bot演算法與軟工在程式交易上的實踐Philip Zheng
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)台灣資料科學年會
 

Viewers also liked (9)

[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container Environments
 
人工智能在量化投资分析中的实践
人工智能在量化投资分析中的实践人工智能在量化投资分析中的实践
人工智能在量化投资分析中的实践
 
Trading bot演算法與軟工在程式交易上的實踐
Trading bot演算法與軟工在程式交易上的實踐Trading bot演算法與軟工在程式交易上的實踐
Trading bot演算法與軟工在程式交易上的實踐
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
 

Similar to Meetup tensorframes

Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Chris Fregly
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...Simplilearn
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...confluent
 
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...gdgsurrey
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in ProductionMatthias Feys
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Lecture 4: Deep Learning Frameworks
Lecture 4: Deep Learning FrameworksLecture 4: Deep Learning Frameworks
Lecture 4: Deep Learning FrameworksMohamed Loey
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Raffi Khatchadourian
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 

Similar to Meetup tensorframes (20)

Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Ka...
 
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Lecture 4: Deep Learning Frameworks
Lecture 4: Deep Learning FrameworksLecture 4: Deep Learning Frameworks
Lecture 4: Deep Learning Frameworks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Influx data basic
Influx data basicInflux data basic
Influx data basic
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 

Recently uploaded

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 

Recently uploaded (20)

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 

Meetup tensorframes

  • 1. TensorFlow & TensorFrames w/ Apache Spark Presents... Marco Saviano
  • 2. TensorFlow & TensorFrames w/ Apache Spark 1. Numerical Computing 2. Apache Spark 3. Google Tensorflow 4. Tensorframes 5. Future developments
  • 3. Numerical computing • Queries and algorithms are computation-heavy • Numerical algorithms, like ML, uses very simple data types: integers/floating-point operations, vectors, matrixes • Not necessary a lot of data movement • Numerical bottlenecks are good targets for optimization
  • 4. Evolution of computing power Scale out Scaleup
  • 5. HPC Frameworks Scale out Scaleup Today’s talk: Spark + TensorFlow = TensorFrames
  • 6. Open source successes Commits on master branch on GitHub Apache Spark – 1015 contributors Google Tensorflow – 582 contributors
  • 9. TensorFlow & TensorFrames w/ Apache Spark 1. Numerical Computing 2. Apache Spark 3. Google Tensorflow 4. Tensorframes 5. Future developments
  • 10. Apache Spark Apache Spark™ is a fast and general engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing
  • 12. How does it work? (1/3) Spark is written in Scala and runs on the Java Virtual Machine. Every Spark application consists of a driver program: It contains the main function, defines distributed datasets on the cluster and then applies operations to them. Driver programs access Spark through a SparkContext object.
  • 13. How does it work? (2/3) To run the operations defined in the application the driver typically manage a number of nodes called executors. These operations result in tasks that the executors have to perform.
  • 14. How does it work?(3/3) Managing and manipulating datasets distributed over a cluster writing just a driver program without taking care of the distributed system is possible because of: • Cluster managers (resources management, networking… ) • SparkContext (task definition from more abstract operations) • RDD (Spark’s programming main abstraction to represent distributed datasets)
  • 15. RDD vs DataFrame • RDD: Immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API, i.e. transformations and actions. • DataFrame: Immutable distributed collection of data, organized into named columns. It is like table in a relational database.
  • 16. DataFrame: pros and cons • Higher level API, which makes Spark available to wider audience • Performance gains thanks to the Catalyst query optimizer • Space efficiency by leveraging the Tungsten subsystem • Off-Heap Memory Management and managing memory explicitly • Cache-aware computation improves the speed of data processing through more effective use of L1/ L2/L3 CPU caches • Higher level API may limit expressiveness • Complex transformation are better express using RDD’s API
  • 17. TensorFlow & TensorFrames w/ Apache Spark 1. Numerical Computing 2. Apache Spark 3. Google Tensorflow 4. Tensorframes 5. Future developments
  • 18. Google TensorFlow • Programming system in which you represent computations as graphs • Google Brain Team (https://research.google.com/teams/brain/) • Very popular for deep learning and neural networks
  • 19. Google TensorFlow • Core written in C++ • Interface in C++ and Python C++ front end Python front end Core TensorFlow Execution System CPU GPU Android iOS …
  • 21. Tensors • Big idea: Express a numeric computation as a graph. • Graph nodes are operations which have any number of inputs and outputs • Graph edges are tensors which flow between nodes • Tensors can be viewed as a multidimensional array of numbers • A scalar is a tensor, • A vector is a tensor • A matrix is a tensor • And so on…
  • 22. Programming model import tensorflow as tf x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) session = tf.Session() output_value = session.run(output, {x: 3, y: 5}) x: int32 y: int32 mul z 3
  • 24. TensorFlow & TensorFrames w/ Apache Spark 1. Numerical Computing 2. Apache Spark 3. Google Tensorflow 4. Tensorframes 5. Future developments
  • 25. Tensorframes • TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's DataFrames with TensorFlow programs. • Code written in Python, Scala or directly by passing a protocol buffer description of the operations graph • Build on the javacpp project • Officially supported Spark versions: 1.6+
  • 26. Spark with Tensorflow Spark worker process Worker python process C++ buffer Python pickle Tungsten binary format Python pickle Java object
  • 27. TensorFrames: native embedding of TensorFlow Spark worker process C++ buffer Tungsten binary format Java object
  • 28. Programming model • Integrate Tensorflow API in Spark Dataframes df=sqlContext.createDataFrame(zip( range(0,10), range(1,11)) ).toDF(“x”,”y”) import tensorflow as tf import tensorframes as tfs x = tfs.row(df,"x") y = tfs.row(df,"y") output = tf.add(x, 3*y, name=“z”) output_df = tfs.map_rows(output,df) output_df.collect() x: int32 y: int32 mul z 3 df: DataFrame[x: int, y: int] output_df: DataFrame[x: int, y: int, z: int]
  • 29. Demo
  • 30. Tensors • TensorFlow expresses operations on tensors: homogeneous data structures that consist in an array and a shape • In Tensorframes, tensors are stored into Spark Dataframe x y 1 [1.1 1.2] 2 [2.1 2.2] 3 [3.1 3.2] x y 1 [1.1 1.2] x y 2 [2.1 2.2] 3 [3.1 3.2] Spark Dataframe Cluster Node 1 Node 2 Chunk and distribute the table across the cluster
  • 31. Map operations • TensorFrames provides most operations in two forms • row-based version • block-based version • The block transforms are usually more efficient: there is less overhead in calling TensorFlow, and they can manipulate more data at once. • In some cases, it is not possible to consider a sequence of rows as a single tensors because the data must be homogeneous process_row: x = 1, y = [1.1 1.2] process_row: x = 2, y = [2.1 2.2] process_row: x = 3, y = [3.1 3.2] row-based process_block: x = [1], y = [1.1 1.2] process_block: x = [2 3], y = [[2.1 2.2] [3.1 3.2]] block-based x y 1 [1] 2 [1 2] 3 [1 2 3]
  • 32. Row-based vs Block-based import tensorflow as tf import tensorframes as tfs from pyspark.sql import Row from pyspark.sql.functions import * data = [Row(x=float(x)) for x in range(5)] df = sqlContext.createDataFrame(data) with tf.Graph().as_default() as g: x = tfs.row(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_rows(z, df) df2.show() import tensorflow as tf import tensorframes as tfs from pyspark.sql import Row from pyspark.sql.functions import * data = [Row(x=float(x)) for x in range(5)] df = sqlContext.createDataFrame(data) with tf.Graph().as_default() as g: x = tfs.block(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_blocks(z, df) df2.show()
  • 33. Reduction operations • Reduction operations coalesce a pair or a collection of rows and transform them into a single row, until there is one row left. • The transforms must be algebraic: the order in which they are done should not matter f(f(a, b), c) == f(a, f(b, c)) import tensorflow as tf import tensorframes as tfs from pyspark.sql import Row from pyspark.sql.functions import * data = [Row(x=float(x)) for x in range(5)] df = sqlContext.createDataFrame(data) with tf.Graph().as_default() as g: x_input = tfs.block(df, "x", tf_name="x_input") x = tf.reduce_sum(x_input, name='x') res = tfs.reduce_blocks(x, df) print res
  • 34. Demo
  • 35. TensorFlow & TensorFrames w/ Apache Spark 1. Numerical Computing 2. Apache Spark 3. Google Tensorflow 4. Tensorframes 5. Future developments
  • 37. Spark worker process C++ buffer Direct memory copy Columnar storage Improving communication
  • 38. Future • Integration with Tungsten: • Direct memory copy • Columnar storage • Better integration with MLlib data types • Improving GPU support

Editor's Notes

  1. Hello everybody! I am Marco Saviano, I am a big data engineer at AgileLab. Today I am here to introduce you an implementation of Google Tensorflow in Apache Spark, called Tensorframes
  2. Here there is the outline of my talk. I start talking about what is our focus, the numerical computing. After that I will talk about Apache Spark, just to remind Dataframes considering the great presentation of Burak. Later I will introduce Google Tensorflow and Tensorframes, an implementation of Tensorflow in Spark. Finally I will talk about future developments of tensorframes, considering that righ now it is an experimental project.
  3. All the people that work in big data field, for example big data engineers and data scientists, are well aware that do operations on data, like queries or analysis, involve heavy numerical computations. If we consider ML algorithms, they use very simple data and structures types like integers/floating point operations, vectors and matrixes. We can confirm that the bread and butter of data science can be told in 3 words: integers, floats and doubles. Usually these algorithms don’t need a lot of data movement. If we consider the numerical pipeline in operations, the ETL part in which the data are imported, transformed and loaded is really fast; the bottleneck in the pipeline is the number-crunching part. So these numerical bottlenecks are really good target for doing some optimizations
  4. To improve the performance of numerical computation, in the last years, the evolution of computing power has followed two ways. One is to "scale up". Starting from a common laptop, we can use a powerful personal computer, or use a gpu that exploits its power in numerical computations. It is known that gpu isn't used just for gaming, but also for numeric computations: its architecture allows to compute numeric computation much faster than a cpu. Finally, If we implement a new algorithm and we work for one of the most powerful company in the world, we can ask to our team to create a specific chip for our application. So scaling up means to use powerful standalone machine. On the other hand, scaling out means using a cluster of computing, multiple machines that work in parallel
  5. In this slide the main frameworks used for HPC computing are shown. We can observe that Tensorflow is at the top of "scale-up axis", so it brings high performance on standalone architecture, eanwhile Spark is at the top of "scale-out axis" Today's talk is to join these two ways, combining one of the best framework used in standalone computer, tensorflow, with Apache Spark that excels in cluster computing
  6. Tensorflow and Spark are two open source successes. The two graphs show the commits on master branch in the github repositories: at the top we have Spark, with 1015 contributors. At the bottom Tensorflow, with 582 contributors
  7. They are used also by big enterprises company
  8. In this section I will go quite fast considering the deep introduction to spark done in the previous talk. Spark is a fast and general engine that allows computing large-scale processing on structured and unstructored data.
  9. It is composed by multiple components. We have SparkSQL with allows to process structured data. Mlib contains a bunch of machine learning algorithms, graphx brings the graph processing. Spark can run on cluster thanks to different cluster managers like Yarn, Mesos and its standalone scheduler
  10. Apache Spark is written in scala and runs on JVM. Every Spark application consists of a driver program containing the main functions, the definition of distributed datasets on the cluster and the operations to apply on them. Driver programs access Spark through a SparkContext Object, which allows the user to use the cluster computing in transparent way
  11. Tipically the cluster is composed by different nodes, in which one is the driver and the other are the executors. The driver tipically manage the executors. The executots compute the tasks which has assigned and return the results to the driver that combine them and return the final result to the user
  12. The management and the operations on distributed datasets is done in transparent way for the user thanks to: Cluster managers that do the resources management, networking Sparkcontext that allows to define task operations in abstract way RDDs that are the Spark programming main abstraction to represent distributed datasets
  13. In Spark, the distributed datasets across the cluster can be represented as RDD or Dataframe (or Datasets).
  14. Catalysts transforms API operations on Dataframes in RDD optimized operations. Tungsten is a new Spark SQL component that provides more efficient Spark operations by working directly at the byte level, rather than the java object Off-Heap Memory Management using binary in-memory data representation and managing memory explicitly,
  15. Now it's time to introduce Google Tensorflow. TensorFlow™ is an open source software library for numerical computation using data flow graphs. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
  16. The core is written in C++, but it has an interface in Python. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. 
  17. Tensorflow is the most adopted framework for deep learning since it was released in November 2015. Being a product of big G of course provokes great interest in community.
  18. The idea is to express a numeric computation as a graph: the framework will convert the graph to efficient low level operations. The TF graph is composed by the nodes that are the operations, and the tensors that "flow" among the nodes: from that the name TensorFlow. The tensor is the data structure used in the graph: it could be considered as a generalization of vectors and matrixes. Usually tensors with 4 dimensions are used.
  19. In the slide there is the code of a simple operation: the sum of x with y multiplied by 3. There are two steps in the code. In the first part there is the graph descriptions and then it is created and runned a session. The session is fed with two inputs that are the x value (3) and y value (5). The operations are done after the last line of code: it is like the lazy evalutaion in Spark.
  20. Tensorframes is the binding of TensorFlow in Spark. It allows to manipulate Spark's Dataframes with Tensorflow programs The code can be written in Python, scala or directly by passing a protocol buffer description of the operations graph The binding is done thanks to the javacpp project. It is supported for Spark 1.6 and greater
  21. Here we consider the scenario of using Tensorflow as a standalone library in Spark application. The data that are represented in Tungsten binary format, has to be transformed in Java object by Spark process. After this data has to be transformed to be used by Python process: this transformation consists in serialization and deserialization in python pickle. Finally the c++ buffer used by tensorflow are filled. This leads to bad performances due to multiple data transformations and interprocesses communication
  22. What tensorframes does is to embed the tensorflow library inside a jar file and so in the process that run on the executor. In this way can completely overcome the python step: there is only one process on the executor
  23. In this slide is shown the Python code to do the same operations done in the previous example in Tensorflow. This time, the x and y inputs are columns of a Spark Dataframe. Here we haven't to create and run explicitly the session: the session is run when the map_rows tensorframes function is triggered by an action. With the map_rows function of tensorframes library the tensorflow operations are done row by row in the dataframe. The tfs.row is used as the "placeholder" used in the previous example: it automatically infers the data type of the input. Then tensorflow add function is used. The result will be a dataframe
  24. Open notebook tensorframes_1
  25. In TF the operations are expressed on tensors. In tensorframes, tensors are stored into Spark Dataframe: it chunks and distributes the Dataframe across the cluster
  26. The operations in Tensorframes can be done in two forms In the row based version, TensorFrames will work on one row after the other. On the left we can see how it will process the table In block based version,  Tensorframes considers a sequence of rows as a single tensors. The block transforms are usually more efficient: it implies less overhead in calling tensorflow However in some cases, it is not possible to consider a sequence of rows as a single tensors because the data must be homogeneous
  27. Watching the code, row based and the block based is written in the same manner and they will return the same result
  28. In Tensorframes, we can compute reduction operations, similarly to Spark TensorFrames minimizes the data transfer between computers by reducing first all the rows on each computer and then it combines each results from each computer
  29. Open notebook tensorframes_aggregate
  30. Right now we still have the convertion from tungsten binary in java object. The author of tensorframes aims to directly take the data from its binary representation in Tungsten and directly copy in the C++ buffers used by tensorflow.
  31. In the future release of Spark, the best part of data in Spark will be stored in column by column: in this case it is possible to directly copy the data from tungsten to c++ buffer