Meetup tensorframes

TensorFlow & TensorFrames w/
Apache Spark
Presents...
Marco Saviano

TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing
2. Apache Spark
3. Google Tensorflow
4. Tensorframes
5. Future developments

Numerical computing
• Queries and algorithms are computation-heavy
• Numerical algorithms, like ML, uses very simple data types:
integers/floating-point operations, vectors, matrixes
• Not necessary a lot of data movement
• Numerical bottlenecks are good targets for optimization

Evolution of computing power
Scale out
Scaleup

HPC Frameworks
Scale out
Scaleup
Today’s talk:
Spark + TensorFlow = TensorFrames

Open source successes
Commits on master branch on GitHub
Apache Spark – 1015 contributors
Google Tensorflow – 582 contributors

Apache Spark
Apache Spark™ is a fast and general engine for
large-scale data processing, with built-in
modules for streaming, SQL, machine learning
and graph processing

How does it work? (1/3)
Spark is written in Scala and runs on the
Java Virtual Machine.
Every Spark application consists of a driver
program:
It contains the main function, defines
distributed datasets on the cluster and
then applies operations to them.
Driver programs access Spark through a
SparkContext object.

How does it work? (2/3)
To run the operations defined in the
application the driver typically manage a
number of nodes called executors.
These operations result in tasks that the
executors have to perform.

How does it work?(3/3)
Managing and manipulating datasets distributed over a cluster writing just a
driver program without taking care of the distributed system is possible
because of:
• Cluster managers (resources management, networking… )
• SparkContext (task definition from more abstract operations)
• RDD (Spark’s programming main abstraction to represent distributed
datasets)

RDD vs DataFrame
• RDD:
Immutable distributed collection of elements of your data,
partitioned across nodes in your cluster that can be operated in
parallel with a low-level API, i.e. transformations and actions.
• DataFrame:
Immutable distributed collection of data, organized into named
columns. It is like table in a relational database.

DataFrame: pros and cons
• Higher level API, which makes Spark available to wider
audience
• Performance gains thanks to the Catalyst query optimizer
• Space efficiency by leveraging the Tungsten subsystem
• Off-Heap Memory Management and managing memory explicitly
• Cache-aware computation improves the speed of data processing through
more effective use of L1/ L2/L3 CPU caches
• Higher level API may limit expressiveness
• Complex transformation are better express using RDD’s API

Google TensorFlow
• Programming system in which you represent computations as graphs
• Google Brain Team (https://research.google.com/teams/brain/)
• Very popular for deep learning and neural networks

Google TensorFlow
• Core written in C++
• Interface in C++ and Python
C++ front end Python front end
Core TensorFlow Execution System
CPU GPU Android iOS …

Tensors
• Big idea: Express a numeric computation as a graph.
• Graph nodes are operations which have any number of inputs and outputs
• Graph edges are tensors which flow between nodes
• Tensors can be viewed as a multidimensional array of numbers
• A scalar is a tensor,
• A vector is a tensor
• A matrix is a tensor
• And so on…

Programming model
import tensorflow as tf
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output, {x: 3, y: 5})
x:
int32
y:
int32
mul
z
3

Tensorframes
• TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate
Spark's DataFrames with TensorFlow programs.
• Code written in Python, Scala or directly by passing a protocol buffer
description of the operations graph
• Build on the javacpp project
• Officially supported Spark versions: 1.6+

Spark with Tensorflow
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object

TensorFrames:
native embedding of
TensorFlow
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object

Programming model
• Integrate Tensorflow API in Spark Dataframes
df=sqlContext.createDataFrame(zip(
range(0,10),
range(1,11))
).toDF(“x”,”y”)
import tensorframes as tfs
x = tfs.row(df,"x")
y = tfs.row(df,"y")
output = tf.add(x, 3*y, name=“z”)
output_df = tfs.map_rows(output,df)
output_df.collect()
x:
int32
y:
int32
mul
z
3
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]

Tensors
• TensorFlow expresses operations on tensors: homogeneous data
structures that consist in an array and a shape
• In Tensorframes, tensors are stored into Spark Dataframe
x y
1 [1.1 1.2]
2 [2.1 2.2]
3 [3.1 3.2]
x y
1 [1.1 1.2]
x y
2 [2.1 2.2]
3 [3.1 3.2]
Spark Dataframe
Cluster
Node 1
Node 2
Chunk and
distribute the
table across the
cluster

Map operations
• TensorFrames provides most operations in two forms
• row-based version
• block-based version
• The block transforms are usually more efficient: there is less overhead in calling
TensorFlow, and they can manipulate more data at once.
• In some cases, it is not possible to consider a sequence of rows as a single tensors
because the data must be homogeneous
process_row: x = 1, y = [1.1 1.2]
process_row: x = 2, y = [2.1 2.2]
process_row: x = 3, y = [3.1 3.2]
row-based
process_block: x = [1], y = [1.1 1.2]
process_block: x = [2 3], y = [[2.1 2.2]
[3.1 3.2]]
block-based
x y
1 [1]
2 [1 2]
3 [1 2 3]

Row-based vs Block-based
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x = tfs.row(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_rows(z, df)
df2.show()
x = tfs.block(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_blocks(z, df)
df2.show()

Reduction operations
• Reduction operations coalesce a pair or a collection of rows and transform them
into a single row, until there is one row left.
• The transforms must be algebraic: the order in which they are done should not matter
f(f(a, b), c) == f(a, f(b, c))
x_input = tfs.block(df, "x", tf_name="x_input")
x = tf.reduce_sum(x_input, name='x')
res = tfs.reduce_blocks(x, df)
print res

C++
buffer
Tungsten
binary
format
Java
object
Direct memory copy
Improving communication

C++
buffer
Direct memory copy
Columnar
storage
Improving communication

Future
• Integration with Tungsten:
• Direct memory copy
• Columnar storage
• Better integration with MLlib data types
• Improving GPU support

Meetup tensorframes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Meetup tensorframes

Similar to Meetup tensorframes (20)

Recently uploaded

Recently uploaded (20)

Meetup tensorframes

Editor's Notes