3. Numerical computing
• Queries and algorithms are computation-heavy
• Numerical algorithms, like ML, uses very simple data types:
integers/floating-point operations, vectors, matrixes
• Not necessary a lot of data movement
• Numerical bottlenecks are good targets for optimization
10. Apache Spark
Apache Spark™ is a fast and general engine for
large-scale data processing, with built-in
modules for streaming, SQL, machine learning
and graph processing
12. How does it work? (1/3)
Spark is written in Scala and runs on the
Java Virtual Machine.
Every Spark application consists of a driver
program:
It contains the main function, defines
distributed datasets on the cluster and
then applies operations to them.
Driver programs access Spark through a
SparkContext object.
13. How does it work? (2/3)
To run the operations defined in the
application the driver typically manage a
number of nodes called executors.
These operations result in tasks that the
executors have to perform.
14. How does it work?(3/3)
Managing and manipulating datasets distributed over a cluster writing just a
driver program without taking care of the distributed system is possible
because of:
• Cluster managers (resources management, networking… )
• SparkContext (task definition from more abstract operations)
• RDD (Spark’s programming main abstraction to represent distributed
datasets)
15. RDD vs DataFrame
• RDD:
Immutable distributed collection of elements of your data,
partitioned across nodes in your cluster that can be operated in
parallel with a low-level API, i.e. transformations and actions.
• DataFrame:
Immutable distributed collection of data, organized into named
columns. It is like table in a relational database.
16. DataFrame: pros and cons
• Higher level API, which makes Spark available to wider
audience
• Performance gains thanks to the Catalyst query optimizer
• Space efficiency by leveraging the Tungsten subsystem
• Off-Heap Memory Management and managing memory explicitly
• Cache-aware computation improves the speed of data processing through
more effective use of L1/ L2/L3 CPU caches
• Higher level API may limit expressiveness
• Complex transformation are better express using RDD’s API
18. Google TensorFlow
• Programming system in which you represent computations as graphs
• Google Brain Team (https://research.google.com/teams/brain/)
• Very popular for deep learning and neural networks
19. Google TensorFlow
• Core written in C++
• Interface in C++ and Python
C++ front end Python front end
Core TensorFlow Execution System
CPU GPU Android iOS …
21. Tensors
• Big idea: Express a numeric computation as a graph.
• Graph nodes are operations which have any number of inputs and outputs
• Graph edges are tensors which flow between nodes
• Tensors can be viewed as a multidimensional array of numbers
• A scalar is a tensor,
• A vector is a tensor
• A matrix is a tensor
• And so on…
22. Programming model
import tensorflow as tf
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output, {x: 3, y: 5})
x:
int32
y:
int32
mul
z
3
25. Tensorframes
• TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate
Spark's DataFrames with TensorFlow programs.
• Code written in Python, Scala or directly by passing a protocol buffer
description of the operations graph
• Build on the javacpp project
• Officially supported Spark versions: 1.6+
26. Spark with Tensorflow
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object
30. Tensors
• TensorFlow expresses operations on tensors: homogeneous data
structures that consist in an array and a shape
• In Tensorframes, tensors are stored into Spark Dataframe
x y
1 [1.1 1.2]
2 [2.1 2.2]
3 [3.1 3.2]
x y
1 [1.1 1.2]
x y
2 [2.1 2.2]
3 [3.1 3.2]
Spark Dataframe
Cluster
Node 1
Node 2
Chunk and
distribute the
table across the
cluster
31. Map operations
• TensorFrames provides most operations in two forms
• row-based version
• block-based version
• The block transforms are usually more efficient: there is less overhead in calling
TensorFlow, and they can manipulate more data at once.
• In some cases, it is not possible to consider a sequence of rows as a single tensors
because the data must be homogeneous
process_row: x = 1, y = [1.1 1.2]
process_row: x = 2, y = [2.1 2.2]
process_row: x = 3, y = [3.1 3.2]
row-based
process_block: x = [1], y = [1.1 1.2]
process_block: x = [2 3], y = [[2.1 2.2]
[3.1 3.2]]
block-based
x y
1 [1]
2 [1 2]
3 [1 2 3]
32. Row-based vs Block-based
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x = tfs.row(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_rows(z, df)
df2.show()
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x = tfs.block(df, "x")
z = tf.add(x, 3, name='z')
df2 = tfs.map_blocks(z, df)
df2.show()
33. Reduction operations
• Reduction operations coalesce a pair or a collection of rows and transform them
into a single row, until there is one row left.
• The transforms must be algebraic: the order in which they are done should not matter
f(f(a, b), c) == f(a, f(b, c))
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x_input = tfs.block(df, "x", tf_name="x_input")
x = tf.reduce_sum(x_input, name='x')
res = tfs.reduce_blocks(x, df)
print res
Hello everybody! I am Marco Saviano, I am a big data engineer at AgileLab. Today I am here to introduce you an implementation of Google Tensorflow in Apache Spark, called Tensorframes
Here there is the outline of my talk. I start talking about what is our focus, the numerical computing. After that I will talk about Apache Spark, just to remind Dataframes considering the great presentation of Burak. Later I will introduce Google Tensorflow and Tensorframes, an implementation of Tensorflow in Spark. Finally I will talk about future developments of tensorframes, considering that righ now it is an experimental project.
All the people that work in big data field, for example big data engineers and data scientists, are well aware that do operations on data, like queries or analysis, involve heavy numerical computations.
If we consider ML algorithms, they use very simple data and structures types like integers/floating point operations, vectors and matrixes. We can confirm that the bread and butter of data science can be told in 3 words: integers, floats and doubles.
Usually these algorithms don’t need a lot of data movement. If we consider the numerical pipeline in operations, the ETL part in which the data are imported, transformed and loaded is really fast; the bottleneck in the pipeline is the number-crunching part.
So these numerical bottlenecks are really good target for doing some optimizations
To improve the performance of numerical computation, in the last years, the evolution of computing power has followed two ways. One is to "scale up". Starting from a common laptop, we can use a powerful personal computer, or use a gpu that exploits its power in numerical computations. It is known that gpu isn't used just for gaming, but also for numeric computations: its architecture allows to compute numeric computation much faster than a cpu.
Finally, If we implement a new algorithm and we work for one of the most powerful company in the world, we can ask to our team to create a specific chip for our application.
So scaling up means to use powerful standalone machine.
On the other hand, scaling out means using a cluster of computing, multiple machines that work in parallel
In this slide the main frameworks used for HPC computing are shown. We can observe that Tensorflow is at the top of "scale-up axis", so it brings high performance on standalone architecture, eanwhile Spark is at the top of "scale-out axis"
Today's talk is to join these two ways, combining one of the best framework used in standalone computer, tensorflow, with Apache Spark that excels in cluster computing
Tensorflow and Spark are two open source successes. The two graphs show the commits on master branch in the github repositories: at the top we have Spark, with 1015 contributors. At the bottom Tensorflow, with 582 contributors
They are used also by big enterprises company
In this section I will go quite fast considering the deep introduction to spark done in the previous talk.
Spark is a fast and general engine that allows computing large-scale processing on structured and unstructored data.
It is composed by multiple components. We have SparkSQL with allows to process structured data. Mlib contains a bunch of machine learning algorithms, graphx brings the graph processing.
Spark can run on cluster thanks to different cluster managers like Yarn, Mesos and its standalone scheduler
Apache Spark is written in scala and runs on JVM. Every Spark application consists of a driver program containing the main functions, the definition of distributed datasets on the cluster and the operations to apply on them. Driver programs access Spark through a SparkContext Object, which allows the user to use the cluster computing in transparent way
Tipically the cluster is composed by different nodes, in which one is the driver and the other are the executors.
The driver tipically manage the executors. The executots compute the tasks which has assigned and return the results to the driver that combine them and return the final result to the user
The management and the operations on distributed datasets is done in transparent way for the user thanks to:
Cluster managers that do the resources management, networking
Sparkcontext that allows to define task operations in abstract way
RDDs that are the Spark programming main abstraction to represent distributed datasets
In Spark, the distributed datasets across the cluster can be represented as RDD or Dataframe (or Datasets).
Catalysts transforms API operations on Dataframes in RDD optimized operations.
Tungsten is a new Spark SQL component that provides more efficient Spark operations by working directly at the byte level, rather than the java object
Off-Heap Memory Management using binary in-memory data representation and managing memory explicitly,
Now it's time to introduce Google Tensorflow.
TensorFlow™ is an open source software library for numerical computation using data flow graphs.
TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
The core is written in C++, but it has an interface in Python.
The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
Tensorflow is the most adopted framework for deep learning since it was released in November 2015. Being a product of big G of course provokes great interest in community.
The idea is to express a numeric computation as a graph: the framework will convert the graph to efficient low level operations.
The TF graph is composed by the nodes that are the operations, and the tensors that "flow" among the nodes: from that the name TensorFlow. The tensor is the data structure used in the graph: it could be considered as a generalization of vectors and matrixes. Usually tensors with 4 dimensions are used.
In the slide there is the code of a simple operation: the sum of x with y multiplied by 3. There are two steps in the code. In the first part there is the graph descriptions and then it is created and runned a session. The session is fed with two inputs that are the x value (3) and y value (5).
The operations are done after the last line of code: it is like the lazy evalutaion in Spark.
Tensorframes is the binding of TensorFlow in Spark. It allows to manipulate Spark's Dataframes with Tensorflow programs
The code can be written in Python, scala or directly by passing a protocol buffer description of the operations graph
The binding is done thanks to the javacpp project.
It is supported for Spark 1.6 and greater
Here we consider the scenario of using Tensorflow as a standalone library in Spark application. The data that are represented in Tungsten binary format, has to be transformed in Java object by Spark process. After this data has to be transformed to be used by Python process: this transformation consists in serialization and deserialization in python pickle. Finally the c++ buffer used by tensorflow are filled.
This leads to bad performances due to multiple data transformations and interprocesses communication
What tensorframes does is to embed the tensorflow library inside a jar file and so in the process that run on the executor.
In this way can completely overcome the python step: there is only one process on the executor
In this slide is shown the Python code to do the same operations done in the previous example in Tensorflow. This time, the x and y inputs are columns of a Spark Dataframe.
Here we haven't to create and run explicitly the session: the session is run when the map_rows tensorframes function is triggered by an action.
With the map_rows function of tensorframes library the tensorflow operations are done row by row in the dataframe.
The tfs.row is used as the "placeholder" used in the previous example: it automatically infers the data type of the input. Then tensorflow add function is used.
The result will be a dataframe
Open notebook tensorframes_1
In TF the operations are expressed on tensors. In tensorframes, tensors are stored into Spark Dataframe: it chunks and distributes the Dataframe across the cluster
The operations in Tensorframes can be done in two forms
In the row based version, TensorFrames will work on one row after the other. On the left we can see how it will process the table
In block based version, Tensorframes considers a sequence of rows as a single tensors.
The block transforms are usually more efficient: it implies less overhead in calling tensorflow
However in some cases, it is not possible to consider a sequence of rows as a single tensors because the data must be homogeneous
Watching the code, row based and the block based is written in the same manner and they will return the same result
In Tensorframes, we can compute reduction operations, similarly to Spark
TensorFrames minimizes the data transfer between computers by reducing first all the rows on each computer and then it combines each results from each computer
Open notebook tensorframes_aggregate
Right now we still have the convertion from tungsten binary in java object. The author of tensorframes aims to directly take the data from its binary representation in Tungsten and directly copy in the C++ buffers used by tensorflow.
In the future release of Spark, the best part of data in Spark will be stored in column by column: in this case it is possible to directly copy the data from tungsten to c++ buffer