The New Age for Data Scientists with Apache Spark

SPARK
THE NEW AGE FOR DATA SCIENTIST
In a very short time, Apache Spark has emerged as the next generation big
data processing engine, and is being applied throughout the industry faster
than ever.

TABLE OF CONTENTS
Summary
GETTING STARTED............................................................................................................................................................1
WHAT IS APACHE SPARK?........................................................................................................................................................................ 1
BIG DATA, WHAT IS AND WHY IT MATTERS.......................................................................................................................................... 1
Parallelism enable Big Data.......................................................................................................................................................2
Big data defined...............................................................................................................................................................................2
WHY SPARK?............................................................................................................................................................................................ 4
FROM BIG CODE TO SMALL CODE........................................................................................................................................................... 4
What language should I choose for my next Apache Spark project? ........................................................................4
A BRIEF HISTORY OF SPARK..........................................................................................................................................6
THE INTERVIEW ....................................................................................................................................................................................... 6
THE SPARK GROWTH............................................................................................................................................................................... 6
THE COMPANY AROUND SPARK............................................................................................................................................................. 7
A UNIFIED STACK...............................................................................................................................................................8
SPARK CORE............................................................................................................................................................................................... 8
How Spark works? ..........................................................................................................................................................................9
RDD............................................................................................................................................................................................................................10
Key Value Methods............................................................................................................................................................................................12
Caching Data..........................................................................................................................................................................................................13
Distribution and Instrumentation ........................................................................................................................................14
Spark Submit.........................................................................................................................................................................................................14
Cluster Manager..................................................................................................................................................................................................16
SPARK LIBRARIES ..................................................................................................................................................................................18
Spark SQL ........................................................................................................................................................................................18
Impala vs Hive: difference between Sql on Hadoop components............................................................................................20
Cloudera Impala..................................................................................................................................................................................................21
Apache Hive...........................................................................................................................................................................................................22
Other features that include Hive:...............................................................................................................................................................22
Streaming Data.............................................................................................................................................................................22
Machine Learning........................................................................................................................................................................23
Machine Learning Library (MLlib)............................................................................................................................................................23
GraphX ..............................................................................................................................................................................................24
How it works GraphX?.....................................................................................................................................................................................24
OPTIMIZATIONS AND THE FUTURE.....................................................................................................................................................26
Understanding closures.............................................................................................................................................................26
Local vs. cluster modes....................................................................................................................................................................................26
Broadcasting..................................................................................................................................................................................27
Optimizing Partitioning ............................................................................................................................................................29
Future of Spark..............................................................................................................................................................................31
R...................................................................................................................................................................................................................................31
Tachyon ...................................................................................................................................................................................................................31

TABLE OF CONTENTS
Project Tungsten.................................................................................................................................................................................................31
REFERENCES.....................................................................................................................................................................32
CONTACT INFORMATION.............................................................................................................................................33

INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 1
Getting Started
In this section I will explain the concepts around Spark, how Spark works and how Spark has been built,
what are the advantages of using Spark and a brief nod to history of MapReduce Algorithm.
WHAT IS APACHE SPARK?
Apache Spark is an open-source framework developed by AMPlab of University of California and,
successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on two-
level disk of Hadoop, the primitive in-memory multilayer provided by Spark allow youto have performance
up to 100 times better.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large
amounts of data. Spark offers performance improvements over MapReduce, especially when Spark’s in-
memory computing capabilities can be leveraged. In practice, Spark extends the popular MapReduce
model to efficiently support more types of computations, including interactive queries and stream
processing. Speed is important in processing large datasetsor Big Data, as it means the difference between
exploring data interactively and waiting minutes or hours.
Figure 1 – Spark is the best solution for Big Data
BIG DATA, WHAT IS AND WHY IT MATTERS
Big data is a popular term used to described the exponential growth and availability of data, both
structured and unstructured. Today Big Data as important to business – and society – as the Internet has
become.

Pagina 2
Why? More data may lead to more accurate analyses. More accurate analyses may lead to mode confident
decision making. And better decision can mean greater operational efficiencies, cost reductions and
reduced risk.
Organizations are increasingly generating large volumes of data as result of instrumented business
processes, monitoring of user activity, web site tracking, sensors, finance, accounting, among other
reasons. For example, with the advent of social network websites, users create records of their lives by
daily posting details of activities they perform, events they attend, places they visit, pictures they take,
and things they enjoy and want. This data deluge is often referred to as Big Data, a term that conveys the
challenges it poses on existing infrastructure in respect to storage, management, interoperability,
governance and analysis large amounts of data.
Parallelism enable Big Data
• Big Data
- Applying data parallelism for data processing of large amount of data (terabyte, petabyte, …)
- Exploiting cluster of computing resources and the Cloud
• MapReduce
- Data parallelism technique introduced by Google for big data processing
- Large scale, distributing both data and computation over large cluster of machines
- Can be used to parallelize different kind of computational task
- Open source implementation: Hadoop
Big data defined
As far back as 2001, industry analyst Doug Laney articulated the now mainstream definition of big data
as the three Vs of big data: volume, velocity and variety.
Figure 2 – Three Vs of Big Data
• Volume defines the amount of data: many factors contribute to the increase in data volume.
Transaction-based data stored the years. In the past, excessive data volume was a storage issue.
Big
Data
Volume
VelocityVariety

Pagina 3
But with decreasing storage costs, other issues emerge, including how to determine relevance
within large data volumes and how to use analytics to create value from relevant data.
• Velocity refers to the rate at which the data is produced and processed, for example in the batch
mode, near-time mode, real-time mode and streams. Data isstreaming in at unprecedented speed
and must be dealt with in a timely manner. Reacting quickly enough to deal with data velocity is
a challenge for most organizations.
• Variety represent the data types. Data today comes in all types of formats. Structured, numeric
data in traditional databases. Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Managing, merging and governing different varieties of data is something many organizations
still grapple with.
On the other hand, many company consider two additional dimension then they thinking about big data:
• Variability. In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and
event-triggered peak data loads can be challenging to manage. Even more so with unstructured
data involved.
• Complexity. Today's data comes from multiple sources. And it is still an undertaking to link,
match, cleanse and transform data across systems. However, it is necessary to connect and
correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out
of control.
Figure 3 – The cube of Big Data

Pagina 4
WHY SPARK?
On the generality side, Spark is designed to cover a wide range of workloads that previously required
separate distributed system, including batch applications, iterative algorithms, interactive queries and
streaming.
By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine
different processing types, which is often necessary in production data analysis pipelines. In addition, it
reduces the management burden of maintaining separate tools.
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich
built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop
clusters and access any Hadoop data source, including Cassandra, for example.
FROM BIG CODE TO SMALL CODE
The most difficult thing for big data developers today is choosing a programming language for big data
applications. There are many languages that a developer can uses, such as Python or R: both are the
languages of choice among data scientists for building Hadoop applications. With the advent of various
big data frameworks like Apache Spark or Apache Kafka, Scala programming language has gained
prominence amid big data developers.
What language should I choose for my next Apache Spark project?
The answer to this question varies, as it depends on the programming expertise of the developers but
preferably Scala programming language has become the language of choice for working with big data
framework like Apache Spark. Not only, Spark is mostly written in Scala, more precisely:
1. Scala 81.2%
2. Python 8.7%
3. Java 8.5%
4. Shell 1.3%
5. Other 0.3%
Figure 4 – The Growth of Scala

Pagina 5
Creating any program in Scala using functional programming cannot impact performance. Functional
Programming is an abstraction, which restricts certain programming practices, for example I/O, avoiding
the side effect. Forcing the abstraction allows the code to be executed in a Parallel Execution Environment.
For this, Spark supports only a functional paradigm, in which Scala results be the most efficient language
supported.
Figure 5 – From Big Code to Tiny Code
In the next sections I will explain the advantage of Spark: a philosophy of tight integration has several
benefits:
- All libraries and higher-level components, in the stack, benefit from improvements at the
lower layers.
For example: when Spark’s core engine adds an optimization, SQL and machine learning
libraries automatically speed up as well.
- The costs associated with running the stack are minimized, because instead of running 5-10
independed software system, an organization needs to run only one. These costs include
deployment, maintenance, testing, support, and others.
This also means that each time a new component is added to the Spark stack, every organization that uses
Spark will immediately be able to try this new component. This changes the cost of trying out a new type
of data analysis from downloading, deploying, and learning a new software project to upgrading Spark.
The advantages are the following:
- Readability and Expressiveness
- Fast and Testability
- Interactive
- Fault Tolerant
- Unify Big Data

Pagina 6
A Brief History of Spark
Apache Spark is one of the most interesting frameworks in big data in recent years. Spark had its humble
beginning as a research project at UC Berkeley. Recently O’Reilly Ben Lorica interviewed Ion Stoica, UC
Berkeley professor and databricks CEO, about history of apache spark.
The question was: “How Apache Spark started?”
THE INTERVIEW
Ion Stoica, professor and databricks CEO, replay this:
“the story started back in 2009 with Mesos1. It was a class project in UC Berkley. Idea was to
build a cluster management framework, which can support different kind of cluster
computing systems. Once Mesos was built, then we thought what we can build on top of Mesos.
That’s how spark was born.”
“We wanted to show how easy it was to build a framework from scratch in Mesos. We also
wanted to be different from existing cluster computing systems. At that time, Hadoop targeted
batch processing, so we targeted interactive iterative computations like machine learning.”
“The requirement for machine learning came from the head of department at that time. They
were trying to scale machine learning on top of Hadoop. Hadoop was slow for ML as it needed
to exchange data between iterations through HDFS. The requirement for interactive querying
came from the company, Conviva, which I founded that time. As part of video analytical tool
built by company, we supported adhoc querying. We were using MySQL at that time, but we
knew it’s not good enough. Also there are many companies came out of UC, like Yahoo,
Facebook were facing similar challenges. So we knew there was need for framework focused
on interactive and iterative processing.2”
THE SPARK GROWTH
Spark is an open source project that has been built and is maintained by a thriving and diverse community
of developers. If you or your organization are trying Spark for the first time, you might be interested in
the history of the project. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to
become the AMPLab. The researchers in the lab had previously been working on Hadoop Map-Reduce,
and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the
1 Mesos is a distributed systems kernel: abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily
be built and run effectively. In the next section I explain more precisely what Mesos is it.
2 (Phatak, 2015)

Pagina 7
beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas
like support for in-memory storage and efficient fault recovery. Research papers were published about
Figure 6 – The History of Spark
Spark at academic conferences and soon after its creation in 2009, it was already 10–20× faster than
MapReduce for certain jobs.
In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark)
and Spark Streaming. These and other components are sometimes referred to as the Berkeley Data
Analytics Stack (BDAS). Spark was first open sourced in March 2010, and was transferred to the Apache
Software Foundation in June 2013, where it is now a top-level project.
THE COMPANY AROUND SPARK
Today, most company uses Apache Spark. Company like Ebay, Pandora, Twetter, Netflix… use Spark for
quickly query analysis, improving their business.
Figure 7 – Who’s using Spark?

Pagina 8
A Unified Stack
The Spark project contains multiple closely integrated components. At its core, Spark is a “computational
engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many
computational task across many worker machines, or a computing cluster. Because the core engine of
Spark is both fast and general-purpose, it powers multiple higher-level components specialized for
various workloads, such as SQL or machine learning.
Figure 8 – The Spark stack
Finally, on of the largest advantage of tight integration is the ability to build applications that seamlessly
combine different processing models. For example, in Spark you can write one application that uses
machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously,
analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured
logfiles). In addition, more sophisticated data engineers and data scientists can access the same data via
the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the
while, the IT team has to maintain only one system. A Figure 6 explain that without words, evidencing the
Spark architecture.
SPARK CORE
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory
management, fault recovery, interacting with storage systems, and more. It provides distributed task
dispatching, scheduling, and basic I/O functionalities, exposed through an application programming
interface (for Java, Python, Scala, and R) centered on the RDD abstraction. Spark Core is based on the
Resilient Distributed Datasets (RDDs), which are Spark’s main programming abstraction.
More precisely: the RDD is a huge list of object in memory, but distributed. It can image this as a huge
magic tree distributed across many compute nodes. Is a big set of objects, that we can distribute and
parallelize as we want. In practice it can basically scale horizontally reasonably well. A "driver" program
invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which

Pagina 9
then schedules the function's execution in parallel on the cluster3. These operations, and additional ones
such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are
lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD, the sequence of operations
produced it, so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python,
Java, or Scala objects.
In this section I will show you the basics of Apache Spark, then the working mechanism, what is the RDD
file, the difference between actions and changes and the cache mechanism.
How Spark works?
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext
can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos
or YARN), which allocate resources across applications. Once connected, Spark acquires executors on
nodes in the cluster, which are processes that run computations and store data for your application. Next,
it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
Finally, SparkContext sends tasks to the executors to run.
Figure 9 – The architecture of Spark, inside the Core
The Spark architecture works through the Master-Worker pattern, where the Driver Program (Master),
controlled by the Spark Context component, orchestra the work by sending the various jobs to the
Workers; in particular, they divide the work into different chunks for each Executors; finally, they compute
the result as quickly as possible. Once the Workers end to compute the request, they send the result to the
Master that deals with aggregate all the results and insert them in the DAG (Directly Acyclic Graph). The
work continues until all chunks has been computed.
3 (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2011)

Pagina 10
RDD
The core of Apache Spark is MapReduce algorithm: MapReduce and its variants have been highly
successful in implementing large-scale data-intensive applications in commodity clusters. These systems
are built around an Acyclic Data Graph model (DAG) that is not suitable for other popular applications:
the focus on this type of working, across a multiset of multiple parallel operations is the better for this
use. To achieve these goal, Spark introduces an abstraction called Resilient Distributed Datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a
partition is lost.
Thank the RDD and DAG model, Spark can outperform Hadoop by 10x in iterative machine learning jobs,
and can be used to interactively query a hundred GB dataset with sub-second response time.
Definition
RDD is a distributed memory abstraction that lets programmers perform in-memory computations on
large cluster in a fault-tolerant manner. RDDs are motivated by two types of applications:
- Iterative algorithms
- Interactive data mining tools
In both cases, keeping data in memory can improve performance by an order of magnitude. can improve
performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted
form of shared memory, based on coarse-grained transformations rather than fine-grained updates to
shared state.
Behavior
The RDD - Resilient Distributed Datasets - is a collection of partitioned elements along the nodes of cluster
which can be computed in parallel. The workers can access only to the RDD in read mode, so they cannot
change it.
More precisely:
- Datasets: elements such as lists or arrays
- Distributed: the work is distributed across the cluster, so the computation can take place in
parallel
- Resilient: given that the work is distributed, Spark has a powerful management of fault
tolerance, because if a job on a node crash, Spark is able to detect the crash and immediately
restore the job over to another node.
Because many functions in Spark are lazy, rather than immediately calculate functions and extractions by
time in time, these extractions are saved in what is called the DAG, the Directed Acyclic Graph, making
fault tolerance as simple as possible.
The Dag grow for all transformations that are carried out, as
- Map
- Filter
- (Other lazy operation)

Pagina 11
Once all the results have been collected in a DAG, they can be reduced, and then brought to the desired
result, through the follow actions:
- Collect
- Count
- Reduce
- …
The results can be saved in local storage and then sent to the Master Worker, or sent directly.
Transformation
Spark offers an environment that supports functional paradigm, and thanks to Scala many features are
reduced to the minimum number of possible lines:
Actions
While all the changes are "lazy evaluation", the actions aggregate the data and return the result to the
Master (Driver Program). More precisely, each action aggregates the result directly in the node, and then
is sent by the Master and aggregate all; this type of Hadoop policy has been developed to minimize, as
much as possible, the traffic on the network.
Associative properties
Thanks to the associative property it can aggregate all data regardless of the order in which they arrive.
Spark Context can organize clusters in different ways, thanks to the associative property that
distinguishes the distributed computing.
Figure 10 – Associative Property of Spark
Action on Data
Whenever that we want to "see" the result, the collect computes the entire collection of RDD.

Pagina 12
Figure 11
Key Value Methods
Spark works by the pair<k,v> to transform and aggregate computations that you want to process. For
example these are some of the functions that you can use:
Figure 12 – RDD functions
Example:
- Given the follow list key-value
Figure 13- List key value Pairs
- The functions highlighted above (rddToPairRDDFuncion) divides the tables in different
nodes. Note that the same keys are assigned to the same nodes.

Pagina 13
Figure 14
- Now, each node can perform the work of transformation and action, and then send the result
to the Master.
Caching Data
- Suppose, for example, that we have a simple RDD, and we want to compute intensive
algorithm that we simulate through Thread.sleep() method.
- The result is another RDD with the transformed data
- Suppose that we want to make a simple map function, which is not computationally heavy as
the first that has been made; the result that we get is the follow result:
- Now, maybe the result is not correct, or maybe yes, or at least not the one you want. Instead
recalculate all over again, Spark stores the intermediate result in the cache.
By default, Spark uses the RAM memory, but if specified, you can also use a larger memory
such as hard-disk (keep attention, for this case, bottleneck can occur!).
- If, at this point, if you want to continue the analysis, the data that you will use are those in the
cache:
With this method there are two benefits

Pagina 14
1. Speed: the data, especially those intermediate, are available in a short time
2. Fault Tolerance: if, in the first transformation, the heavy computational crashed, the data would
be recalculated and made available too, without that the user can notice it at all.
Distribution and Instrumentation
In this section it will explain the system aspects of Spark, where the data passes and which components
are activated when the action occurs.
Spark Submit
In Spark the MapReduce the functions are processed through the spark-submit function. Applications that
are launched in spark refer to the main program and are sent to the cluster manager driver.
Figure 15
At this point, depend how you cluster is configured, the driver may directly call the machine on which you
installed the Spark (for example, in local), or directly to the Master node at the head of the distributed
cluster. This feature it can choose in the launch phase of the main program.
At this point the driver asks to the cluster manager the specific resources that can be used in the
computation. Depending on available resources, the Cluster enables all available resources that can be
used in the MapReduce policy.
Figure 16
At this point, the driver performs the main application, building the RDD until the action occurs. This
causes the sending of the RDD to all nodes of the cluster.
Once the workflow process is completed, execution continues within the driver until the execution is not
completed finally.
When the workflow process is complete, the execution continue inside the driver until the execution is
complete definitively.

Pagina 15
Figure 17
Now, the resources are freed. Note: the resources may also be released manually before the job is
completed.
To understand better, the potential offered by spark-submit feature, you can navigate inside the help
supplied directly from the command shell:
Figure 18 – List of command in Spark
1. First, we note that Spark can send to only compiled programs driver in Java or Python,
specifying any appropriate options that will be shown shortly.
2. Through the kill or status commands, you can drastically end or keep monitored the jobs
present in Spark, passing the appropriate reference ID.
3. The master command determines whether to use Spark in local or on the cluster, if this’s
available, also specifying the number of processors you want to use.
4. Deploy mode allows you to specify only two options, client or cluster. The method specifies
where the driver, explained above, resides: if specified in local, the driver resides on the same
machine from which you will launch the computation; this requires that the driver is

Pagina 16
obviously connected via a fast connection to the computing cluster. The cluster option, on the
other hand, offers a service "fire-and-forget", sending the computation directly to the Master.
5. Another possible option is specifying the compute class through the class option.
6. The name option assigns the process that will be computed.
7. Jars or py-file specifies the type of file that you want to compute. Note that both files in Scala
and in Java produce output as a jar file.
8. The executor-memory option (not showed in the screenshot) is able to set the amount of
memory that the launched process can be used. Default is set to 1GB, but you can change
depending on the power required.
Cluster Manager
The Apache Spark’s Cluster Manager is distributed. The main goal, for the cluster manager, is managing and
distributing the workload and assign tasks to the machine that composed the cluster.
Figure 19 – Cluster Manager View
The kernel that manages a single machine, in this case, the Master, in practice is "distributed" to the entire
cluster.
Main Cluster Manager
Figure 20 – Main Cluster Manager
 Spark has a proprietary kernel: Spark-standalone.
 The Hadoop cluster managers, on the other hand, is YARN, very convenient both in terms of
infrastructure and in terms of ease to use.
 Apache Mesos is another product created by University of Berkeley, in competition with the
cluster manager YARN.

Pagina 17
Standalone Manager
Spark-Standalone Manager is simple and quite convenient if you are
looking for a manager with the basic functions.
 To use spark-submit feature just enter the host and port of the
machine on which you want to launch the process (default port
is 7077).
 The standalone manager works in both client mode and in
cluster mode.
 Another very helpful feature is spark.deploy.spreadOut: depending on itself setup, tries to
distribute the workload across the node of the cluster, trying to querying multiple machines as
possible.
On the other hand, if this parameter is setup on false, the cluster manager tries to works with less
node as possible trying to group the works as soon as possible.
 The feature total-executor-cores # lets you specify how many cores are to be used for each
executor, for each job. By default, this setting uses all available cores in the cluster, but you can
specify a lower number if you want to preserve the performance impact for future computations.
Apache Mesos Manager
 The URL’s Master is pretty much the same as standalone, the only
difference is the port that is launched, that is, 5050.
 Supports both deployment-modes: client and cluster; is not a feature
installed by default, it must be installed separately.
 Mesos is able to allocate dynamically the resources in two ways: fine
and coarse. By default, it is set to the fine; the difference lies in the type
of the allocation of the various jobs:

Pagina 18
o fine: each task, in Spark, is mapped and run on each task of Mesos. This allows
applications to be able to scale their resources dynamically when needed4;
o coarse: this mode uses a very high number of resources for the whole duration of the
application launched. This feature is best if you want maximum performance, since each
application can determine which is its best configuration and keep it.
 If spark.mesos.coarse is selected to true, you can also set the number of cores that must be used
for each node in the cluster; default uses all available cores.
 Therefore, unlike the Standalone Manager, you cannot choose the kind of spread you want to set:
by default, Mesos uses less possible nodes avoiding trying to minimize to the maximum the
distribution of work within the cluster.
Hadoop YARN
 Unlike previous cluster, to launch a spark-submit process is simply the yard master command;
Since in HADOOP / YARN_CONF_DIR configuration file is set the path to use to connect to the
cluster manager.
 Client / Cluster: even with this manager you can set the launch mode of computation.
 The number of executors, in this manager, is set by default to 2, causing problems if you want to
improve performance and scale.
 You can specify the number of cores to be executed for each Executors, which by default is set to
1.
 Unlike all other managers, in YARN there’s the concept of queues, in which you can specify the
type of the queue for all jobs that have yet to be launched from the cluster manager.
SPARK LIBRARIES
In this section we will look how Spark is used to the pragmatic level, that is, how it is used to query the
Hadoop file system and which are, therefore, the features that it allows us to use.
Spark SQL
Called in this mode, Spark SQL,because,thanksto this functionality, it provides a work environment where
you can query any database stored in Hadoop with a language very similar to SQL.
Spark SQL is a Spark module for structured data processing.Unlike the basic Spark RDD API, the interfaces
provided by Spark SQL provide Spark with more information about the structure of both the data and the
4 This type of resource management, for example, is not optimal if you are running streaming applications,
where applications require low latency; but optimal for a parallel use of multiple processes launched by
different instances: the resources needed for a parallel use are automatically allocated and distributed.

Pagina 19
computation being performed. Internally, Spark SQL uses this extra information to perform extra
optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and
the Datasets API. When computing a result, the same execution engine is used, independent of which
API/language you are using to express the computation. This unification means that developers can easily
switch back and forth between the various APIs based on which provides the most natural way to express
a given transformation.
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark
SQL can also be used to read data from an existing Hive installation. For more on how to configure this
feature, please refer to the Hive Tables section. When running SQL from within another programming
language the results will be returned as a DataFrame. You can also interact with the SQL interface using
the command-line or over JDBC/ODBC.
This chapter will show only an example that shows how to use the functional paradigm functions to query
the Hadoop cluster. The language in the example chosen is Scala, which is the most powerful of all those
built by the community of Spark.
The Spark SQL interface supports the automatic conversion of the RDD containing the classes of data
frame5. Classes define the schemes of the tables. The names of the topics of the classes are read using
reflection (it is the ability of a program to perform calculations that relate itself to the same program, in
particular, the structure of its source code) and become the column names. Classes can also be nested or
contain complex types such as Sequences and Arrays. This RDD can be converted implicitly into a data
frame and then be registered as a table. Tables can be used sequentially by the SQL statement.
Refer to an excerpt of this code, in a Figure21:
5 A DataFrame is a distributed collection of data organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations
under the hood. DataFrames can be constructed from a wide array of sources such as: structured data
files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java,
Python, and R.

Pagina 20
Figure 21 – Example Code
Impala vs Hive: difference between Sql on Hadoop components
Apache Hive was introduced by Facebook to manage and process a large amount of distributed data sets
and saved on Hadoop. Apache Hive is an abstraction of MapReduce and Hadoop allows you to query
databases in their own language similar to SQL, that HiveQL.
Across Cloudera Impala it was developed to try to solve all the limitations posed by the slow interaction
with Hadoop SQL. Cloudera Impala provides high performance to process and analyze data saved on the
Hadoop cluster.
Cloudera Impala Apache Hive
It does not require the data are moved or
processed by Apache HDFS and HBase.
Is data warehouses infrastructure built over the
Hadoop platform to enhance tasks such as query,
analysis, process and display data.
Impala usesgeneratedat runtime code for the
"big loops" using LLVM (Low Level Virtual
Machine).
Hive generates query at compile time.
Impala avoids to run the startup process, the
processes in general are demons that are
started during boot of the nodes and are
always ready to plan, coordinate and execute
queries.
All Hive processes have the problem, since they are
compiled at compile-time, of "cold start."

Pagina 21
Impala is very fast on the parallel processes.
It follows the SQL-92 standard.
Hive query results in their own "dialect" (Hive QL) in
the DAG to have them implemented by the MapReduce
jobsAd esempio, in Hive e possibile scrivere questo:
Table 1- Impala vs. Hive
Cloudera Impala
Cloudera Impala is an excellent choice for all developers who do not have the need to move large amounts
of data easily integrates with Hadoop ecosystem and frameworks like MapReduce, Apache Hive, Pig and
other Apache Hadoop software. The architecture is specifically designed to support the familiar with the
SQL language and give support to all those users who make the transition from relational databases to big
data.
Columnar Storage
The data, on Impala, are saved in mode to provide a compression ratio and a very efficient scanning.
Figure 22 – Columnar Storage with Impala
Tree Architecture
The tree architecture is absolutely crucial to the achievement distributed computing, delivering the query
nodes and aggregating the result once the sub-processing.
Figure 23 – Tree Architecture
Impala offers the following features:
 Support to HDFS (Hadoop Distributed File System) and Apache HBase storage
 It recognizes all the Hadoop file formats, texts, LZO, SEQUENCEFILE, Acro, RCFILE and Flooring

Pagina 22
 Support Hadoop Security (Kerberos authentication)
 Can easily read metadata, handle ODBC driver and the SQL syntax from Apache Hive
Apache Hive
Initially developed by Facebook, Apache Hive is a data warehouse infrastructure built on top of Hadoop
to provide a querying tool, analysis, processing, and visualization. It is a versatile tool that supports
analysis of large databases stored in Hadoop HDFS and other compatible systems like Amazon S3. To
maintain compatibility with the SQL language is used as HiveQL tool converts queries and send the jobs
of MapReduce.
Other features that include Hive:
 Indexing to speed up the computation
 Different storage support, such as text files, RCFILE, HBase, ORC and other
 Familiarity to building user defined functions (UDFs) to manipulate strings, dates, ...
If you are looking for Advanced Analytics Language that allows to have a great deal of familiarity with SQL
(without writing separate processes of MapReduce) Apache Hive is the best choice. Queries HiveQL still
are converted into the corresponding MapReduce job that will run on the cluster and will return the final
figure.
Streaming Data
Spark Streaming is an extension of the core API that enables streaming processes that are scalable, high-
throughput and fault-tolerant. Data can be selected from various resources like Kafka, Flume, HDFS / S3,
Kinesis, Twitter or commonplace TCP Sockets and can be processed using complex algorithms expressed
at a high level thanks to the functional paradigm as a map, a reduce, joins, and window. Thereafter, the
processed data can be entered in a filesystem or into database.
Figure 24 – Streaming Data
Internally, Spark, works as follows: Spark receives streaming data in live and divides the data into separate
batches, where they are finally processed by the Spark engine to generate the final stream containing the
processing result.
Figure 25 – Streaming Data in detail

Pagina 23
Spark Streaming provides a high-level abstraction layer called Discretized Stream or, more simply,
DStream, which represents a continuous stream of data. DStream can be created for each input data
streams from the sources such as Kafka, Flume, Kinesis, or, simply, through high-level operations on other
DStreams. Internally a DStreams shall be represented as a sequence of RDD.
Machine Learning
The machine learning algorithms are constantly growing and evolving and are becoming ever more
efficient. To understand the evolution of machine learning in Spark is good to focus on the evolution that
this has had.
Machine Learning Library (MLlib)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and
easy. It consists of common learning algorithms and utilities, including classification, regression,
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives
and higher-level pipeline APIs.
It divides into two packages:
 spark.mllib contains the original API built on top of RDDs.
 spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we
will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable
using spark.mllib features and expect more features coming. Developers should contribute new
algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
In fact, on the one hand, Matlab and R are easy to use but are not scalable. On the other hand, conversely,
there was a growth by Mahaut and GraphLab, very scalable but difficult to use.

Pagina 24
Figure 26 – Mlib Architecture
The library Spark MLlib (ML) aims to make the practice of machine learning scalable and relatively easy.
It consists of a set of algorithms that include regression, clustering, collaborative filtering, dimensional
reduction and others functions that allow integration at a low level. That architecture optimizes the high-
level APIs that are provided and used. also provide tools to perform Extraction and Transformation.
GraphX
GraphX is a component of Spark for simply graphs and graphs parallel computation. GraphX extends
the RDD Spark introducing a new abstraction graph attaching to them the concept of vertex and edge. To
support this, GraphX computing provides operators as a subgraph, joinVertices and aggregateMessages. In
addition, GraphX includes a growing collection of graphics algorithms to simplify access and high-level
analysis.
Figure 27 – How use GraphX
Actually the WWW is a gigantic graph; later, multinationals like Twitter, Netflix and Amazon have used
this concept to expand their markets and their profits. In the world of medicine, the graphs are still used
in research.
How it works GraphX?
This is a directed multigraph with some properties that make all vertices and nodes.
In particular, at each node may be associated metadata, as well as a unique identifier
within the graph.

Pagina 25
In this case, each point represents a person, as
happens every day for the "social graph". In
GraphiX this object is called Vertex; that includes:
- (VertexId, VType): simply a key-value pair
The arrows have the task to connect all the
vertex, in the example, these are created of the
relationships between some people saved in
the vertex.
In Spark, these are called edge, in which are
stored the ID of connected vertices and the type
of correlation between them.
- Edge (VertexId, VertexId, EType)
The Spark graph is built on top of these simple relationships, in particular, the graph is built over the
representation of two different RDD, one containing the relationships between the vertices, and the
another one containing the relations between the edge.
In addition, there is another concept to which we must be careful when working with charts, that is, the
view that exists between each connection in the graph:
The graph presents a comprehensive EdgeTriple, containing all the information that exist between each
connection. This provides a much more accurate rendering of the graph of the reasoning process much
simpler.

Pagina 26
OPTIMIZATIONS AND THE FUTURE
In the final part I have seen some recurring issues, especially to understand how Spark. I'll try to
understand how they work and what problems may have the following constructs:
 Closures
 Broadcast
 Partitioning
Understanding closures
One of the most difficult things to understand in Spark is understanding the scope and lifetime of variables
and methods when the code is executed on the cluster.
The RDD operations that modify variables outside of their scope can be a frequent source of confusion. It
will be shown in the example code which uses a foreach( ) to increment a counter, but similar problems
can occur for other well-known operations.
Example: Consider a sum RDD element, viewed in the following, the behavior of which may be different
depending on the node on which it is processed. A common example of what happens when we change
the Spark behavior from local to cluster (via YARN, for example):
Local vs. cluster modes
The behavior of the code, in the image above, is not clearly defined and might not work as intended. To
run jobs, Spark divides the process around RDD in different tasks, each by an executor. Before execution,
Spark computes the closure of the task. Closures are all those variables and methods that must be visible
for the executor that allow their computation on the RDD (inthis case, the foreach statement). This closure
is serialized and sent to each executor.
Variables with the closure are copiedand sent toevery executor, so when the counter is called by the works
of the foreach, the counter does not refer to the Master node counter (driver). There is still the counter in
the driver node memory, but this is not visible to the executors. The executors "see" only the copy of the
closure serialized. So, the final value of the counter, will always be zero until all the operations on the
counter are not completed and sent to the master (driver).

Pagina 27
In the local environment, the foreach statement will run the same code on the same JVM as the original
driver and calculate the original counter: this could then upgrade directly behaving differently.
To ensure a well-defined behavior in these types of scenarios, you should use an Accumulator. The
Accumulators in Spark are used specifically to allow a mechanism for safely updating a variable by
aggregating all the results that are sent to the driver by the various nodes running the workers.
Broadcasting
In the previous chapter you saw how Spark sends a copy of the RDD to compute operations to various
workers (nodes) in the cluster. This type of operation, without a policy and a well-designed strategy, could
lead to specific problems, given the calculations that must be made are massive Spark and then we talk
about the order of gigabytes.
This section will show an incorrect management of the broadcasting technical of the RDD, and then see
which strategy was implemented in Spark to solve the problem.
Example: are provides the lazy type operations and closure of type Map and flatMap, respectively:
Given a Master (driver) and three Workers:
1. Is created the RDD inside the driver, 1MB size (dummy variable)
2. Whenever that the actions are called by the workers, a copy is sent to the various workers of the
RDD
3. The RDD of 1MB in size, if called, for example, 4 times from the first worker, become 4MB to the
local level. If you think about this in a distributed scale, so for all workers in this small network,
the results could grow out of control and in an uncontrolled manner.

Pagina 28
In a network of millions of nodes, this cannot happen.
In Spark this issue has been resolved in this way: it wraps the Map function inside the broadcast function
that handles sending of the RDD in parallel and, then, the lazy operations:
The broadcast function, first of all, compresses the RDD as much as possible; following:
1. Send indexers (RDD) once and only once time to the various worker
2. Each indexer must be cached by the various worker
This controls the uncontrolled growth of the indexer, even if the workers require to compute repeatedly
the same RDD. Not only! To try to solve the bottleneck that is possible is created between Drivers and
Workers, the Workers can exchange indexers in a similar manner to that used in BitTorrent networks.
This means, for example, that if the first worker has received the indexer to perform an any computation...
... and the second worker has also need to receive the indexer, the workers can help each other in order to
minimize calls to the driver, which could cause annoying bottleneck and create latency:

Pagina 29
Optimizing Partitioning
In previous sections it has shown how an uncontrolled broadcasting can grow RDD distribution in an
uncontrolled manner. In this section, otherwise, you will be seen instead as a broadcast technique
disproportionate use may be inadequate if misused. To best explain this problem through an example.
Given the following RDD:
 Given a set "huge" data  1 to int.MaxValue
 Divided into 10000 executors
1. Perform a filter, that minimizes the result as much as possible
2. Operated an order
3. Perform one map, that adds value 1 to each element in the final array
4. It shows the result using the collect action.
In this case, if you check the systemistic level work how the nodes on the network works, we have
conflicting results.
First, we can see that the execution was presided over by 2 chunks, each approximately 10,000 partitions,
whose first (sortBy) was completed in 13 seconds, while the second (collect) in 20 seconds.
If we make a better analysis of the result and we go down in detail of the function collect ...
We note that the sortBy function, after filtering, it took a full 20 seconds. This is because it has been asked
to use Spark 10000 executors to order 10 values (the action of sortBy is performed, to how the code is
written, after the filtering):

Pagina 30
Figure 28 – Systemistic View
As you can see, this overhead is bad if not controlled: from the image above you can see that 10000 tasks
are performed, and almost every one of them puts zero time to perform the required computation. Simply
put, the synchronization time of the various executors is extremely higher than the desired type of
computation, increasing the overhead and user perceived latency of proportion.
A technique used to control the use of the executors by the user is to use the coalesce function, which
reduces the use of executors from 10000 to 8 after execution of the filter.
The second field passed to coalesce function indicates the shuffle properties: if this is set to true value, then
the output is equally distributed across all nodes equally, otherwise, if it set to false value are used less
possible partitions.
If you monitoring the execution time, the result is always the same,
but with absolutely best performance: in fact, the collect call back after filtering, in the second chunk of
the second test, it takes only 4 seconds compared to 20 used previously.
In details:

Pagina 31
It is evident that the sort function, this time, takes only 24 milliseconds requesting the execution of 8
executors in total.
The coalesce function saves an enormous amount of time by reducing the number of chunks that are called
upon execution of actions.
Future of Spark
This section will show some features that the community around Spark is trying to develop and advance.
R
Integration with R is definitely the first “frontier” to be conquered to join the community of
mathematicians, statisticians and that of engineers and data scientists. R offers a user-
friendly development environment from a statistical and mathematical point of view and is
widely used in the community for the analysis of data.
Tachyon
Why Spark results to be 100 times faster than
MapReduce? The answer is provided by the
community around you work for Tachyon. The
memory is the keyto process BigData quickly,fast and
scalable: Tachyon is a "central memory" adopted
Spark in distributed systems in which in addition to providing automatic fault-tolerance and cache
techniques, provides reliable services the memory sharing along the cluster used. Continue to develop
this type of service would make the field of the analysis of big data even faster and powerful than it already
is.
Project Tungsten
Another features developed by the community around which has rallied Spark is the Tungesten Project:
the project has changed the execution core of the various nodes within the cluster going to substantially
improve the efficiency of memory and CPU usage for applications Spark, trying to maximize, as soon as
possible, the performance. The fundamental concept of this research revolves around the concept about
the use of modern CPUs: they are built around the concept of the I / O. This research has the goal to
minimize the I / O and to improve to the maximum performance if the CPU and the memory, of the
machine, are used as a cluster.

Pagina 32
References
Holden Karau, A. K. (2015). Learning Spark. 1005 Gravenstein Highway North, Sebastopol, CA 95472:
O’Reilly Media.
Mortar Data Software. (2013, 05). How Spark Works. Retrieved from
https://www.linkedin.com/company/mortar-data-inc.
Phatak, M. (2015, 1 2). History of Apache Spark : Journey from Academia to Industry. Retrieved from
http://blog.madhukaraphatak.com/history-of-spark/
SAS - The power to Know. (2013, 04). Big Data. Retrieved from http://www.sas.com/en_ph/insights/big-
data/what-is-big-data.html
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2011, 06). USENIX Workshop on Hot
Topics in Cloud Computing (HotCloud). Retrieved from https://amplab.cs.berkeley.edu/wp-
content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf

Pagina 33
Contact information
MASSIMILIANO
MARTELLA
Email: massimilian.martella@studio.unibo.it

The New Age for Data Scientists with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The New Age for Data Scientists with Apache Spark

Similar to The New Age for Data Scientists with Apache Spark (20)

Recently uploaded

Recently uploaded (20)

The New Age for Data Scientists with Apache Spark