SlideShare a Scribd company logo
1 of 36
Download to read offline
SPARK
THE NEW AGE FOR DATA SCIENTIST
In a very short time, Apache Spark has emerged as the next generation big
data processing engine, and is being applied throughout the industry faster
than ever.
TABLE OF CONTENTS
Summary
GETTING STARTED............................................................................................................................................................1
WHAT IS APACHE SPARK?........................................................................................................................................................................ 1
BIG DATA, WHAT IS AND WHY IT MATTERS.......................................................................................................................................... 1
Parallelism enable Big Data.......................................................................................................................................................2
Big data defined...............................................................................................................................................................................2
WHY SPARK?............................................................................................................................................................................................ 4
FROM BIG CODE TO SMALL CODE........................................................................................................................................................... 4
What language should I choose for my next Apache Spark project? ........................................................................4
A BRIEF HISTORY OF SPARK..........................................................................................................................................6
THE INTERVIEW ....................................................................................................................................................................................... 6
THE SPARK GROWTH............................................................................................................................................................................... 6
THE COMPANY AROUND SPARK............................................................................................................................................................. 7
A UNIFIED STACK...............................................................................................................................................................8
SPARK CORE............................................................................................................................................................................................... 8
How Spark works? ..........................................................................................................................................................................9
RDD............................................................................................................................................................................................................................10
Key Value Methods............................................................................................................................................................................................12
Caching Data..........................................................................................................................................................................................................13
Distribution and Instrumentation ........................................................................................................................................14
Spark Submit.........................................................................................................................................................................................................14
Cluster Manager..................................................................................................................................................................................................16
SPARK LIBRARIES ..................................................................................................................................................................................18
Spark SQL ........................................................................................................................................................................................18
Impala vs Hive: difference between Sql on Hadoop components............................................................................................20
Cloudera Impala..................................................................................................................................................................................................21
Apache Hive...........................................................................................................................................................................................................22
Other features that include Hive:...............................................................................................................................................................22
Streaming Data.............................................................................................................................................................................22
Machine Learning........................................................................................................................................................................23
Machine Learning Library (MLlib)............................................................................................................................................................23
GraphX ..............................................................................................................................................................................................24
How it works GraphX?.....................................................................................................................................................................................24
OPTIMIZATIONS AND THE FUTURE.....................................................................................................................................................26
Understanding closures.............................................................................................................................................................26
Local vs. cluster modes....................................................................................................................................................................................26
Broadcasting..................................................................................................................................................................................27
Optimizing Partitioning ............................................................................................................................................................29
Future of Spark..............................................................................................................................................................................31
R...................................................................................................................................................................................................................................31
Tachyon ...................................................................................................................................................................................................................31
TABLE OF CONTENTS
Project Tungsten.................................................................................................................................................................................................31
REFERENCES.....................................................................................................................................................................32
CONTACT INFORMATION.............................................................................................................................................33
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 1
Getting Started
In this section I will explain the concepts around Spark, how Spark works and how Spark has been built,
what are the advantages of using Spark and a brief nod to history of MapReduce Algorithm.
WHAT IS APACHE SPARK?
Apache Spark is an open-source framework developed by AMPlab of University of California and,
successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on two-
level disk of Hadoop, the primitive in-memory multilayer provided by Spark allow youto have performance
up to 100 times better.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large
amounts of data. Spark offers performance improvements over MapReduce, especially when Spark’s in-
memory computing capabilities can be leveraged. In practice, Spark extends the popular MapReduce
model to efficiently support more types of computations, including interactive queries and stream
processing. Speed is important in processing large datasetsor Big Data, as it means the difference between
exploring data interactively and waiting minutes or hours.
Figure 1 – Spark is the best solution for Big Data
BIG DATA, WHAT IS AND WHY IT MATTERS
Big data is a popular term used to described the exponential growth and availability of data, both
structured and unstructured. Today Big Data as important to business – and society – as the Internet has
become.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 2
Why? More data may lead to more accurate analyses. More accurate analyses may lead to mode confident
decision making. And better decision can mean greater operational efficiencies, cost reductions and
reduced risk.
Organizations are increasingly generating large volumes of data as result of instrumented business
processes, monitoring of user activity, web site tracking, sensors, finance, accounting, among other
reasons. For example, with the advent of social network websites, users create records of their lives by
daily posting details of activities they perform, events they attend, places they visit, pictures they take,
and things they enjoy and want. This data deluge is often referred to as Big Data, a term that conveys the
challenges it poses on existing infrastructure in respect to storage, management, interoperability,
governance and analysis large amounts of data.
Parallelism enable Big Data
• Big Data
- Applying data parallelism for data processing of large amount of data (terabyte, petabyte, …)
- Exploiting cluster of computing resources and the Cloud
• MapReduce
- Data parallelism technique introduced by Google for big data processing
- Large scale, distributing both data and computation over large cluster of machines
- Can be used to parallelize different kind of computational task
- Open source implementation: Hadoop
Big data defined
As far back as 2001, industry analyst Doug Laney articulated the now mainstream definition of big data
as the three Vs of big data: volume, velocity and variety.
Figure 2 – Three Vs of Big Data
• Volume defines the amount of data: many factors contribute to the increase in data volume.
Transaction-based data stored the years. In the past, excessive data volume was a storage issue.
Big
Data
Volume
VelocityVariety
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 3
But with decreasing storage costs, other issues emerge, including how to determine relevance
within large data volumes and how to use analytics to create value from relevant data.
• Velocity refers to the rate at which the data is produced and processed, for example in the batch
mode, near-time mode, real-time mode and streams. Data isstreaming in at unprecedented speed
and must be dealt with in a timely manner. Reacting quickly enough to deal with data velocity is
a challenge for most organizations.
• Variety represent the data types. Data today comes in all types of formats. Structured, numeric
data in traditional databases. Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Managing, merging and governing different varieties of data is something many organizations
still grapple with.
On the other hand, many company consider two additional dimension then they thinking about big data:
• Variability. In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and
event-triggered peak data loads can be challenging to manage. Even more so with unstructured
data involved.
• Complexity. Today's data comes from multiple sources. And it is still an undertaking to link,
match, cleanse and transform data across systems. However, it is necessary to connect and
correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out
of control.
Figure 3 – The cube of Big Data
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 4
WHY SPARK?
On the generality side, Spark is designed to cover a wide range of workloads that previously required
separate distributed system, including batch applications, iterative algorithms, interactive queries and
streaming.
By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine
different processing types, which is often necessary in production data analysis pipelines. In addition, it
reduces the management burden of maintaining separate tools.
Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich
built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop
clusters and access any Hadoop data source, including Cassandra, for example.
FROM BIG CODE TO SMALL CODE
The most difficult thing for big data developers today is choosing a programming language for big data
applications. There are many languages that a developer can uses, such as Python or R: both are the
languages of choice among data scientists for building Hadoop applications. With the advent of various
big data frameworks like Apache Spark or Apache Kafka, Scala programming language has gained
prominence amid big data developers.
What language should I choose for my next Apache Spark project?
The answer to this question varies, as it depends on the programming expertise of the developers but
preferably Scala programming language has become the language of choice for working with big data
framework like Apache Spark. Not only, Spark is mostly written in Scala, more precisely:
1. Scala 81.2%
2. Python 8.7%
3. Java 8.5%
4. Shell 1.3%
5. Other 0.3%
Figure 4 – The Growth of Scala
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 5
Creating any program in Scala using functional programming cannot impact performance. Functional
Programming is an abstraction, which restricts certain programming practices, for example I/O, avoiding
the side effect. Forcing the abstraction allows the code to be executed in a Parallel Execution Environment.
For this, Spark supports only a functional paradigm, in which Scala results be the most efficient language
supported.
Figure 5 – From Big Code to Tiny Code
In the next sections I will explain the advantage of Spark: a philosophy of tight integration has several
benefits:
- All libraries and higher-level components, in the stack, benefit from improvements at the
lower layers.
For example: when Spark’s core engine adds an optimization, SQL and machine learning
libraries automatically speed up as well.
- The costs associated with running the stack are minimized, because instead of running 5-10
independed software system, an organization needs to run only one. These costs include
deployment, maintenance, testing, support, and others.
This also means that each time a new component is added to the Spark stack, every organization that uses
Spark will immediately be able to try this new component. This changes the cost of trying out a new type
of data analysis from downloading, deploying, and learning a new software project to upgrading Spark.
The advantages are the following:
- Readability and Expressiveness
- Fast and Testability
- Interactive
- Fault Tolerant
- Unify Big Data
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 6
A Brief History of Spark
Apache Spark is one of the most interesting frameworks in big data in recent years. Spark had its humble
beginning as a research project at UC Berkeley. Recently O’Reilly Ben Lorica interviewed Ion Stoica, UC
Berkeley professor and databricks CEO, about history of apache spark.
The question was: “How Apache Spark started?”
THE INTERVIEW
Ion Stoica, professor and databricks CEO, replay this:
“the story started back in 2009 with Mesos1. It was a class project in UC Berkley. Idea was to
build a cluster management framework, which can support different kind of cluster
computing systems. Once Mesos was built, then we thought what we can build on top of Mesos.
That’s how spark was born.”
“We wanted to show how easy it was to build a framework from scratch in Mesos. We also
wanted to be different from existing cluster computing systems. At that time, Hadoop targeted
batch processing, so we targeted interactive iterative computations like machine learning.”
“The requirement for machine learning came from the head of department at that time. They
were trying to scale machine learning on top of Hadoop. Hadoop was slow for ML as it needed
to exchange data between iterations through HDFS. The requirement for interactive querying
came from the company, Conviva, which I founded that time. As part of video analytical tool
built by company, we supported adhoc querying. We were using MySQL at that time, but we
knew it’s not good enough. Also there are many companies came out of UC, like Yahoo,
Facebook were facing similar challenges. So we knew there was need for framework focused
on interactive and iterative processing.2”
THE SPARK GROWTH
Spark is an open source project that has been built and is maintained by a thriving and diverse community
of developers. If you or your organization are trying Spark for the first time, you might be interested in
the history of the project. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to
become the AMPLab. The researchers in the lab had previously been working on Hadoop Map-Reduce,
and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the
1 Mesos is a distributed systems kernel: abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily
be built and run effectively. In the next section I explain more precisely what Mesos is it.
2 (Phatak, 2015)
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 7
beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas
like support for in-memory storage and efficient fault recovery. Research papers were published about
Figure 6 – The History of Spark
Spark at academic conferences and soon after its creation in 2009, it was already 10–20× faster than
MapReduce for certain jobs.
In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark)
and Spark Streaming. These and other components are sometimes referred to as the Berkeley Data
Analytics Stack (BDAS). Spark was first open sourced in March 2010, and was transferred to the Apache
Software Foundation in June 2013, where it is now a top-level project.
THE COMPANY AROUND SPARK
Today, most company uses Apache Spark. Company like Ebay, Pandora, Twetter, Netflix… use Spark for
quickly query analysis, improving their business.
Figure 7 – Who’s using Spark?
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 8
A Unified Stack
The Spark project contains multiple closely integrated components. At its core, Spark is a “computational
engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many
computational task across many worker machines, or a computing cluster. Because the core engine of
Spark is both fast and general-purpose, it powers multiple higher-level components specialized for
various workloads, such as SQL or machine learning.
Figure 8 – The Spark stack
Finally, on of the largest advantage of tight integration is the ability to build applications that seamlessly
combine different processing models. For example, in Spark you can write one application that uses
machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously,
analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured
logfiles). In addition, more sophisticated data engineers and data scientists can access the same data via
the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the
while, the IT team has to maintain only one system. A Figure 6 explain that without words, evidencing the
Spark architecture.
SPARK CORE
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory
management, fault recovery, interacting with storage systems, and more. It provides distributed task
dispatching, scheduling, and basic I/O functionalities, exposed through an application programming
interface (for Java, Python, Scala, and R) centered on the RDD abstraction. Spark Core is based on the
Resilient Distributed Datasets (RDDs), which are Spark’s main programming abstraction.
More precisely: the RDD is a huge list of object in memory, but distributed. It can image this as a huge
magic tree distributed across many compute nodes. Is a big set of objects, that we can distribute and
parallelize as we want. In practice it can basically scale horizontally reasonably well. A "driver" program
invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 9
then schedules the function's execution in parallel on the cluster3. These operations, and additional ones
such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are
lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD, the sequence of operations
produced it, so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python,
Java, or Scala objects.
In this section I will show you the basics of Apache Spark, then the working mechanism, what is the RDD
file, the difference between actions and changes and the cache mechanism.
How Spark works?
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext
can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos
or YARN), which allocate resources across applications. Once connected, Spark acquires executors on
nodes in the cluster, which are processes that run computations and store data for your application. Next,
it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
Finally, SparkContext sends tasks to the executors to run.
Figure 9 – The architecture of Spark, inside the Core
The Spark architecture works through the Master-Worker pattern, where the Driver Program (Master),
controlled by the Spark Context component, orchestra the work by sending the various jobs to the
Workers; in particular, they divide the work into different chunks for each Executors; finally, they compute
the result as quickly as possible. Once the Workers end to compute the request, they send the result to the
Master that deals with aggregate all the results and insert them in the DAG (Directly Acyclic Graph). The
work continues until all chunks has been computed.
3 (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2011)
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 10
RDD
The core of Apache Spark is MapReduce algorithm: MapReduce and its variants have been highly
successful in implementing large-scale data-intensive applications in commodity clusters. These systems
are built around an Acyclic Data Graph model (DAG) that is not suitable for other popular applications:
the focus on this type of working, across a multiset of multiple parallel operations is the better for this
use. To achieve these goal, Spark introduces an abstraction called Resilient Distributed Datasets (RDDs).
An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a
partition is lost.
Thank the RDD and DAG model, Spark can outperform Hadoop by 10x in iterative machine learning jobs,
and can be used to interactively query a hundred GB dataset with sub-second response time.
Definition
RDD is a distributed memory abstraction that lets programmers perform in-memory computations on
large cluster in a fault-tolerant manner. RDDs are motivated by two types of applications:
- Iterative algorithms
- Interactive data mining tools
In both cases, keeping data in memory can improve performance by an order of magnitude. can improve
performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted
form of shared memory, based on coarse-grained transformations rather than fine-grained updates to
shared state.
Behavior
The RDD - Resilient Distributed Datasets - is a collection of partitioned elements along the nodes of cluster
which can be computed in parallel. The workers can access only to the RDD in read mode, so they cannot
change it.
More precisely:
- Datasets: elements such as lists or arrays
- Distributed: the work is distributed across the cluster, so the computation can take place in
parallel
- Resilient: given that the work is distributed, Spark has a powerful management of fault
tolerance, because if a job on a node crash, Spark is able to detect the crash and immediately
restore the job over to another node.
Because many functions in Spark are lazy, rather than immediately calculate functions and extractions by
time in time, these extractions are saved in what is called the DAG, the Directed Acyclic Graph, making
fault tolerance as simple as possible.
The Dag grow for all transformations that are carried out, as
- Map
- Filter
- (Other lazy operation)
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 11
Once all the results have been collected in a DAG, they can be reduced, and then brought to the desired
result, through the follow actions:
- Collect
- Count
- Reduce
- …
The results can be saved in local storage and then sent to the Master Worker, or sent directly.
Transformation
Spark offers an environment that supports functional paradigm, and thanks to Scala many features are
reduced to the minimum number of possible lines:
Actions
While all the changes are "lazy evaluation", the actions aggregate the data and return the result to the
Master (Driver Program). More precisely, each action aggregates the result directly in the node, and then
is sent by the Master and aggregate all; this type of Hadoop policy has been developed to minimize, as
much as possible, the traffic on the network.
Associative properties
Thanks to the associative property it can aggregate all data regardless of the order in which they arrive.
Spark Context can organize clusters in different ways, thanks to the associative property that
distinguishes the distributed computing.
Figure 10 – Associative Property of Spark
Action on Data
Whenever that we want to "see" the result, the collect computes the entire collection of RDD.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 12
Figure 11
Key Value Methods
Spark works by the pair<k,v> to transform and aggregate computations that you want to process. For
example these are some of the functions that you can use:
Figure 12 – RDD functions
Example:
- Given the follow list key-value
Figure 13- List key value Pairs
- The functions highlighted above (rddToPairRDDFuncion) divides the tables in different
nodes. Note that the same keys are assigned to the same nodes.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 13
Figure 14
- Now, each node can perform the work of transformation and action, and then send the result
to the Master.
Caching Data
- Suppose, for example, that we have a simple RDD, and we want to compute intensive
algorithm that we simulate through Thread.sleep() method.
- The result is another RDD with the transformed data
- Suppose that we want to make a simple map function, which is not computationally heavy as
the first that has been made; the result that we get is the follow result:
- Now, maybe the result is not correct, or maybe yes, or at least not the one you want. Instead
recalculate all over again, Spark stores the intermediate result in the cache.
By default, Spark uses the RAM memory, but if specified, you can also use a larger memory
such as hard-disk (keep attention, for this case, bottleneck can occur!).
- If, at this point, if you want to continue the analysis, the data that you will use are those in the
cache:
With this method there are two benefits
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 14
1. Speed: the data, especially those intermediate, are available in a short time
2. Fault Tolerance: if, in the first transformation, the heavy computational crashed, the data would
be recalculated and made available too, without that the user can notice it at all.
Distribution and Instrumentation
In this section it will explain the system aspects of Spark, where the data passes and which components
are activated when the action occurs.
Spark Submit
In Spark the MapReduce the functions are processed through the spark-submit function. Applications that
are launched in spark refer to the main program and are sent to the cluster manager driver.
Figure 15
At this point, depend how you cluster is configured, the driver may directly call the machine on which you
installed the Spark (for example, in local), or directly to the Master node at the head of the distributed
cluster. This feature it can choose in the launch phase of the main program.
At this point the driver asks to the cluster manager the specific resources that can be used in the
computation. Depending on available resources, the Cluster enables all available resources that can be
used in the MapReduce policy.
Figure 16
At this point, the driver performs the main application, building the RDD until the action occurs. This
causes the sending of the RDD to all nodes of the cluster.
Once the workflow process is completed, execution continues within the driver until the execution is not
completed finally.
When the workflow process is complete, the execution continue inside the driver until the execution is
complete definitively.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 15
Figure 17
Now, the resources are freed. Note: the resources may also be released manually before the job is
completed.
To understand better, the potential offered by spark-submit feature, you can navigate inside the help
supplied directly from the command shell:
Figure 18 – List of command in Spark
1. First, we note that Spark can send to only compiled programs driver in Java or Python,
specifying any appropriate options that will be shown shortly.
2. Through the kill or status commands, you can drastically end or keep monitored the jobs
present in Spark, passing the appropriate reference ID.
3. The master command determines whether to use Spark in local or on the cluster, if this’s
available, also specifying the number of processors you want to use.
4. Deploy mode allows you to specify only two options, client or cluster. The method specifies
where the driver, explained above, resides: if specified in local, the driver resides on the same
machine from which you will launch the computation; this requires that the driver is
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 16
obviously connected via a fast connection to the computing cluster. The cluster option, on the
other hand, offers a service "fire-and-forget", sending the computation directly to the Master.
5. Another possible option is specifying the compute class through the class option.
6. The name option assigns the process that will be computed.
7. Jars or py-file specifies the type of file that you want to compute. Note that both files in Scala
and in Java produce output as a jar file.
8. The executor-memory option (not showed in the screenshot) is able to set the amount of
memory that the launched process can be used. Default is set to 1GB, but you can change
depending on the power required.
Cluster Manager
The Apache Spark’s Cluster Manager is distributed. The main goal, for the cluster manager, is managing and
distributing the workload and assign tasks to the machine that composed the cluster.
Figure 19 – Cluster Manager View
The kernel that manages a single machine, in this case, the Master, in practice is "distributed" to the entire
cluster.
Main Cluster Manager
Figure 20 – Main Cluster Manager
 Spark has a proprietary kernel: Spark-standalone.
 The Hadoop cluster managers, on the other hand, is YARN, very convenient both in terms of
infrastructure and in terms of ease to use.
 Apache Mesos is another product created by University of Berkeley, in competition with the
cluster manager YARN.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 17
Standalone Manager
Spark-Standalone Manager is simple and quite convenient if you are
looking for a manager with the basic functions.
 To use spark-submit feature just enter the host and port of the
machine on which you want to launch the process (default port
is 7077).
 The standalone manager works in both client mode and in
cluster mode.
 Another very helpful feature is spark.deploy.spreadOut: depending on itself setup, tries to
distribute the workload across the node of the cluster, trying to querying multiple machines as
possible.
On the other hand, if this parameter is setup on false, the cluster manager tries to works with less
node as possible trying to group the works as soon as possible.
 The feature total-executor-cores # lets you specify how many cores are to be used for each
executor, for each job. By default, this setting uses all available cores in the cluster, but you can
specify a lower number if you want to preserve the performance impact for future computations.
Apache Mesos Manager
 The URL’s Master is pretty much the same as standalone, the only
difference is the port that is launched, that is, 5050.
 Supports both deployment-modes: client and cluster; is not a feature
installed by default, it must be installed separately.
 Mesos is able to allocate dynamically the resources in two ways: fine
and coarse. By default, it is set to the fine; the difference lies in the type
of the allocation of the various jobs:
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 18
o fine: each task, in Spark, is mapped and run on each task of Mesos. This allows
applications to be able to scale their resources dynamically when needed4;
o coarse: this mode uses a very high number of resources for the whole duration of the
application launched. This feature is best if you want maximum performance, since each
application can determine which is its best configuration and keep it.
 If spark.mesos.coarse is selected to true, you can also set the number of cores that must be used
for each node in the cluster; default uses all available cores.
 Therefore, unlike the Standalone Manager, you cannot choose the kind of spread you want to set:
by default, Mesos uses less possible nodes avoiding trying to minimize to the maximum the
distribution of work within the cluster.
Hadoop YARN
 Unlike previous cluster, to launch a spark-submit process is simply the yard master command;
Since in HADOOP / YARN_CONF_DIR configuration file is set the path to use to connect to the
cluster manager.
 Client / Cluster: even with this manager you can set the launch mode of computation.
 The number of executors, in this manager, is set by default to 2, causing problems if you want to
improve performance and scale.
 You can specify the number of cores to be executed for each Executors, which by default is set to
1.
 Unlike all other managers, in YARN there’s the concept of queues, in which you can specify the
type of the queue for all jobs that have yet to be launched from the cluster manager.
SPARK LIBRARIES
In this section we will look how Spark is used to the pragmatic level, that is, how it is used to query the
Hadoop file system and which are, therefore, the features that it allows us to use.
Spark SQL
Called in this mode, Spark SQL,because,thanksto this functionality, it provides a work environment where
you can query any database stored in Hadoop with a language very similar to SQL.
Spark SQL is a Spark module for structured data processing.Unlike the basic Spark RDD API, the interfaces
provided by Spark SQL provide Spark with more information about the structure of both the data and the
4 This type of resource management, for example, is not optimal if you are running streaming applications,
where applications require low latency; but optimal for a parallel use of multiple processes launched by
different instances: the resources needed for a parallel use are automatically allocated and distributed.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 19
computation being performed. Internally, Spark SQL uses this extra information to perform extra
optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and
the Datasets API. When computing a result, the same execution engine is used, independent of which
API/language you are using to express the computation. This unification means that developers can easily
switch back and forth between the various APIs based on which provides the most natural way to express
a given transformation.
One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark
SQL can also be used to read data from an existing Hive installation. For more on how to configure this
feature, please refer to the Hive Tables section. When running SQL from within another programming
language the results will be returned as a DataFrame. You can also interact with the SQL interface using
the command-line or over JDBC/ODBC.
This chapter will show only an example that shows how to use the functional paradigm functions to query
the Hadoop cluster. The language in the example chosen is Scala, which is the most powerful of all those
built by the community of Spark.
The Spark SQL interface supports the automatic conversion of the RDD containing the classes of data
frame5. Classes define the schemes of the tables. The names of the topics of the classes are read using
reflection (it is the ability of a program to perform calculations that relate itself to the same program, in
particular, the structure of its source code) and become the column names. Classes can also be nested or
contain complex types such as Sequences and Arrays. This RDD can be converted implicitly into a data
frame and then be registered as a table. Tables can be used sequentially by the SQL statement.
Refer to an excerpt of this code, in a Figure21:
5 A DataFrame is a distributed collection of data organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations
under the hood. DataFrames can be constructed from a wide array of sources such as: structured data
files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java,
Python, and R.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 20
Figure 21 – Example Code
Impala vs Hive: difference between Sql on Hadoop components
Apache Hive was introduced by Facebook to manage and process a large amount of distributed data sets
and saved on Hadoop. Apache Hive is an abstraction of MapReduce and Hadoop allows you to query
databases in their own language similar to SQL, that HiveQL.
Across Cloudera Impala it was developed to try to solve all the limitations posed by the slow interaction
with Hadoop SQL. Cloudera Impala provides high performance to process and analyze data saved on the
Hadoop cluster.
Cloudera Impala Apache Hive
It does not require the data are moved or
processed by Apache HDFS and HBase.
Is data warehouses infrastructure built over the
Hadoop platform to enhance tasks such as query,
analysis, process and display data.
Impala usesgeneratedat runtime code for the
"big loops" using LLVM (Low Level Virtual
Machine).
Hive generates query at compile time.
Impala avoids to run the startup process, the
processes in general are demons that are
started during boot of the nodes and are
always ready to plan, coordinate and execute
queries.
All Hive processes have the problem, since they are
compiled at compile-time, of "cold start."
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 21
Impala is very fast on the parallel processes.
It follows the SQL-92 standard.
Hive query results in their own "dialect" (Hive QL) in
the DAG to have them implemented by the MapReduce
jobsAd esempio, in Hive e possibile scrivere questo:
Table 1- Impala vs. Hive
Cloudera Impala
Cloudera Impala is an excellent choice for all developers who do not have the need to move large amounts
of data easily integrates with Hadoop ecosystem and frameworks like MapReduce, Apache Hive, Pig and
other Apache Hadoop software. The architecture is specifically designed to support the familiar with the
SQL language and give support to all those users who make the transition from relational databases to big
data.
Columnar Storage
The data, on Impala, are saved in mode to provide a compression ratio and a very efficient scanning.
Figure 22 – Columnar Storage with Impala
Tree Architecture
The tree architecture is absolutely crucial to the achievement distributed computing, delivering the query
nodes and aggregating the result once the sub-processing.
Figure 23 – Tree Architecture
Impala offers the following features:
 Support to HDFS (Hadoop Distributed File System) and Apache HBase storage
 It recognizes all the Hadoop file formats, texts, LZO, SEQUENCEFILE, Acro, RCFILE and Flooring
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 22
 Support Hadoop Security (Kerberos authentication)
 Can easily read metadata, handle ODBC driver and the SQL syntax from Apache Hive
Apache Hive
Initially developed by Facebook, Apache Hive is a data warehouse infrastructure built on top of Hadoop
to provide a querying tool, analysis, processing, and visualization. It is a versatile tool that supports
analysis of large databases stored in Hadoop HDFS and other compatible systems like Amazon S3. To
maintain compatibility with the SQL language is used as HiveQL tool converts queries and send the jobs
of MapReduce.
Other features that include Hive:
 Indexing to speed up the computation
 Different storage support, such as text files, RCFILE, HBase, ORC and other
 Familiarity to building user defined functions (UDFs) to manipulate strings, dates, ...
If you are looking for Advanced Analytics Language that allows to have a great deal of familiarity with SQL
(without writing separate processes of MapReduce) Apache Hive is the best choice. Queries HiveQL still
are converted into the corresponding MapReduce job that will run on the cluster and will return the final
figure.
Streaming Data
Spark Streaming is an extension of the core API that enables streaming processes that are scalable, high-
throughput and fault-tolerant. Data can be selected from various resources like Kafka, Flume, HDFS / S3,
Kinesis, Twitter or commonplace TCP Sockets and can be processed using complex algorithms expressed
at a high level thanks to the functional paradigm as a map, a reduce, joins, and window. Thereafter, the
processed data can be entered in a filesystem or into database.
Figure 24 – Streaming Data
Internally, Spark, works as follows: Spark receives streaming data in live and divides the data into separate
batches, where they are finally processed by the Spark engine to generate the final stream containing the
processing result.
Figure 25 – Streaming Data in detail
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 23
Spark Streaming provides a high-level abstraction layer called Discretized Stream or, more simply,
DStream, which represents a continuous stream of data. DStream can be created for each input data
streams from the sources such as Kafka, Flume, Kinesis, or, simply, through high-level operations on other
DStreams. Internally a DStreams shall be represented as a sequence of RDD.
Machine Learning
The machine learning algorithms are constantly growing and evolving and are becoming ever more
efficient. To understand the evolution of machine learning in Spark is good to focus on the evolution that
this has had.
Machine Learning Library (MLlib)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and
easy. It consists of common learning algorithms and utilities, including classification, regression,
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives
and higher-level pipeline APIs.
It divides into two packages:
 spark.mllib contains the original API built on top of RDDs.
 spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we
will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable
using spark.mllib features and expect more features coming. Developers should contribute new
algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
In fact, on the one hand, Matlab and R are easy to use but are not scalable. On the other hand, conversely,
there was a growth by Mahaut and GraphLab, very scalable but difficult to use.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 24
Figure 26 – Mlib Architecture
The library Spark MLlib (ML) aims to make the practice of machine learning scalable and relatively easy.
It consists of a set of algorithms that include regression, clustering, collaborative filtering, dimensional
reduction and others functions that allow integration at a low level. That architecture optimizes the high-
level APIs that are provided and used. also provide tools to perform Extraction and Transformation.
GraphX
GraphX is a component of Spark for simply graphs and graphs parallel computation. GraphX extends
the RDD Spark introducing a new abstraction graph attaching to them the concept of vertex and edge. To
support this, GraphX computing provides operators as a subgraph, joinVertices and aggregateMessages. In
addition, GraphX includes a growing collection of graphics algorithms to simplify access and high-level
analysis.
Figure 27 – How use GraphX
Actually the WWW is a gigantic graph; later, multinationals like Twitter, Netflix and Amazon have used
this concept to expand their markets and their profits. In the world of medicine, the graphs are still used
in research.
How it works GraphX?
This is a directed multigraph with some properties that make all vertices and nodes.
In particular, at each node may be associated metadata, as well as a unique identifier
within the graph.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 25
In this case, each point represents a person, as
happens every day for the "social graph". In
GraphiX this object is called Vertex; that includes:
- (VertexId, VType): simply a key-value pair
The arrows have the task to connect all the
vertex, in the example, these are created of the
relationships between some people saved in
the vertex.
In Spark, these are called edge, in which are
stored the ID of connected vertices and the type
of correlation between them.
- Edge (VertexId, VertexId, EType)
The Spark graph is built on top of these simple relationships, in particular, the graph is built over the
representation of two different RDD, one containing the relationships between the vertices, and the
another one containing the relations between the edge.
In addition, there is another concept to which we must be careful when working with charts, that is, the
view that exists between each connection in the graph:
The graph presents a comprehensive EdgeTriple, containing all the information that exist between each
connection. This provides a much more accurate rendering of the graph of the reasoning process much
simpler.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 26
OPTIMIZATIONS AND THE FUTURE
In the final part I have seen some recurring issues, especially to understand how Spark. I'll try to
understand how they work and what problems may have the following constructs:
 Closures
 Broadcast
 Partitioning
Understanding closures
One of the most difficult things to understand in Spark is understanding the scope and lifetime of variables
and methods when the code is executed on the cluster.
The RDD operations that modify variables outside of their scope can be a frequent source of confusion. It
will be shown in the example code which uses a foreach( ) to increment a counter, but similar problems
can occur for other well-known operations.
Example: Consider a sum RDD element, viewed in the following, the behavior of which may be different
depending on the node on which it is processed. A common example of what happens when we change
the Spark behavior from local to cluster (via YARN, for example):
Local vs. cluster modes
The behavior of the code, in the image above, is not clearly defined and might not work as intended. To
run jobs, Spark divides the process around RDD in different tasks, each by an executor. Before execution,
Spark computes the closure of the task. Closures are all those variables and methods that must be visible
for the executor that allow their computation on the RDD (inthis case, the foreach statement). This closure
is serialized and sent to each executor.
Variables with the closure are copiedand sent toevery executor, so when the counter is called by the works
of the foreach, the counter does not refer to the Master node counter (driver). There is still the counter in
the driver node memory, but this is not visible to the executors. The executors "see" only the copy of the
closure serialized. So, the final value of the counter, will always be zero until all the operations on the
counter are not completed and sent to the master (driver).
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 27
In the local environment, the foreach statement will run the same code on the same JVM as the original
driver and calculate the original counter: this could then upgrade directly behaving differently.
To ensure a well-defined behavior in these types of scenarios, you should use an Accumulator. The
Accumulators in Spark are used specifically to allow a mechanism for safely updating a variable by
aggregating all the results that are sent to the driver by the various nodes running the workers.
Broadcasting
In the previous chapter you saw how Spark sends a copy of the RDD to compute operations to various
workers (nodes) in the cluster. This type of operation, without a policy and a well-designed strategy, could
lead to specific problems, given the calculations that must be made are massive Spark and then we talk
about the order of gigabytes.
This section will show an incorrect management of the broadcasting technical of the RDD, and then see
which strategy was implemented in Spark to solve the problem.
Example: are provides the lazy type operations and closure of type Map and flatMap, respectively:
Given a Master (driver) and three Workers:
1. Is created the RDD inside the driver, 1MB size (dummy variable)
2. Whenever that the actions are called by the workers, a copy is sent to the various workers of the
RDD
3. The RDD of 1MB in size, if called, for example, 4 times from the first worker, become 4MB to the
local level. If you think about this in a distributed scale, so for all workers in this small network,
the results could grow out of control and in an uncontrolled manner.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 28
In a network of millions of nodes, this cannot happen.
In Spark this issue has been resolved in this way: it wraps the Map function inside the broadcast function
that handles sending of the RDD in parallel and, then, the lazy operations:
The broadcast function, first of all, compresses the RDD as much as possible; following:
1. Send indexers (RDD) once and only once time to the various worker
2. Each indexer must be cached by the various worker
This controls the uncontrolled growth of the indexer, even if the workers require to compute repeatedly
the same RDD. Not only! To try to solve the bottleneck that is possible is created between Drivers and
Workers, the Workers can exchange indexers in a similar manner to that used in BitTorrent networks.
This means, for example, that if the first worker has received the indexer to perform an any computation...
... and the second worker has also need to receive the indexer, the workers can help each other in order to
minimize calls to the driver, which could cause annoying bottleneck and create latency:
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 29
Optimizing Partitioning
In previous sections it has shown how an uncontrolled broadcasting can grow RDD distribution in an
uncontrolled manner. In this section, otherwise, you will be seen instead as a broadcast technique
disproportionate use may be inadequate if misused. To best explain this problem through an example.
Given the following RDD:
 Given a set "huge" data  1 to int.MaxValue
 Divided into 10000 executors
1. Perform a filter, that minimizes the result as much as possible
2. Operated an order
3. Perform one map, that adds value 1 to each element in the final array
4. It shows the result using the collect action.
In this case, if you check the systemistic level work how the nodes on the network works, we have
conflicting results.
First, we can see that the execution was presided over by 2 chunks, each approximately 10,000 partitions,
whose first (sortBy) was completed in 13 seconds, while the second (collect) in 20 seconds.
If we make a better analysis of the result and we go down in detail of the function collect ...
We note that the sortBy function, after filtering, it took a full 20 seconds. This is because it has been asked
to use Spark 10000 executors to order 10 values (the action of sortBy is performed, to how the code is
written, after the filtering):
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 30
Figure 28 – Systemistic View
As you can see, this overhead is bad if not controlled: from the image above you can see that 10000 tasks
are performed, and almost every one of them puts zero time to perform the required computation. Simply
put, the synchronization time of the various executors is extremely higher than the desired type of
computation, increasing the overhead and user perceived latency of proportion.
A technique used to control the use of the executors by the user is to use the coalesce function, which
reduces the use of executors from 10000 to 8 after execution of the filter.
The second field passed to coalesce function indicates the shuffle properties: if this is set to true value, then
the output is equally distributed across all nodes equally, otherwise, if it set to false value are used less
possible partitions.
If you monitoring the execution time, the result is always the same,
but with absolutely best performance: in fact, the collect call back after filtering, in the second chunk of
the second test, it takes only 4 seconds compared to 20 used previously.
In details:
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 31
It is evident that the sort function, this time, takes only 24 milliseconds requesting the execution of 8
executors in total.
The coalesce function saves an enormous amount of time by reducing the number of chunks that are called
upon execution of actions.
Future of Spark
This section will show some features that the community around Spark is trying to develop and advance.
R
Integration with R is definitely the first “frontier” to be conquered to join the community of
mathematicians, statisticians and that of engineers and data scientists. R offers a user-
friendly development environment from a statistical and mathematical point of view and is
widely used in the community for the analysis of data.
Tachyon
Why Spark results to be 100 times faster than
MapReduce? The answer is provided by the
community around you work for Tachyon. The
memory is the keyto process BigData quickly,fast and
scalable: Tachyon is a "central memory" adopted
Spark in distributed systems in which in addition to providing automatic fault-tolerance and cache
techniques, provides reliable services the memory sharing along the cluster used. Continue to develop
this type of service would make the field of the analysis of big data even faster and powerful than it already
is.
Project Tungsten
Another features developed by the community around which has rallied Spark is the Tungesten Project:
the project has changed the execution core of the various nodes within the cluster going to substantially
improve the efficiency of memory and CPU usage for applications Spark, trying to maximize, as soon as
possible, the performance. The fundamental concept of this research revolves around the concept about
the use of modern CPUs: they are built around the concept of the I / O. This research has the goal to
minimize the I / O and to improve to the maximum performance if the CPU and the memory, of the
machine, are used as a cluster.
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 32
References
Holden Karau, A. K. (2015). Learning Spark. 1005 Gravenstein Highway North, Sebastopol, CA 95472:
O’Reilly Media.
Mortar Data Software. (2013, 05). How Spark Works. Retrieved from
https://www.linkedin.com/company/mortar-data-inc.
Phatak, M. (2015, 1 2). History of Apache Spark : Journey from Academia to Industry. Retrieved from
http://blog.madhukaraphatak.com/history-of-spark/
SAS - The power to Know. (2013, 04). Big Data. Retrieved from http://www.sas.com/en_ph/insights/big-
data/what-is-big-data.html
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2011, 06). USENIX Workshop on Hot
Topics in Cloud Computing (HotCloud). Retrieved from https://amplab.cs.berkeley.edu/wp-
content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf
INTRODUCTION TO DATA ANALYSIS WITH SPARK
Pagina 33
Contact information
MASSIMILIANO
MARTELLA
Email: massimilian.martella@studio.unibo.it

More Related Content

What's hot

Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Edureka!
 
IBM POWER8 Processor-Based Systems RAS
IBM POWER8 Processor-Based Systems RASIBM POWER8 Processor-Based Systems RAS
IBM POWER8 Processor-Based Systems RASthinkASG
 
White Paper: Hadoop on EMC Isilon Scale-out NAS
White Paper: Hadoop on EMC Isilon Scale-out NAS   White Paper: Hadoop on EMC Isilon Scale-out NAS
White Paper: Hadoop on EMC Isilon Scale-out NAS EMC
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopGeorg Knon
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow PresentationFelix Liao
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunk
 

What's hot (20)

Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
 
IBM POWER8 Processor-Based Systems RAS
IBM POWER8 Processor-Based Systems RASIBM POWER8 Processor-Based Systems RAS
IBM POWER8 Processor-Based Systems RAS
 
White Paper: Hadoop on EMC Isilon Scale-out NAS
White Paper: Hadoop on EMC Isilon Scale-out NAS   White Paper: Hadoop on EMC Isilon Scale-out NAS
White Paper: Hadoop on EMC Isilon Scale-out NAS
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for Hadoop
 
MapR Unique features
MapR Unique featuresMapR Unique features
MapR Unique features
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation2015 HortonWorks MDA Roadshow Presentation
2015 HortonWorks MDA Roadshow Presentation
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep Dive
 

Viewers also liked

Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overviewDavid Taieb
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jlShintaro Fukushima
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1David Taieb
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 

Viewers also liked (20)

Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Performance
PerformancePerformance
Performance
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 

Similar to The New Age for Data Scientists with Apache Spark

Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkAegis Software Canada
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkPradeep Kumar G.S
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySparkLadle Patel
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache sparkmanisha1110
 

Similar to The New Age for Data Scientists with Apache Spark (20)

Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark Framework
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Module01
 Module01 Module01
Module01
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Low latency access of bigdata using spark and shark
Low latency access of bigdata using spark and sharkLow latency access of bigdata using spark and shark
Low latency access of bigdata using spark and shark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
IOT.ppt
IOT.pptIOT.ppt
IOT.ppt
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Spark1
Spark1Spark1
Spark1
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache spark
 

Recently uploaded

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 

Recently uploaded (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 

The New Age for Data Scientists with Apache Spark

  • 1. SPARK THE NEW AGE FOR DATA SCIENTIST In a very short time, Apache Spark has emerged as the next generation big data processing engine, and is being applied throughout the industry faster than ever.
  • 2. TABLE OF CONTENTS Summary GETTING STARTED............................................................................................................................................................1 WHAT IS APACHE SPARK?........................................................................................................................................................................ 1 BIG DATA, WHAT IS AND WHY IT MATTERS.......................................................................................................................................... 1 Parallelism enable Big Data.......................................................................................................................................................2 Big data defined...............................................................................................................................................................................2 WHY SPARK?............................................................................................................................................................................................ 4 FROM BIG CODE TO SMALL CODE........................................................................................................................................................... 4 What language should I choose for my next Apache Spark project? ........................................................................4 A BRIEF HISTORY OF SPARK..........................................................................................................................................6 THE INTERVIEW ....................................................................................................................................................................................... 6 THE SPARK GROWTH............................................................................................................................................................................... 6 THE COMPANY AROUND SPARK............................................................................................................................................................. 7 A UNIFIED STACK...............................................................................................................................................................8 SPARK CORE............................................................................................................................................................................................... 8 How Spark works? ..........................................................................................................................................................................9 RDD............................................................................................................................................................................................................................10 Key Value Methods............................................................................................................................................................................................12 Caching Data..........................................................................................................................................................................................................13 Distribution and Instrumentation ........................................................................................................................................14 Spark Submit.........................................................................................................................................................................................................14 Cluster Manager..................................................................................................................................................................................................16 SPARK LIBRARIES ..................................................................................................................................................................................18 Spark SQL ........................................................................................................................................................................................18 Impala vs Hive: difference between Sql on Hadoop components............................................................................................20 Cloudera Impala..................................................................................................................................................................................................21 Apache Hive...........................................................................................................................................................................................................22 Other features that include Hive:...............................................................................................................................................................22 Streaming Data.............................................................................................................................................................................22 Machine Learning........................................................................................................................................................................23 Machine Learning Library (MLlib)............................................................................................................................................................23 GraphX ..............................................................................................................................................................................................24 How it works GraphX?.....................................................................................................................................................................................24 OPTIMIZATIONS AND THE FUTURE.....................................................................................................................................................26 Understanding closures.............................................................................................................................................................26 Local vs. cluster modes....................................................................................................................................................................................26 Broadcasting..................................................................................................................................................................................27 Optimizing Partitioning ............................................................................................................................................................29 Future of Spark..............................................................................................................................................................................31 R...................................................................................................................................................................................................................................31 Tachyon ...................................................................................................................................................................................................................31
  • 3. TABLE OF CONTENTS Project Tungsten.................................................................................................................................................................................................31 REFERENCES.....................................................................................................................................................................32 CONTACT INFORMATION.............................................................................................................................................33
  • 4. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 1 Getting Started In this section I will explain the concepts around Spark, how Spark works and how Spark has been built, what are the advantages of using Spark and a brief nod to history of MapReduce Algorithm. WHAT IS APACHE SPARK? Apache Spark is an open-source framework developed by AMPlab of University of California and, successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on two- level disk of Hadoop, the primitive in-memory multilayer provided by Spark allow youto have performance up to 100 times better. Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Spark offers performance improvements over MapReduce, especially when Spark’s in- memory computing capabilities can be leveraged. In practice, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasetsor Big Data, as it means the difference between exploring data interactively and waiting minutes or hours. Figure 1 – Spark is the best solution for Big Data BIG DATA, WHAT IS AND WHY IT MATTERS Big data is a popular term used to described the exponential growth and availability of data, both structured and unstructured. Today Big Data as important to business – and society – as the Internet has become.
  • 5. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 2 Why? More data may lead to more accurate analyses. More accurate analyses may lead to mode confident decision making. And better decision can mean greater operational efficiencies, cost reductions and reduced risk. Organizations are increasingly generating large volumes of data as result of instrumented business processes, monitoring of user activity, web site tracking, sensors, finance, accounting, among other reasons. For example, with the advent of social network websites, users create records of their lives by daily posting details of activities they perform, events they attend, places they visit, pictures they take, and things they enjoy and want. This data deluge is often referred to as Big Data, a term that conveys the challenges it poses on existing infrastructure in respect to storage, management, interoperability, governance and analysis large amounts of data. Parallelism enable Big Data • Big Data - Applying data parallelism for data processing of large amount of data (terabyte, petabyte, …) - Exploiting cluster of computing resources and the Cloud • MapReduce - Data parallelism technique introduced by Google for big data processing - Large scale, distributing both data and computation over large cluster of machines - Can be used to parallelize different kind of computational task - Open source implementation: Hadoop Big data defined As far back as 2001, industry analyst Doug Laney articulated the now mainstream definition of big data as the three Vs of big data: volume, velocity and variety. Figure 2 – Three Vs of Big Data • Volume defines the amount of data: many factors contribute to the increase in data volume. Transaction-based data stored the years. In the past, excessive data volume was a storage issue. Big Data Volume VelocityVariety
  • 6. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 3 But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data. • Velocity refers to the rate at which the data is produced and processed, for example in the batch mode, near-time mode, real-time mode and streams. Data isstreaming in at unprecedented speed and must be dealt with in a timely manner. Reacting quickly enough to deal with data velocity is a challenge for most organizations. • Variety represent the data types. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with. On the other hand, many company consider two additional dimension then they thinking about big data: • Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved. • Complexity. Today's data comes from multiple sources. And it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Figure 3 – The cube of Big Data
  • 7. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 4 WHY SPARK? On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed system, including batch applications, iterative algorithms, interactive queries and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools. Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra, for example. FROM BIG CODE TO SMALL CODE The most difficult thing for big data developers today is choosing a programming language for big data applications. There are many languages that a developer can uses, such as Python or R: both are the languages of choice among data scientists for building Hadoop applications. With the advent of various big data frameworks like Apache Spark or Apache Kafka, Scala programming language has gained prominence amid big data developers. What language should I choose for my next Apache Spark project? The answer to this question varies, as it depends on the programming expertise of the developers but preferably Scala programming language has become the language of choice for working with big data framework like Apache Spark. Not only, Spark is mostly written in Scala, more precisely: 1. Scala 81.2% 2. Python 8.7% 3. Java 8.5% 4. Shell 1.3% 5. Other 0.3% Figure 4 – The Growth of Scala
  • 8. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 5 Creating any program in Scala using functional programming cannot impact performance. Functional Programming is an abstraction, which restricts certain programming practices, for example I/O, avoiding the side effect. Forcing the abstraction allows the code to be executed in a Parallel Execution Environment. For this, Spark supports only a functional paradigm, in which Scala results be the most efficient language supported. Figure 5 – From Big Code to Tiny Code In the next sections I will explain the advantage of Spark: a philosophy of tight integration has several benefits: - All libraries and higher-level components, in the stack, benefit from improvements at the lower layers. For example: when Spark’s core engine adds an optimization, SQL and machine learning libraries automatically speed up as well. - The costs associated with running the stack are minimized, because instead of running 5-10 independed software system, an organization needs to run only one. These costs include deployment, maintenance, testing, support, and others. This also means that each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. This changes the cost of trying out a new type of data analysis from downloading, deploying, and learning a new software project to upgrading Spark. The advantages are the following: - Readability and Expressiveness - Fast and Testability - Interactive - Fault Tolerant - Unify Big Data
  • 9. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 6 A Brief History of Spark Apache Spark is one of the most interesting frameworks in big data in recent years. Spark had its humble beginning as a research project at UC Berkeley. Recently O’Reilly Ben Lorica interviewed Ion Stoica, UC Berkeley professor and databricks CEO, about history of apache spark. The question was: “How Apache Spark started?” THE INTERVIEW Ion Stoica, professor and databricks CEO, replay this: “the story started back in 2009 with Mesos1. It was a class project in UC Berkley. Idea was to build a cluster management framework, which can support different kind of cluster computing systems. Once Mesos was built, then we thought what we can build on top of Mesos. That’s how spark was born.” “We wanted to show how easy it was to build a framework from scratch in Mesos. We also wanted to be different from existing cluster computing systems. At that time, Hadoop targeted batch processing, so we targeted interactive iterative computations like machine learning.” “The requirement for machine learning came from the head of department at that time. They were trying to scale machine learning on top of Hadoop. Hadoop was slow for ML as it needed to exchange data between iterations through HDFS. The requirement for interactive querying came from the company, Conviva, which I founded that time. As part of video analytical tool built by company, we supported adhoc querying. We were using MySQL at that time, but we knew it’s not good enough. Also there are many companies came out of UC, like Yahoo, Facebook were facing similar challenges. So we knew there was need for framework focused on interactive and iterative processing.2” THE SPARK GROWTH Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. If you or your organization are trying Spark for the first time, you might be interested in the history of the project. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. The researchers in the lab had previously been working on Hadoop Map-Reduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the 1 Mesos is a distributed systems kernel: abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. In the next section I explain more precisely what Mesos is it. 2 (Phatak, 2015)
  • 10. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 7 beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery. Research papers were published about Figure 6 – The History of Spark Spark at academic conferences and soon after its creation in 2009, it was already 10–20× faster than MapReduce for certain jobs. In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark) and Spark Streaming. These and other components are sometimes referred to as the Berkeley Data Analytics Stack (BDAS). Spark was first open sourced in March 2010, and was transferred to the Apache Software Foundation in June 2013, where it is now a top-level project. THE COMPANY AROUND SPARK Today, most company uses Apache Spark. Company like Ebay, Pandora, Twetter, Netflix… use Spark for quickly query analysis, improving their business. Figure 7 – Who’s using Spark?
  • 11. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 8 A Unified Stack The Spark project contains multiple closely integrated components. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational task across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. Figure 8 – The Spark stack Finally, on of the largest advantage of tight integration is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously, analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured logfiles). In addition, more sophisticated data engineers and data scientists can access the same data via the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the while, the IT team has to maintain only one system. A Figure 6 explain that without words, evidencing the Spark architecture. SPARK CORE Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, and R) centered on the RDD abstraction. Spark Core is based on the Resilient Distributed Datasets (RDDs), which are Spark’s main programming abstraction. More precisely: the RDD is a huge list of object in memory, but distributed. It can image this as a huge magic tree distributed across many compute nodes. Is a big set of objects, that we can distribute and parallelize as we want. In practice it can basically scale horizontally reasonably well. A "driver" program invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which
  • 12. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 9 then schedules the function's execution in parallel on the cluster3. These operations, and additional ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD, the sequence of operations produced it, so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects. In this section I will show you the basics of Apache Spark, then the working mechanism, what is the RDD file, the difference between actions and changes and the cache mechanism. How Spark works? Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run. Figure 9 – The architecture of Spark, inside the Core The Spark architecture works through the Master-Worker pattern, where the Driver Program (Master), controlled by the Spark Context component, orchestra the work by sending the various jobs to the Workers; in particular, they divide the work into different chunks for each Executors; finally, they compute the result as quickly as possible. Once the Workers end to compute the request, they send the result to the Master that deals with aggregate all the results and insert them in the DAG (Directly Acyclic Graph). The work continues until all chunks has been computed. 3 (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2011)
  • 13. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 10 RDD The core of Apache Spark is MapReduce algorithm: MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications in commodity clusters. These systems are built around an Acyclic Data Graph model (DAG) that is not suitable for other popular applications: the focus on this type of working, across a multiset of multiple parallel operations is the better for this use. To achieve these goal, Spark introduces an abstraction called Resilient Distributed Datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Thank the RDD and DAG model, Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a hundred GB dataset with sub-second response time. Definition RDD is a distributed memory abstraction that lets programmers perform in-memory computations on large cluster in a fault-tolerant manner. RDDs are motivated by two types of applications: - Iterative algorithms - Interactive data mining tools In both cases, keeping data in memory can improve performance by an order of magnitude. can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. Behavior The RDD - Resilient Distributed Datasets - is a collection of partitioned elements along the nodes of cluster which can be computed in parallel. The workers can access only to the RDD in read mode, so they cannot change it. More precisely: - Datasets: elements such as lists or arrays - Distributed: the work is distributed across the cluster, so the computation can take place in parallel - Resilient: given that the work is distributed, Spark has a powerful management of fault tolerance, because if a job on a node crash, Spark is able to detect the crash and immediately restore the job over to another node. Because many functions in Spark are lazy, rather than immediately calculate functions and extractions by time in time, these extractions are saved in what is called the DAG, the Directed Acyclic Graph, making fault tolerance as simple as possible. The Dag grow for all transformations that are carried out, as - Map - Filter - (Other lazy operation)
  • 14. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 11 Once all the results have been collected in a DAG, they can be reduced, and then brought to the desired result, through the follow actions: - Collect - Count - Reduce - … The results can be saved in local storage and then sent to the Master Worker, or sent directly. Transformation Spark offers an environment that supports functional paradigm, and thanks to Scala many features are reduced to the minimum number of possible lines: Actions While all the changes are "lazy evaluation", the actions aggregate the data and return the result to the Master (Driver Program). More precisely, each action aggregates the result directly in the node, and then is sent by the Master and aggregate all; this type of Hadoop policy has been developed to minimize, as much as possible, the traffic on the network. Associative properties Thanks to the associative property it can aggregate all data regardless of the order in which they arrive. Spark Context can organize clusters in different ways, thanks to the associative property that distinguishes the distributed computing. Figure 10 – Associative Property of Spark Action on Data Whenever that we want to "see" the result, the collect computes the entire collection of RDD.
  • 15. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 12 Figure 11 Key Value Methods Spark works by the pair<k,v> to transform and aggregate computations that you want to process. For example these are some of the functions that you can use: Figure 12 – RDD functions Example: - Given the follow list key-value Figure 13- List key value Pairs - The functions highlighted above (rddToPairRDDFuncion) divides the tables in different nodes. Note that the same keys are assigned to the same nodes.
  • 16. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 13 Figure 14 - Now, each node can perform the work of transformation and action, and then send the result to the Master. Caching Data - Suppose, for example, that we have a simple RDD, and we want to compute intensive algorithm that we simulate through Thread.sleep() method. - The result is another RDD with the transformed data - Suppose that we want to make a simple map function, which is not computationally heavy as the first that has been made; the result that we get is the follow result: - Now, maybe the result is not correct, or maybe yes, or at least not the one you want. Instead recalculate all over again, Spark stores the intermediate result in the cache. By default, Spark uses the RAM memory, but if specified, you can also use a larger memory such as hard-disk (keep attention, for this case, bottleneck can occur!). - If, at this point, if you want to continue the analysis, the data that you will use are those in the cache: With this method there are two benefits
  • 17. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 14 1. Speed: the data, especially those intermediate, are available in a short time 2. Fault Tolerance: if, in the first transformation, the heavy computational crashed, the data would be recalculated and made available too, without that the user can notice it at all. Distribution and Instrumentation In this section it will explain the system aspects of Spark, where the data passes and which components are activated when the action occurs. Spark Submit In Spark the MapReduce the functions are processed through the spark-submit function. Applications that are launched in spark refer to the main program and are sent to the cluster manager driver. Figure 15 At this point, depend how you cluster is configured, the driver may directly call the machine on which you installed the Spark (for example, in local), or directly to the Master node at the head of the distributed cluster. This feature it can choose in the launch phase of the main program. At this point the driver asks to the cluster manager the specific resources that can be used in the computation. Depending on available resources, the Cluster enables all available resources that can be used in the MapReduce policy. Figure 16 At this point, the driver performs the main application, building the RDD until the action occurs. This causes the sending of the RDD to all nodes of the cluster. Once the workflow process is completed, execution continues within the driver until the execution is not completed finally. When the workflow process is complete, the execution continue inside the driver until the execution is complete definitively.
  • 18. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 15 Figure 17 Now, the resources are freed. Note: the resources may also be released manually before the job is completed. To understand better, the potential offered by spark-submit feature, you can navigate inside the help supplied directly from the command shell: Figure 18 – List of command in Spark 1. First, we note that Spark can send to only compiled programs driver in Java or Python, specifying any appropriate options that will be shown shortly. 2. Through the kill or status commands, you can drastically end or keep monitored the jobs present in Spark, passing the appropriate reference ID. 3. The master command determines whether to use Spark in local or on the cluster, if this’s available, also specifying the number of processors you want to use. 4. Deploy mode allows you to specify only two options, client or cluster. The method specifies where the driver, explained above, resides: if specified in local, the driver resides on the same machine from which you will launch the computation; this requires that the driver is
  • 19. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 16 obviously connected via a fast connection to the computing cluster. The cluster option, on the other hand, offers a service "fire-and-forget", sending the computation directly to the Master. 5. Another possible option is specifying the compute class through the class option. 6. The name option assigns the process that will be computed. 7. Jars or py-file specifies the type of file that you want to compute. Note that both files in Scala and in Java produce output as a jar file. 8. The executor-memory option (not showed in the screenshot) is able to set the amount of memory that the launched process can be used. Default is set to 1GB, but you can change depending on the power required. Cluster Manager The Apache Spark’s Cluster Manager is distributed. The main goal, for the cluster manager, is managing and distributing the workload and assign tasks to the machine that composed the cluster. Figure 19 – Cluster Manager View The kernel that manages a single machine, in this case, the Master, in practice is "distributed" to the entire cluster. Main Cluster Manager Figure 20 – Main Cluster Manager  Spark has a proprietary kernel: Spark-standalone.  The Hadoop cluster managers, on the other hand, is YARN, very convenient both in terms of infrastructure and in terms of ease to use.  Apache Mesos is another product created by University of Berkeley, in competition with the cluster manager YARN.
  • 20. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 17 Standalone Manager Spark-Standalone Manager is simple and quite convenient if you are looking for a manager with the basic functions.  To use spark-submit feature just enter the host and port of the machine on which you want to launch the process (default port is 7077).  The standalone manager works in both client mode and in cluster mode.  Another very helpful feature is spark.deploy.spreadOut: depending on itself setup, tries to distribute the workload across the node of the cluster, trying to querying multiple machines as possible. On the other hand, if this parameter is setup on false, the cluster manager tries to works with less node as possible trying to group the works as soon as possible.  The feature total-executor-cores # lets you specify how many cores are to be used for each executor, for each job. By default, this setting uses all available cores in the cluster, but you can specify a lower number if you want to preserve the performance impact for future computations. Apache Mesos Manager  The URL’s Master is pretty much the same as standalone, the only difference is the port that is launched, that is, 5050.  Supports both deployment-modes: client and cluster; is not a feature installed by default, it must be installed separately.  Mesos is able to allocate dynamically the resources in two ways: fine and coarse. By default, it is set to the fine; the difference lies in the type of the allocation of the various jobs:
  • 21. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 18 o fine: each task, in Spark, is mapped and run on each task of Mesos. This allows applications to be able to scale their resources dynamically when needed4; o coarse: this mode uses a very high number of resources for the whole duration of the application launched. This feature is best if you want maximum performance, since each application can determine which is its best configuration and keep it.  If spark.mesos.coarse is selected to true, you can also set the number of cores that must be used for each node in the cluster; default uses all available cores.  Therefore, unlike the Standalone Manager, you cannot choose the kind of spread you want to set: by default, Mesos uses less possible nodes avoiding trying to minimize to the maximum the distribution of work within the cluster. Hadoop YARN  Unlike previous cluster, to launch a spark-submit process is simply the yard master command; Since in HADOOP / YARN_CONF_DIR configuration file is set the path to use to connect to the cluster manager.  Client / Cluster: even with this manager you can set the launch mode of computation.  The number of executors, in this manager, is set by default to 2, causing problems if you want to improve performance and scale.  You can specify the number of cores to be executed for each Executors, which by default is set to 1.  Unlike all other managers, in YARN there’s the concept of queues, in which you can specify the type of the queue for all jobs that have yet to be launched from the cluster manager. SPARK LIBRARIES In this section we will look how Spark is used to the pragmatic level, that is, how it is used to query the Hadoop file system and which are, therefore, the features that it allows us to use. Spark SQL Called in this mode, Spark SQL,because,thanksto this functionality, it provides a work environment where you can query any database stored in Hadoop with a language very similar to SQL. Spark SQL is a Spark module for structured data processing.Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the 4 This type of resource management, for example, is not optimal if you are running streaming applications, where applications require low latency; but optimal for a parallel use of multiple processes launched by different instances: the resources needed for a parallel use are automatically allocated and distributed.
  • 22. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 19 computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result, the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between the various APIs based on which provides the most natural way to express a given transformation. One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC. This chapter will show only an example that shows how to use the functional paradigm functions to query the Hadoop cluster. The language in the example chosen is Scala, which is the most powerful of all those built by the community of Spark. The Spark SQL interface supports the automatic conversion of the RDD containing the classes of data frame5. Classes define the schemes of the tables. The names of the topics of the classes are read using reflection (it is the ability of a program to perform calculations that relate itself to the same program, in particular, the structure of its source code) and become the column names. Classes can also be nested or contain complex types such as Sequences and Arrays. This RDD can be converted implicitly into a data frame and then be registered as a table. Tables can be used sequentially by the SQL statement. Refer to an excerpt of this code, in a Figure21: 5 A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R.
  • 23. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 20 Figure 21 – Example Code Impala vs Hive: difference between Sql on Hadoop components Apache Hive was introduced by Facebook to manage and process a large amount of distributed data sets and saved on Hadoop. Apache Hive is an abstraction of MapReduce and Hadoop allows you to query databases in their own language similar to SQL, that HiveQL. Across Cloudera Impala it was developed to try to solve all the limitations posed by the slow interaction with Hadoop SQL. Cloudera Impala provides high performance to process and analyze data saved on the Hadoop cluster. Cloudera Impala Apache Hive It does not require the data are moved or processed by Apache HDFS and HBase. Is data warehouses infrastructure built over the Hadoop platform to enhance tasks such as query, analysis, process and display data. Impala usesgeneratedat runtime code for the "big loops" using LLVM (Low Level Virtual Machine). Hive generates query at compile time. Impala avoids to run the startup process, the processes in general are demons that are started during boot of the nodes and are always ready to plan, coordinate and execute queries. All Hive processes have the problem, since they are compiled at compile-time, of "cold start."
  • 24. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 21 Impala is very fast on the parallel processes. It follows the SQL-92 standard. Hive query results in their own "dialect" (Hive QL) in the DAG to have them implemented by the MapReduce jobsAd esempio, in Hive e possibile scrivere questo: Table 1- Impala vs. Hive Cloudera Impala Cloudera Impala is an excellent choice for all developers who do not have the need to move large amounts of data easily integrates with Hadoop ecosystem and frameworks like MapReduce, Apache Hive, Pig and other Apache Hadoop software. The architecture is specifically designed to support the familiar with the SQL language and give support to all those users who make the transition from relational databases to big data. Columnar Storage The data, on Impala, are saved in mode to provide a compression ratio and a very efficient scanning. Figure 22 – Columnar Storage with Impala Tree Architecture The tree architecture is absolutely crucial to the achievement distributed computing, delivering the query nodes and aggregating the result once the sub-processing. Figure 23 – Tree Architecture Impala offers the following features:  Support to HDFS (Hadoop Distributed File System) and Apache HBase storage  It recognizes all the Hadoop file formats, texts, LZO, SEQUENCEFILE, Acro, RCFILE and Flooring
  • 25. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 22  Support Hadoop Security (Kerberos authentication)  Can easily read metadata, handle ODBC driver and the SQL syntax from Apache Hive Apache Hive Initially developed by Facebook, Apache Hive is a data warehouse infrastructure built on top of Hadoop to provide a querying tool, analysis, processing, and visualization. It is a versatile tool that supports analysis of large databases stored in Hadoop HDFS and other compatible systems like Amazon S3. To maintain compatibility with the SQL language is used as HiveQL tool converts queries and send the jobs of MapReduce. Other features that include Hive:  Indexing to speed up the computation  Different storage support, such as text files, RCFILE, HBase, ORC and other  Familiarity to building user defined functions (UDFs) to manipulate strings, dates, ... If you are looking for Advanced Analytics Language that allows to have a great deal of familiarity with SQL (without writing separate processes of MapReduce) Apache Hive is the best choice. Queries HiveQL still are converted into the corresponding MapReduce job that will run on the cluster and will return the final figure. Streaming Data Spark Streaming is an extension of the core API that enables streaming processes that are scalable, high- throughput and fault-tolerant. Data can be selected from various resources like Kafka, Flume, HDFS / S3, Kinesis, Twitter or commonplace TCP Sockets and can be processed using complex algorithms expressed at a high level thanks to the functional paradigm as a map, a reduce, joins, and window. Thereafter, the processed data can be entered in a filesystem or into database. Figure 24 – Streaming Data Internally, Spark, works as follows: Spark receives streaming data in live and divides the data into separate batches, where they are finally processed by the Spark engine to generate the final stream containing the processing result. Figure 25 – Streaming Data in detail
  • 26. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 23 Spark Streaming provides a high-level abstraction layer called Discretized Stream or, more simply, DStream, which represents a continuous stream of data. DStream can be created for each input data streams from the sources such as Kafka, Flume, Kinesis, or, simply, through high-level operations on other DStreams. Internally a DStreams shall be represented as a sequence of RDD. Machine Learning The machine learning algorithms are constantly growing and evolving and are becoming ever more efficient. To understand the evolution of machine learning in Spark is good to focus on the evolution that this has had. Machine Learning Library (MLlib) MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. It divides into two packages:  spark.mllib contains the original API built on top of RDDs.  spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers. In fact, on the one hand, Matlab and R are easy to use but are not scalable. On the other hand, conversely, there was a growth by Mahaut and GraphLab, very scalable but difficult to use.
  • 27. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 24 Figure 26 – Mlib Architecture The library Spark MLlib (ML) aims to make the practice of machine learning scalable and relatively easy. It consists of a set of algorithms that include regression, clustering, collaborative filtering, dimensional reduction and others functions that allow integration at a low level. That architecture optimizes the high- level APIs that are provided and used. also provide tools to perform Extraction and Transformation. GraphX GraphX is a component of Spark for simply graphs and graphs parallel computation. GraphX extends the RDD Spark introducing a new abstraction graph attaching to them the concept of vertex and edge. To support this, GraphX computing provides operators as a subgraph, joinVertices and aggregateMessages. In addition, GraphX includes a growing collection of graphics algorithms to simplify access and high-level analysis. Figure 27 – How use GraphX Actually the WWW is a gigantic graph; later, multinationals like Twitter, Netflix and Amazon have used this concept to expand their markets and their profits. In the world of medicine, the graphs are still used in research. How it works GraphX? This is a directed multigraph with some properties that make all vertices and nodes. In particular, at each node may be associated metadata, as well as a unique identifier within the graph.
  • 28. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 25 In this case, each point represents a person, as happens every day for the "social graph". In GraphiX this object is called Vertex; that includes: - (VertexId, VType): simply a key-value pair The arrows have the task to connect all the vertex, in the example, these are created of the relationships between some people saved in the vertex. In Spark, these are called edge, in which are stored the ID of connected vertices and the type of correlation between them. - Edge (VertexId, VertexId, EType) The Spark graph is built on top of these simple relationships, in particular, the graph is built over the representation of two different RDD, one containing the relationships between the vertices, and the another one containing the relations between the edge. In addition, there is another concept to which we must be careful when working with charts, that is, the view that exists between each connection in the graph: The graph presents a comprehensive EdgeTriple, containing all the information that exist between each connection. This provides a much more accurate rendering of the graph of the reasoning process much simpler.
  • 29. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 26 OPTIMIZATIONS AND THE FUTURE In the final part I have seen some recurring issues, especially to understand how Spark. I'll try to understand how they work and what problems may have the following constructs:  Closures  Broadcast  Partitioning Understanding closures One of the most difficult things to understand in Spark is understanding the scope and lifetime of variables and methods when the code is executed on the cluster. The RDD operations that modify variables outside of their scope can be a frequent source of confusion. It will be shown in the example code which uses a foreach( ) to increment a counter, but similar problems can occur for other well-known operations. Example: Consider a sum RDD element, viewed in the following, the behavior of which may be different depending on the node on which it is processed. A common example of what happens when we change the Spark behavior from local to cluster (via YARN, for example): Local vs. cluster modes The behavior of the code, in the image above, is not clearly defined and might not work as intended. To run jobs, Spark divides the process around RDD in different tasks, each by an executor. Before execution, Spark computes the closure of the task. Closures are all those variables and methods that must be visible for the executor that allow their computation on the RDD (inthis case, the foreach statement). This closure is serialized and sent to each executor. Variables with the closure are copiedand sent toevery executor, so when the counter is called by the works of the foreach, the counter does not refer to the Master node counter (driver). There is still the counter in the driver node memory, but this is not visible to the executors. The executors "see" only the copy of the closure serialized. So, the final value of the counter, will always be zero until all the operations on the counter are not completed and sent to the master (driver).
  • 30. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 27 In the local environment, the foreach statement will run the same code on the same JVM as the original driver and calculate the original counter: this could then upgrade directly behaving differently. To ensure a well-defined behavior in these types of scenarios, you should use an Accumulator. The Accumulators in Spark are used specifically to allow a mechanism for safely updating a variable by aggregating all the results that are sent to the driver by the various nodes running the workers. Broadcasting In the previous chapter you saw how Spark sends a copy of the RDD to compute operations to various workers (nodes) in the cluster. This type of operation, without a policy and a well-designed strategy, could lead to specific problems, given the calculations that must be made are massive Spark and then we talk about the order of gigabytes. This section will show an incorrect management of the broadcasting technical of the RDD, and then see which strategy was implemented in Spark to solve the problem. Example: are provides the lazy type operations and closure of type Map and flatMap, respectively: Given a Master (driver) and three Workers: 1. Is created the RDD inside the driver, 1MB size (dummy variable) 2. Whenever that the actions are called by the workers, a copy is sent to the various workers of the RDD 3. The RDD of 1MB in size, if called, for example, 4 times from the first worker, become 4MB to the local level. If you think about this in a distributed scale, so for all workers in this small network, the results could grow out of control and in an uncontrolled manner.
  • 31. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 28 In a network of millions of nodes, this cannot happen. In Spark this issue has been resolved in this way: it wraps the Map function inside the broadcast function that handles sending of the RDD in parallel and, then, the lazy operations: The broadcast function, first of all, compresses the RDD as much as possible; following: 1. Send indexers (RDD) once and only once time to the various worker 2. Each indexer must be cached by the various worker This controls the uncontrolled growth of the indexer, even if the workers require to compute repeatedly the same RDD. Not only! To try to solve the bottleneck that is possible is created between Drivers and Workers, the Workers can exchange indexers in a similar manner to that used in BitTorrent networks. This means, for example, that if the first worker has received the indexer to perform an any computation... ... and the second worker has also need to receive the indexer, the workers can help each other in order to minimize calls to the driver, which could cause annoying bottleneck and create latency:
  • 32. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 29 Optimizing Partitioning In previous sections it has shown how an uncontrolled broadcasting can grow RDD distribution in an uncontrolled manner. In this section, otherwise, you will be seen instead as a broadcast technique disproportionate use may be inadequate if misused. To best explain this problem through an example. Given the following RDD:  Given a set "huge" data  1 to int.MaxValue  Divided into 10000 executors 1. Perform a filter, that minimizes the result as much as possible 2. Operated an order 3. Perform one map, that adds value 1 to each element in the final array 4. It shows the result using the collect action. In this case, if you check the systemistic level work how the nodes on the network works, we have conflicting results. First, we can see that the execution was presided over by 2 chunks, each approximately 10,000 partitions, whose first (sortBy) was completed in 13 seconds, while the second (collect) in 20 seconds. If we make a better analysis of the result and we go down in detail of the function collect ... We note that the sortBy function, after filtering, it took a full 20 seconds. This is because it has been asked to use Spark 10000 executors to order 10 values (the action of sortBy is performed, to how the code is written, after the filtering):
  • 33. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 30 Figure 28 – Systemistic View As you can see, this overhead is bad if not controlled: from the image above you can see that 10000 tasks are performed, and almost every one of them puts zero time to perform the required computation. Simply put, the synchronization time of the various executors is extremely higher than the desired type of computation, increasing the overhead and user perceived latency of proportion. A technique used to control the use of the executors by the user is to use the coalesce function, which reduces the use of executors from 10000 to 8 after execution of the filter. The second field passed to coalesce function indicates the shuffle properties: if this is set to true value, then the output is equally distributed across all nodes equally, otherwise, if it set to false value are used less possible partitions. If you monitoring the execution time, the result is always the same, but with absolutely best performance: in fact, the collect call back after filtering, in the second chunk of the second test, it takes only 4 seconds compared to 20 used previously. In details:
  • 34. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 31 It is evident that the sort function, this time, takes only 24 milliseconds requesting the execution of 8 executors in total. The coalesce function saves an enormous amount of time by reducing the number of chunks that are called upon execution of actions. Future of Spark This section will show some features that the community around Spark is trying to develop and advance. R Integration with R is definitely the first “frontier” to be conquered to join the community of mathematicians, statisticians and that of engineers and data scientists. R offers a user- friendly development environment from a statistical and mathematical point of view and is widely used in the community for the analysis of data. Tachyon Why Spark results to be 100 times faster than MapReduce? The answer is provided by the community around you work for Tachyon. The memory is the keyto process BigData quickly,fast and scalable: Tachyon is a "central memory" adopted Spark in distributed systems in which in addition to providing automatic fault-tolerance and cache techniques, provides reliable services the memory sharing along the cluster used. Continue to develop this type of service would make the field of the analysis of big data even faster and powerful than it already is. Project Tungsten Another features developed by the community around which has rallied Spark is the Tungesten Project: the project has changed the execution core of the various nodes within the cluster going to substantially improve the efficiency of memory and CPU usage for applications Spark, trying to maximize, as soon as possible, the performance. The fundamental concept of this research revolves around the concept about the use of modern CPUs: they are built around the concept of the I / O. This research has the goal to minimize the I / O and to improve to the maximum performance if the CPU and the memory, of the machine, are used as a cluster.
  • 35. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 32 References Holden Karau, A. K. (2015). Learning Spark. 1005 Gravenstein Highway North, Sebastopol, CA 95472: O’Reilly Media. Mortar Data Software. (2013, 05). How Spark Works. Retrieved from https://www.linkedin.com/company/mortar-data-inc. Phatak, M. (2015, 1 2). History of Apache Spark : Journey from Academia to Industry. Retrieved from http://blog.madhukaraphatak.com/history-of-spark/ SAS - The power to Know. (2013, 04). Big Data. Retrieved from http://www.sas.com/en_ph/insights/big- data/what-is-big-data.html Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2011, 06). USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). Retrieved from https://amplab.cs.berkeley.edu/wp- content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf
  • 36. INTRODUCTION TO DATA ANALYSIS WITH SPARK Pagina 33 Contact information MASSIMILIANO MARTELLA Email: massimilian.martella@studio.unibo.it