SlideShare a Scribd company logo
1 of 16
Spark and Storm: Performance and Use
Case Comparison
Jessica Morris, Partap Singh
May 2, 2016
INFO 491: Big Data Analytics
2
EXECUTIVE SUMMARY 3
INTRODUCTION 4
Apache Spark 5
Apache Storm 6
EVALUATION CRITERIA 7
Data Processing 7
Latency 9
Reliability 9
Ease of Use 10
Application of Use 11
CONCLUSIONS 12
Visual Review of Data 12
Final Overview 15
3
Executive Summary
The purpose of this paper is to explore the differences between the Hadoop
Toolkit technologies Apache Spark and Apache Storm. We will use various online
sources for this comparison in order to understand the real life usage of these
technologies. The report concludes that Apache Spark and Apache Storm each have very
different use cases and their strengths and weaknesses should be carefully considered
when deciding between the two in an overlapping use case. In general; Apache Storm is
stronger at real time stream processing, but Apache Spark is a very strong and fast batch
processing technology. When the situation calls for real time streaming, it is our
recommendation that Apache Storm is used. If the situation is more generalized or calls
for high latency batch processing, Apache spark is more appropriate.
Unlike Hadoop, both Storm and Spark are new and evolving technologies in
business world so a limited amount of information was available at the time this paper
was written. This paper is written by keeping people in mind who are new to this
technology. A simple holistic approach has been used to compare both technologies.
This paper is not meant to be a reference for coding or programming language.
4
Introduction
In the world of computing, the necessity of big data processing is more visible
now than ever. Technology such as Apache Hadoop provides frameworks for many
types of big data computing. Two technologies that exist within this framework are
Spark and Storm. Both of these tools can analyze big data in real time. Real time data
processing used to require expensive batch operations with high latencies. As
technologies continue to grow, pushing both the physical and logical limits of
computing, the demand for alternative ways of analyzing big data has risen. Both Spark
and Storm utilize distributed processing and real time processing to improve upon the
implementation of business intelligence analytics, meeting these demands.
Using the following benchmarks, this paper will analyze the use cases and
strengths of Spark and Storm.
 Data-Processing
 Latency
 Reliability
 Ease of Use
 Applied Use
It should be noted that while Spark and Storm share some use cases and have an
overlap of technologies used, they both also have a unique set of strengths and
weaknesses. This means that there is no clear overall ‘winner’, only certain applications
in which Spark may be more appropriate than Storm or vice versa. In many cases, the
difference boils down to developer or analyst preference.
Both Spark and Storm are essential parts of the Apache Hadoop toolkit. It is
pioneering technologies like Spark and Storm that will pave the future for big data
analytics. Distributed computing and real time data processing are both technologies
with great potential and are very much worth discussing.
5
Apache Spark
Spark’s official website describes Apache Spark as a “fast and general engine for
large-scale data processing.” Businesses such as Amazon, eBay, Yahoo!, and TripAdvisor
all utilize Spark’s technology. This usage is often for ‘real time’ predictive analytics,
profiling and personalization. Many businesses in the financial, health services, and
entertainment industries make use of Spark as well.
The technology is founded on the concept of distributed datasets, which use
Scala, Java or Python objects. Spark is a flexible framework that can handle both batch
and real time data processing. Spark Streaming adds streaming capabilities similar to
Apache Storm as well, but where Spark truly shines is in its ability to process data files
using an efficient distributed computing process over a large amount of clusters and
nodes.
In many ways, Spark was designed as an improvement on MapReduce. One key
advantage of Spark is that it provides an excellent interface for distributed programming
and cluster programming. Parallel data and fault tolerance are implicit in Spark’s design.
These components were developed as a response to inherent limitations found in
MapReduce. Spark runs on a resilient distributed dataset (RDD), which is read only and
designed for distribution across clusters. The result is a much more natural way of
processing data compared to other tools. Some inherent advantages to Spark are:
 Fault-tolerance
 Scalability- can run on clusters with thousands of nodes
 Speedy and efficient batch-processing- 10x – 100x (in memory) faster than
MapReduce
 Flexibility
Older technologies such as MapReduce broke up data into smaller blocks for
easier processing. Contrastingly, Spark can process the entire job in memory allowing
for better parallel form. Spark works well with many of the major programming
languages used in data analysis such as Python, Java, and Scala -- Sparks’ native
programming language.
6
Spark is built to process continuous query style workloads. Because Spark was
designed for ease of use and efficiency of processing, it is a market leader for streaming
technologies, beating out many other more specialized frameworks. Spark was built to
be a generalized way of processing data, making it inherently flexible – it can be applied
to anything from distributed data analytics to machine learning.
Hive, Pig, Sqoop and many other Hadoop ecosystem tools work well with Spark.
In many cases the combination of these tools can replace MapReduce completely. Given
its unique approach to analytics and synergy with many existing applications,
programming languages, and framework, Spark is a rich addition to the Hadoop toolkit.
Spark is a robust batch processing framework with the added capability of Micro-
Batching, as well as extensions such as Spark Streaming.
Apache Storm
Storm is a free and open source distributed real time computation system for
processing large volume of high volume (Streaming) data in parallel and at scale. Unlike
Hadoop and MapReduce which process data in batches, storm process unbounded
streams of data in real time. Being simple and easy to use with any programming
language, many developers find Storm fun to use.
According to Hortonworks, there are five characteristics that make storm ideal
for real-time data processing workloads.
 Fast - it can process one million 100 byte messages per second per node
 Scalable- parallel calculations that run across a cluster of machines
 Fault-tolerant- If a node dies, the system will be restarted on another node.
 Reliable- storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once.
For many of the business opportunities and use cases where storm can be
beneficial are: real-time customer service management, data monetization, operational
dashboards, cyber security analytics, threat detection, online machine learning,
continuous computation, distributed RPC, ETL and more.
7
Many well known and established businesses are using Apache Storm, including:
Groupon, an online coupon based company that uses storm to build real-time data
integration systems. Storm helps Groupon to analyze, clean, normalize and resolve large
amounts of non-unique data points with low latency and high throughput. The Weather
Channel uses many Storm topologies to ingest and persist weather data. Twitter’s
system uses a wide variety of Storm in application from discovery, real-time analytics,
personalization, search, revenue optimization and many more.
Evaluation Criteria
Data Processing
Apache Spark is designed to do more than plain data processing. It actually can
make use of existing machine learning libraries and process graphs. Due to the high
performance of Apache Spark, it can be used for both batch processing and real time
processing. Strategic developer Mr. Oliver in a blog Storm or Spark compared some of
the data processing of both technologies in Dec 2014. According to him, both
technologies have overlapping capabilities and distinctive features and roles to play.
In 2011, BackType, a marketing intelligence company bought by Twitter brought
Storm as a project to distributed computation framework for event stream processing.
Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately
moved to Apache Incubator and became an Apache top-level project in September
2014. Referred as the Hadoop of real-time processing, Storm appears to process
unbounded streams of data for real-time processing what Hadoop did for batch
processing.
Storm is written primarily in Clojure and is designed to support wiring “spouts”
(input streams) and “bolts” (processing and output modules) together as a directed
acyclic graph (DAG) called a topology. Storm topologies run on clusters and based on the
topology configuration, Storm scheduler distributes work to nodes around the clusters.
MapReduce in Hadoop and topologies in Storm roughly do the same job except Storm
8
focus on real-time, stream-based processing. Topologies default to run forever or until
manually terminated. Once a topology is started, the spouts bring data into the system
and hand the data off to bolts where the main computational work is done. As
processing progresses, one or more bolts may requires to write data out to a database
or file systemor send a message to another external system.
One of the strengths of the Storm ecosystem is a rich array of available spouts
which are specialized for receiving data from all types of sources. Processed data in
Storm is easy to integrate with HDFS file systems. Even though Storm is based on
Clojure, it can also support multilanguage programming like JVM.
However, starting out as a project of AMPLab at the University of California at
Berkeley, Spark got very popular project to real-time distributed computation and
ultimately became as a top level project of Apache in February. Like Storm, Spark
supports Stream-Oriented processing, but it’s more of a general-purpose distributed
computing platform. While Spark has the ability to run on top of an existing Hadoop
cluster, relying on YARN for resource scheduling, it can be seen as a potential
replacement for MapReduce functions of Hadoop. Written in Scala, Spark supports
multilanguage programming. Spark provides specific API support only for Scala, Java and
Python. Spark does not have the specific abstraction of a “spout”, but it includes
adapters for working with data stored in numerous disparate sources, including HDFS
files, Cassandra, Hbase, and S3. Like Storm, Spark supports a streaming model, but this
support is provided by only one of several Spark modules, including purpose-built
modules for SQL access, graph operations, and machine learning, along with stream
processing.
Spark does not process streams one at a time like storm. Instead, it slices themin
small batches of time intervals before processing them. DStream, the abstraction for a
continuous stream of data is a micro-batch of RDDs (Resilient Distributed Datasets) are
distributed collections that can be operated in parallel by arbitrary functions and by
transformations over a sliding window of data (windowed computations).
9
Latency
Traditional data warehousing environments were expensive and had high latency
towards batch operations. As a result of this, organizations were not able to embrace
the power of real time business intelligence and big data analytics in real time. Today
many powerful open-source tools like Spark and Storm have emerged to overcome this
challenge. Spark is a parallel open source processing framework and its workflows are
designed in Hadoop MapReduce but are comparatively more efficient than Hadoop
MapReduce. One of the benefit Apache Spark has is that it does not use Hadoop YARN
for functioning but it has its own streaming API and independent processes for
continuous batch processing across varying short time intervals. Spark runs 100 times
faster than Hadoop in certain situations, but Spark doesn’t have its own distributed
storage system.
When talking about Spark performance for data processing, we can not ignore
Hadoop MapReduce technology because both technologies work hand in hand . Where
spark processes in-memory data, Hadoop MapReduce persist back to the disk after a
map action or a reduce action. Therefore Hadoop MapReduce lags behind when
compared to Spark. Spark requires huge memory as it loads the process into the
memory and store it for caching. Spark streams events in small batches that come in
short time windows before it processes them whereas Storm processes the events one
at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event
with just millisecond latency.
Reliability
Spark and its RDD abstraction is designed to seamlessly handle failures of any
worker nodes in the cluster. Spark streaming is built on Spark and so to say it enjoys
same fault -tolerance for worker nodes but certain applications likes Kafka and flume
has to recover from failure of the driver process. For sources like files, periodically
computation on every micro-batch of data was sufficient to ensure zero data loss
10
because it stores data in a fault-tolerant file systemlike HDFS. However, for the other
sources like Kafka and Flume, some of the received data that was buffered in memory
but not yet processed could get lost. To avoid this data loss, write ahead logs system
was introduced in Spark Streaming 1.2 release.
Storm guarantees that every spout tuple will be fully processed by the topology
by tracking the tree of tuples triggered by every spout tuple and determining when that
tree of tuples has been successfully completed. Every topology has a “message timeout”
associated with it and if Storm fails to detect that a spout tuple has been completed
within that timeout, then it fails the tuple and replays it later. There are two main things
that need to be done to benefit from Storm reliability capabilities. First, it needs to tell
Storm whenever a new link in the tree of tuples to be created and second, it needs to
tell Storm that individual tuple have been finished processing. By doing both of these
things, Storm can detect when the tree of tuples is fully processed and can acknowledge
or fail the spout tuple appropriately.
Ease of Use
Both Apache Spark and Apache Storm were built on a founding principle of user
accessibility. As such, both platforms are scalable, fault-tolerant and support a wide
array of programming languages and platforms of use such as Apache Hadoop.
However, there are some differences.
For example, when developing for Spark, one must keep its base language, Scala
in mind. Scala is a JVM based language designed to improve upon the shortcomings of
Java. While Scala has been praised for its various features and adjustments, such as
scalability (thus, the name Scala), it is not as well known as Java. This can make it
difficult for new developers who are not familiar with the language. Furthermore,
Spark’s Tuples are also based on Scala, and while they can be implemented in Java, this
is known as a difficult process.
There are aspects of Storm that has made it a favorite among some developers.
Storm supports more programming languages than Spark. In addition to Java, Scala and
11
Python, Core Storm also supports Clojure, Ruby and many others. Because of this
expanded support, Strom’s Tuples can be much easier for programmers to manage.
Storm uses DAGs (directed acyclic graphs) allowing for a very natural flow of data
through Strom’s tuples. Each individual node plays a role in transforming the data
through the DAG. However, this can cause compiling to take a while, and sometimes
causes developers to skip important safety checks.
Conversely, there are many features of Spark that are much easier to implement
compared to other platforms. Spark Streaming allows for Language Integrated API,
which allows developers to write real-time jobs in a very similar manner to batch jobs.
Many developers make great use of Spark’s Interactive Mode which eases the
programming process. While many developers prefer Storm, developers who are
comfortable with Interactive Mode will choose Spark as nothing comparable is available
for Storm at this time. Thus, even with the aforementioned challenges, Spark remains
one of the simpler solutions to the complex problem of real time data.
Application of Use
Apache Spark and Storm are both powerful business intelligence platforms with
a wide array of uses. These platforms were developed for use in processing real time
data, however; the nuances of their application are many.
For example, since Storm approaches real-time data in a micro-batched fashion,
this makes Storm great for real time log monitoring. Twitter originally developed Storm
in 2011 for use in analytics. At Twitter, Storm treats clusters of tweets as a batch and,
working with other tools such as Cassandra, analyzes each batch of tweets for data such
as geographic location, user demographics and determining whether the tweet was
spam. Other companies such as Spotify make use of Strom’s real time strengths to
recommend users music based on listening history or target advertisements to specific
demographics. According to Spotify: “Storm enables us to build low-latency fault-
tolerant distributed systems with ease.”
Many industry professionals see Spark as a more general-purpose platform.
12
While Storm excels at analyzing user, log and stream data in real time, Spark is more of a
real time improvement on MapReduce. Being a much newer technology, Apache Spark
is far less adapted in the industry and there are fewer working use-cases. Spark shines
when it runs on top of an existing Hadoop cluster or layered on top of other technology
such as Mesos or supporting multiple processing paradigms. While Spark has a
specialized streaming application in Spark Streaming, compared to Storm it is not yet
well supported. Spark is not designed to handle a multi-user environment and all users
have to apportion data appropriately. If a project has many users, Storm may be a more
appropriate choice.
This does not mean that Spark is unused. Contrary, Spark has seen a rapid
adaption since its introduction to the Apache Toolkit in 2014. In addition to streaming,
Spark shows great potential in the areas of machine learning, interactive analysis, and
fog computing. Uber uses Spark to convert unstructured data into structured data, and
Netflix uses Spark in a manner similar to how Spotify uses Storm -- to create real time
user prediction models and recommend movies based on user history.
Overall, while Storm is the older and more tested technology, there is a wide
range of use cases where either Spark or Storm are appropriate. Furthermore, the
decision on whether to use Spark or Storm for a task highly depends on the user base,
organizational structure, and what users wish to accomplish in their data processing.
Conclusions
Visual Review of Data
The following table contains a simplified review of the differences between
Spark and Storm. This table is intended to help establish appropriate use cases for Spark
and Storm as well as help potential users decide between one or both technologies.
13
Storm Spark
1) Data
processing
 Mainly does Streaming
mostly but does Micro-
batching as well ( Trident
API)
 Stream primitive is tuple
Spouts (Input Streams),
bolts(Processing and
output modules),
Topologies
 Directed Acyclic Graph
(DAG)- Topology
 Will run in a continuous
loop unless aborted
 Strong point (rich array of
available spouts- to collect
data)
 Uses Apache
Zookeeper/minion worker
process to coordinate
topologies
 Based on Clojure, but
supports many languages
like Java, Scala, Python,
Ruby, and more
 Mainly Batch
processing but also
does micro-batching
 Uses DStream, an
abstraction for a
continuous stream of
data is a micro-batch of
RDD (Resilient
Distributed Datasets)
 RDD operated in
parallel
 Window computations
 Hadoop HDFS
dependent ( to store
data after computation)
 Based on Java, Scala,
and Python only
2) Speed/
Performance
 Storm is fast: a benchmark
clocked it at over a million
 Much faster than
Hadoop MapReduce
14
tuples processed per
second per node.
 Claimed to be 4 to 5 times
slower than Spark using
Word Count benchmark
 Limited performance per
server by Stream
processing standards
(100 times)
 Process data in-
memory and require
huge memory
 Faster than Storm but
with limited
performance
3) Reliability  Storm guarantees that Tree
of tuple will be fully
processed
 Use time out system: every
data is time stamped and
acknowledge system is
used to detect any failure
 Process data in
memory, hence volatile
 Use log system to keep
track of data process
and to avoid data loss
4) Ease of Use  Built on Clojure but but
flexible enough to work
with many other languages,
including Java, Scala and
Python, making it very easy
to work with
 Data processing flow is
natural and intuitive (
Spout collects the stream
of data and bolt process it
through HDFS)
 Intermediate to advanced
 Built on Scala, Java and
Python only.
 Need experts of Java
and Scala language to
utilize the technology,
unless GUI is used
 GUI is very beginner
friendly and easy to
learn for those who are
new to programming
15
level programmer can use
utilize this technology.
5) Application
of use
 Examples: Twitter,
Groupon, Google
 Strong real time data
processing capabilities
 Data streaming and
analytics
 Event log monitoring,
record keeping, filtering,
lookup
 Examples: Uber, Netflix,
Pintrest
 Batching and micro-
batching
 Distributed computing
 Machine learning
 Event detection
 Cluster computing
 Data analytics
Final Overview
Spark and Storm are very capable technologies in real- time analytics. Both of
the framework are on a streaming architecture, meaning there is an inbound, infinite list
of tuples. In this report, Spark and Storm has been evaluated on base of data processing,
speed, reliability, ease of use, and application of use. In Storm nomenclature, finite-size
tuples is processed through spouts and bolts. In Spark nomenclature, the stream is
called a D-Stream or a discretized stream. In Spark Streaming framework, data
processing engine essentially discretize and convert the stream into finite size RDDs,
which are essentially micro-batches to process messages. Both of the framework allow
stateful processing, that is if a worker node or driver node lose data, it can be recover
from the state. Resilient Distributed Dataset(RDD) is cornerstone of Spark that entails
finite-sized immutable data structures which can be replayed as needed and are
replicated over HDFS or any type of persistent file system.
Since both technologies have varying strengths and weaknesses, it is hard to
recommend one over the other in a general sense. However, Spark and Storm have a
large application domain for which specific uses may be appropriate. In general, industry
16
professionals agree that Storm is much more useful as a real time streaming technology
at this point, even though Spark Streaming is growing by leaps and bounds. Storm has a
specialized toolset that allows it strong processing capabilities but Spark is very
generalized and the technology has developed in such a way that it may replace
MapReduce in the future. Although the use cases for Spark and Storm may have some
overlap, the platforms are very distinct. If one desires to use stream-based technology
for analytic purposes, provided that they are comfortable with programming, Storm is
the better choice. If one would like to use cluster computing or batching technology, or
they are uncomfortable with programming, than Spark is the better choice. Overall,
both Spark and Storm are fantastic ways of achieving lighting- fast, reliable and easy-to-
learn data processing.

More Related Content

What's hot

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduceGreg Hanchin
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 

What's hot (20)

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
MapR Unique features
MapR Unique featuresMapR Unique features
MapR Unique features
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
SAP HORTONWORKS
SAP HORTONWORKSSAP HORTONWORKS
SAP HORTONWORKS
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 
Spark1
Spark1Spark1
Spark1
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 

Similar to INFO491FinalPaper

Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 

Similar to INFO491FinalPaper (20)

IOT.ppt
IOT.pptIOT.ppt
IOT.ppt
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
Hadoop
Hadoop Hadoop
Hadoop
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Apache spark
Apache sparkApache spark
Apache spark
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 

INFO491FinalPaper

  • 1. Spark and Storm: Performance and Use Case Comparison Jessica Morris, Partap Singh May 2, 2016 INFO 491: Big Data Analytics
  • 2. 2 EXECUTIVE SUMMARY 3 INTRODUCTION 4 Apache Spark 5 Apache Storm 6 EVALUATION CRITERIA 7 Data Processing 7 Latency 9 Reliability 9 Ease of Use 10 Application of Use 11 CONCLUSIONS 12 Visual Review of Data 12 Final Overview 15
  • 3. 3 Executive Summary The purpose of this paper is to explore the differences between the Hadoop Toolkit technologies Apache Spark and Apache Storm. We will use various online sources for this comparison in order to understand the real life usage of these technologies. The report concludes that Apache Spark and Apache Storm each have very different use cases and their strengths and weaknesses should be carefully considered when deciding between the two in an overlapping use case. In general; Apache Storm is stronger at real time stream processing, but Apache Spark is a very strong and fast batch processing technology. When the situation calls for real time streaming, it is our recommendation that Apache Storm is used. If the situation is more generalized or calls for high latency batch processing, Apache spark is more appropriate. Unlike Hadoop, both Storm and Spark are new and evolving technologies in business world so a limited amount of information was available at the time this paper was written. This paper is written by keeping people in mind who are new to this technology. A simple holistic approach has been used to compare both technologies. This paper is not meant to be a reference for coding or programming language.
  • 4. 4 Introduction In the world of computing, the necessity of big data processing is more visible now than ever. Technology such as Apache Hadoop provides frameworks for many types of big data computing. Two technologies that exist within this framework are Spark and Storm. Both of these tools can analyze big data in real time. Real time data processing used to require expensive batch operations with high latencies. As technologies continue to grow, pushing both the physical and logical limits of computing, the demand for alternative ways of analyzing big data has risen. Both Spark and Storm utilize distributed processing and real time processing to improve upon the implementation of business intelligence analytics, meeting these demands. Using the following benchmarks, this paper will analyze the use cases and strengths of Spark and Storm.  Data-Processing  Latency  Reliability  Ease of Use  Applied Use It should be noted that while Spark and Storm share some use cases and have an overlap of technologies used, they both also have a unique set of strengths and weaknesses. This means that there is no clear overall ‘winner’, only certain applications in which Spark may be more appropriate than Storm or vice versa. In many cases, the difference boils down to developer or analyst preference. Both Spark and Storm are essential parts of the Apache Hadoop toolkit. It is pioneering technologies like Spark and Storm that will pave the future for big data analytics. Distributed computing and real time data processing are both technologies with great potential and are very much worth discussing.
  • 5. 5 Apache Spark Spark’s official website describes Apache Spark as a “fast and general engine for large-scale data processing.” Businesses such as Amazon, eBay, Yahoo!, and TripAdvisor all utilize Spark’s technology. This usage is often for ‘real time’ predictive analytics, profiling and personalization. Many businesses in the financial, health services, and entertainment industries make use of Spark as well. The technology is founded on the concept of distributed datasets, which use Scala, Java or Python objects. Spark is a flexible framework that can handle both batch and real time data processing. Spark Streaming adds streaming capabilities similar to Apache Storm as well, but where Spark truly shines is in its ability to process data files using an efficient distributed computing process over a large amount of clusters and nodes. In many ways, Spark was designed as an improvement on MapReduce. One key advantage of Spark is that it provides an excellent interface for distributed programming and cluster programming. Parallel data and fault tolerance are implicit in Spark’s design. These components were developed as a response to inherent limitations found in MapReduce. Spark runs on a resilient distributed dataset (RDD), which is read only and designed for distribution across clusters. The result is a much more natural way of processing data compared to other tools. Some inherent advantages to Spark are:  Fault-tolerance  Scalability- can run on clusters with thousands of nodes  Speedy and efficient batch-processing- 10x – 100x (in memory) faster than MapReduce  Flexibility Older technologies such as MapReduce broke up data into smaller blocks for easier processing. Contrastingly, Spark can process the entire job in memory allowing for better parallel form. Spark works well with many of the major programming languages used in data analysis such as Python, Java, and Scala -- Sparks’ native programming language.
  • 6. 6 Spark is built to process continuous query style workloads. Because Spark was designed for ease of use and efficiency of processing, it is a market leader for streaming technologies, beating out many other more specialized frameworks. Spark was built to be a generalized way of processing data, making it inherently flexible – it can be applied to anything from distributed data analytics to machine learning. Hive, Pig, Sqoop and many other Hadoop ecosystem tools work well with Spark. In many cases the combination of these tools can replace MapReduce completely. Given its unique approach to analytics and synergy with many existing applications, programming languages, and framework, Spark is a rich addition to the Hadoop toolkit. Spark is a robust batch processing framework with the added capability of Micro- Batching, as well as extensions such as Spark Streaming. Apache Storm Storm is a free and open source distributed real time computation system for processing large volume of high volume (Streaming) data in parallel and at scale. Unlike Hadoop and MapReduce which process data in batches, storm process unbounded streams of data in real time. Being simple and easy to use with any programming language, many developers find Storm fun to use. According to Hortonworks, there are five characteristics that make storm ideal for real-time data processing workloads.  Fast - it can process one million 100 byte messages per second per node  Scalable- parallel calculations that run across a cluster of machines  Fault-tolerant- If a node dies, the system will be restarted on another node.  Reliable- storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. For many of the business opportunities and use cases where storm can be beneficial are: real-time customer service management, data monetization, operational dashboards, cyber security analytics, threat detection, online machine learning, continuous computation, distributed RPC, ETL and more.
  • 7. 7 Many well known and established businesses are using Apache Storm, including: Groupon, an online coupon based company that uses storm to build real-time data integration systems. Storm helps Groupon to analyze, clean, normalize and resolve large amounts of non-unique data points with low latency and high throughput. The Weather Channel uses many Storm topologies to ingest and persist weather data. Twitter’s system uses a wide variety of Storm in application from discovery, real-time analytics, personalization, search, revenue optimization and many more. Evaluation Criteria Data Processing Apache Spark is designed to do more than plain data processing. It actually can make use of existing machine learning libraries and process graphs. Due to the high performance of Apache Spark, it can be used for both batch processing and real time processing. Strategic developer Mr. Oliver in a blog Storm or Spark compared some of the data processing of both technologies in Dec 2014. According to him, both technologies have overlapping capabilities and distinctive features and roles to play. In 2011, BackType, a marketing intelligence company bought by Twitter brought Storm as a project to distributed computation framework for event stream processing. Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately moved to Apache Incubator and became an Apache top-level project in September 2014. Referred as the Hadoop of real-time processing, Storm appears to process unbounded streams of data for real-time processing what Hadoop did for batch processing. Storm is written primarily in Clojure and is designed to support wiring “spouts” (input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and based on the topology configuration, Storm scheduler distributes work to nodes around the clusters. MapReduce in Hadoop and topologies in Storm roughly do the same job except Storm
  • 8. 8 focus on real-time, stream-based processing. Topologies default to run forever or until manually terminated. Once a topology is started, the spouts bring data into the system and hand the data off to bolts where the main computational work is done. As processing progresses, one or more bolts may requires to write data out to a database or file systemor send a message to another external system. One of the strengths of the Storm ecosystem is a rich array of available spouts which are specialized for receiving data from all types of sources. Processed data in Storm is easy to integrate with HDFS file systems. Even though Storm is based on Clojure, it can also support multilanguage programming like JVM. However, starting out as a project of AMPLab at the University of California at Berkeley, Spark got very popular project to real-time distributed computation and ultimately became as a top level project of Apache in February. Like Storm, Spark supports Stream-Oriented processing, but it’s more of a general-purpose distributed computing platform. While Spark has the ability to run on top of an existing Hadoop cluster, relying on YARN for resource scheduling, it can be seen as a potential replacement for MapReduce functions of Hadoop. Written in Scala, Spark supports multilanguage programming. Spark provides specific API support only for Scala, Java and Python. Spark does not have the specific abstraction of a “spout”, but it includes adapters for working with data stored in numerous disparate sources, including HDFS files, Cassandra, Hbase, and S3. Like Storm, Spark supports a streaming model, but this support is provided by only one of several Spark modules, including purpose-built modules for SQL access, graph operations, and machine learning, along with stream processing. Spark does not process streams one at a time like storm. Instead, it slices themin small batches of time intervals before processing them. DStream, the abstraction for a continuous stream of data is a micro-batch of RDDs (Resilient Distributed Datasets) are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations).
  • 9. 9 Latency Traditional data warehousing environments were expensive and had high latency towards batch operations. As a result of this, organizations were not able to embrace the power of real time business intelligence and big data analytics in real time. Today many powerful open-source tools like Spark and Storm have emerged to overcome this challenge. Spark is a parallel open source processing framework and its workflows are designed in Hadoop MapReduce but are comparatively more efficient than Hadoop MapReduce. One of the benefit Apache Spark has is that it does not use Hadoop YARN for functioning but it has its own streaming API and independent processes for continuous batch processing across varying short time intervals. Spark runs 100 times faster than Hadoop in certain situations, but Spark doesn’t have its own distributed storage system. When talking about Spark performance for data processing, we can not ignore Hadoop MapReduce technology because both technologies work hand in hand . Where spark processes in-memory data, Hadoop MapReduce persist back to the disk after a map action or a reduce action. Therefore Hadoop MapReduce lags behind when compared to Spark. Spark requires huge memory as it loads the process into the memory and store it for caching. Spark streams events in small batches that come in short time windows before it processes them whereas Storm processes the events one at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event with just millisecond latency. Reliability Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Spark streaming is built on Spark and so to say it enjoys same fault -tolerance for worker nodes but certain applications likes Kafka and flume has to recover from failure of the driver process. For sources like files, periodically computation on every micro-batch of data was sufficient to ensure zero data loss
  • 10. 10 because it stores data in a fault-tolerant file systemlike HDFS. However, for the other sources like Kafka and Flume, some of the received data that was buffered in memory but not yet processed could get lost. To avoid this data loss, write ahead logs system was introduced in Spark Streaming 1.2 release. Storm guarantees that every spout tuple will be fully processed by the topology by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a “message timeout” associated with it and if Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later. There are two main things that need to be done to benefit from Storm reliability capabilities. First, it needs to tell Storm whenever a new link in the tree of tuples to be created and second, it needs to tell Storm that individual tuple have been finished processing. By doing both of these things, Storm can detect when the tree of tuples is fully processed and can acknowledge or fail the spout tuple appropriately. Ease of Use Both Apache Spark and Apache Storm were built on a founding principle of user accessibility. As such, both platforms are scalable, fault-tolerant and support a wide array of programming languages and platforms of use such as Apache Hadoop. However, there are some differences. For example, when developing for Spark, one must keep its base language, Scala in mind. Scala is a JVM based language designed to improve upon the shortcomings of Java. While Scala has been praised for its various features and adjustments, such as scalability (thus, the name Scala), it is not as well known as Java. This can make it difficult for new developers who are not familiar with the language. Furthermore, Spark’s Tuples are also based on Scala, and while they can be implemented in Java, this is known as a difficult process. There are aspects of Storm that has made it a favorite among some developers. Storm supports more programming languages than Spark. In addition to Java, Scala and
  • 11. 11 Python, Core Storm also supports Clojure, Ruby and many others. Because of this expanded support, Strom’s Tuples can be much easier for programmers to manage. Storm uses DAGs (directed acyclic graphs) allowing for a very natural flow of data through Strom’s tuples. Each individual node plays a role in transforming the data through the DAG. However, this can cause compiling to take a while, and sometimes causes developers to skip important safety checks. Conversely, there are many features of Spark that are much easier to implement compared to other platforms. Spark Streaming allows for Language Integrated API, which allows developers to write real-time jobs in a very similar manner to batch jobs. Many developers make great use of Spark’s Interactive Mode which eases the programming process. While many developers prefer Storm, developers who are comfortable with Interactive Mode will choose Spark as nothing comparable is available for Storm at this time. Thus, even with the aforementioned challenges, Spark remains one of the simpler solutions to the complex problem of real time data. Application of Use Apache Spark and Storm are both powerful business intelligence platforms with a wide array of uses. These platforms were developed for use in processing real time data, however; the nuances of their application are many. For example, since Storm approaches real-time data in a micro-batched fashion, this makes Storm great for real time log monitoring. Twitter originally developed Storm in 2011 for use in analytics. At Twitter, Storm treats clusters of tweets as a batch and, working with other tools such as Cassandra, analyzes each batch of tweets for data such as geographic location, user demographics and determining whether the tweet was spam. Other companies such as Spotify make use of Strom’s real time strengths to recommend users music based on listening history or target advertisements to specific demographics. According to Spotify: “Storm enables us to build low-latency fault- tolerant distributed systems with ease.” Many industry professionals see Spark as a more general-purpose platform.
  • 12. 12 While Storm excels at analyzing user, log and stream data in real time, Spark is more of a real time improvement on MapReduce. Being a much newer technology, Apache Spark is far less adapted in the industry and there are fewer working use-cases. Spark shines when it runs on top of an existing Hadoop cluster or layered on top of other technology such as Mesos or supporting multiple processing paradigms. While Spark has a specialized streaming application in Spark Streaming, compared to Storm it is not yet well supported. Spark is not designed to handle a multi-user environment and all users have to apportion data appropriately. If a project has many users, Storm may be a more appropriate choice. This does not mean that Spark is unused. Contrary, Spark has seen a rapid adaption since its introduction to the Apache Toolkit in 2014. In addition to streaming, Spark shows great potential in the areas of machine learning, interactive analysis, and fog computing. Uber uses Spark to convert unstructured data into structured data, and Netflix uses Spark in a manner similar to how Spotify uses Storm -- to create real time user prediction models and recommend movies based on user history. Overall, while Storm is the older and more tested technology, there is a wide range of use cases where either Spark or Storm are appropriate. Furthermore, the decision on whether to use Spark or Storm for a task highly depends on the user base, organizational structure, and what users wish to accomplish in their data processing. Conclusions Visual Review of Data The following table contains a simplified review of the differences between Spark and Storm. This table is intended to help establish appropriate use cases for Spark and Storm as well as help potential users decide between one or both technologies.
  • 13. 13 Storm Spark 1) Data processing  Mainly does Streaming mostly but does Micro- batching as well ( Trident API)  Stream primitive is tuple Spouts (Input Streams), bolts(Processing and output modules), Topologies  Directed Acyclic Graph (DAG)- Topology  Will run in a continuous loop unless aborted  Strong point (rich array of available spouts- to collect data)  Uses Apache Zookeeper/minion worker process to coordinate topologies  Based on Clojure, but supports many languages like Java, Scala, Python, Ruby, and more  Mainly Batch processing but also does micro-batching  Uses DStream, an abstraction for a continuous stream of data is a micro-batch of RDD (Resilient Distributed Datasets)  RDD operated in parallel  Window computations  Hadoop HDFS dependent ( to store data after computation)  Based on Java, Scala, and Python only 2) Speed/ Performance  Storm is fast: a benchmark clocked it at over a million  Much faster than Hadoop MapReduce
  • 14. 14 tuples processed per second per node.  Claimed to be 4 to 5 times slower than Spark using Word Count benchmark  Limited performance per server by Stream processing standards (100 times)  Process data in- memory and require huge memory  Faster than Storm but with limited performance 3) Reliability  Storm guarantees that Tree of tuple will be fully processed  Use time out system: every data is time stamped and acknowledge system is used to detect any failure  Process data in memory, hence volatile  Use log system to keep track of data process and to avoid data loss 4) Ease of Use  Built on Clojure but but flexible enough to work with many other languages, including Java, Scala and Python, making it very easy to work with  Data processing flow is natural and intuitive ( Spout collects the stream of data and bolt process it through HDFS)  Intermediate to advanced  Built on Scala, Java and Python only.  Need experts of Java and Scala language to utilize the technology, unless GUI is used  GUI is very beginner friendly and easy to learn for those who are new to programming
  • 15. 15 level programmer can use utilize this technology. 5) Application of use  Examples: Twitter, Groupon, Google  Strong real time data processing capabilities  Data streaming and analytics  Event log monitoring, record keeping, filtering, lookup  Examples: Uber, Netflix, Pintrest  Batching and micro- batching  Distributed computing  Machine learning  Event detection  Cluster computing  Data analytics Final Overview Spark and Storm are very capable technologies in real- time analytics. Both of the framework are on a streaming architecture, meaning there is an inbound, infinite list of tuples. In this report, Spark and Storm has been evaluated on base of data processing, speed, reliability, ease of use, and application of use. In Storm nomenclature, finite-size tuples is processed through spouts and bolts. In Spark nomenclature, the stream is called a D-Stream or a discretized stream. In Spark Streaming framework, data processing engine essentially discretize and convert the stream into finite size RDDs, which are essentially micro-batches to process messages. Both of the framework allow stateful processing, that is if a worker node or driver node lose data, it can be recover from the state. Resilient Distributed Dataset(RDD) is cornerstone of Spark that entails finite-sized immutable data structures which can be replayed as needed and are replicated over HDFS or any type of persistent file system. Since both technologies have varying strengths and weaknesses, it is hard to recommend one over the other in a general sense. However, Spark and Storm have a large application domain for which specific uses may be appropriate. In general, industry
  • 16. 16 professionals agree that Storm is much more useful as a real time streaming technology at this point, even though Spark Streaming is growing by leaps and bounds. Storm has a specialized toolset that allows it strong processing capabilities but Spark is very generalized and the technology has developed in such a way that it may replace MapReduce in the future. Although the use cases for Spark and Storm may have some overlap, the platforms are very distinct. If one desires to use stream-based technology for analytic purposes, provided that they are comfortable with programming, Storm is the better choice. If one would like to use cluster computing or batching technology, or they are uncomfortable with programming, than Spark is the better choice. Overall, both Spark and Storm are fantastic ways of achieving lighting- fast, reliable and easy-to- learn data processing.