Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
INFO491FinalPaper
1. Spark and Storm: Performance and Use
Case Comparison
Jessica Morris, Partap Singh
May 2, 2016
INFO 491: Big Data Analytics
2. 2
EXECUTIVE SUMMARY 3
INTRODUCTION 4
Apache Spark 5
Apache Storm 6
EVALUATION CRITERIA 7
Data Processing 7
Latency 9
Reliability 9
Ease of Use 10
Application of Use 11
CONCLUSIONS 12
Visual Review of Data 12
Final Overview 15
3. 3
Executive Summary
The purpose of this paper is to explore the differences between the Hadoop
Toolkit technologies Apache Spark and Apache Storm. We will use various online
sources for this comparison in order to understand the real life usage of these
technologies. The report concludes that Apache Spark and Apache Storm each have very
different use cases and their strengths and weaknesses should be carefully considered
when deciding between the two in an overlapping use case. In general; Apache Storm is
stronger at real time stream processing, but Apache Spark is a very strong and fast batch
processing technology. When the situation calls for real time streaming, it is our
recommendation that Apache Storm is used. If the situation is more generalized or calls
for high latency batch processing, Apache spark is more appropriate.
Unlike Hadoop, both Storm and Spark are new and evolving technologies in
business world so a limited amount of information was available at the time this paper
was written. This paper is written by keeping people in mind who are new to this
technology. A simple holistic approach has been used to compare both technologies.
This paper is not meant to be a reference for coding or programming language.
4. 4
Introduction
In the world of computing, the necessity of big data processing is more visible
now than ever. Technology such as Apache Hadoop provides frameworks for many
types of big data computing. Two technologies that exist within this framework are
Spark and Storm. Both of these tools can analyze big data in real time. Real time data
processing used to require expensive batch operations with high latencies. As
technologies continue to grow, pushing both the physical and logical limits of
computing, the demand for alternative ways of analyzing big data has risen. Both Spark
and Storm utilize distributed processing and real time processing to improve upon the
implementation of business intelligence analytics, meeting these demands.
Using the following benchmarks, this paper will analyze the use cases and
strengths of Spark and Storm.
Data-Processing
Latency
Reliability
Ease of Use
Applied Use
It should be noted that while Spark and Storm share some use cases and have an
overlap of technologies used, they both also have a unique set of strengths and
weaknesses. This means that there is no clear overall ‘winner’, only certain applications
in which Spark may be more appropriate than Storm or vice versa. In many cases, the
difference boils down to developer or analyst preference.
Both Spark and Storm are essential parts of the Apache Hadoop toolkit. It is
pioneering technologies like Spark and Storm that will pave the future for big data
analytics. Distributed computing and real time data processing are both technologies
with great potential and are very much worth discussing.
5. 5
Apache Spark
Spark’s official website describes Apache Spark as a “fast and general engine for
large-scale data processing.” Businesses such as Amazon, eBay, Yahoo!, and TripAdvisor
all utilize Spark’s technology. This usage is often for ‘real time’ predictive analytics,
profiling and personalization. Many businesses in the financial, health services, and
entertainment industries make use of Spark as well.
The technology is founded on the concept of distributed datasets, which use
Scala, Java or Python objects. Spark is a flexible framework that can handle both batch
and real time data processing. Spark Streaming adds streaming capabilities similar to
Apache Storm as well, but where Spark truly shines is in its ability to process data files
using an efficient distributed computing process over a large amount of clusters and
nodes.
In many ways, Spark was designed as an improvement on MapReduce. One key
advantage of Spark is that it provides an excellent interface for distributed programming
and cluster programming. Parallel data and fault tolerance are implicit in Spark’s design.
These components were developed as a response to inherent limitations found in
MapReduce. Spark runs on a resilient distributed dataset (RDD), which is read only and
designed for distribution across clusters. The result is a much more natural way of
processing data compared to other tools. Some inherent advantages to Spark are:
Fault-tolerance
Scalability- can run on clusters with thousands of nodes
Speedy and efficient batch-processing- 10x – 100x (in memory) faster than
MapReduce
Flexibility
Older technologies such as MapReduce broke up data into smaller blocks for
easier processing. Contrastingly, Spark can process the entire job in memory allowing
for better parallel form. Spark works well with many of the major programming
languages used in data analysis such as Python, Java, and Scala -- Sparks’ native
programming language.
6. 6
Spark is built to process continuous query style workloads. Because Spark was
designed for ease of use and efficiency of processing, it is a market leader for streaming
technologies, beating out many other more specialized frameworks. Spark was built to
be a generalized way of processing data, making it inherently flexible – it can be applied
to anything from distributed data analytics to machine learning.
Hive, Pig, Sqoop and many other Hadoop ecosystem tools work well with Spark.
In many cases the combination of these tools can replace MapReduce completely. Given
its unique approach to analytics and synergy with many existing applications,
programming languages, and framework, Spark is a rich addition to the Hadoop toolkit.
Spark is a robust batch processing framework with the added capability of Micro-
Batching, as well as extensions such as Spark Streaming.
Apache Storm
Storm is a free and open source distributed real time computation system for
processing large volume of high volume (Streaming) data in parallel and at scale. Unlike
Hadoop and MapReduce which process data in batches, storm process unbounded
streams of data in real time. Being simple and easy to use with any programming
language, many developers find Storm fun to use.
According to Hortonworks, there are five characteristics that make storm ideal
for real-time data processing workloads.
Fast - it can process one million 100 byte messages per second per node
Scalable- parallel calculations that run across a cluster of machines
Fault-tolerant- If a node dies, the system will be restarted on another node.
Reliable- storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once.
For many of the business opportunities and use cases where storm can be
beneficial are: real-time customer service management, data monetization, operational
dashboards, cyber security analytics, threat detection, online machine learning,
continuous computation, distributed RPC, ETL and more.
7. 7
Many well known and established businesses are using Apache Storm, including:
Groupon, an online coupon based company that uses storm to build real-time data
integration systems. Storm helps Groupon to analyze, clean, normalize and resolve large
amounts of non-unique data points with low latency and high throughput. The Weather
Channel uses many Storm topologies to ingest and persist weather data. Twitter’s
system uses a wide variety of Storm in application from discovery, real-time analytics,
personalization, search, revenue optimization and many more.
Evaluation Criteria
Data Processing
Apache Spark is designed to do more than plain data processing. It actually can
make use of existing machine learning libraries and process graphs. Due to the high
performance of Apache Spark, it can be used for both batch processing and real time
processing. Strategic developer Mr. Oliver in a blog Storm or Spark compared some of
the data processing of both technologies in Dec 2014. According to him, both
technologies have overlapping capabilities and distinctive features and roles to play.
In 2011, BackType, a marketing intelligence company bought by Twitter brought
Storm as a project to distributed computation framework for event stream processing.
Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately
moved to Apache Incubator and became an Apache top-level project in September
2014. Referred as the Hadoop of real-time processing, Storm appears to process
unbounded streams of data for real-time processing what Hadoop did for batch
processing.
Storm is written primarily in Clojure and is designed to support wiring “spouts”
(input streams) and “bolts” (processing and output modules) together as a directed
acyclic graph (DAG) called a topology. Storm topologies run on clusters and based on the
topology configuration, Storm scheduler distributes work to nodes around the clusters.
MapReduce in Hadoop and topologies in Storm roughly do the same job except Storm
8. 8
focus on real-time, stream-based processing. Topologies default to run forever or until
manually terminated. Once a topology is started, the spouts bring data into the system
and hand the data off to bolts where the main computational work is done. As
processing progresses, one or more bolts may requires to write data out to a database
or file systemor send a message to another external system.
One of the strengths of the Storm ecosystem is a rich array of available spouts
which are specialized for receiving data from all types of sources. Processed data in
Storm is easy to integrate with HDFS file systems. Even though Storm is based on
Clojure, it can also support multilanguage programming like JVM.
However, starting out as a project of AMPLab at the University of California at
Berkeley, Spark got very popular project to real-time distributed computation and
ultimately became as a top level project of Apache in February. Like Storm, Spark
supports Stream-Oriented processing, but it’s more of a general-purpose distributed
computing platform. While Spark has the ability to run on top of an existing Hadoop
cluster, relying on YARN for resource scheduling, it can be seen as a potential
replacement for MapReduce functions of Hadoop. Written in Scala, Spark supports
multilanguage programming. Spark provides specific API support only for Scala, Java and
Python. Spark does not have the specific abstraction of a “spout”, but it includes
adapters for working with data stored in numerous disparate sources, including HDFS
files, Cassandra, Hbase, and S3. Like Storm, Spark supports a streaming model, but this
support is provided by only one of several Spark modules, including purpose-built
modules for SQL access, graph operations, and machine learning, along with stream
processing.
Spark does not process streams one at a time like storm. Instead, it slices themin
small batches of time intervals before processing them. DStream, the abstraction for a
continuous stream of data is a micro-batch of RDDs (Resilient Distributed Datasets) are
distributed collections that can be operated in parallel by arbitrary functions and by
transformations over a sliding window of data (windowed computations).
9. 9
Latency
Traditional data warehousing environments were expensive and had high latency
towards batch operations. As a result of this, organizations were not able to embrace
the power of real time business intelligence and big data analytics in real time. Today
many powerful open-source tools like Spark and Storm have emerged to overcome this
challenge. Spark is a parallel open source processing framework and its workflows are
designed in Hadoop MapReduce but are comparatively more efficient than Hadoop
MapReduce. One of the benefit Apache Spark has is that it does not use Hadoop YARN
for functioning but it has its own streaming API and independent processes for
continuous batch processing across varying short time intervals. Spark runs 100 times
faster than Hadoop in certain situations, but Spark doesn’t have its own distributed
storage system.
When talking about Spark performance for data processing, we can not ignore
Hadoop MapReduce technology because both technologies work hand in hand . Where
spark processes in-memory data, Hadoop MapReduce persist back to the disk after a
map action or a reduce action. Therefore Hadoop MapReduce lags behind when
compared to Spark. Spark requires huge memory as it loads the process into the
memory and store it for caching. Spark streams events in small batches that come in
short time windows before it processes them whereas Storm processes the events one
at a time. Thus, Spark has a latency of few seconds whereas Storm processes an event
with just millisecond latency.
Reliability
Spark and its RDD abstraction is designed to seamlessly handle failures of any
worker nodes in the cluster. Spark streaming is built on Spark and so to say it enjoys
same fault -tolerance for worker nodes but certain applications likes Kafka and flume
has to recover from failure of the driver process. For sources like files, periodically
computation on every micro-batch of data was sufficient to ensure zero data loss
10. 10
because it stores data in a fault-tolerant file systemlike HDFS. However, for the other
sources like Kafka and Flume, some of the received data that was buffered in memory
but not yet processed could get lost. To avoid this data loss, write ahead logs system
was introduced in Spark Streaming 1.2 release.
Storm guarantees that every spout tuple will be fully processed by the topology
by tracking the tree of tuples triggered by every spout tuple and determining when that
tree of tuples has been successfully completed. Every topology has a “message timeout”
associated with it and if Storm fails to detect that a spout tuple has been completed
within that timeout, then it fails the tuple and replays it later. There are two main things
that need to be done to benefit from Storm reliability capabilities. First, it needs to tell
Storm whenever a new link in the tree of tuples to be created and second, it needs to
tell Storm that individual tuple have been finished processing. By doing both of these
things, Storm can detect when the tree of tuples is fully processed and can acknowledge
or fail the spout tuple appropriately.
Ease of Use
Both Apache Spark and Apache Storm were built on a founding principle of user
accessibility. As such, both platforms are scalable, fault-tolerant and support a wide
array of programming languages and platforms of use such as Apache Hadoop.
However, there are some differences.
For example, when developing for Spark, one must keep its base language, Scala
in mind. Scala is a JVM based language designed to improve upon the shortcomings of
Java. While Scala has been praised for its various features and adjustments, such as
scalability (thus, the name Scala), it is not as well known as Java. This can make it
difficult for new developers who are not familiar with the language. Furthermore,
Spark’s Tuples are also based on Scala, and while they can be implemented in Java, this
is known as a difficult process.
There are aspects of Storm that has made it a favorite among some developers.
Storm supports more programming languages than Spark. In addition to Java, Scala and
11. 11
Python, Core Storm also supports Clojure, Ruby and many others. Because of this
expanded support, Strom’s Tuples can be much easier for programmers to manage.
Storm uses DAGs (directed acyclic graphs) allowing for a very natural flow of data
through Strom’s tuples. Each individual node plays a role in transforming the data
through the DAG. However, this can cause compiling to take a while, and sometimes
causes developers to skip important safety checks.
Conversely, there are many features of Spark that are much easier to implement
compared to other platforms. Spark Streaming allows for Language Integrated API,
which allows developers to write real-time jobs in a very similar manner to batch jobs.
Many developers make great use of Spark’s Interactive Mode which eases the
programming process. While many developers prefer Storm, developers who are
comfortable with Interactive Mode will choose Spark as nothing comparable is available
for Storm at this time. Thus, even with the aforementioned challenges, Spark remains
one of the simpler solutions to the complex problem of real time data.
Application of Use
Apache Spark and Storm are both powerful business intelligence platforms with
a wide array of uses. These platforms were developed for use in processing real time
data, however; the nuances of their application are many.
For example, since Storm approaches real-time data in a micro-batched fashion,
this makes Storm great for real time log monitoring. Twitter originally developed Storm
in 2011 for use in analytics. At Twitter, Storm treats clusters of tweets as a batch and,
working with other tools such as Cassandra, analyzes each batch of tweets for data such
as geographic location, user demographics and determining whether the tweet was
spam. Other companies such as Spotify make use of Strom’s real time strengths to
recommend users music based on listening history or target advertisements to specific
demographics. According to Spotify: “Storm enables us to build low-latency fault-
tolerant distributed systems with ease.”
Many industry professionals see Spark as a more general-purpose platform.
12. 12
While Storm excels at analyzing user, log and stream data in real time, Spark is more of a
real time improvement on MapReduce. Being a much newer technology, Apache Spark
is far less adapted in the industry and there are fewer working use-cases. Spark shines
when it runs on top of an existing Hadoop cluster or layered on top of other technology
such as Mesos or supporting multiple processing paradigms. While Spark has a
specialized streaming application in Spark Streaming, compared to Storm it is not yet
well supported. Spark is not designed to handle a multi-user environment and all users
have to apportion data appropriately. If a project has many users, Storm may be a more
appropriate choice.
This does not mean that Spark is unused. Contrary, Spark has seen a rapid
adaption since its introduction to the Apache Toolkit in 2014. In addition to streaming,
Spark shows great potential in the areas of machine learning, interactive analysis, and
fog computing. Uber uses Spark to convert unstructured data into structured data, and
Netflix uses Spark in a manner similar to how Spotify uses Storm -- to create real time
user prediction models and recommend movies based on user history.
Overall, while Storm is the older and more tested technology, there is a wide
range of use cases where either Spark or Storm are appropriate. Furthermore, the
decision on whether to use Spark or Storm for a task highly depends on the user base,
organizational structure, and what users wish to accomplish in their data processing.
Conclusions
Visual Review of Data
The following table contains a simplified review of the differences between
Spark and Storm. This table is intended to help establish appropriate use cases for Spark
and Storm as well as help potential users decide between one or both technologies.
13. 13
Storm Spark
1) Data
processing
Mainly does Streaming
mostly but does Micro-
batching as well ( Trident
API)
Stream primitive is tuple
Spouts (Input Streams),
bolts(Processing and
output modules),
Topologies
Directed Acyclic Graph
(DAG)- Topology
Will run in a continuous
loop unless aborted
Strong point (rich array of
available spouts- to collect
data)
Uses Apache
Zookeeper/minion worker
process to coordinate
topologies
Based on Clojure, but
supports many languages
like Java, Scala, Python,
Ruby, and more
Mainly Batch
processing but also
does micro-batching
Uses DStream, an
abstraction for a
continuous stream of
data is a micro-batch of
RDD (Resilient
Distributed Datasets)
RDD operated in
parallel
Window computations
Hadoop HDFS
dependent ( to store
data after computation)
Based on Java, Scala,
and Python only
2) Speed/
Performance
Storm is fast: a benchmark
clocked it at over a million
Much faster than
Hadoop MapReduce
14. 14
tuples processed per
second per node.
Claimed to be 4 to 5 times
slower than Spark using
Word Count benchmark
Limited performance per
server by Stream
processing standards
(100 times)
Process data in-
memory and require
huge memory
Faster than Storm but
with limited
performance
3) Reliability Storm guarantees that Tree
of tuple will be fully
processed
Use time out system: every
data is time stamped and
acknowledge system is
used to detect any failure
Process data in
memory, hence volatile
Use log system to keep
track of data process
and to avoid data loss
4) Ease of Use Built on Clojure but but
flexible enough to work
with many other languages,
including Java, Scala and
Python, making it very easy
to work with
Data processing flow is
natural and intuitive (
Spout collects the stream
of data and bolt process it
through HDFS)
Intermediate to advanced
Built on Scala, Java and
Python only.
Need experts of Java
and Scala language to
utilize the technology,
unless GUI is used
GUI is very beginner
friendly and easy to
learn for those who are
new to programming
15. 15
level programmer can use
utilize this technology.
5) Application
of use
Examples: Twitter,
Groupon, Google
Strong real time data
processing capabilities
Data streaming and
analytics
Event log monitoring,
record keeping, filtering,
lookup
Examples: Uber, Netflix,
Pintrest
Batching and micro-
batching
Distributed computing
Machine learning
Event detection
Cluster computing
Data analytics
Final Overview
Spark and Storm are very capable technologies in real- time analytics. Both of
the framework are on a streaming architecture, meaning there is an inbound, infinite list
of tuples. In this report, Spark and Storm has been evaluated on base of data processing,
speed, reliability, ease of use, and application of use. In Storm nomenclature, finite-size
tuples is processed through spouts and bolts. In Spark nomenclature, the stream is
called a D-Stream or a discretized stream. In Spark Streaming framework, data
processing engine essentially discretize and convert the stream into finite size RDDs,
which are essentially micro-batches to process messages. Both of the framework allow
stateful processing, that is if a worker node or driver node lose data, it can be recover
from the state. Resilient Distributed Dataset(RDD) is cornerstone of Spark that entails
finite-sized immutable data structures which can be replayed as needed and are
replicated over HDFS or any type of persistent file system.
Since both technologies have varying strengths and weaknesses, it is hard to
recommend one over the other in a general sense. However, Spark and Storm have a
large application domain for which specific uses may be appropriate. In general, industry
16. 16
professionals agree that Storm is much more useful as a real time streaming technology
at this point, even though Spark Streaming is growing by leaps and bounds. Storm has a
specialized toolset that allows it strong processing capabilities but Spark is very
generalized and the technology has developed in such a way that it may replace
MapReduce in the future. Although the use cases for Spark and Storm may have some
overlap, the platforms are very distinct. If one desires to use stream-based technology
for analytic purposes, provided that they are comfortable with programming, Storm is
the better choice. If one would like to use cluster computing or batching technology, or
they are uncomfortable with programming, than Spark is the better choice. Overall,
both Spark and Storm are fantastic ways of achieving lighting- fast, reliable and easy-to-
learn data processing.