SlideShare a Scribd company logo
1 of 46
Download to read offline
2015 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
2014 © Trivadis
Apache Storm vs. Spark Streaming –
Two Stream Processing Platforms compared
Juni 2015
Guido Schmutz
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
1
2015 © Trivadis
Guido Schmutz
§  Working for Trivadis for more than 18 years
§  Oracle ACE Director for Fusion Middleware and SOA
§  Co-Author of different books
§  Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
§  Member of Trivadis Architecture Board
§  Technology Manager @ Trivadis
§  More than 25 years of software development
experience
§  Contact: guido.schmutz@trivadis.com
§  Blog: http://guidoschmutz.wordpress.com
§  Twitter: gschmutz
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
2
2015 © Trivadis
Trivadis is a market leader in IT consulting, system integration,
solution engineering and the provision of IT services focusing
on and technologies in Switzerland,
Germany and Austria.
We offer our services in the following strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
Trivadis
O P E R A T I O N
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
3
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing in the Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
4
2015 © Trivadis
What is Stream Processing?
Infrastructure for continuous data processing
Computational model can be as general as MapReduce but with the ability
to produce low-latency results
Data collected continuously is naturally processed continuously
aka. Event Processing / Complex Event Processing (CEP)
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
5
2015 © Trivadis
Trivadis Stream Processing Demo System
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
6
Use Hashtag #JFS2015 plus #storm and/or #spark
2015 © Trivadis
How to design a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
7
Event
Stream
event
Collecting
event
Queue
(Persist)
Event
Stream
event
Collecting
event
Processing
event
Processing
result
result
Event
Stream
event
Collecting/
Processing
result
2015 © Trivadis
How to scale a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
8
Queue
(Persist)
Event
Stream
event
Collecting
Thread 1 event event
Processing
Thread 1 result
Collecting
Thread 2
Processing
Thread 2
event event event result
Collecting
Thread n
Processing
Thread n
2015 © Trivadis
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
How to scale a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
9
Queue 1
(Persist)
Event
Stream
event
Collecting
Thread 1
event event
Processing
Process 1 result
Collecting
Thread 1
Processing
Process 1
Queue 2
(Persist)event
event result
Processing
Process 1
Queue n
(Persist)
event
2015 © Trivadis
Collecting
Process 1
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Processing
A
Process 1
Processing
B
Process 1
How to scale a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
10
Event
Stream
Collecting
Process 1
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Processing A
Thread 1
Q1
e
Processing B
Thread 1
Q1
e
Processing
A
Process 2
Processing A
Thread n
Qn
e
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Faults and stragglers inevitable in large clusters running big data
applications
Streaming applications must recover from them quickly
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
11
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Solution 1: using active/passive system (hot replication)
•  Both systems process the full load
•  In case of a failure, automatically switch and use the “passive” system
•  Stragglers slow down both active and passive system
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
12
State = State in-memory and/or on-disk
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Active
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Passive
State
State
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Solution 2: Upstream backup
•  Nodes buffer messages and reply them to new node in case of failure
•  Stragglers are treated as failures
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
13
State = State in-memory and/or on-disk
buffer = Buffer for replay in-memory and/or on-disk
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
State
2015 © Trivadis
Processing Models
Batch Processing
•  Familiar concept of processing data en masse
•  Generally incurs a high-latency
(Event-) Stream Processing
•  A one-at-a-time processing model
•  A datum is processed as it arrives
•  Sub-second latency
•  Difficult to process state data efficiently
Micro-Batching
•  A special case of batch processing with very small batch sizes (tiny)
•  A nice mix between batching and streaming
•  At cost of latency
•  Allows Stateful computation, making windowing an easy task
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
14
2015 © Trivadis
Message Delivery Semantics
At most once [0,1]
•  Messages my be lost
•  Messages never redelivered
At least once [1 .. n]
•  Messages will never be lost
•  but messages may be redelivered (might be ok if consumer can handle it)
Exactly once [1]
•  Messages are never lost
•  Messages are never redelivered
•  Perfect message delivery
•  Incurs higher latency for transactional semantics
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
15
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing in the Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
16
2015 © Trivadis
Apache Storm
A platform for doing analysis on streams of data as they come in, so you
can react to data as it happens.
•  highly distributed real-time computation system
•  Provides general primitives to do
real-time computation
•  To simplify working with queues & workers
•  scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache since September 2013
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
17
2015 © Trivadis
Apache Storm – Core concepts
Tuple
•  Immutable Set of Key/value pairs
Stream
•  an unbounded sequence of tuples that can be processed in parallel by Storm
Topology
•  Wires data and functions via a DAG (directed acyclic graph)
•  Executes on many machines similar to a MR job in Hadoop
Spout
•  Source of data streams (tuples)
•  can be run in “reliable” and “unreliable” mode
Bolt
•  Consumes 1+ streams and produces new streams
•  Complex operations often require multiple
steps and thus multiple bolts
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
18
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of
Stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T
2015 © Trivadis
Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
19
Split Text
nth
Text
Spout
Word
Count nth
Split Text
1th
Word
Count 1st
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle
grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks
running in the same worker process, if any. Falls back to shuffle
grouping behavior.
ReportGlobal
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
20
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
Sentence
Splitter
Twitter
Spout
Sentence
Splitter
… #barca
Shuffle
Grouping
Sentence
Splitter
… #fcb
bayern
fcb
juve
real
barca
barca
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
21
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Sentence
Splitter
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
Shuffle
Grouping
… #barca
… #fcb
Fields
Grouping
real
juve
barca
barca
bayern
fcb
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
22
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Sentence
Splitter
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
Shuffle
Grouping
real
juve
barca
barca
bayern
fcb
… #barca
… #fcb
Fields
Grouping
INCR
barca
INCR
real
INCR
juve
real = 1
juve = 1
INCR
barca
INCR
bayern bayern = 1
barca = 1
barca = 2
INCR
fcb fcb = 1
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
23
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Report
real = 1
juve = 1
barca = 2
bayern = 1
Sentence
Splitter
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
Shuffle
Grouping
real
juve
barca
barca
bayern
fcb
… #barca
… #fcb
Fields
Grouping
Global
Grouping
real = 1
juve = 1
bayern = 1
barca = 2
30sec
fcb = 1
fcb = 1
2015 © Trivadis
Using a NoSQL datastore for persisting
results
Keep state in a NoSQL datastore
Using counter type columns of Cassandra
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
24
Twitter
Stream
Sentence
Splitter
Twitter
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Who will win: Barca,
Real, Juve or Bayern?
… bit.ly/1yRsPmE #fcb
#barca
… #barca
… #fcb real = 1
juve = 1
barca = 2
bayern = 1
INCR
barca
INCR
real
INCR
juve
INCR
barca
INCR
bayern
real
juve
barca
barca
bayern
fcb
fcb = 1
INCR
fcb
2015 © Trivadis
Storm Trident
High-Level abstraction on top of storm
•  Processing as a series of batches (micro-batches)
•  Stream is partitioned among nodes in cluster
5 kinds of operations in Trident
•  Operations that apply locally to each partition and cause no network transfer
•  Repartitioning operations that don‘t change the contents
•  Aggregation operations that do network transfer
•  Operations on grouped streams
•  Merges and Joins
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
25
Twitter
Stream
tweet tweet
Sentence
Splitter
Twitter
Spout
hashtag Sentence
Normalizer
Persistent
Aggregate
hashtag
groupBylocal
Bolt Bolt
2015 © Trivadis
Storm Core vs. Storm Trident
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
26
Storm Core Storm Trident
Community > 100 contributors > 100 contributors
Adoption *** *
Language Options Java, Clojure, Scala,
Python, Ruby, …
Java, Clojure,
Scala
Processing Models Event-Streaming Micro-Batching
Processing DSL No Yes
Stateful Ops No Yes
Distributed RPC Yes Yes
Delivery
Guarantees
At most once / At least
once
Exactly Once
Latency sub-second seconds
Platform Storm Cluster, YARN Storm Cluster, YARN
2015 © Trivadis
Agenda
1.  Introduction
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Unified Log (Enterprise Event Bus)
5.  Stream Processing in the Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
27
2015 © Trivadis
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
•  The hot trend in Big Data!
•  Based on 2007 Microsoft Dryad paper
•  Written in Scala, supports Java, Python, SQL and R
•  Can run programs up to 100x faster than Hadoop MapReduce in memory, or
10x faster on disk
•  Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud
•  One of the largest OSS communities in big data with over 200 contributors in
50+ organizations
•  Originally developed 2009 in UC Berkley’s AMPLab
•  Open Sourced in 2010 – since 2014 part of Apache Software foundation
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
28
2015 © Trivadis
Apache Spark
Spark Core
•  General execution engine for the Spark platform
•  In-memory computing capabilities deliver speed
•  General execution model supports wide variety of use cases
•  DAG-based
•  Ease of development – native APIs in Java, Scala and Python
Spark Streaming
•  Run a streaming computation as a series of very small, deterministic batch jobs
•  Batch size as low as ½ sec, latency of about 1 sec
•  Exactly-once semantics
•  Potential for combining batch and streaming processing in same system
•  Started in 2012, first alpha release in 2013
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
29
2015 © Trivadis
Apache Spark - Generality
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
30
Spark SQL
(Batch
Processing)
Blink DB
(Approximate
Querying)
Spark Streaming
(Real-Time)
MLlib, Spark R
(Machine
Learning)
GraphX
(Graph
Processing)
Spark Core API and Execution Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
Cassandra
S3 /
DynamoDB
Libraries
Core Runtime
Cluster Resource Managers Data Stores
Adapted from C. Fregly: http://slidesha.re/11PP7FV
2015 © Trivadis
Apache Spark – Core concepts
Resilient Distributed Dataset (RDD)
•  Core Spark abstraction
•  Collections of objects (partitions) spread across cluster
•  Can be stored in-memory or on-disk (local)
•  Enables parallel processing on data sets
•  Build through parallel transformations
•  Immutable, re-computable, fault tolerant
•  Contains transformation history (“lineage”) for whole data set
Operations
•  Stateless Transformations (map, filter, groupBy)
•  Actions (count, collect, save)
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
31
2015 © Trivadis
RDD Lineage Example
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
32
HDFS File Input 1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS File
Output
HadoopRDD
MappedRDD
HDFS File Input
2
SparkContext.hadoopFile()	
  
SparkContext.hadoopFile()	
  filter()	
  
map()	
   map()	
  
join()	
  
SparkContext.saveAsHadoopFile()	
  
Transformations
(Lazy)
Action
(Execute Transformations)
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
2015 © Trivadis
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
•  Core Spark Streaming abstraction
•  micro batches of RDD’s
•  Operations similar to RDD
Input DStreams
•  Represents the stream of raw data received from streaming sources
•  Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter,
ZeroMQ, TCP Socket, Akka actors, etc.
•  Custom Sources can be easily written for custom data sources
Operations
•  Same as Spark Core
•  Additional Stateful transformations (window, reduceByWindow)
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
33
2015 © Trivadis
Discretized Stream (DStream)
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
34
time 1 time 2 time 3
message	
  
time n….
f(message	
  1)	
  
RDD @time 1
f(message	
  2)	
  
f(message	
  n)	
  
….
message	
  1	
  
RDD @time 1
message	
  2	
  
message	
  n	
  
….
result	
  1	
  
result	
  2	
  
result	
  n	
  
….
message	
   message	
   message	
  
f(message	
  1)	
  
RDD @time 2
f(message	
  2)	
  
f(message	
  n)	
  
….
message	
  1	
  
RDD @time 2
message	
  2	
  
message	
  n	
  
….
result	
  1	
  
result	
  2	
  
result	
  n	
  
….
f(message	
  1)	
  
RDD @time 3
f(message	
  2)	
  
f(message	
  n)	
  
….
message	
  1	
  
RDD @time 3
message	
  2	
  
message	
  n	
  
….
result	
  1	
  
result	
  2	
  
result	
  n	
  
….
f(message	
  1)	
  
RDD @time n
f(message	
  2)	
  
f(message	
  n)	
  
….
message	
  1	
  
RDD @time n
message	
  2	
  
message	
  n	
  
….
result	
  1	
  
result	
  2	
  
result	
  n	
  
….
Input Stream
DStream
MappedDStream
map()	
  
saveAsHadoopFiles()	
  
Time Increasing
DStreamTransformationLineageActions
TriggerSpark
Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
2015 © Trivadis
Storm Core vs. Storm Trident vs. Spark Streaming
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
35
Storm Core Storm Trident Spark Streaming
Community > 100 contributors > 100 contributors > 280 contributors
Adoption *** * *
Language
Options
Java, Clojure, Scala,
Python, Ruby, …
Java, Clojure,
Scala
Java, Scala
Python (coming)
Processing
Models
Event-Streaming Micro-Batching Micro-Batching
Batch (Spark Core)
Processing DSL No Yes Yes
Stateful Ops No Yes Yes
Distributed RPC Yes Yes No
Delivery
Guarantees
At most once / At
least once
Exactly Once Exactly Once
Latency sub-second seconds seconds
Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos
Standalone, DataStax EE
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing in the Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
36
2015 © Trivadis
Architectural Pattern: Standalone Event Stream
Processing
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
3737
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
Enterprise
EventBus
37
Analytical
Applications
DB
Enterprise
Service
Bus
Business Rule
Management
SystemRules
Event Processing
Result
Store
2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Lambda Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
3838
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
Enterprise
EventBus
38
Analytical
Applications
DB
Enterprise
Service
Bus
Event Processing
Map/
Reduce
HDF
S
Result
Store
Result
Store
2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Kappa Architecture
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
3939
Event Processing
(ESP / CEP)
State Store /
Event Store
EnterpriseEventBus
(Ingress)
Event
Cloud
Internet of
Things
Social Media
Streams
39
Analytical
Applications
DB
Enterprise
Service
Bus
Event Processing
Replay
HDF
S
Result
Store
2015 © Trivadis
Unified Log (Event) Architecture
Stream processing
allows
for computing feeds
off of other feeds
Derived feeds
are no different
than original feeds
they are computed off
Single deployment of
“Unified Log” but
logically different
feeds
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
40
Meter
Readings
Collector
Enrich /
Transform
Aggregate
by Minute
Raw Meter
Readings
Meter with
Customer
Meter by Customer
by Minute
Customer
Aggregate
by Minute
Meter by
Minute
Persist
Meter by
Minute
Persist
Raw Meter
Readings
2015 © Trivadis
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
41
Tweets
Filter
Persist
Filtered
Tweets
Persist
Sensor
Readings
Tweet
Distribution Layer
Kafka Storm Cassandra ElasticsearchTitan
Speed Layer
Feature
extractor
Count
Skill
Matcher
sensor
reading
Feature
Occurrences
Matches
Feature
counter Skill
Unified Log/
Event
Architecture
for Trivadis
Streaming
Demo System
2015 © Trivadis
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
42
Tweets
Filter
Persist
Filtered
Tweets
Persist
Sensor
Readings
Tweet
Distribution Layer
Kafka Storm Cassandra ElasticsearchTitan
Speed Layer
Feature
extractor
Count
Skill
Matcher
sensor
reading
Feature
Occurrences
Matches
Feature
counter Skill
Unified Log/
Event
Architecture
for Trivadis
Streaming
Demo System
Storm Topology
Splitter
Kafka
Spout
Word
Remover
Splitter
Word
Remover
Shuffle Fields
Kafka
Kafka
Word
Remover
2015 © Trivadis
Central Unified Log for (real-time) subscription
Take all the organization’s data and put it into a central log for subscription
Properties of the Unified Log:
•  Unified: “Enterprise”, single deployment
•  Append-Only: events are appended, no update in place => immutable
•  Ordered: each event has an offset, which is unique within a shard
•  Fast: should be able to handle thousands of messages / sec
•  Distributed: lives on a cluster of machines
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
43
0 1 2 3 4 5 6 7 8 9 10 11
reads
writes
Collector
Consumer
System A
(time = 6)
Consumer
System B
(time = 10)
reads
2015 © Trivadis
Apache Kafka - Overview
•  A distributed publish-subscribe messaging system
•  Designed for processing of real time activity stream data (logs, metrics
collections, social media streams, …)
•  Initially developed at LinkedIn, now part of Apache
•  Does not follow JMS Standards and does not use JMS API
•  Kafka maintains feeds of messages in topics
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
44
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
Anatomy of a topic:
Partition 0
Partition 1
Partition 2
Writes
old new
2015 © Trivadis
Trivadis Stream Processing Demo System - Update
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
45
2015 © Trivadis
Questions and answers ...
2014 © Trivadis
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
Guido Schmutz
Technology Manager
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
46

More Related Content

What's hot

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaTop 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaKai Wähner
 
Blockchain in Agri-Food – Industry Adoption Analysis
Blockchain in Agri-Food – Industry Adoption AnalysisBlockchain in Agri-Food – Industry Adoption Analysis
Blockchain in Agri-Food – Industry Adoption AnalysisNetscribes
 
Blockchain technology-report-final
Blockchain technology-report-finalBlockchain technology-report-final
Blockchain technology-report-finalRishabhMalik32
 
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...confluent
 
Clean Infrastructure as Code
Clean Infrastructure as Code Clean Infrastructure as Code
Clean Infrastructure as Code QAware GmbH
 
Blockchain in food industry
Blockchain in food industryBlockchain in food industry
Blockchain in food industryzaarahary
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 

What's hot (8)

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache KafkaTop 5 Event Streaming Use Cases for 2021 with Apache Kafka
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
 
Blockchain in Agri-Food – Industry Adoption Analysis
Blockchain in Agri-Food – Industry Adoption AnalysisBlockchain in Agri-Food – Industry Adoption Analysis
Blockchain in Agri-Food – Industry Adoption Analysis
 
Blockchain technology-report-final
Blockchain technology-report-finalBlockchain technology-report-final
Blockchain technology-report-final
 
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
 
Clean Infrastructure as Code
Clean Infrastructure as Code Clean Infrastructure as Code
Clean Infrastructure as Code
 
Blockchain in food industry
Blockchain in food industryBlockchain in food industry
Blockchain in food industry
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 

Viewers also liked

Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Tin Ho
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learningVinoth Kannan
 
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarReal-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarImpetus Technologies
 
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
„Enterprise Event Bus“ Unified Log (Event) Processing ArchitectureGuido Schmutz
 

Viewers also liked (20)

Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Arquitectura Lambda
Arquitectura LambdaArquitectura Lambda
Arquitectura Lambda
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus WebinarReal-time Predictive Analytics in Manufacturing - Impetus Webinar
Real-time Predictive Analytics in Manufacturing - Impetus Webinar
 
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
„Enterprise Event Bus“ Unified Log (Event) Processing Architecture
 

Similar to Apache Storm vs Spark Streaming - Comparing Two Stream Processing Platforms

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologiesSachin Aggarwal
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Using NATS for Control Flow in Distributed Systems
Using NATS for Control Flow in Distributed SystemsUsing NATS for Control Flow in Distributed Systems
Using NATS for Control Flow in Distributed SystemsNATS
 
NATS: Control Flow for Distributed Systems
NATS: Control Flow for Distributed SystemsNATS: Control Flow for Distributed Systems
NATS: Control Flow for Distributed SystemsApcera
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
 
Apache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Blueprints for the analysis of social media
Blueprints for the analysis of social mediaBlueprints for the analysis of social media
Blueprints for the analysis of social mediaGuido Schmutz
 
Nix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationNix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationLynchpin Analytics Consultancy
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Multiple awr reports_parser
Multiple awr reports_parserMultiple awr reports_parser
Multiple awr reports_parserJacques Kostic
 

Similar to Apache Storm vs Spark Streaming - Comparing Two Stream Processing Platforms (20)

Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Using NATS for Control Flow in Distributed Systems
Using NATS for Control Flow in Distributed SystemsUsing NATS for Control Flow in Distributed Systems
Using NATS for Control Flow in Distributed Systems
 
NATS: Control Flow for Distributed Systems
NATS: Control Flow for Distributed SystemsNATS: Control Flow for Distributed Systems
NATS: Control Flow for Distributed Systems
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
Apache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex Meetup at Cask
Apache Apex Meetup at Cask
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Blueprints for the analysis of social media
Blueprints for the analysis of social mediaBlueprints for the analysis of social media
Blueprints for the analysis of social media
 
Nix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformationNix for etl using scripting to automate data cleaning & transformation
Nix for etl using scripting to automate data cleaning & transformation
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Multiple awr reports_parser
Multiple awr reports_parserMultiple awr reports_parser
Multiple awr reports_parser
 

More from Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Recently uploaded

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

Apache Storm vs Spark Streaming - Comparing Two Stream Processing Platforms

  • 1. 2015 © Trivadis BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared Juni 2015 Guido Schmutz Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 1
  • 2. 2015 © Trivadis Guido Schmutz §  Working for Trivadis for more than 18 years §  Oracle ACE Director for Fusion Middleware and SOA §  Co-Author of different books §  Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data §  Member of Trivadis Architecture Board §  Technology Manager @ Trivadis §  More than 25 years of software development experience §  Contact: guido.schmutz@trivadis.com §  Blog: http://guidoschmutz.wordpress.com §  Twitter: gschmutz Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 2
  • 3. 2015 © Trivadis Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. Trivadis O P E R A T I O N Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3
  • 4. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 4
  • 5. 2015 © Trivadis What is Stream Processing? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event Processing / Complex Event Processing (CEP) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 5
  • 6. 2015 © Trivadis Trivadis Stream Processing Demo System Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 6 Use Hashtag #JFS2015 plus #storm and/or #spark
  • 7. 2015 © Trivadis How to design a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 7 Event Stream event Collecting event Queue (Persist) Event Stream event Collecting event Processing event Processing result result Event Stream event Collecting/ Processing result
  • 8. 2015 © Trivadis How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 8 Queue (Persist) Event Stream event Collecting Thread 1 event event Processing Thread 1 result Collecting Thread 2 Processing Thread 2 event event event result Collecting Thread n Processing Thread n
  • 9. 2015 © Trivadis Collecting Process 1 Collecting Process 1 Collecting Process 1 Collecting Process 1 Collecting Process 1 How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 9 Queue 1 (Persist) Event Stream event Collecting Thread 1 event event Processing Process 1 result Collecting Thread 1 Processing Process 1 Queue 2 (Persist)event event result Processing Process 1 Queue n (Persist) event
  • 10. 2015 © Trivadis Collecting Process 1 Collecting Process 2 Processing A Process 2 Processing B Process 2 Processing A Process 1 Processing B Process 1 How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 10 Event Stream Collecting Process 1 Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Processing A Thread 1 Q1 e Processing B Thread 1 Q1 e Processing A Process 2 Processing A Thread n Qn e
  • 11. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 11 Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e
  • 12. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Solution 1: using active/passive system (hot replication) •  Both systems process the full load •  In case of a failure, automatically switch and use the “passive” system •  Stragglers slow down both active and passive system Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 12 State = State in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Active Collecting Process 2 Processing A Process 2 Processing B Process 2 Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Passive State State
  • 13. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Solution 2: Upstream backup •  Nodes buffer messages and reply them to new node in case of failure •  Stragglers are treated as failures Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 13 State = State in-memory and/or on-disk buffer = Buffer for replay in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e State
  • 14. 2015 © Trivadis Processing Models Batch Processing •  Familiar concept of processing data en masse •  Generally incurs a high-latency (Event-) Stream Processing •  A one-at-a-time processing model •  A datum is processed as it arrives •  Sub-second latency •  Difficult to process state data efficiently Micro-Batching •  A special case of batch processing with very small batch sizes (tiny) •  A nice mix between batching and streaming •  At cost of latency •  Allows Stateful computation, making windowing an easy task Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 14
  • 15. 2015 © Trivadis Message Delivery Semantics At most once [0,1] •  Messages my be lost •  Messages never redelivered At least once [1 .. n] •  Messages will never be lost •  but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] •  Messages are never lost •  Messages are never redelivered •  Perfect message delivery •  Incurs higher latency for transactional semantics Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 15
  • 16. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 16
  • 17. 2015 © Trivadis Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. •  highly distributed real-time computation system •  Provides general primitives to do real-time computation •  To simplify working with queues & workers •  scalable and fault-tolerant Originated at Backtype, acquired by Twitter in 2011 Open Sourced late 2011 Part of Apache since September 2013 Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 17
  • 18. 2015 © Trivadis Apache Storm – Core concepts Tuple •  Immutable Set of Key/value pairs Stream •  an unbounded sequence of tuples that can be processed in parallel by Storm Topology •  Wires data and functions via a DAG (directed acyclic graph) •  Executes on many machines similar to a MR job in Hadoop Spout •  Source of data streams (tuples) •  can be run in “reliable” and “unreliable” mode Bolt •  Consumes 1+ streams and produces new streams •  Complex operations often require multiple steps and thus multiple bolts Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 18 Spout Spout Bolt Bolt Bolt Bolt Source of Stream B Subscribes: A Emits: C Subscribes: A Emits: D Subscribes: A & B Emits: - Subscribes: C & D Emits: - T T T T T T T T
  • 19. 2015 © Trivadis Apache Storm – Core concepts Each Spout or Bolt are running N instances in parallel Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 19 Split Text nth Text Spout Word Count nth Split Text 1th Word Count 1st Shuffle Fields Shuffle grouping is random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in the same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive Local or Shuffle grouping similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. ReportGlobal
  • 20. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 20 Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Sentence Splitter Twitter Spout Sentence Splitter … #barca Shuffle Grouping Sentence Splitter … #fcb bayern fcb juve real barca barca
  • 21. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 21 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping … #barca … #fcb Fields Grouping real juve barca barca bayern fcb
  • 22. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 22 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping real juve barca barca bayern fcb … #barca … #fcb Fields Grouping INCR barca INCR real INCR juve real = 1 juve = 1 INCR barca INCR bayern bayern = 1 barca = 1 barca = 2 INCR fcb fcb = 1
  • 23. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 23 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Report real = 1 juve = 1 barca = 2 bayern = 1 Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping real juve barca barca bayern fcb … #barca … #fcb Fields Grouping Global Grouping real = 1 juve = 1 bayern = 1 barca = 2 30sec fcb = 1 fcb = 1
  • 24. 2015 © Trivadis Using a NoSQL datastore for persisting results Keep state in a NoSQL datastore Using counter type columns of Cassandra Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 24 Twitter Stream Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca … #barca … #fcb real = 1 juve = 1 barca = 2 bayern = 1 INCR barca INCR real INCR juve INCR barca INCR bayern real juve barca barca bayern fcb fcb = 1 INCR fcb
  • 25. 2015 © Trivadis Storm Trident High-Level abstraction on top of storm •  Processing as a series of batches (micro-batches) •  Stream is partitioned among nodes in cluster 5 kinds of operations in Trident •  Operations that apply locally to each partition and cause no network transfer •  Repartitioning operations that don‘t change the contents •  Aggregation operations that do network transfer •  Operations on grouped streams •  Merges and Joins Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 25 Twitter Stream tweet tweet Sentence Splitter Twitter Spout hashtag Sentence Normalizer Persistent Aggregate hashtag groupBylocal Bolt Bolt
  • 26. 2015 © Trivadis Storm Core vs. Storm Trident Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 26 Storm Core Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN
  • 27. 2015 © Trivadis Agenda 1.  Introduction 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Unified Log (Enterprise Event Bus) 5.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 27
  • 28. 2015 © Trivadis Apache Spark Apache Spark is a fast and general engine for large-scale data processing •  The hot trend in Big Data! •  Based on 2007 Microsoft Dryad paper •  Written in Scala, supports Java, Python, SQL and R •  Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk •  Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud •  One of the largest OSS communities in big data with over 200 contributors in 50+ organizations •  Originally developed 2009 in UC Berkley’s AMPLab •  Open Sourced in 2010 – since 2014 part of Apache Software foundation Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 28
  • 29. 2015 © Trivadis Apache Spark Spark Core •  General execution engine for the Spark platform •  In-memory computing capabilities deliver speed •  General execution model supports wide variety of use cases •  DAG-based •  Ease of development – native APIs in Java, Scala and Python Spark Streaming •  Run a streaming computation as a series of very small, deterministic batch jobs •  Batch size as low as ½ sec, latency of about 1 sec •  Exactly-once semantics •  Potential for combining batch and streaming processing in same system •  Started in 2012, first alpha release in 2013 Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 29
  • 30. 2015 © Trivadis Apache Spark - Generality Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 30 Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB Libraries Core Runtime Cluster Resource Managers Data Stores Adapted from C. Fregly: http://slidesha.re/11PP7FV
  • 31. 2015 © Trivadis Apache Spark – Core concepts Resilient Distributed Dataset (RDD) •  Core Spark abstraction •  Collections of objects (partitions) spread across cluster •  Can be stored in-memory or on-disk (local) •  Enables parallel processing on data sets •  Build through parallel transformations •  Immutable, re-computable, fault tolerant •  Contains transformation history (“lineage”) for whole data set Operations •  Stateless Transformations (map, filter, groupBy) •  Actions (count, collect, save) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 31
  • 32. 2015 © Trivadis RDD Lineage Example Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 32 HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HadoopRDD MappedRDD HDFS File Input 2 SparkContext.hadoopFile()   SparkContext.hadoopFile()  filter()   map()   map()   join()   SparkContext.saveAsHadoopFile()   Transformations (Lazy) Action (Execute Transformations) Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  • 33. 2015 © Trivadis Apache Spark Streaming – Core concepts Discretized Stream (DStream) •  Core Spark Streaming abstraction •  micro batches of RDD’s •  Operations similar to RDD Input DStreams •  Represents the stream of raw data received from streaming sources •  Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. •  Custom Sources can be easily written for custom data sources Operations •  Same as Spark Core •  Additional Stateful transformations (window, reduceByWindow) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 33
  • 34. 2015 © Trivadis Discretized Stream (DStream) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 34 time 1 time 2 time 3 message   time n…. f(message  1)   RDD @time 1 f(message  2)   f(message  n)   …. message  1   RDD @time 1 message  2   message  n   …. result  1   result  2   result  n   …. message   message   message   f(message  1)   RDD @time 2 f(message  2)   f(message  n)   …. message  1   RDD @time 2 message  2   message  n   …. result  1   result  2   result  n   …. f(message  1)   RDD @time 3 f(message  2)   f(message  n)   …. message  1   RDD @time 3 message  2   message  n   …. result  1   result  2   result  n   …. f(message  1)   RDD @time n f(message  2)   f(message  n)   …. message  1   RDD @time n message  2   message  n   …. result  1   result  2   result  n   …. Input Stream DStream MappedDStream map()   saveAsHadoopFiles()   Time Increasing DStreamTransformationLineageActions TriggerSpark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  • 35. 2015 © Trivadis Storm Core vs. Storm Trident vs. Spark Streaming Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 35 Storm Core Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 280 contributors Adoption *** * * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Java, Scala Python (coming) Processing Models Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) Processing DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE
  • 36. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 36
  • 37. 2015 © Trivadis Architectural Pattern: Standalone Event Stream Processing Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3737 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams Enterprise EventBus 37 Analytical Applications DB Enterprise Service Bus Business Rule Management SystemRules Event Processing Result Store
  • 38. 2015 © Trivadis Hadoop Big Data Infrastructure Architectural Pattern: Event Stream Processing as part of Lambda Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3838 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams Enterprise EventBus 38 Analytical Applications DB Enterprise Service Bus Event Processing Map/ Reduce HDF S Result Store Result Store
  • 39. 2015 © Trivadis Hadoop Big Data Infrastructure Architectural Pattern: Event Stream Processing as part of Kappa Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3939 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams 39 Analytical Applications DB Enterprise Service Bus Event Processing Replay HDF S Result Store
  • 40. 2015 © Trivadis Unified Log (Event) Architecture Stream processing allows for computing feeds off of other feeds Derived feeds are no different than original feeds they are computed off Single deployment of “Unified Log” but logically different feeds Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 40 Meter Readings Collector Enrich / Transform Aggregate by Minute Raw Meter Readings Meter with Customer Meter by Customer by Minute Customer Aggregate by Minute Meter by Minute Persist Meter by Minute Persist Raw Meter Readings
  • 41. 2015 © Trivadis Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 41 Tweets Filter Persist Filtered Tweets Persist Sensor Readings Tweet Distribution Layer Kafka Storm Cassandra ElasticsearchTitan Speed Layer Feature extractor Count Skill Matcher sensor reading Feature Occurrences Matches Feature counter Skill Unified Log/ Event Architecture for Trivadis Streaming Demo System
  • 42. 2015 © Trivadis Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 42 Tweets Filter Persist Filtered Tweets Persist Sensor Readings Tweet Distribution Layer Kafka Storm Cassandra ElasticsearchTitan Speed Layer Feature extractor Count Skill Matcher sensor reading Feature Occurrences Matches Feature counter Skill Unified Log/ Event Architecture for Trivadis Streaming Demo System Storm Topology Splitter Kafka Spout Word Remover Splitter Word Remover Shuffle Fields Kafka Kafka Word Remover
  • 43. 2015 © Trivadis Central Unified Log for (real-time) subscription Take all the organization’s data and put it into a central log for subscription Properties of the Unified Log: •  Unified: “Enterprise”, single deployment •  Append-Only: events are appended, no update in place => immutable •  Ordered: each event has an offset, which is unique within a shard •  Fast: should be able to handle thousands of messages / sec •  Distributed: lives on a cluster of machines Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 43 0 1 2 3 4 5 6 7 8 9 10 11 reads writes Collector Consumer System A (time = 6) Consumer System B (time = 10) reads
  • 44. 2015 © Trivadis Apache Kafka - Overview •  A distributed publish-subscribe messaging system •  Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) •  Initially developed at LinkedIn, now part of Apache •  Does not follow JMS Standards and does not use JMS API •  Kafka maintains feeds of messages in topics Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 44 Kafka Cluster Consumer Consumer Consumer Producer Producer Producer 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Anatomy of a topic: Partition 0 Partition 1 Partition 2 Writes old new
  • 45. 2015 © Trivadis Trivadis Stream Processing Demo System - Update Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 45
  • 46. 2015 © Trivadis Questions and answers ... 2014 © Trivadis BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA Guido Schmutz Technology Manager Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 46