SlideShare a Scribd company logo
1 of 70
Download to read offline
Distributed Real-Time Stream
Processing: Why and How
Petr Zapletal
@petr_zapletal
Scala eXchange 2015
Agenda
● Motivation
● Stream Processing
● Available Frameworks
● Systems Comparison
● Recommendations
● 8 Zettabytes (1 ZB = 1021
B = 1 billion TB) created in 2015
● Every minute we create
○ 200 million emails
○ 48 hours of YouTube video
○ 2 million google queries
○ 200 000 tweets
○ ...
● How can we make sense of all data
○ Most data is not interesting
○ New data supersedes old data
○ Challenge is not only storage but processing
The Data Deluge
New Sources And New Use Cases
● Many new sources of data become
available
○ Sensors
○ Mobile devices
○ Web feeds
○ Social networking
○ Cameras
○ Databases
○ ...
● Even more use cases become viable
○ Web/Social feed mining
○ Real-time data analysis
○ Fraud detection
○ Smart order routing
○ Intelligence and surveillance
○ Pricing and analytics
○ Trends detection
○ Log processing
○ Real-time data aggregation
○ …
Stream Processing to the Rescue
● Process data streams on-the-fly without permanent storage
● Stream data rates can be high
○ High resource requirements for processing (clusters, data centres)
● Processing stream data has real-time aspect
○ Latency of data processing matters
○ Must be able to react to events as they occur
Streaming Applications
ETL Operations
● Transformations, joining or filtering of incoming data
Windowing
● Trends in bounded interval, like tweets or sales
Streaming Applications
Machine Learning
● Clustering, Trend fitting, Classification
Pattern Recognition
● Fraud detection, Signal triggering, Anomaly detection
Processing Architecture Evolution
Batch Pipeline
Kappa ArchitectureLambda Architecture
Standalone Stream Processing
Stream
Processing
Batch Layer Serving
Layer
Stream layer
Query
Query
Allyourdata
Oozie
Stream
Processing
Query
Serving DBHDFS
Query
Distributed Stream Processing
● Continuous processing, aggregation and analysis of unbounded data
● General computational model as MapReduce
● Expected latency in milliseconds or seconds
● Systems often modelled as Directed Acyclic Graph (DAG)
● Describes topology of streaming job
● Data flows through chain of processors
from sources to sinks
Points of Interest
● Runtime and programming model
● Primitives
● State management
● Message delivery guarantees
● Fault tolerance & Low overhead recovery
● Latency, Throughput & Scalability
● Maturity and Adoption Level
● Ease of development and operability
Runtime and Programming Model
● Most important trait of stream processing system
● Defines expressiveness, possible operations and its limitations
● Therefore defines systems capabilities and its use cases
Native Streaming
Native stream processing systems
continuous operator model
record
Source Operator
Processing Operator
Processing Operator
Sink Operator
records processed one at a time
Micro-batching
Receiver
records
Processing Operator
Micro-batches
Records processed in short batches
Sink
Operator
Processing Operator
Native Streaming
● Records are processed as they arrive
● Native model with general processing ability
Pros
➔ Expressiveness
➔ Low-latency
➔ Stateful operations
Cons
➔ Throughput
➔ Fault-tolerance is expensive
➔ Load-balancing
Micro-batching
● Splits incoming stream into small batches
● Batch interval inevitably limits system expressiveness
● Can be built atop Native streaming easily
Pros
➔ High-throughput
➔ Easier fault tolerance
➔ Simpler load-balancing
Cons
➔ Lower latency, depends on
batch interval
➔ Limited expressivity
➔ Harder stateful operations
Programming Model
Compositional
➔ Provides basic building blocks as
operators or sources
➔ Custom component definition
➔ Manual Topology definition &
optimization
➔ Advanced functionality often
missing
Declarative
➔ High-level API
➔ Operators as higher order functions
➔ Abstract data types
➔ Advance operations like state
management or windowing supported
out of the box
➔ Advanced optimizers
Apache Streaming Landscape
TRIDENT
Storm
● Originally created by Nathan Marz and his team at BackType in 2010
● Being acquired and later open-sourced by Twitter, Apache project top-level
since 2014
● Pioneer in large scale stream processing
● Low-level native streaming API
● Uses Thrift for topology definition
● Large number of API languages available
○ Storm Multi-Language Protocol
Trident
● Higher level micro-batching system build atop Storm
● Stream is partitioned into a small batches
● Simplifies building topologies
● Java, Clojure and Scala API
● Provides exactly once delivery
● Adds higher level operations
○ Aggregations
○ State operations
○ Joining, merging , grouping, windowing, etc.
Spark Streaming
● Spark started in 2009 at UC Berkeley, Apache since since 2013
● General engine for large scale batch processing
● Spark Streaming introduced in 0.7, came out of alpha in 0.9 (Feb 2014)
● Unified batch and stream processing over a batch runtime
● Great integration with batch processing and its build-in libraries (SparkSQL, MLlib,
GraphX)
● Scala, Java & Python API
input data
stream Spark
Streaming
Spark
Engine
batches of
input data
batches of
processed data
Samza
● Developed in LinkedIn, open-sourced in 2013
● Builds heavily on Kafka’s log based philosophy
● Pluggable messaging system and executional backend
○ Uses Kafka & YARN usually
● JVM languages, usually Java or Scala
Task 1
Task 2
Task 3
Flink
● Started as Stratosphere in 2008 at as Academic project
● Native streaming
● High level API
● Batch as special case of Streaming (bounded vs unbounded dataset)
● Provides ML (FlinkML) and graph processing (Gelly) out of the box
● Java & Scala API
Stream Data
Batch Data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
System Comparison
Native Micro-batching Native Native
Compositional Declarative Compositional Declarative
At-least-once Exactly-once At-least-once Exactly-once
Record ACKs
RDD based
Checkpointing Log-based Checkpointing
Not build-in
Dedicated
Operators
Stateful
Operators
Stateful
Operators
Very Low Medium Low Low
Low Medium High High
High High Medium Low
Micro-batching
Exactly-once
Dedicated
DStream
Medium
High
Streaming
Model
API
Guarantees
Fault
Tolerance
State
Management
Latency
Throughput
Maturity
TRIDENT
Counting Words
Scala eXchange Apache
Apache Spark Storm
Apache Trident Flink
Streaming Samza Scala
2015 Streaming
(Apache,3)
(Scala, 2)
(Streaming, 2)
(eXchange, 1)
(Spark, 1)
(Storm, 1)
(Trident, 1)
(Flink, 1)
(Samza, 1)
(2015, 1)
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)
counts.print()
env.execute("wordcount")
Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)
counts.print()
env.execute("wordcount")
Fault Tolerance
● Fault tolerance in streaming systems is inherently harder that in batch
○ Can’t restart computation easily
○ State is a problem
○ Jobs can run 24/7
○ Fast recovery is critical
● A lot of challenges must be addressed
○ No single point of failure
○ Ensure processing of all incoming messages
○ State consistency
○ Fast recovery
○ Exactly once semantics is even harder
● Record acknowledgments
● Tracks the lineage of tuples in DAG as they are processed
● Special “ACK” bold track each lineage
● Able to replay from the root of failed tuples
● Performance difficulties
Storm & Trident
Reliable Processing
Acks are delivered via a system-level bolt
Acker Bold
ack ack
{A} {B}
ACK
Spark Streaming
● Failed DStreams can be recomputed using
their lineage
● Checkpointing to persistent data storage
○ Reduce lineage length
○ Recover metadata
● Computation redistributed from failed node
● Similar to restarting Spark’s failed batch job
faster recovery by using multiple nodes for
recomputations
failed tasks
failed node
Samza
● Takes an advantage of durable, partitioned, offset based messaging system
● Task monitors its offset, which moves forward as messages are processed
● Offset is checkpointed and can be restored on failureinputstream
Checkpoint
partition 0: offset 6
partition 1: offset 4
partition 2: offset 8
Partition 0
Partition 1
Partition 2
Samza
StreamTask
partition 0
StreamTask
partition 1
StreamTask
partition 2
Flink
● Based on distributed snapshots
● Sends checkpoint barriers through the stream
● Snapshots can be stored in durable storage
data stream
checkpoint
barrier n
checkpoint
barrier n-1
part of
checkpoint n+1
part of
checkpoint n
part of
checkpoint n-1
newer records older records
Managing State
● Most of the non-trivial streaming applications have a state
● Stateful operation looks like this:
f: (input, state) -> (output, state’)
● Delivery guarantees plays crucial role
○ At least once
■ Ensure all operators see all events ~ replay stream in failure case
○ Exactly once
■ Ensure that operators do not perform duplicate updates to their state
Storm & Trident
● States available only in Trident API
● Dedicated operators for state updates and queries
● State access methods
● Difficult to implement transactional state
● Exactly once semantics*
State
Transactional
Opaque
transactional
Non-
transactional
Non-
transactional
Transactional
Opaque
transactional
Spout
No No No
No
No No
Yes Yes
Yes
● State is implemented as another
stream
○ UpdateStateByKey()
○ TrackStateByKey()
○ Requires checkpointing
● Stateless runtime by design
● Exactly-once semantics
Spark Streaming
Input Stream
Job Job Job
Output
Stream
State
Micro-batch
processing
● Stateful operators, any task can hold state
● State is stored as another log
● Ships with 2 key-value stores
○ In-memory & RocksDB
○ Possible to send updates to kafka changelog to restore
store if needed
● Great for large data sets because of localized
storage
● Can use custom storage engines
● At-least once semantics, exactly once planned
Samza
Task Task
Input
Stream
Changelog
Stream
Output
Stream
● Stateful dataflow operators (conceptually similar to Samza)
● State access patterns
○ Local (Task) state - current state of a specific operator instance, operators do not interact
○ Partitioned (Key) state - maintains state of partitions (~ keys)
● Direct API integration
○ mapWithState(), flatMapWithState(), …
● Checkpointing
○ Pluggable backends for storing snapshots
● Exactly-once semantics
Flink
Operator
S1
S2
S3
Counting Words Revisited
Scala eXchange Apache
Apache Spark Storm
Apache Trident Flink
Streaming Samza Scala
2015 Streaming
(Apache,3)
(Scala, 2)
(Streaming, 2)
(eXchange, 1)
(Spark, 1)
(Storm, 1)
(Trident, 1)
(Flink, 1)
(Samza, 1)
(2015, 1)
Trident
import storm.trident.operation.builtin.Count;
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count"))
.parallelismHint(6);
Trident
import storm.trident.operation.builtin.Count;
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count"))
.parallelismHint(6);
// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)])
val lines = ...
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
Spark Streaming
Spark Streaming
// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)])
val lines = ...
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
Spark Streaming
// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)])
val lines = ...
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
Spark Streaming
// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)])
val lines = ...
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption().getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
Spark Streaming
// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)])
val lines = ...
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val stateDstream = wordDstream.trackStateByKey(
StateSpec.function(trackStateFunc).initialState(initialRDD))
Samza
class WordCountTask extends StreamTask with InitableTask {
private var store: CountStore = _
def init(config: Config, context: TaskContext) {
this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]]
}
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val words = envelope.getMessage.asInstanceOf[String].split(" ")
words.foreach{ key =>
val count: Integer = Option(store.get(key)).getOrElse(0)
store.put(key, count + 1)
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count)))
}
}
class WordCountTask extends StreamTask with InitableTask {
private var store: CountStore = _
def init(config: Config, context: TaskContext) {
this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]]
}
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val words = envelope.getMessage.asInstanceOf[String].split(" ")
words.foreach{ key =>
val count: Integer = Option(store.get(key)).getOrElse(0)
store.put(key, count + 1)
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count)))
}
}
Samza
class WordCountTask extends StreamTask with InitableTask {
private var store: CountStore = _
def init(config: Config, context: TaskContext) {
this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]]
}
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val words = envelope.getMessage.asInstanceOf[String].split(" ")
words.foreach{ key =>
val count: Integer = Option(store.get(key)).getOrElse(0)
store.put(key, count + 1)
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count)))
}
}
Samza
Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val words = text.flatMap ( _.split(" ") )
words.keyBy(x => x).mapWithState{
(word, count: Option[Int]) =>
{
val newCount = count.getOrElse(0) + 1
val output = (word, newCount)
(output, Some(newCount))
}
}
...
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val words = text.flatMap ( _.split(" ") )
words.keyBy(x => x).mapWithState{
(word, count: Option[Int]) =>
{
val newCount = count.getOrElse(0) + 1
val output = (word, newCount)
(output, Some(newCount))
}
}
...
Flink
Performance
● Hard to design not biased test, lots of variables
● Latency vs. Throughput
○ 500k/node/sec is ok, 1M/node/sec is nice and >1M/node/sec is great
○ Micro-batching latency usually in seconds, native in millis
● Guarantees
● Fault-tolerance cost
● State cost
● Network operations & Data locality
● Serialization
● Lots of tuning options
Project Maturity
● Storm & Trident
○ For a long time de-facto industrial standard
○ Widely used (Twitter, Yahoo!, Groupon, Spotify, Alibaba, Baidu and many more)
○ > 180 contributors
● Spark Streaming
○ Around 40% of Spark users use Streaming in Production or Prototyping
○ Significant uptake in adoption (Netflix, Cisco, DataStax, Pinterest, Intel, Pearson, … )
○ > 720 contributors (whole Spark)
● Samza
○ Used by LinkedIn and tens of other companies
○ > 30 contributors
● Flink
○ Still emerging, first production deployments
○ > 130 contributors
Summary
Native Micro-batching Native Native
Compositional Declarative Compositional Declarative
At-least-once Exactly-once At-least-once Exactly-once
Record ACKs
RDD based
Checkpointing Log-based Checkpointing
Not in-build
Dedicated
Operators
Stateful
Operators
Stateful
Operators
Very Low Medium Low Low
Low Medium High High
High High Medium Low
Micro-batching
Exactly-once
Dedicated
DStream
Medium
High
Streaming
Model
API
Guarantee
Fault
Tolerance
State
Management
Latency
Throughput
Maturity
TRIDENT
General Guidelines
● As always it depends
● Always evaluate particular application needs
● Fully understand internals improper use may have disastrous consequences
● Prefer High Level API
● Usually wants exactly once delivery
● Almost all non-trivial jobs have state
● Fast recovery is critical
○ Use Chaos Monkey or similar tool to be sure
Framework Recommendations
● Storm & Trident
○ Great fit for small and fast tasks
○ Very low (tens of milliseconds) latency
○ State & Fault tolerance degrades performance significantly
○ Potential update to Heron
■ Keeps API, according to Twitter better in every single way
■ Future open-sourcing is uncertain
● Spark Streaming
○ If Spark is already part of your infrastructure
○ Take advantage of various Spark libraries
○ Lambda architecture
○ Latency is not critical
○ Micro-batching limitations
Framework Recommendations
● Samza
○ Kafka is a cornerstone of your architecture
○ Application requires large states
○ At least once delivery is fine
● Flink
○ Conceptually great, fits very most use cases
○ Take advantage of batch processing capabilities
○ Need a functionality which is hard to implement in micro-batch
○ Enough courage to use emerging project
Questions
Thank you
● Jobs at www.cakesolutions.net/careers
● Mail petrz@cakesolutions.net
● Twitter @petr_zapletal

More Related Content

What's hot

Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 

What's hot (19)

Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Storm
StormStorm
Storm
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 

Viewers also liked

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013MLconf
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs stormTrong Ton
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingKeira Zhou
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...Abhishek Vijaywargiya
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 

Viewers also liked (20)

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
 
Anomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark StreamingAnomaly Detection using Spark MLlib and Spark Streaming
Anomaly Detection using Spark MLlib and Spark Streaming
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
Resume Abhishek Vijaywargiya: Database Developer with 9 years of experience i...
 
Overcoming cassandra query limitation spark
Overcoming cassandra query limitation sparkOvercoming cassandra query limitation spark
Overcoming cassandra query limitation spark
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 

Similar to Stream Processing Frameworks Compared

Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservicesmarius_bogoevici
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 

Similar to Stream Processing Frameworks Compared (20)

Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Netty training
Netty trainingNetty training
Netty training
 
Netty training
Netty trainingNetty training
Netty training
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 

More from Petr Zapletal

Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Petr Zapletal
 
Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019Petr Zapletal
 
Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018Petr Zapletal
 
Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Petr Zapletal
 
Real World Serverless
Real World ServerlessReal World Serverless
Real World ServerlessPetr Zapletal
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017Petr Zapletal
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nycPetr Zapletal
 
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016Petr Zapletal
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 

More from Petr Zapletal (10)

Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019Adopting GraalVM - NE Scala 2019
Adopting GraalVM - NE Scala 2019
 
Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018Adopting GraalVM - Scala eXchange London 2018
Adopting GraalVM - Scala eXchange London 2018
 
Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018Adopting GraalVM - Scale by the Bay 2018
Adopting GraalVM - Scale by the Bay 2018
 
Real World Serverless
Real World ServerlessReal World Serverless
Real World Serverless
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 
Reactive mistakes reactive nyc
Reactive mistakes   reactive nycReactive mistakes   reactive nyc
Reactive mistakes reactive nyc
 
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
Top Mistakes When Writing Reactive Applications - Scala by the Bay 2016
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 

Recently uploaded

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Stream Processing Frameworks Compared

  • 1. Distributed Real-Time Stream Processing: Why and How Petr Zapletal @petr_zapletal Scala eXchange 2015
  • 2. Agenda ● Motivation ● Stream Processing ● Available Frameworks ● Systems Comparison ● Recommendations
  • 3. ● 8 Zettabytes (1 ZB = 1021 B = 1 billion TB) created in 2015 ● Every minute we create ○ 200 million emails ○ 48 hours of YouTube video ○ 2 million google queries ○ 200 000 tweets ○ ... ● How can we make sense of all data ○ Most data is not interesting ○ New data supersedes old data ○ Challenge is not only storage but processing The Data Deluge
  • 4. New Sources And New Use Cases ● Many new sources of data become available ○ Sensors ○ Mobile devices ○ Web feeds ○ Social networking ○ Cameras ○ Databases ○ ... ● Even more use cases become viable ○ Web/Social feed mining ○ Real-time data analysis ○ Fraud detection ○ Smart order routing ○ Intelligence and surveillance ○ Pricing and analytics ○ Trends detection ○ Log processing ○ Real-time data aggregation ○ …
  • 5. Stream Processing to the Rescue ● Process data streams on-the-fly without permanent storage ● Stream data rates can be high ○ High resource requirements for processing (clusters, data centres) ● Processing stream data has real-time aspect ○ Latency of data processing matters ○ Must be able to react to events as they occur
  • 6. Streaming Applications ETL Operations ● Transformations, joining or filtering of incoming data Windowing ● Trends in bounded interval, like tweets or sales
  • 7. Streaming Applications Machine Learning ● Clustering, Trend fitting, Classification Pattern Recognition ● Fraud detection, Signal triggering, Anomaly detection
  • 8. Processing Architecture Evolution Batch Pipeline Kappa ArchitectureLambda Architecture Standalone Stream Processing Stream Processing Batch Layer Serving Layer Stream layer Query Query Allyourdata Oozie Stream Processing Query Serving DBHDFS Query
  • 9. Distributed Stream Processing ● Continuous processing, aggregation and analysis of unbounded data ● General computational model as MapReduce ● Expected latency in milliseconds or seconds ● Systems often modelled as Directed Acyclic Graph (DAG) ● Describes topology of streaming job ● Data flows through chain of processors from sources to sinks
  • 10. Points of Interest ● Runtime and programming model ● Primitives ● State management ● Message delivery guarantees ● Fault tolerance & Low overhead recovery ● Latency, Throughput & Scalability ● Maturity and Adoption Level ● Ease of development and operability
  • 11. Runtime and Programming Model ● Most important trait of stream processing system ● Defines expressiveness, possible operations and its limitations ● Therefore defines systems capabilities and its use cases
  • 12. Native Streaming Native stream processing systems continuous operator model record Source Operator Processing Operator Processing Operator Sink Operator records processed one at a time
  • 14. Native Streaming ● Records are processed as they arrive ● Native model with general processing ability Pros ➔ Expressiveness ➔ Low-latency ➔ Stateful operations Cons ➔ Throughput ➔ Fault-tolerance is expensive ➔ Load-balancing
  • 15. Micro-batching ● Splits incoming stream into small batches ● Batch interval inevitably limits system expressiveness ● Can be built atop Native streaming easily Pros ➔ High-throughput ➔ Easier fault tolerance ➔ Simpler load-balancing Cons ➔ Lower latency, depends on batch interval ➔ Limited expressivity ➔ Harder stateful operations
  • 16. Programming Model Compositional ➔ Provides basic building blocks as operators or sources ➔ Custom component definition ➔ Manual Topology definition & optimization ➔ Advanced functionality often missing Declarative ➔ High-level API ➔ Operators as higher order functions ➔ Abstract data types ➔ Advance operations like state management or windowing supported out of the box ➔ Advanced optimizers
  • 18. Storm ● Originally created by Nathan Marz and his team at BackType in 2010 ● Being acquired and later open-sourced by Twitter, Apache project top-level since 2014 ● Pioneer in large scale stream processing ● Low-level native streaming API ● Uses Thrift for topology definition ● Large number of API languages available ○ Storm Multi-Language Protocol
  • 19. Trident ● Higher level micro-batching system build atop Storm ● Stream is partitioned into a small batches ● Simplifies building topologies ● Java, Clojure and Scala API ● Provides exactly once delivery ● Adds higher level operations ○ Aggregations ○ State operations ○ Joining, merging , grouping, windowing, etc.
  • 20. Spark Streaming ● Spark started in 2009 at UC Berkeley, Apache since since 2013 ● General engine for large scale batch processing ● Spark Streaming introduced in 0.7, came out of alpha in 0.9 (Feb 2014) ● Unified batch and stream processing over a batch runtime ● Great integration with batch processing and its build-in libraries (SparkSQL, MLlib, GraphX) ● Scala, Java & Python API input data stream Spark Streaming Spark Engine batches of input data batches of processed data
  • 21. Samza ● Developed in LinkedIn, open-sourced in 2013 ● Builds heavily on Kafka’s log based philosophy ● Pluggable messaging system and executional backend ○ Uses Kafka & YARN usually ● JVM languages, usually Java or Scala Task 1 Task 2 Task 3
  • 22. Flink ● Started as Stratosphere in 2008 at as Academic project ● Native streaming ● High level API ● Batch as special case of Streaming (bounded vs unbounded dataset) ● Provides ML (FlinkML) and graph processing (Gelly) out of the box ● Java & Scala API Stream Data Batch Data Kafka, RabbitMQ, ... HDFS, JDBC, ...
  • 23. System Comparison Native Micro-batching Native Native Compositional Declarative Compositional Declarative At-least-once Exactly-once At-least-once Exactly-once Record ACKs RDD based Checkpointing Log-based Checkpointing Not build-in Dedicated Operators Stateful Operators Stateful Operators Very Low Medium Low Low Low Medium High High High High Medium Low Micro-batching Exactly-once Dedicated DStream Medium High Streaming Model API Guarantees Fault Tolerance State Management Latency Throughput Maturity TRIDENT
  • 24. Counting Words Scala eXchange Apache Apache Spark Storm Apache Trident Flink Streaming Samza Scala 2015 Streaming (Apache,3) (Scala, 2) (Streaming, 2) (eXchange, 1) (Spark, 1) (Storm, 1) (Trident, 1) (Flink, 1) (Samza, 1) (2015, 1)
  • 25. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 26. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 27. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 28. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 29. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 30. public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... } Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 31. val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination() Spark Streaming
  • 32. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  • 33. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  • 34. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  • 35. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  • 36. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  • 37. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  • 38. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  • 39. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  • 40. Fault Tolerance ● Fault tolerance in streaming systems is inherently harder that in batch ○ Can’t restart computation easily ○ State is a problem ○ Jobs can run 24/7 ○ Fast recovery is critical ● A lot of challenges must be addressed ○ No single point of failure ○ Ensure processing of all incoming messages ○ State consistency ○ Fast recovery ○ Exactly once semantics is even harder
  • 41. ● Record acknowledgments ● Tracks the lineage of tuples in DAG as they are processed ● Special “ACK” bold track each lineage ● Able to replay from the root of failed tuples ● Performance difficulties Storm & Trident Reliable Processing Acks are delivered via a system-level bolt Acker Bold ack ack {A} {B} ACK
  • 42. Spark Streaming ● Failed DStreams can be recomputed using their lineage ● Checkpointing to persistent data storage ○ Reduce lineage length ○ Recover metadata ● Computation redistributed from failed node ● Similar to restarting Spark’s failed batch job faster recovery by using multiple nodes for recomputations failed tasks failed node
  • 43. Samza ● Takes an advantage of durable, partitioned, offset based messaging system ● Task monitors its offset, which moves forward as messages are processed ● Offset is checkpointed and can be restored on failureinputstream Checkpoint partition 0: offset 6 partition 1: offset 4 partition 2: offset 8 Partition 0 Partition 1 Partition 2 Samza StreamTask partition 0 StreamTask partition 1 StreamTask partition 2
  • 44. Flink ● Based on distributed snapshots ● Sends checkpoint barriers through the stream ● Snapshots can be stored in durable storage data stream checkpoint barrier n checkpoint barrier n-1 part of checkpoint n+1 part of checkpoint n part of checkpoint n-1 newer records older records
  • 45. Managing State ● Most of the non-trivial streaming applications have a state ● Stateful operation looks like this: f: (input, state) -> (output, state’) ● Delivery guarantees plays crucial role ○ At least once ■ Ensure all operators see all events ~ replay stream in failure case ○ Exactly once ■ Ensure that operators do not perform duplicate updates to their state
  • 46. Storm & Trident ● States available only in Trident API ● Dedicated operators for state updates and queries ● State access methods ● Difficult to implement transactional state ● Exactly once semantics* State Transactional Opaque transactional Non- transactional Non- transactional Transactional Opaque transactional Spout No No No No No No Yes Yes Yes
  • 47. ● State is implemented as another stream ○ UpdateStateByKey() ○ TrackStateByKey() ○ Requires checkpointing ● Stateless runtime by design ● Exactly-once semantics Spark Streaming Input Stream Job Job Job Output Stream State Micro-batch processing
  • 48. ● Stateful operators, any task can hold state ● State is stored as another log ● Ships with 2 key-value stores ○ In-memory & RocksDB ○ Possible to send updates to kafka changelog to restore store if needed ● Great for large data sets because of localized storage ● Can use custom storage engines ● At-least once semantics, exactly once planned Samza Task Task Input Stream Changelog Stream Output Stream
  • 49. ● Stateful dataflow operators (conceptually similar to Samza) ● State access patterns ○ Local (Task) state - current state of a specific operator instance, operators do not interact ○ Partitioned (Key) state - maintains state of partitions (~ keys) ● Direct API integration ○ mapWithState(), flatMapWithState(), … ● Checkpointing ○ Pluggable backends for storing snapshots ● Exactly-once semantics Flink Operator S1 S2 S3
  • 50. Counting Words Revisited Scala eXchange Apache Apache Spark Storm Apache Trident Flink Streaming Samza Scala 2015 Streaming (Apache,3) (Scala, 2) (Streaming, 2) (eXchange, 1) (Spark, 1) (Storm, 1) (Trident, 1) (Flink, 1) (Samza, 1) (2015, 1)
  • 51. Trident import storm.trident.operation.builtin.Count; TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  • 52. Trident import storm.trident.operation.builtin.Count; TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  • 53. // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)]) val lines = ... val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption.getOrElse(0) val output = (word, sum) state.update(sum) Some(output) } val stateDstream = wordDstream.trackStateByKey( StateSpec.function(trackStateFunc).initialState(initialRDD)) Spark Streaming
  • 54. Spark Streaming // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)]) val lines = ... val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption.getOrElse(0) val output = (word, sum) state.update(sum) Some(output) } val stateDstream = wordDstream.trackStateByKey( StateSpec.function(trackStateFunc).initialState(initialRDD))
  • 55. Spark Streaming // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)]) val lines = ... val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption.getOrElse(0) val output = (word, sum) state.update(sum) Some(output) } val stateDstream = wordDstream.trackStateByKey( StateSpec.function(trackStateFunc).initialState(initialRDD))
  • 56. Spark Streaming // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)]) val lines = ... val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption().getOrElse(0) val output = (word, sum) state.update(sum) Some(output) } val stateDstream = wordDstream.trackStateByKey( StateSpec.function(trackStateFunc).initialState(initialRDD))
  • 57. Spark Streaming // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List.empty[(String, Int)]) val lines = ... val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) val trackStateFunc = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption.getOrElse(0) val output = (word, sum) state.update(sum) Some(output) } val stateDstream = wordDstream.trackStateByKey( StateSpec.function(trackStateFunc).initialState(initialRDD))
  • 58. Samza class WordCountTask extends StreamTask with InitableTask { private var store: CountStore = _ def init(config: Config, context: TaskContext) { this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]] } override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val words = envelope.getMessage.asInstanceOf[String].split(" ") words.foreach{ key => val count: Integer = Option(store.get(key)).getOrElse(0) store.put(key, count + 1) collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count))) } }
  • 59. class WordCountTask extends StreamTask with InitableTask { private var store: CountStore = _ def init(config: Config, context: TaskContext) { this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]] } override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val words = envelope.getMessage.asInstanceOf[String].split(" ") words.foreach{ key => val count: Integer = Option(store.get(key)).getOrElse(0) store.put(key, count + 1) collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count))) } } Samza
  • 60. class WordCountTask extends StreamTask with InitableTask { private var store: CountStore = _ def init(config: Config, context: TaskContext) { this.store = context.getStore("wordcount-store").asInstanceOf[KeyValueStore[String, Integer]] } override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val words = envelope.getMessage.asInstanceOf[String].split(" ") words.foreach{ key => val count: Integer = Option(store.get(key)).getOrElse(0) store.put(key, count + 1) collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), (key, count))) } } Samza
  • 61. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val words = text.flatMap ( _.split(" ") ) words.keyBy(x => x).mapWithState{ (word, count: Option[Int]) => { val newCount = count.getOrElse(0) + 1 val output = (word, newCount) (output, Some(newCount)) } } ...
  • 62. val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val words = text.flatMap ( _.split(" ") ) words.keyBy(x => x).mapWithState{ (word, count: Option[Int]) => { val newCount = count.getOrElse(0) + 1 val output = (word, newCount) (output, Some(newCount)) } } ... Flink
  • 63. Performance ● Hard to design not biased test, lots of variables ● Latency vs. Throughput ○ 500k/node/sec is ok, 1M/node/sec is nice and >1M/node/sec is great ○ Micro-batching latency usually in seconds, native in millis ● Guarantees ● Fault-tolerance cost ● State cost ● Network operations & Data locality ● Serialization ● Lots of tuning options
  • 64. Project Maturity ● Storm & Trident ○ For a long time de-facto industrial standard ○ Widely used (Twitter, Yahoo!, Groupon, Spotify, Alibaba, Baidu and many more) ○ > 180 contributors ● Spark Streaming ○ Around 40% of Spark users use Streaming in Production or Prototyping ○ Significant uptake in adoption (Netflix, Cisco, DataStax, Pinterest, Intel, Pearson, … ) ○ > 720 contributors (whole Spark) ● Samza ○ Used by LinkedIn and tens of other companies ○ > 30 contributors ● Flink ○ Still emerging, first production deployments ○ > 130 contributors
  • 65. Summary Native Micro-batching Native Native Compositional Declarative Compositional Declarative At-least-once Exactly-once At-least-once Exactly-once Record ACKs RDD based Checkpointing Log-based Checkpointing Not in-build Dedicated Operators Stateful Operators Stateful Operators Very Low Medium Low Low Low Medium High High High High Medium Low Micro-batching Exactly-once Dedicated DStream Medium High Streaming Model API Guarantee Fault Tolerance State Management Latency Throughput Maturity TRIDENT
  • 66. General Guidelines ● As always it depends ● Always evaluate particular application needs ● Fully understand internals improper use may have disastrous consequences ● Prefer High Level API ● Usually wants exactly once delivery ● Almost all non-trivial jobs have state ● Fast recovery is critical ○ Use Chaos Monkey or similar tool to be sure
  • 67. Framework Recommendations ● Storm & Trident ○ Great fit for small and fast tasks ○ Very low (tens of milliseconds) latency ○ State & Fault tolerance degrades performance significantly ○ Potential update to Heron ■ Keeps API, according to Twitter better in every single way ■ Future open-sourcing is uncertain ● Spark Streaming ○ If Spark is already part of your infrastructure ○ Take advantage of various Spark libraries ○ Lambda architecture ○ Latency is not critical ○ Micro-batching limitations
  • 68. Framework Recommendations ● Samza ○ Kafka is a cornerstone of your architecture ○ Application requires large states ○ At least once delivery is fine ● Flink ○ Conceptually great, fits very most use cases ○ Take advantage of batch processing capabilities ○ Need a functionality which is hard to implement in micro-batch ○ Enough courage to use emerging project
  • 70. Thank you ● Jobs at www.cakesolutions.net/careers ● Mail petrz@cakesolutions.net ● Twitter @petr_zapletal