Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

Petr Zapletal @petr_zapletal
@spark_summit
@cakesolutions
Distributed Real-Time Stream Processing:
Why and How

Agenda
● Motivation
● Stream Processing
● Available Frameworks
● Systems Comparison
● Recommendations

The Data Deluge
● New Sources and New Use Cases
● 8 Zettabytes (1 ZB = 1 trillion GB) created in 2015
● Stream Processing to the Rescue

● Continuous processing, aggregation and analysis of unbounded data
Distributed Stream Processing

Points of Interest
➜ Runtime and Programming Model

Points of Interest
➜ Primitives

Points of Interest
➜ Primitives
➜ State Management

Points of Interest
➜ Primitives
➜ Message Delivery Guarantees

Points of Interest
➜ Primitives
➜ Fault Tolerance & Low Overhead Recovery

Points of Interest
➜ Primitives
➜ Latency, Throughput & Scalability

Points of Interest
➜ Primitives
➜ Maturity and Adoption Level

Points of Interest
➜ Primitives
➜ Maturity and Adoption Level
➜ Ease of Development and Operability

● Most important trait of stream processing system
● Defines expressiveness, possible operations and its limitations
● Therefore defines systems capabilities and its use cases
Runtime and Programming Model

Native Streaming
records
Sink
Operator
Source
Operator
Processing
Operator
records processed one at a time
Processing
Operator

records processed in short batches
Processing
Operator
Receiver
records
Processing
Operator
Micro-batches
Sink
Operator
Micro-batching

Native Streaming
● Records are processed as they arrive
Pros
⟹ Expressiveness
⟹ Low-latency
⟹ Stateful operations
Cons
⟹ Fault-tolerance is expensive
⟹ Load-balancing

Micro-batching
Cons
⟹ Lower latency, depends on
batch interval
⟹ Limited expressivity
⟹ Harder stateful operations
● Splits incoming stream into small batches
Pros
⟹ High-throughput
⟹ Easier fault tolerance
⟹ Simpler load-balancing

Programming Model
Compositional
⟹ Provides basic building blocks as
operators or sources
⟹ Custom component definition
⟹ Manual Topology definition &
optimization
⟹ Advanced functionality often
missing
Declarative
⟹ High-level API
⟹ Operators as higher order functions
⟹ Abstract data types
⟹ Advance operations like state
management or windowing supported
out of the box
⟹ Advanced optimizers

Apache Streaming Landscape
TRIDENT

Storm
Stor
● Pioneer in large scale stream processing

● Higher level micro-batching system build atop Storm
Trident

● Unified batch and stream processing over a batch runtime
Spark Streaming
input data
stream Spark
Streaming
Spark
Engine
batches of
input data
batches of
processed data

Samza
● Builds heavily on Kafka’s log based philosophy
Task 1
Task 2
Task 3

Kafka Streams
● Simple streaming library on top of Apache Kafka

Apex
● Processes massive amounts of real-time events natively in Hadoop
Operator
Operator
OperatorOperator Operator Operator
Output
Stream
Stream
Stream
Stream
Stream
Tuple

Flink
Stream Data
Batch Data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
● Native streaming & High level API

Counting Words
Spark Summit 2017
Apache Apache Spark
Storm Apache Trident
Flink Streaming Samza
Scala 2017 Streaming
(Apache, 3)
(Streaming, 2)
(2017, 2)
(Spark, 2)
(Storm, 1)
(Trident, 1)
(Flink, 1)
(Samza, 1)
(Scala, 1)
(Summit, 1)

Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}

Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}

Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()

val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming

Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}

class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Samza

Kafka Streams
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> textLines = builder.stream(stringSerde, stringSerde,
"streams-file-input");
KStream<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, value) -> value)
.count("Counts")
.toStream();
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");

Apex
val input = dag.addOperator("input", new LineReader)
val parser = dag.addOperator("parser", new Parser)
val out = dag.addOperator("console", new ConsoleOutputOperator)
dag.addStream[String]("lines", input.out, parser.in)
dag.addStream[String]("words", parser.out, counter.data)
class Parser extends BaseOperator {
@transient
val out = new DefaultOutputPort[String]()
@transient
val in = new DefaultInputPort[String]()
override def process(t: String): Unit = {
for(w <- t.split(" ")) out.emit(w)
}
}

val input = dag.addOperator("input", new LineReader)
val parser = dag.addOperator("parser", new Parser)
val out = dag.addOperator("console", new ConsoleOutputOperator)
dag.addStream[String]("lines", input.out, parser.in)
dag.addStream[String]("words", parser.out, counter.data)
class Parser extends BaseOperator {
@transient
val out = new DefaultOutputPort[String]()
@transient
val in = new DefaultInputPort[String]()
override def process(t: String): Unit = {
for(w <- t.split(" ")) out.emit(w)
}
}
Apex

Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)
counts.print()
env.execute("wordcount")

Summary
Native Micro-batching Native Native
Compositional Declarative Compositional Declarative
At-least-once Exactly-once At-least-once Exactly-once
Record ACKs
RDD based
Checkpointing
Log-based Checkpointing
Not build-in
Dedicated
Operators
Stateful
Operators
Stateful
Operators
Very Low Medium Low Low
Low Medium High High
High High Medium Medium
Micro-batching
Exactly-once*
Dedicated
DStream
Medium
High
Streaming
Model
API
Guarantees
Fault
Tolerance
State
Management
Latency
Throughput
Maturity
TRIDENT
Hybrid
Compositional*
Exactly-once
Checkpointing
Stateful
Operators
Very Low
High
Medium
Native
Declarative*
At-least-once
Low
Log-based
Stateful
Operators
Low
High

General Guidelines
As always, It depends

General Guidelines
➜ Evaluate particular application needs

General Guidelines
➜ Programming model

General Guidelines
➜ Available delivery guarantees

General Guidelines
➜ Almost all non-trivial jobs have state

General Guidelines
➜ Almost all non-trivial jobs have state
➜ Fast recovery is critical

Recommendations [Storm & Trident]
● Fits for small and fast tasks
● Very low (tens of milliseconds) latency
● State & Fault tolerance degrades performance significantly
● Potential update to Heron
○ Keeps the API, according to Twitter better in every single way
○ Open-sourced recently

Recommendations [Spark Streaming]
● Spark Ecosystem
● Data Exploration
● Latency is not critical
● Micro-batching limitations

Recommendations [Samza]
● Kafka is a cornerstone of your architecture
● Application requires large states
● Don’t need exactly once

Recommendations [Kafka Streams]
● Similar functionality like Samza with nicer APIs
● Operated by Kafka itself
● Great learning curve
● At-least once delivery
● May not support more advanced functionality

Recommendations [Apex]
● Prefer compositional approach
● Hadoop
● Great performance
● Dynamic DAG changes

Recommendations [Flink]
● Conceptually great, fits very most use cases
● Take advantage of batch processing capabilities
● Need a functionality which is hard to implement in micro-batch
● Enough courage to use emerging project

Dataflow and Apache Beam
Dataflow
Model & SDKs
Apache Flink
Apache Spark
Direct Pipeline
Google Cloud
Dataflow
Stream Processing
Batch Processing
Multiple Modes One Pipeline Many Runtimes
Local
or
cloud
Local
Cloud

Questions
MANCHESTER LONDON NEW YORK

@petr_zapletal @cakesolutions
347 708 1518
enquiries@cakesolutions.net
We are hiring
http://www.cakesolutions.net/careers

References
● http://storm.apache.org/
● http://spark.apache.org/streaming/
● http://samza.apache.org/
● https://apex.apache.org/
● https://flink.apache.org/
● http://beam.incubator.apache.org/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1
● http://data-artisans.com/blog/
● http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
● https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
● http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● http://stackoverflow.com/questions/29111549/where-do-apache-samza-and-apache-storm-differ-in-their-use-cases
● https://www.dropbox.com/s/1s8pnjwgkkvik4v/GearPump%20Edge%20Topologies%20and%20Deployment.pptx?dl=0
● ...

List of Figures
● https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
● http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014
● https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png
● http://data-artisans.com/wp-content/uploads/2015/08/microbatching.png
● http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job
● http://data-artisans.com/wp-content/uploads/2015/08/streambarrier.png
● https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing
● https://raw.githubusercontent.com/tweise/apex-samples/kafka-count-jdbc/exactly-once/docs/images/image00.png
● https://4.bp.blogspot.com/-RlLeDymI_mU/Vp-1cb3AxNI/AAAAAAAACSQ/5TphliHJA4w/s1600/dataflow%2BASF.pn
g

Backup Slides

Processing Architecture Evolution
Batch Pipeline
Serving DBHDFS
Query

Lambda Architecture
Batch Layer Serving
Layer
Stream layer
Query
Allyourdata
Oozie
Query

Standalone Stream Processing
Stream
Processing

Kappa Architecture
Query
Stream
Processing

ETL Operations
● Transformations, joining or filtering of incoming
data
Streaming Applications

Windowing
● Trends in bounded interval, like tweets or sales

Machine Learning
● Clustering, Trend fitting,
Classification

Pattern Recognition
● Fraud detection, Signal
triggering, Anomaly detection

Fault tolerance in streaming systems is
inherently harder that in batch
Fault Tolerance

Managing State
f: (input, state) => (output, state’)

Performance
Hard to design not biased test, lots
of variables

Performance
➜ Latency vs. Throughput

Performance
➜ Costs of Delivery Guarantees, Fault-tolerance & State Management

Performance
➜ Tuning

Performance
➜ Tuning
➜ Network operations, Data locality & Serialization

Project Maturity
When picking up the framework, you
should always consider its maturity

For a long time de-facto industrial
standard
Project Maturity [Storm & Trident]

Project Maturity [Spark Streaming]
The most trending Scala repository
these days and one of the engines
behind Scala’s popularity

Project Maturity [Samza]
Used by LinkedIn and also by tens of
other companies

Project Maturity [Kafka Streams]
???

Project Maturity [Apex]
Graduated very recently, adopted by
a couple of corporate clients already

Project Maturity [Flink]
Still an emerging project, but we can
see its first production deployments

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

Similar to Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal