SlideShare a Scribd company logo
1 of 59
Download to read offline
Four Things to Know about
Reliable Spark Streaming
Dean Wampler, Typesafe
Tathagata Das, Databricks
Agenda for today
• The Stream Processing Landscape
• How Spark Streaming Works - A Quick Overview
• Features in Spark Streaming that Help Prevent
Data Loss
• Design Tips for Successful Streaming
Applications
The Stream
Processing Landscape
Stream Processors
Stream Storage
Stream Sources
MQTT$
How Spark Streaming Works:
A Quick Overview
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrates with MLlib, SQL,
DataFrames, GraphX
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
9
data streams
receivers
batches results
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
10
entry point of streaming
functionality
create DStream
from Kafka data
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
11
split lines into words
Word Count with Kafka
val!context!=!new!StreamingContext(conf,!Seconds(1))!
val!lines!=!KafkaUtils.createStream(context,!...)!
val!words!=!lines.flatMap(_.split("!"))!
val!wordCounts!=!words.map(x!=>!(x,!1))!
!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
wordCounts.print()!
context.start()!
12
print some counts on screen
count the words
start receiving and
transforming the data
Word Count with Kafka
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))!
!!!!val!lines!=!KafkaUtils.createStream(context,!...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
13
Features in Spark Streaming
that Help Prevent Data Loss
15
A Deeper View of
Spark Streaming
Any Spark Application
16
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Any Spark Application
17
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Any Spark Application
18
Spark
Driver
User code runs in
the driver process
YARN / Mesos /
Spark Standalone
cluster
Tasks sent to
executors for
processing data
Spark
Executor
Spark
Executor
Spark
Executor
Driver launches
executors in
cluster
Spark Streaming Application: Receive data
19
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Receive data
20
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Spark Streaming Application: Receive data
21
Executor
Executor
Driver runs
receivers as long
running tasks Receiver Data stream
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Receiver divides
stream into blocks and
keeps in memory
Data Blocks!
Blocks also
replicated to
another executor Data Blocks!
Spark Streaming Application: Process data
22
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Spark Streaming Application: Process data
23
Executor
Executor
Receiver
Data Blocks!
Data Blocks!
Data
store
Every batch
interval, driver
launches tasks to
process the blocks
Driver@
object!WordCount!{!
!!def!main(args:!Array[String])!{!
!!!!val!context!=!new!StreamingContext(...)!
!!!!val!lines!=!KafkaUtils.createStream(...)!
!!!!val!words!=!lines.flatMap(_.split("!"))!
!!!!val!wordCounts!=!words.map(x!=>!(x,1))!
!!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)!
!!!!wordCounts.print()!
!!!!context.start()!
!!!!context.awaitTermination()!
!!}!
}!
Fault Tolerance
and Reliability
Failures? Why care?
Many streaming applications need zero data loss guarantees
despite any kind of failures in the system
At least once guarantee – every record processed at least once
Exactly once guarantee – every record processed exactly once
Different kinds of failures – executor and driver
Some failures and guarantee requirements need additional
configurations and setups
25
Executor
Receiver
Data Blocks!
What if an executor fails?
26
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Executor
Receiver
Data Blocks!
What if an executor fails?
Tasks and receivers restarted by Spark
automatically, no config needed
27
Executor
Failed Ex.
Receiver
Blocks!
Blocks!
Driver
If executor fails,
receiver is lost and
all blocks are lost
Receiver
Receiver
restarted
Tasks restarted
on block replicas
What if the driver fails?
28
Executor
Blocks!How do we
recover?
When the driver
fails, all the
executors fail
All computation,
all received
blocks are lost
Executor
Receiver
Blocks!
Failed Ex.
Receiver
Blocks!
Failed
Executor
Blocks!
Driver
Failed
Driver
Recovering Driver w/ DStream Checkpointing
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
29
Executor
Blocks!
Executor
Receiver
Blocks!
Active
Driver
Checkpoint info
to HDFS / S3
Recovering Driver w/ DStream Checkpointing
30
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
31
Failed driver can be restarted
from checkpoint information
Failed
Driver
Restarted
Driver
New
Executor
New
Executor
Receiver
New executors
launched and
receivers
restarted
DStream Checkpointing:
Periodically save the DAG of DStreams to
fault-tolerant storage
Recovering Driver w/ DStream Checkpointing
1.  Configure automatic driver restart
All cluster managers support this
2.  Set a checkpoint directory in a HDFS-compatible file
system
!streamingContext.checkpoint(hdfsDirectory)!
3.  Slightly restructure of the code to use checkpoints for
recovery
32
Configurating Automatic Driver Restart
Spark Standalone – Use spark-submit with “cluster” mode and “--
supervise”
See http://spark.apache.org/docs/latest/spark-standalone.html
YARN – Use spark-submit in “cluster” mode
See YARN config “yarn.resourcemanager.am.max-attempts”
Mesos – Marathon can restart applications or use the “--supervise” flag.
33
Restructuring code for Checkpointing
34
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
Restructuring code for Checkpointing
35
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Restructuring code for Checkpointing
36
val!context!=!new!StreamingContext(...)!
val!lines!=!KafkaUtils.createStream(...)!
val!words!=!lines.flatMap(...)!
...!
context.start()!
Create
+
Setup
Start
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
Put all setup code into a function that returns a new StreamingContext
Get context setup from HDFS dir OR create a new one with the function
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Restructuring code for Checkpointing
StreamingContext.getOrCreate():
If HDFS directory has checkpoint info
recover context from info
else
call creatingFunc() to create
and setup a new context
Restarted process can figure out whether
to recover using checkpoint info or not
37
def!creatingFunc():!StreamingContext!=!{!!
!!!val!context!=!new!StreamingContext(...)!!!
!!!val!lines!=!KafkaUtils.createStream(...)!
!!!val!words!=!lines.flatMap(...)!
!!!...!
!!!context.checkpoint(hdfsDir)!
}!
val!context!=!
StreamingContext.getOrCreate(!
!!hdfsDir,!creatingFunc)!
context.start()!
Received blocks lost on Restart!
38
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
No Blocks!
In-memory blocks of
buffered data are
lost on driver restart
Recovering data with Write Ahead Logs
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
39
Executor
Blocks saved
to HDFS
Executor
Receiver
Blocks!
Active
Driver
Data stream
Recovering data with Write Ahead Logs
40
Failed
Driver
Restarted
Driver
New
Executor
New Ex.
Receiver
Blocks!
Blocks recovered
from Write Ahead Log
Write Ahead Log (WAL): Synchronously
save received data to fault-tolerant storage
Recovering data with Write Ahead Logs
1.  Enable checkpointing, logs written in checkpoint directory
3.  Enabled WAL in SparkConf configuration
!!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")!
3.  Receiver should also be reliable
Acknowledge source only after data saved to WAL
Unacked data will be replayed from source by restarted receiver
5.  Disable in-memory replication (already replicated by HDFS)
!Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams
41
RDD Checkpointing
•  Stateful stream processing can lead to long RDD lineages
•  Long lineage = bad for fault-tolerance, too much recomputation
•  RDD checkpointing saves RDD data to the fault-tolerant storage to
limit lineage and recomputation
•  More:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
42
Fault-tolerance Semantics
43
Zero data loss = every stage processes each
event at least once despite any failure
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
44
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
End-to-end semantics:
At-least once
Fault-tolerance Semantics
45
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Receiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
46
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
At least once, w/ Checkpointing + WAL + Reliable receiversReceiving
Outputting
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
At-least once
Fault-tolerance Semantics
47
Exactly once receiving with new Kafka Direct approach
Treats Kafka like a replicated log, reads it like a file
Does not use receivers
No need to create multiple DStreams and union them
No need to enable Write Ahead Logs
!
@val!directKafkaStream!=!KafkaUtils.createDirectStream(...)!
!
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Sources
Transforming
Sinks
Outputting
Receiving
Fault-tolerance Semantics
48
Exactly once receiving with new Kafka Direct approach
Sources
Transforming
Sinks
Outputting
Receiving
Exactly once, as long as received data is not lost
Exactly once, if outputs are idempotent or transactional
End-to-end semantics:
Exactly once!
Design Tips for Successful
Streaming Applications
Areas for consideration
• Enhance resilience with additional
components.
• Mini-batch vs. per-message handling.
• Exploit Reactive Streams.
• Use Storm, Akka, Samza, etc. for handling
individual messages, especially with sub-
second latency requirements.
• Use Spark Streaming’s mini-batch model for
the Lambda architecture and highly-scalable
analytics.
Mini-batch vs. per-message handling
• Consider Kafka or Kinesis for resilient buffering
in front of Spark Streaming.
•  Buffer for traffic spikes.
•  Re-retrieval of data if an RDD partition is lost and must be
reconstructed from the source.
• Going to store the raw data anyway?
•  Do it first, then ingest to Spark from that storage.
Enhance Resiliency with Additional
Components.
• Spark Streaming v1.5 will have support for back pressure to more
easily build end-to-end reactive applications
Exploit Reactive Streams
• Spark Streaming v1.5 will have support for back
pressure to more easily build end-to-end reactive
applications
• Backpressure from consumer to producer:
•  Prevents buffer overflows.
•  Avoids unnecessary throttling.
Exploit Reactive Streams
• Spark Streaming v1.4? Buffer with Akka
Streams:
Exploit Reactive Streams
Event
Source
Akka
App
Spark
Streaming App
Event
Event
Event
Event
feedback
(back pressure)
• Spark Streaming v1.4 has a rate limit property:
•  spark.streaming.receiver.maxRate
•  Consider setting it for long-running streaming apps with a
variable input flow rate.
• Have a graph of Reactive Streams? Consider using an
Akka app to buffer the data fed to Spark Streaming over
a socket (until 1.5…).
Exploit Reactive Streams
Thank you!
Dean Wampler, Typesafe
Tathagata Das, Databricks
Databricks is
available as a hosted
platform on 
AWS with a monthly
subscription. 
What to do next?
Start with a free trial
Typesafe now offers
certified support for
Spark, Mesos &
DCOS, read more
about it
READ MORE
REACTIVE PLATFORM

Full Lifecycle Support for Play, Akka, Scala and Spark
Give your project a boost with Reactive Platform:
• Monitor Message-Driven Apps
• Resolve Network Partitions Decisively
• Integrate Easily with Legacy Systems
• Eliminate Incompatibility & Security Risks
• Protect Apps Against Abuse
• Expert Support from Dedicated Product Teams
Enjoy learning? See about the availability of
on-site training for Scala, Akka, Play and Spark!
Learn more about our offersCONTACT US

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Building Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonBuilding Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonTimothy Spann
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Building Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with PythonBuilding Modern Data Streaming Apps with Python
Building Modern Data Streaming Apps with Python
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 

Viewers also liked

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 

Viewers also liked (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 

Similar to Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 

Similar to Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 

More from Legacy Typesafe (now Lightbend)

The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreLegacy Typesafe (now Lightbend)
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Legacy Typesafe (now Lightbend)
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Legacy Typesafe (now Lightbend)
 
Microservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyMicroservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyLegacy Typesafe (now Lightbend)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0Legacy Typesafe (now Lightbend)
 
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Legacy Typesafe (now Lightbend)
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Legacy Typesafe (now Lightbend)
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)
 

More from Legacy Typesafe (now Lightbend) (15)

The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Reactive Design Patterns
Reactive Design PatternsReactive Design Patterns
Reactive Design Patterns
 
Revitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with MicroservicesRevitalizing Aging Architectures with Microservices
Revitalizing Aging Architectures with Microservices
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Del...
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
 
Microservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with TechnologyMicroservices 101: Exploiting Reality's Constraints with Technology
Microservices 101: Exploiting Reality's Constraints with Technology
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
 
Going Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive PlatformGoing Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive Platform
 
Why Play Framework is fast
Why Play Framework is fastWhy Play Framework is fast
Why Play Framework is fast
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 

Recently uploaded

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Recently uploaded (20)

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

  • 1. Four Things to Know about Reliable Spark Streaming Dean Wampler, Typesafe Tathagata Das, Databricks
  • 2. Agenda for today • The Stream Processing Landscape • How Spark Streaming Works - A Quick Overview • Features in Spark Streaming that Help Prevent Data Loss • Design Tips for Successful Streaming Applications
  • 7. How Spark Streaming Works: A Quick Overview
  • 8. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX
  • 9. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 9 data streams receivers batches results
  • 10. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! 10 entry point of streaming functionality create DStream from Kafka data
  • 11. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! 11 split lines into words
  • 12. Word Count with Kafka val!context!=!new!StreamingContext(conf,!Seconds(1))! val!lines!=!KafkaUtils.createStream(context,!...)! val!words!=!lines.flatMap(_.split("!"))! val!wordCounts!=!words.map(x!=>!(x,!1))! !!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! wordCounts.print()! context.start()! 12 print some counts on screen count the words start receiving and transforming the data
  • 13. Word Count with Kafka object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(new!SparkConf(),!Seconds(1))! !!!!val!lines!=!KafkaUtils.createStream(context,!...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1)).reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! 13
  • 14. Features in Spark Streaming that Help Prevent Data Loss
  • 15. 15 A Deeper View of Spark Streaming
  • 16. Any Spark Application 16 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster
  • 17. Any Spark Application 17 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 18. Any Spark Application 18 Spark Driver User code runs in the driver process YARN / Mesos / Spark Standalone cluster Tasks sent to executors for processing data Spark Executor Spark Executor Spark Executor Driver launches executors in cluster
  • 19. Spark Streaming Application: Receive data 19 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 20. Spark Streaming Application: Receive data 20 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks!
  • 21. Spark Streaming Application: Receive data 21 Executor Executor Driver runs receivers as long running tasks Receiver Data stream Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }! Receiver divides stream into blocks and keeps in memory Data Blocks! Blocks also replicated to another executor Data Blocks!
  • 22. Spark Streaming Application: Process data 22 Executor Executor Receiver Data Blocks! Data Blocks! Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 23. Spark Streaming Application: Process data 23 Executor Executor Receiver Data Blocks! Data Blocks! Data store Every batch interval, driver launches tasks to process the blocks Driver@ object!WordCount!{! !!def!main(args:!Array[String])!{! !!!!val!context!=!new!StreamingContext(...)! !!!!val!lines!=!KafkaUtils.createStream(...)! !!!!val!words!=!lines.flatMap(_.split("!"))! !!!!val!wordCounts!=!words.map(x!=>!(x,1))! !!!!!!!!!!!!!!!!!!!!!!!!!!.reduceByKey(_!+!_)! !!!!wordCounts.print()! !!!!context.start()! !!!!context.awaitTermination()! !!}! }!
  • 25. Failures? Why care? Many streaming applications need zero data loss guarantees despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Different kinds of failures – executor and driver Some failures and guarantee requirements need additional configurations and setups 25
  • 26. Executor Receiver Data Blocks! What if an executor fails? 26 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost
  • 27. Executor Receiver Data Blocks! What if an executor fails? Tasks and receivers restarted by Spark automatically, no config needed 27 Executor Failed Ex. Receiver Blocks! Blocks! Driver If executor fails, receiver is lost and all blocks are lost Receiver Receiver restarted Tasks restarted on block replicas
  • 28. What if the driver fails? 28 Executor Blocks!How do we recover? When the driver fails, all the executors fail All computation, all received blocks are lost Executor Receiver Blocks! Failed Ex. Receiver Blocks! Failed Executor Blocks! Driver Failed Driver
  • 29. Recovering Driver w/ DStream Checkpointing DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage 29 Executor Blocks! Executor Receiver Blocks! Active Driver Checkpoint info to HDFS / S3
  • 30. Recovering Driver w/ DStream Checkpointing 30 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 31. Recovering Driver w/ DStream Checkpointing 31 Failed driver can be restarted from checkpoint information Failed Driver Restarted Driver New Executor New Executor Receiver New executors launched and receivers restarted DStream Checkpointing: Periodically save the DAG of DStreams to fault-tolerant storage
  • 32. Recovering Driver w/ DStream Checkpointing 1.  Configure automatic driver restart All cluster managers support this 2.  Set a checkpoint directory in a HDFS-compatible file system !streamingContext.checkpoint(hdfsDirectory)! 3.  Slightly restructure of the code to use checkpoints for recovery 32
  • 33. Configurating Automatic Driver Restart Spark Standalone – Use spark-submit with “cluster” mode and “-- supervise” See http://spark.apache.org/docs/latest/spark-standalone.html YARN – Use spark-submit in “cluster” mode See YARN config “yarn.resourcemanager.am.max-attempts” Mesos – Marathon can restart applications or use the “--supervise” flag. 33
  • 34. Restructuring code for Checkpointing 34 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start
  • 35. Restructuring code for Checkpointing 35 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext
  • 36. Restructuring code for Checkpointing 36 val!context!=!new!StreamingContext(...)! val!lines!=!KafkaUtils.createStream(...)! val!words!=!lines.flatMap(...)! ...! context.start()! Create + Setup Start def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! Put all setup code into a function that returns a new StreamingContext Get context setup from HDFS dir OR create a new one with the function val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 37. Restructuring code for Checkpointing StreamingContext.getOrCreate(): If HDFS directory has checkpoint info recover context from info else call creatingFunc() to create and setup a new context Restarted process can figure out whether to recover using checkpoint info or not 37 def!creatingFunc():!StreamingContext!=!{!! !!!val!context!=!new!StreamingContext(...)!!! !!!val!lines!=!KafkaUtils.createStream(...)! !!!val!words!=!lines.flatMap(...)! !!!...! !!!context.checkpoint(hdfsDir)! }! val!context!=! StreamingContext.getOrCreate(! !!hdfsDir,!creatingFunc)! context.start()!
  • 38. Received blocks lost on Restart! 38 Failed Driver Restarted Driver New Executor New Ex. Receiver No Blocks! In-memory blocks of buffered data are lost on driver restart
  • 39. Recovering data with Write Ahead Logs Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage 39 Executor Blocks saved to HDFS Executor Receiver Blocks! Active Driver Data stream
  • 40. Recovering data with Write Ahead Logs 40 Failed Driver Restarted Driver New Executor New Ex. Receiver Blocks! Blocks recovered from Write Ahead Log Write Ahead Log (WAL): Synchronously save received data to fault-tolerant storage
  • 41. Recovering data with Write Ahead Logs 1.  Enable checkpointing, logs written in checkpoint directory 3.  Enabled WAL in SparkConf configuration !!!!!sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",!"true")! 3.  Receiver should also be reliable Acknowledge source only after data saved to WAL Unacked data will be replayed from source by restarted receiver 5.  Disable in-memory replication (already replicated by HDFS) !Use StorageLevel.MEMORY_AND_DISK_SER for input DStreams 41
  • 42. RDD Checkpointing •  Stateful stream processing can lead to long RDD lineages •  Long lineage = bad for fault-tolerance, too much recomputation •  RDD checkpointing saves RDD data to the fault-tolerant storage to limit lineage and recomputation •  More: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing 42
  • 43. Fault-tolerance Semantics 43 Zero data loss = every stage processes each event at least once despite any failure Sources Transforming Sinks Outputting Receiving
  • 44. Fault-tolerance Semantics 44 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting End-to-end semantics: At-least once
  • 45. Fault-tolerance Semantics 45 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Receiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 46. Fault-tolerance Semantics 46 Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost At least once, w/ Checkpointing + WAL + Reliable receiversReceiving Outputting Exactly once, if outputs are idempotent or transactional End-to-end semantics: At-least once
  • 47. Fault-tolerance Semantics 47 Exactly once receiving with new Kafka Direct approach Treats Kafka like a replicated log, reads it like a file Does not use receivers No need to create multiple DStreams and union them No need to enable Write Ahead Logs ! @val!directKafkaStream!=!KafkaUtils.createDirectStream(...)! ! https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html http://spark.apache.org/docs/latest/streaming-kafka-integration.html Sources Transforming Sinks Outputting Receiving
  • 48. Fault-tolerance Semantics 48 Exactly once receiving with new Kafka Direct approach Sources Transforming Sinks Outputting Receiving Exactly once, as long as received data is not lost Exactly once, if outputs are idempotent or transactional End-to-end semantics: Exactly once!
  • 49. Design Tips for Successful Streaming Applications
  • 50. Areas for consideration • Enhance resilience with additional components. • Mini-batch vs. per-message handling. • Exploit Reactive Streams.
  • 51. • Use Storm, Akka, Samza, etc. for handling individual messages, especially with sub- second latency requirements. • Use Spark Streaming’s mini-batch model for the Lambda architecture and highly-scalable analytics. Mini-batch vs. per-message handling
  • 52. • Consider Kafka or Kinesis for resilient buffering in front of Spark Streaming. •  Buffer for traffic spikes. •  Re-retrieval of data if an RDD partition is lost and must be reconstructed from the source. • Going to store the raw data anyway? •  Do it first, then ingest to Spark from that storage. Enhance Resiliency with Additional Components.
  • 53. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications Exploit Reactive Streams
  • 54. • Spark Streaming v1.5 will have support for back pressure to more easily build end-to-end reactive applications • Backpressure from consumer to producer: •  Prevents buffer overflows. •  Avoids unnecessary throttling. Exploit Reactive Streams
  • 55. • Spark Streaming v1.4? Buffer with Akka Streams: Exploit Reactive Streams Event Source Akka App Spark Streaming App Event Event Event Event feedback (back pressure)
  • 56. • Spark Streaming v1.4 has a rate limit property: •  spark.streaming.receiver.maxRate •  Consider setting it for long-running streaming apps with a variable input flow rate. • Have a graph of Reactive Streams? Consider using an Akka app to buffer the data fed to Spark Streaming over a socket (until 1.5…). Exploit Reactive Streams
  • 57. Thank you! Dean Wampler, Typesafe Tathagata Das, Databricks
  • 58. Databricks is available as a hosted platform on  AWS with a monthly subscription.  What to do next? Start with a free trial Typesafe now offers certified support for Spark, Mesos & DCOS, read more about it READ MORE
  • 59. REACTIVE PLATFORM
 Full Lifecycle Support for Play, Akka, Scala and Spark Give your project a boost with Reactive Platform: • Monitor Message-Driven Apps • Resolve Network Partitions Decisively • Integrate Easily with Legacy Systems • Eliminate Incompatibility & Security Risks • Protect Apps Against Abuse • Expert Support from Dedicated Product Teams Enjoy learning? See about the availability of on-site training for Scala, Akka, Play and Spark! Learn more about our offersCONTACT US

Editor's Notes

  1. You’ve heard from TD about Streaming internals and how to make your Spark Streaming apps resilient. Let’s expand the view a bit with additional tips.