Productionalizing spark streaming applications

www.clairvoyantsoft.com
Productionalizing Spark Streaming
Applications
By: Robert Sanders

| 3
Robert Sanders
Big Data Manager and Engineer
Robert Sanders is an Engineering Manager at
Clairvoyant. In his day job, Robert wears multiple hats
and goes back and forth between Architecting and
Engineering large scale Data platforms. Robert has deep
background in enterprise systems, initially working on
fullstack implementations and then focusing on building
Data Management Platforms.

| 4
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop

| 5
● What is Spark Streaming and Kafka?
● Steps to Production
○ Managing the Streaming Application (Starting and Stopping)
○ Monitoring
○ Prevent Data Loss
■ Checkpointing
■ Implementing Kafka Delivery Semantics
● Stability
● Summary
Agenda

| 6
● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
What is Spark Streaming?
Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png

| 7
● Spark processes Micro Batches of data from the input on the Spark Engine
Processing in Spark Streaming
Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png

| 8
● Apache Kafka® is a Distributed Streaming Platform
● Kafka is a Circular Buffer
○ Data gets written to disk
○ As data gets filled up, old files are removed
What is Kafka?
Kafka - https://kafka.apache.org/images/kafka_diagram.png

| 9
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
The Starting Point

| 10
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
The Starting Point

| 11
ssc,
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
The Starting Point

| 12
ssc,
wordCounts.print()
// Start the computation
ssc.start()
The Starting Point

| 13
ssc,
wordCounts.print()
ssc.start()
ssc.awaitTermination()
The Starting Point

| 14
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application

| 15
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
Starting the Spark Streaming Application
What’s that?

| 16
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters

| 17
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters

| 18
● Spark Version <= 1.6.3
○ Yarn Client Mode:
■ --master yarn-client
○ YARN Cluster Mode:
■ --master yarn-cluster
● Spark Version >= 2.0
○ YARN Client Mode:
■ --master yarn --deploy-mode client
○ YARN Cluster Mode:
■ --master yarn --deploy-mode cluster
Spark-YARN Integration

| 19
Spark Architecture
Spark Architecture - http://blog.cloudera.com/wp-content/uploads/2014/05/spark-yarn-f1.png

| 20
YARN Client Mode
YARN Client Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg

| 21
YARN Cluster Mode
YARN Cluster Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg

| 23
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
YARN Cluster Mode Configurations

| 24
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
Every 1 hour it will attempt to start the App 2 times.

| 25
$ spark-submit
--class "org.apache.testSimpleApp"
--master yarn
--deploy-mode cluster
--conf spark.yarn.maxAppAttempts=2
--conf spark.yarn.am.attemptFailuresValidityInterval=1h

| 26
val sparkConf = new SparkConf()
.setAppName("App")
.set("spark.yarn.maxAppAttempts", "1")
.set("spark.yarn.am.attemptFailuresValidityInterval", "2h")

| 27
● yarn application -kill {ApplicationID}
Shutting Down the Spark Streaming Application

What if a Micro Batch is
processing when we kill the
application?!
| 28

Shut the Streaming
Application down Gracefully
| 29

| 30
1. On Spark Streaming Startup
a. Create a touch file in HDFS
2. Within the Spark Code
a. Periodically check if the touch file still exists
b. If the touch file doesn’t exist, start the Graceful Shutdown process
3. To Stop
a. Delete the touch file and wait for the Graceful Shutdown process to complete
Tip: Build a shell script to do these start and stop operations
Graceful Shutdown

| 31
ssc,
wordCounts.print()
ssc.start()
The Starting Point

| 32
ssc,
wordCounts.print()
ssc.start()
The Starting Point - Step one to Graceful Shutdown
Replace

| 33
var TRIGGER_STOP = false
var ssc: StreamingContext = …
// Define Stream Creation, Transformations and Actions.
ssc.start()
var isStopped = false
while (!isStopped) {
isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS)
if (isStopped)
LOGGER.info("The Spark Streaming context is stopped. Exiting application...")
else
LOGGER.info("Streaming App is still running. Timeout...")
checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION)
if (!isStopped && TRIGGER_STOP) {
LOGGER.info("Stopping the ssc Spark Streaming Context...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
LOGGER.info("Spark Streaming Context is Stopped!")
}
}
Graceful Shutdown

| 34
def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = {
LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...")
if (!TRIGGER_STOP) {
val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation))
LOGGER.info("Running File Exists: " + fileExists)
TRIGGER_STOP = !fileExists
if (TRIGGER_STOP)
LOGGER.info("Running File does not exist. Triggering Stop...")
else
LOGGER.info("Running File exists. NOT triggering shutdown.")
} else {
LOGGER.info("Skipping as the Stop Trigger has already been set")
}
}
Graceful Shutdown (cont.)

| 35
● Operational monitoring - Ganglia, Graphite
○ http://spark.apache.org/docs/latest/monitoring#metrics
● StreamingListener (Spark >=2.1)
○ onBatchSubmitted
○ onBatchStarted
○ onBatchCompleted
○ onReceiverStarted
○ onReceiverStopped
○ onReceiverError
Monitoring

| 36
Monitoring - Spark UI - Streaming Tab

| 37
● Metadata checkpointing
○ Configuration
○ DStream Operations
○ Incomplete Batches
● Data checkpointing
○ Saves the RDDs in each microbatch to a reliable storage
Preventing Data Loss - Checkpointing

| 38
● Required If using stateful transformations (updateStateByKey or
reduceByKeyAndWindow)
● Used to recover from Driver failures
Preventing Data Loss - Checkpointing

| 39
val checkpointDirectory = "hdfs://..." // define checkpoint directory
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc // Return the StreamingContext
}
// Get StreamingContext from checkpoint data or create a new one
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
ssc.start() // Start the context
Checkpointing

| 40
● Can’t survive Spark Version Upgrades
● Clear checkpoint between Code Upgrades
Checkpointing Problems

| 41
Receiver Based Streaming
Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png

| 42
● Data in the Receiver is stored within the Executors Memory
● If we don’t have a WAL, on Executor failure, the data will be lost
● Once the data is written to the WAL, acknowledgement is passed to Kafka
Why have a WAL?

| 43
● Enable Checkpointing
○ Logs will be written to the Checkpoint Directory
● Enable WAL in Spark Configuration
○ spark.streaming.receiver.wrteAheadLog.enable=true
● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication.
○ Use StorageLevel.MEMROY_AND_DISK_SER
Recovering data with the WAL

Thought: Kafka already
stores replicated copies of
the data in a circular buffer.
Why do I need a WAL?
| 44

| 45
Direct Stream
Spark Streaming Direct Stream - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png

| 47
● Creating a Topic:
kafka-topics --zookeeper <host>:2181 --create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas>
Kafka Topics
Kafka Writes - https://www.analyticshut.com/wp-content/uploads/2018/04/topic.png

| 48
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png

When setting up your Kafka
Topics, setup multiple
Partitions
| 49

| 50
● You need to track your own Kafka Offsets
○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc
● Checkpoints are not recoverable across code or cluster upgrades
● For Exactly-Once Delivery Semantic
○ store offsets after an idempotent output OR
○ store offsets in an atomic transaction alongside output
Direct Stream Gotchas

| 51
Managing Kafka Offsets
Managing Kafka Offsets - http://blog.cloudera.com/wp-content/uploads/2017/06/Spark-Streaming-flow-for-offsets.png

| 52
val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext)
val kafkaDStream = storedOffsets match {
case None =>
LOGGER.info("storedOffsets was None")
kafkaParams += ("auto.offset.reset" -> "latest")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]]
(topicsSet, kafkaParams)
)
case Some(fromOffsets) =>
LOGGER.info("storedOffsets was Some(" + fromOffsets + ")")
kafkaParams += ("auto.offset.reset" -> "none")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]]
(fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
}
Managing Offsets

| 53
Average Batch Processing Time < Batch Interval
Stability

| 54
Stability - Spark UI - Streaming Tab

| 55
● Optimize reads, transformations and writes
● Caching
● Increase Parallelism
○ More partitions in Kafka
○ More Executors and Cores
● Repartition the data after receiving the data
○ dstream.repartition(100)
● Increase Batch Duration
Improving Stability

| 56
● Use YARN Cluster Mode
● Gracefully Shutdown your application
● Monitor your job
● Use Checkpointing (but be careful)
● Setup Multiple Partitions in your Kafka Topics
● Use Direct Streams
● Save your Offsets
● Stabilize your Streaming Application
Summary

Thank You!
| 57
Questions?
hello@clairvoyantsoft.com

Productionalizing spark streaming applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Productionalizing spark streaming applications

Similar to Productionalizing spark streaming applications (20)

More from Robert Sanders

More from Robert Sanders (6)

Recently uploaded

Recently uploaded (20)

Productionalizing spark streaming applications

Editor's Notes