Spark Streaming has quickly established itself as one of the more popular Streaming Engines running on the Hadoop Ecosystem. Not only does it provide integration with many type of message brokers and stream sources, but it also provides the ability to leverage other major modules in Spark like Spark SQL and MLib in conjunction. This allows for businesses and developers to make use out of data in ways they couldn’t hope to do in the past.
However, while building a Spark Streaming pipeline, it’s not sufficient to only know how to express your business logic. Operationalizing these pipelines and running the application with high uptime and continuous monitoring has a lot of operational challenges. Fortunately, Spark Streaming makes all that easy as well. In this talk, we’ll go over some of the main steps you’ll need to take to get your Spark Streaming application ready for production, specifically in conjunction with Kafka. This includes steps to gracefully shutdown your application, steps to perform upgrades, monitoring, various useful spark configurations and more.
3. | 3
Robert Sanders
Big Data Manager and Engineer
Robert Sanders is an Engineering Manager at
Clairvoyant. In his day job, Robert wears multiple hats
and goes back and forth between Architecting and
Engineering large scale Data platforms. Robert has deep
background in enterprise systems, initially working on
fullstack implementations and then focusing on building
Data Management Platforms.
4. | 4
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop
5. | 5
● What is Spark Streaming and Kafka?
● Steps to Production
○ Managing the Streaming Application (Starting and Stopping)
○ Monitoring
○ Prevent Data Loss
■ Checkpointing
■ Implementing Kafka Delivery Semantics
● Stability
● Summary
Agenda
6. | 6
● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
What is Spark Streaming?
Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png
7. | 7
● Spark processes Micro Batches of data from the input on the Spark Engine
Processing in Spark Streaming
Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png
8. | 8
● Apache Kafka® is a Distributed Streaming Platform
● Kafka is a Circular Buffer
○ Data gets written to disk
○ As data gets filled up, old files are removed
What is Kafka?
Kafka - https://kafka.apache.org/images/kafka_diagram.png
9. | 9
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
The Starting Point
10. | 10
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
The Starting Point
11. | 11
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
The Starting Point
12. | 12
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
The Starting Point
13. | 13
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
14. | 14
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
15. | 15
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
What’s that?
24. | 24
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
Every 1 hour it will attempt to start the App 2 times.
YARN Cluster Mode Configurations
30. | 30
1. On Spark Streaming Startup
a. Create a touch file in HDFS
2. Within the Spark Code
a. Periodically check if the touch file still exists
b. If the touch file doesn’t exist, start the Graceful Shutdown process
3. To Stop
a. Delete the touch file and wait for the Graceful Shutdown process to complete
Tip: Build a shell script to do these start and stop operations
Graceful Shutdown
31. | 31
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
32. | 32
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point - Step one to Graceful Shutdown
Replace
33. | 33
var TRIGGER_STOP = false
var ssc: StreamingContext = …
// Define Stream Creation, Transformations and Actions.
ssc.start()
var isStopped = false
while (!isStopped) {
isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS)
if (isStopped)
LOGGER.info("The Spark Streaming context is stopped. Exiting application...")
else
LOGGER.info("Streaming App is still running. Timeout...")
checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION)
if (!isStopped && TRIGGER_STOP) {
LOGGER.info("Stopping the ssc Spark Streaming Context...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
LOGGER.info("Spark Streaming Context is Stopped!")
}
}
Graceful Shutdown
34. | 34
def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = {
LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...")
if (!TRIGGER_STOP) {
val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation))
LOGGER.info("Running File Exists: " + fileExists)
TRIGGER_STOP = !fileExists
if (TRIGGER_STOP)
LOGGER.info("Running File does not exist. Triggering Stop...")
else
LOGGER.info("Running File exists. NOT triggering shutdown.")
} else {
LOGGER.info("Skipping as the Stop Trigger has already been set")
}
}
Graceful Shutdown (cont.)
37. | 37
● Metadata checkpointing
○ Configuration
○ DStream Operations
○ Incomplete Batches
● Data checkpointing
○ Saves the RDDs in each microbatch to a reliable storage
Preventing Data Loss - Checkpointing
38. | 38
● Required If using stateful transformations (updateStateByKey or
reduceByKeyAndWindow)
● Used to recover from Driver failures
Preventing Data Loss - Checkpointing
39. | 39
val checkpointDirectory = "hdfs://..." // define checkpoint directory
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc // Return the StreamingContext
}
// Get StreamingContext from checkpoint data or create a new one
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
ssc.start() // Start the context
Checkpointing
40. | 40
● Can’t survive Spark Version Upgrades
● Clear checkpoint between Code Upgrades
Checkpointing Problems
41. | 41
Receiver Based Streaming
Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
42. | 42
● Data in the Receiver is stored within the Executors Memory
● If we don’t have a WAL, on Executor failure, the data will be lost
● Once the data is written to the WAL, acknowledgement is passed to Kafka
Why have a WAL?
43. | 43
● Enable Checkpointing
○ Logs will be written to the Checkpoint Directory
● Enable WAL in Spark Configuration
○ spark.streaming.receiver.wrteAheadLog.enable=true
● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication.
○ Use StorageLevel.MEMROY_AND_DISK_SER
Recovering data with the WAL
48. | 48
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
49. When setting up your Kafka
Topics, setup multiple
Partitions
| 49
50. | 50
● You need to track your own Kafka Offsets
○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc
● Checkpoints are not recoverable across code or cluster upgrades
● For Exactly-Once Delivery Semantic
○ store offsets after an idempotent output OR
○ store offsets in an atomic transaction alongside output
Direct Stream Gotchas
55. | 55
● Optimize reads, transformations and writes
● Caching
● Increase Parallelism
○ More partitions in Kafka
○ More Executors and Cores
● Repartition the data after receiving the data
○ dstream.repartition(100)
● Increase Batch Duration
Improving Stability
56. | 56
● Use YARN Cluster Mode
● Gracefully Shutdown your application
● Monitor your job
● Use Checkpointing (but be careful)
● Setup Multiple Partitions in your Kafka Topics
● Use Direct Streams
● Save your Offsets
● Stabilize your Streaming Application
Summary
Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
Configuration - The configuration that was used to create the streaming application.
DStream operations - The set of DStream operations that define the streaming application.
Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
Checkpointing must be enabled for applications with any of the following requirements:
Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information.
Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
Configuration - The configuration that was used to create the streaming application.
DStream operations - The set of DStream operations that define the streaming application.
Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
Checkpointing must be enabled for applications with any of the following requirements:
Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information.
Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
Since Kafka already stores the data we won’t need a WAL