SlideShare a Scribd company logo
1 of 57
www.clairvoyantsoft.com
Productionalizing Spark Streaming
Applications
By: Robert Sanders
Quick Poll
| 2
| 3
Robert Sanders
Big Data Manager and Engineer
Robert Sanders is an Engineering Manager at
Clairvoyant. In his day job, Robert wears multiple hats
and goes back and forth between Architecting and
Engineering large scale Data platforms. Robert has deep
background in enterprise systems, initially working on
fullstack implementations and then focusing on building
Data Management Platforms.
| 4
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop
| 5
● What is Spark Streaming and Kafka?
● Steps to Production
○ Managing the Streaming Application (Starting and Stopping)
○ Monitoring
○ Prevent Data Loss
■ Checkpointing
■ Implementing Kafka Delivery Semantics
● Stability
● Summary
Agenda
| 6
● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
What is Spark Streaming?
Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png
| 7
● Spark processes Micro Batches of data from the input on the Spark Engine
Processing in Spark Streaming
Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png
| 8
● Apache Kafka® is a Distributed Streaming Platform
● Kafka is a Circular Buffer
○ Data gets written to disk
○ As data gets filled up, old files are removed
What is Kafka?
Kafka - https://kafka.apache.org/images/kafka_diagram.png
| 9
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
The Starting Point
| 10
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
The Starting Point
| 11
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
The Starting Point
| 12
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
The Starting Point
| 13
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
| 14
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
| 15
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
What’s that?
| 16
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters
| 17
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters
| 18
● Spark Version <= 1.6.3
○ Yarn Client Mode:
■ --master yarn-client
○ YARN Cluster Mode:
■ --master yarn-cluster
● Spark Version >= 2.0
○ YARN Client Mode:
■ --master yarn --deploy-mode client
○ YARN Cluster Mode:
■ --master yarn --deploy-mode cluster
Spark-YARN Integration
| 19
Spark Architecture
Spark Architecture - http://blog.cloudera.com/wp-content/uploads/2014/05/spark-yarn-f1.png
| 20
YARN Client Mode
YARN Client Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
| 21
YARN Cluster Mode
YARN Cluster Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
Use YARN Cluster Mode
| 22
| 23
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
YARN Cluster Mode Configurations
| 24
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
Every 1 hour it will attempt to start the App 2 times.
YARN Cluster Mode Configurations
| 25
$ spark-submit
--class "org.apache.testSimpleApp"
--master yarn
--deploy-mode cluster
--conf spark.yarn.maxAppAttempts=2
--conf spark.yarn.am.attemptFailuresValidityInterval=1h
/path/to/jar/simple-project_2.11-1.0.jar
YARN Cluster Mode Configurations
| 26
val sparkConf = new SparkConf()
.setAppName("App")
.set("spark.yarn.maxAppAttempts", "1")
.set("spark.yarn.am.attemptFailuresValidityInterval", "2h")
val ssc = new StreamingContext(sparkConf, Seconds(2))
YARN Cluster Mode Configurations
| 27
● yarn application -kill {ApplicationID}
Shutting Down the Spark Streaming Application
What if a Micro Batch is
processing when we kill the
application?!
| 28
Shut the Streaming
Application down Gracefully
| 29
| 30
1. On Spark Streaming Startup
a. Create a touch file in HDFS
2. Within the Spark Code
a. Periodically check if the touch file still exists
b. If the touch file doesn’t exist, start the Graceful Shutdown process
3. To Stop
a. Delete the touch file and wait for the Graceful Shutdown process to complete
Tip: Build a shell script to do these start and stop operations
Graceful Shutdown
| 31
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
| 32
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point - Step one to Graceful Shutdown
Replace
| 33
var TRIGGER_STOP = false
var ssc: StreamingContext = …
// Define Stream Creation, Transformations and Actions.
ssc.start()
var isStopped = false
while (!isStopped) {
isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS)
if (isStopped)
LOGGER.info("The Spark Streaming context is stopped. Exiting application...")
else
LOGGER.info("Streaming App is still running. Timeout...")
checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION)
if (!isStopped && TRIGGER_STOP) {
LOGGER.info("Stopping the ssc Spark Streaming Context...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
LOGGER.info("Spark Streaming Context is Stopped!")
}
}
Graceful Shutdown
| 34
def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = {
LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...")
if (!TRIGGER_STOP) {
val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation))
LOGGER.info("Running File Exists: " + fileExists)
TRIGGER_STOP = !fileExists
if (TRIGGER_STOP)
LOGGER.info("Running File does not exist. Triggering Stop...")
else
LOGGER.info("Running File exists. NOT triggering shutdown.")
} else {
LOGGER.info("Skipping as the Stop Trigger has already been set")
}
}
Graceful Shutdown (cont.)
| 35
● Operational monitoring - Ganglia, Graphite
○ http://spark.apache.org/docs/latest/monitoring#metrics
● StreamingListener (Spark >=2.1)
○ onBatchSubmitted
○ onBatchStarted
○ onBatchCompleted
○ onReceiverStarted
○ onReceiverStopped
○ onReceiverError
Monitoring
| 36
Monitoring - Spark UI - Streaming Tab
| 37
● Metadata checkpointing
○ Configuration
○ DStream Operations
○ Incomplete Batches
● Data checkpointing
○ Saves the RDDs in each microbatch to a reliable storage
Preventing Data Loss - Checkpointing
| 38
● Required If using stateful transformations (updateStateByKey or
reduceByKeyAndWindow)
● Used to recover from Driver failures
Preventing Data Loss - Checkpointing
| 39
val checkpointDirectory = "hdfs://..." // define checkpoint directory
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc // Return the StreamingContext
}
// Get StreamingContext from checkpoint data or create a new one
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
ssc.start() // Start the context
Checkpointing
| 40
● Can’t survive Spark Version Upgrades
● Clear checkpoint between Code Upgrades
Checkpointing Problems
| 41
Receiver Based Streaming
Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
| 42
● Data in the Receiver is stored within the Executors Memory
● If we don’t have a WAL, on Executor failure, the data will be lost
● Once the data is written to the WAL, acknowledgement is passed to Kafka
Why have a WAL?
| 43
● Enable Checkpointing
○ Logs will be written to the Checkpoint Directory
● Enable WAL in Spark Configuration
○ spark.streaming.receiver.wrteAheadLog.enable=true
● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication.
○ Use StorageLevel.MEMROY_AND_DISK_SER
Recovering data with the WAL
Thought: Kafka already
stores replicated copies of
the data in a circular buffer.
Why do I need a WAL?
| 44
| 45
Direct Stream
Spark Streaming Direct Stream - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
Use the Direct Stream
| 46
| 47
● Creating a Topic:
kafka-topics --zookeeper <host>:2181 --create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas>
Kafka Topics
Kafka Writes - https://www.analyticshut.com/wp-content/uploads/2018/04/topic.png
| 48
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
When setting up your Kafka
Topics, setup multiple
Partitions
| 49
| 50
● You need to track your own Kafka Offsets
○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc
● Checkpoints are not recoverable across code or cluster upgrades
● For Exactly-Once Delivery Semantic
○ store offsets after an idempotent output OR
○ store offsets in an atomic transaction alongside output
Direct Stream Gotchas
| 51
Managing Kafka Offsets
Managing Kafka Offsets - http://blog.cloudera.com/wp-content/uploads/2017/06/Spark-Streaming-flow-for-offsets.png
| 52
val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext)
val kafkaDStream = storedOffsets match {
case None =>
LOGGER.info("storedOffsets was None")
kafkaParams += ("auto.offset.reset" -> "latest")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]]
(topicsSet, kafkaParams)
)
case Some(fromOffsets) =>
LOGGER.info("storedOffsets was Some(" + fromOffsets + ")")
kafkaParams += ("auto.offset.reset" -> "none")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]]
(fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
}
Managing Offsets
| 53
Average Batch Processing Time < Batch Interval
Stability
| 54
Stability - Spark UI - Streaming Tab
| 55
● Optimize reads, transformations and writes
● Caching
● Increase Parallelism
○ More partitions in Kafka
○ More Executors and Cores
● Repartition the data after receiving the data
○ dstream.repartition(100)
● Increase Batch Duration
Improving Stability
| 56
● Use YARN Cluster Mode
● Gracefully Shutdown your application
● Monitor your job
● Use Checkpointing (but be careful)
● Setup Multiple Partitions in your Kafka Topics
● Use Direct Streams
● Save your Offsets
● Stabilize your Streaming Application
Summary
Thank You!
| 57
Questions?
hello@clairvoyantsoft.com

More Related Content

What's hot

What's hot (20)

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 

Similar to Productionalizing spark streaming applications

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Similar to Productionalizing spark streaming applications (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
 
Event streaming webinar feb 2020
Event streaming webinar feb 2020Event streaming webinar feb 2020
Event streaming webinar feb 2020
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 

More from Robert Sanders

More from Robert Sanders (6)

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud Overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Productionalizing spark streaming applications

  • 3. | 3 Robert Sanders Big Data Manager and Engineer Robert Sanders is an Engineering Manager at Clairvoyant. In his day job, Robert wears multiple hats and goes back and forth between Architecting and Engineering large scale Data platforms. Robert has deep background in enterprise systems, initially working on fullstack implementations and then focusing on building Data Management Platforms.
  • 4. | 4 About Background Awards & Recognition Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
  • 5. | 5 ● What is Spark Streaming and Kafka? ● Steps to Production ○ Managing the Streaming Application (Starting and Stopping) ○ Monitoring ○ Prevent Data Loss ■ Checkpointing ■ Implementing Kafka Delivery Semantics ● Stability ● Summary Agenda
  • 6. | 6 ● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. What is Spark Streaming? Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png
  • 7. | 7 ● Spark processes Micro Batches of data from the input on the Spark Engine Processing in Spark Streaming Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png
  • 8. | 8 ● Apache Kafka® is a Distributed Streaming Platform ● Kafka is a Circular Buffer ○ Data gets written to disk ○ As data gets filled up, old files are removed What is Kafka? Kafka - https://kafka.apache.org/images/kafka_diagram.png
  • 9. | 9 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) The Starting Point
  • 10. | 10 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) The Starting Point
  • 11. | 11 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() The Starting Point
  • 12. | 12 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() The Starting Point
  • 13. | 13 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point
  • 14. | 14 1. Build your JAR (or Python File) 2. Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_2.11-1.0.jar Starting the Spark Streaming Application
  • 15. | 15 1. Build your JAR (or Python File) 2. Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_2.11-1.0.jar Starting the Spark Streaming Application What’s that?
  • 16. | 16 ● Local ○ --master local ○ --master local[2] ○ --master local[*] ● Spark Standalone ○ --master spark://{HOST}:{PORT}/ ● Yarn ○ --master yarn ● Mesos ○ --master mesos://{HOST}:{PORT} ● Kubernetes ○ --master k8s://{HOST}:{PORT} Spark Masters
  • 17. | 17 ● Local ○ --master local ○ --master local[2] ○ --master local[*] ● Spark Standalone ○ --master spark://{HOST}:{PORT}/ ● Yarn ○ --master yarn ● Mesos ○ --master mesos://{HOST}:{PORT} ● Kubernetes ○ --master k8s://{HOST}:{PORT} Spark Masters
  • 18. | 18 ● Spark Version <= 1.6.3 ○ Yarn Client Mode: ■ --master yarn-client ○ YARN Cluster Mode: ■ --master yarn-cluster ● Spark Version >= 2.0 ○ YARN Client Mode: ■ --master yarn --deploy-mode client ○ YARN Cluster Mode: ■ --master yarn --deploy-mode cluster Spark-YARN Integration
  • 19. | 19 Spark Architecture Spark Architecture - http://blog.cloudera.com/wp-content/uploads/2014/05/spark-yarn-f1.png
  • 20. | 20 YARN Client Mode YARN Client Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
  • 21. | 21 YARN Cluster Mode YARN Cluster Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
  • 22. Use YARN Cluster Mode | 22
  • 23. | 23 ● spark-default.conf ○ spark.yarn.maxAppAttempts=2 ○ spark.yarn.am.attemptFailuresValidityInterval=1h YARN Cluster Mode Configurations
  • 24. | 24 ● spark-default.conf ○ spark.yarn.maxAppAttempts=2 ○ spark.yarn.am.attemptFailuresValidityInterval=1h Every 1 hour it will attempt to start the App 2 times. YARN Cluster Mode Configurations
  • 25. | 25 $ spark-submit --class "org.apache.testSimpleApp" --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=2 --conf spark.yarn.am.attemptFailuresValidityInterval=1h /path/to/jar/simple-project_2.11-1.0.jar YARN Cluster Mode Configurations
  • 26. | 26 val sparkConf = new SparkConf() .setAppName("App") .set("spark.yarn.maxAppAttempts", "1") .set("spark.yarn.am.attemptFailuresValidityInterval", "2h") val ssc = new StreamingContext(sparkConf, Seconds(2)) YARN Cluster Mode Configurations
  • 27. | 27 ● yarn application -kill {ApplicationID} Shutting Down the Spark Streaming Application
  • 28. What if a Micro Batch is processing when we kill the application?! | 28
  • 29. Shut the Streaming Application down Gracefully | 29
  • 30. | 30 1. On Spark Streaming Startup a. Create a touch file in HDFS 2. Within the Spark Code a. Periodically check if the touch file still exists b. If the touch file doesn’t exist, start the Graceful Shutdown process 3. To Stop a. Delete the touch file and wait for the Graceful Shutdown process to complete Tip: Build a shell script to do these start and stop operations Graceful Shutdown
  • 31. | 31 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point
  • 32. | 32 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point - Step one to Graceful Shutdown Replace
  • 33. | 33 var TRIGGER_STOP = false var ssc: StreamingContext = … // Define Stream Creation, Transformations and Actions. ssc.start() var isStopped = false while (!isStopped) { isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS) if (isStopped) LOGGER.info("The Spark Streaming context is stopped. Exiting application...") else LOGGER.info("Streaming App is still running. Timeout...") checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION) if (!isStopped && TRIGGER_STOP) { LOGGER.info("Stopping the ssc Spark Streaming Context...") ssc.stop(stopSparkContext = true, stopGracefully = true) LOGGER.info("Spark Streaming Context is Stopped!") } } Graceful Shutdown
  • 34. | 34 def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = { LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...") if (!TRIGGER_STOP) { val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration) val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation)) LOGGER.info("Running File Exists: " + fileExists) TRIGGER_STOP = !fileExists if (TRIGGER_STOP) LOGGER.info("Running File does not exist. Triggering Stop...") else LOGGER.info("Running File exists. NOT triggering shutdown.") } else { LOGGER.info("Skipping as the Stop Trigger has already been set") } } Graceful Shutdown (cont.)
  • 35. | 35 ● Operational monitoring - Ganglia, Graphite ○ http://spark.apache.org/docs/latest/monitoring#metrics ● StreamingListener (Spark >=2.1) ○ onBatchSubmitted ○ onBatchStarted ○ onBatchCompleted ○ onReceiverStarted ○ onReceiverStopped ○ onReceiverError Monitoring
  • 36. | 36 Monitoring - Spark UI - Streaming Tab
  • 37. | 37 ● Metadata checkpointing ○ Configuration ○ DStream Operations ○ Incomplete Batches ● Data checkpointing ○ Saves the RDDs in each microbatch to a reliable storage Preventing Data Loss - Checkpointing
  • 38. | 38 ● Required If using stateful transformations (updateStateByKey or reduceByKeyAndWindow) ● Used to recover from Driver failures Preventing Data Loss - Checkpointing
  • 39. | 39 val checkpointDirectory = "hdfs://..." // define checkpoint directory // Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc // Return the StreamingContext } // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext) ssc.start() // Start the context Checkpointing
  • 40. | 40 ● Can’t survive Spark Version Upgrades ● Clear checkpoint between Code Upgrades Checkpointing Problems
  • 41. | 41 Receiver Based Streaming Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
  • 42. | 42 ● Data in the Receiver is stored within the Executors Memory ● If we don’t have a WAL, on Executor failure, the data will be lost ● Once the data is written to the WAL, acknowledgement is passed to Kafka Why have a WAL?
  • 43. | 43 ● Enable Checkpointing ○ Logs will be written to the Checkpoint Directory ● Enable WAL in Spark Configuration ○ spark.streaming.receiver.wrteAheadLog.enable=true ● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication. ○ Use StorageLevel.MEMROY_AND_DISK_SER Recovering data with the WAL
  • 44. Thought: Kafka already stores replicated copies of the data in a circular buffer. Why do I need a WAL? | 44
  • 45. | 45 Direct Stream Spark Streaming Direct Stream - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
  • 46. Use the Direct Stream | 46
  • 47. | 47 ● Creating a Topic: kafka-topics --zookeeper <host>:2181 --create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas> Kafka Topics Kafka Writes - https://www.analyticshut.com/wp-content/uploads/2018/04/topic.png
  • 48. | 48 Consuming from a Kafka Topic Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
  • 49. When setting up your Kafka Topics, setup multiple Partitions | 49
  • 50. | 50 ● You need to track your own Kafka Offsets ○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc ● Checkpoints are not recoverable across code or cluster upgrades ● For Exactly-Once Delivery Semantic ○ store offsets after an idempotent output OR ○ store offsets in an atomic transaction alongside output Direct Stream Gotchas
  • 51. | 51 Managing Kafka Offsets Managing Kafka Offsets - http://blog.cloudera.com/wp-content/uploads/2017/06/Spark-Streaming-flow-for-offsets.png
  • 52. | 52 val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext) val kafkaDStream = storedOffsets match { case None => LOGGER.info("storedOffsets was None") kafkaParams += ("auto.offset.reset" -> "latest") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]] (topicsSet, kafkaParams) ) case Some(fromOffsets) => LOGGER.info("storedOffsets was Some(" + fromOffsets + ")") kafkaParams += ("auto.offset.reset" -> "none") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]] (fromOffsets.keys.toList, kafkaParams, fromOffsets) ) } Managing Offsets
  • 53. | 53 Average Batch Processing Time < Batch Interval Stability
  • 54. | 54 Stability - Spark UI - Streaming Tab
  • 55. | 55 ● Optimize reads, transformations and writes ● Caching ● Increase Parallelism ○ More partitions in Kafka ○ More Executors and Cores ● Repartition the data after receiving the data ○ dstream.repartition(100) ● Increase Batch Duration Improving Stability
  • 56. | 56 ● Use YARN Cluster Mode ● Gracefully Shutdown your application ● Monitor your job ● Use Checkpointing (but be careful) ● Setup Multiple Partitions in your Kafka Topics ● Use Direct Streams ● Save your Offsets ● Stabilize your Streaming Application Summary

Editor's Notes

  1. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
  2. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
  3. Since Kafka already stores the data we won’t need a WAL