Presented at the Seattle Spark meetup on March 23th, 2017 hosted at Expedia. (https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/)
This presentation focuses on a case study of taking Spark Streaming to production using Kafka as a data source, and highlights best practices for different concerns of streaming processing:
1. Spark Streaming & Standalone Cluster Overview
2. Design Patterns for Performance
3. Guaranteed Message Processing & Direct Kafka Integration
4. Operational Monitoring & Alerting
5. Spark Cluster & App Resilience
8. Spark Streaming & Standalone
Cluster Overview
RDD: Partitioned, replicated collection of data
objects
Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
Lazy Execution: Transformations & Actions
Cluster Types: Standalone, YARN, Mesos
16. Guaranteed Message Processing &
Direct Kafka Integration
Guaranteed Message Processing = At-least-once
processing + idempotence
Kafka Receiver
Consumes messages faster than Spark can process
Checkpoints before processing finished
Inefficient CPU utilization
Direct Kafka Integration
Control over checkpointing & transactionality
Better distribution on resource consumption
1:1 Kafka Topic-partition to Spark RDD-partition
Use Kafka as WAL
Statelessness, Fail-fast
Not necessarily the only way to set it up. Save IP space
Ok, we built the app in the spark framework for scalability, made it fast,
Pause, check on game player
Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages.
Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios
Direct Kafka Integration = statelessness
Simple, At a glance, batch process time < batch interval.
Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
After a few days, we notice…
After a few days, we notice…
I thought resiliency was the promise of Spark. Resilient distributed datasets
The app was crashing, but why
Crashes while using Kafka Receiver = missing data. No WAL
Is Spark so flaky?
Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing
Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.