SlideShare a Scribd company logo
1 of 30
Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Or
“A Case Study in Operationalizing
Spark Streaming”
Context/Disclaimer
 Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some
aggregation. Values accuracy over speed.
 Spark Streaming 1.5-1.6, Kafka 0.9
 Standalone Cluster (not YARN or Mesos)
 No Hadoop
 Message velocity: k/s. Batch window: 10s
 Data sourcee: Kafka (primary), Redis (joins + ref data) & S3
(ref data)
Demo: Spark in Action
Game & Scoreboard Architecture
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Spark Streaming & Standalone
Cluster Overview
 RDD: Partitioned, replicated collection of data
objects
 Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
 Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
 Lazy Execution: Transformations & Actions
 Cluster Types: Standalone, YARN, Mesos
Spark Streaming & Standalone
Cluster Overview
 Standalone Cluster
 Each node
 Master
 Worker
 Executor
 Driver
 Zookeeper cluster
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Design Patterns for Performance
 Delegate all IO/CPU to the Executors
 Avoid unnecessary shuffles (join, groupBy,
repartition)
 Externalize streaming joins & reference data
lookups. Large/volatile ref data set.
 JVM static hashmap
 External cache (e.g. Redis)
 Static LRU cache (amortize lookups)
 RocksDB
 Hygienic function closures
We’re done, right?
We’re done, right?
Just need to QA the data…
70% missing data
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Guaranteed Message Processing &
Direct Kafka Integration
 Guaranteed Message Processing = At-least-once
processing + idempotence
 Kafka Receiver
 Consumes messages faster than Spark can process
 Checkpoints before processing finished
 Inefficient CPU utilization
 Direct Kafka Integration
 Control over checkpointing & transactionality
 Better distribution on resource consumption
 1:1 Kafka Topic-partition to Spark RDD-partition
 Use Kafka as WAL
 Statelessness, Fail-fast
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Operational Monitoring
& Alerting
 Driver “Heartbeat”
 Batch processing time
 Message count
 Kafka lag (latest offsets vs last processed)
 Driver start events
 StatsD + Graphite + Seyren
 http://localhost:4040/metrics/json/
Data loss fixed
Data loss fixed
So we’re done, right?
Cluster & app
continuously crashing
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Spark Cluster & App Stability
Spark slave memory utilization
Spark Cluster & App Stability
 Slave memory overhead
 OOM killer
 Crashes + Kafka Receiver = missing data
 Supervised driver: “--supervise” for spark-submit.
Driver restart logging
 Cluster resource overprovisioning
 Standby Masters for failover
 Auto-cleanup of work directories
spark.worker.cleanup.enabled=true
We’re done, right?
We’re done, right?
Finally, yes
Party Time
TL;DR
1. Use Direct Kafka Integration + transactionality
2. Cache reference data for speed
3. Avoid shuffles & driver bottlenecks
4. Supervised driver
5. Cleanup worker temp directory
6. Beware of function closures
7. Cluster resource over-provisioning
8. Spark slave memory headroom
9. Monitoring on Driver heartbeat & Kafka lag
10. Standby masters
Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Thanks!
Links
 Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-
spark-streaming-part-1/
 Direct Kafka Integration:
https://databricks.com/blog/2015/03/30/improvements-to-
kafka-integration-of-spark-streaming.html
 App metrics: http://localhost:4040/metrics/json/
 MetricsSystem:
http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/
 sparkConf.set("spark.worker.cleanup.enabled", "true")

More Related Content

Viewers also liked

Viewers also liked (17)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingTapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)
 
The greatest tragedy of western front pakistani stupidity at its lowest height
The greatest tragedy of western front   pakistani stupidity at its lowest heightThe greatest tragedy of western front   pakistani stupidity at its lowest height
The greatest tragedy of western front pakistani stupidity at its lowest height
 
Gustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto AusenciasGustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto Ausencias
 
3行ラベリングの勧め
3行ラベリングの勧め3行ラベリングの勧め
3行ラベリングの勧め
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線
 
Devel for Drupal 8
Devel for Drupal 8Devel for Drupal 8
Devel for Drupal 8
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話
 
Drupal Developer Days Keynote
Drupal Developer Days KeynoteDrupal Developer Days Keynote
Drupal Developer Days Keynote
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 

Recently uploaded

Recently uploaded (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

  • 1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc
  • 2. Or “A Case Study in Operationalizing Spark Streaming”
  • 3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)
  • 4. Demo: Spark in Action
  • 5. Game & Scoreboard Architecture
  • 6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos
  • 9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster
  • 10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures
  • 13. We’re done, right? Just need to QA the data…
  • 15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast
  • 17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/
  • 20. Data loss fixed So we’re done, right?
  • 22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 23. Spark Cluster & App Stability Spark slave memory utilization
  • 24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true
  • 28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters
  • 29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!
  • 30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

Editor's Notes

  1. Tell our story, to share learnings
  2. This was our use case, yours may be different
  3. This is our use case, yours may be different
  4. Live system to reason about
  5. Not necessarily the only way to set it up. Save IP space
  6. Ok, we built the app in the spark framework for scalability, made it fast,
  7. Pause, check on game player
  8. Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages. Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios Direct Kafka Integration = statelessness
  9. Simple, At a glance, batch process time < batch interval. Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
  10. After a few days, we notice…
  11. After a few days, we notice…
  12. I thought resiliency was the promise of Spark. Resilient distributed datasets
  13. The app was crashing, but why
  14. Crashes while using Kafka Receiver = missing data. No WAL Is Spark so flaky? Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.