Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

•Download as PPTX, PDF•

4 likes•4,200 views

Presented at the Seattle Spark meetup on March 23th, 2017 hosted at Expedia. (https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/) This presentation focuses on a case study of taking Spark Streaming to production using Kafka as a data source, and highlights best practices for different concerns of streaming processing: 1. Spark Streaming & Standalone Cluster Overview 2. Design Patterns for Performance 3. Guaranteed Message Processing & Direct Kafka Integration 4. Operational Monitoring & Alerting 5. Spark Cluster & App Resilience

Technology

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc

Or
“A Case Study in Operationalizing
Spark Streaming”

Context/Disclaimer
 Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some
aggregation. Values accuracy over speed.
 Spark Streaming 1.5-1.6, Kafka 0.9
 Standalone Cluster (not YARN or Mesos)
 No Hadoop
 Message velocity: k/s. Batch window: 10s
 Data sourcee: Kafka (primary), Redis (joins + ref data) & S3
(ref data)

Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience

Spark Streaming & Standalone
Cluster Overview
 RDD: Partitioned, replicated collection of data
objects
 Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
 Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
 Lazy Execution: Transformations & Actions
 Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone
Cluster Overview
 Standalone Cluster
 Each node
 Master
 Worker
 Executor
 Driver
 Zookeeper cluster

Design Patterns for Performance
 Delegate all IO/CPU to the Executors
 Avoid unnecessary shuffles (join, groupBy,
repartition)
 Externalize streaming joins & reference data
lookups. Large/volatile ref data set.
 JVM static hashmap
 External cache (e.g. Redis)
 Static LRU cache (amortize lookups)
 RocksDB
 Hygienic function closures

We’re done, right?
Just need to QA the data…

Guaranteed Message Processing &
Direct Kafka Integration
 Guaranteed Message Processing = At-least-once
processing + idempotence
 Kafka Receiver
 Consumes messages faster than Spark can process
 Checkpoints before processing finished
 Inefficient CPU utilization
 Direct Kafka Integration
 Control over checkpointing & transactionality
 Better distribution on resource consumption
 1:1 Kafka Topic-partition to Spark RDD-partition
 Use Kafka as WAL
 Statelessness, Fail-fast

Operational Monitoring
& Alerting
 Driver “Heartbeat”
 Batch processing time
 Message count
 Kafka lag (latest offsets vs last processed)
 Driver start events
 StatsD + Graphite + Seyren
 http://localhost:4040/metrics/json/

Spark Cluster & App Stability
Spark slave memory utilization

Spark Cluster & App Stability
 Slave memory overhead
 OOM killer
 Crashes + Kafka Receiver = missing data
 Supervised driver: “--supervise” for spark-submit.
Driver restart logging
 Cluster resource overprovisioning
 Standby Masters for failover
 Auto-cleanup of work directories
spark.worker.cleanup.enabled=true

TL;DR
1. Use Direct Kafka Integration + transactionality
2. Cache reference data for speed
3. Avoid shuffles & driver bottlenecks
4. Supervised driver
5. Cleanup worker temp directory
6. Beware of function closures
7. Cluster resource over-provisioning
8. Spark slave memory headroom
9. Monitoring on Driver heartbeat & Kafka lag
10. Standby masters

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Thanks!

Links
 Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-
spark-streaming-part-1/
 Direct Kafka Integration:
https://databricks.com/blog/2015/03/30/improvements-to-
kafka-integration-of-spark-streaming.html
 App metrics: http://localhost:4040/metrics/json/
 MetricsSystem:
http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/
 sparkConf.set("spark.worker.cleanup.enabled", "true")

Viewers also liked

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Brandon O'Brien

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

SingleStore

Ingesting Drone Data into Big Data Platforms

Timothy Spann

A primer on building real time data-driven products

Lars Albertsson

A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.

Real time Analytics with Apache Kafka and Apache Spark

Rahul Jain

Kafka presentation

Mohammed Fazuluddin

We all want more diversity in tech. We rarely acknowledge that the experience of inclusion is the product of Org Design. Presented at O'Reilly Design Conference with Molly Beyer, #OReillyDesign, these slides share some practical tips and advice on increasing diversity through applied design thinking. Learn how to empathize and ideate in response to real needs instead of getting people to 'hack a hairdryer'.

Designing for Diversity in Design Orgs (Presentation)

Eli Silva

The greatest tragedy of western front pakistani stupidity at its lowest height

Agha A

Gustavo Germano Proyecto Ausencias

Monica Oporto

3行ラベリングの勧め

Mizuhiro Kaimai

TEDx Manchester: AI & The Future of Work

Volker Hirsch

Recently Drupal celebrated its 15th birthday and while everybody is busy with learning Drupal 8 we would like to stop and take a look at where our beloved system emerged from 15 years ago. Most of the people don’t know about history of Drupal and how it evolved from message board platform (Drop 1.0) to a fully scaled enterprise level CMS (Drupal 8.0). Did you know some of key features of Drupal like modules, nodes, watchdog and multilingual support where available since Drupal 2.0?

History of Drupal: From Drop 1.0 to Drupal 8

ドローン農業最前線

Devel for Drupal 8

Goをカンストさせる話

Drupal Developer Days Keynote

Angela Byron

Contact: https://www.linkedin.com/in/brandonjobrien @hakczar Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup. Presented by Brandon O'Brien Code example: https://github.com/OpenDataMining/brandonobrien Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/

Introduction to Streaming Distributed Processing with Storm

Brandon O'Brien

Viewers also liked (17)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

Ingesting Drone Data into Big Data Platforms

A primer on building real time data-driven products

Real time Analytics with Apache Kafka and Apache Spark

Kafka presentation

Designing for Diversity in Design Orgs (Presentation)

The greatest tragedy of western front pakistani stupidity at its lowest height

Gustavo Germano Proyecto Ausencias

3行ラベリングの勧め

TEDx Manchester: AI & The Future of Work

History of Drupal: From Drop 1.0 to Drupal 8

ドローン農業最前線

Devel for Drupal 8

Goをカンストさせる話

Drupal Developer Days Keynote

Introduction to Streaming Distributed Processing with Storm

Recently uploaded

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Modernizing Securities Finance: The cloud-native prime brokerage platform transforming capital markets. Madhu Subbu, Managing Director, Head of Securities Finance Engineering Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

apidays

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Manulife - Insurer Transformation Award 2024

The Digital Insurer

ICT role in 21st century education and its challenges

rafiqahmad00786416

Recently uploaded (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

presentation ICT roal in 21st century education

Corporate and higher education May webinar.pptx

How to Troubleshoot Apps for the Modern Connected Worker

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Artificial Intelligence Chap.5 : Uncertainty

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

AWS Community Day CPH - Three problems of Terraform

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Boost Fertility New Invention Ups Success Rates.pdf

Automating Google Workspace (GWS) & more with Apps Script

Strategies for Landing an Oracle DBA Job as a Fresher

Manulife - Insurer Transformation Award 2024

ICT role in 21st century education and its challenges

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc

2. Or “A Case Study in Operationalizing Spark Streaming”

3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

4. Demo: Spark in Action

5. Game & Scoreboard Architecture

6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos

9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster

10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures

12. We’re done, right?

13. We’re done, right? Just need to QA the data…

14. 70% missing data

15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast

17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/

19. Data loss fixed

20. Data loss fixed So we’re done, right?

21. Cluster & app continuously crashing

22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

23. Spark Cluster & App Stability Spark slave memory utilization

24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true

25. We’re done, right?

26. We’re done, right? Finally, yes

27. Party Time

28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters

29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!

30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

Editor's Notes

Tell our story, to share learnings
This was our use case, yours may be different
This is our use case, yours may be different
Live system to reason about
Not necessarily the only way to set it up. Save IP space
Ok, we built the app in the spark framework for scalability, made it fast,
Pause, check on game player
Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages. Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios Direct Kafka Integration = statelessness
Simple, At a glance, batch process time < batch interval. Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
After a few days, we notice…
After a few days, we notice…
I thought resiliency was the promise of Spark. Resilient distributed datasets
The app was crashing, but why
Crashes while using Kafka Receiver = missing data. No WAL Is Spark so flaky? Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Recently uploaded

Recently uploaded (20)

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Editor's Notes