Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

579 views

Published on

Saba Khalilnaji, DoorDash, Software Engineer + Ashwin Kachhara, DoorDash, Software Engineer

Scaling backend infrastructure to handle hyper-growth is one of the many exciting challenges of working at DoorDash. In this talk, we’ll discuss some scaling issues in 2019 that prompted us to accelerate our adoption of Kafka.

In mid 2019, we faced significant scaling challenges and frequent outages involving Celery and RabbitMQ, two technologies powering the system that handles the asynchronous work enabling critical functionalities of our platform, including order checkout and Dasher assignments. We quickly solved this problem with a simple, Apache Kafka-based asynchronous task processing system that stopped our outages while we continued to iterate on a robust solution. Our initial version implemented the smallest set of features needed to accommodate a large portion of existing Celery tasks. Once in production, we continued to add support for more Celery features while addressing novel problems that arose when using Kafka.

Thereafter, we adopted Kafka across a variety of domains either directly, or in conjunction with technologies like Flink and Cadence. Kafka’s ability to scale and provide at-least-once message delivery has been crucial for our use cases and given us a boost in reliability across several domains.

https://www.meetup.com/KafkaBayArea/events/274915506/?isFirstPublish=true

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

  1. 1. 1 SABA KHALILNAJI saba@doordash.com ASHWIN KACHHARA ashwin@doordash.com 12/15/2020 Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
  2. 2. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 2 Contents Introduction Problems we faced with Celery / RabbitMQ Potential solutions to problems with Celery / RabbitMQ Kafka Onboarding Strategy No solution is perfect Key Wins Other use-cases of Kafka at DoorDash Conclusion Acknowledgements
  3. 3. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 3 Tasks related to different use-cases leverage different topics with their dedicated worker pools, based on volume. Introduction
  4. 4. 4 Problems we faced with RabbitMQ & Celery
  5. 5. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 5 Issues with availability ● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA ● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput ● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure ● Celery task processing would stop with no evidence of resource constraints, requiring a restart
  6. 6. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 6 Other problems with Celery and RabbitMQ SCALABILITY Reached the maximum vertical scale available to us. The provider HA mode limited our capacity. OBSERVABILITY Limited to a small set of RabbitMQ metrics available to us. Limited visibility into the Celery workers. OPERATIONAL EFFICIENCY Unsustainable time spent operating and maintaining RabbitMQ. Not enough in-house RabbitMQ expertise.
  7. 7. 7 Potential Solutions to the problems with RabbitMQ and Celery
  8. 8. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 8 CELERY BROKER CHANGE Continue using Celery with a potentially more reliable backing data store. MULTI-BROKER SYSTEM Shard task processing across multiple brokers to reduce average load. RMQ / CELERY VERSION UPGRADE Leverage potential reliability fixes in newer versions, buying us some time. CUSTOM KAFKA SOLUTION More effort than any other solution, but potential to solve all our problems (by design). Potential solutions we considered
  9. 9. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 9 Change the Celery Broker to Redis ● Improved availability & observability w/ ECC & multi-AZ ● Improved operational efficiency ● In-house operational experience & expertise w/ Redis ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Redis performance ● Incompatible w/ Redis clustered mode ● Single node Redis does not scale horizontally ● No Celery observability improvements ● Does not address stopped worker problem CONS Option #1 Does not solve scalability, only partially solves observability, and does not address worker stopped problem
  10. 10. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 10 Change the Celery Broker to Kafka ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● The team has lots of Kafka expertise ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Kafka performance ● Kafka is not supported by Celery yet ● No Celery observability improvements ● Does not address stopped worker problem ● Insufficient experience operating Kafka at scale CONS Option #2 Only partially solves observability, does not address worker stopped problem AND not supported out of the box
  11. 11. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 11 Multi-Broker Solution ● Improved availability ● Horizontal scalability ● Comparatively less effort required ● No observability or operational efficiency boosts ● Does not address stopped worker problem ● Does not address connection churn issue CONS Option #3 Does not solve observability, connection churn, nor worker stopped problem
  12. 12. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 12 Upgrade both Celery & RabbitMQ versions ● Might prevent RabbitMQ getting stuck ● Might prevent Celery workers getting stuck ● Buys us time to work on a longer-term strategy ● Will not fix any issues immediately ● Requires newer versions of Python ● Does not address connection churn issue CONS Option #4 Might prevent stuck Celery workers, but doesn’t definitely solve anything else
  13. 13. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 13 Building a custom Kafka solution ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● Team has a lot of in-house Kafka expertise ● Broker change is a straightforward option ● Connection churn doesn’t degrade Kafka performance ● Addresses stopped worker problem ● More work to implement compared to other options ● Minimal team experience operating Kafka at scale CONS Option #5 Solves all our problems. Most amount of effort required, and limited experience operating at scale
  14. 14. 14 And the winner is…
  15. 15. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 15 It addressed all the problems we were facing, while also being an industry standard that can scale. Kafka would give us full control over observability and availability. Building a custom Kafka Solution!
  16. 16. 16 Kafka Onboarding Strategy
  17. 17. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash HITTING THE GROUND RUNNING 17 Kafka Onboarding Strategy Leverage the basic solution as we’re iterating on other parts of it. “Racing a car while swapping in a new fuel pump” Maintain the same task interface for seamless, no-hassle adoption and minimize effort on the part of developers NO-OP ADOPTION Instead of a big flashy release, ship smaller independent features that can be individually tested INCREMENTAL ROLLOUT, ZERO DOWNTIME
  18. 18. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 18 ONBOARDING STRATEGY We built a minimum viable product (MVP) to bring us interim stability and buy us time to iterate on a more comprehensive solution. Hitting the ground running
  19. 19. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 19 ONBOARDING STRATEGY We launched our MVP after 2 weeks of development. We achieved an 80% reduction in RabbitMQ task load a week after that. Hitting the ground running
  20. 20. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 20 Seamless adoption, incremental rollout ● We implemented a wrapper for Celery’s @task annotation ● Allowed us to route task submissions to either system dynamically ● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds) ONBOARDING STRATEGY
  21. 21. 21 ITERATE AS NEEDED No solution is perfect
  22. 22. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 22 NO SOLUTION IS PERFECT A “slow” message in a partition can block all messages behind it from getting processed. Head-of-the-line blocking
  23. 23. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 23 NO SOLUTION IS PERFECT Consists of ● 1 x Local message queue ● 1 x Kafka-consumer process ● N x Task-executor processes A “slow” message only blocks a single task-executor process till it completes. Other messages in the partition can continue to flow. Non-blocking task consumer
  24. 24. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 24 ● Kafka is not a hard dependency for Cadence ● Useful to execute & schedule multi-step workflows in a distributed service ecosystem ● Distributed, scalable, durable, and highly available ● Orchestration asynchronous business logic scalably and with resilience Scheduled tasks (and more) via
  25. 25. 25 Conclusions
  26. 26. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 26 Conclusion & Key Wins NO MORE REPEATED OUTAGES Dealt with outage problem within 3 weeks of development, giving us more time after that to focus on esoteric features. PROCESSING NO LONGER A BOTTLENECK Task processing was no longer a bottleneck allowing DoorDash to continuing growing and serving customers 10x INCREASED OBSERVABILITY Granular observability in prod and dev environments, improving confidence as well as developer productivity. OPERATIONAL DECENTRALIZATION Enable developers to debug their operational issues, and perform cluster-management ops if needed.
  27. 27. 27 Other notable use-cases of Kafka at DoorDash
  28. 28. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 28 OTHER USE-CASES Receive real-time production and analytics events Kafka REST Proxy Apache Flink Current Scale ● 800B events / day ● Peak > 200k / sec Real-Time Streaming Platform
  29. 29. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 29 OTHER USE-CASES Standardized events with schema defn. as Protobuf or Avro ● Low latency ● Lower costs ● Better Data Quality Our Iguazu Pipeline
  30. 30. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 30 OTHER USE-CASES Huge boost in ● Indexing speed ● Accuracy Search Indexing
  31. 31. 31 It takes a village! Engineering Branding: Ezra Berger Wayne Cunningham 3131 Engineering: Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger, Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
  32. 32. 32 SABA KHALILNAJI ASHWIN KACHHARA 12/15/2020 Thank you
  33. 33. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 33 ● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/ ● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/ ● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/ Further Reading

×