Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the Bay

208 views

Published on

Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.

In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.

Published in: Technology
  • Be the first to comment

8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the Bay

  1. 1. natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil 8 Lessons Learned from using Kafka with 1000 Scala microservices Natan Silnitsky Backend Infra Developer, Wix.com
  2. 2. >180M registered users (website builders) from 190 countries 5% of all Internet websites run on Wix 4000+ people work at Wix >500B HTTP Requests / Day 6PB of static content At Wix @NSilnitsky
  3. 3. Kafka Messages per day Microservices (Service) Developers 300M 1510M 500 1500 300 900 2017 2020 At Wix
  4. 4. What do you do when the traffic, meta-data, and amount of developers and use cases grow?
  5. 5. What do you do when the traffic, meta-data, and amount of developers and use cases grow?
  6. 6. #1 Common Infra with common features.What do you do when the traffic, meta-data, and amount of developers and use cases grow?
  7. 7. Kafka Consumer Kafka Producer Kafka Broker Greyhound Wraps Kafka Service A Service B @NSilnitsky
  8. 8. Kafka Consumer Kafka Producer Kafka Broker Greyhound Wraps Kafka Service A Service B @NSilnitsky
  9. 9. Kafka Consumer Kafka Producer Kafka Broker Service A Service B Abstract so that it is easy to change for everyone Simplify APIs, with additional features Greyhound Wraps Kafka
  10. 10. Kafka Consumer Kafka Producer Kafka Broker Service A Service B Scala ZIO + Java API Greyhound Wraps Kafka @NSilnitsky
  11. 11. Greyhound Wraps Kafka Simple Consumer API - Boilerplate
  12. 12. val consumer: KafkaConsumer[String, SomeMessage] = createConsumer() def pollProcessAndCommit(): Unit = { val consumerRecords = consumer.poll(1000).asScala consumerRecords.foreach(record => { println(s"Record value: ${record.value.messageValue}") }) consumer.commitAsync() pollProcessAndCommit() } pollProcessAndCommit() Kafka Consumer API * Broker location, serde
  13. 13. val consumer: KafkaConsumer[String, SomeMessage] = createConsumer() def pollProcessAndCommit(): Unit = { val consumerRecords = consumer.poll(1000).asScala consumerRecords.foreach(record => { println(s"Record value: ${record.value.messageValue}") }) consumer.commitAsync() pollProcessAndCommit() } pollProcessAndCommit() Kafka Consumer API
  14. 14. val consumer: KafkaConsumer[String, SomeMessage] = createConsumer() def pollProcessAndCommit(): Unit = { val consumerRecords = consumer.poll(1000).asScala consumerRecords.foreach(record => { println(s"Record value: ${record.value.messageValue}") }) consumer.commitAsync() pollProcessAndCommit() } pollProcessAndCommit() Kafka Consumer API
  15. 15. val handler: RecordHandler[Console, Nothing, String, SomeMessage] = RecordHandler { record => zio.console.putStrLn(record.value.messageValue) } GreyhoundConsumersBuilder .withConsumer(GreyhoundConsumer( topic = "some-group", group = "group-2", handle = handler)) Greyhound Consumer API * No explicit commit, broker location
  16. 16. + Parallel Consumption! Greyhound Wraps Kafka Simple Consumer API
  17. 17. val consumer: KafkaConsumer[String, SomeMessage] = createConsumer() def pollProcessAndCommit(): Unit = { val consumerRecords = consumer.poll(1000).asScala consumerRecords.foreach(record => { println(s"Record value: ${record.value.messageValue}") }) consumer.commitAsync() pollProcessAndCommit() } pollProcessAndCommit() Kafka Consumer API
  18. 18. Kafka Broker Site Chat-bot-m essages Topic Greyhound Consumer Kafka Consumer ZIO FIBERS + QUEUES 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 (THREAD-SAFE) PARALLEL CONSUMPTION @NSilnitsky
  19. 19. + Retries! ...what about Error handling? Greyhound Wraps Kafka Simple Consumer API Thread-safe Parallel Consumption
  20. 20. val retryConfig = RetryConfig.nonBlocking( 1.second, 10.minutes) GreyhoundConsumersBuilder .withConsumer(GreyhoundConsumer( topic = "some-group", group = "group-2", handle = handler, retryConfig = retryConfig)) Non-blocking Retries
  21. 21. Kafka Broker renew-sub-topic 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 Greyhound Consumer Kafka Consumer FAILS TO READ @NSilnitsky
  22. 22. Kafka Broker renew-sub-topic 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 renew-sub-topic-retry-0 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 renew-sub-topic-retry-1 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 Inspired by Uber RETRY! Greyhound Consumer Kafka Consumer RETRY PRODUCER @NSilnitsky
  23. 23. #2 Retry Topics will cause your cluster to grow faster. 😐 What do you do when the traffic, meta-data, and amount of developers and use cases grow? #1 Common Infra
  24. 24. Kafka Broker renew-sub-topic-retry-0 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 renew-sub-topic-retry-1 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 Kafka Broker renew-sub-topic 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 renew-sub-topic-retry-N 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 RETRY! Inspired by Uber @NSilnitsky
  25. 25. Kafka Broker renew-sub-topic-retry-0 0 1 2 3 4 5 renew-sub-topic-retry-1 0 1 2 3 4 5 Kafka Broker renew-sub-topic 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 renew-sub-topic-retry-N RETRY! Inspired by Uber 0 2 3 4 51 @NSilnitsky
  26. 26. Retries same message on failure * lag BLOCKING POLICY HANDLER Kafka Broker source-control- update-topic 0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5 Greyhound Consumer Kafka Consumer build-log-service
  27. 27. BLOCKING POLICY HANDLER @NSilnitsky
  28. 28. + Context Propagation Super cool for us Greyhound Wraps Kafka Simple Consumer API Thread-safe Parallel Consumption Scheduled retries & blocking handler
  29. 29. CONTEXT PROPAGATION Language User Type USER REQUEST METADATA Sign up Site-Members Service Browser Geo ... @NSilnitsky
  30. 30. CONTEXT PROPAGATION Sign up Site-Members Service Browser Kafka Broker Producer Topic/Partition/Offset Headers Key Value timestamp
  31. 31. CONTEXT PROPAGATION Sign up Site-Members Service Browser Kafka Broker Producer Contacts Service Consumer
  32. 32. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #3 Self-service tooling and documentation. #2 Retry Topics - bigger cluster #1 Common Infra
  33. 33. Our team
  34. 34. Our team Self-service Tools & Docs
  35. 35. 1. Kafka CLI scripts (+ flags) 2. Yahoo Kafka Manager (OSS) 3. Confluent Control Centre 4. A control plane of our own Self-service Tools
  36. 36. Self-service Tools @NSilnitsky
  37. 37. Self-service Tools @NSilnitsky
  38. 38. Self-service Tools
  39. 39. Self-service Tools
  40. 40. Self-service Tools
  41. 41. Self-service Tools
  42. 42. Self-service Tools @NSilnitsky
  43. 43. Self-service Docs How do I investigate this lag? How do I add retries on errors? 1. Github Readme for Greyhound code 2. Internal StackOverflow Q&A 3. Slack bot that answers without you
  44. 44. Self-service Docs
  45. 45. Self-service Docs @NSilnitsky
  46. 46. Self-service Docs Discover topics and message structure. @NSilnitsky
  47. 47. protobuf internal API site Self-service Docs Discover topics and message structure. Message structure defined in... And exposed with... @NSilnitsky
  48. 48. protobuf internal API site Self-service Docs Discover topics and message structure. Or… avro schema registry Message structure defined in... And exposed with... @NSilnitsky
  49. 49. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #4 Async event driven monitoring is less trivial. #3 Self-service tooling & docs #2 Retry Topics - bigger cluster #1 Common Infra
  50. 50. ALERTS METRICS PUBLISHING Producer Consumer Slack emailKafka Broker Metrics Server
  51. 51. @NSilnitsky
  52. 52. #5 Proactive broker maintenance. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #4 Non-trivial Monitoring #3 Self-service tooling & docs #2 Retry Topics - bigger cluster #1 Common Infra
  53. 53. ● Add brokers when needed ● Split clusters when needed ● Delete unused topics ● Avoid hard failures As a rule of thumb, we recommend each broker to have up to 4,000 partitions and each cluster to have up to 200,000 partitions. https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions Don’t let brokers break... @NSilnitsky
  54. 54. We’re migrating to Confluent Cloud @NSilnitsky ● High Availability ● Don’t need to worry about scaling clusters
  55. 55. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #6 Avoid using Kafka SDK directly in nodeJs #5 Proactive broker maintenance #4 Non-trivial Monitoring #3 Self-service tooling & docs #2 Retry Topics - bigger cluster #1 Common Infra
  56. 56. Greyhound Kafka Broker Producer Consumer Scala services NodeJS services ConsumerProducer Single Event loop @NSilnitsky
  57. 57. Greyhound Kafka Broker Producer Consumer Scala services NodeJS services ConsumerProducer Single Event loop ✘ @NSilnitsky
  58. 58. * memory Greyhound Kafka Broker Consumer Sidecar (Or REST Proxy) gRPC NodeJS services Single Event loop
  59. 59. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #7 Consume and project #5 Proactive broker maintenance #4 Non-trivial Monitoring #3 Self-service tooling & docs #2 Retry Topics - bigger cluster #1 Common Infra #6 Avoid nodeJs SDK
  60. 60. MetaSite Site installed Apps? RPC Wix Stores Wix Bookings Wix Restaurants Site version? Site owner? The Popular Flow RPC RPC @NSilnitsky
  61. 61. Read + Writes Large MetaSite Object Request overload (~1M RPM requests) MetaSite Site installed Apps? RPC Wix Bookings Wix Restaurants Site version? Site owner? RPC RPC Wix Stores @NSilnitsky
  62. 62. Read + Writes Large MetaSite Object Request overload (~1M RPM requests) MetaSite Site installed Apps? RPC Wix Bookings Wix Restaurants Site version? Site owner? RPC RPC Wix Stores @NSilnitsky
  63. 63. Entire MetaSite Context 1. Produce to Kafka Kafka Broker Producer MetaSite Site Updated!
  64. 64. Kafka Broker Filter Site installed Apps Updated! Installed Apps Context 1. Produce to Kafka 2. Consume and Project Producer MetaSite Site Updated! Consumer Reverse lookup writer
  65. 65. Kafka Broker Filter Site installed Apps Updated! Installed Apps Context 1. Produce to Kafka 2. Consume and Project Producer MetaSite Site Updated! Consumer Reverse lookup writer Reverse lookup reader RPC RPC RPC 3. Split Read from Write
  66. 66. Kafka messaging is event driven. It is only relevant to service-service communications, not for browser-server interactions, where a user is waiting, right? @NSilnitsky
  67. 67. Kafka messaging is event driven. It is only relevant to service-service communications, not for browser-server interactions, where a user is waiting, right? Wrong. @NSilnitsky
  68. 68. What do you do when the traffic, meta-data, and amount of developers and use cases grow? #8 WebSockets are Kafka’s best friend #7 Consume and project #5 Proactive broker maintenance #4 Non-trivial Monitoring #3 Self-service tooling & docs #2 Retry Topics - bigger cluster #1 Common Infra #6 Avoid nodeJs SDK
  69. 69. Completely distributed and event driven Kafka WebSockets + = @NSilnitsky
  70. 70. Kafka Broker Consumer Browser Producer Contacts Importer Contacts Jobs Web Sockets Service Use Case: Long-running async business process @NSilnitsky
  71. 71. Kafka Broker Subscribe for notifications ConsumerProducer Use Case: Long-running async business process Contacts Importer Contacts Jobs Browser Web Sockets Service @NSilnitsky
  72. 72. Kafka Broker Import CSVs ConsumerProducer Use Case: Long-running async business process Contacts Importer Contacts Jobs Browser Web Sockets Service @NSilnitsky
  73. 73. Web Sockets Service Kafka Broker Consumer * distributed Use Case: Long-running async business process Producer Contacts Jobs Browser done! Contacts Importer
  74. 74. So, What do you do when the traffic, meta-data, and amount of developers and use cases grow?
  75. 75. Wix Created an entire ecosystem to support large-scale Kafka-related needs.
  76. 76. A Java/Scala high-level SDK for Apache Kafka. 0.1 is out! github.com/wix/greyhound
  77. 77. Thank You natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
  78. 78. Slides & More slideshare.net/NatanSilnitsky medium.com/@natansil twitter.com/NSilnitsky natansil.com

×