Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1400 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 10 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
* How to tackle multiple DCs environment.
8. >180M registered users (website builders) from 190 countries
5% of all Internet websites run on Wix
3000+ people work at Wix
>500B HTTP Requests / Day
6PB of static content
At Wix
@NSilnitsky
9. Kafka Messages per day
Microservices
(backend) Developers
300M 1075M
500 1,400
100 300*
2017 2020
At Wix
10. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
11. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
12. #1 Common Infra
with common
features.What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
29. #2 Retry Topics will
cause your cluster to
grow faster. 😐
What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#1 Common Infra
38. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#3 Self-service
tooling and
documentation.
#2 Retry Topics - bigger cluster
#1 Common Infra
49. Self-service
Docs
How do I investigate this lag?
How do I add retries on errors?
1. Github Readme for Greyhound code
2. Internal StackOverflow Q&A
3. Slack bot that answers without you
55. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#4 Testing
needs to be easy.
v
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
57. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow? #5 Async event
driven monitoring
is less trivial.
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
60. #6 Proactive
broker maintenance.
What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#5 Non-trivial Monitoring
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
61. ● Add brokers when needed
● Split clusters when needed
● Delete unused topics
● Avoid hard failures
As a rule of thumb, we recommend each broker to have up to 4,000 partitions
and each cluster to have up to 200,000 partitions.
https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions
Don’t let brokers break...
@NSilnitsky
62. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#7 Avoid
active-active
ordered produce
(write)
#5 Non-trivial Monitoring
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
#6 Proactive broker maintenance
70. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#8 Avoid using
Kafka SDK
directly in nodeJs
#7 Avoid active-active write
#5 Non-trivial Monitoring
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
#6 Proactive broker maintenance
74. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#9 Consume and
project
#8 Avoid nodeJs SDK
#7 Avoid active-active write
#5 Non-trivial Monitoring
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
#6 Proactive broker maintenance
81. Kafka messaging is event driven.
It is only relevant to service-service communications,
not for browser-server interactions, where a user is waiting,
right?
@NSilnitsky
82. Kafka messaging is event driven.
It is only relevant to service-service communications,
not for browser-server interactions, where a user is waiting,
right? Wrong.
@NSilnitsky
83. What do you do
when the traffic,
meta-data, and amount of
developers and use cases
grow?
#10 WebSockets
are Kafka’s best
friend
#9 Consume and project
#8 Avoid nodeJs SDK
#7 Avoid active-active write
#5 Non-trivial Monitoring
#4 Easy Testing
#3 Self-service tooling & docs
#2 Retry Topics - bigger cluster
#1 Common Infra
#6 Proactive broker maintenance
86. Kafka Broker
Subscribe for notifications
ConsumerProducer
Use Case:
Long-running async
business process
Contacts
Importer
Contacts
Jobs
Browser Web
Sockets
Service
@NSilnitsky