10 Lessons Learned from using Kafka with 1000 microservices - java global summit

natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
10 Lessons Learned from using Kafka
with 1000 microservices
Natan Silnitsky
Backend Infra Developer, Wix.com
JAVA
GLOBAL
SUMMIT

But ﬁrst…
a few Kafka terms

Kafka
Producer
Kafka Broker
A few
Kafka terms
@NSilnitsky

Kafka
Producer
Topic
Partition
Partition
Partition
Kafka Broker
Topic Topic
Partition
Partition
Partition
Partition
Partition
Partition
A few
Kafka terms
@NSilnitsky

Topic TopicTopic
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Kafka
Producer
Partition
0 1 2 3 4 5
append-only log
A few
Kafka terms
@NSilnitsky

Topic TopicTopic
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Partition
Partition
0 1 2 3 4 5
Kafka
Consumers
6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
A few
Kafka terms
@NSilnitsky

Now,
Here’s how we use it at Wix.

>180M registered users (website builders) from 190 countries
5% of all Internet websites run on Wix
3000+ people work at Wix
>500B HTTP Requests / Day
6PB of static content
At Wix
@NSilnitsky

Kafka Messages per day
Microservices
(backend) Developers
300M 1075M
500 1,400
100 300*
2017 2020
At Wix

What do you do
when the traﬃc,
meta-data, and amount of
developers and use cases
grow?

#1 Common Infra
with common
features.What do you do
when the traﬃc,
grow?

Kafka
Consumer
Kafka
Producer
Kafka Broker
Greyhound
Wraps
Kafka
Service A Service B
@NSilnitsky

Kafka
Consumer
Kafka
Producer
Kafka Broker
Service A Service B
Abstract
so that it is easy to
change for everyone
Simplify
APIs, with additional
features
Greyhound
Wraps
Kafka

Kafka
Consumer
Kafka
Producer
Kafka Broker
Service A Service B
Scala ZIO
+ Java API
Greyhound
Wraps
Kafka
@NSilnitsky

Greyhound
Wraps
Kafka
Simple Consumer API
- Boilerplate
@NSilnitsky

static void runConsumer() throws InterruptedException {
ﬁnal Consumer<Long, SomeMessage> consumer = createConsumer();
while (true) {
ﬁnal ConsumerRecords<Long, SomeMessage> consumerRecords =
consumer.poll(1000);
consumerRecords.forEach(record -> {
System.out.printf("Record value:%dn", record.value().messageValue);
});
consumer.commitAsync();
}
}
Kafka
Consumer API

static void runConsumer() throws InterruptedException {
String topic = "some-topic";
MessageHandler<SomeMessage> handler =
message -> System.out.printf("Record value:%dn", message.messageValue);
GreyhoundConsumer consumer = GreyhoundConsumer.aGreyhoundConsumerSpec(topic, handler);
}
Greyhound
Consumer API
* No explicit commit

+ Parallel
Consumption!
Greyhound
Wraps
Kafka
Simple Consumer API

static void runConsumer() {
MessageHandler<PurgeSite> handler = ...
GreyhoundConsumer consumer = GreyhoundConsumer.aGreyhoundConsumerSpec(topic, handler)
.withGroup("some-group")
.withMaxParallelism(40);
}
Greyhound
Consumer
@NSilnitsky

Kafka Broker
Site
Chat-bot-m
essages
Topic
Greyhound
Consumer
Kafka
Consumer
SCALA ZIO FIBERS
+ QUEUES
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
(THREAD-SAFE)
PARALLEL
CONSUMPTION
@NSilnitsky

+ Retries!
...what about
Error handling?
Greyhound
Wraps
Kafka
Simple Consumer API
Thread-safe Parallel Consumption

static void setupConsumer() {
ConsumeRetryPolicy retryPolicy = ConsumeRetryPolicy.aRetryPolicy(
Retries.fromBackoﬀs(Duration.ofSeconds(1), Duration.ofMinutes(10), ...));
GreyhoundConsumer renewConsumer = GreyhoundConsumer.aGreyhoundConsumerSpec(
topic,
renewHandler)
.withGroup("some-group")
.withRetry(retryPolicy);
}
Greyhound
Consumer
@NSilnitsky

Kafka Broker
renew-sub-topic
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
Greyhound Consumer
Kafka Consumer
FAILS TO
READ
@NSilnitsky

Kafka Broker
renew-sub-topic
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
renew-sub-topic-retry-0
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
Inspired by Uber
RETRY!
Greyhound Consumer
Kafka Consumer
RETRY
PRODUCER
@NSilnitsky

#2 Retry Topics will
cause your cluster to
grow faster. 😐
What do you do
when the traﬃc,
grow?
#1 Common Infra

Kafka Broker
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
Kafka Broker
renew-sub-topic
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
renew-sub-topic-retry-N
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
RETRY!
Inspired by Uber
@NSilnitsky

Kafka Broker
0 1 2 3 4 5
0 1 2 3 4 5
Kafka Broker
renew-sub-topic
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
renew-sub-topic-retry-N
RETRY!
Inspired by Uber
0 2 3 4 51
@NSilnitsky

Retries same message on failure
* lag
BLOCKING
POLICY
HANDLER
Kafka Broker
source-control-
update-topic
0 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 50 1 2 3 4 5
Greyhound Consumer
Kafka Consumer
build-log-service

BLOCKING
POLICY
HANDLER
@NSilnitsky

+ Context Propagation
Super
cool
for us
Greyhound
Wraps
Kafka
Simple Consumer API
Thread-safe Parallel Consumption
Scheduled retries & blocking handler

CONTEXT
PROPAGATION
Language
User
Type
USER REQUEST METADATA
Sign up
Site-Members
Service
Browser
Geo
...
@NSilnitsky

CONTEXT
PROPAGATION
Sign up
Site-Members
Service
Browser
Kafka Broker
Producer
Topic/Partition/Oﬀset
Headers
Key
Value
timestamp

CONTEXT
PROPAGATION
Sign up
Site-Members
Service
Browser
Kafka Broker
Producer
Contacts
Service
Consumer

What do you do
when the traﬃc,
grow?
#3 Self-service
tooling and
documentation.
#2 Retry Topics - bigger cluster
#1 Common Infra

Our
team
Self-service
Tools & Docs

1. Kafka CLI scripts (+ ﬂags)
2. Yahoo Kafka Manager (OSS)
3. Conﬂuent Control Centre
4. A control plane of our own
Self-service
Tools

Self-service
Tools
@NSilnitsky

Self-service
Docs
How do I investigate this lag?
How do I add retries on errors?
1. Github Readme for Greyhound code
2. Internal StackOverﬂow Q&A
3. Slack bot that answers without you

Self-service
Docs
Discover
topics and
message
structure.
@NSilnitsky

protobuf
internal
API site
Self-service
Docs
Discover
topics and
message
structure.
Message
structure
deﬁned in...
And exposed
with...
@NSilnitsky

protobuf
internal
API site
Self-service
Docs
Discover
topics and
message
structure.
Or… avro
schema
registry
Message structure
deﬁned in...
And exposed
with...
@NSilnitsky

What do you do
when the traﬃc,
grow?
#4 Testing
needs to be easy.
v
#3 Self-service tooling & docs
#1 Common Infra

Greyhound Testkit
Kafka Broker ZooKeeper
Test Environment
Scala Test {
}
Testkit
included!
@NSilnitsky

What do you do
when the traﬃc,
grow? #5 Async event
driven monitoring
is less trivial.
#4 Easy Testing
#1 Common Infra

ALERTS
METRICS
PUBLISHING
Producer
Consumer
Slack
emailKafka Broker
Metrics Server

#6 Proactive
broker maintenance.
What do you do
when the traﬃc,
grow?
#5 Non-trivial Monitoring
#4 Easy Testing
#1 Common Infra

● Add brokers when needed
● Split clusters when needed
● Delete unused topics
● Avoid hard failures
As a rule of thumb, we recommend each broker to have up to 4,000 partitions
and each cluster to have up to 200,000 partitions.
https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions
Don’t let brokers break...
@NSilnitsky

What do you do
when the traﬃc,
grow?
#7 Avoid
active-active
ordered produce
(write)
#4 Easy Testing
#1 Common Infra
#6 Proactive broker maintenance

Kafka Broker
Producer
Update
my site Browser
service
1
1
DC1
@NSilnitsky

Kafka Broker
Update
my siteBrowser
Kafka Broker
Producer
service
1 2
21
DC1 DC2
Producer
@NSilnitsky

Kafka Broker
Browser
Kafka Broker
Producer
1 122
2
DC1 DC2
Producer
1
REPLICATION
Out of
Order
@NSilnitsky

Kafka Broker
Update
my site Browser
Kafka Broker
Producer
1 2
Producer
DC1 DC2
Best
Practice
@NSilnitsky

Kafka Broker
Browser
1 2
Producer
Kafka Broker
1 2
DC1 DC2
REPLICATION
@NSilnitsky

Kafka Broker
Browser
1 2
Producer
Kafka Broker
1 2
DC1 DC2
Consumer
REPLICATION
@NSilnitsky

What do you do
when the traﬃc,
grow?
#8 Avoid using
Kafka SDK
directly in nodeJs
#7 Avoid active-active write
#4 Easy Testing
#1 Common Infra

Greyhound
Kafka Broker
Producer Consumer
Scala
services
NodeJS
services
ConsumerProducer
Single
Event loop
@NSilnitsky

Greyhound
Kafka Broker
Producer Consumer
Scala
services
NodeJS
services
ConsumerProducer
Single
Event loop
✘
@NSilnitsky

* memory
Greyhound
Kafka Broker
Consumer
Sidecar
(Or REST Proxy)
gRPC
NodeJS
services
Single
Event loop

What do you do
when the traﬃc,
grow?
#9 Consume and
project
#8 Avoid nodeJs SDK
#4 Easy Testing
#1 Common Infra

MetaSite
Site
installed
Apps?
RPC
Wix
Stores
Wix
Bookings
Wix
Restaurants
Site
version?
Site owner?
The Popular
Flow
RPC
RPC
@NSilnitsky

Read +
Writes
Large
MetaSite
Object
Request overload
(~1M RPM requests)
MetaSite
Site
installed
Apps?
RPC
Wix
Bookings
Wix
Restaurants
Site
version?
Site owner?
RPC
RPC
Wix
Stores
@NSilnitsky

Entire
MetaSite
Context
1. Produce
to Kafka
Kafka
Broker
Producer
MetaSite
Site
Updated!

Kafka
Broker
Filter
Site installed
Apps Updated!
Installed
Apps
Context
1. Produce
to Kafka
2. Consume
and Project
Producer
MetaSite
Site
Updated!
Consumer
Reverse
lookup
writer

Kafka
Broker
Filter
Site installed
Apps Updated!
Installed
Apps
Context
1. Produce
to Kafka
2. Consume
and Project
Producer
MetaSite
Site
Updated!
Consumer
Reverse
lookup
writer
Reverse
lookup
reader
RPC
RPC
RPC
3. Split Read
from Write

Kafka messaging is event driven.
It is only relevant to service-service communications,
not for browser-server interactions, where a user is waiting,
right?
@NSilnitsky

Kafka messaging is event driven.
It is only relevant to service-service communications,
not for browser-server interactions, where a user is waiting,
right? Wrong.
@NSilnitsky

What do you do
when the traﬃc,
grow?
#10 WebSockets
are Kafka’s best
friend
#9 Consume and project
#8 Avoid nodeJs SDK
#4 Easy Testing
#1 Common Infra

Completely distributed
and event driven
Kafka
WebSockets
+
=
@NSilnitsky

Kafka Broker
Consumer
Browser
Producer
Contacts
Importer
Contacts
Jobs
Web
Sockets
Service
Use Case:
Long-running async
business process
@NSilnitsky

Kafka Broker
Subscribe for notiﬁcations
ConsumerProducer
Use Case:
Long-running async
business process
Contacts
Importer
Contacts
Jobs
Browser Web
Sockets
Service
@NSilnitsky

Kafka Broker
Import
CSVs
ConsumerProducer
Use Case:
Long-running async
business process
Contacts
Importer
Contacts
Jobs
Browser Web
Sockets
Service
@NSilnitsky

Web
Sockets
Service
Kafka Broker
Consumer
* distributed
Use Case:
Long-running async
business process
Producer
Contacts
Jobs
Browser
done!
Contacts
Importer

So,
What do you do
when the traﬃc, meta-data,
and amount of developers and
use cases grow?

Wix
Created an entire
ecosystem to support large-scale
Kafka-related needs.

A Java/Scala high-level SDK for Apache Kafka.
0.1 is out!
github.com/wix/greyhound

Thank You
JAVA
GLOBAL
SUMMIT

Slides & More
slideshare.net/NatanSilnitsky
medium.com/@natansil
twitter.com/NSilnitsky
natansil.com
JAVA
GLOBAL
SUMMIT

Q&A
JAVA
GLOBAL
SUMMIT

10 Lessons Learned from using Kafka with 1000 microservices - java global summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 10 Lessons Learned from using Kafka with 1000 microservices - java global summit

Similar to 10 Lessons Learned from using Kafka with 1000 microservices - java global summit (20)

More from Natan Silnitsky

More from Natan Silnitsky (18)

Recently uploaded

Recently uploaded (20)

10 Lessons Learned from using Kafka with 1000 microservices - java global summit