This document discusses using Apache Kafka on Azure to process trillions of events per day for various Microsoft services. It processed over 2.6 trillion ingress events per day from over 100,000 producers across more than 20 regions. The document outlines several use cases Kafka is used for at Microsoft and how it is tuned for different needs like throughput, latency, security, and real-time applications. It describes experiments conducted to optimize configurations like batch size, number of disks per broker, and number of threads to maximize throughput and minimize latency for different workloads.
4. CUSTOMERS’ USE CASES
Throughput
Latency
Security and
intrusion
detection
Real-time
online spelling
and grammar
checks
Telemetry data
- Availability
monitoring
< 250 ms – 1.5 GBps< 100ms – 2 GBps
< 10ms – 250 MBps
5. LOW LATENCY
Real-time online
spelling and grammar checks
Configs
o Higher Partition Density
o Smaller Producer Batch Size
o Additional Disks per Broker
Security and intrusion detection
applications
Configs
o Fewer Replications
o Fewer Required Acks
o Larger Producer Batch Size
HIGH THROUGHPUT LOW RELIABILITY
Telemetry data for availability
monitoring
Configs
o Higher Partition Density
o Larger Producer Batch Size
o Additional Disks per Broker
7. CONFIGURATIONS
Hardware Configs
o CPU
o RAM
o Disks per broker
Kafka Producer Configs
o Batch size
o Linger
o Compression
o Producer required acks
o Buffer memory
Kafka Broker Configs
o Num.io.threads
o Num.network.threads
o Min.replica.fetchers
o Replica.fetch.min.bytes
o Number of topics/partitions
8. ADDITIONAL DISKS PER BROKER
Maximum of 16 standard HDD disks per
broker
o CPU: 10 brokers with 8 cores
o RAM: 28 GB per broker
o Disks: Azure standard S30 HDD
1 TB - Up to 60 MB/second
Introduce ourselves...
We will share our experience and learnings from running one of world’s largest Kafka deployments. Besides underlying infrastructure considerations, we discuss several tunable Kafka broker and client configurations that affect performance.
Add siphon logo
Scale and numbers
Performance could potentially have orthogonal dimensions. In a realtime data pipeline, we are interested in throughput, latency and in some critical cases reliability as well.
From our experience, performance requirements fall in three categories. Telemetry data ingestion for near real-time processes like security and intrusion detection applications is one that requires high throughput but is tolerant to high latency. On the other end, real time online spelling and grammar check become obsolete with high latency hence have stringent latency requirements. There are applications that require both high throughput and low latency such as availability monitoring apps but can tolerate data loss.
Add the numbers (like in blog)
Spend time on each use case
This is a summary of the configurations that have proved significant to achieve the requirements in the three quadrants.
A larger batch size is required for high throughput.
Swap headlines
give little explanation
To stress-test our system in general and the Kafka clusters specifically, we developed an application which constantly generates message batches of random bytes to a cluster’s front-end.
spins 100 threads to send 1,000 messages of 1 KB random data to each topic, in 5 ms intervals.
Event Server is used as a front-end web server which implements Kafka producer and consumer APIs. It serves like Kafka Proxy.
We provision multiple Event Servers in a cluster to balance the load and manage produce requests sent from thousands of client machines to Kafka brokers.
Each Event Server application runs in a docker container on scale-sets of Azure Standard F8s Linux VMs, and is allocated 7 CPUs and 12 GB of memory with a maximum Java heap size set to 9 GB. To handle the large amount of traffic generated by our stress tool, we run 20 instances of these Event Servers.
ES instantiates multiple parallel Kafka producer threads. Each thread instantiates one producer. The number of sliding queues is controlled by thread pool size.
HDInsight Kafka clusters, running Kafka 1.1 with 20 topics and 3 replicas. Number of partitions varies throughout the tests
For our experiments, we ran Null sink connectors which consume messages from Kafka, discard them and then commit the offsets. This allowed us to measure both producer and consumer throughput, while eliminating any potential bottlenecks introduced by sending data to specific destinations.
Event Server is used as a front-end web server which implements Kafka producer and consumer APIs. We provision multiple Event Servers in a cluster to balance the load and manage produce requests sent from thousands of client machines to Kafka brokers.
Azure Standard D4 V2 Linux VMs
We never ran into high CPU utilization with this setup. On the other hand, the number of disks had a direct effect on throughput.
We monitored insync replicas and consumer lag as well
Batch Size: Controls batching on producer client
Linger: It puts a ceiling on how long producers wait. In low-load scenarios, this improves throughput by sacrificing latency
Acks: The number of acknowledgments the producer requires the leader to have received before considering a request complete
Buffer memory: controls the amount of memory available for the producer for buffering. to support larger batching, we increased this setting to 1 GB
How efficiently we are using available resources? CPU was not utilized fully.
We monitored Disk usage and utilization
Storage disks have limited IOPS (Input/Output Operations Per Second) and read/write bytes per second. When creating new partitions, Kafka stores each new partition on the disk with fewest existing partitions to balance them across the available disks. Despite this, when processing hundreds of replicas on each disk, Kafka can easily saturate the available disk throughput.
The results show a correlation of increasing throughput with an increasing number of attached disks.
Never tested with SSD
Monitoring the current performance to identify bottlenecks:
Min.insync: this configuration specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. else: producer raises an exception
Together with acks config set to all, this can guarantee that data is written to this many replicas.
We monitor both request handler idle time and time spent waiting in the request queue: if high, not enough IO threads or CPU
Response queue: you need more network threads
Minimum bytes expected for each fetch response. If not enough bytes, wait up to replicaMaxWaitTimeMs
This setting helps when high load with larger batches,
the number of partitions per broker, not including replicas
increasing the partition density adds an overhead related to metadata operations and per partition request/response between the partition leader and its followers.
Even in the absence of data flowing through, partition replicas still fetch data from leaders, which results in extra processing for send and receive requests over the network.
More network and IO threads helps
\100 partitions per topic, i.e., a total of 200 partitions per broker (we have 20 topics and 10 brokers)
CPU usage also increases with a higher rate
Each Kafka producer batches records for a single partition, optimizing network and IO requests issued to a partition leader.
Therefore, increasing batch size could result in higher throughput. Under light load, this may increase Kafka send latency since the producer waits for a batch to be ready.
The Linger.ms setting also controls batching. It puts a ceiling on how long producers wait before sending a batch, even if the batch is not full. In low-load scenarios, this improves throughput by sacrificing latency.
Using a larger batch.size makes compression more efficient.
The buffer.memory controls total memory available to a producer for buffering. If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a TimeoutException.
Producer required acks configuration determines the number of acknowledgments required by the partition leader before a write request is considered completed.
While the trend is obvious, it is interesting to quantify the effect of the requried ack settings. We can see that going from no reliabilty to max reliabiltiy (acks = 0 to acks = -1) cuts the throughput in half and almost doubles the latency.
Compression is beneficial and should be considered if there is a limitation on disk capacity.