SlideShare a Scribd company logo
1 of 42
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka at Peak Performance
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load (whole site)
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound
4
 1800+ Kafka brokers
 Over 79,000 topics
 1,130,000+ Partitions
 1.3 Trillion messages per day
 330 Terabytes In
 1.2 Petabytes Out
 Peak Load (single cluster)
– 2 Million messages/sec
– 4.7 Gigabits/sec Inbound
– 15 Gigabits/sec Outbound
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Picking Your Hardware
 Monitoring the Cluster
 Triaging Broker Performance Problems
 Conclusion
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Hardware Selection
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What’s Important To You?
 Message Retention - Disk size
 Message Throughput - Network capacity
 Producer Performance - Disk I/O
 Consumer Performance - Memory
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Go Wide
 Kafka is well-suited to horizontal scaling
 RAIS - Redundant Array of Inexpensive Servers
 Also helps with CPU utilization
– Kafka needs to decompress and recompress every message batch
– KIP-31 will help with this by eliminating recompression
 Don’t co-locate Kafka
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Disk Layout
 RAID
– Can survive a single disk failure (not RAID 0)
– Provides the broker with a single log directory
– Eats up disk I/O
 JBOD
– Gives Kafka all the disk I/O available
– Broker is not smart about balancing partitions
– If one disk fails, the entire broker stops
 Amazon EBS performance works!
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Operating System Tuning
 Filesystem Options
– EXT or XFS
– Using unsafe mount options
 Virtual Memory
– Swappiness
– Dirty Pages
 Networking
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Java
 Only use JDK 8 now
 Keep heap size small
– Even our largest brokers use a 6 GB heap
– Save the rest for page cache
 Garbage Collection - G1 all the way
– Basic tuning only
– Watch for humongous allocations
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
How Much Do You Need?
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
13
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Possible Broker Improvements
 Namespaces
– Namespace topics by datacenter
– Eliminate local clusters and just have aggregate
– Significant hardware savings
 JBOD Fixes
– Intelligent partition assignment
– Admin tools to move partitions between mount points
– Broker should not fail completely with a single disk failure
16
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Keeping An Eye On Things
18
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring The Foundation
 CPU Load
 Network inbound and outbound
 Filehandle usage for Kafka
 Disk
– Free space - where you write logs, and where Kafka stores messages
– Free inodes
– I/O performance - at least average wait and percent utilization
 Garbage Collection
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Ground Rules
 Tuning
– Stick (mostly) with the defaults
– Set default cluster retention as appropriate
– Default partition count should be at least the number of brokers
 Monitoring
– Watch the right things
– Don’t try to alert on everything
 Triage and Resolution
– Solve problems, don’t mask them
20
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Too Much Information!
 Monitoring teams hate Kafka
– Per-Topic metrics
– Per-Partition metrics
– Per-Client metrics
 Capture as much as you can
– Many metrics are useful while triaging an issue
 Clients want metrics on their own topics
 Only alert on what is needed to signal a problem
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Monitoring
 Bytes In, Bytes Out
 Messages In, Produce Rate, Produce Failure Rate
 Fetch Rate, Fetch Failure Rate
 Partition Bytes
 Log End Offset
– Why bother?
– KIP-32 will make this unnecessary
 Quota Throttling
 Provide this to your customers for them to alert on
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Client Monitoring
 For consumers, use Burrow
– Monitor all partitions for all consumers
– Provides an easy to digest “good, warning, bad” state, with detail available
– Fast and free
 Producers are a little harder
– Several internal implementations of message auditing
– The community needs a good open source standard
 Cluster availability monitoring
– kafka-monitoring is coming soon from LinkedIn!
24
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
It’s Broken! Now What?
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
All The Best Ops People…
 Know more of what is happening than their customers
 Are proactive
 Fix bugs, not work around them
 This applies to our developers too!
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Anticipating Trouble
 Trend cluster utilization and growth over time
 Use default configurations for quotas and retention to require customers to
talk to you
 Monitor request times
– If you are able to develop a consistent baseline, this is early warning
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Under Replicated Partitions
 Count of number of partitions which are not fully replicated within the
cluster
 Also referred to as “replica lag”
 Primary indicator of problems within the cluster
28
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Performance Checks
 Are you still running 0.8?
 Are all the brokers in the cluster working?
 Are the network interfaces saturated?
– Reelect partition leaders
– Rebalance partitions in the cluster
– Spread out traffic more (increase partitions or brokers)
 Is the CPU utilization high? (especially iowait)
– Is another process competing for resources?
– Look for a bad disk
 Do you have really big messages?
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka’s OK, Now What?
 If Kafka is working properly, it’s probably a client issue
– Don’t throw it over the fence. Help your customers understand
 Common producer issues
– Batch size and linger time
– Receive and send buffers
– Sync vs. async, and acknowledgements
 Common consumer issues
– Garbage collection problems
– Min fetch bytes and max wait time
– Not enough partitions
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Ecosystem
 Kafka can scale to millions of messages per second, and more
– Operations must scale the cluster appropriately
– Developers must use the right tuning and go parallel
 Few problems are owned by only one side
– Expanding partitions often requires coordination
– Applications that need higher reliability drive cluster configurations
 Either we work together, or we fail separately
32
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Would You Like To Know More?
 Presentations: http://www.slideshare.net/toddpalino
– More Datacenters, More Problems
– Kafka As A Service
– Always download the originals for slide notes!
 Blog Posts: https://engineering.linkedin.com/blog
– Development and SRE blogs on Kafka and other topics
 LinkedIn Open Source: https://github.com/linkedin/streaming
– Burrow Consumer Monitoring - https://github.com/linkedin/Burrow
– Kafka Admin Tools - https://github.com/linkedin/kafka-tools
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 LinkedIn
– Strong commitment to open source
– Do cool things and work with awesome people
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
35
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Appendix
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
JDK Options
Heap Size -Xmx6g -Xms6g
Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80
G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M
GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-Xloggc:/path/to/logs/gc.log -verbose:gc
Error Handling -XX:-HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/path/to/logs/hs_err.log
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters
 Networking:
net.core.rmem_default = 124928
net.core.rmem_max = 2048000
net.core.wmem_default = 124928
net.core.wmem_max = 2048000
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_max_syn_backlog = 1024
39
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters (cont.)
 Virtual Memory
vm.oom_kill_allocating_task = 1
vm.max_map_count = 200000
vm.swappiness = 1
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 500
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5
40
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics
kafka.server:name=PartitionCount,type=ReplicaManager
kafka.server:name=LeaderCount,type=ReplicaManager
kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager
kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool
kafka.controller:name=ActiveControllerCount,type=KafkaController
kafka.controller:name=OfflinePartitionsCount,type=KafkaController
kafka.log:name=max-dirty-percent,type=LogCleanerManager
kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer
kafka.network:name=RequestsPerSec=*,type=RequestMetrics
kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics
kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics
kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics
41
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors - Topics
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*
42

More Related Content

What's hot

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for productionconfluent
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Diveconfluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Kafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInKafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInAditya Auradkar
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 

What's hot (20)

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for production
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Kafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedInKafka Quotas Talk at LinkedIn
Kafka Quotas Talk at LinkedIn
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
kafka
kafkakafka
kafka
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 

Similar to Kafka at Peak Performance

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More ProblemsTodd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise controlBill Liu
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterpriseAnas Kanzoua
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016AdobeMarketingCloud
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)Mario Beck
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsAerospike, Inc.
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should knowRatish Ravindran
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesSeveralnines
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Felix GV
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage SessionBrocade
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Dan Cundiff
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksAreaNetworking.it
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Group, Inc.
 

Similar to Kafka at Peak Performance (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceTodd Palino
 

More from Todd Palino (11)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentBharaniDharan195623
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managament
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 

Kafka at Peak Performance

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 4  1800+ Kafka brokers  Over 79,000 topics  1,130,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Picking Your Hardware  Monitoring the Cluster  Triaging Broker Performance Problems  Conclusion 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Hardware Selection 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What’s Important To You?  Message Retention - Disk size  Message Throughput - Network capacity  Producer Performance - Disk I/O  Consumer Performance - Memory 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Go Wide  Kafka is well-suited to horizontal scaling  RAIS - Redundant Array of Inexpensive Servers  Also helps with CPU utilization – Kafka needs to decompress and recompress every message batch – KIP-31 will help with this by eliminating recompression  Don’t co-locate Kafka 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Disk Layout  RAID – Can survive a single disk failure (not RAID 0) – Provides the broker with a single log directory – Eats up disk I/O  JBOD – Gives Kafka all the disk I/O available – Broker is not smart about balancing partitions – If one disk fails, the entire broker stops  Amazon EBS performance works! 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Operating System Tuning  Filesystem Options – EXT or XFS – Using unsafe mount options  Virtual Memory – Swappiness – Dirty Pages  Networking 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Java  Only use JDK 8 now  Keep heap size small – Even our largest brokers use a 6 GB heap – Save the rest for page cache  Garbage Collection - G1 all the way – Basic tuning only – Watch for humongous allocations 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. How Much Do You Need? 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 13 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Possible Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Keeping An Eye On Things 18
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring The Foundation  CPU Load  Network inbound and outbound  Filehandle usage for Kafka  Disk – Free space - where you write logs, and where Kafka stores messages – Free inodes – I/O performance - at least average wait and percent utilization  Garbage Collection 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Ground Rules  Tuning – Stick (mostly) with the defaults – Set default cluster retention as appropriate – Default partition count should be at least the number of brokers  Monitoring – Watch the right things – Don’t try to alert on everything  Triage and Resolution – Solve problems, don’t mask them 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Too Much Information!  Monitoring teams hate Kafka – Per-Topic metrics – Per-Partition metrics – Per-Client metrics  Capture as much as you can – Many metrics are useful while triaging an issue  Clients want metrics on their own topics  Only alert on what is needed to signal a problem 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Monitoring  Bytes In, Bytes Out  Messages In, Produce Rate, Produce Failure Rate  Fetch Rate, Fetch Failure Rate  Partition Bytes  Log End Offset – Why bother? – KIP-32 will make this unnecessary  Quota Throttling  Provide this to your customers for them to alert on 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Client Monitoring  For consumers, use Burrow – Monitor all partitions for all consumers – Provides an easy to digest “good, warning, bad” state, with detail available – Fast and free  Producers are a little harder – Several internal implementations of message auditing – The community needs a good open source standard  Cluster availability monitoring – kafka-monitoring is coming soon from LinkedIn! 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. It’s Broken! Now What? 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. All The Best Ops People…  Know more of what is happening than their customers  Are proactive  Fix bugs, not work around them  This applies to our developers too! 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Anticipating Trouble  Trend cluster utilization and growth over time  Use default configurations for quotas and retention to require customers to talk to you  Monitor request times – If you are able to develop a consistent baseline, this is early warning 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Under Replicated Partitions  Count of number of partitions which are not fully replicated within the cluster  Also referred to as “replica lag”  Primary indicator of problems within the cluster 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Performance Checks  Are you still running 0.8?  Are all the brokers in the cluster working?  Are the network interfaces saturated? – Reelect partition leaders – Rebalance partitions in the cluster – Spread out traffic more (increase partitions or brokers)  Is the CPU utilization high? (especially iowait) – Is another process competing for resources? – Look for a bad disk  Do you have really big messages? 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka’s OK, Now What?  If Kafka is working properly, it’s probably a client issue – Don’t throw it over the fence. Help your customers understand  Common producer issues – Batch size and linger time – Receive and send buffers – Sync vs. async, and acknowledgements  Common consumer issues – Garbage collection problems – Min fetch bytes and max wait time – Not enough partitions 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Ecosystem  Kafka can scale to millions of messages per second, and more – Operations must scale the cluster appropriately – Developers must use the right tuning and go parallel  Few problems are owned by only one side – Expanding partitions often requires coordination – Applications that need higher reliability drive cluster configurations  Either we work together, or we fail separately 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Would You Like To Know More?  Presentations: http://www.slideshare.net/toddpalino – More Datacenters, More Problems – Kafka As A Service – Always download the originals for slide notes!  Blog Posts: https://engineering.linkedin.com/blog – Development and SRE blogs on Kafka and other topics  LinkedIn Open Source: https://github.com/linkedin/streaming – Burrow Consumer Monitoring - https://github.com/linkedin/Burrow – Kafka Admin Tools - https://github.com/linkedin/kafka-tools 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  LinkedIn – Strong commitment to open source – Do cool things and work with awesome people  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels 35
  • 36.
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Appendix 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. JDK Options Heap Size -Xmx6g -Xms6g Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters  Networking: net.core.rmem_default = 124928 net.core.rmem_max = 2048000 net.core.wmem_default = 124928 net.core.wmem_max = 2048000 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_max_syn_backlog = 1024 39
  • 40. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters (cont.)  Virtual Memory vm.oom_kill_allocating_task = 1 vm.max_map_count = 200000 vm.swappiness = 1 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 vm.dirty_ratio = 60 vm.dirty_background_ratio = 5 40
  • 41. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics kafka.server:name=PartitionCount,type=ReplicaManager kafka.server:name=LeaderCount,type=ReplicaManager kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool kafka.controller:name=ActiveControllerCount,type=KafkaController kafka.controller:name=OfflinePartitionsCount,type=KafkaController kafka.log:name=max-dirty-percent,type=LogCleanerManager kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer kafka.network:name=RequestsPerSec=*,type=RequestMetrics kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics 41
  • 42. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors - Topics kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.log:type=Log,name=LogEndOffset,topic=*,partition=* 42