SlideShare a Scribd company logo
1 of 34
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
More Datacenters, More Problems
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Who Am I?
 Kafka, Samza, and Zookeeper SRE at LinkedIn
 Site Reliability Engineering
– Administrators
– Architects
– Developers
 Keep the site running, always
3
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
 LinkedIn Data Infrastructure Meetup
– Where: LinkedIn @ 2061 Stierlin Court, Mountain View
– When: May 11th at 6:30 PM
– Registration: http://www.meetup.com/LinkedIn-Data-Infrastructure-Meetup
4
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load (whole site)
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound
5
 1800+ Kafka brokers
 Over 95,000 topics
 1,340,000+ Partitions
 1.3 Trillion messages per day
 330 Terabytes In
 1.2 Petabytes Out
 Peak Load (single cluster)
– 2 Million messages/sec
– 4.7 Gigabits/sec Inbound
– 15 Gigabits/sec Outbound
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Tiered Cluster Architecture
 Multiple Datacenters
 Performance Concerns
 Conclusion
6
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Tiered Cluster Architecture
7
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
One Kafka Cluster
8
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Single Cluster – Remote Clients
9
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Single Cluster – Spanning Datacenters
10
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Local and Remote Clients
11
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Multiple Clusters – Message Aggregation
12
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Why Not Direct?
 Network Concerns
– Bandwidth
– Network partitioning
– Latency
 Security Concerns
– Firewalls and ACLs
– Encrypting data in transit
 Resource Concerns
– A misbehaving application can swamp production resources
13
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
What Do We Lose?
 You may lose message ordering
– Mirror maker breaks apart message batches and redistributes them
 You may lose key to partition affinity
– Mirror maker will partition based on the key
– Differing partition counts in source and target will result in differing distribution
– Mirror maker does not (without work) honor custom partitioning
14
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Multiple Datacenters
15
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Why Multiple Datacenters?
 Disaster Recovery
16
 Geolocalization  Legal / Political
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Planning For the Worst
 Keep your datacenters identical
– Snowflake services are hard to fail over
 Services that need to run only once in the infrastructure need to coordinate
– Zookeeper across sites can be used for this (at least 3 sites)
– Moving from one Kafka cluster to another is hard
– KIP-33: Time-based log indexing
 What about AWS (or Google Cloud)?
– Disaster recovery can be accomplished via availiability zones
17
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Planning For the Best
 Not much different than designing for disasters
 Consider having a limited number of aggregate clusters
– Every copy of a message costs you money
– Have 2 downstream (or super) sites that process aggregate messages
– Push results back out to all sites
– Good place for stream processing, like Samza
 What about AWS (or Google Cloud)?
– Geolocalization can be accomplished via regions
18
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Segregating Data
 All the data, available everywhere
 The right data, only where we need it
 Topics are (usually) the smallest unit we want to deal with
– Mirroring a topic between clusters is easy
– Filtering that mirror is much harder
 Have enough topics so consumers do not throw away messages
19
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Retaining Data
20
 Retain data only as long as you have to
 Move data away from the front lines as
quickly as possible.
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Performance Concerns
21
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
22
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
23
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
24
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Where Do I Put Mirror Makers?
 Best practice was to keep the mirror maker local to the target cluster
– TLS (can) make this a new game
 In the datacenter with the produce cluster
– Fewer issues from networking
– Significant performance hit when using TLS on consume
 In the datacenter with the consume cluster
– Highest performance TLS
– Potential to consume messages and drop on produce
25
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Sizing
 Number of servers and streams
– Size the number of servers based on the peak bytes per second
– Co-locate mirror makers
– Run more mirror makers in an instance than you need
– Use multiple consumer and producer streams
 Other tunables to look at
– Partition assignment strategy
– In flight requests per connection
– Linger time
26
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Segregation of Topics
 Not all topics are created equal
 High Priority Topics
– Topics that change search results
– Topics used for hourly or daily reporting
 Low Latency Topics
 Run a separate mirror maker for these topics
– One bloated topic won’t affect reporting
– Restarting the mirror maker takes less time
– Less time to catch up when you fall behind
27
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Conclusion
28
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Broker Improvements
 Namespaces
– Namespace topics by datacenter
– Eliminate local clusters and just have aggregate
– Significant hardware savings
 JBOD Fixes
– Intelligent partition assignment
– Admin tools to move partitions between mount points
– Broker should not fail completely with a single disk failure
29
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Mirror Maker Improvements
 Identity Mirror Maker
– Messages in source partition 0 get produced directly to partition 0
– No decompression of message batches
– Keeps key to partition affinity, supporting custom partitioning
– Requires mirror maker to maintain downstream partition counts
 Multi-Consumer Mirror Maker
– A single mirror maker that consumes from more than one cluster
– Reduces the number of copies of mirror maker that need to be running
– Forces a produce-local architecture, however
30
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
31
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
32
SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
 LinkedIn Data Infrastructure Meetup
– Where: LinkedIn @ 2061 Stierlin Court, Mountain View
– When: May 11th at 6:30 PM
– Registration: http://www.meetup.com/LinkedIn-Data-Infrastructure-Meetup
33
More Datacenters, More Problems

More Related Content

What's hot

Redis and Kafka - Advanced Microservices Design Patterns Simplified
Redis and Kafka - Advanced Microservices Design Patterns SimplifiedRedis and Kafka - Advanced Microservices Design Patterns Simplified
Redis and Kafka - Advanced Microservices Design Patterns SimplifiedAllen Terleto
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streamingconfluent
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfWhy Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfDATAVERSITY
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafkaconfluent
 
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...HostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Event streaming: A paradigm shift in enterprise software architecture
Event streaming: A paradigm shift in enterprise software architectureEvent streaming: A paradigm shift in enterprise software architecture
Event streaming: A paradigm shift in enterprise software architectureSina Sojoodi
 
Kafka summit apac session
Kafka summit apac sessionKafka summit apac session
Kafka summit apac sessionChristina Lin
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIconfluent
 
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, ClouderaLessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, ClouderaHostedbyConfluent
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...confluent
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Jonghyun Lee
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Implyconfluent
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuHeroku
 
Partner Development Guide for Kafka Connect
Partner Development Guide for Kafka ConnectPartner Development Guide for Kafka Connect
Partner Development Guide for Kafka Connectconfluent
 
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...HostedbyConfluent
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builderTimothy Spann
 

What's hot (20)

Redis and Kafka - Advanced Microservices Design Patterns Simplified
Redis and Kafka - Advanced Microservices Design Patterns SimplifiedRedis and Kafka - Advanced Microservices Design Patterns Simplified
Redis and Kafka - Advanced Microservices Design Patterns Simplified
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streaming
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfWhy Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafka
 
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Event streaming: A paradigm shift in enterprise software architecture
Event streaming: A paradigm shift in enterprise software architectureEvent streaming: A paradigm shift in enterprise software architecture
Event streaming: A paradigm shift in enterprise software architecture
 
Kafka summit apac session
Kafka summit apac sessionKafka summit apac session
Kafka summit apac session
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, ClouderaLessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
Lessons from the field: Catalog of Kafka Deployments | Joseph Niemiec, Cloudera
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
Apache Kafka at LinkedIn - How LinkedIn Customizes Kafka to Work at the Trill...
 
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and ImplyAchieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Imply
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
Partner Development Guide for Kafka Connect
Partner Development Guide for Kafka ConnectPartner Development Guide for Kafka Connect
Partner Development Guide for Kafka Connect
 
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
 

Viewers also liked

101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)Henning Spjelkavik
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Joel Koshy
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloudconfluent
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Corner office strategy transformation
Corner office strategy transformationCorner office strategy transformation
Corner office strategy transformationHadi Imran
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandraAxel Liljencrantz
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggleconfluent
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use casesIndrajeet Kumar
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 

Viewers also liked (20)

101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Corner office strategy transformation
Corner office strategy transformationCorner office strategy transformation
Corner office strategy transformation
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use cases
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Similar to More Datacenters, More Problems

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! elangovans
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInJens Pillgram-Larsen
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateMichael Elder
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyesThousandEyes
 
A Blueprint for Cloud-Native Financial Institutions
A Blueprint for Cloud-Native Financial InstitutionsA Blueprint for Cloud-Native Financial Institutions
A Blueprint for Cloud-Native Financial InstitutionsAngelo Agatino Nicolosi
 
Three Key Steps for Moving Your Branches to the Cloud
Three Key Steps for Moving Your Branches to the CloudThree Key Steps for Moving Your Branches to the Cloud
Three Key Steps for Moving Your Branches to the CloudZscaler
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSShapeBlue
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsAerospike, Inc.
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern ApplicationRahul Kumar Gupta
 
Introduction To ThousandEyes
Introduction To ThousandEyesIntroduction To ThousandEyes
Introduction To ThousandEyesThousandEyes
 
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdfSaurabh Chauhan
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreDataStax Academy
 
Realise True Business Value .pdf
Realise True Business Value .pdfRealise True Business Value .pdf
Realise True Business Value .pdfThousandEyes
 
Netherlands Tech Tour 03 - MySQL Cluster
Netherlands Tech Tour 03 -   MySQL ClusterNetherlands Tech Tour 03 -   MySQL Cluster
Netherlands Tech Tour 03 - MySQL ClusterMark Swarbrick
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyesThousandEyes
 
EarthLink Business Cloud Hosting
EarthLink Business Cloud HostingEarthLink Business Cloud Hosting
EarthLink Business Cloud HostingMike Ricca
 

Similar to More Datacenters, More Problems (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers!
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Accelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud PrivateAccelerate Digital Transformation with IBM Cloud Private
Accelerate Digital Transformation with IBM Cloud Private
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
A Blueprint for Cloud-Native Financial Institutions
A Blueprint for Cloud-Native Financial InstitutionsA Blueprint for Cloud-Native Financial Institutions
A Blueprint for Cloud-Native Financial Institutions
 
Three Key Steps for Moving Your Branches to the Cloud
Three Key Steps for Moving Your Branches to the CloudThree Key Steps for Moving Your Branches to the Cloud
Three Key Steps for Moving Your Branches to the Cloud
 
Updates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDSUpdates to Apache CloudStack and LINBIT SDS
Updates to Apache CloudStack and LINBIT SDS
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern Application
 
Introduction To ThousandEyes
Introduction To ThousandEyesIntroduction To ThousandEyes
Introduction To ThousandEyes
 
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
0328apjcintrotothousandeyeswebinar-230328233735-4df10d7f.pdf
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Why You Should Move to the Cloud
Why You Should Move to the CloudWhy You Should Move to the Cloud
Why You Should Move to the Cloud
 
Realise True Business Value .pdf
Realise True Business Value .pdfRealise True Business Value .pdf
Realise True Business Value .pdf
 
Netherlands Tech Tour 03 - MySQL Cluster
Netherlands Tech Tour 03 -   MySQL ClusterNetherlands Tech Tour 03 -   MySQL Cluster
Netherlands Tech Tour 03 - MySQL Cluster
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
MySQL cluster 7.4
MySQL cluster 7.4 MySQL cluster 7.4
MySQL cluster 7.4
 
EarthLink Business Cloud Hosting
EarthLink Business Cloud HostingEarthLink Business Cloud Hosting
EarthLink Business Cloud Hosting
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 

More from Todd Palino (7)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 

Recently uploaded

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 

Recently uploaded (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 

More Datacenters, More Problems

  • 1. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. More Datacenters, More Problems
  • 2. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Who Am I?  Kafka, Samza, and Zookeeper SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 3
  • 4. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels  LinkedIn Data Infrastructure Meetup – Where: LinkedIn @ 2061 Stierlin Court, Mountain View – When: May 11th at 6:30 PM – Registration: http://www.meetup.com/LinkedIn-Data-Infrastructure-Meetup 4
  • 5. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 5  1800+ Kafka brokers  Over 95,000 topics  1,340,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  • 6. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Tiered Cluster Architecture  Multiple Datacenters  Performance Concerns  Conclusion 6
  • 7. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Tiered Cluster Architecture 7
  • 8. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. One Kafka Cluster 8
  • 9. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Single Cluster – Remote Clients 9
  • 10. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Single Cluster – Spanning Datacenters 10
  • 11. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Local and Remote Clients 11
  • 12. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Multiple Clusters – Message Aggregation 12
  • 13. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Why Not Direct?  Network Concerns – Bandwidth – Network partitioning – Latency  Security Concerns – Firewalls and ACLs – Encrypting data in transit  Resource Concerns – A misbehaving application can swamp production resources 13
  • 14. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. What Do We Lose?  You may lose message ordering – Mirror maker breaks apart message batches and redistributes them  You may lose key to partition affinity – Mirror maker will partition based on the key – Differing partition counts in source and target will result in differing distribution – Mirror maker does not (without work) honor custom partitioning 14
  • 15. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Multiple Datacenters 15
  • 16. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Why Multiple Datacenters?  Disaster Recovery 16  Geolocalization  Legal / Political
  • 17. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Planning For the Worst  Keep your datacenters identical – Snowflake services are hard to fail over  Services that need to run only once in the infrastructure need to coordinate – Zookeeper across sites can be used for this (at least 3 sites) – Moving from one Kafka cluster to another is hard – KIP-33: Time-based log indexing  What about AWS (or Google Cloud)? – Disaster recovery can be accomplished via availiability zones 17
  • 18. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Planning For the Best  Not much different than designing for disasters  Consider having a limited number of aggregate clusters – Every copy of a message costs you money – Have 2 downstream (or super) sites that process aggregate messages – Push results back out to all sites – Good place for stream processing, like Samza  What about AWS (or Google Cloud)? – Geolocalization can be accomplished via regions 18
  • 19. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Segregating Data  All the data, available everywhere  The right data, only where we need it  Topics are (usually) the smallest unit we want to deal with – Mirroring a topic between clusters is easy – Filtering that mirror is much harder  Have enough topics so consumers do not throw away messages 19
  • 20. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Retaining Data 20  Retain data only as long as you have to  Move data away from the front lines as quickly as possible.
  • 21. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Performance Concerns 21
  • 22. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Buy The Book! 22 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 23. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 23
  • 24. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 24
  • 25. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Where Do I Put Mirror Makers?  Best practice was to keep the mirror maker local to the target cluster – TLS (can) make this a new game  In the datacenter with the produce cluster – Fewer issues from networking – Significant performance hit when using TLS on consume  In the datacenter with the consume cluster – Highest performance TLS – Potential to consume messages and drop on produce 25
  • 26. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Mirror Maker Sizing  Number of servers and streams – Size the number of servers based on the peak bytes per second – Co-locate mirror makers – Run more mirror makers in an instance than you need – Use multiple consumer and producer streams  Other tunables to look at – Partition assignment strategy – In flight requests per connection – Linger time 26
  • 27. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Segregation of Topics  Not all topics are created equal  High Priority Topics – Topics that change search results – Topics used for hourly or daily reporting  Low Latency Topics  Run a separate mirror maker for these topics – One bloated topic won’t affect reporting – Restarting the mirror maker takes less time – Less time to catch up when you fall behind 27
  • 28. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Conclusion 28
  • 29. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 29
  • 30. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Mirror Maker Improvements  Identity Mirror Maker – Messages in source partition 0 get produced directly to partition 0 – No decompression of message batches – Keeps key to partition affinity, supporting custom partitioning – Requires mirror maker to maintain downstream partition counts  Multi-Consumer Mirror Maker – A single mirror maker that consumes from more than one cluster – Reduces the number of copies of mirror maker that need to be running – Forces a produce-local architecture, however 30
  • 31. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 31
  • 32. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 32
  • 33. SITE RELIABILITY ENGINEERING©2015 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels  LinkedIn Data Infrastructure Meetup – Where: LinkedIn @ 2061 Stierlin Court, Mountain View – When: May 11th at 6:30 PM – Registration: http://www.meetup.com/LinkedIn-Data-Infrastructure-Meetup 33

Editor's Notes

  1. So who am I, and why am I qualified to stand up here? I am a member of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka and Zookeeper operations, as well as Samza and a couple iterations of our change capture systems. SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. At the end of the day, our job is to keep the site running, always.
  2. These are the numbers I presented this time last year, as far as how much data we push around in Kafka at LinkedIn. Over the last year, it’s changed significantly We now have well over 1800 brokers in total in our 80+ clusters Which are managing over 95,000 topics With over 1.3 million partitions between them, not including replication We’ve gone from 875 billion messages a day to over 1.3 trillion, and that was a slow day There is now over 330 terabytes per day flowing into Kafka, almost double And consumers are reading over 1.2 petabytes per day out. Of course, those are both compressed data numbers I’m changing the peak load numbers, choosing to show what happens in our single largest production cluster. At peak, we’re receiving over 2 million messages per second For a total of 4.7 gigabits per second of inbound traffic And the consumers are reading over 15 gigabits per second at the same time Again, this is compressed data, and these clusters are held at a max of 60% of disk and network utilization. This is a fairly astonishing growth rate for the amount of data we are moving around with Kafka. Some of it comes from standing up new datacenters, so let’s move directly into what that looks like.
  3. What are the things we are going to cover in this talk? We’ll start by talking about the basics of how Kafka works, very briefly, and move right into what a tiered architecture looks like along with the infrastructure tool we use for creating our tiers – mirror maker. We’ll then step back and discuss the reasons why we run multiple datacenters, and what that means for our design choices. I will cover performance tuning, specifically when it comes to laying out and managing tiered clusters. Lastly, we’ll talk about what work is going on right now that will continue to improve the ecosystem for running large Kafka installations, and what you can do to get involved.
  4. I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck. Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems. Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network. What happens when you have two sites to deal with?
  5. Multiple datacenters is where this starts to get interesting. Here is an example of a layout that uses one Kafka cluster. We’re keeping the cluster in a single datacenter, because having it span datacenters is an entirely different level of complexity. In addition, Kafka has no provision for reading from the follower brokers, so you would still be crossing datacenters with your clients. The problem with this layout should be quite obvious – if we lose Datacenter A, we’ve lost everything. Not only do we have concerns with network partitions between the datacenters cutting off access for one consumer or producer or another, we have no redundancy at all.
  6. Another single cluster alternative is to have the cluster span all the datacenters. That is, at least a couple brokers in each datacenter, all sharing the same Zookeeper instance and operating as one cluster. This layout has lots of problems as well, though. First, you’re still dealing with network latency across sites. This is compounded by spreading out the Zookeeper ensemble (and you have to, otherwise you can’t tolerate the loss of the datacenter that the ensemble is in). And your client connections are not a simple as pictured above. They actually cross all over the datacenters and connect wherever the leaders are. The other big problem is that you need to make sure that all partitions have replicas that are in separate datacenters. Brand new in 0.10 is the ability to have the brokers do rack-aware replica assignment, and you can utilize this to make sure the initial replica assignments are correct. Of course, that’s only the initial assignment – if you add more brokers or more datacenters and want to rebalance partitions, you’re on your own. Also, it’s brand new code, and if you’re anything like me you’re wary of relying on it just yet.
  7. So to improve this situation, we’ll run a Kafka cluster in each of our primary datacenters, A and B. In this layout, C is a lower-tier datacenter where we don’t have producers of the data, only consumers. Consider it a backend environment where you run things like Hadoop. Our producers all talk to the local Kafka cluster. Consumers in the primary datacenters talk to their local cluster as well. Now if we lose either datacenter A or B, the other datacenter can continue to operate. We’ve pushed more complexity on the consumers that need to access all of the data from both datacenters, however. They have to maintain consumer connections to both clusters, and they will have to deal with networking problems that come up as a result. In addition, latency in the network connection can manifest in strange ways in an application.
  8. Now we iterate on the architecture one more time. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves. We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
  9. We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns. The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring. There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now. The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
  10. Why do we even want to have multiple datacenters in the first place? The answer may be obvious to some, but it’s always worth starting out with a clear statement of what we’re hoping to accomplish. This makes it easy to make decisions about how the architecture should look to solve the problem that we intend to solve. The single largest reason behind having multiple sites is for disaster recovery. If one site is down for some reason, whether that is due to a loss of power, a natural disaster, or zombies, we want to be able to continue operations. Hopefully in an uninterrupted manner. This means that we need the entire stack of applications represented in all datacenters. We also need enough capacity to handle the failure scenario we are planning for. If you have 2 datacenters, you want to plan for losing one. If you have 3 datacenters, you might still only plan for the loss of one, which means you don’t have to have 300% of your planned capacity out there. Another big reason for multiple sites is to get the content as close to the users as possible. A round-trip for data that has to go halfway around the world is at least half a second, and if you have a few of those in a page load, your users are going to have a bad time. There are a lot of ways to accomplish this goal, from using a ready-made content distribution network, to setting up points of presence that terminate and aggregate connections, to having full-blown datacenters in strategic locations. A sometimes overlooked reason for multiple datacenters, especially for those of us with our heads down in the code, is legal and/or political reasons. There can be tax advantages to having datacenters in certain countries. There may also be legal requirements for serving or storing a local user’s content locally. You may think about China here, but I think more about countries like Germany that have strong federal data protection laws.
  11. Disaster recovery is the first place we look – if we want to get big, we have to be able to tolerate problems. It gets much easier to run more than one datacenter when all of them look exactly the same. Run all the same services everywhere, because special little snowflakes are difficult to move around and fail over. This can be difficult in practice, however, so we need to make accommodations for them. We call these “single master” services – there can be only one of them within our infrastructure. Services like this will need to coordinate, because in order to handle a disaster you’ll need to have a second copy of them somewhere, whether as a warm or cold standby. This requires cross-site coordination, and one way to do this is to use Zookeeper. Keep in mind that your Zookeeper ensemble needs to live in at least 3 sites to make this work. So if you have two datacenters, you’ll need a third site to act as an arbiter. Another problem with this type of application is that you have to fail over your position from one Kafka cluster to another. As offsets are different between clusters, this is a little complex to do. We currently have the ability to request an offset from the broker given a timestamp, but you probably won’t get what you want. Kafka satisfies this request by giving you the first offset of the log segment that your timestamp falls into by looking at the modification time of the log segments. This means that the offset you get is only guaranteed to be before the timestamp you asked for. It could be seconds, minutes, hours, or days before, but it will be before. Thankfully, there’s a fix coming for this. KIP-33, worked on as a set of changes by Becket, one of LinkedIn’s developers, adds a time-based index. This will allow Kafka to return a much more granular offset as an answer to that query. You may ask me “what about running in Amazon Web Services?” The answers are all the same, it’s just the language that is different. Rather than speaking about datacenters, we talk about “availability zones”.
  12. But nobody likes thinking about disasters, so let’s plan for the best of all worlds – you have so much traffic you need more capacity and you want to spread it out so that your users get the best experience. It’s actually not that much different from disaster planning. You’re still working with multiple sites, but you’re probably going to add more over time. So where you really only need two sites for DR, you might have 3, 4, or dozens of sites for geolocalization. One thing to be concerned about here is where you have aggregate Kafka clusters. Every copy of your messages is going to cost you money – both in storage and bandwidth. A good plan here is to designate a limited number of datacenters (2 is a good number, remembering DR) where aggregate data lives. This could be two “back office” type sites, or it could be two of your production datacenters that specifically set up for this. Run the applications that need aggregate data there, and then push the results out to applications in the other sites to work with. This is a great place to run stream processing in a framework like Samza, where you are performing online transformation of your messages and generating new data for applications. Again, when it comes to AWS and Google Cloud, and the like, you’re just dealing with a difference in language. When you want to do geolocalization, you’re most likely going to talk about using different regions. But keep an eye on those data transfer costs!
  13. Legal requirements for data storage can present a lot of difficulties for us. We may start out wanting to make all the data available everywhere, but the reality is that what we really want is to make sure the right data is available where it is needed. What the “right data” is can be heavily influenced by the legal and political concerns we just mentioned. For example, what happens if we store a US user’s data in China? Or the opposite? If we have a datacenter in Germany, is there a problem a German citizen’s data is stored outside the country? Do we need to handle it differently? For the most part, we only want to deal in whole topics. If we want to mirror messages from one Kafka cluster to another, it’s much easier for us to just mirror the entire topic. If it’s required that we filter the messages as we mirror them, a significant layer of complexity has been added. We now have to inspect every message and make a decision. It also complicates monitoring systems like audit (which I will talk about later) So what is the best practice? Have enough topics so that your consumers can operate without throwing away messages. This means that you should try to avoid having consumers that filter messages – you should be able to accomplish that by only consuming one of several topics.
  14. Does anyone know who this woman is? Her name is Helen Dixon, and she’s the Irish Data Protection Commissioner. Like many other companies, we hold ourselves to the standards of the IDPC for protecting our members’ data. There are specific rules about how long you can retain certain types of data, and how quickly it must be removed if requested. This leads us to our first best practice when it comes to message retention – only keep it around as long as you have to. This is not only important from a protection and privacy point of view, but from a resource point of view as well. The more data we retain, the more storage we need to have online. Fast storage is expensive, and we’re already keeping multiple replicas within a cluster. We also need to consider security – the more data we have, the more vulnerable we are making ourselves. This leads us to another good idea – keep the data away from the front lines of your infrastructure as much and as quickly as you can. The further back into your infrastructure your data is, the more layers an attacker would have to go through to get to it, the less vulnerable that data is. This means that if we can, we should have lower retention times in the first level clusters, relying on either other Kafka clusters (that are deeper into our infrastructure) or other systems entirely (like HDFS) to store messages for longer periods of time.
  15. More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
  16. This is as good a time as any for a little self-promotion. Many of the questions around how to set up and lay out Kafka clusters, including specific performance concerns and tuning, are covered in this fine book that I am co-authoring. You’ll also find a trove of information about client development, stream processing, and a variety of use cases for Kafka. We currently have 4 chapters complete, and it’s available from O’Reilly under their early access program. We expect to have the book completed late this year, or early next, with chapters being released as soon as we can write them.
  17. It may not strictly be tuning, but having your Kafka clusters be the right size is the first place you need to start. It can be a little difficult to determine exactly how large your cluster should be, but there are a few major points you have to consider. First, how much disk space do you have on each broker, and how much space do you need to maintain message retention? One of our rules is that we keep disk usage for the log segments partition to under 60%. This allows us enough headroom to move partitions around when needed (especially because retention for the partition resets when you move it). The next concern is how much network you have. If you have gigabit network interfaces, and your Kafka cluster is going to receive 5 gigabits per second of traffic at peak, you need to take that into account. CPU, memory, and disk I/O all take a back seat to these concerns, because they mostly drive how fast your cluster operates, not if it operates at all. Size your local clusters first, then you can consider how large your aggregate cluster needs to be. For the most part, taking the number of local clusters you have, multiplied by the number of brokers in each local cluster, will give you the size of your aggregate cluster. If you have 3 local clusters with 10 brokers each, your aggregate cluster should be at least 30 brokers. You’ll also need to take into account the number of consumers of the aggregate messages. This can affect how much network bandwidth you need, which will change the number of brokers that you need.
  18. You will also need to size your topics appropriately. This tends to be a topic of much discussion, because there are so many variables that get considered here. For example: How many brokers do you have? Do you want to perfectly balance the topic across the brokers? If so, you should have then number of partitions be a multiple of the number of brokers How many consumers does your topic have in its largest consumer group? If you have 8 partitions and 16 consumers, 8 of those consumers will be sitting idle Does your application have specific requirements around partition counts? If you are using keyed messages, you may want to go with a larger number of partitions to start with so you don’t have to expand it later based on other criteria Another concern we have is around keeping the size of a partition on disk manageable. Very large partitions can be harder to keep balanced in a cluster to make sure each broker is doing its fair share of work. We use a guideline internally of making sure partitions do not exceed 50 gigabytes on disk. When they get close or exceed that, we expand the topic (provided it is not a keyed topic). Once again here, the partition counts in your aggregate cluster should be a simple calculation. For most topics, we take the number of partitions in the local cluster, multiply it by the number of local clusters, and that is the number of partitions in the aggregate cluster. You also want to check your partition counts regularly, especially if you use automatic topic creation. An imbalance in the number of partitions between clusters can bog down mirror maker very quickly. Another thing to consider is how much retention you need. There’s nothing that says that retention has to be the same in the local and aggregate tiers. You may want to retain messages longer in aggregate, keeping them in the local tier only long enough to get them out. Just remember that this can change your sizing calculation for the aggregate clusters. If you have twice the retention, you may very well need twice the number of brokers.
  19. The answer of where to put the mirror makers, local to the source cluster or local to the target cluster, used to have a simple answer – putting it with the target cluster was safer. However, now that Kafka has introduced TLS communications, and we have a push to encrypt all cross-datacenter traffic, the game has changed. If we put the mirror maker in the datacenter with the produce cluster, we have a better layout from a network point of view. If a mirror maker fails to consume messages, it will just not commit offsets and you will not lose messages. You will just slow down or stop consuming. If the mirror maker fails to produce messages, then more likely than not it will start dropping messages. This means we want to keep network problems on the consumer side as much as possible. However, there’s a significant performance hit to consuming messages over TLS, due to the loss of a zero copy send in the broker. When producing messages over TLS, we don’t take this performance hit, so if you don’t have a directive to encrypt every link, even within your datacenter, placing the mirror makers in the datacenter with the consume cluster is now the way to go. Newer mirror maker versions, which exit on any produce failure, make this a little easier to deal with.
  20. We introduced another component with Mirror Maker, so we also need to size that appropriately. When we talk about sizing, we are talking about the number of copies of mirror maker with the same consumer and producer configuration, which is one pipeline. With mirror maker, it’s mostly about network throughput. Because of the decompression and recompression of message batches, you’re probably never going to run at wire speed. This means that you can easily co-locate multiple mirror makers on one set of servers to efficiently use them. You should also make sure that you are running more copies of mirror maker than you need to handle your peak traffic. If you fall behind, such as if you have a network partition for a period of time, you want to be able to catch up quickly. If you don’t have excess capacity, it will take a long time, or you will just continue to fall behind. You should also run multiple consumer and producer streams in each copy of mirror maker, as this will allow you to take advantage of the parallel nature of having multiple partitions. If you can process 15 megabytes per second at peak on one stream, you won’t get 30 with two streams, but you’ll do a lot better. We run with 8 consumer streams and 4 producer streams, and it works out pretty well. We also co-locate up to 11 mirror makers on one host, each for a separate pipeline. There are a few other parameters to consider. One is the partition assignment strategy. We asked our developers to add a round robin strategy of balancing partitions for wildcard consumers, like mirror maker. This provides a nice balance of partitions across your mirror makers, and should almost certainly be the configuration you use. You should also set the number of in flight requests per connection. A higher number will make things go faster, but it will also mean more loss of messages if mirror maker breaks. The linger time for the producer is another thing to look at. A longer linger time will allow mirror maker to assemble more efficient batches of messages, but it will also mean that messages take a little longer to get through the pipeline. Weigh which tradeoffs are the right ones for you.
  21. Another thing you can do is to provide separate paths for different topics between the same two clusters. We do this at LinkedIn because not all topics are created equal. We have high priority topics. For you, this could be topics that change search results. For us, it’s mostly topics that are used for hourly or daily reporting. Most other topics, especially headed to Hadoop, we’re OK if they’re a little delayed. But if the hourly report to the executives is delayed, you can be sure that I’m fielding phone calls as to exactly what is broken and when it will be fixed. Another type of topic you may want to separate out is low latency topics. This could be topics that are used for messaging, or things that generate responses to users. We don’t want to wait for these messages to batch up and send. For these topics, we have two separate mirror maker pipelines that run in parallel. The high priority mirror maker has a small whitelist of topics, and the other mirror maker has a blacklist that contains the same topic list. This way a bloated topic that is not considered a priority will not delay the most important topics. It also means that the priority mirror maker starts up faster, and takes less time to catch up when there is a problem. These are all very good things if it means the CEO doesn’t know my name.
  22. There are a number of things that can be improved upon, both in the brokers and in the mirror maker, to make it easier to set up and manage multiple datacenters. In the architecture I laid out, and the one in use currently at LinkedIn, we have both local and aggregate clusters in every datacenter. This is wasteful, because we have two copies of the local cluster’s messages in a datacenter. If we were to use namespaces, we could have a single aggregate cluster that contains the messages for all datacenters, and clients could choose to consume either local or aggregate messages by using a wildcard. Namespaces could be accomplished either natively in Kafka (a KIP which has already come up and been rejected), or could be handled by using our wrapper library to prefix topic names. Another big problem is that we are using RAID and providing a single mount point to the Kafka brokers for a log dir. This is because there are some issues with the way JBOD is handled in the broker. Specifically, the brokers assign partitions to log dirs by round robin, not taking into account current size. In addition, there are no administrative functions to move partitions from one directory to another. And if a single disk fails, the entire broker fails. If JBOD was more robust, we could have replication factors of 3 or 4 without an increase in hardware cost, which would allow us to have “no data loss” configurations.
  23. The big improvement to mirror maker is the creation of an identity mirror maker, which would keep message batches together in the exact same partition from source to target cluster. This would completely eliminate the compression overhead from the mirror maker, making it much faster and more efficient. Of course, this requires maintaining the partitions counts in the clusters properly, and allowing the mirror maker to increase partition counts in a target cluster if needed. Another issue with mirror maker is that as you add new local clusters, you geometrically increase the number of paths you have to mirror messages over. If each path is a mirror maker instance, this gets out of control very quickly. There’s no good solution for this right now – you used to be able to configure a mirror maker with multiple consumers, but not in 0.8 or later. Bringing it back might not be a bad idea as a way to reduce the infrastructure complexity.
  24. That leads into the idea of multi-cluster management. While there are a couple people making some headway on this in the open source world, we still lack a solid interface for managing Kafka clusters as part of an overall infrastructure. This would include maintaining topic configurations across multiple clusters and easily configuring and visualizing the mirror maker links between them. Another piece needed is better client monitoring overall. Burrow provides us with a good view of what the consumers are doing, but there’s nothing available yet for producer client monitoring. We, of course, have our internal audit system for this. And other companies have their own versions as well. It would be nice to have an open source solution that anyone can use for assuring that the producers are working properly. We could also use better end-to-end monitoring of our Kafka clusters, so we can know that they are available. We have a lot of metrics that can track information about the individual components, but without a client view of the cluster, we don’t know if the cluster is actually available. We also have a hard time making sure that the entire pipeline is working properly. There’s not a lot available for this right now, but watch this space…
  25. So how can you get more involved in the Kafka community? The most obvious answer is to go apache.kafka.org. From there you can Join the mailing lists, either on the development or the user side You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local You can also dive into the source repository, and work on and contribute your own tools back. Kafka may be young, but it’s a critical piece of data infrastructure for many of us.