This document discusses using Apache Kafka for database replication in LinkedIn's ESPRESSO database system. It provides an overview of ESPRESSO's architecture and transition from per-instance to per-partition replication using Kafka. Key aspects covered include Kafka configuration, the message protocol for ensuring in-order delivery, and checkpointing by the Kafka producer to allow resuming replication from the last committed transaction after failures.
The first section is quick and intended to frame why the requirements for Kafka internal replication
Seriously, this should take 5 minutes max
ESPRESSO is a NoSQL, RESTFULL, HTTP Document store
Partition Placement and Replication
Helix assigns partitions to nodes
Initial deployments (0.8) used MySQL replication between nodes
Evolving to (1.0) using Kafka for internal replication
A couple of concepts that are key to how Espresso replication with Kafka works.
Time To Market dictated 0.8 Architecture
Delegated intra cluster replication to MySQL
Replication is at instance level
Rigid partition placement
Graph of 3 hosts in a “slice” One node is performing 500 to 3K qps. The other two are performing exactly zero.
Next we’ll explore the reasons for replacing MySQL replication with Kafka.
Upon node failure, rather than 1 node getting 100% of the workload for the failed node, each of the surviving node gets an increase of 1/num_nodes load
All subsequent examples one partition. This is to simplify the diagrams. The same logic runs for every partition.
Sound like Kafka, right?
Let’s look at the “happy path” for replication:
Each Kafka message contains the SCN of the commit and an indicator of whether it is the beginning, and/or end of the transaction
When consumer sees first message in a txn, it starts a txn in the local MySQL
Each message generates an “INSERT … ON DUPLICATE UPDATE …” statement
When consumer processes last message in a txn, it executes a COMMIT statement
Old Master stops producing
Helix sends SlaveToMaster transition to selected slave for partition
Slave emits a control message to propose next generation
Once slave has read its own control message, it updates generation in Helix Property Store – if successful, can start accepting writes
Once slave has read its own control message, it updates generation in Helix Property Store – if successful, can start accepting writes
Periodically writes (SCN, Kafka Offset) to per-partition MySQL table
May only checkpoint offset at end of valid transaction
Non retryable exception. We destroy the producer and restart from the last checkpoint.
Next we will explore how the client handles these replayed messages.
Here is the replication stream from our master reconnect example
Here is the replication stream from our master reconnect example
Here is the replication stream from our master reconnect example
Here is the replication stream from our master reconnect example
Stall may be due to a Garbage Collection event, disk failing disk, a switch glitch, …
Here the master is in the middle of a transaction
Helix sends a SlaveToMaster transition to one of the slaves
Slave becomes master and starts taking writes
Helix has revoked mastership
Node transitions to ERROR state
We have the ability to replay binlogged events back into the top of the stack with Last Writer Wins conflict resolution
Latency is measured from the time we send a message to Kaka until it is committed in the slave’s MySQL.