This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
22. Kafka has it all
▪ Horizontally Scalable
▪ Durable – Replication at Partition level
▪ Low latency
▪ High throughput
▪ Data is kept on disk
▪ Log compaction
Imagine you are the owner of a new web based company which has a simple web application. As you are just beginning, you start small and the webapp probably has the stereotypical three-tier architecture. You have some clients (which may be web browsers, or mobile apps, or both), which make requests to a web application running on your servers. The web application is where your application code or “business logic” lives. Whenever the application wants to remember something for the future, it stores it in a database. And whenever the application wants to look up something that it stored previously, it queries the database. This approach is simple to understand and works pretty well.
Let’s say your business flourishes and now you are attracting a lot of new customers. You ask for feedback from the end users and realise that most people complain about the slow performance and lack of rich search functionality. You set up a cache to store pre-rendered HTML pages and other content to speed up the performance for the end users. You also realize that the basic search functionality on the DB is not quite good enough for the type of searches that are required, so you set up setting a separate indexing service
Perhaps you need to move some expensive operations out of the web request flow, and into an asynchronous background process, so you add a message queue which lets you send jobs to your background workers.
Next, you want to send notifications, such as email or push notifications to your users, so you chain a notification system off the side of the job queue for background workers, and it perhaps needs some kind of database of its own to keep track of stuff. At this point, you are generating a lot of data that needs to be analyzed, and you can’t have your business analysts running big expensive queries on your main database, so you add Hadoop and load the data from the database into it. Now, you realize that since you have all the data in HDFS anyway, you could actually build your search indexes in Hadoop and push them out to the search servers.All the steps we have taken to improve the system have worked rather well, however you have been seeing a slurry of issues that have accumulated over time.
There are multiple instances of some action occurring at the web application which triggers multiple concurrent writes to various data stores.
This approach of “dual writes” has a couple of problems associated with it, as we will see in following slides.
The first and most obvious issue that comes to mind when doing dual writes is the race condition that is involved. In this slide, we are looking at 2 different operation which have been triggered by the webapp and making their way to 2 data stores.
Given that there is no co-ordination between the processes/threads on the webapp that issued these 2 operations, they might operate in some order that leaves the data stores in inconsistent states as shown in the slide.These are the worst kind of inconsistencies, as they don’t produce any error or exception anywhere! These are introduced silently and lead your data stores to diverge from each other
But I’d say you don’t even have to go that far to find the issue with dual writes approach.
This slide shows a single write operation being carried out by the webapp to 2 data stores. One data store get the write and sends an ack back, however the ack from the other data store was never received. It could be due to a network issue, an issue at data store end, an issue with webapp itself etc. but the end result is that we end up in an inconsistent state of the operation where it went through to some data stores, buty we can’t certainly say that all data stores got the write operation. At this poit, we either add retry logic to the webapp, or the data stores, or both to try and recover from this situation.The situation can be better stated as the attempt to atomically perform operations on multiple data stores. Either the operation should be performed on all the data stores at once, or none of the data stores should get the operation performed on them.
“The Answer to the Great Question... Of Life, the Universe and Everything... Is... Forty-two,' said Deep Thought, with infinite majesty and calm.”
― Douglas Adams, The Hitchhiker's Guide to the Galaxy
The answer to our problem is a log.A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time.
Rather than describing how exactly the log solves our problems, let’s look at one already solved problem and see what we can learn from it.This slide shows the way database replication happens. Every database maintains a log of all the transactions which happen on the DB which is then used by the follower to achieve the same state as the leader DB.One key thing to note here is that even though the Leader DB itself is subjected to multiple concurrent writes, when the writes do happen on the DB, they follow a particular order which is described in the transaction log of the DB. Hence, the log effectively removes the concurrency from the writes
The leader DB appends the transactions in the transaction logs and the follower then applies them in order.The way this works despite encountering failures is that the Follower maintains the current position it currently is at in the transaction log.Let’s say the follower DB suffered a failure. In that case, the follower DB, whenever it comes up, reads the log position it was at before the failure and resumes the log consumption from that position onwards. Hence, disaster recovery on the consumers is fairly straightforward
Equipped with the insights we gained from the use of log in DB replication, we are now ready to propose a solution architecture
In this proposed solution, we let the webapp write to a log. This log can be consumed by the various data stores, including the DB itself.Given that the data is written only to one data store (the log), we don’t see the race condition issue we saw with Dual writes.
Also, as the webapp only has to write to a single data store (the log), we don’t see the problem of writing concurrently to multiple data stores which we saw with the dual write strategy
We realize that the log based queue that we require needs to have some basic qualities as described in this side.One major requirement is that we should be able to add new data stores that can bootstrap by consuming from the log and then consistently maintain the same state as the other data stores by continuing to consume all writes coming in at the log continuosly.
The specific qualities that make Kafka stand out when compared to other message queues for this use case is that Kafka stores the data at the disk while still providing comparable performance to the in memory message queue applications.
The ability to maintain data on disk allows new consumers to start reading data from the oldest message in log and consume them in order till they are completely caught up to the change stream. Hence, bootstrapping a new consumer is simple and straightforward.Log compaction is the way kafka uses for intelligently expiring data from the log. More details on the next slide
Kafka topics consist of partitions which are just logs. For simplicity, we will talk about a single partition and see how log compaction may work for key value based transactions that make their way in this partition.Kafka expires all the messages with a certain key except the latest message with that key in the partition, as can be seen from the slide. This allows Kafka to always maintain the keys which have been written to Kafka at least once and not expire them on time basis.
More details on : http://kafka.apache.org/documentation.html#compaction
Let’s revisit the proposed solution now.There are still some issues that we observe from this arrangement:
If a webapp need to perform multiple operations as a single atomic transaction, the responsibility of maintaining atomicity when performing this transaction is to be handled by the webapp as Kafka doesn’t support atomic production of a set of messages.
The system of record in this arrangement is the log rather than the DB, and using a fairly new technology such as Kafka as a system of record may not be preferred by people who trust on conventional DB
The validation checks before the transactions are written to the log are all on webapp
In this revised arrangement, we have all the writes from the webapp on the DB from which the transaction log is emitted into Kafka for the other data stores to read.We can have all the other data stores, including the Database followers, to consume from the stream of changes that occur.This takes care of all the discussed issues with the previously proposed solution:1) A conventional DB which provides ACID properties out of the box can be utilized which takes care of Atomic transaction responsibility which was on webapp in the previous arrangement
2) DB is system of record, which is trusted
3) DB can enforce constraints on incoming writes4) DB can also be used for one-off use cases that require “immediate consistency” readsThis approach of streaming the changes from a DB to the downstream consumers is what is referred to as the Change Data Capture