This document discusses using Apache Kafka as a data hub to capture changes from various data sources using change data capture (CDC). It outlines several common CDC patterns like using modification dates, database triggers, or log files to identify changes. It then discusses using Kafka Connect to integrate various data sources like MongoDB, PostgreSQL and replicate changes. The document provides examples of open source CDC connectors and concludes with suggestions for getting involved in the Apache Kafka community.
4. Needs
● System integration / Data replication.
○ They are usually done via API calls with an orchestration engine or service mesh for
microservices.
○ But… sometimes there are not such APIs or legacy systems might be too complicated.
● Migration.
○ From monolithic to a microservices architecture.
○ From one database type to another.
● Audit trail of changes.
○ Analysis of changes.
○ Identification of patterns.
5. What is CDC?
● Stands for Change Data Capture.
● Identifies changes that happen in source systems to sync others.
● Observability about what is happening in the applications (data layer).
● Most common scenario in database systems.
● Helps to avoid dual writes.
CHANGES
6. Patterns - Modification Date
● Adding a column on each table with the date/time for each record (last modification).
● This column must be available in all tables we want to track.
● Be sure this column is reliable set (in the app or even with database triggers).
● Query the data based on a date range.
● Take into account the number of inserts/updates for this range.
● Deletes might be an issue. Use logical deletes instead.
SELECT *
FROM sample_table
WHERE updated_time > now() - interval ‘1 second’
7. ● Database triggers can store the change in the original table into a “shadow” table.
● Store the whole record or just the PK from the original table.
● Must be defined for each table to track.
● Adds overhead into the database.
● Ensure all the triggers are enabled.
Patterns - Triggers
TRIGGER
TABLE
8. ● Changes in database are stored in its transaction log.
● Based on this log, changes are notified.
● There is no overhead in the database and no extra SQL/configs required.
● More reliable.
● Each database implements its own way of representing changes.
Patterns - Log based
9. Patterns - Diff or custom scripts
● Compute the difference between the previous and the current state.
● Via SQL or custom implementations (based on the data source).
● Might generate more overhead.
● Requires more maintenance.
● More oriented to specific use cases.
10. Considerations
● Cannot apply same approach for all data sources.
● Understanding of the data changed (context).
● Management of delete records.
● Schema changes in the data source.
● Initial workloads due to huge volume of data.
11. Some products
● Commercial
○ Oracle Golden Gate - https://www.oracle.com/integration/goldengate
○ Qlik Replicate - https://www.qlik.com/us/data-streaming/data-streaming-cdc
○ HVR - https://www.hvr-software.com/product/change-data-capture
○ AWS Data Migration Service - https://aws.amazon.com/dms
● Open source:
○ Debezium - https://debezium.io
○ Maxwell - https://maxwells-daemon.io
○ MongoDB - https://docs.mongodb.com/kafka-connector
13. With Kafka Connect
● We already have a bunch of connectors:
○ Debezium (PostgreSQL, MySQL, SQL Server, MongoDB).
○ MongoDB.
○ Oracle CDC.
○ SQData.
○ TiDB.
○ …
● You can find more in Confluent Hub!
○ https://www.confluent.io/hub
16. ○ A REPL for Apache Kafka.
○ Support POSIX and Windows OS.
○ Written in Scala, Java and Python.
○ Shells in:
○ Ammonite REPL.
○ Scala REPL.
○ JShell.
○ Python shell.
○ APIs for Admin, Producer, Consumer, Connect,
Streams, Schema Registry, and KSQL.
kukulcan
https://github.com/mmolimar/kukulcan
17. ○ Scripts to run the demo in Kukulcan.
○ Source code:
○ https://github.com/mmolimar/meetups
○ Documentation:
○ https://github.com/mmolimar/meetups/tree/master/kafka-cdc
Ammonite scripts