Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JOnConf - A CDC use-case: designing an Evergreen Cache

When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database.

Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile.

You might have read about Change-Data-Capture before. It’s been described by Martin Kleppmann as turning the database inside out: it means the DB can send change events (SELECT, DELETE and UPDATE) that one can register to. Just opposite to Event Sourcing that aggregates events to produce state, CDC is about getting events out of states. Once CDC is implemented, one can subscribe to its events and update the cache accordingly. However, CDC is quite in its early stage, and implementations are quite specific.

In this talk, I’ll describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

JOnConf - A CDC use-case: designing an Evergreen Cache

  1. 1. @nicolas_frankel A CDC use-case: Designing an Evergreen Cache Nicolas Fränkel
  2. 2. @nicolas_frankel Me, myself and I  Former developer, team lead, architect, blah-blah  Developer Advocate  Interested in CDC and data streaming
  3. 3. @nicolas_frankel Hazelcast HAZELCAST IMDG is an operational, in-memory, distributed computing platform that manages data using in-memory storage and performs execution for breakthrough and scale. HAZELCAST JET is the ultra fast, application embeddable, 3rd generation stream processing engine for low latency batch and stream processing.
  4. 4. @nicolas_frankel Agenda 1. Why cache? 2. Alternatives to keeping the cache in sync 3. Change-Data-Capture (CDC) 4. Debezium, a CDC implementation 5. Hazelcast Jet + Debezium 6. Demo!
  5. 5. @nicolas_frankel The caching trade-off  Improved performance/availability  Stale data
  6. 6. @nicolas_frankel The initial state 1. The application 2. The RDBMS 3. The cache
  7. 7. @nicolas_frankel Aye, there’s the rub!  A new component writes to the database  E.g.: a table holding references needs to be updated every now and then
  8. 8. @nicolas_frankel How to keep the cache in sync with the DB?
  9. 9. @nicolas_frankel Cache invalidation “There are two hard things in computer science: 1. Naming things 2. Cache invalidation 3. And off-by-one errors”
  10. 10. @nicolas_frankel Cache eviction vs Time-To-Live  Cache eviction: which entities to evict when the cache is full • Least Recently Used • Least Frequently Used  TTL: how long will an entity be kept in the cache
  11. 11. @nicolas_frankel Choosing the “correct” TTL  Less frequent than the update frequency • Miss updates  More frequent than the update frequency • Waste resources
  12. 12. @nicolas_frankel Polling process Same issue regarding the frequency
  13. 13. @nicolas_frankel Event-driven for the win! 1. If no writes happen, there's no need to update the cache 2. If a write happens, then the relevant cache item should be updated accordingly
  14. 14. @nicolas_frankel RDMBS triggers  Not all RDBMS implement triggers  How to call an external process from the trigger?
  15. 15. @nicolas_frankel The example of MySQL: User-defined function  Functions must be written in C++  The OS must support dynamic loading  Becomes part of the running server • Bound by all constraints that apply to writing server code  Etc. -- https://dev.mysql.com/doc/refman/8.0/en/adding-udf.html
  16. 16. @nicolas_frankel lib_mysqludf_sys UDF library with functions to interact with the operating system CREATE TRIGGER MyTrigger AFTER INSERT ON MyTable FOR EACH ROW BEGIN DECLARE cmd CHAR(255); DECLARE result INT(10); SET cmd = CONCAT('update_row', '1'); SET result = sys_exec(cmd); END; -- https://github.com/mysqludf/lib_mysqludf_sys
  17. 17. @nicolas_frankel Cons  Implementation-dependent  Fragile  Who maintains/debugs it?  Resource-consuming if done frequently
  18. 18. @nicolas_frankel Change-Data-Capture “In databases, Change Data Capture is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. CDC is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.” -- https://en.wikipedia.org/wiki/Change_data_capture
  19. 19. @nicolas_frankel CDC implementation options 1. Timestamps on rows 2. Version numbers on rows 3. Status indicators on rows 4. Triggers on tables 5. Log scanners
  20. 20. @nicolas_frankel “Turning the database inside out” - Martin Kleppman -- https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
  21. 21. @nicolas_frankel What is a transaction/binary/etc. log? “The binary log contains ‘events’ that describe database changes such as table creation operations or changes to table data.” -- https://dev.mysql.com/doc/refman/8.0/en/binary-log.html
  22. 22. @nicolas_frankel Reasons for the log 1. Data recovery 2. Replication
  23. 23. @nicolas_frankel What if we “hacked” the log?
  24. 24. @nicolas_frankel Sample MySQL binlog ### UPDATE `test`.`t` ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2='apple' /* VARSTRING(20) meta=20 nullable=0 is_null=0 */ ### @3=NULL /* VARSTRING(20) meta=0 nullable=1 is_null=1 */ ### SET ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2='pear' /* VARSTRING(20) meta=20 nullable=0 is_null=0 */ ### @3='2009:01:01' /* DATE meta=0 nullable=1 is_null=0 */ # at 569 #150112 21:40:14 server id 1 end_log_pos 617 CRC32 0xf134ad89 #Table_map: `test`.`t` mapped to number 251 # at 617 #150112 21:40:14 server id 1 end_log_pos 665 CRC32 0x87047106 #Delete_rows: table id 251 flags: STMT_END_F
  25. 25. @nicolas_frankel Kind reminder…  Implementation-dependent  Fragile  Who maintains/debugs it?
  26. 26. @nicolas_frankel Debezium to the rescue  Java-based abstraction layer for CDC  Provided by Red Hat  Apache v2 licensed  Very skewed toward Kafka
  27. 27. @nicolas_frankel Debezium connector plugins  Production-ready • MySQL • PostrgreSQL • MongoDB • SQL Server  Incubating • Oracle • DB2 (!) • Cassandra
  28. 28. @nicolas_frankel Debezium “Debezium records all row-level changes within each database table in a change event stream” -- https://debezium.io/
  29. 29. @nicolas_frankel Hazelcast Jet  Stream Processing Engine (SPE)  Distributed  In-memory  Embeds Hazelcast IMDG  Apache v2 licensed  (Hazelcast Jet Enterprise offering)
  30. 30. @nicolas_frankel Jet overview Stream Processor Data SinkData Source Hazelcast IMDG Map, Cache, List, Change Events Live Streams Kafka, JMS, Sensors, Feeds Databases JDBC, Relational, NoSQL, Change Events Files HDFS, Flat Files, Logs, File watcher Applications Sockets Ingest In-Memory Operational Storage Combine Join, Enrich, Group, Aggregate Stream Windowing, Event-Time Processing Compute Distributed and Parallel Computations Transform Filter, Clean, Convert Publish In-Memory, Subscriber Notifications Stream Stream
  31. 31. @nicolas_frankel Deployment modes // Create new cluster member JetInstance jet = Jet.newJetInstance(); // Connect to running cluster JetInstance jet = Jet.newJetClient(); Client/ServerEmbedded Java API Application Java API Application Java API Application Client API Application Client API Application Client API Application Client API Application
  32. 32. @nicolas_frankel Pipeline Job  Declarative code that defines and links sources, transforms, and sinks  Platform-specific SDK  Client submits pipeline to the SPE  Running instance of pipeline in SPE  SPE executes the pipeline • Code execution • Data routing • Flow control
  33. 33. @nicolas_frankel Back to our use-case A Jet job: 1. Watches change events in the database 2. Analyzes the change event 3. Updates the cache accordingly
  34. 34. @nicolas_frankel
  35. 35. @nicolas_frankel Recap  The caching trade-off  Event-based architectures FTW  Change-Data-Capture • Integration through Hazelcast Jet
  36. 36. @nicolas_frankel Thanks for your attention!  https://blog.frankel.ch/  @nicolas_frankel  https://jet-start.sh/docs/tutorials/cdc  https://bit.ly/evergreen-cache

    Be the first to comment

  • nfrankel

    Jun. 26, 2020

When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database. Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile. You might have read about Change-Data-Capture before. It’s been described by Martin Kleppmann as turning the database inside out: it means the DB can send change events (SELECT, DELETE and UPDATE) that one can register to. Just opposite to Event Sourcing that aggregates events to produce state, CDC is about getting events out of states. Once CDC is implemented, one can subscribe to its events and update the cache accordingly. However, CDC is quite in its early stage, and implementations are quite specific. In this talk, I’ll describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.

Views

Total views

126

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

1

×