SlideShare a Scribd company logo
1 of 44
Apache Kafka®
and the Data
Mesh
Kenneth Cheung
Sr. Solutions Engineer, Confluent
kenneth@confluent.io
Hong Kong Apache Kafka®
Meetup, August 10 , 2021
Data “in practice” Needs More Discipline
2
Data as a Practice
… is not on the same level.
Software as a Practice
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What is Data Mesh
3
Data Marts DDD Microservices Event Streaming
Domain
Inventory
Orders
Shipments
Data Product
Data Mesh
...
Data ownership by
domain
Data as a product Data governed
wherever it is
Data available
everywhere, self
serve
1 2 3 4
The Principles of a Data Mesh
Spaghetti: Data architectures often lack rigour
5
Centralized Event Streams. Decentralized Data Products.
6
Kafka
Centralize an immutable stream of facts. Decentralize the freedom to act, adapt, and change.
7
A First Look
Domain
Inventory
Orders
Shipments
...
Data Product
Data Mesh
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
1 2 3 4
The Principles of a Data Mesh
Principle 1: Domain-driven
Decentralization
Pattern: Ownership of a data asset given to
the “local” team that is most familiar with it
Centralized
Data Ownership
Decentralized
Data Ownership
Objective: Ensure data is owned by those that truly understand it
Anti-pattern: responsibility for data
becomes the domain of the DWH team
10
Shipping Data
Joe
Practical example
1. Joe in Inventory has a problem with
Order data.
2. Inventory items are going negative,
because of bad Order data.
3. He could fix the data up locally in the
Inventory domain, and get on with his
job.
4. Or, better, he contacts Alice in Orders and
get it fixed at the source. This is more
reliable as Joe doesn’t fully understand
the Orders process.
5. Ergo, Alice needs be an responsible &
responsive “Data Product Owner”, so
everyone benefits from the fix to Joe’s
problem.
Orders Domain Shipment Domain
Order Data
Inventory Billing Recommendations
Alice
Recommendations: Domain-driven
Decentralization
11
Learn from DDD:
• Use a standard language and nomenclature for data.
• Business users should understand a data flow diagram.
• The stream of events should create a shared narrative that is business-user comprehensible.
1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
Principle 2: Data as a First-Class
Product
13
• Objective: Make shared data discoverable, addressable, trustworthy, secure, so other
teams can make good use of it.
• Data is treated as a true product, not a by-product.
This product thinking is important to prevent data chauvinism.
Infra
Code
Data product, a “microservice for the
data world”
14
• Data product is a node on the data mesh, situated within a domain.
• Produces—and possibly consumes—high-quality data within the mesh.
• Encapsulates all the elements required for its function, namely data + code + infrastructure.
Data
Creates, manipulates,
serves, etc. that data
Powers the data (e.g., storage) and the
code (e.g., run, deploy, monitor)
“Items about to expire”
Data Product
Data and metadata,
including history
15
Connectivity within the mesh lends
itself...
Domain
Data Product
Data Mesh
16
...naturally to Event Streaming with
Kafka
Domain
Data Product
Mesh is a logical view, not physical!
Data Mesh
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
17
Event Streaming is Pub/Sub, not
Point-to-Point
Data
Product
Data
Product
Data
Product
Data
Product
stream
(persisted) other streams
write
(publish)
read
(consume)
independently
Data producers are scalably decoupled from consumers.
Data Product
Data Product
Why is Event Streaming a good fit
for meshing?
18
0 1 2 3 4 5 6 1
7
Streams are real-time, low latency ⇒ Propagate data immediately.
Streams are highly scalable ⇒ Handle today’s massive data
volumes.
Streams are stored, replayable ⇒ Capture real-time & historical data.
Streams are immutable ⇒ Auditable source of record.
Streams are addressable, discoverable, …⇒ Meet key criteria for mesh data.
Streams are popular for Microservices ⇒ Adapting to Data Mesh is often easy.
How to get data into & out of a data
product
19
Data Product
Input
Data
Ports
Output
Data
Ports
Snapshot via
Nightly ETL
Snapshot via
Nighty ETL
Continuous
Stream
Snapshot via
Req/Res API
Snapshot via
Req/Res API
1
2
3
Continuous
Stream
Onboarding existing data
20
Data
Product
Input
Data
Ports
Source
Connectors
Use Kafka connectors to stream data from cloud
services and existing systems into the mesh.
https://www.confluent.io/hub/
Data product: what’s happening
inside
21
Input
Data
Ports
Output
Data
Ports
…pick your favorites...
Data on the Inside: HOW the domain team solves specific problems
internally? This doesn’t matter to other domains.
Event Streaming inside a data
product
22
Input
Data
Ports
Output
Data
Ports
ksqlDB to filter,
process, join,
aggregate, analyze
Stream data from
other DPs or
internal systems
into ksqlDB
1 2 Stream data to
internal systems or
the outside. Pull
queries can drive a
req/res API.
3
Req/Res API
Pull Queries
Use ksqlDB, Kafka Streams apps, etc. for processing data in motion.
Use Kafka connectors and CDC to “streamify” classic databases.
Event Streaming inside a data
product
23
Input
Data
Ports
Output
Data
Ports
MySQL
Sink
Connector
Source
Connector
DB client apps
work as usual
Stream data from
other Data Products
into your local DB
Stream data to the outside
with CDC and e.g. the
Outbox Pattern, ksqlDB, etc.
1 3
2
Dealing with data change: schemas
& versioning
24
Data
Product
Output
Data
Ports
V1 - user, product, quantity
V2 - userAnonymized, product, quantity
Also, when needed, data can be fully reprocessed by replaying history.
Publish evolving streams with back/forward-compatible schemas.
Publish versioned streams for breaking changes.
Recommendations: Data as a
First-class Product
25
1. Data-on-the-Outside is harder to change, but it has more value in a holistic sense.
a. Use schemas as a contract.
b. Handle incompatible schema changes using Dual Schema Upgrade Window pattern.
2. Get data from the source, not from intermediaries. Think: Demeter's law applied to data.
a. Otherwise, proliferation of ‘slightly corrupt’ data within the mesh. “Game of Telephone”.
b. Event Streaming makes it easy to subscribe to data from authoritative sources.
3. Change data at the source, including error fixes. Don’t “fix data up” locally.
4. Some data sources will be difficult to turn into first-class data products. Example:
Batch-based sources that lose event-level data or are not reproducible.
a. Use Event Streaming plus CDC, Outbox Pattern, etc. to integrate these into the mesh.
1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
Why Self-service Matters
27
Trade Surveillance System
● Data from 13 sources
● Some sources publish events
● Needed both historical and real-time data
● Historical data from database extracts arranged with dev
team.
● Format of events different to format of extracts
● 9 months of effort to get 13 sources into the new system.
Principle 3: Self-serve Data Platform
28
Central infrastructure that provides real-time and historical data on demand
Objective: Make domains autonomous in their execution through rapid data provisioning
Consuming real-time & historical data
from the mesh
29
1) Separate Systems for Real-time and Historical Data (Lambda Architecture)
- Considerations:
- Difficulty to correlate real-time with historical “snapshot” data
- Two systems to manage
- Unlike event streams, snapshots have less granularity
2) One System for Real-time and Historical Data (Kappa Architecture)
- Considerations:
- Operational complexity (addressed in Confluent Cloud)
- Downsides of immutability of regular streams: e.g. altering or deleting events
- Storage cost (addressed in Confluent Cloud, in Apache Kafka with KIP-405)
What this can look like in practice
30
Browse Schemas
Implementation: Database
Inside-Out
DB
DB
CONNECTOR
CONNECTOR
STREAM
PROCESSOR
DB/
View
Messaging that
Remembers
ksqlDB
32
With ksqlDB the data mesh is
queryable and decentralized.
Destination
Data Port
STREAM
PROCESSOR
ksqlDB
Query is the interface
to the mesh
Events are the interface to
the mesh
Think: Infrastructure as code, but for data
33
Code
Container
Image
+ Same APP
every time
Code
Event
Streams
+ Same DATA
every time
34
Mesh is one logical cluster. Data product has another.
Data
Product
Data Product has its own
cluster for internal use
1 2 3 4
Domain-driven
Decentralization
Local Autonomy
(Organizational Concerns)
Data as a
First-class Product
Product thinking,
“Microservice for Data”
Federated
Governance
Interoperability,
Network Effects
(Organizational Concerns)
Self-serve
Data Platform
Infra Tooling,
Across Domains
The Principles of a Data Mesh
Principle 4: Federated Governance
36
• Objective: Independent data products can interoperate and create network effects.
• Establish global standards, like governance, that apply to all data products in the mesh.
• Ideally, these global standards and rules are applied automatically by the platform.
Domain Domain Domain Domain
Self-serve Data Platform
What is decided
locally by a domain?
What is globally?
(implemented and
enforced by platform)
Must balance between Decentralization vs. Centralization. No silver bullet!
Example standard: Identifying customers
globally
• Define how data is represented, so you can join and correlate data across different domains.
• Use data contracts, schemas, registries, etc. to implement and enforce such standards.
• Use Event Streaming to retrofit historical data to new requirements, standards.
37
customerId=29639
customerId=29639
customerId=29639
customerId=29639
SELECT … FROM orders o
LEFT JOIN shipments s
ON o.customerId = s.customerId
EMIT CHANGES;
Example standard: Detect errors and
recover with Streams
38
• Use strategies like logging, data profiling, data lineage, etc. to detect errors in the mesh.
• Streams are very helpful to detect errors and identify cause-effect relationships.
• Streams let you recover and fix errors: e.g., replay & reprocess historical data.
Data
Product
Output
Data
Ports
0 1 2 3 4 5 6 7 8 9
My App
Bug? Error? Rewind
to start of stream,
then reprocess.
If needed, tell the origin data product to fix problematic data at the source.
Event Streams give
you a powerful
Time Machine.
Example standard: Tracking data lineage
with Streams
39
• Lineage must work across domains and data products—and systems, clouds, data centers.
• Event streaming is a foundational technology for this.
On-premise
Recommendations: Federated
Governance
40
1. Be pragmatic: Don’t expect governance systems to be perfect.
a. They are a map that helps you navigate the data-landscape of your company.
b. But there will always be roads that have changed or have not been mapped.
2. Governance is more a process—i.e., an organizational concern—than a technology.
3. Beware of centralized data models, which can become slow to change. Where they must
exist, use processes & tooling like GitHub to collaborate and change quickly. Good luck! 🙂
Data mesh journey
41
Principle 1
Data should have one owner:
the team that creates it.
Principle 2
Data is your product:
All exposed data should
be good data.
Principle 3
Get access to any data
immediately and painlessly,
be it historical or real-time.
Principle 4: Governance, with standards, security,
lineage, etc. (cross-cutting concerns)
Difficulty
to execute
Start Here
1
2
3
Learn More
42
Explore how to build a cloud-native Data Mesh using
Confluent’s fully managed, serverless Apache Kafka® service.
Confluent Cloud
cnfl.io/confluent-cloud
Promo Code: DATAMESH101
Get Started Today
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Start your Apache Kafka® journey at
developer.confluent.io
Some news:
Confluent Developer, a site built as the #1
destination for learning Apache Kafka® and
Confluent, has just had a
MAJOR CONTENT UPGRADE
Thank you!
Kenneth Cheung
kenneth@confluent.io
https://www.linkedin.com/in/kennethcb/

More Related Content

More from confluent

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluentconfluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performanceconfluent
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Replyconfluent
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Diveconfluent
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloudconfluent
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgradeconfluent
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKconfluent
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sectorconfluent
 

More from confluent (20)

Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
 
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and UpgradePartner Tech Talk Q3: Q&A with PS - Migration and Upgrade
Partner Tech Talk Q3: Q&A with PS - Migration and Upgrade
 
Confluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIKConfluent Partner Tech Talk with QLIK
Confluent Partner Tech Talk with QLIK
 
Real-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public SectorReal-time Streaming for Government and the Public Sector
Real-time Streaming for Government and the Public Sector
 

Recently uploaded

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Apache kafka® and The Data Mesh

  • 1. Apache Kafka® and the Data Mesh Kenneth Cheung Sr. Solutions Engineer, Confluent kenneth@confluent.io Hong Kong Apache Kafka® Meetup, August 10 , 2021
  • 2. Data “in practice” Needs More Discipline 2 Data as a Practice … is not on the same level. Software as a Practice
  • 3. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. What is Data Mesh 3 Data Marts DDD Microservices Event Streaming Domain Inventory Orders Shipments Data Product Data Mesh ...
  • 4. Data ownership by domain Data as a product Data governed wherever it is Data available everywhere, self serve 1 2 3 4 The Principles of a Data Mesh
  • 5. Spaghetti: Data architectures often lack rigour 5
  • 6. Centralized Event Streams. Decentralized Data Products. 6 Kafka Centralize an immutable stream of facts. Decentralize the freedom to act, adapt, and change.
  • 8. Domain-driven Decentralization Local Autonomy (Organizational Concerns) Data as a First-class Product Product thinking, “Microservice for Data” Federated Governance Interoperability, Network Effects (Organizational Concerns) Self-serve Data Platform Infra Tooling, Across Domains 1 2 3 4 The Principles of a Data Mesh
  • 9. Principle 1: Domain-driven Decentralization Pattern: Ownership of a data asset given to the “local” team that is most familiar with it Centralized Data Ownership Decentralized Data Ownership Objective: Ensure data is owned by those that truly understand it Anti-pattern: responsibility for data becomes the domain of the DWH team
  • 10. 10 Shipping Data Joe Practical example 1. Joe in Inventory has a problem with Order data. 2. Inventory items are going negative, because of bad Order data. 3. He could fix the data up locally in the Inventory domain, and get on with his job. 4. Or, better, he contacts Alice in Orders and get it fixed at the source. This is more reliable as Joe doesn’t fully understand the Orders process. 5. Ergo, Alice needs be an responsible & responsive “Data Product Owner”, so everyone benefits from the fix to Joe’s problem. Orders Domain Shipment Domain Order Data Inventory Billing Recommendations Alice
  • 11. Recommendations: Domain-driven Decentralization 11 Learn from DDD: • Use a standard language and nomenclature for data. • Business users should understand a data flow diagram. • The stream of events should create a shared narrative that is business-user comprehensible.
  • 12. 1 2 3 4 Domain-driven Decentralization Local Autonomy (Organizational Concerns) Data as a First-class Product Product thinking, “Microservice for Data” Federated Governance Interoperability, Network Effects (Organizational Concerns) Self-serve Data Platform Infra Tooling, Across Domains The Principles of a Data Mesh
  • 13. Principle 2: Data as a First-Class Product 13 • Objective: Make shared data discoverable, addressable, trustworthy, secure, so other teams can make good use of it. • Data is treated as a true product, not a by-product. This product thinking is important to prevent data chauvinism.
  • 14. Infra Code Data product, a “microservice for the data world” 14 • Data product is a node on the data mesh, situated within a domain. • Produces—and possibly consumes—high-quality data within the mesh. • Encapsulates all the elements required for its function, namely data + code + infrastructure. Data Creates, manipulates, serves, etc. that data Powers the data (e.g., storage) and the code (e.g., run, deploy, monitor) “Items about to expire” Data Product Data and metadata, including history
  • 15. 15 Connectivity within the mesh lends itself... Domain Data Product Data Mesh
  • 16. 16 ...naturally to Event Streaming with Kafka Domain Data Product Mesh is a logical view, not physical! Data Mesh
  • 17. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 17 Event Streaming is Pub/Sub, not Point-to-Point Data Product Data Product Data Product Data Product stream (persisted) other streams write (publish) read (consume) independently Data producers are scalably decoupled from consumers.
  • 18. Data Product Data Product Why is Event Streaming a good fit for meshing? 18 0 1 2 3 4 5 6 1 7 Streams are real-time, low latency ⇒ Propagate data immediately. Streams are highly scalable ⇒ Handle today’s massive data volumes. Streams are stored, replayable ⇒ Capture real-time & historical data. Streams are immutable ⇒ Auditable source of record. Streams are addressable, discoverable, …⇒ Meet key criteria for mesh data. Streams are popular for Microservices ⇒ Adapting to Data Mesh is often easy.
  • 19. How to get data into & out of a data product 19 Data Product Input Data Ports Output Data Ports Snapshot via Nightly ETL Snapshot via Nighty ETL Continuous Stream Snapshot via Req/Res API Snapshot via Req/Res API 1 2 3 Continuous Stream
  • 20. Onboarding existing data 20 Data Product Input Data Ports Source Connectors Use Kafka connectors to stream data from cloud services and existing systems into the mesh. https://www.confluent.io/hub/
  • 21. Data product: what’s happening inside 21 Input Data Ports Output Data Ports …pick your favorites... Data on the Inside: HOW the domain team solves specific problems internally? This doesn’t matter to other domains.
  • 22. Event Streaming inside a data product 22 Input Data Ports Output Data Ports ksqlDB to filter, process, join, aggregate, analyze Stream data from other DPs or internal systems into ksqlDB 1 2 Stream data to internal systems or the outside. Pull queries can drive a req/res API. 3 Req/Res API Pull Queries Use ksqlDB, Kafka Streams apps, etc. for processing data in motion.
  • 23. Use Kafka connectors and CDC to “streamify” classic databases. Event Streaming inside a data product 23 Input Data Ports Output Data Ports MySQL Sink Connector Source Connector DB client apps work as usual Stream data from other Data Products into your local DB Stream data to the outside with CDC and e.g. the Outbox Pattern, ksqlDB, etc. 1 3 2
  • 24. Dealing with data change: schemas & versioning 24 Data Product Output Data Ports V1 - user, product, quantity V2 - userAnonymized, product, quantity Also, when needed, data can be fully reprocessed by replaying history. Publish evolving streams with back/forward-compatible schemas. Publish versioned streams for breaking changes.
  • 25. Recommendations: Data as a First-class Product 25 1. Data-on-the-Outside is harder to change, but it has more value in a holistic sense. a. Use schemas as a contract. b. Handle incompatible schema changes using Dual Schema Upgrade Window pattern. 2. Get data from the source, not from intermediaries. Think: Demeter's law applied to data. a. Otherwise, proliferation of ‘slightly corrupt’ data within the mesh. “Game of Telephone”. b. Event Streaming makes it easy to subscribe to data from authoritative sources. 3. Change data at the source, including error fixes. Don’t “fix data up” locally. 4. Some data sources will be difficult to turn into first-class data products. Example: Batch-based sources that lose event-level data or are not reproducible. a. Use Event Streaming plus CDC, Outbox Pattern, etc. to integrate these into the mesh.
  • 26. 1 2 3 4 Domain-driven Decentralization Local Autonomy (Organizational Concerns) Data as a First-class Product Product thinking, “Microservice for Data” Federated Governance Interoperability, Network Effects (Organizational Concerns) Self-serve Data Platform Infra Tooling, Across Domains The Principles of a Data Mesh
  • 27. Why Self-service Matters 27 Trade Surveillance System ● Data from 13 sources ● Some sources publish events ● Needed both historical and real-time data ● Historical data from database extracts arranged with dev team. ● Format of events different to format of extracts ● 9 months of effort to get 13 sources into the new system.
  • 28. Principle 3: Self-serve Data Platform 28 Central infrastructure that provides real-time and historical data on demand Objective: Make domains autonomous in their execution through rapid data provisioning
  • 29. Consuming real-time & historical data from the mesh 29 1) Separate Systems for Real-time and Historical Data (Lambda Architecture) - Considerations: - Difficulty to correlate real-time with historical “snapshot” data - Two systems to manage - Unlike event streams, snapshots have less granularity 2) One System for Real-time and Historical Data (Kappa Architecture) - Considerations: - Operational complexity (addressed in Confluent Cloud) - Downsides of immutability of regular streams: e.g. altering or deleting events - Storage cost (addressed in Confluent Cloud, in Apache Kafka with KIP-405)
  • 30. What this can look like in practice 30 Browse Schemas
  • 32. 32 With ksqlDB the data mesh is queryable and decentralized. Destination Data Port STREAM PROCESSOR ksqlDB Query is the interface to the mesh Events are the interface to the mesh
  • 33. Think: Infrastructure as code, but for data 33 Code Container Image + Same APP every time Code Event Streams + Same DATA every time
  • 34. 34 Mesh is one logical cluster. Data product has another. Data Product Data Product has its own cluster for internal use
  • 35. 1 2 3 4 Domain-driven Decentralization Local Autonomy (Organizational Concerns) Data as a First-class Product Product thinking, “Microservice for Data” Federated Governance Interoperability, Network Effects (Organizational Concerns) Self-serve Data Platform Infra Tooling, Across Domains The Principles of a Data Mesh
  • 36. Principle 4: Federated Governance 36 • Objective: Independent data products can interoperate and create network effects. • Establish global standards, like governance, that apply to all data products in the mesh. • Ideally, these global standards and rules are applied automatically by the platform. Domain Domain Domain Domain Self-serve Data Platform What is decided locally by a domain? What is globally? (implemented and enforced by platform) Must balance between Decentralization vs. Centralization. No silver bullet!
  • 37. Example standard: Identifying customers globally • Define how data is represented, so you can join and correlate data across different domains. • Use data contracts, schemas, registries, etc. to implement and enforce such standards. • Use Event Streaming to retrofit historical data to new requirements, standards. 37 customerId=29639 customerId=29639 customerId=29639 customerId=29639 SELECT … FROM orders o LEFT JOIN shipments s ON o.customerId = s.customerId EMIT CHANGES;
  • 38. Example standard: Detect errors and recover with Streams 38 • Use strategies like logging, data profiling, data lineage, etc. to detect errors in the mesh. • Streams are very helpful to detect errors and identify cause-effect relationships. • Streams let you recover and fix errors: e.g., replay & reprocess historical data. Data Product Output Data Ports 0 1 2 3 4 5 6 7 8 9 My App Bug? Error? Rewind to start of stream, then reprocess. If needed, tell the origin data product to fix problematic data at the source. Event Streams give you a powerful Time Machine.
  • 39. Example standard: Tracking data lineage with Streams 39 • Lineage must work across domains and data products—and systems, clouds, data centers. • Event streaming is a foundational technology for this. On-premise
  • 40. Recommendations: Federated Governance 40 1. Be pragmatic: Don’t expect governance systems to be perfect. a. They are a map that helps you navigate the data-landscape of your company. b. But there will always be roads that have changed or have not been mapped. 2. Governance is more a process—i.e., an organizational concern—than a technology. 3. Beware of centralized data models, which can become slow to change. Where they must exist, use processes & tooling like GitHub to collaborate and change quickly. Good luck! 🙂
  • 41. Data mesh journey 41 Principle 1 Data should have one owner: the team that creates it. Principle 2 Data is your product: All exposed data should be good data. Principle 3 Get access to any data immediately and painlessly, be it historical or real-time. Principle 4: Governance, with standards, security, lineage, etc. (cross-cutting concerns) Difficulty to execute Start Here 1 2 3
  • 42. Learn More 42 Explore how to build a cloud-native Data Mesh using Confluent’s fully managed, serverless Apache Kafka® service. Confluent Cloud cnfl.io/confluent-cloud Promo Code: DATAMESH101 Get Started Today
  • 43. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Start your Apache Kafka® journey at developer.confluent.io Some news: Confluent Developer, a site built as the #1 destination for learning Apache Kafka® and Confluent, has just had a MAJOR CONTENT UPGRADE