Don’t underestimate the Hidden Technical Debt in Machine Learning Systems.
Leverage Apache Kafka’s open ecosystem as a scalable and flexible Event Streaming Platform to build one pipeline for real-time and batch use cases.
Use Streaming Machine Learning with Apache Kafka, Tiered Storage, and TensorFlow IO to simplify your big data architecture.
Tiered Storage for Kafka provides:
- one platform for all data processing
- an event-based source of truth for materialized views
- no need for a pipeline between Kafka and a Data Lake like Hadoop
Benefits:
- cost reduction
- long-term backup
- performance isolation (real-time and historical analysis in the same cluster)
Use Cases for Reprocessing Historical Events:
- New consumer application
- Error-handling
- Compliance / regulatory processing
- Query and analyze existing events
- Model training
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake
1. Apache Kafka, Tiered Storage and TensorFlow for
Streaming Machine Learning without a Data Lake
Kai Waehner
Technology Evangelist
contact@kai-waehner.de
LinkedIn
@KaiWaehner
www.confluent.io
www.kai-waehner.de
2. Disclaimer – Status for Tiered Storage in August 2020
KIP-405 –
Add Tiered Storage Support to Kafka
Confluent is actively working on this
with the open source community -
Uber is leading this initiative
Confluent Tiered Storage is available
today in Confluent Platform and used
under the hood in Confluent Cloud
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
405%3A+Kafka+Tiered+Storage
(in the works)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
4. Machine Learning to Improve Traditional
and to Build New Use Cases
5www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Real Time
Tracking
Predictive
Maintenance
Fraud
Detection
Cross
Selling
Transportation
Rerouting
Customer
Service
Inventory
ManagementAutonomous
Driving
Face
Recognition
Robotics
Speech
Translation
Video
Generation
Supply Chain
Optimization Simulations
Real Time Information Digital Transformation Strategic Goals
Customer
Churn
5. Global Automotive Company
Builds Connected Car Infrastructure
6
Digital Transformation
● Improve Customer
Experience
● Increase Revenue
● Reduce Risk
3 years ago Today 2 years in the future
Project begins Connected car infrastructure
in production for first use
cases
Improved processes leveraging
machine learning (predictive
maintenance, cross-selling)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
6. Streaming Analytics for
Predictive Maintenance at Scale
7
IoT
Integration
Layer
Batch
Analytics
Platform
BI
Dashboard
Streaming
Platform
Big Data
Integration
Layer
Car Sensors
Streaming Platform
Other Components
Real Time
Monitoring
System
All
Data
Critical
Data
Ingest
Data
Human
Intelligence
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
7. Machine Learning (ML)
...allows computers to find hidden insights without
being programmed where to look
8
Machine Learning
● Decision Trees
● Naïve Bayes
● Clustering
● Neural
Networks
● Etc.
Deep Learning
● CNN
● RNN
● Transformer
● Autoencoder
● Etc.
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
8. Streaming Analytics for
Predictive Maintenance at Scale
9
IoT
Integration
Layer
Batch
Analytics
Platform
BI
Dashboard
Streaming
Platform
Big Data
Integration
Layer
Car Sensors
Streaming Platform
Analytics Platform
Other Components
Real Time
Monitoring
System
All
Data
Critical
Data
Ingest
Data
Potential
DetectAnalytics
Platform
Train
Analytic
Model
Data
Processing
Analytic
Model
Preprocess
Data
Consume
Data
Deploy
Analytic Model
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
9. The First
Analytic Models
10
How to deploy the models
in production?
…real-time processing?
…at scale?
…24/7 zero uptime?
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
10. Hidden Technical Debt
in Machine Learning Systems
11
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
11. Scalable, Technology-Agnostic ML Infrastructures
What is this
thing used everywhere?
https://www.infoq.com/presentations/netflix-ml-meson
https://eng.uber.com/michelangelo
https://www.infoq.com/presentations/paypal-data-service-fraud
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
12. A Streaming Platform -
The Underpinning of an Event-Driven Architecture
15
Microservices
DBs
SaaS apps
Mobile
Customer 360
Real-time fraud
detection
Data warehouse
Producers
Consumers
Database
change
Microservices
events
SaaS
data
Customer
experiences
Streams of real time events
Stream processing apps
Connectors
Connectors
Stream processing apps
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
13. Apache Kafka as Infrastructure for ML
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
14. Apache Kafka’s Open Ecosystem as Infrastructure for ML
Kafka
Streams/
ksqlDB
Kafka Connect
Confluent REST Proxy
Confluent Schema Registry
Go/.NET/Python
Kafka Producer
ksqlDB
Python
Client
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
15. Ingestion of
IoT Data
20
Replication
MirrorMaker /
Confluent Replicator
Kafka
Connect
Analytics /
Machine
Learning
Ca
rsCa
rsCa
rsCa
rs
Cars
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
17. Preprocessing
with ksqlDB
22
SELECT car_id, event_id, car_model_id, sensor_input
FROM car_sensor c
LEFT JOIN car_models m ON c.car_model_id = m.car_model_id
WHERE m.car_model_type ='Audi_A8';
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
18. Data Ingestion
into a Data Store for Model Training
(and Consumption by other Decoupled Applications)
23
Connect
Preprocessed
Data
Batch Near
Real Time
Real
Time
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
19. Extreme scale
usingTensorFlow
and TPUs
in the cloud!
Analytic
Model
Model Training
Using an Elastic
Infrastructure in
the Cloud
24www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
20. TensorFlow Model —
Autoencoder for Anomaly Detection
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 25
21. Direct streaming ingestion
for model training
with TensorFlow I/O + Kafka Plugin
(no additional data storage
like S3 or HDFS required!)
Time
Model BModel A
Producer
Distributed
Commit Log
Streaming Ingestion and Model Training
with TensorFlow IO
https://github.com/tensorflow/io
26
Model X
(at a later time)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
22. Store Data Long-Term
in Kafka?
Today, Kafka works well for recent events,
short horizon storage, and manual data
balancing.
Kafka’s present-day design offers
extraordinarily low messaging latency by
storing topic data on fast disks that are
collocated with brokers. This is usually
good.
But sometimes, you need to store a huge
amount of data for a long time.
Kafka
Processing
App
Storage
Transactions,
auth, quota
enforcement,
compaction, ...
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
23. Simplified Data Lake Architecture
Tiered Storage for Kafka provides
● one platform for all data processing
● an event-based source of truth for
materialized views
● no need for a pipeline between Kafka and
a Data Lake like Hadoop
Benefits
● cost reduction
● long-term backup
● performance isolation
(real-time and historical analysis in the same cluster)
24. Confluent Tiered Storage for Kafka
Object Store
Processing Storage
Transactions,
auth, quota
enforcement,
compaction, ...
Local
Remote
Kafka
Apps
Store Forever
Older data is offloaded to inexpensive object
storage, permitting it to be consumed at any time.
Save $$$
Storage limitations, like capacity and duration,
are effectively uncapped.
Instantaneously scale up and down
Your Kafka clusters will be able to automatically
self-balance load and hence elastically scale
(Only available in Confluent Platform)
www.kai-waehner.de | @KaiWaehner
25. Confluent Tiered Storage for Kafka
30www.kai-waehner.de | @KaiWaehner
(Only available in Confluent Platform)
26. Use Cases for Reprocessing Historical Events
Give me all events from time A to time B
Real-time Producer
Time
• New consumer application
• Error-handling
• Compliance / regulatory processing
• Query and analyze existing events
• Model training
Real-time Consumer
Consumer of
Historical Data
www.kai-waehner.de | @KaiWaehner
27. Local Predictions
Model Training
in Cloud
Model Deployment
at the Edge
Analytic Model
Separation of
Model Training and Model Inference
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 32
30. “CREATE STREAM AnomalyDetection AS
SELECT sensor_id,
detectAnomaly(sensor_values)
FROM car_engine;“
User Defined Function (UDF)
Model Deployment with
Apache Kafka, ksqlDB
and TensorFlow
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 35
31. Streaming Analytics with
Kafka and TensorFlow
36www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
MQTT Proxy
MongoDB
Storage
MongoDB
Dashboards
Search
Analytics
Kafka
Cluster
Kafka
Connect
Car Sensors
Kafka Ecosystem
TensorFlow
Other Components
Kafka
Streams
Application
All
Data
Critical
Data
Ingest
Data
Potential
DetectTensorFlow
Train
Analytic
Model
ksqlDB
Analytic
Model
Preprocess
Data
Consume
Data
Deploy
Analytic Model
Tiered
Storage
Mobile App
BI Tool
32. Demo: 100,000 Connected Cars
(Kafka + ksqlDB + MQTT + TensorFlow)
https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 37
35. One pipeline to rule them all
Real-time model scoring, batch model training, near-real time BI analytics
Give me all events from time A to time B
Car sensors
(MQTT connector)
Time
Production
infrastructure
(Java)
Data science / analytics infrastructure
(Python + Jupyter)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake