The term data quality is used to describe the correctness, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between good vs bad data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?
To maximize the usefulness of datasets for data-intensive applications, it is critical that metadata is collected, maintained, and shared across the organization. The investment in metadata enables: Data lineage, Data governance, and Data discovery.
Machine Learning (ML) jobs, just another type of data-intensive application, would benefit from metadata as well. But unlike most software projects which use established tools for maintaining quality, ML projects have fewer safeguards to prevent defects. Marquez helps fill the tooling gap available for ML jobs by tracking the relationships between training jobs, input datasets, and ML models. Marquez also links the different variations of training jobs which can grow wildly due to experimentation and hyperparameter optimization. Data lineage tracking in Marquez also reveals unexpected changes in upstream data dependencies which can harm model performance and be time consuming to debug.
In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain high model performance and prevent quality issues.
5. Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
20. Data Platform
● Enables global task-level
metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
...
room_bookings_7_days_dag.py
Marquez: Airflow
Airflow support for Marquez
21. Airflow
DAG
DAG
DAG
DAG
Marquez Lib.
Data Platform
● Metadata
○ Task lifecycle
○ Task parameters
○ Task runs linked to versioned code
○ Task inputs / outputs
● Lineage
○ Track origin of data
Marquez: Airflow
Airflow support for Marquez (cont.)
27. Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ecosystem
● Occasionally you build ML
models for periodic, offline use
28. Data Platform
● You are a successful Data
Scientist or Machine Learning
Engineer
● Your organization has a healthy
data ecosystem
● Occasionally you build ML
models for one-time use
● Life is good
29. Data Platform
● Your CTO schedules a
meeting with you
● He says those ML models are
great and all…
● But he wants way more
models ...making real-time
predictions… driving
impactful business
decisions
30. Data Platform
● You’re going from ML in “the
Small” to ML in “the Large”1
● What happens next?
1
https://al3x.net/posts/2010/07/27/node.html
31. Data Platform
Machine Learning at Scale
● You set up infrastructure to build way more models
● Your models are driving business decisions in real-time
● The models make great predictions
Model 😄Model
Model
😄😄
33. Data Platform
Machine Learning at Scale
● For some models, accuracy is declining without
explanation
● There are no bugs in the training workflow
● Changing learning algorithms does not help
Model 😭
48. Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
Job Dataset
Model
Dataset Job
49. Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will take days of data cleansing work to before
model accuracy is restored
Job Dataset
Model
Dataset Job
50. Data Platform
Machine Learning at Scale
● You traced the upstream lineage and found the source
of bad data
● But it will take days of data cleansing work to before
model accuracy is restored
● You need to rollback to the best last model
Job Dataset
Model
Dataset Job
58. Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
Job Dataset
Model
Dataset JobDataset
Test
✅
Test
❌
59. Data Platform
Machine Learning at Scale
● Regularly test quality of upstream datasets?
● Automatically alert an engineer for faster resolution
Job Dataset
Model
Dataset JobDataset
Test
✅
Test
❌
��
64. Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
Job Dataset
Model
Dataset JobDataset
❌
65. Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
Job Dataset
Model
Dataset JobDataset
❌
66. Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
Job Dataset
Model
Dataset JobDataset
❌
67. Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
● What if the pipeline was dataset quality-aware?
Job Dataset
Model
Dataset JobDataset
❌
68. Data Platform
Machine Learning at Scale
● Data pipelines often run on a fixed schedule
● It’s a race to fix the issue before bad data propagates
● What if the pipeline was aware of dataset quality?
Job Dataset
Model
Dataset JobDataset
❌
OK to run?
70. Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Model
ModelVersion
*1
1
*
*
1
Determine if training is safe by
checking metadata
DatasetVersion
quality_status boolean
71. Data Platform
ML + Marquez
Problems Solved
✅ Identified training data issues with lineage
✅ Fast model rollbacks with model version tracking
✅ Prevent bad training runs with data quality checking