The term data quality is used to describe the dependability, reliability, and usability of datasets. Data scientists and business analysts often determine the quality of a dataset by its trustworthiness and completeness. But what information might be needed to differentiate between useful vs noisy data? How quickly can data quality issues be identified and explored? More importantly, how can metadata enable data scientists to make better sense of the high volume of data within their organization from a variety of data sources?
With Airflow now ubiquitous for DAG orchestration, organizations increasingly dependon Airflow to manage complex inter-DAG dependencies and provide up-to-date runtime visibility into DAG execution. At WeWork, Airflow has quickly become an important component of our Data Platform powering billing, space inventory, etc. But what effects (if any) would upstream DAGs have on downstream DAGs if dataset consumption was delayed? What alerting rules should be in place to notify downstream DAGs of possible upstream processing issues or failures?
At WeWork, we feel it’s critical that DAG metadata is collected, maintained, and shared across the organization. This investment in metadata enables:
● Data lineage
● Data governance
● Data discovery
In this talk, we introduce Marquez: an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. We will demonstrate how metadata management with Marquez helps maintain inter-DAG dependencies, catalog historical runs of DAGs, and minimize data quality issues.
4. Data Platform
DATA
● What is the data source?
● Does the data have a
schema?
● Does the data have an
owner?
● How often is the data
updated?
Today: Limited context
7. Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
38. Data Platform
● Open source!
● Enables global task-level metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
...
room_bookings_7_days_dag.py
Marquez: Airflow
Marquez Airflow Lib.
42. Marquez: Data triggers
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
LINEAGE
Data Platform JOBDATASET
Managing inter-DAG dependencies
with data triggers
43. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
44. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
45. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
46. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
47. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
48. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
49. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
50. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
51. Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET