SlideShare a Scribd company logo
1 of 58
Download to read offline
Data Platform
Data Lineage with Apache
Airflow using Marquez
CRUNCH ‘19
Data Platform
Hey!
I’m Willy Lulciuc
Software Engineer
Marquez Team, Data Platform
@wslulciuc
AGENDA
Intro to Marquez
Marquez + Airflow
02
03
Why metadata?01
Data Platform
Demo04
Future work05
Data Platform
DATA
● What is the data source?
● Does the data have a
schema?
● Does the data have an
owner?
● How often is the data
updated?
Today: Limited context
Data Platform
DATA
SOURCES
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
FORMATS
● CSV
● JSON
● AVRO
● PARQUET
Data contextualization!
Why metadata?01
Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
… creating a healthy data
ecosystem
Freedom
● Experiment
● Flexible
● Self-sufficient
Accountability
● Cost
● Trust
Self-service
● Discover
● Explore
● Global context
A healthy data ecosystem
Data Platform
WeWork Metadata
Metadata (Marquez)
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Data Platform
Flink
Airflow
Kafka
Iceberg / S3
Intro to Marquez02
http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf
Data
Lineage
Data
Governance
Data
Discovery
Marquez
Data Platform
Data Platform
Metadata Service
● Centralized metadata
management
○ Sources
○ Datasets
○ Jobs
● Modular framework
○ Data governance
○ Data lineage
○ Data discovery +
exploration
Marquez: Design
Marquez
Core
Lineage
Search
REST API
ETL Batch Stream
Integrations
APIs /
Libraries
Data PlatformData Platform
Core Lineage SearchServices
Integrations
APIs /
Libraries
Data PlatformData Platform
Core Lineage SearchServices
Integrations
APIs /
Libraries
Data Platform
Metadata
DB
Graph
Search
Storage
Data Platform
Marquez: Data model
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Data Platform
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● BATCH
● STREAM
● SERVICE
Marquez: Data model
Data Platform
DbTable Filesystem Stream
Job
Dataset JobVersion
RunDatasetVersion
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● BATCH
● STREAM
● SERVICE
Data Platform
v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) produced and
consumed dataset version X?
● Backfilling
○ Full / incremental processing
Design benefits
Data Platform
Marquez: Metadata collection
How is metadata collected?
● Push-based metadata
collection
● REST API
● Language-specific SDKs
○ Java
○ Python
○ Go
Marquez
Job
Dataset+job
metadata
Data Platform
Marquez: Metadata collection
Source
{
"type":"POSTGRESQL",
"name":"analytics_db”,
"connectionUrl":"jdbc:postgresql://we.db:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
01
Data Platform
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analytics_db”,
"connectionUrl":"jdbc:postgresql://we.db:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"wedata.room_bookings”,
"physicalName":"wedata.room_bookings”,
"sourceName":"analytics_db”,
"namespace":"datascience",
"columns": [...],
"description":“All global room bookings for each office.”
}
02 Dataset
Source01
Data Platform
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analytics_db”,
"connectionUrl":"jdbc:postgresql://we.db:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"wedata.room_bookings”,
"physicalName":"wedata.room_bookings”,
"sourceName":"analytics_db”,
"namespace":"datascience”,
"columns": [...],
"description":“All global room bookings for each office.”
}
{
"type":"BATCH",
"name":"room_bookings_7_days”,
"inputs":["wedata.room_bookings”],
"outputs":[],
"location":"https://github.com/we/jobs/commit/124f6089...”,
"namespace":"datascience",
"description":“Weekly email of room bookings occupancy patterns.”
}
03 Job
Source01
02 Dataset
Data Platform
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analytics_db”,
"connectionUrl":"jdbc:postgresql://we.db:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"wedata.room_bookings”,
"physicalName":"wedata.room_bookings”,
"sourceName":"analytics_db”,
"namespace":"datascience”,
"columns": [...],
"description":“All global room bookings for each office.”
}
{
"type":"BATCH",
"name":"room_bookings_7_days”,
"inputs":["wedata.room_bookings”],
"outputs":[],
"location":"https://github.com/we/jobs/commit/124f6089...”,
"namespace":"datascience”,
"description":“Weekly email of room bookings occupancy patterns.”
}
03 Job
Source01
LINK SOURCE
LINK DATASET
02 Dataset
Data Platform
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analytics_db”,
"connectionUrl":"jdbc:postgresql://we.db:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"wedata.room_bookings”,
"physicalName":"wedata.room_bookings”,
"sourceName":"analytics_db”,
"namespace":"datascience”,
"columns": [...],
"description":“All global room bookings for each office.”
}
Source01
{
"type":"BATCH",
"name":"room_bookings_7_days”,
"inputs":["wedata.room_bookings”],
"outputs":[],
"location":"https://github.com/we/jobs/commit/124f6089...”,
"namespace":"datascience”,
"description":“Weekly email of room bookings occupancy patterns.”
}
03 Job
OWNERSHIP
02 Dataset
Data Platform
01 Job
v1
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”],
"outputs":[],
...
}
LINEAGE
JOBDATASET
Marquez: Metadata collection
Data Platform
Marquez: Metadata collection
02
01 Job
v1
Job
v2
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”],
"outputs":[],
...
}
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”],
"outputs":["wedata.room_bookings_aggs”],
...
}
LINEAGE
LINEAGE
JOBDATASET
Data Platform
Marquez: Metadata collection
03 Job
v3
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”],
"outputs":["wedata.room_bookings_aggs”],
...
}
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”,"wedata.spaces”],
"outputs":["wedata.room_bookings_aggs”],
...
}
02 Job
v2
01 Job
v1
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":["wedata.room_bookings”],
"outputs":[],
...
}
LINEAGE
LINEAGE
LINEAGE
JOBDATASET
Marquez @ WeWork
VERSION
CONTROL
Quay.io
BUILD TEST
PUBLISH
IMAGE
CIRCLECI BUILD PIPELINE
DEPLOY
STAGING
DEPLOY
PRODUCTION
HOLD
PUSH IMAGE
PULL IMAGE
Marquez
Deployment
HELM INSTALL HELM INSTALL
+Marquez03 Airflow
Airflow
Airflow
DAG
DAG
DAG
DAG
Marquez Lib.
Data Platform
● Metadata
○ Task lifecycle
○ Task parameters
○ Task runs linked to versioned code
○ Task inputs / outputs
● Lineage
○ Track inter-DAG dependencies
● Built-in
○ SQL parser
○ Link to code builder (GitHub)
○ Metadata extractors
Marquez: Airflow
Airflow support for Marquez
Data Platform
Marquez: Airflow
Airflow
Marquez Airflow
Lib.
Extractor
Operator
Metadata
Data Platform
airflow.operators.PostgresOperator
marquez_airflow.extractors.PostgresExtractor
Extractor
Operator
Metadata
Airflow
Marquez Airflow
Lib.
Example
Marquez: Airflow
Data Platform
● Open source!
● Enables global task-level metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
...
room_bookings_7_days_dag.py
Marquez: Airflow
Marquez Airflow Lib.
Data Platform
DAG
MarquezLib.
Integration
Marquez
RESTAPI
Capturing task-level metadata in a
nutshell
Marquez: Airflow
Job
Dataset
Job
Version
Run
Dataset
Version
*
1
*
1
1*
1*
Source
1 *
*
1
Airflow
Airflow @ WeWork
Quay.io
IMAGE
AIRFLOW
MARQUEZ
AIRFLOW LIB.
● Multiple Airflow instances
● Internally distribute our own
docker image
○ Airflow
○ Marquez Airflow Lib.
● All DAGs are Marquez-enabled
○ Lineage across Airflow instances
○ DAG ownership enforcement
IMAGE
AIRFLOW CUSTOM OPERATORS
CUSTOM EXTRACTORSQuay.io
NAMESPACE 1
PULL IMAGE
NAMESPACE 2 NAMESPACE N
...
PULL IMAGEPULL IMAGE
MARQUEZ
AIRFLOW LIB.
DATAPLATFORM
MARQUEZ
BUILD / PUSH
IMAGE
Marquez: Data triggers
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
LINEAGE
Data Platform JOBDATASET
Managing inter-DAG dependencies
with data triggers
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Marquez: Data triggers
Managing inter-DAG dependencies
with data triggers
LINEAGE
Data Platform
● Powered by data lineage
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad data
○ Incomplete data
○ Low-quality data
JOBDATASET
Demo04
Future work05
Data Platform
Roadmap
● Short-term
○ Release marquez 0.5.0
○ Release marquez-airflow 0.2.0
○ Release early version of marquez-helm
○ Docs
● Long-term
○ Apache Iceberg metadata support
○ Search
○ Data triggers
Marquez: Future work
Thanks! <o/
Data Platform
https://marquezproject.github.io/marquez
github.com/MarquezProject
@MarquezProject
Questions?
Data Platform CRUNCH ‘19

More Related Content

What's hot

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 

What's hot (20)

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 

Similar to Data Lineage with Apache Airflow using Marquez

Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Knowerce
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastEric Kavanagh
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsMike Broberg
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseClusterpoint
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 

Similar to Data Lineage with Apache Airflow using Marquez (20)

Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory Webcast
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 

Recently uploaded

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

Data Lineage with Apache Airflow using Marquez

  • 1. Data Platform Data Lineage with Apache Airflow using Marquez CRUNCH ‘19
  • 2. Data Platform Hey! I’m Willy Lulciuc Software Engineer Marquez Team, Data Platform @wslulciuc
  • 3. AGENDA Intro to Marquez Marquez + Airflow 02 03 Why metadata?01 Data Platform Demo04 Future work05
  • 4. Data Platform DATA ● What is the data source? ● Does the data have a schema? ● Does the data have an owner? ● How often is the data updated? Today: Limited context
  • 5. Data Platform DATA SOURCES ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA FORMATS ● CSV ● JSON ● AVRO ● PARQUET Data contextualization!
  • 7. Data lineage ● Add context to data Democratize ● Self-service data culture Data quality ● Build trust in data Why manage and utilize metadata? Data Platform
  • 8. … creating a healthy data ecosystem
  • 9. Freedom ● Experiment ● Flexible ● Self-sufficient Accountability ● Cost ● Trust Self-service ● Discover ● Explore ● Global context A healthy data ecosystem Data Platform
  • 11. Metadata (Marquez) Ingest Storage Compute StreamingBatch/ETL ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Data Platform Flink Airflow Kafka Iceberg / S3
  • 15. Data Platform Metadata Service ● Centralized metadata management ○ Sources ○ Datasets ○ Jobs ● Modular framework ○ Data governance ○ Data lineage ○ Data discovery + exploration Marquez: Design Marquez Core Lineage Search REST API ETL Batch Stream
  • 17. Core Lineage SearchServices Integrations APIs / Libraries Data PlatformData Platform
  • 18. Core Lineage SearchServices Integrations APIs / Libraries Data Platform Metadata DB Graph Search Storage Data Platform
  • 19. Marquez: Data model Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Data Platform Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● BATCH ● STREAM ● SERVICE
  • 20. Marquez: Data model Data Platform DbTable Filesystem Stream Job Dataset JobVersion RunDatasetVersion * 1 * 1 * 1 1* 1* Source 1 * ● MYSQL ● POSTGRESQL ● REDSHIFT ● SNOWFLAKE ● KAFKA ● BATCH ● STREAM ● SERVICE
  • 21. Data Platform v1 v4Dataset v2 v4 v4 Job v1 Dataset v4 Job v2 Marquez: Data model ● Debugging ○ What job version(s) produced and consumed dataset version X? ● Backfilling ○ Full / incremental processing Design benefits
  • 22. Data Platform Marquez: Metadata collection How is metadata collected? ● Push-based metadata collection ● REST API ● Language-specific SDKs ○ Java ○ Python ○ Go Marquez Job Dataset+job metadata
  • 23. Data Platform Marquez: Metadata collection Source { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } 01
  • 24. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience", "columns": [...], "description":“All global room bookings for each office.” } 02 Dataset Source01
  • 25. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience", "description":“Weekly email of room bookings occupancy patterns.” } 03 Job Source01 02 Dataset
  • 26. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience”, "description":“Weekly email of room bookings occupancy patterns.” } 03 Job Source01 LINK SOURCE LINK DATASET 02 Dataset
  • 27. Data Platform Marquez: Metadata collection { "type":"POSTGRESQL", "name":"analytics_db”, "connectionUrl":"jdbc:postgresql://we.db:5431/analytics”, "description":“Contains tables such as office room bookings.” } { "type":"DB_TABLE", "name":"wedata.room_bookings”, "physicalName":"wedata.room_bookings”, "sourceName":"analytics_db”, "namespace":"datascience”, "columns": [...], "description":“All global room bookings for each office.” } Source01 { "type":"BATCH", "name":"room_bookings_7_days”, "inputs":["wedata.room_bookings”], "outputs":[], "location":"https://github.com/we/jobs/commit/124f6089...”, "namespace":"datascience”, "description":“Weekly email of room bookings occupancy patterns.” } 03 Job OWNERSHIP 02 Dataset
  • 29. Data Platform Marquez: Metadata collection 02 01 Job v1 Job v2 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":[], ... } { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":["wedata.room_bookings_aggs”], ... } LINEAGE LINEAGE JOBDATASET
  • 30. Data Platform Marquez: Metadata collection 03 Job v3 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":["wedata.room_bookings_aggs”], ... } { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”,"wedata.spaces”], "outputs":["wedata.room_bookings_aggs”], ... } 02 Job v2 01 Job v1 { "type":"BATCH", "name":"room_bookings_7_days” "inputs":["wedata.room_bookings”], "outputs":[], ... } LINEAGE LINEAGE LINEAGE JOBDATASET
  • 32. VERSION CONTROL Quay.io BUILD TEST PUBLISH IMAGE CIRCLECI BUILD PIPELINE DEPLOY STAGING DEPLOY PRODUCTION HOLD PUSH IMAGE PULL IMAGE Marquez Deployment HELM INSTALL HELM INSTALL
  • 35. Airflow DAG DAG DAG DAG Marquez Lib. Data Platform ● Metadata ○ Task lifecycle ○ Task parameters ○ Task runs linked to versioned code ○ Task inputs / outputs ● Lineage ○ Track inter-DAG dependencies ● Built-in ○ SQL parser ○ Link to code builder (GitHub) ○ Metadata extractors Marquez: Airflow Airflow support for Marquez
  • 36. Data Platform Marquez: Airflow Airflow Marquez Airflow Lib. Extractor Operator Metadata
  • 38. Data Platform ● Open source! ● Enables global task-level metadata collection ● Extends Airflow’s DAG class from marquez_airflow import DAG from airflow.operators.postgres_operator import PostgresOperator ... room_bookings_7_days_dag.py Marquez: Airflow Marquez Airflow Lib.
  • 39. Data Platform DAG MarquezLib. Integration Marquez RESTAPI Capturing task-level metadata in a nutshell Marquez: Airflow Job Dataset Job Version Run Dataset Version * 1 * 1 1* 1* Source 1 * * 1 Airflow
  • 40. Airflow @ WeWork Quay.io IMAGE AIRFLOW MARQUEZ AIRFLOW LIB. ● Multiple Airflow instances ● Internally distribute our own docker image ○ Airflow ○ Marquez Airflow Lib. ● All DAGs are Marquez-enabled ○ Lineage across Airflow instances ○ DAG ownership enforcement
  • 41. IMAGE AIRFLOW CUSTOM OPERATORS CUSTOM EXTRACTORSQuay.io NAMESPACE 1 PULL IMAGE NAMESPACE 2 NAMESPACE N ... PULL IMAGEPULL IMAGE MARQUEZ AIRFLOW LIB. DATAPLATFORM MARQUEZ BUILD / PUSH IMAGE
  • 42. Marquez: Data triggers ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data LINEAGE Data Platform JOBDATASET Managing inter-DAG dependencies with data triggers
  • 43. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 44. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 45. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 46. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 47. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 48. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 49. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 50. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 51. Marquez: Data triggers Managing inter-DAG dependencies with data triggers LINEAGE Data Platform ● Powered by data lineage ● Timely processing of data ○ No polling! ● Reduce manual handling of backfills ● Reduce production of bad data ○ Incomplete data ○ Low-quality data JOBDATASET
  • 54. Data Platform Roadmap ● Short-term ○ Release marquez 0.5.0 ○ Release marquez-airflow 0.2.0 ○ Release early version of marquez-helm ○ Docs ● Long-term ○ Apache Iceberg metadata support ○ Search ○ Data triggers Marquez: Future work