Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers

Data Platform
Marquez:
A Metadata Service for Data Abstraction, Data Lineage,
and Event-based Triggers
DataEngConf NYC ‘18

Data Platform
Hey!
I’m Willy Lulciuc
Data Engineer
Marquez Team, Data Platform
@wslulciuc

Data Platform
Space01
Community02
Services03

Data Platform
268,000
members globally
287
physical locations
72
cities
23
countries

Data Platform
AGENDA
Room bookings pipeline (naïve)
Intro to Marquez
Room bookings pipeline (take 2)
02
03
04
@wslulciuc
Future work05
Why metadata?01

Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform

… creating a healthy data
ecosystem

Freedom
● Experiment
● Flexible
● Self-sufficient
Accountability
● Cost
● Trust
Self-service
● Discover
● Explore
● Global context
A healthy data ecosystem
Data Platform

Data Platform
Let’s get
booking!

Location + floor01
Data Platform

Data Platform
Location + floor01
Open time slot02

Data Platform
Location + floor01
Open time slot02
Duration03

Data Platform
Location + floor01
Open time slot02
Duration03
Confirm04

Which location has
the most bookings?
Data Platform

Room bookings pipeline
(naïve)
02

Data Platform
@wslulciuc
Requirements
Example: Room bookings pipeline (naïve)
● Read room bookings
● Sum room bookings by location
● Write top location
● Run once an hour
Read SumStart Write

Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
Postgres
.csv
.csv

Data Platform
S3
Postgres
.csv
.csv
b940314,1541624285,2
TSLOCATION ROOM
b648485,1541501885,9
b648485,1541710685,4

Data Platform
S3
Postgres
.csv
.csv
b940314,1541624285,2
1 b648485 1541721600 2
TSLOCATION ROOM
LOCATIONID TS BOOKINGS
b648485,1541501885,9
b648485,1541710685,4

Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
We’re live!

Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?

Data Platform
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Workflow
Curses, our job’s failing …

Data Platform
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Workflow
Oh, might be our input data!

Data Platform
S3
.csv
.csv
Room field is of type string
b648485,1541501885,9A
b940314,1541624285,2G
b648485,1541710685,4F
TSLOCATION ROOM
int

Data Platform
@wslulciuc
Problems
● Coordinate changes

Data Platform
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Workflow
Ugh, gaps in output data

Data Platform
00h 01h 02h 03h 04h 05h 06h 07h 08h 09h
Backfills!
time partitions
latest

Data Platform
@wslulciuc
Problems
● Figure out backfills

Data Platform
Job
Scheduler
S3 Postgres
Room Bookings
Workflow
What we have so far …

Data Platform
Job
Scheduler
S3 Postgres
What we have so far … Problems
● What’s our job’s input
dataset?
● Does the dataset have
an owner?
● How often is the
dataset updated?
● Figure out backfillsRoom Bookings
Workflow

… writing a job shouldn’t be
this hard!

Data Platform
Metadata Service
● Centralized metadata
management
○ Jobs
○ Datasets
● Modular
○ Data discovery
○ Data health
○ Data triggers
Marquez: Design @wslulciuc
Clients
(JVM)
Clients
(Python)
Marquez
Search
Health
Triggers
REST API

Data Platform
Module: Search
● Unified search
● Documentation
○ Owner
○ Schema
○ Datasource
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data discovery

@wslulciucMarquez: Data discovery
room bo
Room Bookings (SF)
All
created: jul. 8, 2018
Room Booking Metrics (GLBL)
created: feb. 15, 2010
All San Francisco room bookings
Global room booking metrics
Search
Datasets
TagsS3

Data Platform
Module: Health
● Owner
○ Team / project
● Schema
● Location
● Description
● Size
○ Growth over time
○ Number of records
● Lineage
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data health

Lineage queries!
Dataset
Job
Lineage

Data Platform
Module: Triggers
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad
data
○ Incomplete data
○ Low-quality data
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data triggers

Dataset
Job
Upstream failure
detection!
Job failure

Dataset
Job
Affected paths!
Job failure

Cascading triggers!
Dataset
Job
Trigger

Data Platform
Job + Datasets
Input
Dataset
Output
Dataset
Job
@wslulciucMarquez: Core concepts

Data Platform
Dataset versions!
A dataset version
contains a
complete snapshot
of data as of some
point in time
v1 v1
v2 v2
v3
Job

Data Platform
Deltas “diffs”!
v1 v1
v2 v2
v3
Job
INSERT INTO room_bookings (location, bookings)
VALUES (b648485, 2)

Data Platform
Deltas “diffs”!
v1 v1
v2 v2
v3
Job
Δv2￫v3
INSERT INTO room_bookings (location, bookings)
VALUES (b648485, 2)

Data Platform
Job versions!
A job version is created
when business logic has
changed
v1 v1
v2 v2
v3
Job
v1
Job
v2

Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
Job
Dataset
New Run
Job
v2

Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
Dataset
New Run
v4
Job
Job
v2

Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
Dataset
New Run
v4
Finish
Update
Job
Job
v2

Data Platform
Data triggers!
v1 v1
v2 v2
v3
Job
v1
Dataset
New Run
v4
Trigger
Job
v7
Job
v10
Job
Update
Finish
Job
v2

Data Platform
Job failures!
v1 v1
v2 v2
v3
Job
v1
Dataset
New Run FailureJob
v4
Job
v2

Data Platform
Delayed datasets!
v1 v1
v2 v2
v3
Job
v1
Dataset
New RunJob
v4
Job
v2
Failure
Delay

Data Platform
Design benefits
● Early upstream failure detection
● Debugging
○ What job version(s) produced /
consumed dataset version X?
● Recoverability
○ Full / incremental processing
● Coordination

Job
Marquez: Data model @wslulciuc
Dataset JobVersion
JobRunDatasetVersion
*
1
*
1
*
1
1*
1*

Marquez: Data model @wslulciuc
DbTable
Filesystem
Stream
Datasource
Types
Job
Dataset JobVersion
JobRunDatasetVersion
*
1
*
1
*
1
1*
1*

Data Platform
@wslulciucMarquez: Metadata collection
How is metadata collected?
● Marquez API
● Language-specific SDKs
○ Java
○ Python
Marquez
Job
record
metadata

Data Platform
Workflow
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description

Data Platform
Register
Job
● Job version
(logical names)
● Owner
● Description
Register
Job Run
Workflow

Data Platform
Register
Job
● Job version
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Workflow

Data Platform
Register
Job
● Job version
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Register
Job Run
Outputs
● Outputs (physical
locations)
Workflow

Room bookings pipeline
(take 2)
04

Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Recall, we are tasked with analyzing
room booking trends …

Data Platform
Job Postgres
Room Bookings
Workflow
Top Locations
S3
Scheduler
Recall, we are tasked with analyzing
room booking trends …

Data Platform
@wslulciuc
Problems
● Figure out backfills
Example: Room bookings pipeline (take 2)

@wslulciuc
room bo
Room Bookings (ALL)
All
created: feb. 15, 2010
Room Bookings (SF)
created: jul. 8, 2018
All room bookings since beginning of time
All San Francisco room bookings
Data Platform
S3
S3

@wslulciuc
room bo
All
Room Bookings (SF)
created: jul. 8, 2018All San Francisco room bookings
Well, that
was easy!
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Data Platform
S3
S3

@wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
S3

Room Bookings (ALL)
Info
Updated: Hourly
Data Platform
Bonus!
S3

Data Platform
Job Postgres
Room Bookings
Workflow
Top Locations
S3
We also had to coordinate changes to
our input data
Scheduler

Our view
Dataset
Job
Job failure
Room bookings
workflow

Global view!
Dataset
Job
Job failure
Room bookings
workflow
Top locations
dataset

Room Bookings (ALL)
Info
Updated: Hourly
Data Platform
Oh, version
bumped!
S3

Patch, deploy, trigger!
Dataset
Job
Room bookings
workflow
Top locations
dataset
Trigger

Data Platform
@wslulciuc
RECAP
● Make it trival to discovery datasets
● Global context when debugging
● Easily handle backfills
○ Datasets as dependencies

Data Platform
WeWork + Marquez
● Data platform built around Marquez
● Internal integrations
○ Scheduling
○ Batching
○ Streaming
@wslulciucMarquez: Future work

Data Platform
Roadmap
● Short-term
○ Release Marquez 0.1.0
○ Docs
● Long-term
○ Marquez UI
@wslulciucMarquez: Future work

github.com/MarquezProject
@MarquezProject

Thanks!
Data Platform DataEngConf NYC ‘18

Data Platform
We’re
hiring!
contact: willy.lulciuc@wework.com

Questions?
Data Platform DataEngConf NYC ‘18

Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers

Similar to Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers (20)

Recently uploaded

Recently uploaded (20)

Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers