At WeWork, it's critical that we understand the complete context for all datasets. We also want to be able to explore dependencies between jobs and the datasets they produce and consume. To do this, WeWork needs metadata. In this talk I will focus on Marquez, a core service for the collection, aggregation and visualization of a data ecosystems metadata. Marquez maintains the provenance of how datasets are consumed and produced while providing global visibility into job runtime.
7. Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
23. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
Example: Room bookings pipeline (naïve)
25. Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
Oh, might be our input data!
26. Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
.csv
.csv
Room field is of type string
b648485,1541501885,9A
b940314,1541624285,2G
b648485,1541710685,4F
TSLOCATION ROOM
int
27. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
Example: Room bookings pipeline (naïve)
28. Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
Ugh, gaps in output data
30. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (naïve)
32. Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
Job
Scheduler
S3 Postgres
What we have so far … Problems
● What’s our job’s input
dataset?
● Does the dataset have
an owner?
● How often is the
dataset updated?
● Coordinate changes
● Figure out backfillsRoom Bookings
Workflow
35. Data Platform
Metadata Service
● Centralized metadata
management
○ Jobs
○ Datasets
● Modular
○ Data discovery
○ Data health
○ Data triggers
Marquez: Design @wslulciuc
Clients
(JVM)
Clients
(Python)
Marquez
Search
Health
Triggers
REST API
36. Data Platform
Module: Search
● Unified search
● Documentation
○ Owner
○ Schema
○ Datasource
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data discovery
37. @wslulciucMarquez: Data discovery
room bo
Room Bookings (SF)
All
created: jul. 8, 2018
Room Booking Metrics (GLBL)
created: feb. 15, 2010
All San Francisco room bookings
Global room booking metrics
Search
Datasets
TagsS3
38. Data Platform
Module: Health
● Owner
○ Team / project
● Schema
● Location
● Description
● Size
○ Growth over time
○ Number of records
● Lineage
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data health
41. Data Platform
Module: Triggers
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad
data
○ Incomplete data
○ Low-quality data
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data triggers
65. Data Platform
@wslulciucMarquez: Metadata collection
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Workflow
66. Data Platform
@wslulciucMarquez: Metadata collection
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Register
Job Run
Outputs
● Outputs (physical
locations)
Workflow
68. Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Recall, we are tasked with analyzing
room booking trends …
69. Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Job Postgres
Room Bookings
Workflow
Top Locations
S3
Scheduler
Recall, we are tasked with analyzing
room booking trends …
70. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
72. @wslulciuc
room bo
Room Bookings (ALL)
All
created: feb. 15, 2010
Room Bookings (SF)
created: jul. 8, 2018
All room bookings since beginning of time
All San Francisco room bookings
Example: Room bookings pipeline (take 2)
Data Platform
S3
S3
73. @wslulciuc
room bo
All
Room Bookings (SF)
created: jul. 8, 2018All San Francisco room bookings
Example: Room bookings pipeline (take 2)
Well, that
was easy!
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Data Platform
S3
S3
74. @wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
S3
75. @wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
S3
76. @wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
Bonus!
S3
77. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
78. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
79. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
80. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
81. Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Job Postgres
Room Bookings
Workflow
Top Locations
S3
We also had to coordinate changes to
our input data
Scheduler
84. @wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/2
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
Oh, version
bumped!
S3
86. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
87. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
88. Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
89. Data Platform
@wslulciuc
RECAP
● Make it trival to discovery datasets
● Global context when debugging
● Easily handle backfills
○ Datasets as dependencies
91. Data Platform
WeWork + Marquez
● Data platform built around Marquez
● Internal integrations
○ Scheduling
○ Batching
○ Streaming
@wslulciucMarquez: Future work