Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
Schema on read is obsolete. Welcome metaprogramming..pdf
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform
1. Solving Data Discovery Challenges
with Amundsen, an open-source
metadata platform
Tao Feng | tfeng@apache.org
Staff Software Engineer
2. Who
● Engineer at Lyft Data Platform and
Tools
● Apache Airflow PMC and Committer
● Working on different data products
(Airflow, Amundsen, etc), and led
data org cost attribution effort
● Previously at Linkedin, Oracle
3. Agenda
● What is Data Discovery
● Challenges in Data Discovery
● Introducing Amundsen
● Amundsen Architecture
● Deep Dive
● Impact and Future Work
5. Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
● Axiom: Good decisions are based in data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy
6. Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision
8. ● Why:
- An unknown number of RSVPs will no-show
- Need to procure pizza, drinks, chairs, etc
Case Study
● How: Use data from past meetups to build a predictive model
● Goal: Predict Meetup Attendance
9. ● Ask a friend or expert
● Ask in a Slack channel
● Search in the Github repos, or other documents
Step 2: Find the Data
10. ● We find a table called core.meetup_events with columns:
attending, not_attending, date, init_date
● Does attending mean they actually showed up or just RSVPed?
● What's the difference between date and init_date?
● Is this data trustworthy and reliable?
Step 3: Understand the Data
11. Step 3: Understand the Data
● Ask the data owner, but how do we find the owner?
● Look for further documentation on Github, Confluence, etc
● Run queries and try to figure it out
SELECT * FROM core.meetup_events LIMIT 100;
12. Data Discovery is Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata
14. What is Amundsen
• In a nutshell, Amundsen is a data discovery and metadata platform for improving the
productivity of data analysts, data scientists, and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with
open governance and RFC process. (e.g blog post)
15. Lyft data discovery before Amundsen exists
• Only a few
(20ish) core tables are listed
• Metadata refreshed through a cron
job, no human curation
• Metadata includes: owner, code, ETL
SLA(static defined), table/column
description
• The metadata not easy to extend
28. Metadata Service
• A proxy layer to interact with graph database with API
‒ Supports different graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune
(Gremlin based)
‒ Supports Apache Atlas as meta-storedata engine
• Support Rest APIs for other services pushing / pulling metadata directly
‒ Service communication authorized through Envoy RBAC at Lyft
29. Search Service
• A proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch, and Apache Atlas as search backend.
• Support different search patterns
‒ Fuzzy search: search based on popularity
‒ Multi facet search
37. 1. What kind of information? (aka ABC of metadata)
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
TODAY
38. Short answer: Any data within your organization
Long answer:
2. About what data?
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes
40. Dataset
• Includes metadata both manual curated and programmatic curated
• Current metadata:
‒ Table description, column, column descriptions
‒ Last updated timestamp
‒ Partition date range
‒ Tags
‒ Owners, Frequent users
‒ Column stats, column usage
‒ Used in which dashboard
‒ Produced by which Airflow(ETL) task
‒ Github source definition
‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies
metadata requirements
• Challenge: not every dataset defines the same set of metadata or
follows the same practice
‒ Tier, SLA (operation metadata)
41. User
• User has the most context / tribal knowledge around data assets.
• Connect user with data entities to surface those tribal knowledge.
43. Dashboard
• Current metadata:
‒ Description
‒ Owner
‒ Last updated timestamp, last successful run timestamp, last run status
‒ Tables used in dashboard, queries, charts
‒ Dashboard preview
‒ Tags
• Challenge:
‒ Not every dashboard metadata applicable for other dashboard type
45. Pull model vs. Push model
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface exists
Preferred if
● Waiting for indexing is ok
● Easy to bootstrap central metadata
46. Metadata ingestion
• Pull model ingestion with neo4j, AWS Neptune as backend.
‒ We could extend to a push and pull hybrid model if needed
47. Metadata ingestion
• Push model ingestion with Apache Atlas as backend (ING blog post)
• Cons: Apache Atlas doesn’t support the external source(e.g redshift)
if it doesn’t support hook interface (intercepting events, messages or function calls
during processing).
49. Why graph database
• Data entities with its relationships could be represented as a graph
• Performance is better than RDBMS once numbers of nodes and
relationships are in large scale
• Adding a new metadata is easy as it is just adding a new node in the
graph
56. Metadata source of truth
• Centralize all the fragmented metadata
• Treat Amundsen graph as metadata source of truth
‒ Unless upstream source of truth is available (E.g at Lyft, we define metadata for events in IDL repo)
59. Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.
60. Data Preview
• Supports data preview for
datasets.
• Plugin client with different BI Viz
tools (e.g Apache Superset).
• Delegate the user authz to
Superset to verify whether the
given user could access the
data.
61. Data Exploration
• Supports integration between
Amundsen and BI Viz tool for
data exploration (e.g Apache
Superset by default).
• Allows users to do complex data
exploration.
63. “This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft
Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+
dashboards
66. Edmunds.com
• Data Discovery use case and integrated with in-house Data quality
service (e.g blog post)
• Integrating with Databricks’ Delta analytics platform
67. ING
• Data Discovery on top of Amundsen with Apache Atlas
• Contributed a lot of security integrations to Amundsen (e.g blog post)
68. Workday
• Data Discovery on their analytics platform, named Goku
• Amundsen is Landing page for Goku
• 1400 users using their platform
69. Square
• Compliance and regulatory use cases
• Used by security analysis
• Contribute the Gremlin / AWS Neptune integration
• Production phase (e.g blog post)
70. Recent Contributions from the community
• Redash dashboard integration (Asana)
• Tableau dashboard integration (Gusto)
• Looker dashboard integration (in progress, Brex )
• Integrating with Delta analytics platform (In progress, Edmunds)
• ...
72. Data Lineage
Pattern Description Example Key Benefit Key Challenge
Tool Contributed
Lineage
The tool creating
the data asset
also writes the
lineage
1) Informatica
2) Hive hook
expose
lineage
At time of creation No standard way
to write lineage;
Manual linked by
User
Manual added
and described
how datasets are
linked
Does not scale
Inferred from
DAG
Extract
dependencies
based on
scheduling
1) Airflow
lineage
2) Marquez
Automatable Doesn’t support
field/column level
lineage
Inferred from SQL Programmatic
extracting lineage
with SQL dialect
https://github.com
/uber/queryparser
Accurate,
supports all sql
dialect
SQL is easier, but
long tail of
support of others
(Spark)
73. Data Lineage
• Current main Q4 focus
‒ working on UX design for table lineage
• RFC is coming
‒ Provide data model for data lineage
‒ Provide UI for data lineage
‒ Allows different ingestion mechanisms (Push based, SQL parsing, etc)
74. Machine Learning Feature as entity
• ML Feature as a separate resource entity
‒ Surface feature stats
‒ Surface feature and upstream dataset lineage
‒ Surface various metadatas around ML features
75. Metadata platform
• Support other services metadata programmatic graphql API access
use cases
‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz
tool
‒ Integrate with data quality service to surface health score, data quality information in
Amundsen
• Support hybrid(pull + push) metadata ingestion
‒ Build SDK to push metadata to Amundsen either through API or through Kafka