SlideShare a Scribd company logo
1 of 77
Download to read offline
Solving Data Discovery Challenges
with Amundsen, an open-source
metadata platform
Tao Feng | tfeng@apache.org
Staff Software Engineer
Who
● Engineer at Lyft Data Platform and
Tools
● Apache Airflow PMC and Committer
● Working on different data products
(Airflow, Amundsen, etc), and led
data org cost attribution effort
● Previously at Linkedin, Oracle
Agenda
● What is Data Discovery
● Challenges in Data Discovery
● Introducing Amundsen
● Amundsen Architecture
● Deep Dive
● Impact and Future Work
What is Data Discovery
Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
● Axiom: Good decisions are based in data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy
Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision
Challenges in
Data Discovery
● Why:
- An unknown number of RSVPs will no-show
- Need to procure pizza, drinks, chairs, etc
Case Study
● How: Use data from past meetups to build a predictive model
● Goal: Predict Meetup Attendance
● Ask a friend or expert
● Ask in a Slack channel
● Search in the Github repos, or other documents
Step 2: Find the Data
● We find a table called core.meetup_events with columns:
attending, not_attending, date, init_date
● Does attending mean they actually showed up or just RSVPed?
● What's the difference between date and init_date?
● Is this data trustworthy and reliable?
Step 3: Understand the Data
Step 3: Understand the Data
● Ask the data owner, but how do we find the owner?
● Look for further documentation on Github, Confluence, etc
● Run queries and try to figure it out
SELECT * FROM core.meetup_events LIMIT 100;
Data Discovery is Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata
Introducing
What is Amundsen
• In a nutshell, Amundsen is a data discovery and metadata platform for improving the
productivity of data analysts, data scientists, and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with
open governance and RFC process. (e.g blog post)
Lyft data discovery before Amundsen exists
• Only a few
(20ish) core tables are listed
• Metadata refreshed through a cron
job, no human curation
• Metadata includes: owner, code, ETL
SLA(static defined), table/column
description
• The metadata not easy to extend
Amundsen homepage
Search for datasets
See details of the data set
See detailed descriptions and profile of the column
See dashboards built on this data set
Search for existing dashboards/reports
Dashboard detail page
Search for co-workers!
Search for data owned and used by your peers
Architecture
Postgres Hive Redshift ... Presto
Mode
Dashboa
rd
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Pluggable Pluggable
Frontend Service
Metadata Service
• A proxy layer to interact with graph database with API
‒ Supports different graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune
(Gremlin based)
‒ Supports Apache Atlas as meta-storedata engine
• Support Rest APIs for other services pushing / pulling metadata directly
‒ Service communication authorized through Envoy RBAC at Lyft
Search Service
• A proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch, and Apache Atlas as search backend.
• Support different search patterns
‒ Fuzzy search: search based on popularity
‒ Multi facet search
Databuilder
Metadata Sources
Databuilder in action
How is the databuilder orchestrated?
Amundsen uses a workflow engine (e.g Apache Airflow) to orchestrate Databuilder jobs
Current built-in connectors
Deep Dive
Metadata model
1. What kind of information? (aka ABC of metadata)
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
TODAY
Short answer: Any data within your organization
Long answer:
2. About what data?
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes
Dataset
Dataset
• Includes metadata both manual curated and programmatic curated
• Current metadata:
‒ Table description, column, column descriptions
‒ Last updated timestamp
‒ Partition date range
‒ Tags
‒ Owners, Frequent users
‒ Column stats, column usage
‒ Used in which dashboard
‒ Produced by which Airflow(ETL) task
‒ Github source definition
‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies
metadata requirements
• Challenge: not every dataset defines the same set of metadata or
follows the same practice
‒ Tier, SLA (operation metadata)
User
• User has the most context / tribal knowledge around data assets.
• Connect user with data entities to surface those tribal knowledge.
Dashboard
• Dashboard represents existing users research analysis.
Dashboard
• Current metadata:
‒ Description
‒ Owner
‒ Last updated timestamp, last successful run timestamp, last run status
‒ Tables used in dashboard, queries, charts
‒ Dashboard preview
‒ Tags
• Challenge:
‒ Not every dashboard metadata applicable for other dashboard type
Push vs Pull
Pull model vs. Push model
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface exists
Preferred if
● Waiting for indexing is ok
● Easy to bootstrap central metadata
Metadata ingestion
• Pull model ingestion with neo4j, AWS Neptune as backend.
‒ We could extend to a push and pull hybrid model if needed
Metadata ingestion
• Push model ingestion with Apache Atlas as backend (ING blog post)
• Cons: Apache Atlas doesn’t support the external source(e.g redshift)
if it doesn’t support hook interface (intercepting events, messages or function calls
during processing).
Why Graph Database?
Why graph database
• Data entities with its relationships could be represented as a graph
• Performance is better than RDBMS once numbers of nodes and
relationships are in large scale
• Adding a new metadata is easy as it is just adding a new node in the
graph
Search Tradeoff
Search Results
Ranked on Relevance and Popularity
Relevance - search for “apple” on Google
Low relevance High relevance
Popularity - search for “apple” on Google
Low popularity High popularity
Search Results - Striking the balance
Relevance Popularity
● Names, Description, Tags,
[Owners, Frequent users]
● Different weights for different
metadata. e.g., resource name
● Querying activity
● Lower weight for automated
querying
● Higher weight for ad-hoc
querying
Metadata Source Of
Truth
Metadata source of truth
• Centralize all the fragmented metadata
• Treat Amundsen graph as metadata source of truth
‒ Unless upstream source of truth is available (E.g at Lyft, we define metadata for events in IDL repo)
Other features
Announcement page
• Plugin client to support new feature or new datasets
Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.
Data Preview
• Supports data preview for
datasets.
• Plugin client with different BI Viz
tools (e.g Apache Superset).
• Delegate the user authz to
Superset to verify whether the
given user could access the
data.
Data Exploration
• Supports integration between
Amundsen and BI Viz tool for
data exploration (e.g Apache
Superset by default).
• Allows users to do complex data
exploration.
Impact
“This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft
Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+
dashboards
Amundsen Open Source
950+
Community
members
150+
Companies in
the community
25+
Companies using
in production
Amundsen Open Source Community
ProminentusersActivecommunity
Edmunds.com
• Data Discovery use case and integrated with in-house Data quality
service (e.g blog post)
• Integrating with Databricks’ Delta analytics platform
ING
• Data Discovery on top of Amundsen with Apache Atlas
• Contributed a lot of security integrations to Amundsen (e.g blog post)
Workday
• Data Discovery on their analytics platform, named Goku
• Amundsen is Landing page for Goku
• 1400 users using their platform
Square
• Compliance and regulatory use cases
• Used by security analysis
• Contribute the Gremlin / AWS Neptune integration
• Production phase (e.g blog post)
Recent Contributions from the community
• Redash dashboard integration (Asana)
• Tableau dashboard integration (Gusto)
• Looker dashboard integration (in progress, Brex )
• Integrating with Delta analytics platform (In progress, Edmunds)
• ...
Future
Data Lineage
Pattern Description Example Key Benefit Key Challenge
Tool Contributed
Lineage
The tool creating
the data asset
also writes the
lineage
1) Informatica
2) Hive hook
expose
lineage
At time of creation No standard way
to write lineage;
Manual linked by
User
Manual added
and described
how datasets are
linked
Does not scale
Inferred from
DAG
Extract
dependencies
based on
scheduling
1) Airflow
lineage
2) Marquez
Automatable Doesn’t support
field/column level
lineage
Inferred from SQL Programmatic
extracting lineage
with SQL dialect
https://github.com
/uber/queryparser
Accurate,
supports all sql
dialect
SQL is easier, but
long tail of
support of others
(Spark)
Data Lineage
• Current main Q4 focus
‒ working on UX design for table lineage
• RFC is coming
‒ Provide data model for data lineage
‒ Provide UI for data lineage
‒ Allows different ingestion mechanisms (Push based, SQL parsing, etc)
Machine Learning Feature as entity
• ML Feature as a separate resource entity
‒ Surface feature stats
‒ Surface feature and upstream dataset lineage
‒ Surface various metadatas around ML features
Metadata platform
• Support other services metadata programmatic graphql API access
use cases
‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz
tool
‒ Integrate with data quality service to surface health score, data quality information in
Amundsen
• Support hybrid(pull + push) metadata ingestion
‒ Build SDK to push metadata to Amundsen either through API or through Kafka
Q & A
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
Vineet .
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Data Mesh
Data MeshData Mesh
Data Mesh
 

Similar to Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 

Similar to Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform (20)

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

  • 1. Solving Data Discovery Challenges with Amundsen, an open-source metadata platform Tao Feng | tfeng@apache.org Staff Software Engineer
  • 2. Who ● Engineer at Lyft Data Platform and Tools ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc), and led data org cost attribution effort ● Previously at Linkedin, Oracle
  • 3. Agenda ● What is Data Discovery ● Challenges in Data Discovery ● Introducing Amundsen ● Amundsen Architecture ● Deep Dive ● Impact and Future Work
  • 4. What is Data Discovery
  • 5. Data-Driven Decisions Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers ● Axiom: Good decisions are based in data ● Who needs Data? Anyone who wants to make good decisions ○ HR wants to ensure salaries are competitive with market ○ Politician wants to optimize campaign strategy
  • 6. Data-Driven Decisions 1. Data is Collected 2. Analyst Finds the Data 3. Analyst Understands the Data 4. Analyst Creates Report 5. Analyst Shares the Results 6. Someone Makes a Decision
  • 8. ● Why: - An unknown number of RSVPs will no-show - Need to procure pizza, drinks, chairs, etc Case Study ● How: Use data from past meetups to build a predictive model ● Goal: Predict Meetup Attendance
  • 9. ● Ask a friend or expert ● Ask in a Slack channel ● Search in the Github repos, or other documents Step 2: Find the Data
  • 10. ● We find a table called core.meetup_events with columns: attending, not_attending, date, init_date ● Does attending mean they actually showed up or just RSVPed? ● What's the difference between date and init_date? ● Is this data trustworthy and reliable? Step 3: Understand the Data
  • 11. Step 3: Understand the Data ● Ask the data owner, but how do we find the owner? ● Look for further documentation on Github, Confluence, etc ● Run queries and try to figure it out SELECT * FROM core.meetup_events LIMIT 100;
  • 12. Data Discovery is Not Productive ● Data Scientists spend up to 30% of their time in Data Discovery ● Data Discovery in itself provides little to no intrinsic value. Impactful work happens in Analysis. ● The answer to these problems is Metadata
  • 14. What is Amundsen • In a nutshell, Amundsen is a data discovery and metadata platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data. • Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with open governance and RFC process. (e.g blog post)
  • 15. Lyft data discovery before Amundsen exists • Only a few (20ish) core tables are listed • Metadata refreshed through a cron job, no human curation • Metadata includes: owner, code, ETL SLA(static defined), table/column description • The metadata not easy to extend
  • 18. See details of the data set
  • 19. See detailed descriptions and profile of the column
  • 20. See dashboards built on this data set
  • 21. Search for existing dashboards/reports
  • 24. Search for data owned and used by your peers
  • 26. Postgres Hive Redshift ... Presto Mode Dashboa rd Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources Pluggable Pluggable
  • 28. Metadata Service • A proxy layer to interact with graph database with API ‒ Supports different graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune (Gremlin based) ‒ Supports Apache Atlas as meta-storedata engine • Support Rest APIs for other services pushing / pulling metadata directly ‒ Service communication authorized through Envoy RBAC at Lyft
  • 29. Search Service • A proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch, and Apache Atlas as search backend. • Support different search patterns ‒ Fuzzy search: search based on popularity ‒ Multi facet search
  • 33. How is the databuilder orchestrated? Amundsen uses a workflow engine (e.g Apache Airflow) to orchestrate Databuilder jobs
  • 37. 1. What kind of information? (aka ABC of metadata) Application Context Metadata needed by humans or applications to operate ● Where is the data? ● What are the semantics of the data? Behavior How is data created and used over time? ● Who’s using the data? ● Who created the data? Change Change in data over time ● How is the data evolving over time? ● Evolution of code that generates the data TODAY
  • 38. Short answer: Any data within your organization Long answer: 2. About what data? Data stores Schema registry Events / Schemas StreamsPeople Employees TODAY NotebooksDashboard / Reports Processes
  • 40. Dataset • Includes metadata both manual curated and programmatic curated • Current metadata: ‒ Table description, column, column descriptions ‒ Last updated timestamp ‒ Partition date range ‒ Tags ‒ Owners, Frequent users ‒ Column stats, column usage ‒ Used in which dashboard ‒ Produced by which Airflow(ETL) task ‒ Github source definition ‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies metadata requirements • Challenge: not every dataset defines the same set of metadata or follows the same practice ‒ Tier, SLA (operation metadata)
  • 41. User • User has the most context / tribal knowledge around data assets. • Connect user with data entities to surface those tribal knowledge.
  • 42. Dashboard • Dashboard represents existing users research analysis.
  • 43. Dashboard • Current metadata: ‒ Description ‒ Owner ‒ Last updated timestamp, last successful run timestamp, last run status ‒ Tables used in dashboard, queries, charts ‒ Dashboard preview ‒ Tags • Challenge: ‒ Not every dashboard metadata applicable for other dashboard type
  • 45. Pull model vs. Push model Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. DB) pushes to a message bus which downstream subscribes to. ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph Preferred if ● Near-real time indexing is important ● Clean interface exists Preferred if ● Waiting for indexing is ok ● Easy to bootstrap central metadata
  • 46. Metadata ingestion • Pull model ingestion with neo4j, AWS Neptune as backend. ‒ We could extend to a push and pull hybrid model if needed
  • 47. Metadata ingestion • Push model ingestion with Apache Atlas as backend (ING blog post) • Cons: Apache Atlas doesn’t support the external source(e.g redshift) if it doesn’t support hook interface (intercepting events, messages or function calls during processing).
  • 49. Why graph database • Data entities with its relationships could be represented as a graph • Performance is better than RDBMS once numbers of nodes and relationships are in large scale • Adding a new metadata is easy as it is just adding a new node in the graph
  • 51. Search Results Ranked on Relevance and Popularity
  • 52. Relevance - search for “apple” on Google Low relevance High relevance
  • 53. Popularity - search for “apple” on Google Low popularity High popularity
  • 54. Search Results - Striking the balance Relevance Popularity ● Names, Description, Tags, [Owners, Frequent users] ● Different weights for different metadata. e.g., resource name ● Querying activity ● Lower weight for automated querying ● Higher weight for ad-hoc querying
  • 56. Metadata source of truth • Centralize all the fragmented metadata • Treat Amundsen graph as metadata source of truth ‒ Unless upstream source of truth is available (E.g at Lyft, we define metadata for events in IDL repo)
  • 58. Announcement page • Plugin client to support new feature or new datasets
  • 59. Central data quality issue portal • Central portal for users to report data issues. • Users could see all the past issues as well. • Users could request further context / descriptions from owners through the portal.
  • 60. Data Preview • Supports data preview for datasets. • Plugin client with different BI Viz tools (e.g Apache Superset). • Delegate the user authz to Superset to verify whether the given user could access the data.
  • 61. Data Exploration • Supports integration between Amundsen and BI Viz tool for data exploration (e.g Apache Superset by default). • Allows users to do complex data exploration.
  • 63. “This is God’s work” - George X, ex-head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Bomee P, DS, Lyft Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+ dashboards
  • 64. Amundsen Open Source 950+ Community members 150+ Companies in the community 25+ Companies using in production
  • 65. Amundsen Open Source Community ProminentusersActivecommunity
  • 66. Edmunds.com • Data Discovery use case and integrated with in-house Data quality service (e.g blog post) • Integrating with Databricks’ Delta analytics platform
  • 67. ING • Data Discovery on top of Amundsen with Apache Atlas • Contributed a lot of security integrations to Amundsen (e.g blog post)
  • 68. Workday • Data Discovery on their analytics platform, named Goku • Amundsen is Landing page for Goku • 1400 users using their platform
  • 69. Square • Compliance and regulatory use cases • Used by security analysis • Contribute the Gremlin / AWS Neptune integration • Production phase (e.g blog post)
  • 70. Recent Contributions from the community • Redash dashboard integration (Asana) • Tableau dashboard integration (Gusto) • Looker dashboard integration (in progress, Brex ) • Integrating with Delta analytics platform (In progress, Edmunds) • ...
  • 72. Data Lineage Pattern Description Example Key Benefit Key Challenge Tool Contributed Lineage The tool creating the data asset also writes the lineage 1) Informatica 2) Hive hook expose lineage At time of creation No standard way to write lineage; Manual linked by User Manual added and described how datasets are linked Does not scale Inferred from DAG Extract dependencies based on scheduling 1) Airflow lineage 2) Marquez Automatable Doesn’t support field/column level lineage Inferred from SQL Programmatic extracting lineage with SQL dialect https://github.com /uber/queryparser Accurate, supports all sql dialect SQL is easier, but long tail of support of others (Spark)
  • 73. Data Lineage • Current main Q4 focus ‒ working on UX design for table lineage • RFC is coming ‒ Provide data model for data lineage ‒ Provide UI for data lineage ‒ Allows different ingestion mechanisms (Push based, SQL parsing, etc)
  • 74. Machine Learning Feature as entity • ML Feature as a separate resource entity ‒ Surface feature stats ‒ Surface feature and upstream dataset lineage ‒ Surface various metadatas around ML features
  • 75. Metadata platform • Support other services metadata programmatic graphql API access use cases ‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz tool ‒ Integrate with data quality service to surface health score, data quality information in Amundsen • Support hybrid(pull + push) metadata ingestion ‒ Build SDK to push metadata to Amundsen either through API or through Kafka
  • 76. Q & A
  • 77. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.