6. • My first project is to analyze and predict Strata Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
7
7. • Option 1: Phone a friend!
• Option 2: Github search
Status quo
8
8. • What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
9
10. Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
11
11. Data Scientists spend upto 1/3rd time in Data Discovery...
12
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
13. Data Discovery - User personas
14
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
14. 3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
15. Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
17. Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
37. 38
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
38. 39
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine.
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
46. 47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
47. 3. Search Service
• Support REST API for building indexes
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then relevancy
‒ Wildcard Search
48
49. How to make the search result more relevant?
50
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search
‒ Support category search (e.g. “column: is_line_ride”)
54. Metadata - Challenges
• Standardization: No single data model that fits for all data resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Extraction: Each data set metadata is stored and fetched differently,
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
55
56. Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
62. Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
63. Impact - Amundsen at Lyft
66
Beta release
(internal)
Generally Available
(GA) release
Alpha release
64. Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
65. Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
67. Summary
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• Amundsen: Data graph for all data
• Blog post with more details: go.lyft.com/datadiscoveryblog
70
68. Mark Grover | @mark_grover
Tao Feng | @feng-tao
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
71
Today’s agenda:
Why empowering with data is important…
What are we doing in the data team at Lyft (context)...
What challenges we are facing and have seen other companies face…
How are we solving the problem...
At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
Who is our audience: everyone who works at Lyft…
Power users: Data Scientists, Research Scientists, Product Managers…
Next: Engineers, GMs, Ops, etc.
What do we do: Wide scope of work… de-mystify and democratize data at Lyft…
Not mentioned here: Operational / transactional databases...
What does the architecture for our core infra look like?
Mobile application primarily…
Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
Mark
Mark
Mark
Mark
Data Discovery: How much of a challenge is it? Significant challenge…
Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis…
We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis…
But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth…
We can significantly increase productivity and impact if we can reduce this time...
Mark
Mark
Popularity is not click through rate but through query access patterns.
Amundsen architecture at Lyft:
3 micro-services(FE, metadata, search) and one generic data ingestion framework
Will discuss each of the component in details
High level walkthrough….how CCPA compilance works
What options we have (graph, rdbms,
ML features (one sentence on what is feature service)
Add a logo of neo4j
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
Why graph db is the best option? Why not rdbms, nosql etc
Why choose neo4j vs other graph db? (most popular graph db)
Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data.
Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah
There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all.
We model the graph as bi-direction relationships.
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
? challenge
How to improve search relevance?
Think of search algorithm…
Event_ride_table -> event ride table
Talk about how we measure it: instrument empty results from user, % of click through rate
What we did? Boost the rank of exact table name match
Walk through pros and cons
Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin
Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table.
Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that.
Loader writes data into either sink (destination) or into staging area.
Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
We currently index 2 a day, dependency
Why we index twice a day?
Today’s agenda:
Why empowering with data is important…
What are we doing in the data team at Lyft (context)...
What challenges we are facing and have seen other companies face…
How are we solving the problem...
At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
A slide on Amundsen @ Lyft?
? how long it has been in prod
How many datasets
Users
WAU
Usage
Kafka topic
Schema registry
ML workflow and Airflow DAGs
Three long term type of metadata(A.B.C)
We want to use the index A,B,C for the datasources we mentioned in last slide