Yelp has operated our connector ecosystem to feed vital data to domain-specific teams and data stores. We share some of our learning and experiences on operating such system. We will touch on what is the next phase of the system evolution.
3. Who am I?
My name is
Steven, my
preferred
pronoun is “he”
I graduated from UC Berkeley EECS in 2005
This is my second term in Yelp (2017 - now)
Last term is 2011 - 2015
I consider myself a generalist in the field
4. Who am I?
I work in team
metrics-data
within
metrics-platform
5. Who am I?
I work in team
metrics-data
within
metrics-platform
6. Data powers
decision making
OnLine Transaction Processing (OLTP)
We use MySQL to power yelp.com
Each transaction interacts with small amount of
data
Display reviews, photos, tips of a business
OLTP queries’ results are expected to return quickly
No one wants to wait for more than 2 seconds for a
business page to load
7. OLTP example:
find the titles an
author has
written. Take
advantage of an
index
https://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg
8. Data powers
decision making
Developers want to find out what local business has
the most reviews
Table scan on the review table?
OnLine Analytical Processing (OLAP)
Queries that scan majority of data relative to total
amount of data
Need specialized system to support such queries
Yelp uses AWS Redshift as a data warehouse to
support OLAP queries.
9. OLAP example:
average number
of pages in a
book stored
inside main
stack. Need to
scan all the titles.
https://www.dailycal.org/2013/12/08/best-worst-foods-sneak-main-stacks/
12. Data Fabric We want to avoid n * m programs to transport data
n is the number of source, and m is the number of sink
Domain specific data stores are here to stay
Stonebraker, “One Size Fits All”: An Idea Whose Time
Has Come and Gone”
Stream-Table Duality
We can formulate the transport of data as streams
17. Benefits
Connector
Ecosystem
Lower the barrier of entry
It’s easy to move data between data stores
High performance implementation
Each data store has its own performance
characteristics.
Streams-processing over batch processing
Near real-time data availability
19. Lesson Learned
Connector
Ecosystem
Schematized data is good
Lessen the likelihood of malformed data
Schema evolution can be difficult
Making incompatible schema change can break many
things. Discourage them in registration phase.
Decouple data producers and data consumers
We need automation to inform data producers how to
manage data life cycle as producers do not think about
who uses the data.
21. Desirable
Improvements
Data Producers should own their data life cycle
Specific connector owner does not have visibility of
data semantics.
Data Consumers are stakeholders
Consumers don’t want to out incompatible changes
after its been rolled out.
Self-serve mechanism accelerates changes
The only way to rapidly evolves is to self-serve
22. Data Mesh Data specifications are like microservices APIs
They are contracts between producers and consumers
Each team owns their data specifications
To avoid accidentally abstraction leakage
Decentralization allows rapid experiments
Common conventions are promoted to minimize
frictions among different domain systems
24. yelp.com/dataset_challenge
Academic
dataset from 10
cities across the
globe!
Your academic project, research or visualizations
submitted by December 31, 2019
=
a $5,000 prize* !
*See full terms on website
6M reviews
1M business attributes
190K businesses
200K photos