Ensuring Consistency in a Replicated World

•

0 likes•889 views

Yelp Engineering

Technology

• we operate in a bunch of markets
• aim to be globally distributed
• our users should never see stale content
• our developers should be able to design an application resilient to
replication delay
3
goals

• a small set of moving parts
• enables us to do more with fewer shards
• masks geographic traffic split from users and developers
• enhanced tolerance to replication delay
• ability to
– perform online replication hierarchy changes
– batch-load data
5
our toolset

• give the client a short-lived “dirty session” cookie
• encode the time of the latest interaction between you and them
• expire or ignore the cookie after replicas have caught up
7
cookies

• load balancer:
• POST?
• GET? -> cookie?
• routes the request into the appropriate datacenter
• adds headers to requests
8
request routing

• users get read-after-write consistency
• routing a user’s request between datacenters increases latency
!
• getting it wrong: increased load on the master database
9
tradeoffs

• we need to be assured that a user’s request falls back to a datacenter that
has all of their data
10
tradeoffs

• we need a clear picture of it
• never underestimate replication delay, always overestimate
11
replication delay

• made of lies (for this purpose)
• underestimates most of the time
• overestimates some of the time
12
Seconds_Behind_Master
http://bugs.mysql.com/bug.php?id=66921

• insert known data on the master
• wait until you see it on the slave
• time waited is replication delay
14
heartbeats

• A sensu check:
20
what does that get us? (pt 1)

22
time is hard
http://bugs.mysql.com/bug.php?id=48326

• aggregates heartbeat information
• provides it to the webapp
• determines when to expire the dirty session cookie
23
repl_delay_reporter

• Wait for replication:
• “I inserted some data; when will it be available on all replicas?”
• Throttle to replication:
• “I want to bulk insert data. Will doing so cause too much replication delay?”
24
operations

• insert some data
• ask the master database “what’s the heartbeat right now?”
• ask the repl_delay_reporter “what’s the lowest heartbeat right now?”
• wait a bit
• loop until the lowest heartbeat exceeds the original master heartbeat
25
wait for replication

• determines when to expire the dirty session cookie
• relies on only 1 clock, and only for monotonicity
• used heavily by batches
– provides read-after-write consistency
26
wait for replication

• prevents batches from causing excessive replication delay
• operates before the beginning of each transaction
– batches ask “is replication delay low enough for me to write right
now?”
• batches are required to keep their transactions reasonably-sized
27
throttle to replication

• load on masters
• laggards
• over-throttling
28
gotchas

• batch data can reside on the same shards that serve OLTP requests
• support databases with heterogenous SLAs
• automatic load-shedding when there is a replication issue
29
what this gets us

• shunting of nearly ALL reading and reporting off of the master
• better mileage out of the Percona toolkit
• on-line replication hierarchy changes
30
what this gets us

Ensuring Consistency in a Replicated World

What's hot

Kafka reliability velocity 17Gwen (Chen) Shapira

Apache Kafka – (Pattern and) Anti-Patternconfluent

Gluster Metrics: why they are crucial for running stable deployments of all s...Gluster.org

Apache con2016final Salesforce

Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...confluent

Benchmarking NGINX for Accuracy and ResultsNGINX, Inc.

How Much Kafka?confluent

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platformconfluent

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira

Pulsar - flexible pub-sub for internet scaleMatteo Merli

SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltStack

Effectively-once semantics in Apache PulsarMatteo Merli

Papers we love realtime at facebookGwen (Chen) Shapira

Hands-on Workshop: Apache PulsarSijie Guo

Mobile 3: Launch Like a Boss!MongoDB

Apache BookKeeper Distributed Store- a Salesforce use caseSalesforce Engineering

High performance messaging with Apache PulsarMatteo Merli

How to Fail at Kafkaconfluent

Altitude SF 2017: Optimizing your hit rateFastly

Kafka At Scale in the Cloudconfluent

What's hot (20)

Kafka reliability velocity 17

Apache Kafka – (Pattern and) Anti-Pattern

Gluster Metrics: why they are crucial for running stable deployments of all s...

Apache con2016final

Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...

Benchmarking NGINX for Accuracy and Results

How Much Kafka?

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

Pulsar - flexible pub-sub for internet scale

SaltConf14 - Brendan Burns, Google - Management at Google Scale

Effectively-once semantics in Apache Pulsar

Papers we love realtime at facebook

Hands-on Workshop: Apache Pulsar

Mobile 3: Launch Like a Boss!

Apache BookKeeper Distributed Store- a Salesforce use case

High performance messaging with Apache Pulsar

How to Fail at Kafka

Altitude SF 2017: Optimizing your hit rate

Kafka At Scale in the Cloud

Viewers also liked

Microservices Summit - The Human Side of ServicesYelp Engineering

Yelp Tech Talks: Mobile Testing 1, 2, 3Yelp Engineering

Building a World Class Security TeamYelp Engineering

"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...Yelp Engineering

Giving Design CritiqueYelp Engineering

Humans by the hundredYelp Engineering

Viewers also liked (6)

Microservices Summit - The Human Side of Services

Yelp Tech Talks: Mobile Testing 1, 2, 3

Building a World Class Security Team

"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...

Giving Design Critique

Humans by the hundred

Similar to Ensuring Consistency in a Replicated World

MySQL Infrastructure Testing Automation at GitHubIke Walker

Membase East Coast MeetupsMembase

(ATS4-PLAT08) Server Pool ManagementBIOVIA

MySQL Performance Tuning at COSCUP 2014Ryusuke Kajiyama

Membase Intro from Membase Meetup San FranciscoMembase

(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA

Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A

Kafka at scale facebook israelGwen (Chen) Shapira

Managing Security At 1M Events a Second using ElasticsearchJoe Alex

Putting Kafka Into OverdriveTodd Palino

Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.

Monitoring Apache Kafkaconfluent

Architecting for the cloud elasticity securityLen Bass

Work with hundred of hot terabytes in JVMsMalin Weiss

Big Data for QAsAhmed Misbah

HBase Low Latency, StrataNYC 2014Nick Dimiduk

Got Problems? Let's Do a Health CheckLuis Guirigay

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang

Why would I store my data in more than one database?Kurtosys Systems

Similar to Ensuring Consistency in a Replicated World (20)

MySQL Infrastructure Testing Automation at GitHub

Membase East Coast Meetups

(ATS4-PLAT08) Server Pool Management

MySQL Performance Tuning at COSCUP 2014

Membase Intro from Membase Meetup San Francisco

(ATS6-PLAT06) Maximizing AEP Performance

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Kafka at scale facebook israel

Managing Security At 1M Events a Second using Elasticsearch

Putting Kafka Into Overdrive

Java one2015 - Work With Hundreds of Hot Terabytes in JVMs

Monitoring Apache Kafka

Architecting for the cloud elasticity security

Work with hundred of hot terabytes in JVMs

Big Data for QAs

HBase Low Latency, StrataNYC 2014

Got Problems? Let's Do a Health Check

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

Case Study - How Rackspace Query Terabytes Of Data

Why would I store my data in more than one database?

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

A Year of the Servo Reboot: Where Are We Now?Igalia

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Manulife - Insurer Innovation Award 2024The Digital Insurer

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Real Time Object Detection Using Open CVKhem

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

A Year of the Servo Reboot: Where Are We Now?

Boost Fertility New Invention Ups Success Rates.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Partners Life - Insurer Innovation Award 2024

Exploring the Future Potential of AI-Enabled Smartphone Processors

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Strategies for Landing an Oracle DBA Job as a Fresher

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Manulife - Insurer Innovation Award 2024

Automating Google Workspace (GWS) & more with Apps Script

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Scaling API-first – The story of a global engineering organization

Real Time Object Detection Using Open CV

Boost PC performance: How more available memory can improve productivity

HTML Injection Attacks: Impact and Mitigation Strategies

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Ensuring Consistency in a Replicated World

1. Ensuring Consistency in a Replicated World Josh Snyder 2014-‐09-‐30

2. 2 what is Yelp?

3. • we operate in a bunch of markets • aim to be globally distributed • our users should never see stale content • our developers should be able to design an application resilient to replication delay 3 goals

4. 4 a sample architecture

5. • a small set of moving parts • enables us to do more with fewer shards • masks geographic traffic split from users and developers • enhanced tolerance to replication delay • ability to – perform online replication hierarchy changes – batch-load data 5 our toolset

6. 6 cookies

7. • give the client a short-lived “dirty session” cookie • encode the time of the latest interaction between you and them • expire or ignore the cookie after replicas have caught up 7 cookies

8. • load balancer: • POST? • GET? -> cookie? • routes the request into the appropriate datacenter • adds headers to requests 8 request routing

9. • users get read-after-write consistency • routing a user’s request between datacenters increases latency ! • getting it wrong: increased load on the master database 9 tradeoffs

10. • we need to be assured that a user’s request falls back to a datacenter that has all of their data 10 tradeoffs

11. • we need a clear picture of it • never underestimate replication delay, always overestimate 11 replication delay

12. • made of lies (for this purpose) • underestimates most of the time • overestimates some of the time 12 Seconds_Behind_Master http://bugs.mysql.com/bug.php?id=66921

13. 13 heartbeats

14. • insert known data on the master • wait until you see it on the slave • time waited is replication delay 14 heartbeats

15. 15 clocks are evil

16. 16 clocks are evil (2)

17. 17 pt-heartbeat

18. 18 yelp_heartbeat

19. 19 the secret sauce

20. • A sensu check: 20 what does that get us? (pt 1)

21. 21 why that way?

22. 22 time is hard http://bugs.mysql.com/bug.php?id=48326

23. • aggregates heartbeat information • provides it to the webapp • determines when to expire the dirty session cookie 23 repl_delay_reporter

24. • Wait for replication: • “I inserted some data; when will it be available on all replicas?” • Throttle to replication: • “I want to bulk insert data. Will doing so cause too much replication delay?” 24 operations

25. • insert some data • ask the master database “what’s the heartbeat right now?” • ask the repl_delay_reporter “what’s the lowest heartbeat right now?” • wait a bit • loop until the lowest heartbeat exceeds the original master heartbeat 25 wait for replication

26. • determines when to expire the dirty session cookie • relies on only 1 clock, and only for monotonicity • used heavily by batches – provides read-after-write consistency 26 wait for replication

27. • prevents batches from causing excessive replication delay • operates before the beginning of each transaction – batches ask “is replication delay low enough for me to write right now?” • batches are required to keep their transactions reasonably-sized 27 throttle to replication

28. • load on masters • laggards • over-throttling 28 gotchas

29. • batch data can reside on the same shards that serve OLTP requests • support databases with heterogenous SLAs • automatic load-shedding when there is a replication issue 29 what this gets us

30. • shunting of nearly ALL reading and reporting off of the master • better mileage out of the Percona toolkit • on-line replication hierarchy changes 30 what this gets us

Ensuring Consistency in a Replicated World

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Ensuring Consistency in a Replicated World

Similar to Ensuring Consistency in a Replicated World (20)

More from Yelp Engineering

More from Yelp Engineering (10)

Recently uploaded

Recently uploaded (20)

Ensuring Consistency in a Replicated World