SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

•Download as PPTX, PDF•

2 likes•608 views

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center out of rotation and redistribute its traffic to the healthy data centers within minutes, with virtually no visible impact to users As we transitioned from big monolithic application to micro-services, we witnessed pain in determining capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex micro-services architecture wasn’t sufficient to provide enough confidence in a data center’s capacity. To solve this problem, we at LinkedIn leverage live traffic to stress services site-wide by shifting traffic to simulate a disaster load. This talk provide details on how LinkedIn uses Traffic Shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Engineering

TrafficShift - Avoiding
Disasters at Scale
Michael Kehoe
Staff SRE
LinkedIn
Anil Mallapur
SRE
LinkedIn

Overview
LinkedIn Architectural Overview
Fabric Disaster Recovery
Questions

467+ million
members
World’s largest professional network
200+
Countries

Who are we ?
Production-SRE team at LinkedIn
● Assist in restoring stability to services
during site critical issues
● Developing applications to improve
MTTD and MTTR
● Provide direction and guidelines for site
monitoring
● Build tools for efficient site issue
troubleshooting, issue detection &
correlation

Terminologies
Fabric/Colo Data Center with full application stack deployed
PoP/Edge Entry point to LinkedIn network (TCP/ SSL
termination)
Load Test Planned stress testing of data centers

2003 2010 2011 2013 2014 2015
Active &
Passive
Active &
Active
Multi-colo 3-
way Active &
Active
Multi-colo n-
way Active &
Active

2017
4 Data
Centers
13 PoPs
1000+
services

What are Disasters ?
Service
Degradation
Infrastructure
Issues
Human Error
Data
Center on
Fire

One solution for all disasters
TrafficShift - Reroute user
traffic to different datacenters
without any user interruption.

Border
Router
IPVS ATS
EDGE
ATS Fronte
nd
FABRIC
Stickyrouting
Service
Internet

ATS
Request
Stickyrouting
Service
Gets primary
colo for user
If not
cookie in
header
DC1 in
cookie DC1
DC2
Got DC2 as primary
colo for user
FABRICEDGE

US-East
1 2 3 10
91 92 93 100
BUCKETS
FABRIC
Stickyrouting

How StickyRouting assigns users to a colo ?
Capacity of Fabric
Offline job to
assign colo
to users
Geographic
distance to
users

Advantages of sticky routing
Less latency for users
Store data where it’s necessary
Provides precise control over capacity allotment

When to TrafficShift ?
Impact
Mitigation
Planned
Maintenance
Stress Test

Site Traffic and Disaster Recovery
US-West
US-
Central
US-East APAC
EDGE
0%
Distributed Load
50%
Distributed Load
50%
Distributed Load
0%
Distributed Load
Traffic stops being
served to offline
fabrics
Traffic is shifted to online
fabrics

TrafficShift Architecture
Web application
Salt master
Stickyrouting
Service
Couchbase
Backend Worker
Processes
FABRIC
BUCKETS

What is Load Testing ?
3 times a week
Peak hour traffic
Fixed SLA
USW
Target Data
Center
USW

Load Testing
FABRIC
Target
US-West US-East
50%
Traffic
Percentage

Benefits of Load Test
Capacity Planning
Leverage production traffic
to stress test services
Identify bugs in
production
Confidence in Disaster
Recovery

Big Red Button
Kill switch (No Kidding)
Failout of a datacenter and PoP in less
than 10 minutes
Minimal user impact

Key Takeaways
● Design infrastructure to facilitate disaster
recovery
● Stress test regularly to avoid surprises
● Automate everything to reduce time to
mitigate impact

LinkedIn’s PoP Architecture
29
• Using IPVS - Each PoP announces a unicast address and a regional anycast
address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for
www.linkedin.com
• US and EU members is nearly all anycast
• APAC is all unicast

LinkedIn’s PoP DR
30
• Sometimes need to fail out of PoP’s
• 3rd party provider issues (e.g. transit links
going down)
• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast
traffic

Viewers also liked

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

SouthBay SRE Meetup Jan 2016Michael Kehoe

Dropbox 5 edu tech Victoria Haynes

2017 New College Application LandscapeRebecca Joseph

Aula fotografia digitalMarco Hovnanian

Andrea palladio jesica rivasjesica rivas

Vida na terraDavid Casado Bravo

Iirp presentation tommy st claire updateSt Claire Adriaan

Sistema DigestórioEliando Oliveira

The Librarian Reloaded: Evolving Roles in Health InformaticsIris Thiele Isip-Tan

The servicescore card - Gamifying Operational Excellence - SRECONDaniel ( Danny ) ☃ Lawrence

High Performance Magnolia with Anycast Routingbkraft

Service Redundancy and Traffic Balancing Using AnycastSean Jain Ellis

Software reliability tools and common software errorsHimanshu

The Analytics Data Store: Information Supply FrameworkMartyn Richard Jones

Perl在nginx里的应用琛琳饶

Routing for an Anycast CDNTom Paseka

How TPM saves the dayPooja Tangi

Viewers also liked (18)

CouchbasetoHadoop_Matt_Michael_Justin v4

SouthBay SRE Meetup Jan 2016

Dropbox 5 edu tech

2017 New College Application Landscape

Aula fotografia digital

Andrea palladio jesica rivas

Vida na terra

Iirp presentation tommy st claire update

Sistema Digestório

The Librarian Reloaded: Evolving Roles in Health Informatics

The servicescore card - Gamifying Operational Excellence - SRECON

High Performance Magnolia with Anycast Routing

Service Redundancy and Traffic Balancing Using Anycast

Software reliability tools and common software errors

The Analytics Data Store: Information Supply Framework

Perl在nginx里的应用

Routing for an Anycast CDN

How TPM saves the day

Recently uploaded

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

An experimental study in using natural admixture as an alternative for chemic...Chandu841456

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER

Architect Hassan Khalil Portfolio for 2024hassan khalil

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Past, Present and Future of Generative AIabhishek36461

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Artificial-Intelligence-in-Electronics (K).pptxbritheesh05

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

computer application and construction managementMariconPadriquez1

Recently uploaded (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

An experimental study in using natural admixture as an alternative for chemic...

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf

Risk Assessment For Installation of Drainage Pipes.pdf

Architect Hassan Khalil Portfolio for 2024

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers

8251 universal synchronous asynchronous receiver transmitter

Introduction to Machine Learning Unit-3 for II MECH

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Past, Present and Future of Generative AI

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

Artificial-Intelligence-in-Electronics (K).pptx

Software and Systems Engineering Standards: Verification and Validation of Sy...

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

computer application and construction management

SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

1. TrafficShift - Avoiding Disasters at Scale Michael Kehoe Staff SRE LinkedIn Anil Mallapur SRE LinkedIn

2. Overview LinkedIn Architectural Overview Fabric Disaster Recovery Questions

3. 467+ million members World’s largest professional network 200+ Countries

4. Who are we ? Production-SRE team at LinkedIn ● Assist in restoring stability to services during site critical issues ● Developing applications to improve MTTD and MTTR ● Provide direction and guidelines for site monitoring ● Build tools for efficient site issue troubleshooting, issue detection & correlation

5. Terminologies Fabric/Colo Data Center with full application stack deployed PoP/Edge Entry point to LinkedIn network (TCP/ SSL termination) Load Test Planned stress testing of data centers

6. 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active

7. 2017 4 Data Centers 13 PoPs 1000+ services

9. What are Disasters ? Service Degradation Infrastructure Issues Human Error Data Center on Fire

10. One solution for all disasters TrafficShift - Reroute user traffic to different datacenters without any user interruption.

11. Whaaaat ?

12. Border Router IPVS ATS EDGE ATS Fronte nd FABRIC Stickyrouting Service Internet

13. ATS Request Stickyrouting Service Gets primary colo for user If not cookie in header DC1 in cookie DC1 DC2 Got DC2 as primary colo for user FABRICEDGE

14. US-East 1 2 3 10 91 92 93 100 BUCKETS FABRIC Stickyrouting

15. How StickyRouting assigns users to a colo ? Capacity of Fabric Offline job to assign colo to users Geographic distance to users

16. Advantages of sticky routing Less latency for users Store data where it’s necessary Provides precise control over capacity allotment

17. When to TrafficShift ? Impact Mitigation Planned Maintenance Stress Test

18. Site Traffic and Disaster Recovery US-West US- Central US-East APAC EDGE 0% Distributed Load 50% Distributed Load 50% Distributed Load 0% Distributed Load Traffic stops being served to offline fabrics Traffic is shifted to online fabrics

19. TrafficShift Architecture Web application Salt master Stickyrouting Service Couchbase Backend Worker Processes FABRIC BUCKETS

20. What is Load Testing ? 3 times a week Peak hour traffic Fixed SLA USW Target Data Center USW

21. Load Testing FABRIC Target US-West US-East 50% Traffic Percentage

22. Benefits of Load Test Capacity Planning Leverage production traffic to stress test services Identify bugs in production Confidence in Disaster Recovery

23. Big Red Button Kill switch (No Kidding) Failout of a datacenter and PoP in less than 10 minutes Minimal user impact

24. Key Takeaways ● Design infrastructure to facilitate disaster recovery ● Stress test regularly to avoid surprises ● Automate everything to reduce time to mitigate impact

25. Questions

26.

27. Edge Failout

28. Edge Presence

29. LinkedIn’s PoP Architecture 29 • Using IPVS - Each PoP announces a unicast address and a regional anycast address • APAC, EU and NAMER anycast regions • Use GeoDNS to steer users to the ‘best’ PoP • DNS will either provide users with an anycast or unicast address for www.linkedin.com • US and EU members is nearly all anycast • APAC is all unicast

30. LinkedIn’s PoP DR 30 • Sometimes need to fail out of PoP’s • 3rd party provider issues (e.g. transit links going down) • Infrastructure maintenance • Withdraw anycast route announcements • Fail healthchecks on proxy to drain unicast traffic

SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

More from Michael Kehoe

More from Michael Kehoe (17)

Recently uploaded

Recently uploaded (20)

SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale