Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

•Download as PPTX, PDF•

1 like•249 views

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Engineering

TrafficShift: Avoiding Disasters at
Scale
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff SRE
Anil Mallapur
Sr SRE

Today’s
agenda
1 Introductions
2 Evolution of the Infrastructure
3 Planning for Disaster
4 LinkedIn Traffic-Tier
5 TrafficShift
6 Load Testing
7 Q&A

Key Takeaways
• Design infrastructure to facilitate disaster
recovery
• Test regularly
• Automate everything

World’s largest professional network
Largest global network
of professionals
500+M members
Serving users world-
wide
200+ Countries

Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Assist in restoring stability to services
during site-critical issues
• Develop applications to improve MTTD
and MTTR
• Provide direction and guidelines for site
monitoring
• Build tools for efficient site-issue
detection, correlation & troubleshooting,

Terminologies
• Fabric/Colo Data Center with full application stack deployed
• PoP/ Edge Entry point to LinkedIn network (TCP/ SSL
Termination)
• Load Test Planned stress testing of data centers

Evolution of the Infrastructure
2003 2010 2011 2013 2014 2017
Active &
Passive
Active &
Active
Multi-colo
3-way
Active &
Active
Multi-colo
n-way
Active &
Active

2017
4 Data Centers 13 PoPs 1000+ services

What are Disasters
Service
Degradation
Infrastructure
Issues
Human Error Data Center
on Fire

One Solution for all Disasters
• TrafficShift – Reroute user traffic to
different datacenters without any user
interruption.

LinkedIn Traffic-Tier
Border
Router IPVS ATS ATS Frontend
EDGE FABRIC
Stickyrouting

LinkedIn Traffic-Tier
ATS
EDGE FABRIC
DC1
DC2
DC1 in Cookie
Got DC2 as primary fabric
Gets primary
fabric for user
Stickyrouting

LinkedIn Traffic-Tier
Fabric
Buckets
1
91
2 3 10
92 93 100

How Stickyrouting assigns users to a fabric?
Capacity of a
Datacenter
Geographic
distance to
users
Hadoop

Advantages of Stickyrouting
Less Latency Store data
where needed
Control over
capacity

Site Traffic and Disaster Recovery
DC2 DC3
DC1
DC4
EDGE
30%
Distributed Load
50%
Distributed Load
50%
Distributed Load
10%
Distributed Load
Traffic stops being
served to offline
fabrics when we
mark buckets offline
Traffic is shifted to online
fabrics as ATS redirects
those users to their
secondary fabric
DC1
DC4

When to TrafficShift
Impact
Mitigation
Planned
Maintenance
Stress Test

TrafficShift Architecture
Web
application
Salt master
Stickyrouting
ServiceCouchbase Backend Worker
Processes
FABRIC
BUCKETS

What is Load Testing?
3x a week Peak hour traffic Fixed SLA

Load Testing
FABRIC
DC3
DC1 DC2
60%
Traffic
Percentage

Benefits of Load testing
Capacity
Planning
Stress Test Identify Bugs Confidence

Big Red Buttom
• Kill-switch for a datacenter
• Failout of a datacenter & PoP in minutes
• Minimal user impact

Key Takeaways
• Design infrastructure to facilitate disaster
recovery
• Stress test regularly to avoid surprises
• Automate everything to reduce time to
mitigate impact

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

What's hot

Do You Need A Service Mesh?NGINX, Inc.

Why modern cloud infrastructure require automationGerald Crescione

Shanghai airship-project-updatedkataria7

Monoliths to Microservices: App Transformation - Jacksonville Workshop SlidesTiera Fann, MBA

Microservices ArchitectureSrinivasan Nanduri

Don't Assume Your API Gateway is Ready for MicroservicesAmbassador Labs

The service mesh management planeLibbySchulze

Ammar Murtaza-IMMalik Ammar Murtaza

Mastering Chaos - A Netflix Guide to MicroservicesC4Media

building microservicesCisco DevNet

Rapidly Updating MicroservicesAmbassador Labs

About Microservices, Containers and their Underestimated Impact on Network Pe...Nane Kratzke

Transformation During a Global Pandemic | Ashish Pandit and Scott Lee, Univer...HostedbyConfluent

Intro to Environment as a Service - Cloudify 5.0.5 WebinarCloudify Community

stackconf 2021 | Prometheus in 2021 and beyondNETWAYS

Digital Transformation: Highly Resilient Streaming Architecture and StrategiesHostedbyConfluent

Monoliths to Microservices: App Transformation - introductionTiera Fann, MBA

Cloud Testing: The Future of software TestingBugRaptors

Migrating from One Cloud Provider to Another (Without Losing Your Data or You...HostedbyConfluent

Devtest Orchestration for SDN & NFVAlex Henthorn-Iwane

What's hot (20)

Do You Need A Service Mesh?

Why modern cloud infrastructure require automation

Shanghai airship-project-update

Monoliths to Microservices: App Transformation - Jacksonville Workshop Slides

Microservices Architecture

Don't Assume Your API Gateway is Ready for Microservices

The service mesh management plane

Ammar Murtaza-IM

Mastering Chaos - A Netflix Guide to Microservices

building microservices

Rapidly Updating Microservices

About Microservices, Containers and their Underestimated Impact on Network Pe...

Transformation During a Global Pandemic | Ashish Pandit and Scott Lee, Univer...

Intro to Environment as a Service - Cloudify 5.0.5 Webinar

stackconf 2021 | Prometheus in 2021 and beyond

Digital Transformation: Highly Resilient Streaming Architecture and Strategies

Monoliths to Microservices: App Transformation - introduction

Cloud Testing: The Future of software Testing

Migrating from One Cloud Provider to Another (Without Losing Your Data or You...

Devtest Orchestration for SDN & NFV

Similar to Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Move fast and make things with microservicesMithun Arunan

Risc and velostrata 2 28 2018 lessons_in_cloud_migrationRISC Networks

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...Michael Kehoe

The evolution of data center network fabricsCisco Canada

Lisa Guess - Embracing the Cloudcentralohioissa

Manage the Digital Transformation with Machine Learning in a Reactive Microse...DataWorks Summit

Managing IT environment complexity in a Multi-Cloud WorldShashi Kiran

20-datacenter-measurements.pptxSteve491226

Cisco’s Cloud Ready InfrastructureCisco Canada

SolarWinds Online Federal User GroupSolarWinds

cncf overview and building edge computing using kubernetesKrishna-Kumar

Data Center Interconnects: An OverviewXO Communications

Meetup Microservices CommandmentsBill Zajac

iWAN - Cisco Application Experience Solutionxband

Reactive Integrations - Caveats and bumps in the road explained Markus Eisele

Introduction to SDNNetCraftsmen

5G-USA-Telemetrysnrism

SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SAMeh Zaghloul

ONF & iSDX WebinarKatie Hyman

Tech Talk: Leverage the combined power of CA Unified Infrastructure Managemen...CA Technologies

Similar to Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale (20)

Move fast and make things with microservices

Risc and velostrata 2 28 2018 lessons_in_cloud_migration

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...

The evolution of data center network fabrics

Lisa Guess - Embracing the Cloud

Manage the Digital Transformation with Machine Learning in a Reactive Microse...

Managing IT environment complexity in a Multi-Cloud World

20-datacenter-measurements.pptx

Cisco’s Cloud Ready Infrastructure

SolarWinds Online Federal User Group

cncf overview and building edge computing using kubernetes

Data Center Interconnects: An Overview

Meetup Microservices Commandments

iWAN - Cisco Application Experience Solution

Reactive Integrations - Caveats and bumps in the road explained

Introduction to SDN

5G-USA-Telemetry

SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014

ONF & iSDX Webinar

Tech Talk: Leverage the combined power of CA Unified Infrastructure Managemen...

Recently uploaded

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

UNIT - IV - Air Compressors and its Performancesivaprakash250

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

Java Programming :Event Handling(Types of Events)simmis5

Extrusion Processes and Their Limitations120cr0395

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

result management system report for college projectTonystark477637

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile

Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

UNIT - IV - Air Compressors and its Performance

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

UNIT-III FMM. DIMENSIONAL ANALYSIS

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

Java Programming :Event Handling(Types of Events)

Extrusion Processes and Their Limitations

KubeKraft presentation @CloudNativeHooghly

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

Processing & Properties of Floor and Wall Tiles.pptx

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

result management system report for college project

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

Glass Ceramics: Processing and Properties

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

1. TrafficShift: Avoiding Disasters at Scale Jeff Weiner Chief Executive Officer Michael Kehoe Staff SRE Anil Mallapur Sr SRE

2. Today’s agenda 1 Introductions 2 Evolution of the Infrastructure 3 Planning for Disaster 4 LinkedIn Traffic-Tier 5 TrafficShift 6 Load Testing 7 Q&A

3. Key Takeaways • Design infrastructure to facilitate disaster recovery • Test regularly • Automate everything

4. Introductions

5. World’s largest professional network Largest global network of professionals 500+M members Serving users world- wide 200+ Countries

6. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Assist in restoring stability to services during site-critical issues • Develop applications to improve MTTD and MTTR • Provide direction and guidelines for site monitoring • Build tools for efficient site-issue detection, correlation & troubleshooting,

7. Terminologies

8. Terminologies • Fabric/Colo Data Center with full application stack deployed • PoP/ Edge Entry point to LinkedIn network (TCP/ SSL Termination) • Load Test Planned stress testing of data centers

9. Evolution of the Infrastructure

10. Evolution of the Infrastructure 2003 2010 2011 2013 2014 2017 Active & Passive Active & Active Multi-colo 3-way Active & Active Multi-colo n-way Active & Active

11. 2017 4 Data Centers 13 PoPs 1000+ services

12. Planning for Disaster

13. Why care about Disasters ?

14. What are Disasters Service Degradation Infrastructure Issues Human Error Data Center on Fire

15. One Solution for all Disasters • TrafficShift – Reroute user traffic to different datacenters without any user interruption.

16.

17. LinkedIn Traffic-Tier

18. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting

19. LinkedIn Traffic-Tier ATS EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as primary fabric Gets primary fabric for user Stickyrouting

20. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100

21. How Stickyrouting assigns users to a fabric? Capacity of a Datacenter Geographic distance to users Hadoop

22. Advantages of Stickyrouting Less Latency Store data where needed Control over capacity

23. TrafficShift

24. Site Traffic and Disaster Recovery DC2 DC3 DC1 DC4 EDGE 30% Distributed Load 50% Distributed Load 50% Distributed Load 10% Distributed Load Traffic stops being served to offline fabrics when we mark buckets offline Traffic is shifted to online fabrics as ATS redirects those users to their secondary fabric DC1 DC4

25. When to TrafficShift Impact Mitigation Planned Maintenance Stress Test

26. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS

27. Load Testing

28. What is Load Testing? 3x a week Peak hour traffic Fixed SLA

29. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage

30. Benefits of Load testing Capacity Planning Stress Test Identify Bugs Confidence

31. Big Red Buttom • Kill-switch for a datacenter • Failout of a datacenter & PoP in minutes • Minimal user impact

32. Key Takeaways

33. Key Takeaways • Design infrastructure to facilitate disaster recovery • Stress test regularly to avoid surprises • Automate everything to reduce time to mitigate impact

34. Q & A

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Similar to Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale (20)

More from Michael Kehoe

More from Michael Kehoe (20)

Recently uploaded

Recently uploaded (20)

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale