SouthBay SRE Meetup Jan 2016

•Download as PPTX, PDF•

0 likes•586 views

Michael Kehoe

LinkedIn Traffic Shifting

Engineering

2
$ whoami
Michael Kehoe
• Sr Site Reliability Engineer (SRE)
• Member of PROD-SRE
• https://www.linkedin.com/in/michaelkkehoe

4
What is a Traffic Shift?
• Edge (PoP) shift
• Datacenter Load shift
• Single Master Failovers

5
Why do we do traffic shifts
• To mitigate user impact from problems with a 3rd party provider or
LinkedIn’s infrastructure/ services
• To validate Disaster Recovery (DR) in case of any datacenter failure
• To validate and test capacity headroom across our datacenters
• To expose bugs and suboptimal configurations by load testing one or
more datacenters
• To perform planned maintenance
• To validate and exercise the traffic shift automation

7
Edge Traffic shifts
How does it work
• We use IPVS to load balance at our edges
• We can withdraw anycast routes to remove traffic from that
PoP
• Health checks on our edge proxy are tested by DNS providers to
verify whether that PoP is in rotation
• We can fail those health checks to remove unicast traffic from
that PoP

9
Datacenter Traffic shifts
How does it work?
• Different traffic types are partitioned and controlled separately
• Logged-in vs Logged-out
• CDN
• Monitoring
• Microsites
• Logged-in users are placed into ‘buckets’ and have primary/
secondary datacenter assignments
• Buckets are marked online/ offline to move site traffic

10
Mitigating Impact
What a traffic shift looks like

13
Single Master Failover
How does it work?
• Only used in extreme cases
• Leverage distributed locking in Apache Zookeeper
• Single master services have a spring component that checks the
mastership of the service in a particular datacenter

14
Single Master Failover
How does it work?

15
Conclusion
• The best way to prepare for a disaster is to practice one regularly!
• Tooling and automation is your best friend during an outage
• Capacity planning/ management is extremely important

©2014 LinkedIn Corporation. All Rights Reserved.

What's hot

Microsoft Azure and Windows Application monitoringSite24x7

Intro to.net core 20170111Christian Horsdal

Introducing Tupilak, Snowplow's unified log fabricAlexander Dean

[Webinar] AWS Monitoring with Site24x7Site24x7

Server Monitoring from the CloudSite24x7

URP? Excuse You! The Three Metrics You Have to Know confluent

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent

VMware Monitoring-Discover And Monitor Your Virtual EnvironmentSite24x7

Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn

DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...Gene Kim

Integrating Apache Kafka Into Your Environmentconfluent

Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...HostedbyConfluent

Scala eXchange: Building robust data pipelines in ScalaAlexander Dean

Span Conference: Why your company needs a unified logAlexander Dean

Top 15 Exchange Questions that Senior Admin ask - Jaap WesseliusKemp

Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...confluent

Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone

Kafka Streams: What it is, and how to use it?confluent

Azkaban - WorkFlow Scheduler/Automation EnginePraveen Thirukonda

Nov 2015 Webinar: Introduction to FileCatalyst v3.6FileCatalyst

What's hot (20)

Microsoft Azure and Windows Application monitoring

Intro to.net core 20170111

Introducing Tupilak, Snowplow's unified log fabric

[Webinar] AWS Monitoring with Site24x7

Server Monitoring from the Cloud

URP? Excuse You! The Three Metrics You Have to Know

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

VMware Monitoring-Discover And Monitor Your Virtual Environment

Akka and AngularJS – Reactive Applications in Practice

DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...

Integrating Apache Kafka Into Your Environment

Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies...

Scala eXchange: Building robust data pipelines in Scala

Span Conference: Why your company needs a unified log

Top 15 Exchange Questions that Senior Admin ask - Jaap Wesselius

Operating Kafka on AutoPilot mode @ DBS Bank (Arpit Dubey, DBS Bank) Kafka Su...

Tale of two streaming frameworks (Karthik D - Walmart)

Kafka Streams: What it is, and how to use it?

Azkaban - WorkFlow Scheduler/Automation Engine

Nov 2015 Webinar: Introduction to FileCatalyst v3.6

Viewers also liked

SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe

Couchbase Meetup Jan 2016Michael Kehoe

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

The ROLE SRE Approach - Getting more concretedrenzel

The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...Ralf Klamma

Sre con16 tier 1 metamorphosisNina Mushiana

SRE in StartupLadislav Prskavec

Software reliability tools and common software errorsHimanshu

Scio Saa S Readiness Evaluation Sre V1.0ScioSales

Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group

Works of site reliability engineerShohei Kobayashi

How TPM saves the dayPooja Tangi

Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...Diego Pacheco

Software Reliability Engineeringguest90cec6

Software reliability growth modelHimanshu

You got a couple Microservices, now what? - Adding SRE to DevOpsGonzalo Maldonado

Feedback loops: How SREs benefit and what is needed to realize their potentialPooja Tangi

SRE ToolsGurbakash Phonsa

SRE From ScratchGrier Johnson

Load balancing in the SRE wayShawn Zhu

Viewers also liked (20)

SRECon USA 2016: Growing your Entry Level Talent

Couchbase Meetup Jan 2016

CouchbasetoHadoop_Matt_Michael_Justin v4

The ROLE SRE Approach - Getting more concrete

The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...

Sre con16 tier 1 metamorphosis

SRE in Startup

Software reliability tools and common software errors

Scio Saa S Readiness Evaluation Sre V1.0

Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Works of site reliability engineer

How TPM saves the day

Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...

Software Reliability Engineering

Software reliability growth model

You got a couple Microservices, now what? - Adding SRE to DevOps

Feedback loops: How SREs benefit and what is needed to realize their potential

SRE Tools

SRE From Scratch

Load balancing in the SRE way

Similar to SouthBay SRE Meetup Jan 2016

Microservices ArchitectureLucian Neghina

Trafficshifting: Avoiding Disasters & Improving Performance at ScaleAPNIC

Bringing it all together - Denver JUGMelissaMcKay15

Fleet management system mine excellenceMason Taylor

Nonfunctional Testing: Examine the Other Side of the CoinTechWell

Introduction to High Availability with SQL ServerJohn Sterrett

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks

Building Enterprise Clouds - Key Considerations and Strategies - RED HATFadi Semaan

Service Mesh CTO Forum (Draft 3)Rick Hightower

Oracle Database Lifecycle ManagementHari Srinivasan

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleMichael Kehoe

Cognos Performance Tuning Tips & TricksSenturus

Creating an Elastic Platform Using Kafka and Microservices in OpenShift confluent

A Three-Tier Load Testing Program Saved Our BaconTechWell

Testing Mobile App PerformanceTechWell

NANOG 80: Measuring RPKI EffectivenessAPNIC

HKNOG 9.0: Measuring RPKIAPNIC

Getting start with Performance Testing Yogesh Deshmukh

Tiger oracled0nn9n

Net flix embracingfailure re-invent2014-141113085858-conversion-gate02~Eric Principe

Similar to SouthBay SRE Meetup Jan 2016 (20)

Microservices Architecture

Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Bringing it all together - Denver JUG

Fleet management system mine excellence

Nonfunctional Testing: Examine the Other Side of the Coin

Introduction to High Availability with SQL Server

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...

Building Enterprise Clouds - Key Considerations and Strategies - RED HAT

Service Mesh CTO Forum (Draft 3)

Oracle Database Lifecycle Management

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Cognos Performance Tuning Tips & Tricks

Creating an Elastic Platform Using Kafka and Microservices in OpenShift

A Three-Tier Load Testing Program Saved Our Bacon

Testing Mobile App Performance

NANOG 80: Measuring RPKI Effectiveness

HKNOG 9.0: Measuring RPKI

Getting start with Performance Testing

Tiger oracle

Net flix embracingfailure re-invent2014-141113085858-conversion-gate02

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Introduction to IEEE STANDARDS and its different types.pptxupamatechverse

Introduction to Multiple Access Protocol.pptxupamatechverse

result management system report for college projectTonystark477637

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Introduction and different types of Ethernet.pptxupamatechverse

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

Recently uploaded (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

Introduction to IEEE STANDARDS and its different types.pptx

Introduction to Multiple Access Protocol.pptx

result management system report for college project

(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...

MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts

Roadmap to Membership of RICS - Pathways and Routes

Introduction and different types of Ethernet.pptx

UNIT-III FMM. DIMENSIONAL ANALYSIS

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

SouthBay SRE Meetup Jan 2016

1. Michael Kehoe Senior Site Reliability Engineer LinkedIn SouthBay SRE Meetup LinkedIn Traffic Shifting

2. 2 $ whoami Michael Kehoe • Sr Site Reliability Engineer (SRE) • Member of PROD-SRE • https://www.linkedin.com/in/michaelkkehoe

3. 3 LinkedIn Multicolo History

4. 4 What is a Traffic Shift? • Edge (PoP) shift • Datacenter Load shift • Single Master Failovers

5. 5 Why do we do traffic shifts • To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure/ services • To validate Disaster Recovery (DR) in case of any datacenter failure • To validate and test capacity headroom across our datacenters • To expose bugs and suboptimal configurations by load testing one or more datacenters • To perform planned maintenance • To validate and exercise the traffic shift automation

6. 6 Traffic shifting How do we do it?

7. 7 Edge Traffic shifts How does it work • We use IPVS to load balance at our edges • We can withdraw anycast routes to remove traffic from that PoP • Health checks on our edge proxy are tested by DNS providers to verify whether that PoP is in rotation • We can fail those health checks to remove unicast traffic from that PoP

8. 8 Edge Traffic shifts

9. 9 Datacenter Traffic shifts How does it work? • Different traffic types are partitioned and controlled separately • Logged-in vs Logged-out • CDN • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ and have primary/ secondary datacenter assignments • Buckets are marked online/ offline to move site traffic

10. 10 Mitigating Impact What a traffic shift looks like

11. 11 Load testing How do we do it?

12. 12 Load testing How do we do it?

13. 13 Single Master Failover How does it work? • Only used in extreme cases • Leverage distributed locking in Apache Zookeeper • Single master services have a spring component that checks the mastership of the service in a particular datacenter

14. 14 Single Master Failover How does it work?

15. 15 Conclusion • The best way to prepare for a disaster is to practice one regularly! • Tooling and automation is your best friend during an outage • Capacity planning/ management is extremely important

16. Questions? 16 Thank You

Editor's Notes

Prior to 2010, running out of Chicago only (ECH3) 2010 – completion of our second production data center (ELA4) Formal disaster recovery strategy 2011 Site was not serving traffic – maintaining active/ passive was not easy Recovery from a true disaster would not have been easy 2013 Built LVA1 Re-architected services to be (mostly) MultiMaster Invested in how to recover from disaster/ service outage quickly 2014 Started multi-colo loadtesting Single Master Failover 1 Built LTX1 2015 Single Master Failover 2 Started LSG1 2016 Ramp LSG1 LOR1 – NextGen DC design
Edge (PoP) shifts. LinkedIn currently operates 12 PoP’s around the world, with more on the way, that help improve page load times for our users. These PoP’s give LinkedIn more flexibility about where a user enters our network and also gives us added redundancy in the case of an outage. We work with our DNS providers to direct users to an appropriate PoP or to alter the flow of traffic to each PoP. See Ritesh Maheshwari’s post for more details on this approach. Data Center Load Shifts. From each PoP, we direct traffic to a specific data center. Logged in users are assigned to a specific data center by default, but during a traffic shift, we can instruct the PoPs to reroute any portion of traffic to one or more different data centers. Single master failovers Some of our legacy services have not been fully migrated to a multi-data center architecture, and operate in single master mode in one data center. This includes both user-facing services and back-end services whose traffic may not be directly related to user page views. When performing maintenance, addressing site issues, or exploring capacity issues, we must also take these single master services into account. Although some of these legacy services require special attention in these situations, many have been converted to a “fast-failover” mechanism that allows us to switch masters between data centers in seconds, with no downtime. Being able to move these master services around at will also allows us to balance that part of the load between data centers. Edge LinkedIn currently has 12 PoP’s around the world These help improve page load times for our users Give flexibility on where a user enters our network Gives us extra redundancy See Ritesh Masheshwari’s blog post Fabric We assign users to a specific datacenter, but during a trafficshift we can instruct the PoP’s to reroute users to other datacenters to increase the load Single Master Failovers Some older, more complicated services have not been fully migrated to a multi-datacentre architecture. They operate in SingleMaster mode on one datacentre. These services may not be directly related to user page-views but are still important to the running of the site. We have converted most of these services to fast-failover which allows us to change the mastership between datacenters in seconds with no downtime.
To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure To validate Disaster Recovery (DR) in case of any datacentre failure To validate and test capacity headroom across our datacenters To expose bugs and suboptimal configurations by loadtesting one or more datacenters To perform planned maintenance To validate and exercise the traffic shift automation
The traffic shifting process is orchestrated by a system we developed internally which is designed to make the load shift process hands off. The portal gives a holistic view of the site
The following graphs illustrate the traffic-shift process – from last night Here we see the number of buckets online for each data center. At roughly 6PM, we progressively marked 100 buckets offline, and then ramped them back online gradually. The following graph shows actual measured request percentage for one segment of our traffic. The pattern corresponds to the bucket graph, and shows traffic going to zero in one data center, and being redistributed to the other two. Our ability to offline a data center with as little member impact as possible is one of our top priorities. We perform weekly load tests to validate the process and guarantee we can offline a colo successfully with minimal member impact. We load test by load shedding a percentage of traffic to a targeted data center and evaluate sustainability.
You simply schedule a load test and the system does the rest. A load test is preceded by a series of email notifications starting several hours prior to shifting traffic. At the designated time, the system starts shifting traffic to the targeted data center by offlining buckets from the remaining data centers. The manipulation of these buckets is facilitated by underlying libraries that interface with our “sticky routing” service developed in-house. The system has a feedback loop that uses our alerting system to check for any errors potentially triggered by the traffic shift. If an alert is detected, the traffic shift automatically halts and issues notifications, allowing an engineer to manually inspect the reason for the alert and determine whether it is safe to proceed. The stress test period is reached once the system has successfully redirected the desired volume of traffic to the targeted data center. The stress test period is typically 1 1/2 hours during which we observe the impact the extra load is having on the data center. If impact is detected we immediately rebalance the load and begin investigating the source of the impact. We then work with service owners on reviewing their system and determine a solution. If the stress test period is completed without causing impact, the system rebalances traffic and is considered complete once the rebalance is finished.
Our single master mechanism relies on Apache ZooKeeper to maintain the status of all single master services. On startup, all instances within a cluster of single master services check the value of a cluster master node in ZooKeeper. Each service determines whether or not it is a master based on the value stored in the cluster master node. All services also establish a watch on the cluster master node. When services accept master status, they create an ephemeral node in ZooKeeper that acts as a lockfile.
We can perform a single master failover with a command line tool that handles all the communication with ZooKeeper instances along with the workflow. But in a disaster scenario, it can be useful to have an easy-to-use interface to functionality, as well as a visual overview of the system state. This interface allows an Engineer to failover all of our “Fast Failover” enabled Singlemaster services with one click. The interface also integrates the LiX and ZooKeeper-based mechanisms into a single location.

SouthBay SRE Meetup Jan 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SouthBay SRE Meetup Jan 2016

Similar to SouthBay SRE Meetup Jan 2016 (20)

More from Michael Kehoe

More from Michael Kehoe (16)

Recently uploaded

Recently uploaded (20)

SouthBay SRE Meetup Jan 2016

Editor's Notes