Prior to 2010, running out of Chicago only (ECH3) 2010 – completion of our second production data center (ELA4) Formal disaster recovery strategy 2011 Site was not serving traffic – maintaining active/ passive was not easy Recovery from a true disaster would not have been easy 2013 Built LVA1 Re-architected services to be (mostly) MultiMaster Invested in how to recover from disaster/ service outage quickly 2014 Started multi-colo loadtesting Single Master Failover 1 Built LTX1 2015 Single Master Failover 2 Started LSG1 2016 Ramp LSG1 LOR1 – NextGen DC design
Edge (PoP) shifts. LinkedIn currently operates 12 PoP’s around the world, with more on the way, that help improve page load times for our users. These PoP’s give LinkedIn more flexibility about where a user enters our network and also gives us added redundancy in the case of an outage. We work with our DNS providers to direct users to an appropriate PoP or to alter the flow of traffic to each PoP. See Ritesh Maheshwari’s post for more details on this approach.
Data Center Load Shifts. From each PoP, we direct traffic to a specific data center. Logged in users are assigned to a specific data center by default, but during a traffic shift, we can instruct the PoPs to reroute any portion of traffic to one or more different data centers.
Single master failovers Some of our legacy services have not been fully migrated to a multi-data center architecture, and operate in single master mode in one data center. This includes both user-facing services and back-end services whose traffic may not be directly related to user page views. When performing maintenance, addressing site issues, or exploring capacity issues, we must also take these single master services into account. Although some of these legacy services require special attention in these situations, many have been converted to a “fast-failover” mechanism that allows us to switch masters between data centers in seconds, with no downtime. Being able to move these master services around at will also allows us to balance that part of the load between data centers.
Edge LinkedIn currently has 12 PoP’s around the world These help improve page load times for our users Give flexibility on where a user enters our network Gives us extra redundancy See Ritesh Masheshwari’s blog post
Fabric We assign users to a specific datacenter, but during a trafficshift we can instruct the PoP’s to reroute users to other datacenters to increase the load
Single Master Failovers Some older, more complicated services have not been fully migrated to a multi-datacentre architecture. They operate in SingleMaster mode on one datacentre. These services may not be directly related to user page-views but are still important to the running of the site. We have converted most of these services to fast-failover which allows us to change the mastership between datacenters in seconds with no downtime.
To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure To validate Disaster Recovery (DR) in case of any datacentre failure To validate and test capacity headroom across our datacenters To expose bugs and suboptimal configurations by loadtesting one or more datacenters To perform planned maintenance To validate and exercise the traffic shift automation
The traffic shifting process is orchestrated by a system we developed internally which is designed to make the load shift process hands off.
The portal gives a holistic view of the site
The following graphs illustrate the traffic-shift process – from last night
Here we see the number of buckets online for each data center. At roughly 6PM, we progressively marked 100 buckets offline, and then ramped them back online gradually.
The following graph shows actual measured request percentage for one segment of our traffic. The pattern corresponds to the bucket graph, and shows traffic going to zero in one data center, and being redistributed to the other two.
Our ability to offline a data center with as little member impact as possible is one of our top priorities. We perform weekly load tests to validate the process and guarantee we can offline a colo successfully with minimal member impact. We load test by load shedding a percentage of traffic to a targeted data center and evaluate sustainability.
You simply schedule a load test and the system does the rest.
A load test is preceded by a series of email notifications starting several hours prior to shifting traffic. At the designated time, the system starts shifting traffic to the targeted data center by offlining buckets from the remaining data centers.
The manipulation of these buckets is facilitated by underlying libraries that interface with our “sticky routing” service developed in-house. The system has a feedback loop that uses our alerting system to check for any errors potentially triggered by the traffic shift. If an alert is detected, the traffic shift automatically halts and issues notifications, allowing an engineer to manually inspect the reason for the alert and determine whether it is safe to proceed.
The stress test period is reached once the system has successfully redirected the desired volume of traffic to the targeted data center. The stress test period is typically 1 1/2 hours during which we observe the impact the extra load is having on the data center.
If impact is detected we immediately rebalance the load and begin investigating the source of the impact. We then work with service owners on reviewing their system and determine a solution. If the stress test period is completed without causing impact, the system rebalances traffic and is considered complete once the rebalance is finished.
Our single master mechanism relies on Apache ZooKeeper to maintain the status of all single master services. On startup, all instances within a cluster of single master services check the value of a cluster master node in ZooKeeper. Each service determines whether or not it is a master based on the value stored in the cluster master node. All services also establish a watch on the cluster master node. When services accept master status, they create an ephemeral node in ZooKeeper that acts as a lockfile.
We can perform a single master failover with a command line tool that handles all the communication with ZooKeeper instances along with the workflow.
But in a disaster scenario, it can be useful to have an easy-to-use interface to functionality, as well as a visual overview of the system state.
This interface allows an Engineer to failover all of our “Fast Failover” enabled Singlemaster services with one click. The interface also integrates the LiX and ZooKeeper-based mechanisms into a single location.
SouthBay SRE Meetup Jan 2016
Senior Site Reliability Engineer
SouthBay SRE Meetup
LinkedIn Traffic Shifting
• Sr Site Reliability Engineer (SRE)
• Member of PROD-SRE
What is a Traffic Shift?
• Edge (PoP) shift
• Datacenter Load shift
• Single Master Failovers
Why do we do traffic shifts
• To mitigate user impact from problems with a 3rd party provider or
LinkedIn’s infrastructure/ services
• To validate Disaster Recovery (DR) in case of any datacenter failure
• To validate and test capacity headroom across our datacenters
• To expose bugs and suboptimal configurations by load testing one or
• To perform planned maintenance
• To validate and exercise the traffic shift automation
Edge Traffic shifts
How does it work
• We use IPVS to load balance at our edges
• We can withdraw anycast routes to remove traffic from that
• Health checks on our edge proxy are tested by DNS providers to
verify whether that PoP is in rotation
• We can fail those health checks to remove unicast traffic from
Datacenter Traffic shifts
How does it work?
• Different traffic types are partitioned and controlled separately
• Logged-in vs Logged-out
• Logged-in users are placed into ‘buckets’ and have primary/
secondary datacenter assignments
• Buckets are marked online/ offline to move site traffic
What a traffic shift looks like
Single Master Failover
How does it work?
• Only used in extreme cases
• Leverage distributed locking in Apache Zookeeper
• Single master services have a spring component that checks the
mastership of the service in a particular datacenter
• The best way to prepare for a disaster is to practice one regularly!
• Tooling and automation is your best friend during an outage
• Capacity planning/ management is extremely important