LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.
As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load.
Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.
World’s largest professional network
Largest global network
Serving users world-
Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Assist in restoring stability to services
during site-critical issues
• Develop applications to improve MTTD
• Provide direction and guidelines for site
• Build tools for efficient site-issue
detection, correlation & troubleshooting,
Site Traffic and Disaster Recovery
Traffic stops being
served to offline
fabrics when we
mark buckets offline
Traffic is shifted to online
fabrics as ATS redirects
those users to their
When to TrafficShift
ServiceCouchbase Backend Worker