LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center out of rotation and redistribute its traffic to the healthy data centers within minutes, with virtually no visible impact to users
As we transitioned from big monolithic application to micro-services, we witnessed pain in determining capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex micro-services architecture wasn’t sufficient to provide enough confidence in a data center’s capacity. To solve this problem, we at LinkedIn leverage live traffic to stress services site-wide by shifting traffic to simulate a disaster load.
This talk provide details on how LinkedIn uses Traffic Shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.
4. Who are we ?
Production-SRE team at LinkedIn
● Assist in restoring stability to services
during site critical issues
● Developing applications to improve
MTTD and MTTR
● Provide direction and guidelines for site
monitoring
● Build tools for efficient site issue
troubleshooting, issue detection &
correlation
5. Terminologies
Fabric/Colo Data Center with full application stack deployed
PoP/Edge Entry point to LinkedIn network (TCP/ SSL
termination)
Load Test Planned stress testing of data centers
6. 2003 2010 2011 2013 2014 2015
Active &
Passive
Active &
Active
Multi-colo 3-
way Active &
Active
Multi-colo n-
way Active &
Active
18. Site Traffic and Disaster Recovery
US-West
US-
Central
US-East APAC
EDGE
0%
Distributed Load
50%
Distributed Load
50%
Distributed Load
0%
Distributed Load
Traffic stops being
served to offline
fabrics
Traffic is shifted to online
fabrics
22. Benefits of Load Test
Capacity Planning
Leverage production traffic
to stress test services
Identify bugs in
production
Confidence in Disaster
Recovery
23. Big Red Button
Kill switch (No Kidding)
Failout of a datacenter and PoP in less
than 10 minutes
Minimal user impact
24. Key Takeaways
● Design infrastructure to facilitate disaster
recovery
● Stress test regularly to avoid surprises
● Automate everything to reduce time to
mitigate impact
29. LinkedIn’s PoP Architecture
29
• Using IPVS - Each PoP announces a unicast address and a regional anycast
address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for
www.linkedin.com
• US and EU members is nearly all anycast
• APAC is all unicast
30. LinkedIn’s PoP DR
30
• Sometimes need to fail out of PoP’s
• 3rd party provider issues (e.g. transit links
going down)
• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast
traffic