Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale


Published on

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load.

Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

  1. 1. TrafficShift: Avoiding Disasters at Scale Jeff Weiner Chief Executive Officer Michael Kehoe Staff SRE Anil Mallapur Sr SRE
  2. 2. Today’s agenda 1 Introductions 2 Evolution of the Infrastructure 3 Planning for Disaster 4 LinkedIn Traffic-Tier 5 TrafficShift 6 Load Testing 7 Q&A
  3. 3. Key Takeaways • Design infrastructure to facilitate disaster recovery • Test regularly • Automate everything
  4. 4. Introductions
  5. 5. World’s largest professional network Largest global network of professionals 500+M members Serving users world- wide 200+ Countries
  6. 6. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Assist in restoring stability to services during site-critical issues • Develop applications to improve MTTD and MTTR • Provide direction and guidelines for site monitoring • Build tools for efficient site-issue detection, correlation & troubleshooting,
  7. 7. Terminologies
  8. 8. Terminologies • Fabric/Colo Data Center with full application stack deployed • PoP/ Edge Entry point to LinkedIn network (TCP/ SSL Termination) • Load Test Planned stress testing of data centers
  9. 9. Evolution of the Infrastructure
  10. 10. Evolution of the Infrastructure 2003 2010 2011 2013 2014 2017 Active & Passive Active & Active Multi-colo 3-way Active & Active Multi-colo n-way Active & Active
  11. 11. 2017 4 Data Centers 13 PoPs 1000+ services
  12. 12. Planning for Disaster
  13. 13. Why care about Disasters ?
  14. 14. What are Disasters Service Degradation Infrastructure Issues Human Error Data Center on Fire
  15. 15. One Solution for all Disasters • TrafficShift – Reroute user traffic to different datacenters without any user interruption.
  16. 16. LinkedIn Traffic-Tier
  17. 17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting
  18. 18. LinkedIn Traffic-Tier ATS EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as primary fabric Gets primary fabric for user Stickyrouting
  19. 19. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100
  20. 20. How Stickyrouting assigns users to a fabric? Capacity of a Datacenter Geographic distance to users Hadoop
  21. 21. Advantages of Stickyrouting Less Latency Store data where needed Control over capacity
  22. 22. TrafficShift
  23. 23. Site Traffic and Disaster Recovery DC2 DC3 DC1 DC4 EDGE 30% Distributed Load 50% Distributed Load 50% Distributed Load 10% Distributed Load Traffic stops being served to offline fabrics when we mark buckets offline Traffic is shifted to online fabrics as ATS redirects those users to their secondary fabric DC1 DC4
  24. 24. When to TrafficShift Impact Mitigation Planned Maintenance Stress Test
  25. 25. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS
  26. 26. Load Testing
  27. 27. What is Load Testing? 3x a week Peak hour traffic Fixed SLA
  28. 28. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage
  29. 29. Benefits of Load testing Capacity Planning Stress Test Identify Bugs Confidence
  30. 30. Big Red Buttom • Kill-switch for a datacenter • Failout of a datacenter & PoP in minutes • Minimal user impact
  31. 31. Key Takeaways
  32. 32. Key Takeaways • Design infrastructure to facilitate disaster recovery • Stress test regularly to avoid surprises • Automate everything to reduce time to mitigate impact
  33. 33. Q & A