Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How robust monitoring powers high availability for linkedin feed


Published on

It is common practice for networking services like LinkedIn to introduce a new feature to a small subset of users before deploying to all its members. Even with rigorous testing and tight processes, bugs are inevitably introduced with a new deploy. Unit and integration tests in development environment cannot completely cover all use cases. This results in an inferior experience for the users subject to this treatment. In addition, these bugs cause service outages, impacting service availability, health metrics and increased on-call burden.

Although there exist frameworks for testing new features in development clusters, it is not possible to completely simulate all aspects of a production environment. Read requests for Feed service have enough variation to trigger large classes of unforeseen errors. So, we introduced a production instance of the service, called ‘Dark Canary’, that receives live production traffic that is tee’d from a production source node, but does not return any responses upstream. New experimental code is deployed to the dark canary and compared to the current version in production node for health failures. This helps us in reducing the number of bad sessions to the users and also provides a better understanding of the service capacity requirements.

Published in: Technology
  • Login to see the comments

How robust monitoring powers high availability for linkedin feed

  1. 1. How Robust Monitoring Powers High Availability for Linkedin Feed ​Rushin Barot ​SREcon17 ​LinkedIn
  2. 2. Overview ● Linkedin Feed architecture ● Monitoring ● High availability for Feed ● Best practices
  3. 3. LinkedIn in 2016 ● 467 million member accounts ● 107 million active users, 40 million+ messaging conversations per week ● Kafka: 1.4 trillion messages/day across 1400 brokers ● 2-3x increase in referral traffic YoY
  4. 4. Linkedin Feed Architecture LinkedIn Business Logic Mid Tier TimeOrderd DB Data ingestion Linkedin Business Logic DataStore
  5. 5. Monitoring What do we monitor
  6. 6. Monitoring Load Avg Errors Latency SaturationInstances What do we monitor
  7. 7. Monitoring ● Ingraphs ----------> Frontend to round-robin database(rrd) ● Autometrics ----------> Self-service metrics collections ● Nurse ----------> Auto Remediation ● Iris ----------> Similar to pagerDuty How do we monitor
  8. 8. High Availability for Feed ● Data partition ---------------> 720 partitions ● Replicas ---------------> 3 replicas ● Back up and Restore ---------------> separate back up nodes ● Quotas ---------------> throttle ● Fail-over ● Live load test
  9. 9. Best Practices ● Automated hardware issue detection. ● A/B testing ● Proof rules( aka Error budgeting ) ● Canaries ● Dark Canaries
  10. 10. Dark Canary ● Mirror server in production which receives a copy of production traffic ● Responses are never sent to clients ● Source server and dark canary
  11. 11. Dark Canary
  12. 12. Dark Canary ● Read-Only Requests ● Scalability using traffic multiplier ● Dark Canary failures have no impact on production ● Differentiation from production traffic ○ Page key: dark canary requests have separate page keys. ○ Tree id: Separate tree id Specifics for Feed
  13. 13. ● New build verification ○ Detection of schema changes, code issues, misconfigurations ● New colo certification ● Warming up caches in multi-colo environment ● Capacity planning Dark Canary Usage
  14. 14. Dark Canary Capacity Measurement via Redliner ● Use live traffic in production environment ● No significant impact on production node ● Minimal maintenance overhead
  15. 15. 17 Dark Canary Capacity Measurement via Redliner
  16. 16. Results 18
  17. 17. System Stats - CPU 19
  18. 18. System Stats - Memory 20
  19. 19. Advantages and Limitations Pros ● Proactive measure to catch issues ● Low cost of testing with real production data and traffic patterns ● No tests implementation/maintenance is required ● Configurable traffic multiplier ● New code can be deployed and tested in parallel with staging before pushing it to regular Canary Cons ● Read-Only requests ● Recommended for services closer to the bottom of the stack due to calls to production downstream services ● Additional hardware for Dark Canary
  20. 20. Thank you
  21. 21. References Linkedin Tech Blog ● Followfeed Architecture ● Nurse ● Ingraphs ● Autometrics Books Site Reliability Engineering Paper Capacity Measurement and Planning for NoSQL Databases by Ramya Pasumarti, Rushin Barot, Susie Xia, Ang Xu, Haricharan Ramachandra [ IEEE SDI 2017]