Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn


Published on

LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.

In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.

We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn

  1. 1. Reducing MTTR and False Escalations: Event Correlation at LinkedIn Jeff Weiner Chief Executive Officer Michael Kehoe Staff SRE Rusty Wickell Sr Operations Engineer
  2. 2. False Escalations • Have you ever? • Been woken because your service is unhealthy because of a dependency • Been woken because someone believes your service is responsible • Spent hours trying to work out why your service is broken
  3. 3. Today’s agenda 1 Introductions 2 The Problem Statement 3 Architecture Considerations 4 Platform Overview 5 Ecosystem Integration 6 Key Takeaways 7 Q&A
  4. 4. Introduction
  5. 5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Assist in restoring stability to services during site-critical issues • Develop applications to improve MTTD and MTTR • Provide direction and guidelines for site monitoring • Build tools for efficient site-issue detection, correlation & troubleshooting,
  6. 6. Problem Statement
  7. 7. Problem Statement Learning Curve MTTRReliability Service Complexity
  8. 8. Problem Statement Understanding services is harder Learning Curve Complexity delays identification of cause High MTTR Lack of understanding results in false escalations False Escalations
  9. 9. Project Goals
  10. 10. Project Goals Internal application shows high latency/ errors Unified API External monitoring show high latency/ errors Web Frontend
  11. 11. Project Goals Reduce impact on members Reduce MTTR Less disruptions to oncall SRE’s Reduce False Escalations
  12. 12. Project Goals Internal application shows high latency/ errors Applicable Use- cases External monitoring show high latency/ errors Non-Applicable Use- cases
  13. 13. Architecture Considerations
  14. 14. Architecture Considerations Running metric correlation via stream-processing Real-Time Metrics Analysis Metric correlation on demand Ad-Hoc metric analytics Processing alerts and performing Alert Correlation
  15. 15. Architecture Considerations REAL-TIME METRIC ANALYTICS • Pros • Fast response time • Ability to do advanced analytics in real-time • Cons • Resource intensive = Expensive
  16. 16. Architecture Considerations AD-HOC METRIC ANALYTICS • Pros • Smaller resource footprint • Cons • Analysis time is slow
  17. 17. Architecture Considerations ALERT CORRELATION • Pros • Leverage already existing alerts • Strong signal-to-noise ratio • Cons • Analysis constrained to alerts only (boolean state)
  18. 18. Architecture Considerations EVALUATION • Alert Correlation gives us strong signal • Real-time analytics is expensive, but useful • Ad-Hoc metric analytics is slower, but cheaper
  19. 19. Platform Overview
  20. 20. Platform Overview Understanding how services depend on each other Call Graph K-Means analysis Ad-Hoc Metric Correlation Using alerts to confirm performance Alert Correlation Collating and decorating data Recommendations Engine
  21. 21. Correlation Engine Overview Architecture 21 Callgraph-api Callgraph-be correlate-fe drilldown invisualize site-stabilizer
  22. 22. Problem Statement Learning Curve MTTRReliability Service Complexity
  23. 23. Learning Curve Scattered Knowledge Outdated Documentatio n Poor Dependency Understanding s
  24. 24. Callgraph Stores Callcount, latency and error rates Created Programmatically Interface API and a User Interface Lookup Service/ API
  25. 25. Services, APIs, Protocols Service Discovery Destination service, Endpoint, Protocol Metrics How do we map service
  26. 26. Site Stabilizer | Real Time and Ad-Hoc Metrics Analysis
  27. 27. Challenge: Not all metrics had thresholds Threshold Challenge: expensive, real time processing, tuning based on the individual metrics behaviour Statistical Approaches that we tried Challenge: expensive, real time processing, tuning based on the individual metrics behaviour Machine Learning
  28. 28. Clustering Algorithm K-Means Partitions n observations to k clusters Store Can be trained and saved
  29. 29. cluster center cluster 3cluster 3 cluster 2 cluster 1cluster 1 K-Means : How it works
  30. 30. cluster 1 cluster 2 cluster 3 cluster 3 cluster center Predict score
  31. 31. Ranking Using K-Means Predict score Based on the trend of the time series Trend score Leverage week on week data WoW
  32. 32. Typical Workflow Identify Drilldown Identify the critical metrics using the k- means method Drilldown to the corresponding critical services
  33. 33. inVisualize | Alert Correlation and Visualization
  34. 34. Polls the monitoring system continuously for alerts Ingests and represents callcount, average latency, error rate from callgraph Correlates downstream alerts using Callgraph inVisualize Assumptions Alert Correlation and Visualization inVisualize
  35. 35. Higher the alerts for a service, more likely it’s affected or broken Higher the change in latency/error to a downstream, more likely it’s broken Higher the callcount to a downstream, more valuable it is inVisualize Assumptions Alert Correlation and Visualization inVisualize
  36. 36. inVisualize
  37. 37. Alert Correlation and Visualization inVisualize Save the states continuously for replay Rank the services based on a score and accessible via api Score is normalized between 0-100
  38. 38. Recommendation Engine
  39. 39. Recommendation Engine Service, colo, duration Input Collates the outputs from Site stabilizer and inVisualize Collate Responsible service, SRE team, correlation confidence score User Interface With information such as scheduled changes, deployments and A/B experiments Decorate
  40. 40. Ecosystem Integration
  41. 41. Ecosystem Integration Escalate to correct SRE Nurse Plan arguments • service-name: my-frontend • req_confidence = 85 • escalate=true Find what’s wrong with ‘my-frontend’ in DatacenterB Service: Service-C Confidence: 91% Reason: ‘Service-C’ has high latency after a deploy Service Owner: SRE
  42. 42. Key Takeaways
  43. 43. Key Takeaways Understand what correlation infrastructure makes sense Approach Understand dependencies Dependencies
  44. 44. Key Takeaways Feedback Loops • Important to get some feedback on accuracy • Provides a means to do reporting: • System effectiveness • Engineers saved from escalations • Use feedback data to train system = Improve Results
  45. 45. Team Michael Kehoe Rusty Wickell Reynold PJ Govindaluri Kishore Renjith Rejan
  46. 46. Q&A