LinkedIn’s production stack is made up of over 900 applications and over 2200 internal API’s. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.
In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.
We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
1. Reducing MTTR and False Escalations:
Event Correlation at LinkedIn
Michael Kehoe
Staff Site Reliability Engineer
LinkedIn
2. Have you ever?
2
False Escalations
• Been woken because your service is unhealthy because of a dependency?
• Been woken because someone believes your service is responsible?
• Spent hours trying to work out why your service is broken?
4. $ whoami
4
Michael Kehoe
• Staff Site Reliability Engineer (SRE) @ LinkedIn
• Production-SRE team
• Funny accent = Australian + 3 years American
5. $ whatis PROD-SRE
5
Michael Kehoe
• Production-SRE
• Develop applications to improve MTTD and
MTTR
• Build tools for efficient site issue
troubleshooting, issue detection & correlation
• Provide direction on site monitoring
• Assist in restoring stability to services during site
critical issues
9. Expected Use-Cases
9
Problem Statement
Applicable use-cases:
• A service has high latency or error rates
• Find the problematic service(s)
Non-applicable use-cases:
• External monitoring services show slow page-load times
15. 15
Correlation Engine Overview
At LinkedIn, we had two smaller projects that we could leverage
Drilldown + Site-Stabilizer
Near-Time metric analytics & event correlation
Invisualize
Alert Correlation
Existing knowledge available
16. Where to get started
16
Correlation Engine Overview
The ability to correlate is great!
But you need to understand dependencies
Build a callgraph!
19. drilldown (Near-Time analytics)
19
Correlation Engine Overview
Using callgraph, identifies high-value dependencies (and the associated metrics)
In 5min chunks, analyses high-value metrics
Using a k-means unsupervised algorithm, find similar trends between service metrics
Queryable API
Outputs correlation confidence scores
Normalised between 0-100
20. inVisualize (Alert Correlation)
20
Correlation Engine Overview
inVisualize analyses alerts (in realtime) from each service
Use callgraph to calculate the unhealthy service and affected services
Queryable API
Results normalised between 0-100
Visualizes impact
27. Latency Alert
NURSE
Nurse Plan arguments
• service-name: my-frontend
• req_confidence = 85
• escalate = True
Escalate to
correct SRE
Find what’s wrong with
‘my-frontend’ in
DatacenterB
IrisAlert Correlation API
Service: Service-C
Confidence: 91%
Reason: ‘Service-C’ has high latency after a deploy
Service Owner: SRE