This document summarizes Michael Kehoe's experience being loaned out to SRE teams at LinkedIn to help with various issues. It describes three scenarios: 1) A team lacked resource allocation documentation and had a backlog, 2) A new service had technical debt and complex performance issues, 3) A database team lacked automation and had alert fatigue.
To build success, the document recommends defining problem areas, success criteria, acquiring needed resources, and planning short and long-term. It also stresses communicating expectations with clients/partners. Key learnings are to measure toil, prioritize reducing it, and communicate effectively.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Helping operations top-heavy teams the smart way
1. Helping operations top-heavy
teams the smart way
(Lessons from my experience being loaned out to SRE teams)
Michael Kehoe
Staff Site Reliability Engineer
2. Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
3. Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it
4. • How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
5. • How to identify team anti-patterns
• How to work through high-toil
• How to create sustainable
workloads
This talk is
6. Today’s
agenda
1 Background
2 Scenario 1: Resource Allocation
3 Scenario 2: Technical Debt
4 Scenario 3: High Toil
5 Building A Formula For Success
6 Key Learnings
7 Q&A
12. Problem Statement
Technical Debt
• New frontend service
• Understanding performance is
complicated
• Management of dependent
services was difficult
17. Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Commutation &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning
18. Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
19. Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
20. Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
21. Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success
22. Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success
23. When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
Michael
So we’re apart of a team at LinkedIn called Production-SRE
The key tenants of production-sre at LinkedIn is:
Assist in restoring stability during site-critical issues
Developing applications to reduce MTTD and MTTR
Provide direction and guidelines for site-troubleshooting
Build tools for efficient site-issue troubleshooting, issue detection and correlation
As this presentation goes on, you’ll notice how an Event Correlation system fits into these
This talk isn’t how to magically erase all of your technical debt
Neither is it a talk on changing your engineering culture
This talk is
How to identify team anti-patterns
How to work through high-toil
How to create sustainable workloads
Michael
So we’re apart of a team at LinkedIn called Production-SRE
The key tenants of production-sre at LinkedIn is:
Assist in restoring stability during site-critical issues
Developing applications to reduce MTTD and MTTR
Provide direction and guidelines for site-troubleshooting
Build tools for efficient site-issue troubleshooting, issue detection and correlation
As this presentation goes on, you’ll notice how an Event Correlation system fits into these
So the first scenario I want to discuss is when I got pulled into the Traffic team due to severe resource allocation issues:
We had a team that had a lack of written documentation on how their platform worked and was deployed
They had a large backlog of work for clients
And there was a large amount of alert fatigue due to a some poorly defined alerts and some infrastructure that needed upgrading but they hadn’t gotten to it yet
Ontop of that, 4/5 team members left in a short period of time and started doing reliability operations at another company together
So we’re in a bit of a pickle here….
So in response, we took 5 staff SRE’s from other teams and dedicated them to the traffic team for a period of 3 months
Stopped all non-critical client work for a number of weeks
Completely recreated all monitoring systems
Spent a large chunk of time removing complexity
Focused on infrastructure reliability
The second team I worked with was our frontend API service team
Thousands of instances
Lack of maturity in automation for the team
Alert fatigue given the size of their infrastructure
Poor visibility into ops metrics