Helping operations top-heavy teams the smart way

•Download as PPTX, PDF•

2 likes•235 views

This document summarizes Michael Kehoe's experience being loaned out to SRE teams at LinkedIn to help with various issues. It describes three scenarios: 1) A team lacked resource allocation documentation and had a backlog, 2) A new service had technical debt and complex performance issues, 3) A database team lacked automation and had alert fatigue. To build success, the document recommends defining problem areas, success criteria, acquiring needed resources, and planning short and long-term. It also stresses communicating expectations with clients/partners. Key learnings are to measure toil, prioritize reducing it, and communicate effectively.

Technology

Helping operations top-heavy
teams the smart way
(Lessons from my experience being loaned out to SRE teams)
Michael Kehoe
Staff Site Reliability Engineer

Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland

Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it

• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not

• How to identify team anti-patterns
• How to work through high-toil
• How to create sustainable
workloads
This talk is

Today’s
agenda
1 Background
2 Scenario 1: Resource Allocation
3 Scenario 2: Technical Debt
4 Scenario 3: High Toil
5 Building A Formula For Success
6 Key Learnings
7 Q&A

Personal Experience in the past 15 months
ASSISTANCE RENDERED
• Traffic-SRE: Resource Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability

Problem Statement
Resource Allocations
• Lack of written documentation
• Backlog of work for clients
• Alert Fatigue

Problem Statement
Technical Debt
• New frontend service
• Understanding performance is
complicated
• Management of dependent
services was difficult

Problem Statement
High Toil
• Large multi-tenant/ multi-cluster
database team
• Lack of maturity in team-specific
automation
• Alert Fatigue

Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Commutation &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning

Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success

Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success

Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success

Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success

Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success

When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/

Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate

Helping operations top-heavy teams the smart way

What's hot

Implementing infinity hr (katie cuthriell)Infinity Software Solutions

Making a Project a Complete Success with Post-Implementation Strategies | Jul...Katie Elliott

Agile network India | Dysfunctions in a Scrum Master's Role | Soja NizamAgileNetwork

Using OEE to Improve Production - Interphex 2012Adrian Pask

Blackbaud CRM After Go-LiveBlackbaud

Top tips for a successful traceability system implemention paula peterson 2015Paula Peterson

Agile Balanced Scorecard -Agile Tour 2011 PuneAsheesh Mehdiratta

Why lean can't succeed without operational disciplineCalvin L Williams

Agile Network India | Disciplined Agile Through Case Study | Nagaraja GundappaAgileNetwork

Project management career seminarOjiugo Ajunwa

Procensol Breakfast Forum Launch - Modern Business TransformationProcensol

Lean testingChandan Patary

IT Outsourcing Best PracticesVasantha Gullapalli

Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchioKatie Elliott

TPM: Focused Improvement (Kobetsu Kaizen) PosterOperational Excellence Consulting

City of Canning: 4 Key Success Factors to Drive Engagement and Build MomentumEileenTan67

Agile project management - everything you want to know but were afraid to ask...Association for Project Management

Agile Network India | Data driven approach to Retrospectives | Sandhya BhayanaAgileNetwork

What's hot (18)

Implementing infinity hr (katie cuthriell)

Making a Project a Complete Success with Post-Implementation Strategies | Jul...

Agile network India | Dysfunctions in a Scrum Master's Role | Soja Nizam

Using OEE to Improve Production - Interphex 2012

Blackbaud CRM After Go-Live

Top tips for a successful traceability system implemention paula peterson 2015

Agile Balanced Scorecard -Agile Tour 2011 Pune

Why lean can't succeed without operational discipline

Agile Network India | Disciplined Agile Through Case Study | Nagaraja Gundappa

Project management career seminar

Procensol Breakfast Forum Launch - Modern Business Transformation

Lean testing

IT Outsourcing Best Practices

Go Live is Just the Start - Managing AX Improvement Projects | Carlo DiPucchio

TPM: Focused Improvement (Kobetsu Kaizen) Poster

City of Canning: 4 Key Success Factors to Drive Engagement and Build Momentum

Agile project management - everything you want to know but were afraid to ask...

Agile Network India | Data driven approach to Retrospectives | Sandhya Bhayana

Similar to Helping operations top-heavy teams the smart way

Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe

Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino

Helping operations top-heavy teams the smart wayMichael Kehoe

American Electric Power Ercot kickoffJohn Napier

103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...ssuser835d1a

AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA

Methodology lean IT transformation missionJean-François Nguyen

The Dashlane Agile JourneyDashlane

Engineering Teams and Systems for VelocityJean Barmash

Lean Six Sigma-An Execution EngineMark Cichonski

Fundamentals of agile tntu (2015-04-27)Oleg Nazarevych

R a ci & innovationAlan Culler

BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference

CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...TheConnectedCause

Practical Enterprise Architecture in Medium-size Corporation using TOGAFMichael Sukachev

Agile ncr pramila hitachi consulting_future_coachingAgileNCR2016

The Five Phases of Agile Maturity (Part 2): Phase 3 and 4Cprime

Effective ScrumSándor Zolta Székely Sipos

Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6PrescienceTechnology

Changing culture and building efficiencies by applying the Lean principles to...Association for Project Management

Similar to Helping operations top-heavy teams the smart way (20)

Code Yellow: Helping operations top-heavy teams the smart way

Code Yellow: Helping Operations Top-Heavy Teams the Smart Way

Helping operations top-heavy teams the smart way

American Electric Power Ercot kickoff

103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...

AVATA Webinar: Solutions to Common Demantra & ASCP Challenges

Methodology lean IT transformation mission

The Dashlane Agile Journey

Engineering Teams and Systems for Velocity

Lean Six Sigma-An Execution Engine

Fundamentals of agile tntu (2015-04-27)

R a ci & innovation

BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...

CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...

Practical Enterprise Architecture in Medium-size Corporation using TOGAF

Agile ncr pramila hitachi consulting_future_coaching

The Five Phases of Agile Maturity (Part 2): Phase 3 and 4

Effective Scrum

Webinar - Integrating InEight Hard Dollar and Oracle Primavera P6

Changing culture and building efficiencies by applying the Lean principles to...

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

How to convert PDF to text with Nanonetsnaman860154

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Histor y of HAM Radio presentation slidevu2urc

GenCyber Cyber Security Day PresentationMichael W. Hawkins

A Year of the Servo Reboot: Where Are We Now?Igalia

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

08448380779 Call Girls In Civil Lines Women Seeking Men

Handwritten Text Recognition for manuscripts and early printed texts

Data Cloud, More than a CDP by Matt Robison

How to convert PDF to text with Nanonets

Axa Assurance Maroc - Insurer Innovation Award 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Histor y of HAM Radio presentation slide

GenCyber Cyber Security Day Presentation

A Year of the Servo Reboot: Where Are We Now?

CNv6 Instructor Chapter 6 Quality of Service

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Helping operations top-heavy teams the smart way

1. Helping operations top-heavy teams the smart way (Lessons from my experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer

2. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland

3. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it

4. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not

5. • How to identify team anti-patterns • How to work through high-toil • How to create sustainable workloads This talk is

6. Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3 Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A

7. Background

8. Personal Experience in the past 15 months ASSISTANCE RENDERED • Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability

9. Scenario 1: Resource Allocation

10. Problem Statement Resource Allocations • Lack of written documentation • Backlog of work for clients • Alert Fatigue

11. Scenario 2: Technical Debt

12. Problem Statement Technical Debt • New frontend service • Understanding performance is complicated • Management of dependent services was difficult

13. Scenario 3: High toil

14. Problem Statement High Toil • Large multi-tenant/ multi-cluster database team • Lack of maturity in team-specific automation • Alert Fatigue

15. Building a formula for success

16. Code Yellow

17. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning

18. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success

19. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success

20. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success

21. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success

22. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success

23. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/

24. Key Learnings

25. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate

26. Q&A

Editor's Notes

Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
This talk isn’t how to magically erase all of your technical debt Neither is it a talk on changing your engineering culture
This talk is How to identify team anti-patterns How to work through high-toil How to create sustainable workloads
Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
So the first scenario I want to discuss is when I got pulled into the Traffic team due to severe resource allocation issues: We had a team that had a lack of written documentation on how their platform worked and was deployed They had a large backlog of work for clients And there was a large amount of alert fatigue due to a some poorly defined alerts and some infrastructure that needed upgrading but they hadn’t gotten to it yet Ontop of that, 4/5 team members left in a short period of time and started doing reliability operations at another company together So we’re in a bit of a pickle here…. So in response, we took 5 staff SRE’s from other teams and dedicated them to the traffic team for a period of 3 months Stopped all non-critical client work for a number of weeks Completely recreated all monitoring systems Spent a large chunk of time removing complexity Focused on infrastructure reliability
The second team I worked with was our frontend API service team
Thousands of instances Lack of maturity in automation for the team Alert fatigue given the size of their infrastructure Poor visibility into ops metrics

Helping operations top-heavy teams the smart way

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Helping operations top-heavy teams the smart way

Similar to Helping operations top-heavy teams the smart way (20)

More from Michael Kehoe

More from Michael Kehoe (20)

Recently uploaded

Recently uploaded (20)

Helping operations top-heavy teams the smart way

Editor's Notes