SlideShare a Scribd company logo
1 of 127
Download to read offline
SysAdmin to SRE:
Creating Capacity to Make Tomorrow Better Than Today

Damon Edwards

@damonedwards
Pasadena Convention Center
March 7 - 10, 2019
Community
Ops Improvement
DevOps in Enterprise
Ops Tools
Damon Edwards
Not that far away, maybe in a company just like yours…
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Ticket
Ticket
Project A
···
Project B
···
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
DUE: Yesterday! DUE: Tomorrow!
Ticket
Ticket
Ticket
Waiting in ticket queues for everything.
Not that far away, maybe in a company just like yours…
Waiting in ticket queues for everything.
Ticket
Not that far away, maybe in a company just like yours…
Waiting in ticket queues for everything.
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Not that far away, maybe in a company just like yours…
Things break. Break again. And again.
Later…
Later…
same
same
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Not that far away, maybe in a company just like yours…
Everyone is busy, but it doesn’t get any better.
Improvement
Project
Business
Delivery
Incidents
Business
Delivery
Business
Delivery
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives

Have you heard of SRE?
Google does it.
Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives

Have you heard of SRE?
Google does it.
Have you heard of SRE?
Google does it.
Jane Doe
Systems Administrator
Jane Doe
Systems Administrator
We have
SysAdmins
Jane Doe
Systems Administrator
They should be
SREs!
Jane Doe
SRE
They should be
SREs!
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive

View
SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive

View
SRE (new name)
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much, and
break too often!
Executive

View
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
000000000000000
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Not SRE
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
000000000000000
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
SRE is a rethinking of how Operations work gets
done.
Principles of SRE
Principles of SRE
1. SRE needs Service Level Objectives, with consequences
Principles of SRE
1. SRE needs Service Level Objectives, with consequences
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Toil: Name For a Problem We’ve All Felt
Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy

Microsoft Azure
Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy

Microsoft Azure

Launch

(ToDos & Unknowns)
Mature
Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy

Microsoft Azure

Toil
Toil
Toil
Toil
Launch

(ToDos & Unknowns)
Mature
Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy

Microsoft Azure

Toil
Toil
Toil
Toil
Launch

(ToDos & Unknowns)
Mature
cycle
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
SRE teams have the ability to regulate their workload
SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
InfoSec • Compliance • Risk • Dev
How??Wait…
What?? That’s nuts.
Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Principles of SRE are what set SRE apart
Stephen Thorne

At DevOps Enterprise Summit

London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Where to start (the practical approach)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences

2. SREs have time to make tomorrow better than today

3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.

Everybody wins!
Why focus on reducing toil?
Why focus on reducing toil?
1. Lots of value independent of “SRE”
2. Your people are you most expensive assets

… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Delivering planned work:
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Empower them to
take action!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
Operations: The Last Mile
DevOps Enterprise Summit 2018 Las Vegas
https://rundeck.co/damon_at_does18
But there is a lot that gets in the way…
Operations: The Last Mile
DevOps Enterprise Summit 2018 Las Vegas
https://rundeck.co/damon_at_does18
But there is a lot that gets in the way…
tl;dr:
Silos and Queues are major causes of dysfunction.
TL;DR:
Silos and Queues are major causes of dysfunction.
Silos
Backlog Information
PrioritiesTools
Backlog Information
I need X
PrioritiesTools
Silos
Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Toil
Ticket queues are how we cope
Silo A Silo BFunction A Function B
Ticket queues are how we cope
Silo A Silo B
Ticket
Queue
Function A Function B
??
Silo A Silo B
Ticket
Queue
Function A Function B
Ticket queues = interruptions, waiting, and toil
??
Silo A Silo B
Ticket
Queue
Function A Function B
Ticket queues = interruptions, waiting, and toil
Toil
??
Silo A Silo B
Ticket
Queue
Function A Function B
Snowflakes:
Technically acceptable, but brittle and unreproducible
Ticket queues = interruptions, waiting, and toil
??
Silo A Silo B
Ticket
Queue
Function A Function B
Snowflakes:
Technically acceptable, but brittle and unreproducible
Ticket queues = interruptions, waiting, and toil
Toil
Super easy to get started reducing toil
Toil
Super easy to get started reducing toil
1. Track toil levels for each team
Toil
Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
Toil
Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
↳ Apply self-service design pattern
Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
↳ Apply self-service design pattern
Empower teams to spot and fix the anti-patterns.
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
“Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil Toil
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
Toil
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
Toil
“I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
Toil Toil
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Toil
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
Toil
“I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
Toil Toil
“Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
“Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Toil
“Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Service A
!!
I’ll fix it
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Later…
Later…
Toil
“Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Service A
!!
I’ll fix it
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Later…
Later…
Toil
Toil
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Define “guardrails” to
provide work safety
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Define “guardrails” to
provide work safety
Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Build in security
and compliance
Define “guardrails” to
provide work safety
Self-Service can also be a
foundation for strategic initiatives
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%

• Reduced escalations by 50%

• Reduced overall support costs by 55%
Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees

• 60+ countries

• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
12 months: 

• Saved 28 person years of time

• 13,000+ ops tasks in privileged environments that
didn’t require a review

• ~200 less customer impacting events
rundeck.com/self-service
Read for free online:
Working on documenting the Self-
Service Operations design pattern.
Where I need your help…
Give feedback.
Recap: Creating Capacity to Make Tomorrow Better Than Today
SRE is more than a title
Be practical and start focusing
on toil
Find and fix toil anti-patterns
Error Budgets and Toil Limits
Apply Self-Service Operations
design pattern
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
SRE is a new way to think
about Ops work
ITIL Book 1
ITIL Book 2
ITIL Book 3
ITIL Book 4
ITIL Book 5
Quality!
is job
#1
Sys
Admin
CAB CALENDAR
Your new title is SRE.
Now write code and be better at ops.
PROVISIONING PROCESS
Dilbert characters © Scott Adams www.dilbert.com
1. SRE needs Service Level
Objectives, with consequences

2. SREs have time to make
tomorrow better than today

3. SRE teams have the ability to
regulate their workload
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Toil
Let’s talk…
@damonedwards
damon@rundeck.com
rundeck.com/self-service

More Related Content

What's hot

Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile Rundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Rundeck
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Rundeck
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Rundeck
 
The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps Rundeck
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Rundeck
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Rundeck
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensRundeck
 
Mainframe Solutions Introduction
Mainframe Solutions IntroductionMainframe Solutions Introduction
Mainframe Solutions IntroductionMicro Focus
 
PagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
PagerDuty + Rundeck = Shorter Incidents, Fewer EscalationsPagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
PagerDuty + Rundeck = Shorter Incidents, Fewer EscalationsRundeck
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsRundeck
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Rundeck
 
Samples 3 Print
Samples 3 PrintSamples 3 Print
Samples 3 PrintHang Le
 
Innovation and Architecture
Innovation and ArchitectureInnovation and Architecture
Innovation and ArchitectureAdrian Cockcroft
 
Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Andrew Shafer
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioAtlassian
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Andrew Shafer
 

What's hot (20)

Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises
 
The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still Happens
 
Mainframe Solutions Introduction
Mainframe Solutions IntroductionMainframe Solutions Introduction
Mainframe Solutions Introduction
 
PagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
PagerDuty + Rundeck = Shorter Incidents, Fewer EscalationsPagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
PagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
 
Samples 3 Print
Samples 3 PrintSamples 3 Print
Samples 3 Print
 
Innovation and Architecture
Innovation and ArchitectureInnovation and Architecture
Innovation and Architecture
 
Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to Studio
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09
 

Similar to SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today

2019-11 NewOpsDays Dallas - Sysadmin to SRE _v1.1
2019-11 NewOpsDays Dallas  - Sysadmin to SRE _v1.12019-11 NewOpsDays Dallas  - Sysadmin to SRE _v1.1
2019-11 NewOpsDays Dallas - Sysadmin to SRE _v1.1Jorn Knuttila
 
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...Jorn Knuttila
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management Rundeck
 
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveRational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveJoakim Lindbom
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...AppDynamics
 
The 7 Deadly Sins Of Almost Being Agile
The 7 Deadly Sins Of Almost Being AgileThe 7 Deadly Sins Of Almost Being Agile
The 7 Deadly Sins Of Almost Being Agilelazygolfer
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS
 
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...Joakim Lindbom
 
DELUSIONED ERP SAGA!!
 DELUSIONED ERP SAGA!! DELUSIONED ERP SAGA!!
DELUSIONED ERP SAGA!!Mihir Oza
 
Devops for Large Enterprises
Devops for Large EnterprisesDevops for Large Enterprises
Devops for Large EnterprisesMarcio Sete
 
Micro Focus and RAET - Gartner
Micro Focus and RAET - GartnerMicro Focus and RAET - Gartner
Micro Focus and RAET - GartnerMicro Focus
 
Micro Focus and ICWA - Gartner
Micro Focus and ICWA - GartnerMicro Focus and ICWA - Gartner
Micro Focus and ICWA - GartnerMicro Focus
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service OperationsRundeck
 
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean Leffingwell
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean LeffingwellBe Agile. Scale Up. Stay Lean. And Have More Fun by Dean Leffingwell
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean LeffingwellAgile Software Community of India
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...dev2ops
 

Similar to SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today (20)

2019-11 NewOpsDays Dallas - Sysadmin to SRE _v1.1
2019-11 NewOpsDays Dallas  - Sysadmin to SRE _v1.12019-11 NewOpsDays Dallas  - Sysadmin to SRE _v1.1
2019-11 NewOpsDays Dallas - Sysadmin to SRE _v1.1
 
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...
NewOps Days Boston 2019 - SysAdmin to SRE: Creating Capacity to Make Tomorrow...
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
 
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspectiveRational User Group - May 2014 Stockholm - DevOps from an EA perspective
Rational User Group - May 2014 Stockholm - DevOps from an EA perspective
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Writing srs
Writing srsWriting srs
Writing srs
 
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...
How Cerner Corporation Delivers End-to-End Workflow Visibility to Increase Cr...
 
The 7 Deadly Sins Of Almost Being Agile
The 7 Deadly Sins Of Almost Being AgileThe 7 Deadly Sins Of Almost Being Agile
The 7 Deadly Sins Of Almost Being Agile
 
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method fo...
 
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
2015 10 dev ops n-fi - why it's a good idea to deploy 10 times per day v1.0 -...
 
DELUSIONED ERP SAGA!!
 DELUSIONED ERP SAGA!! DELUSIONED ERP SAGA!!
DELUSIONED ERP SAGA!!
 
Devops for Large Enterprises
Devops for Large EnterprisesDevops for Large Enterprises
Devops for Large Enterprises
 
What is ERP?
What is ERP?What is ERP?
What is ERP?
 
Micro Focus and RAET - Gartner
Micro Focus and RAET - GartnerMicro Focus and RAET - Gartner
Micro Focus and RAET - Gartner
 
Micro Focus and ICWA - Gartner
Micro Focus and ICWA - GartnerMicro Focus and ICWA - Gartner
Micro Focus and ICWA - Gartner
 
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
Helping Ops Help You:  Development’s Role in Enabling Self-Service OperationsHelping Ops Help You:  Development’s Role in Enabling Self-Service Operations
Helping Ops Help You: Development’s Role in Enabling Self-Service Operations
 
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean Leffingwell
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean LeffingwellBe Agile. Scale Up. Stay Lean. And Have More Fun by Dean Leffingwell
Be Agile. Scale Up. Stay Lean. And Have More Fun by Dean Leffingwell
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...
DevOps Paradox: Going Faster Brings Higher Quality, Lower Costs, & Better Out...
 

More from Rundeck

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps Rundeck
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationRundeck
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckRundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & AnsibleRundeck
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...Rundeck
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control PoliciesRundeck
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckRundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4 Rundeck
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...Rundeck
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Rundeck
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuRundeck
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response Rundeck
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Rundeck
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Rundeck
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck OverviewRundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings Rundeck
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Rundeck
 
You Build It, But How Are You Going to Run It?
You Build It, But How Are You Going to Run It? You Build It, But How Are You Going to Run It?
You Build It, But How Are You Going to Run It? Rundeck
 

More from Rundeck (20)

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 
You Build It, But How Are You Going to Run It?
You Build It, But How Are You Going to Run It? You Build It, But How Are You Going to Run It?
You Build It, But How Are You Going to Run It?
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today

  • 1. SysAdmin to SRE: Creating Capacity to Make Tomorrow Better Than Today Damon Edwards @damonedwards Pasadena Convention Center March 7 - 10, 2019
  • 2. Community Ops Improvement DevOps in Enterprise Ops Tools Damon Edwards
  • 3. Not that far away, maybe in a company just like yours…
  • 4. Not that far away, maybe in a company just like yours… Overloaded. Constant firefighting. Ticket Ticket Project A ··· Project B ··· Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket Ticket DUE: Yesterday! DUE: Tomorrow! Ticket Ticket Ticket
  • 5. Waiting in ticket queues for everything. Not that far away, maybe in a company just like yours…
  • 6. Waiting in ticket queues for everything. Ticket Not that far away, maybe in a company just like yours…
  • 7. Waiting in ticket queues for everything. Ticket Ticket Ticket Ticket Ticket Ticket Not that far away, maybe in a company just like yours…
  • 8. Things break. Break again. And again. Later… Later… same same Help! Ticket Wait Interrupt Help! Ticket Wait Interrupt Help! Ticket Wait Interrupt Not that far away, maybe in a company just like yours…
  • 9. Everyone is busy, but it doesn’t get any better. Improvement Project Business Delivery Incidents Business Delivery Business Delivery Not that far away, maybe in a company just like yours…
  • 10. Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Not that far away, maybe in a company just like yours…
  • 11. Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Not that far away, maybe in a company just like yours… Everything takes too long, costs too much, and breaks too often! Executives Have you heard of SRE? Google does it.
  • 12. Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Not that far away, maybe in a company just like yours… Everything takes too long, costs too much, and breaks too often! Executives Have you heard of SRE? Google does it.
  • 13.
  • 14. Have you heard of SRE? Google does it.
  • 19. ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  • 20. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View
  • 21. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View SRE (new name) Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Our transformation has largely ignored Ops. Any ideas? Have you h Google Everything takes too long, cost too much, and break too often! Executive View
  • 22. Changing job titles or adding individual skills doesn’t make systems administrators SREs.
  • 23. Changing job titles or adding individual skills doesn’t make systems administrators SREs.
  • 24. Changing job titles or adding individual skills doesn’t make systems administrators SREs. Observability Programming Skills Distributed Systems Arch. Incident Response
  • 25. Changing job titles or adding individual skills doesn’t make systems administrators SREs. Observability Programming Skills Distributed Systems Arch. Incident Response 000000000000000
  • 26. Changing job titles or adding individual skills doesn’t make systems administrators SREs. Not SRE Observability Programming Skills Distributed Systems Arch. Incident Response 000000000000000
  • 27. Changing job titles or adding individual skills doesn’t make systems administrators SREs.
  • 28. Changing job titles or adding individual skills doesn’t make systems administrators SREs. SRE is a rethinking of how Operations work gets done.
  • 30. Principles of SRE 1. SRE needs Service Level Objectives, with consequences
  • 31. Principles of SRE 1. SRE needs Service Level Objectives, with consequences
  • 32. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 33. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  • 34. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  • 35. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  • 36. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  • 37. Toil: Name For a Problem We’ve All Felt
  • 38. Toil: Name For a Problem We’ve All Felt “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 39. Toil vs. Engineering Work Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  • 40. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 41. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 42. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Downward spiral is inevitable!
  • 43. Toil is a naturally occurring force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure
  • 44. Toil is a naturally occurring force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Launch (ToDos & Unknowns) Mature
  • 45. Toil is a naturally occurring force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Toil Toil Toil Toil Launch (ToDos & Unknowns) Mature
  • 46. Toil is a naturally occurring force General Evolution of Automation 1. No automation 2. Externally maintained system-specific automation 3. Externally maintained generic automation 4. Internally maintained system-specific automation 5. Systems that don’t need any automation Niall Murphy Microsoft Azure Toil Toil Toil Toil Launch (ToDos & Unknowns) Mature cycle
  • 47. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  • 48. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 49. SRE teams have the ability to regulate their workload
  • 50. SRE teams have the ability to regulate their workload What if handing-off responsibility to SRE/Ops wasn’t a right?
  • 51. SRE teams have the ability to regulate their workload What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”)
  • 52. SRE teams have the ability to regulate their workload What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”)
  • 53. SRE teams have the ability to regulate their workload What if handing-off responsibility to SRE/Ops wasn’t a right? (separate the “running in production” from “run by SRE/Ops”) InfoSec • Compliance • Risk • Dev How??Wait… What?? That’s nuts.
  • 54. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 55. Principles of SRE are what set SRE apart Stephen Thorne At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 56. Where to start (the practical approach)
  • 57. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  • 58. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!)
  • 59. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!)
  • 60. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 61. Where to start (the practical approach) 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload Company-wide culture change (hard!) Company-wide culture change (hard!) Reduce toil.
 Everybody wins!
  • 62. Why focus on reducing toil?
  • 63. Why focus on reducing toil? 1. Lots of value independent of “SRE”
  • 64. 2. Your people are you most expensive assets
 … stay out of their way! Why focus on reducing toil? 1. Lots of value independent of “SRE”
  • 65. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Delivering planned work:
  • 66. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Observe Orient Decide Action SRE OODA Loop Responding to incidents:Delivering planned work:
  • 67. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Observe Orient Decide Action SRE OODA Loop Responding to incidents:Delivering planned work:
  • 68. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Observe Orient Decide Action SRE OODA Loop Responding to incidents:Delivering planned work:
  • 69. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Observe Orient Decide Action SRE OODA Loop Responding to incidents:Delivering planned work:
  • 70. Your people are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Empower them to take action! Observe Orient Decide Action SRE OODA Loop Responding to incidents:Delivering planned work:
  • 71. Operations: The Last Mile DevOps Enterprise Summit 2018 Las Vegas https://rundeck.co/damon_at_does18 But there is a lot that gets in the way…
  • 72. Operations: The Last Mile DevOps Enterprise Summit 2018 Las Vegas https://rundeck.co/damon_at_does18 But there is a lot that gets in the way… tl;dr: Silos and Queues are major causes of dysfunction.
  • 73. TL;DR: Silos and Queues are major causes of dysfunction.
  • 75. Backlog Information I need X PrioritiesTools Silos
  • 76. Backlog Information I need X PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  • 77. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  • 78. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity Toil
  • 79. Ticket queues are how we cope Silo A Silo BFunction A Function B
  • 80. Ticket queues are how we cope Silo A Silo B Ticket Queue Function A Function B
  • 81. ?? Silo A Silo B Ticket Queue Function A Function B Ticket queues = interruptions, waiting, and toil
  • 82. ?? Silo A Silo B Ticket Queue Function A Function B Ticket queues = interruptions, waiting, and toil Toil
  • 83. ?? Silo A Silo B Ticket Queue Function A Function B Snowflakes: Technically acceptable, but brittle and unreproducible Ticket queues = interruptions, waiting, and toil
  • 84. ?? Silo A Silo B Ticket Queue Function A Function B Snowflakes: Technically acceptable, but brittle and unreproducible Ticket queues = interruptions, waiting, and toil Toil
  • 85. Super easy to get started reducing toil Toil
  • 86. Super easy to get started reducing toil 1. Track toil levels for each team Toil
  • 87. Super easy to get started reducing toil 1. Track toil levels for each team 2. Set toil limit for each team Toil
  • 88. Super easy to get started reducing toil 1. Track toil levels for each team 2. Set toil limit for each team 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil
  • 89. Super easy to get started reducing toil 1. Track toil levels for each team 2. Set toil limit for each team 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil ↳ Refactor apps, tools, and processes
  • 90. Super easy to get started reducing toil 1. Track toil levels for each team 2. Set toil limit for each team 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil ↳ Refactor apps, tools, and processes ↳ Apply self-service design pattern
  • 91. Super easy to get started reducing toil 1. Track toil levels for each team 2. Set toil limit for each team 3. Fund efforts to reduce toil (with emphasis on teams already over limit) Toil ↳ Refactor apps, tools, and processes ↳ Apply self-service design pattern
  • 92. Empower teams to spot and fix the anti-patterns.
  • 93. “Do this for me, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt
  • 94. “Do this for me, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Toil
  • 95. “Do this for me, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Toil
  • 96. “Do this for me, do it again, then do it again.” Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Toil Toil
  • 97. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt
  • 98. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt Toil
  • 99. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt After I’ve got this! Environment Self- Service Toil
  • 100. “I could fix it, but I can’t get to it.” Environment I could fix it if I could get to it Before Wait Interrupt After I’ve got this! Environment Self- Service Toil Toil
  • 101. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 102. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before
  • 103. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Toil
  • 104. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart” Environment Later… Update Restart Job ✅ I’ve done this before. I’ve got this. Self-Service Self-Service After Toil
  • 105. “I’m an expert, I don’t read the wiki.” docs Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this… Environment docs Later… Before Service has changed. Use this flag or bad things will happen! Pause monitoring first or we all get woken up! “restart” Environment Later… Update Restart Job ✅ I’ve done this before. I’ve got this. Self-Service Self-Service After Toil Toil
  • 106. “Dev work is more expensive than Ops work” Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Fix it? Not in budget. Ops has a work around. Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Interruptions Toil OPS DEV
  • 107. “Dev work is more expensive than Ops work” Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Fix it? Not in budget. Ops has a work around. Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Interruptions Toil OPS DEV Toil
  • 108. “Dev work is more expensive than Ops work” Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Fix it? Not in budget. Ops has a work around. Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Interruptions Toil OPS DEV Service A !! I’ll fix it 1. Step 2. Step 3. Your other work Self-Service Service A !! 1. Step 2. Step 3. Your other work Self-Service Service A !! 1. Step 2. Step 3. Later… Later… Toil
  • 109. “Dev work is more expensive than Ops work” Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Fix it? Not in budget. Ops has a work around. Service A !! I’ll fix it AGAIN 1. Step 2. Step 3. ✅ Interruptions Toil OPS DEV Service A !! I’ll fix it 1. Step 2. Step 3. Your other work Self-Service Service A !! 1. Step 2. Step 3. Your other work Self-Service Service A !! 1. Step 2. Step 3. Later… Later… Toil Toil
  • 110. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge
  • 111. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Pull-Based
  • 112. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Pull-Based Accept tools/languages that teams want to use
  • 113. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Pull-Based Accept tools/languages that teams want to use Define “guardrails” to provide work safety
  • 114. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Pull-Based Accept tools/languages that teams want to use Let people who “push buttons” define the buttons Define “guardrails” to provide work safety
  • 115. Self-Service Operations Design Pattern Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Pull-Based Accept tools/languages that teams want to use Let people who “push buttons” define the buttons Build in security and compliance Define “guardrails” to provide work safety
  • 116. Self-Service can also be a foundation for strategic initiatives
  • 117. Strategic: Improve incident response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF
  • 118. Strategic: Improve incident response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 119. Strategic: Improve incident response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 120. Strategic: Improve incident response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower • Reduced MTTR by 92% • Reduced escalations by 50% • Reduced overall support costs by 55%
  • 121. Strategic: Reduce compliance burden & improve consistency Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg
  • 122. Strategic: Reduce compliance burden & improve consistency Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated
  • 123. Strategic: Reduce compliance burden & improve consistency Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency
  • 124. Strategic: Reduce compliance burden & improve consistency Shaun Norris at DOES ‘18 Las Vegas https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency 12 months: • Saved 28 person years of time • 13,000+ ops tasks in privileged environments that didn’t require a review • ~200 less customer impacting events
  • 125. rundeck.com/self-service Read for free online: Working on documenting the Self- Service Operations design pattern. Where I need your help… Give feedback.
  • 126. Recap: Creating Capacity to Make Tomorrow Better Than Today SRE is more than a title Be practical and start focusing on toil Find and fix toil anti-patterns Error Budgets and Toil Limits Apply Self-Service Operations design pattern Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) SRE is a new way to think about Ops work ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) Done.I need you to do X Your other work I need you to do X I need you to do X Ticket Do X Later… Do X Do X Done. Done. Your other work Self-Service Self-Service Self-Service Your other work x2 Your other work x3 Later…Later… Later… Your other work Your other work After Before Wait Interrupt Ticket Wait Interrupt Ticket Wait Interrupt Consumer of Ops Capabilities Self-Service Operation On Demand Ops Capability Specialist Knowledge Ops Capability Specialist Knowledge Toil