Damon Edwards, co-founder of Rundeck, talks at SCALE 17x on March 9, 2019 in Pasadena, CA.
Wouldn't everyone in operations love more time to work on exciting projects? Build out new platforms, improve performance, contribute to open source projects focus on security, level-up their automation — all things that add value to your companies and advance your career. But instead, the life of a traditional systems administrator is often buried in interruptions and repetitive work. Imagine the things you could do, if you just had the time to get to it.
Then along comes a new way of working and a new role called Site Reliability Engineering (SRE). But SRE almost seems too good to be true! People are doing what systems administrators used to do, but getting to spend more than 50% of their time doing engineering work that adds enduring value to their company? How can less than half of these SREs' time be wasted on the interruptions, repetitive work, and drudgery that seem to consume most of the traditional systems administrator's time? And do this with the same or less headcount?
This talk will first take a close look at what SRE is and what SRE isn't. We will break down the principles behind the SRE movement and highlight where SRE departs from the current conventional wisdom of Operations and Systems Administration work. You'll learn about key concepts like Toil, SLOs, Error Budgets, and Shared Responsibility Models.
Next, we'll look at how to move to an SRE style of working. We'll look at how traditional operations beliefs and practices can leave organizational scar tissue that is difficult to overcome. We'll examine examples of how silos, excessive toil, reliance on queues, and incorrectly applied governance models undermine the adoption of SRE principles and practices in the enterprise. We'll also look at the individual skills and mindset changes that you'll need to adopt an SRE way of working.
You'll leave this talk with an appreciation for how SRE can create the capacity you need to make tomorrow better than today.
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
3. Not that far away, maybe in a company just like yours…
4. Not that far away, maybe in a company just like yours…
Overloaded. Constant firefighting.
Ticket
Ticket
Project A
···
Project B
···
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
DUE: Yesterday! DUE: Tomorrow!
Ticket
Ticket
Ticket
5. Waiting in ticket queues for everything.
Not that far away, maybe in a company just like yours…
6. Waiting in ticket queues for everything.
Ticket
Not that far away, maybe in a company just like yours…
7. Waiting in ticket queues for everything.
Ticket
Ticket
Ticket
Ticket
Ticket
Ticket
Not that far away, maybe in a company just like yours…
8. Things break. Break again. And again.
Later…
Later…
same
same
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Help!
Ticket
Wait Interrupt
Not that far away, maybe in a company just like yours…
9. Everyone is busy, but it doesn’t get any better.
Improvement
Project
Business
Delivery
Incidents
Business
Delivery
Business
Delivery
Not that far away, maybe in a company just like yours…
10. Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
11. Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives
Have you heard of SRE?
Google does it.
12. Overloaded. Constant firefighting.
Waiting in ticket queues for everything.
Things break. Break again. And again.
Everyone is busy, but it doesn’t get any better.
Not that far away, maybe in a company just like yours…
Everything takes too long, costs
too much, and breaks too often!
Executives
Have you heard of SRE?
Google does it.
20. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive
View
21. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much, and
break too often!
Executive
View
SRE (new name)
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much, and
break too often!
Executive
View
22. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
23. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
24. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
25. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
000000000000000
26. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
Not SRE
Observability
Programming
Skills
Distributed
Systems Arch.
Incident
Response
000000000000000
27. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
28. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
SRE is a rethinking of how Operations work gets
done.
32. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
33. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
34. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
35. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
36. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
38. Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau
Google
39. Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
40. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
41. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
42. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
43. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
44. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Launch
(ToDos & Unknowns)
Mature
45. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil
Toil
Toil
Toil
Launch
(ToDos & Unknowns)
Mature
46. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil
Toil
Toil
Toil
Launch
(ToDos & Unknowns)
Mature
cycle
47. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
48. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
50. SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
51. SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
52. SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
53. SRE teams have the ability to regulate their workload
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
InfoSec • Compliance • Risk • Dev
How??Wait…
What?? That’s nuts.
54. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
55. Principles of SRE are what set SRE apart
Stephen Thorne
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
57. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
58. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
59. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
60. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
61. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
63. Why focus on reducing toil?
1. Lots of value independent of “SRE”
64. 2. Your people are you most expensive assets
… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
65. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Delivering planned work:
66. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
67. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
68. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
69. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
70. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Empower them to
take action!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
71. Operations: The Last Mile
DevOps Enterprise Summit 2018 Las Vegas
https://rundeck.co/damon_at_does18
But there is a lot that gets in the way…
72. Operations: The Last Mile
DevOps Enterprise Summit 2018 Las Vegas
https://rundeck.co/damon_at_does18
But there is a lot that gets in the way…
tl;dr:
Silos and Queues are major causes of dysfunction.
76. Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
77. Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
78. Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Toil
80. Ticket queues are how we cope
Silo A Silo B
Ticket
Queue
Function A Function B
81. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Ticket queues = interruptions, waiting, and toil
82. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Ticket queues = interruptions, waiting, and toil
Toil
83. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Snowflakes:
Technically acceptable, but brittle and unreproducible
Ticket queues = interruptions, waiting, and toil
84. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Snowflakes:
Technically acceptable, but brittle and unreproducible
Ticket queues = interruptions, waiting, and toil
Toil
86. Super easy to get started reducing toil
1. Track toil levels for each team
Toil
87. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
Toil
88. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
89. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
90. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
↳ Apply self-service design pattern
91. Super easy to get started reducing toil
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
↳ Refactor apps, tools, and processes
↳ Apply self-service design pattern
93. “Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
94. “Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
95. “Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
96. “Do this for me, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil Toil
97. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
98. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
Toil
99. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
Toil
100. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
Toil Toil
101. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
102. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
103. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Toil
104. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
Toil
105. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
Toil Toil
106. “Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
107. “Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Toil
108. “Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Service A
!!
I’ll fix it
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Later…
Later…
Toil
109. “Dev work is more expensive than Ops work”
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Fix it? Not in budget.
Ops has a work around.
Service A
!!
I’ll fix it
AGAIN
1. Step
2. Step
3.
✅
Interruptions
Toil
OPS
DEV
Service A
!!
I’ll fix it
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Your
other
work
Self-Service
Service A
!!
1. Step
2. Step
3.
Later…
Later…
Toil
Toil
112. Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
113. Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Define “guardrails” to
provide work safety
114. Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Define “guardrails” to
provide work safety
115. Self-Service Operations Design Pattern
Consumer of
Ops Capabilities
Self-Service Operation
On
Demand
Ops Capability
Specialist
Knowledge
Ops Capability
Specialist
Knowledge
Pull-Based
Accept tools/languages
that teams want to use
Let people who
“push buttons”
define the buttons
Build in security
and compliance
Define “guardrails” to
provide work safety
118. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
119. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
120. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%
• Reduced escalations by 50%
• Reduced overall support costs by 55%
121. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
122. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
123. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
124. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
12 months:
• Saved 28 person years of time
• 13,000+ ops tasks in privileged environments that
didn’t require a review
• ~200 less customer impacting events
125. rundeck.com/self-service
Read for free online:
Working on documenting the Self-
Service Operations design pattern.
Where I need your help…
Give feedback.