Presented by Damon Edwards, co-founder of Rundeck, at DevOps Days Dallas on August 20, 2019.
Some DevOps transformations flourish, but others are stalling. Why is that? This talk will make the case that Operations is the most predictable differentiator.
So much of the energy in DevOps has been about activities that start in Dev and move towards Ops — continuous delivery, deployment pipelines, automated testing, and of course, the unofficial mantra of “deploy, deploy, deploy. “However, post-deployment, too many DevOps transformations maintain the status quo and leave questionable Operations practices in place.
Now along comes a new vision for Operations called SRE (a.k.a. Site Reliability Engineering)… But SRE seems almost too good to be true!
SREs are cover much of what systems administrators used to do, but get to spend most of their time doing engineering work that adds enduring value to their company? How is it that SREs’ don’t get caught up in the interruptions, repetitive work, and drudgery that consumes so much of our time? And how do companies use SRE to do so much more with the same or less headcount?
This talk will take a close look at what SRE is, what SRE isn’t, and how SRE avoids the pitfalls that have plagued traditional Ops work. Finally, we’ll break down the principles behind the SRE movement and highlight how early examples are proving that DevOps + SRE = the end-to-end speed and quality promised since the early days of DevOps.
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
8. “SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
9. “SRE…
When you ask
software engineers
to do operations”
“SRE…
Next-generation,
cloud-native
Operations”
Class SRE implements DevOps
“SRE…
When Ops does
more engineering
than Ops”
SRE
13. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Thursday 10:00am PDT
(1200 Agents)
14. t a c#@p
ervice!
rks Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
18. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Custo
VIP Cu
Friday 9:00am PDT
21. ev
No code
updates
Probably not the new server
dening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
25. Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Custo
VIP Cus
Monday 10:00am PDT
27. “…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you s
that ticket?
Scrum
29. .
I’ll take a look
r Systems
d Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
36. Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Ven
Cons
So
per
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Headcount: 15
38. So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?DBA
Dev
m
39. but… its been
working for years!
?
?
?
Ops
SysEng
QA
Ops
QA
DBA
change
config
load
test
Dev
1:00am
Headcount: 10
40. but… its been
working for years!
?
?
?
Ops
SysEng
QA
Ops
QA
DBA
change
config
load
test
Dev
1:00am
Headcount: 10
.
42. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
43. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
44. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
45. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
46. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
47. Vendor
Consultant
Dir
Finance
No budget
GM, Line of
Business
Stay on
schedule
You should really
fix that…
Ops
It’s not fixed.
It’s just turned off.
VP Ops
I’m told bug
#8543 is P1, but
was rejected?
Ops
Refactor it before
it bites us again.
VP Dev
It’s not a bug.
You already have
a fix.
Dev
wins
Dev
wins
Dev
No time.
Dev
Their change
broke it.Dev vs Ops
48. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
49. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
50. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
51. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
(+ brand damage)
52. Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
What a c#@p
service!
I can’t login Barely works
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Call Center
Agent
Customer
Now it works Now it works
Service
Desk
?
Ops Ops
Thursday 10:00am PDT 3:30pm
(1200 Agents)
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Service
Desk
Escalate!
Incident
Commander
Ticket
Launch the
incident bridge
Ops
Incident
Commander
Ops
Dev
Sec
Ops
Bridge
Call
Ops
Not me…
Not me…
Not me…
Not me…
No code
updates
Probably not the new server
hardening process or the network
changes…
Ops
Ops
Ops
Uhh.. WHAT new
server hardening
process and network
changes?
Sec
We were going to fail
audit… you didn’t get
the email?
Dev
Bridge
Call
No code
updates
War
Room
SysAdmin
“Try
this”
Test
Platform
“Try
this”
Test
Network
“Try
this”
Test
Security
“Try
this”
Test
Storage
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander
“Theory: new
security updates”
Call Center
Agent
Customer
Now it works Now it works
Ops
Ops
Sec
Ops
Ops
Call Center
Manager
What is going
on?
Ops
Rollback:
-OS changes
-Network changes
3:30pm Over the weekend
QA
Headcount: 40
Headcount: 30
Headcount: 10
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
“…but monitoring
is all green”
Service
Desk
OK
OK
OK
OK
OK
Bridge
Call
DBA
“Try
this”
SysAdmin
“Try
this”
Network
“Try
this”
Security
“Try
this”
SysEng
“Try
this”
“New Theory: Its
the database
connection”
Customer Systems
Lead Dev
ding!
Ignore.
Incident
Commander
Hey did you see
that ticket?
sigh.
I’ll take a look
Scrum
Customer Systems
Lead Dev
Customer Systems
Lead Dev
Something is wrong with
the database connection…
… But our code didn’t
change.
DBA
No recent database
updates.
Monday 10:00am PDT
Headco
Dev
Bridge
Call
No code
updates
War
Room
DBA
“Try
this”
Test
DBA
“Try
this”
Test
SysAdmin
“Try
this”
Test
SysEng
“Try
this”
Test
SysEng
“Try
this”
Test
Incident
Commander“New Theory: “problem with
stored procedures… but
not sure what”
Incident
Commander
Vendor
Management
DB Vendor phone
support isn’t
cutting it.
We only paid for
bronze support
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
Approval
Request
“Need to upgrade
support” Finance
??
Tuesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Dev
Bridge
Call
No code
updates
War
Room
Vendor
Consultant
“Let’s see with the vendor
consultant says”
Call Center
Manager
What is going
on?
Call Center
Director
What is being
done?
OK, let me take a
look.
Vendor
Consultant
So?
Vendor
Consultant
Its been choking on a particular stored
procedure you use everywhere…Someone toggled on the new
performance analysis feature
This stored procedure has
almost 400 parameters.
It’s 1 million lines
of code
but… its been
working for years!
?
?
?
Ops
Sys
Ops
QA
change
config
load
test
Wednesday 10:00am PDT
Call Center
Agent
Call Center
Agent
… so frustrating
Not again…
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
DBA
Dev
3:00pm
Headcount: 15
Headcount: 10
Call Center
Agent
Call Center
Agent
My browser
times out!Wow, this is
so slow!
I can’t login
Are you kidding
me?
How hard is it to
run a website? Soo Sloooow
It’s broken
Customers
Call Center
Agent
Technical
Support
Service
Desk
Many tickets
Many calls
Customers
“Stuff
isn’t
working”
VIP Customers
Friday 9:00am PDT
Response labor: $270,000
Lost call center productivity: $620,000
$890,000
(+ project delays)
(+ brand damage)
> $1,000,000
60. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
ITIL Processes
The same as everyone else.
61. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
62. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
63. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
64. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
65. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
66. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
67. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
68. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
Charity Majors
“Distributed systems have an
infinite list of almost impossible
failure scenarios”
69. 26 ITIL Processes
Service Validation & Testing
Strategy Management for IT Services
Supplier Management
The 7 Step Improvement
Transition Planning & Support
Access Management
Availability Management
Business Relationship Management
Capacity Management
Change Management
Change Evaluation
Demand Management
Design Coordination
Event Management
Financial Management for IT Services
Incident Management
Information Security Management
IT Service Continuity Management
Knowledge Management Process
Problem Management Process
Release & Deployment Management
Request Fulfillment Process
Service Asset & Configuration Management
Service Catalog Management
Service Level Management
Service Portfolio Management
Encourages
Silos
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Command and Control Management
Deming
“3. Cease dependence on
inspection to achieve
quality.”
X X X X X X
Charity Majors
“Distributed systems have an
infinite list of almost impossible
failure scenarios”
71. The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
72. The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
73. The Rise of a New IT Operations
Support Model
By 2015, DevOps will evolve from a niche strategy employed
by large cloud providers into a mainstream strategy employed
by 20% of Global 2000 organizations
Why DevOps will emerge:
!DevOps is not usually driven from
Why DevOps will not emerge:
!Cultural changes are the hardest to
by 20% of Global 2000 organizations.
!DevOps is not usually driven from
the top down and, thus, may be
more easily accepted by IT
operations teams.
!Cultural changes are the hardest to
implement, and DevOps requires a
significant rethinking of IT
operations conventional wisdom.
!ITIL and other best practices
frameworks are acknowledged to
have not delivered on their goals,
enabling IT organizations to look for
!There is a large body of work with
respect to ITIL and other best
practices frameworks that is already
accepted within the industry enabling IT organizations to look for
new models.
!The growing interest in tools such
as Chef, Puppet, etc., will help
accepted within the industry.
!Open source (OSS) management
tools, which are more aligned with
this approach, have not seen pp p
stimulate demand for OSS-based
management
pp
significant enterprise market share
traction.
March 18, 2011
Cameron Haight
DevOps vs
ITIL?
81. Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
82. Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
83. Developer
Developer
Developer
Developer
Developer
Old Release Still
Running
Release Plan
Release Plan
Release Plan
Release Plan
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Deploy
Feature to
Production
Bugs
Deploy
Feature to
Production
Immutable microservice deployment
scales, is faster with large teams and
diverse platform components
Adrian Cockcroft
https://www.youtube.com/watch?v=nMTaS07i3jk
DockerCon EU 2014
Architecture enables
speed.
Speed is the advantage.
Keeps the people out of
their own way!
86. Principles are what makes SRE different
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
87. Principles are what makes SRE different
1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
88. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
89. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
90. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
91. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
SLO takes priority!!
92. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
93. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
95. Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau
Google
96. Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
97. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
98. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
99. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
100. Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
101. Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Launch
(ToDos & Unknowns)
Mature
102. Toil is a Naturally Occurring Force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil
Toil
Toil
Toil
Launch
(ToDos & Unknowns)
Mature
103. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
104. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Stephen Thorne, Google
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
106. SRE teams have the ability to regulate their workload
SRE can say no.
107. SRE teams have the ability to regulate their workload
Example:
SRE can say no.
108. SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
SRE can say no.
109. SRE teams have the ability to regulate their workload
Example:
What if handing-off responsibility to SRE/Ops wasn’t a right?
(separate the “running in production” from “run by SRE/Ops”)
SRE can say no.
110. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
111. What's the Difference Between DevOps and SRE?
(class SRE implements DevOps)
@sethvargo@lizthegrey
113. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
114. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
115. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
116. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
117. Where to start (the practical approach)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
125. Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
126. Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track
• Self-reporting
• Periodic surveys
• SM or PM interview/sampling
127. Track toil levels for each team
• Standardize (e.g. meetings and email are “overhead" not “toil”)
• Track
• Self-reporting
• Periodic surveys
• SM or PM interview/sampling
• Don’t get lost in time tracking weeds!
129. Start reducing toil today
1. Track toil levels for each team
Toil
2. Set toil limit for each team (50% is conventional wisdom)
130. Start reducing toil today
1. Track toil levels for each team
2. Set toil limit for each team (50% is conventional wisdom)
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
131. Start reducing toil today
1. Track toil levels for each team
2. Set toil limit for each team (50% is conventional wisdom)
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
Toil
Michael Kehoe
Todd Palino
(LinkedIn)
At SREcon Americas 2019
Example
Process
“Code Yellow”
142. “Fix this for me, fix it again, then fix it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
143. “Fix this for me, fix it again, then fix it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
144. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
145. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
146. “The dog-pile.”
!!
I think its a problem with
db07-store2.uswest.acme
“$ top”
“$ top”
db07store2.
uswest.acme
“$ top”
“$ top”
“$ top”
!!
“$ top”
!!
!!
!!
healthcheck
store2 -all
db07store2.
uswest.acme
Self-Service
1.
2.
3.
I think its a problem with
db07-store2.uswest.acme
147. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
148. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
149. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
152. Recap: Make Tomorrow Better Than Today
Beware: impact of traditional
management structures
Be practical and start focusing
on toil
Find and fix toil anti-patterns Empower with Self-Service
Runbooks
SRE is a new way to think
about Ops work
1. SRE needs Service Level
Objectives, with consequences
2. SREs have time to make
tomorrow better than today
3. SRE teams have the ability to
regulate their workload
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
Toil
Use DevOps and SRE to improve
speed and quality
After
I’ve got this!
Environment
Self-
Service