SlideShare a Scribd company logo
1 of 183
Download to read offline
Operations: The Last Mile
Damon Edwards
@damonedwards


Developers have had an unfair advantage.
Ops
Ah-ha!
Dev
Ka-ching!
Ops
Ah-ha!
Dev
Ka-ching!
Agile
2001
Ops
Ah-ha!
Dev
Ka-ching!
Agile
2001
ITIL
1989
OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and DevOps
Availability Auditing
Security Compliance
"Go faster!"
“Open up!”
“Lock it down!”
2018
Story time….
Digital
Agile
DevOps
SRE
Cloud
Docker
Kubernetes
Microservices
CHANGE
Wow
That is cool
I wish I could
work there
But nobody was talking about what
happened after deployment…
It was just another Tuesday…
NOC
NOC
Biz
Manager
Escalate!
NOC NOC
NOC
(Bob)
Open
Incident
Ticket
9:30am 10:00am
NOC (Bob)
Biz Manager
Ticket
Context Wagon
Yes, but this
looks different
Hasn’t there been
some intermittent
errors this week?
v3
?!
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Interruption
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Context
Switching
Interruption
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Context
Switching
Interruption
Waiting
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Context
Switching


“Dog Pile”
Interruption
Waiting
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Context
Switching


“Dog Pile”
Disconnected
Access
Interruption
Waiting
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
Context
Switching
Distraction


“Dog Pile”
Disconnected
Access
Interruption
Waiting
SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
Partially
Done
Work
SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
Partially
Done
Work
Escalation
SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
Partially
Done
Work
Escalation
Waiting
o
Dev
Foo
Lead Dev
(Karen)
ding!
Ignore.
App
Manager
Hey did you see
that ticket?
Foo
Lead Dev
(Karen)
sigh.
I’ll take a look
I’m go
mor
pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
Scrum
Ticket
Context Wagon
o
Dev
Foo
Lead Dev
(Karen)
ding!
Ignore.
App
Manager
Hey did you see
that ticket?
Foo
Lead Dev
(Karen)
sigh.
I’ll take a look
I’m go
mor
pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
Scrum
Ticket
Context Wagon
Interruption
o
Dev
Foo
Lead Dev
(Karen)
ding!
Ignore.
App
Manager
Hey did you see
that ticket?
Foo
Lead Dev
(Karen)
sigh.
I’ll take a look
I’m go
mor
pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
Scrum
Ticket
Context Wagon
Context
Switching
Interruption
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
Disconnected
Access
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
Disconnected
Access
Waiting
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
Interruption
Disconnected
Access
Waiting
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
Interruption
Disconnected
Access
Waiting
Context
Switch
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
Partially
Done
Work
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
Partially
Done
Work
Waiting
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
Partially
Done
Work
Waiting
Interruption
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
Partially
Done
Work
Waiting
Context
Switching
Interruption
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
Extra
Process
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
Extra
Process
Misaligned
Priorities
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
Interruption
Extra
Process
Misaligned
Priorities
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
Context
Switching
Interruption
Extra
Process
Misaligned
Priorities
Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
Interruption
Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
Context
Switching
Interruption
Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
Context
Switching
Interruption
Disconnected
Context
Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
Waiting
Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
Waiting
Siloed
Knowledge
Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
Waiting
Manual
Siloed
Knowledge
Share
point
Middleware
(Scott)
Trial and error
.doc
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
ket
Context Wagon
Middleware
(Scott)
Bar
Service
10 min Middleware
(Scott)
Waiting for
Acme Service
Acme startup
failed
Bar
Service
6:00pm
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
8888888
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
-Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
-Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Escalation
-Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Escalation
Task
Switching
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
Escalation
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
Escalation
Task
Switching
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
Escalation
Task
Switching
Disconnected
Process
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Distraction
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Distraction
Finger
Pointing
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Distraction
Finger
Pointing
Heroics
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Distraction
Finger
Pointing
Heroics
Waiting
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Escalation
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Escalation Interruption
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Escalation
Task
Switching
Interruption
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Escalation
Task
Switching
Siloed
Knowledge
Interruption
Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm
Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm
Extra
Process
Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm
Extra
Process
Misaligned
Priorities
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
Extra
Process
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
Extra
Process
Escalation
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
Extra
Process
Escalation
Task
Switching
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
Extra
Process
Escalation
Task
Switching
Misaligned
Priorities
I’ll have
Network
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
APPROVE: Emergency
firewall rule change
Ticket
Update
Firewall
Ticket
NetSec
(Nicole)
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
Firewall
(Freddie)
Net L2
(Bob)
Middl
(Sc
Firewall
change
Restart Bar
9:30pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
I’ll have
Network
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
APPROVE: Emergency
firewall rule change
Ticket
Update
Firewall
Ticket
NetSec
(Nicole)
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
Firewall
(Freddie)
Net L2
(Bob)
Middl
(Sc
Firewall
change
Restart Bar
9:30pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Waiting
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Manual
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Manual
Partially
Done
Work
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Manual
Partially
Done
Work
Escalation
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Manual
Partially
Done
Work
Escalation
Task
Switching
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
Manual
Partially
Done
Work
Extra
Process
Escalation
Task
Switching
et
gement
“Ready for
API tests”
Customer
Engagement
Manager
(Varsha)
NOC
(Bob)
Customer Engagement
Manager
(Varsha)
Update
Ticket
Ticket
“APIs OK”
Middleware
(Scott)
Upda
Tick
11:00pm
Ticket
Co
et
gement
“Ready for
API tests”
Customer
Engagement
Manager
(Varsha)
NOC
(Bob)
Customer Engagement
Manager
(Varsha)
Update
Ticket
Ticket
“APIs OK”
Middleware
(Scott)
Upda
Tick
11:00pm
Ticket
Co
Life
Interruption
et
gement
“Ready for
API tests”
Customer
Engagement
Manager
(Varsha)
NOC
(Bob)
Customer Engagement
Manager
(Varsha)
Update
Ticket
Ticket
“APIs OK”
Middleware
(Scott)
Upda
Tick
11:00pm
Ticket
Co
Life
Interruption
Extra
Process
e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
.
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
Next Day
SVP
(Susan)
Whose fault is this?!
Why are we so bad at change?
What additional processes
and approvals are you
adding to never let this
happen again?!
VP
VP
Dir
Dir
VP
Dir
VP
Scott)
da)
Carlos)
(Bob)
die)
NetSec (Nicole)
Cust. Engmt. (Varsha)
Later…
We’ve invested in Cloud, Agile,
DevOps, Containers…
Why does everything still take too
long and cost too much?
Executive Team
Our transformation has
largely ignored Ops
Most companies chase the symptoms…
…by following the conventional wisdom:
“We need better tools”
…by following the conventional wisdom:
“We need better tools”
“We need more people”
…by following the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
…by following the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
…by following the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
…by following the conventional wisdom:
“We’ll wait and see what ITIL v4 says”
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
…by following the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
…by following the conventional wisdom:
Challenge the conventional
wisdom about operations work
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
Decisions made here
All work is contextual
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
Psychological safety
Psychological safety is a shared belief that the team is safe for
interpersonal risk taking. It can be defined as "being able to show
and employ one's self without fear of negative consequences of
self-image, status or career.
- William Kahn

Boston University

1990
Psychological safety
Psychological safety is a shared belief that the team is safe for
interpersonal risk taking. It can be defined as "being able to show
and employ one's self without fear of negative consequences of
self-image, status or career.
- William Kahn

Boston University

1990
Google: most important characteristic
to predict team effectiveness?
2016
Psychological safety
Psychological safety is a shared belief that the team is safe for
interpersonal risk taking. It can be defined as "being able to show
and employ one's self without fear of negative consequences of
self-image, status or career.
- William Kahn

Boston University

1990
Google: most important characteristic
to predict team effectiveness?
2016
Psychological safety!
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Toil: Name For a Problem We’ve All Felt
Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Silos
Backlog Information
PrioritiesTools
Backlog Information
I need X
PrioritiesTools
Silos
Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
How do we cover for our silos’ disconnects and mismatches?
Silo A Silo B
How do we cover for our silos’ disconnects and mismatches?
Silo A Silo B
Ticket
Queue
??
Silo A Silo B
We all know how well that works
Ticket
Queue
Ticket queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
What do queues do to value streams?
What do queues do to value streams?
Queue
A
Queue
B
What do queues do to value streams?
Queue
A
Queue
B
Queues disintegrate and
obfuscate value streams
Ticket queues are “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Ticket queues are “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
Technically acceptable, but brittle and unreproducible
Ticket queues are “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
Technically acceptable, but brittle and unreproducible
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
So what can we do differently?
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
“Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
Tools
Enablement and tooling
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Reduce Toil
Reduce Toil
1. Track toil levels for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Bonus: Use Service Level Objectives, Error Budgets, and other lessons from SRE
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Obvious: Get rid of as many silos as possible
“Horizontal” shared
responsibility, not
everyone do everything!
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget with consequences
“Netflix"
Model
“Google”
Model
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget with consequences
“Netflix"
Model
“Google”
Model
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget with consequences
“Netflix"
Model
“Google”
Model
Same
high-quality,
high-velocity
results!
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue
Forces That Undermine Operations
Silos Queues
Excessive ToilLow Trust
Self-Service Operations: Turn handoffs into self-service
Self-Service Operations
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Cross-Functional Product Team 2 Ops
(embedded)
Self-Service Operations: Works with any org model
Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Self-Service Operations
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Development Team 1
Development Team 2
Ops/SRE
Team
Self-Service Operations
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Cross-Functional Product Team n Ops
(embedded)
But, what about security and compliance?
Build-in
Security
Here
Build-in
Compliance
Here
Are all tickets bad?
Are all tickets bad?
Ticket
System
No. Just use tickets for what they are good for
Are all tickets bad?
1.Documenting true problems/issues/exceptionsTicket
System
No. Just use tickets for what they are good for
Are all tickets bad?
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System
No. Just use tickets for what they are good for
Are all tickets bad?
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System
No. Just use tickets for what they are good for
Strategy: Self-Service improves response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Strategy: Self-Service improves response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategy: Self-Service improves response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
Strategy: Self-Service improves consistency &compliance
Shaun Norris at DOES ‘18 London
https://youtu.be/d5IMvK0YHTg
Strategy: Self-Service improves consistency &compliance
Shaun Norris at DOES ‘18 London
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
Strategy: Self-Service improves consistency &compliance
Shaun Norris at DOES ‘18 London
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
Strategy: Self-Service improves consistency &compliance
Shaun Norris at DOES ‘18 London
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
12 months: 13,000+ ops tasks in privileged
environments that didn’t require a review
rundeck.com/self-service
Read for free online:
Working on documenting the Self-
Service Operations design pattern.
Where I need your help…
Recap
Don’t forget about Ops.
Challenge conventional wisdom.
Leverage the Self-Service
Operations design pattern
“Shift-Left” control and decision
making.
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Focus on removing silos and
queues
Learn from SRE: Reduce toil to
create capacity to change
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
Toil at manageable percentage of capacity
oil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Understand the forces
undermining operations work
Development Team 1
Development Team 2
Ops/SRE
Team
Self-Service Operations
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(operates platform)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Cross-Functional Product Team n Ops
(embedded)
Let’s talk…
@damonedwards
damon@rundeck.com
rundeck.com/self-service

More Related Content

What's hot

SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise Rundeck
 
Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Rundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Rundeck
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Rundeck
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsRundeck
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps Rundeck
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Rundeck
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Rundeck
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensRundeck
 
Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Andrew Shafer
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Andrew Shafer
 
Agile Infra @AgileRoots 2009
Agile Infra @AgileRoots 2009Agile Infra @AgileRoots 2009
Agile Infra @AgileRoots 2009Andrew Shafer
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioAtlassian
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Burr Sutter
 
[Tel aviv merge world tour] Perforce Keynote
[Tel aviv merge world tour] Perforce Keynote[Tel aviv merge world tour] Perforce Keynote
[Tel aviv merge world tour] Perforce KeynotePerforce
 
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Burr Sutter
 
examkiller 000-938
examkiller 000-938examkiller 000-938
examkiller 000-938jimenoon
 

What's hot (20)

SRE Lessons for the Enterprise
SRE Lessons for the Enterprise SRE Lessons for the Enterprise
SRE Lessons for the Enterprise
 
Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem Modern Operations: Solving DevOps’ Last Mile Problem
Modern Operations: Solving DevOps’ Last Mile Problem
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens
 
Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps The "Ops" Side of DevSecOps
The "Ops" Side of DevSecOps
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)
 
Self-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still HappensSelf-Service Operations: Because Ops Still Happens
Self-Service Operations: Because Ops Still Happens
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009Agile Infrastructure - Agile 2009
Agile Infrastructure - Agile 2009
 
Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09Agile Infrastructure Velocity 09
Agile Infrastructure Velocity 09
 
Agile Infra @AgileRoots 2009
Agile Infra @AgileRoots 2009Agile Infra @AgileRoots 2009
Agile Infra @AgileRoots 2009
 
My History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to StudioMy History with Atlassian Tools, and Why I'm Moving to Studio
My History with Atlassian Tools, and Why I'm Moving to Studio
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
 
[Tel aviv merge world tour] Perforce Keynote
[Tel aviv merge world tour] Perforce Keynote[Tel aviv merge world tour] Perforce Keynote
[Tel aviv merge world tour] Perforce Keynote
 
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
 
examkiller 000-938
examkiller 000-938examkiller 000-938
examkiller 000-938
 

Similar to Operations: The Last Mile

Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every TeamSven Peters
 
Tastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work EasierTastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work EasierHarvard Web Working Group
 
Concurrent Ruby Application Servers
Concurrent Ruby Application ServersConcurrent Ruby Application Servers
Concurrent Ruby Application ServersLin Jen-Shin
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management Rundeck
 
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product..."Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...Product of Things
 
Socket applications
Socket applicationsSocket applications
Socket applicationsJoão Moura
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019RackN
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowAll Things Open
 
Innovation dank DevOps (DevOpsCon Berlin 2015)
Innovation dank DevOps (DevOpsCon Berlin 2015)Innovation dank DevOps (DevOpsCon Berlin 2015)
Innovation dank DevOps (DevOpsCon Berlin 2015)Wooga
 
Evolving Archetecture
Evolving ArchetectureEvolving Archetecture
Evolving Archetectureleo lapworth
 
Mobile Development integration tests
Mobile Development integration testsMobile Development integration tests
Mobile Development integration testsKenneth Poon
 
Games for the Masses (Jax)
Games for the Masses (Jax)Games for the Masses (Jax)
Games for the Masses (Jax)Wooga
 
Progressive Enhancement for JavaScript Apps
Progressive Enhancement for JavaScript AppsProgressive Enhancement for JavaScript Apps
Progressive Enhancement for JavaScript AppsCodemotion
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up LookoutLookout
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSContinuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSDanilo Poccia
 
Debugging Production Applications in Nomad using Lightrun
Debugging Production Applications in Nomad using LightrunDebugging Production Applications in Nomad using Lightrun
Debugging Production Applications in Nomad using LightrunShaiAlmog1
 
Surviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planetSurviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planetTomAkehurst
 

Similar to Operations: The Last Mile (20)

Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every Team
 
Tastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work EasierTastypie: Easy APIs to Make Your Work Easier
Tastypie: Easy APIs to Make Your Work Easier
 
Concurrent Ruby Application Servers
Concurrent Ruby Application ServersConcurrent Ruby Application Servers
Concurrent Ruby Application Servers
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
 
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product..."Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
 
Socket applications
Socket applicationsSocket applications
Socket applications
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To Know
 
Innovation dank DevOps (DevOpsCon Berlin 2015)
Innovation dank DevOps (DevOpsCon Berlin 2015)Innovation dank DevOps (DevOpsCon Berlin 2015)
Innovation dank DevOps (DevOpsCon Berlin 2015)
 
Chat ops x line
Chat ops x lineChat ops x line
Chat ops x line
 
Evolving Archetecture
Evolving ArchetectureEvolving Archetecture
Evolving Archetecture
 
Mobile Development integration tests
Mobile Development integration testsMobile Development integration tests
Mobile Development integration tests
 
Kruize
KruizeKruize
Kruize
 
Games for the Masses (Jax)
Games for the Masses (Jax)Games for the Masses (Jax)
Games for the Masses (Jax)
 
Progressive Enhancement for JavaScript Apps
Progressive Enhancement for JavaScript AppsProgressive Enhancement for JavaScript Apps
Progressive Enhancement for JavaScript Apps
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up Lookout
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSContinuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS
 
Debugging Production Applications in Nomad using Lightrun
Debugging Production Applications in Nomad using LightrunDebugging Production Applications in Nomad using Lightrun
Debugging Production Applications in Nomad using Lightrun
 
Surviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planetSurviving SOA - delivering (somewhat) continuously on a hostile planet
Surviving SOA - delivering (somewhat) continuously on a hostile planet
 

More from Rundeck

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps Rundeck
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationRundeck
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckRundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & AnsibleRundeck
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...Rundeck
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control PoliciesRundeck
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckRundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4 Rundeck
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...Rundeck
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Rundeck
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuRundeck
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response Rundeck
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Rundeck
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Rundeck
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck OverviewRundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings Rundeck
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Rundeck
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Rundeck
 

More from Rundeck (20)

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Operations: The Last Mile

  • 1. Operations: The Last Mile Damon Edwards @damonedwards
  • 2. 
 Developers have had an unfair advantage.
  • 6. OpsBusiness Idea Shorter Time-to-Market Fast Feedback from Users Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” 2018
  • 9. But nobody was talking about what happened after deployment…
  • 10. It was just another Tuesday…
  • 11. NOC NOC Biz Manager Escalate! NOC NOC NOC (Bob) Open Incident Ticket 9:30am 10:00am NOC (Bob) Biz Manager Ticket Context Wagon Yes, but this looks different Hasn’t there been some intermittent errors this week? v3 ?!
  • 12. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon
  • 13. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Interruption
  • 14. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Context Switching Interruption
  • 15. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Context Switching Interruption Waiting
  • 16. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Context Switching 
 “Dog Pile” Interruption Waiting
  • 17. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Context Switching 
 “Dog Pile” Disconnected Access Interruption Waiting
  • 18. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon Context Switching Distraction 
 “Dog Pile” Disconnected Access Interruption Waiting
  • 19. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it?
  • 20. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it? Partially Done Work
  • 21. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it? Partially Done Work Escalation
  • 22. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it? Partially Done Work Escalation Waiting
  • 23. o Dev Foo Lead Dev (Karen) ding! Ignore. App Manager Hey did you see that ticket? Foo Lead Dev (Karen) sigh. I’ll take a look I’m go mor pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE Scrum Ticket Context Wagon
  • 24. o Dev Foo Lead Dev (Karen) ding! Ignore. App Manager Hey did you see that ticket? Foo Lead Dev (Karen) sigh. I’ll take a look I’m go mor pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE Scrum Ticket Context Wagon Interruption
  • 25. o Dev Foo Lead Dev (Karen) ding! Ignore. App Manager Hey did you see that ticket? Foo Lead Dev (Karen) sigh. I’ll take a look I’m go mor pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE Scrum Ticket Context Wagon Context Switching Interruption
  • 26. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon
  • 27. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon Disconnected Access
  • 28. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon Disconnected Access Waiting
  • 29. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon Interruption Disconnected Access Waiting
  • 30. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon Interruption Disconnected Access Waiting Context Switch
  • 31. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W
  • 32. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W Partially Done Work
  • 33. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W Partially Done Work Waiting
  • 34. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W Partially Done Work Waiting Interruption
  • 35. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W Partially Done Work Waiting Context Switching Interruption
  • 36. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm
  • 37. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm Extra Process
  • 38. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm Extra Process Misaligned Priorities
  • 39. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm Interruption Extra Process Misaligned Priorities
  • 40. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm Context Switching Interruption Extra Process Misaligned Priorities
  • 41. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon
  • 42. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon Interruption
  • 43. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon Context Switching Interruption
  • 44. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon Context Switching Interruption Disconnected Context
  • 45. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon
  • 46. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon Waiting
  • 47. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon Waiting Siloed Knowledge
  • 48. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon Waiting Manual Siloed Knowledge
  • 49. Share point Middleware (Scott) Trial and error .doc NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) ket Context Wagon Middleware (Scott) Bar Service 10 min Middleware (Scott) Waiting for Acme Service Acme startup failed Bar Service 6:00pm
  • 50. Come on.. no.no.no. What? Why? Middleware (Scott)
  • 51. Come on.. no.no.no. What? Why? Middleware (Scott)
  • 52. 8888888 Come on.. no.no.no. What? Why? Middleware (Scott)
  • 53. -Bar app startup timed out. Error says can’t connect to Acme service. - I looked at Acme but it seems to be running -Is this error message correct? Why can’t Bar connect? Ticket Update Ticket Middleware (Scott) Bar SRE + add Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add 6:45 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda)Ticket Context Wagon The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked.
  • 54. -Bar app startup timed out. Error says can’t connect to Acme service. - I looked at Acme but it seems to be running -Is this error message correct? Why can’t Bar connect? Ticket Update Ticket Middleware (Scott) Bar SRE + add Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add 6:45 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda)Ticket Context Wagon The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Escalation
  • 55. -Bar app startup timed out. Error says can’t connect to Acme service. - I looked at Acme but it seems to be running -Is this error message correct? Why can’t Bar connect? Ticket Update Ticket Middleware (Scott) Bar SRE + add Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add 6:45 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda)Ticket Context Wagon The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Escalation Task Switching
  • 56. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV!
  • 57. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV! Escalation
  • 58. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV! Escalation Task Switching
  • 59. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV! Escalation Task Switching Disconnected Process
  • 60. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V!
  • 61. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V! Distraction
  • 62. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V! Distraction Finger Pointing
  • 63. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V! Distraction Finger Pointing Heroics
  • 64. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V! Distraction Finger Pointing Heroics Waiting
  • 65. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon
  • 66. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon Escalation
  • 67. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon Escalation Interruption
  • 68. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon Escalation Task Switching Interruption
  • 69. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon Escalation Task Switching Siloed Knowledge Interruption
  • 70. Firewall Engineer (Freddie) Middleware (Scott) Firewall Engineer (Freddie) Middleware (Scott) Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. There was a rule change last Thursday that would stop Bar from talking to Acme. Can you change it back? Sure we make changes on Thursday… Chief of Staff SVP and VPs are livid… this was supposed to be a safe change!! Freddie, we’ve got customers calling. ES Em pro rul Update Firewall Ticket Firewall Engineer (Freddie) 8:00pm
  • 71. Firewall Engineer (Freddie) Middleware (Scott) Firewall Engineer (Freddie) Middleware (Scott) Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. There was a rule change last Thursday that would stop Bar from talking to Acme. Can you change it back? Sure we make changes on Thursday… Chief of Staff SVP and VPs are livid… this was supposed to be a safe change!! Freddie, we’ve got customers calling. ES Em pro rul Update Firewall Ticket Firewall Engineer (Freddie) 8:00pm Extra Process
  • 72. Firewall Engineer (Freddie) Middleware (Scott) Firewall Engineer (Freddie) Middleware (Scott) Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. There was a rule change last Thursday that would stop Bar from talking to Acme. Can you change it back? Sure we make changes on Thursday… Chief of Staff SVP and VPs are livid… this was supposed to be a safe change!! Freddie, we’ve got customers calling. ES Em pro rul Update Firewall Ticket Firewall Engineer (Freddie) 8:00pm Extra Process Misaligned Priorities
  • 73. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon
  • 74. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon Extra Process
  • 75. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon Extra Process Escalation
  • 76. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon Extra Process Escalation Task Switching
  • 77. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon Extra Process Escalation Task Switching Misaligned Priorities
  • 78. I’ll have Network Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! APPROVE: Emergency firewall rule change Ticket Update Firewall Ticket NetSec (Nicole) … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev Firewall (Freddie) Net L2 (Bob) Middl (Sc Firewall change Restart Bar 9:30pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole)
  • 79. I’ll have Network Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! APPROVE: Emergency firewall rule change Ticket Update Firewall Ticket NetSec (Nicole) … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev Firewall (Freddie) Net L2 (Bob) Middl (Sc Firewall change Restart Bar 9:30pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Waiting
  • 80. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm
  • 81. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm Manual
  • 82. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm Manual Partially Done Work
  • 83. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm Manual Partially Done Work Escalation
  • 84. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm Manual Partially Done Work Escalation Task Switching
  • 85. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm Manual Partially Done Work Extra Process Escalation Task Switching
  • 86. et gement “Ready for API tests” Customer Engagement Manager (Varsha) NOC (Bob) Customer Engagement Manager (Varsha) Update Ticket Ticket “APIs OK” Middleware (Scott) Upda Tick 11:00pm Ticket Co
  • 87. et gement “Ready for API tests” Customer Engagement Manager (Varsha) NOC (Bob) Customer Engagement Manager (Varsha) Update Ticket Ticket “APIs OK” Middleware (Scott) Upda Tick 11:00pm Ticket Co Life Interruption
  • 88. et gement “Ready for API tests” Customer Engagement Manager (Varsha) NOC (Bob) Customer Engagement Manager (Varsha) Update Ticket Ticket “APIs OK” Middleware (Scott) Upda Tick 11:00pm Ticket Co Life Interruption Extra Process
  • 89. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha)
  • 90. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha) .
  • 91. NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz Next Day SVP (Susan) Whose fault is this?! Why are we so bad at change? What additional processes and approvals are you adding to never let this happen again?! VP VP Dir Dir VP Dir VP Scott) da) Carlos) (Bob) die) NetSec (Nicole) Cust. Engmt. (Varsha)
  • 93. We’ve invested in Cloud, Agile, DevOps, Containers… Why does everything still take too long and cost too much? Executive Team Our transformation has largely ignored Ops
  • 94. Most companies chase the symptoms…
  • 95. …by following the conventional wisdom:
  • 96. “We need better tools” …by following the conventional wisdom:
  • 97. “We need better tools” “We need more people” …by following the conventional wisdom:
  • 98. “We need better tools” “We need more people” “We need more discipline and attention to detail” …by following the conventional wisdom:
  • 99. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom:
  • 100. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom: “We’ll wait and see what ITIL v4 says”
  • 101. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom:
  • 102. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” …by following the conventional wisdom:
  • 103. Challenge the conventional wisdom about operations work
  • 104. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 105. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 106. Where are decisions made? Who can take action? escalate 1° 2° 3° 4° escalate escalateor
  • 107. Where are decisions made? Who can take action? escalate 1° 2° 3° 4° escalate escalateor Decisions made here
  • 108. All work is contextual John Allspaw
  • 109. All work is contextual rm -rf $PATHNAME John Allspaw
  • 110. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 111. All work is contextual rm -rf $PATHNAME John Allspaw
  • 112. All work is contextual rm -rf $PATHNAME John Allspaw
  • 113. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 114. All work is contextual rm -rf $PATHNAME John Allspaw
  • 115. All work is contextual rm -rf $PATHNAME Answer is always “it depends” John Allspaw
  • 116. escalate 1° 2° 3° 4° escalate escalateor Context Where are decisions made? Who can take action?
  • 117. Psychological safety Psychological safety is a shared belief that the team is safe for interpersonal risk taking. It can be defined as "being able to show and employ one's self without fear of negative consequences of self-image, status or career. - William Kahn Boston University 1990
  • 118. Psychological safety Psychological safety is a shared belief that the team is safe for interpersonal risk taking. It can be defined as "being able to show and employ one's self without fear of negative consequences of self-image, status or career. - William Kahn Boston University 1990 Google: most important characteristic to predict team effectiveness? 2016
  • 119. Psychological safety Psychological safety is a shared belief that the team is safe for interpersonal risk taking. It can be defined as "being able to show and employ one's self without fear of negative consequences of self-image, status or career. - William Kahn Boston University 1990 Google: most important characteristic to predict team effectiveness? 2016 Psychological safety!
  • 120. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 121. Toil: Name For a Problem We’ve All Felt
  • 122. Toil: Name For a Problem We’ve All Felt “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 123. Toil vs. Engineering Work Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  • 124. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 125. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 126. Excessive Toil Prevents Fixing the System Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Downward spiral is inevitable!
  • 127. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 129. Backlog Information I need X PrioritiesTools Silos
  • 130. Backlog Information I need X PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  • 131. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  • 132. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 133. How do we cover for our silos’ disconnects and mismatches? Silo A Silo B
  • 134. How do we cover for our silos’ disconnects and mismatches? Silo A Silo B Ticket Queue
  • 135. ?? Silo A Silo B We all know how well that works Ticket Queue
  • 136. Ticket queues are an expensive way to manage work Ticket Queue Queues Create… Longer Cycle Time Increased Risk More Variability More Overhead Lower Quality Less Motivation Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
  • 137. What do queues do to value streams?
  • 138. What do queues do to value streams? Queue A Queue B
  • 139. What do queues do to value streams? Queue A Queue B Queues disintegrate and obfuscate value streams
  • 140. Ticket queues are “snowflake makers” ?? Silo A Silo B Ticket Queue
  • 141. Ticket queues are “snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes Technically acceptable, but brittle and unreproducible
  • 142. Ticket queues are “snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes Technically acceptable, but brittle and unreproducible
  • 143. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 144. So what can we do differently?
  • 145. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 146. “Shift Left” the ability to take action escalate 1° 2° 3° 4° escalate escalateor
  • 147. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor
  • 148. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor Tools Enablement and tooling
  • 149. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 151. Reduce Toil 1. Track toil levels for each team
  • 152. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team
  • 153. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
  • 154. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits) Bonus: Use Service Level Objectives, Error Budgets, and other lessons from SRE
  • 155. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 156. Obvious: Get rid of as many silos as possible Old Silo A Old Silo B Old Silo C Old Silo D
  • 157. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Obvious: Get rid of as many silos as possible
  • 158. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Obvious: Get rid of as many silos as possible “Horizontal” shared responsibility, not everyone do everything!
  • 159. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget with consequences “Netflix" Model “Google” Model
  • 160. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget with consequences “Netflix" Model “Google” Model
  • 161. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget with consequences “Netflix" Model “Google” Model Same high-quality, high-velocity results!
  • 162. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities
  • 163. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue
  • 164. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue
  • 165. Forces That Undermine Operations Silos Queues Excessive ToilLow Trust
  • 166. Self-Service Operations: Turn handoffs into self-service Self-Service Operations On Demand On Demand On Demand On Demand Ops (operates platform) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Cross-Functional Product Team 2 Ops (embedded)
  • 167. Self-Service Operations: Works with any org model Development Team 1 Development Team 2 Development Team n Ops/SRE Team Self-Service Operations On Demand On Demand On Demand On Demand Ops (operates platform) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist
  • 168. Development Team 1 Development Team 2 Ops/SRE Team Self-Service Operations On Demand On Demand On Demand On Demand Ops (operates platform) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Cross-Functional Product Team n Ops (embedded) But, what about security and compliance? Build-in Security Here Build-in Compliance Here
  • 170. Are all tickets bad? Ticket System No. Just use tickets for what they are good for
  • 171. Are all tickets bad? 1.Documenting true problems/issues/exceptionsTicket System No. Just use tickets for what they are good for
  • 172. Are all tickets bad? 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Ticket System No. Just use tickets for what they are good for
  • 173. Are all tickets bad? 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Not as a general purpose work management system! Ticket System No. Just use tickets for what they are good for
  • 174. Strategy: Self-Service improves response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF
  • 175. Strategy: Self-Service improves response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 176. Strategy: Self-Service improves response times https://youtu.be/USYrDaPEFtM Jody Mulkey at DOES ‘15 SF Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools DEV STAGE PROD Dev & QA NOC/Ops Dev Promote approved jobs Self-Service Self-Service Empower
  • 177. Strategy: Self-Service improves consistency &compliance Shaun Norris at DOES ‘18 London https://youtu.be/d5IMvK0YHTg
  • 178. Strategy: Self-Service improves consistency &compliance Shaun Norris at DOES ‘18 London https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated
  • 179. Strategy: Self-Service improves consistency &compliance Shaun Norris at DOES ‘18 London https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency
  • 180. Strategy: Self-Service improves consistency &compliance Shaun Norris at DOES ‘18 London https://youtu.be/d5IMvK0YHTg Optimized for compliance • 86,000+ employees • 60+ countries • Highly regulated LOB #1 LOB #2 LOB #3 LOB …n Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Data Center Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Services Scripts/Tools Cloud Self-Service ComplianceConsistency 12 months: 13,000+ ops tasks in privileged environments that didn’t require a review
  • 181. rundeck.com/self-service Read for free online: Working on documenting the Self- Service Operations design pattern. Where I need your help…
  • 182. Recap Don’t forget about Ops. Challenge conventional wisdom. Leverage the Self-Service Operations design pattern “Shift-Left” control and decision making. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Focus on removing silos and queues Learn from SRE: Reduce toil to create capacity to change Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil Toil at manageable percentage of capacity oil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Understand the forces undermining operations work Development Team 1 Development Team 2 Ops/SRE Team Self-Service Operations On Demand On Demand On Demand On Demand Ops (operates platform) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Cross-Functional Product Team n Ops (embedded)