2. • One of the founding members of “Devopsdays”
• Co-author of the “Devops Handbook”.
• Author of the “Introduction to Devops” on Linux Foundation edX.
• Podcaster at devopscafe.org
• Devops Enterprise Summit - Cofounder
• Nine person in at Chef (VP of Customer Enablement)
• Formally Director of Devops at Dell
• Found of Socketplane (Acquired by Docker)
• 10 Startups over 25 years
About Me
https://github.com/botchagalupe/my-presentations
12. Devops Practices and Patterns
• Continuous Delivery
• Everything in version control
• Small batch principle
• Trunk based deployments
• Manage flow (WIP)
• Automate everything
• Culture
• Everyone is responsible
• Done means released
• Stop the line when it breaks
• Remove silos12
itrevolution.com/devops-handbook
14. 30x 200x
more frequent
deployments
faster lead
times
60x 168x
the change
success rate
faster mean time to
recover (MTTR)
2x 50%
more likely to
exceed profitability,
market share &
productivity goals
higher market
capitalization growth
over 3 years*
High performers compared to their peers…
Data from 2014/2015 State of DevOps Report - https://puppetlabs.com/2015-devops-report
Recent IT Performance Data is Compelling
15. 30x 200x
more frequent
deployments
faster lead
times
60x 168x
the change
success rate
faster mean time to
recover (MTTR)
2x 50%
more likely to
exceed profitability,
market share &
productivity goals
higher market
capitalization growth
over 3 years*
High performers compared to their peers…
Data from 2014/2015 State of DevOps Report - https://puppetlabs.com/2015-devops-report
Recent IT Performance Data is Compelling
Faster
Higher
Quality
More
Effective
2555x
18. Organizational culture was one
of the strongest predictors of
both IT performance and the
overall performance of the
organization
19. Devops is about Humans
19
Devops is a set of practices and
patterns that turn human
capital into high performance
organizational capital.
20.
21. Google
• Over 15,000 engineers in over 40 offices
• 4,000+ projects under active development
• 5500+ code submissions per day (20+ p/m)
• Over 75M test cases run daily
• 50% of code changes monthly
• Single source tree
• Over 75M test cases run daily
22. Amazon
• 11.6 second mean time between deploys.
• 1079 max deploys in a single hour.
• 10,000 mean number of hosts
simultaneously receiving a deploy.
• 30,000 max number of hosts simultaneously
receiving a deploy
23. 23
Unicorns and Horses (Enterprises)
Unicorns
Enterprise
Shamelessly stolen and repurposed from: Pete Cheslock
24. Enterprise Organizations
• Ticketmaster - 98% reduction in MTTR
• Nordstrom - 20% shorter Lead Time
• Target - Full Stack Deploy 3 months to minutes
• USAA - Release from 28 days to 7 days
• ING - 500 applications teams doing devops
• CSG - From 200 incidents per release to 18
28. Service
now
Parts Unlimited - "Major Release 6"
Early 2014
Project
Initiation
ZRA
(finance)
Approve
Project
Monthly
Steering
Meeting
Portfolio
C-level
Steering Comittee
Provides
Input
Project
Charter
High-Level
• Stories
• Project Info
• Description
• Budget
• Schedule
PM
Stakeholders
(Tech and Biz)
Create Work
Breakdown
Work
Breakdown
(MS Proj)
High-Level
• Milestones
• Resource
Planning
3 months 3 monthsHold / Pause
Create
Requirements
(Project Meeting)
MS
Office
• Detailed Req for new
features
• Technology refreshes
• ERD (Infra req)
• DRD (Dev req)
• BRD (Biz req)
Share
Point
Create
Design
Tech
Req
Tech
Req
Tech
Req
Tech Leads
Architects Vendor Arch
Ops Arch
High-Level
Server
Tickets
3 months
Receive
Request for
Servers
Create
Server
Request
Spreadsheet
Server
Req
PM
Tixattach
Route for
Approval
Tix
1 week 1 week
• Budget
• Appropriate
Resources
DB
App or
Web
or
Approved Into Ops
Delivery
Queue
Delivery
Manager
"Matt"
Service
now
"Heads up"
Assign to
Delivery
Engineer
Delivery
Engineer
Clarify or
Confirm
Req with
Dev or
QA
1 - 6 weeks
Provision
Server
and
Rework
DBA
Validation
App/Web
Validation
Restore
Data
1 week
App
Team
App
Team
PM
Stakeholders
(Tech and Biz)
Dev Leads
4 weeks
ARB
Queue
Detailed
Analysis and
Requirements
Jira
"Stories"
Maybe
Track Ticket
Dependencies
Confluence
Pages
Team Leads
and PMs
Assign
Requirements
add more detail
for their teams
Architecture
Review Board
"Bill" plus
Architects
Working
Group
Ops
? (sometimes)
Devs, PM, Engr, QA
Development
Sprint
2 week c/t
Existing Dev
Environments
Acquire /
Prepare needed
data
Ops DBA
Service Data
Setup
(Mainframe)
"Jennifer"
Test Data
Configuration
Manager
Development Deploy to
Integration
Dev, QA
Integration &
Regression
Testing
focused on service
Scrum
Dev/QA
Integ03
Scrum
Dev/QA
Test
Link
Sprint
Review
Release
to Prod
Product Owners
(Using own
criteria)
Create
CAB ticket
or
Scrum Team Ops Team
(if legacy)
Push Deployment
to Stage
Stage
Email Notification
Jira
NewArch
Build
VMs
Jira
Ops
Service
Now
Legacy
QA Lead
PMs
QAs
End to end
testing in Prod
Prod Env
Prd
DB
Go-No Go
decision
meeting
Team Leads
Jira
Ops
By Cluster
"Remove
Feature Flag"
(if new arch)
16 weeks
6 weeks H/C: 6 3 weeks H/C: 8
4 weeks H/C:8 3 weeks H/C: 14
Data Setup Integration Testing
DEv Arch
Create
Change
Tickets
> 100
Service
Now
Compute
Net
Facility
Cabling
Storage
"Linda"
Ops PM
RESET
DELIVERY
DATE!
Steering
Comittee
Fix
Tickets!
"Linda"
Ops PM
Dev
Leadership
Assign Dev
Team
Ops Intake
Meeting
Dev
Leadership
1 week
Group
CIOs and
Arch Leads
QA
Steering
Design
Dev Breakdown
Dev / Test
Staging Release
Server Requirements Gathering
Server Approval and Assignment
Provisioning
Production Release
Initiation and Planning
Create Ops
Tickets
TS
PD
TS PD
Gaps in Requirements
• Licenses
• Dependencies on 3rd party apps
• Capacity planning always seem low
("robbing Peter to pay Paul")
• Don't purchase in advance even though
we know it's coming
Duplicate info across
different documents
EP
D
D
Procurement of physical servers can take months (lead
times for procurement plus facilities groups)
Too many Env. in on ticket
cases audit confusion
Piecemeal requests ("2 this
week, 3 next week")
1 queue for delivery team
with ~1,000 tickets at once
Capacity issues cause delay
Often told to stop
everything and do
something else
TS
D
M
TS
M W
W TS EP
H
No monitoring or backup
for some environments
30% of delivery teams time
spent "consulting" on
performance and dealing with
unfounded requests for more
capacity
3-5 days to fix
~10% S/R
H
D M
TS
H
Often skips CAB.
What CAB reviews is
often not what built
All manual setup. 1
person really knows
how. Low data quality.
Manual process with
lots of back and forth.
Many tickets with
mismatched
priorities
Mostly
manual
testing
Manual, per
cluster
Frequently down.
External service
updates take offline.
Lots of contention.
EP
M
D
PD
M W
TS
TS D
M TS
PD
M
M
S/R - 90%
S/R - 55%
S/R - 15%
D
S/R - 20%
S/R - 50%
Sometimes submits
server requests
directly to delivery Ad-hoc requests get
lost, maybe 2-3
week delays
TS
High Level
S/R - 75%
9+ months of planning before
implementation starts
(and information / requirements still
incorrect or incomplete!)
Dev and QA told to submit sever
request 6-8 weeks in advance
(only done 50% of time)
W5. New "white
glove"
engagement
model
3. Standard product catalog
("Environments on Demand")
2. Visualization of flow of work and
expected upcoming work
4. Shorten from
Design to
Implementation
1. Fully Automated Environment Provisioning
7. Small
Batches
8. Write end-to-
end customer
func. tests
11. Resolve
interface to
legacy
10. Test data
setup
automation
13. Dev Deploy to Prod for
legacy
14. Unify
change
management
tools
15. Tool
9. Service Verification test writing: shift left to Dev
(test early)
12. Remove Bottleneck and Environment Contention
(test more)
• Make the work visibile for all
• Manage flow and eliminate waste
• Build alignment and consensus across team boundaries
• Empower teams to find and fix what is getting in the way
29. • Small Batch
• Reduce Work in Process (WIP)
• 1x1 Flow
• Reduce Bottlenecks (TOC)
• Optimize Globally
32. I fear not the man who has
practiced 10,000 kicks
once, but I fear the man
who has practiced one
kick 10,000 times
- Bruce Lee
33. Toyota is not a story about
techniques. It’s an organization
defined primarily by the unique
behavior routines it continually
teaches to all it’s members.
Mike Rother (Page 262-263)
44. • Capability 1: Seeing problems as they occur
• Complex work is managed so that problems in design are revealed
• They see problems as they occur, through relentless testing of
assumptions
• Capability 2: Swarming and solving problems as they are seen to
build new knowledge
• Problems that are seen are solved so that new knowledge is built
quickly
• Improvement of daily work is prioritized above daily work
• Capability 3: Spreading new knowledge throughout the
organization
• The new discovery of local knowledge and improvements are
turned into global improvements, shared throughout the
organization
• Learning is fed back to prevent future failures
• Capability 4: Leading by developing
• The job of leaders is not the command and control, but to create
other capable leaders who can perpetuate this system of work
51. ▪ Views on Human Error
▪ The old view of human error (First Story)
▪ Human error is the cause of accidents
▪ To explain failure,you must seek failure
▪ You must find people’s: inaccurate
assessments,wrong decisions, bad judgments
52. ▪ Views on Human Error
▪ The new view of human error (Second Story)
▪ Human error is a symptom of trouble deeper inside a
system
▪ To explain failure, do not try to find where people
went wrong
▪ Instead, find how people’s assessments and actions
made sense at the time, given the circumstances that
surrounded them
53. ▪ Bad Apple Theory - Throw away the bad apples
▪ Complex systems are basically safe, they need to be
protected from unreliable people (bad apples)
▪ Human errors cause accidents: humans are the
dominant contributor to more than two thirds of mishaps
▪ Errors occur because of human loss of situation
awareness, complacency, negligence
▪ Errors are introduced to the system only through the
inherent unreliability of people.
54. What can go wrong usually goes
right, but then we draw the wrong
conclusion.
Murphy’s Law is Wrong!
Sidney Dekker
The Field Guide to Human Error
55. Blameless Culture
A blameless culture believes that
systems are NOT inherently safe
and humans do the best they can to
keep them running.
56. Thematic Vagabonding
People jump from one topic to the next,
treating all superficially, in certain cases
picking up topics dealt with earlier at a
later time; they don’t go beyond the
surface with any topic and seldom finish
with any. (Dörner, 1980)
58. ▪ Awesome Postmortems - Mindweather LLC
▪ in complex systems, there is no root cause, except…
▪ there are (multiple) conditions, some of which are
unknowable, unfixable, outside our control
▪ people did what made sense at the time, given the
information they had (no counterfactuals)
▪ failure and success are both normal in complex systems
▪ getting the full account* of what happened is more
important than blame/punishment
59. ▪ Hindsight bias:
▪ knew-it-all-along, to see the event as having been predictable,
counterfactuals
▪ Outcome bias:
▪ evaluating the quality of a decision when the outcome of that
decision is already known
▪ Availability bias:
▪ preference by decision makers to information and events that are
more recent
▪ Fundamental attribution error:
▪ explain behavior in terms of internal disposition, such as
personality traits, abilities, motives, etc. as opposed to external
situational factors
60. ▪ Just Culture at Etsy (John Allspaw)
▪ Encourage learning by having these blameless Post-
Mortems on outages and accidents
▪ Understand how an accidents happen, in order to better
equip ourselves from it happening in the future
▪ Gather details from multiple perspectives on failures, and
we don’t punish people for making mistakes
▪ Enable and encourage people who do make mistakes to be
the experts on educating the rest of the organization how
not to make them in the future
61. ▪ Just Culture at Etsy (John Allspaw)
▪ Accept that there is always a discretionary space where
humans can decide to make actions or not, and that the
judgement of those decisions lie in hindsight
▪ Accept that the Hindsight Bias will continue to cloud our
assessment of past events, and work hard to eliminate it
▪ Accept that the Fundamental Attribution Error is also
difficult to escape, so we focus on the environment and
circumstances people are working in when investigating
accidents
68. A learning organization is a place
where people are continually
discovering how they create their
reality.
- Peter Senge
69. ▪ Five Disciplines must be adopted to become a
learning organization
▪ Systems Thinking
▪ Personal Mastery
▪ Mental Models
▪ Shared Vision
▪ Team Learning
70. Ladder of Inference
Chris Argyris
• Action
• Beliefs
• Conclusions
• Assumptions
• Meanings
• Select
• Observe
71. Ladder of Inference
▪ Can create bad judgement
▪ Our assumptions can lead us to bad conclusions
▪ Question your assumptions and conclusions
▪ Seek contrary data
▪ Make your assumptions visible to others
▪ Invite others to test your assumptions and conclusions
▪ Inquire other peoples assumptions and conclusions
▪ Move down the ladder instead of up
72. Ladder of Inference - Bad Judgement
▪ Observe - Notice people in the first row
▪ Select - Person in front row keep looking at their phone
▪ Meaning - Not listening to my presentation
▪ Assumption - He is not interested
▪ Conclusion - Doesn’t like my new idea
▪ Beliefs - Their team always blocks new ideas
▪ Action - I send a nasty email to their boss
73. Ladder of Inference - Alternative Assumption
▪ Observe - I notice people in the first row
▪ Select - Person in the front row keep looking at their phone
▪ Meaning - Not listening to my presentation
▪ Assumption - Try and engage with a question (safely)
▪ Conclusion - Might find out that they are late for another meeting
and they really don’t want to miss this one… so they sent an
email noticing the next meeting team that they will be late….
▪ Beliefs - They are very excited about this new idea
▪ Action - Both teams setup another meeting to engage.
82. ▪ Anomaly Response
▪ Computers do not resolve outages.. people do
▪ Trade-off’s under pressure
▪ Cognition in the wild
▪ An outage is not a detective story
▪ With each step the story changes
▪ Need to see what’s happing with incomplete information
▪ Tools don’t always make thing better
83. ▪ Anomaly Response - Internet Services are Opaque
▪ Network layer abstractions
▪ Variability in network performance
▪ Interdependent and decoupled services
▪ Internet based distributed computing
▪ Geographically distributed communication
▪ Open internet facing interactions
85. ▪ Anomaly Response - Dynamic Fault Management
▪ Cascading effects
▪ Tempo changes and time pressure
▪ Multiple interleaved tasks
▪ Multiple interacting goals
▪ Need to revise assessments as new evidence comes in
86. "In dynamic fault management,
intervention precedes or is interwoven
with diagnosis"
- Woods (1994)