2. About Me
• J. Michael (Mike) McGarr
• 13 years as a software engineer
• Engineering Tools Team at Netflix
• DC Continuous Delivery Meetup
Founder
• Director of EngOps, Blackboard Inc.
• Excella Consulting
• Booz Allen Hamilton
6/26/14 @SonOfGarr 2
12. How long does it take…
…to deploy a single line of code change to prod?
• Months?
• Weeks?
• Days?
• Hours?
• Minutes?
How long does it take to stand up a new service?
6/26/14 @SonOfGarr 12
20. Principles Underpinning DevOps
“We assert that the Three
Ways describe the values and
philosophies that frame the
processes, procedures,
practices of DevOps, as well as
the prescriptive steps.”
- Gene Kim
6/26/14 @SonOfGarr 20
25. The Second Way
• Did your code change break something else?
• How does your code perform in production?
• Who has access to production graphs? Do they exist?
6/26/14 @SonOfGarr 25
35. Pager Duty
“We found that when we woke
up developers at 2am, defects
got fixed faster than ever.”
- Patrick Lightbody, CEO
BrowserMob
Not Patrick Lightbody
6/26/14 @SonOfGarr 35
41. Blameless Post-mortems
“Having a “blameless” Post-Mortem process means that engineers whose actions have
contributed to an accident can give a detailed account of:
• what actions they took at what time,
• what effects they observed,
• expectations they had,
• assumptions they had made,
• and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or
retribution.”
- John Allspaw, Etsy
6/26/14 @SonOfGarr 43
Software Releases are stressful:
Off-hours/weekends
Take a long time
Heavily manually
Error-prone
Teams lack confidence in making changes
Business lacks confidence in IT
To reduce pain, orgs release less often
Releases become larger
Risk increases with each release
Dev Teams can’t keep up with customer demands
Customers don’t get what they want
Dev teams focus on writing new features
No time for improving quality
QA always seems to be forgotten
Technical debt slows teams ability to deliver quickly
Cycle times continue to decrease
Gene Kim: “A downward spiral”
Not a new story…
Agile is suppose to solve this, right?
Agile traditionally focuses on business working with dev teams
Dev and Ops motivations are at odds
Focus on flow
Focus on the whole system
Avoid local optimizations
Silos (and silo’d ownership) is toxic
Working code is only valuable in the customers hands
Upon joining Blackboard, I assessed dev process…
- Changes wait up to 3 hours to build
8pm nightly build missed late changes
9am QA tests run off nightly builds
QA tests took 16 hours
Manual evaluation of results
Create feedback loops
Fail fast
Culture of Continual experimentation
Culture of Continuous learning
Observe-Orient-Decide-Act (OODA) – John Boyd – faster you can loop, the more likely you are to survive a dogfight
Learning via retrospectives
Stickies, time, people
Experimentation via hackdays
Every two weeks
Protect this time
Graphite, Sensu, Openstack,
The opposite of fragile is not robust, but anti-fragile
Fragile orgs are weak and crack under stress
Robust orgs are strong, but still crack under stress
Anti-fragile orgs get stronger from stress