ScotSoft 2018 - A DevOps Story: 70k deployments a day

© 2013 – 2017 naked Agility Limited All Rights Reserved
A DevOps Story
@MrHinsh 1
Martin Hinshelwood | @MrHinsh
martin@nkdagility.com | http://nkdagility.com/blog

2

3

Diego Lo Giudice and Dave West, Forrester
February 2011
Transforming Application Delivery
Firms today experience a much higher velocity of
business change. Market opportunities appear or
dissolve in months or weeks instead of years.
“
”

@MrHinsh
This is the story of:

@MrHinsh
Insanity is doing the same thing over and over again
expecting a different result.
-Albert Einstein

Developer Division

@MrHinsh
Before 1ES and Azure DevOps

@MrHinsh
Faster Value Delivery
Increase flow of Shorten cycle times Reduce re-work costs

Typical day at Microsoft
Data: Internal Microsoft engineering system activity, August 2018
12.4k
Pull Requests per day
67k
Git commits per day
78,000Deployments per day
146k
Builds per day
500m
Test executions per day
500k
Work items updated
per day
5m
Work items viewed per
day
Azure DevOps Services is the toolchain of choice for Microsoft engineering with over 90,000 internal users
https://aka.ms/DevOpsAtMicrosoft

@MrHinsh
Schedule
Code Test & Stabilize Code Test & Stabilize
Beta RTM

@MrHinsh
Feedback
Planning
Customer feedback – we should
change the way a feature works. We
didn’t get it quite right…
… but we’re booked solid already.

@MrHinsh
S1 S2 S3 S4 S5 Stabilization S6
Story: Sprint 1-5
A
B

@MrHinsh
Now
2 years
3 weeks

@MrHinsh
Deliver more value to customers
Faster responses to customers and market changes
Improved engineering satisfaction
2x productivity increase
Features Delivered per Year
https://www.visualstudio.com/en-us/articles/news/features-timeline
22
58
65
111
262
249
2012 2013 2014 2015 2016 2017

@MrHinsh

@MrHinsh
Organization
Roles
Teams
Cadence
Taxonomy
Plan
Practices
Guiding Principles
Alignment
Autonomy
“Let’s try to give our teams three things….
Autonomy, Mastery, Purpose”

@MrHinsh
Alignment
Every team and business
tracks scenarios and
features consistently.
Autonomy
Every team chooses how to manage
stories and/or tasks
Taxonomy & Staying Aligned

Planning
Epic
18 months
Aspirational
(60%)
Plan
3 sprints
Thoughtful
(90%)
3
Sprint
3 weeks
Confident (95%)
1
Season
6 months
Hopeful (80%)
6
Teams are responsible for the detail
Leadership is responsible
for the big picture

@MrHinsh
Scenarios
Features
Stories
Tasks
Aligned Autonomy
Alignment
The big picture in light of our
business goals
Autonomy
The detail about what we’ll deliver
to achieve our business goals

Week 1 Week 2 Week 3
Week 1 Week 2 Week 3Week 2 Week 3
Sprint 69Sprint 68 Sprint 70
Sprint Planning Done!

What we accomplished
The sprint plan

@MrHinsh
Sprint Mails
Value delivered
during the sprint
Video demonstrating
the value
What the team is
planning to accomplish
in the next sprint

It’s not 2 years, but…
• Updates were large
• Months apart
• Lots of problems!
4/1/2010 4/23/2012
5/3/2010
TFS 2010 RTM
4/23/2011
ServiceDeployment
8/5/2011
ServiceUpdate
9/26/2011
//BUILD2011
12/7/2011
ServiceUpdate
1/30/2012
ServiceUpdate
2/20/2012
ServiceUpdate
3/12/2012
ServiceUpdate
4/2/2012
ServiceUpdate

@MrHinsh
Organization Chart… before
Program Management Development Testing
Operations

@MrHinsh
Organization Chart
Program Management Engineering
Operations
Engineering
Program Management is responsible for:
WHAT we’re building, and
WHY we’re building it
Engineering is responsible for
HOW we’re building it, and that
we’re building it with QUALITY

@MrHinsh
Teams
Program Management Engineering

Deployment
Sprint Planning Done!
If it’s bad, YOU wake up

But we have many teams

@MrHinsh
Everyone creates a branch…

@MrHinsh
Writes a lot of code…

@MrHinsh
It needs to come together…

@MrHinsh
Merge Debt

@MrHinsh
Organizations which design systems... are
constrained to produce designs which are
copies of the communication structures of
these organizations…

@MrHinsh
Typical Server Based Branching Structure

@MrHinsh
Organizations tend to produce
branching structures that copy the
organization chart.

Maintaining enterprise rigor

@MrHinsh
Branching
https://guides.github.com/introduction/flow/

@MrHinsh
Branching

Internal Open Source
Starts from a
position ofTrust
Share everything Encourage
contributions

@MrHinsh
Quality- Before
Beta RTM
Planning
Code
Complete

@MrHinsh
Quality- After

@MrHinsh
There’s no place
like production!

Customer IntelligenceBusiness IntelligenceOperational Intelligence
Gather everything
Dashboard DevOps Debug Experiments

TAKE
AWAY
• Microsoft has changed the way that they run their
business to support faster delivery
• Create the right balance of autonomy and
alignment
• Focus on Engineering Excellence with continuous
delivery
• Gather as much telemetry as you can to make
better decisions
Summary

Martin
Hinshelwood
martin@nkdagility.com
• Hear about their Journey journey:
http://aka.ms/engineeringstories
• Learn how they deployVSTS:
https://blogs.msdn.microsoft.com/devops/2017/04/25/
how-we-use-rm-part-1/
• Follow their ongoing journey andVSTS updates:
https://blogs.msdn.microsoft.com/bharry/
• UseVSTS
• http://visualstudio.com/team-services

@MrHinsh
Connect With Martin Hinshelwood:
55
+52 1 998 894 1898@MrHinsh
martin@nkdagility.com
https://nkdAgility.com/blog

Starting with what is most important/most pain, go from
there
Designing metrics is as hard as designing features
Baking it into the review culture – from top to bottom –
cadence is the heartbeat – spurs activity

@MrHinsh
Health Dashboards

Getting the availability model right
Experience: Coverage too narrow as service footprint grows
Experience: Loses sensitivity as command volumes grow
Experience: Empathizes individual customer impact
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
-200
0
200
400
600
800
1000
1200
1400
1600
9/25/13 2:24 PM 9/25/13 3:36 PM 9/25/13 4:48 PM 9/25/13 6:00 PM 9/25/13 7:12 PM 9/25/13 8:24 PM 9/25/13 9:36 PM 9/25/13 10:48 PM
Sept 25th 2013 LSI
FailedExecutionCount SlowExecutionCount Start End Availability (ID4 - Activity Only) Availability (Current)

Alerting is key to fast detection
Every alert must be actionable and represent a
real issue with the system.
Alerts should create a sense of urgency – false
alerts dilutes that
Redundant alerts for same the issue
Needed to set right thresholds and tune often
Stateless alerts contributed to further noise

@MrHinsh
Health model in action
• 3 errors for memory and
performance
• All 3 related to same code
defect
• APM component mapped to feature team
• Auto-dialer engaged Global DRI
Eliminated alert noise ~928
alerts per week to ~22 and
reduced DRI escalations by
~56%

VSTS Scorecard

Time to MitigateTime to Detect
%ofIncidents
DRAFT
DRAFT
Microsoft Confidential 64
Service Availability & Health Metrics
DRAFT DRAFT
DRAFT
IncidentCount
IncidentCount
DRAFT
DRAFT
DRAFT
%ofIncidents
UserMinutes
DRAFT
DRAFTDRAFT
Error By SourceIncidents by Severity
User Impact Minutes During Incidents [TFS Only]
3
2
1
4
1. TFS Availability is on an improving trend. No Sev0/Sev1 LSIs for July.
2. App Insights switched from synthetic availability to real-user experience in Ibiza portal. A high
volume of SEV-2 LSIs (72) contributed to customer impact in addition to intermittent UX errors.
(UX fixes applied on 8/11 that improves availability)
3. App Insights was impacted by 3 long running LSIs related to ES maintenance, Ibiza updates and an
Azure Storage outage.
4. TFS Service attainment (SLO) improved significantly MoM with focus on minimizing failed/slow
commands and reviewing in weekly LiveSite reviews

@MrHinsh
Service status

RCA (Root Cause Analysis) transparency

Changing the test portfolio balance
Tests should be written at the lowest level
possible
Write once, run anywhere including production
system
Product is designed for testability
Test code is product code, only reliable tests
survive
Testing infrastructure is a shared Service

Agenda

@MrHinsh
Feedback - Before
Beta RTM
Planning
??

@MrHinsh
Feedback - After
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

@MrHinsh
Staying connected
Chat Chat Chat Chat Chat Chat
Every 3 sprints we sit down with
the team for a “chat”

@MrHinsh
• What’s next on your backlog?
• How are you doing with
regards to debt?
• Any issues?
Team “Chats”
Version Control

@MrHinsh
Team “Chats”

@MrHinsh
Sprint mails
Plan Accomplished

ScotSoft 2018 - A DevOps Story: 70k deployments a day

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ScotSoft 2018 - A DevOps Story: 70k deployments a day

Similar to ScotSoft 2018 - A DevOps Story: 70k deployments a day (20)

More from Martin Hinshelwood

More from Martin Hinshelwood (9)

Recently uploaded

Recently uploaded (20)

ScotSoft 2018 - A DevOps Story: 70k deployments a day

Editor's Notes