Lessons Learned Doing DevOps at Scale at Microsoft

Lessons Learned Doing DevOps
at Scale at Microsoft
Buck Hodges
Director of Engineering
Microsoft VS Team Services
Level: Intermediate

Team Foundation
Server (TFS)
Visual Studio
Team Services (VSTS)

Sprint 1
August 2010
Sprint 127
November 2017
Team Rooms
August 2010
1ES
Spring 2014
DRI Duty
October 2013
Combined
Engineering
November 2014
Test Conversion
April 2017
Service Online
April 23, 2011
Service Preview
June 2012

3 weeks
Team Foundation Server (TFS)
Visual Studio Team Services (VSTS)

What do feature flags give us?
Decouple deployment and exposure
Flags provide runtime control down to individual user
Change without redeployment
Controlled via PowerShell or web UI
Support early feedback, experimentation (stages)
Quick off switch

Define
<?xml version="1.0" encoding="utf-8"?>

<ServicingStepGroup name="TfsFeatureAvailability" … >
<Steps>

<ServicingStep name="Register features" stepPerformer="FeatureAvailability" … >
<StepData>

<Features owner="TFS">

<Feature name="SourceControl.Revert" description="Source control revert
features" />

Check
private _addRevertButton(): void {
if (FeatureAvailability.isFeatureEnabled(Flags.SourceControlRevert)) {
this._calloutButtons.unshift(
<button onClick={ () => Dialogs.revertPullRequest(
this.props.repositoryContext,
this.props.pullRequest.pullRequestContract(),
this.props.pullRequest.branchStatusContract().sourceBranchStatus,
this.props.pullRequest.branchStatusContract().targetBranchStatus) }
>{ VCResources.PullRequest_Revert_Button }</button>
);
}
}

What could go wrong?
Features to be revealed at Connect 2013 event
We turned features on globally (SU1) just before the keynote...
It didn’t go well.

What went wrong?
Turned on flags in production morning of event
On the only instance we had at the time
Lessons learned over time…
Turn new features on completely at least 24 hours ahead of an event
Turn on incrementally
Monitor
Use feature flags for back end changes

If we can’t prevent failure, can we limit impact?

Synchronize settings
New for Visual Studio 2013
Settings synchronized in the background
VS requests notifications from SPS

Creating notification subscriptions slowed

What happened?
1. VS requests notifications from SPS
2. SPS creates a subscription in Service Bus
3. That call was normally fast but became slow
4. Each call executed in SPS on a thread from the thread pool
5. Thread pool was quickly exhausted by incoming requests
6. Requests start queuing
7. SPS has critical services like auth and account…we’re down!
8. VS retried failed connections & SPS retried failed calls to Service Bus

Cascading failure
A low-priority feature consumed a limited resource
Blocked critical functions
Retries made the problem worse

What do we need?
Limit the impact of a problem
Degrade gracefully
Once problem is over, the service should self-recover quickly
For this incident, we mitigated by turning off feature flag
Could we detect when a dependency is unhealthy and fail fast?

Circuit breakers
Originated at Netflix
Stop cascading failures in a complex distributed system
Protection from latency, failure and volume (concurrency)
Shed load quickly: fail fast and rapidly recover
Fallback and gracefully degrade when possible

Circuit Breaker State Transitions
Closed
On call -> pass through
Call succeeds -> count success
Call fails -> count failure
Threshold reached -> trip breaker
Open
On call -> fail
On timeout -> attempt reset
Half-Open
On call -> pass through
Call succeeds -> reset
Call fails -> trip breaker
Trip breaker
Reset
Attempt reset Trip breaker

How do we know it works?
Fault Injection Testing
Test on our account in production (SU0 – no customers)
Simulate real live site incidents
Randomly inject real faults to measure resiliency
Network disconnects and latency
CPU, Memory, Disk pressure
Opening circuit breakers
Does the service handle the fault gracefully?
Does the service recover quickly?

Lessons Learned
Tune in production
Verify fallback
Treat as symptoms not causes during an event
Make it easy to understand what opened a circuit breaker
Monitor and root cause why

Customer IntelligenceBusiness IntelligenceOperational Intelligence
Dashboard DevOps Debug Experiments
Volume
~7TBAverage per day
and growing!
Alerts
Activity
Logging
Traces
Customer
Intelligence
Synthetic
KPI
Metrics
Job
History
Perf
Counters
NetworkPlatform
Gather everything
SLA
Mindset shift from on-premises to the cloud

Phase-1: Outside-in synthetic tests
Learn: Coverage too narrow as service footprint grows
Phase-2: Command health
Learn: Captures real user experience (errors & performance) but loses sensitivity as
aggregate command volumes grow
Phase-3: Failed or slow user minutes
Learn: Empathizes with each individual customer experience - aka.ms/vsts-sla
Modeling the user experience
80.0%
82.0%
84.0%
86.0%
88.0%
90.0%
92.0%
94.0%
96.0%
98.0%
100.0%
-200
0
200
400
600
800
1000
1200
1400
1600
9/25/13 2:24 PM 9/25/13 3:36 PM 9/25/13 4:48 PM 9/25/13 6:00 PM 9/25/13 7:12 PM 9/25/13 8:24 PM 9/25/13 9:36 PM 9/25/13 10:48 PM
FailedExecutionCount Start End SlowExecutionCount Availability (ID4 - Activity Only) Availability (Current)
Phase-3 model
Impact Window

Alert when customers are impacted

Alerting is the key to fast detection
Before
• Redundant alerts for same the issue
• Needed to set right thresholds and
tune often
• Stateless alerts contributed to
further noise
After
• Every alert must be actionable and
represent a real issue with the
system.
• Alerts should create a sense of
urgency – false alerts dilutes that

A strategy adopted by our teams to provide
focus, and assist with an interrupt culture.
• The team self-organizes each sprint into two
distinct sub-teams: Features and Shield
• Rotates each sprint
Team of 10 Engineers
Shield Team
Deals with all live-site
issues and interruptions
Feature Team
Works on committed
features (new work)

Live Site Incidents
• Conference bridge created
• DRI’s brought in to call
• Communication externally and
internally
• Pursue multiple theories
• Gather data for root cause & mitigate
• Record changes
• Rotate people during long running LSIs

Root Cause Analysis (RCA)
Repair work-items are logged in VSTS but linked into
the post mortem for traceability
Time-to’s are a key KPI that are reviewed for improvements
Each Feature Team has goals for closing repair items

Be Transparent
https://blogs.msdn.microsoft.com/vsoservice/?p=14575https://blogs.msdn.microsoft.com/vsoservice/?p=14406

Code Test & Stabilize Code Test & Stabilize
Code
Complete
Planning

engineers on
your team# 5 ?x =
Rule: Stop working on new features when over cap.

Engineer: Combining Dev and Test
New Engineer role merged responsibilities
from dev and test
Every engineer and team has E2E
accountability
Big cultural shift across the company

L0 – requires only built binaries, no dependencies
L1 – adds ability to use SQL and file system
Run L0 & L1 in the pull request builds
L2 – test a service via REST APIs
L3 – full environment to test end to end
TRA tests – Legacy functional tests

Test at the lowest level possible
Fast and reliable
Product is designed for testability
Test code is product code
End to end tests can run in production
Testing infrastructure is a shared
service

L3 tests
TRA to L2
conversion
Analyzed
legacy
TRA tests
Started
with L0 /
L1 tests

Aggregate Flaky
tags over
cumulative runs
Un-tag
Flakiness
when
reliability bug
is resolved
Test Reporting
enhancements
VSTS Extension tasks &
Test Reporting Enhancements
Reliability Run Workflow
Execute
Reliability
Runs on
Green Builds
Flaky tests
excluded from
Test Run
Summary
Official Run Workflow
Execute
Runs
Carry Flakiness
from Latest
Reliability run
to current run
Tag Failed Tests
as Flaky & file
Reliability bugs
Reliability Bugs
surfaced in Team
scorecards

What do we measure?
Live Site Health/Debt
Time to Detect, Time To Mitigate
Incident prevention items
Aging live site problems
Customer support metrics (SLA, MPI, top
drivers)
Engineering Health/Debt
Bug cap per engineer
Aging bugs in important categories
Test reliability bugs
Velocity
Time to build
Time to self test
Time to deploy
Time to learn (Telemetry pipe)

Lessons Learned Doing DevOps at Scale at Microsoft

Lessons Learned Doing DevOps at Scale at Microsoft

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Lessons Learned Doing DevOps at Scale at Microsoft

Editor's Notes