SlideShare a Scribd company logo
1 of 51
Lessons Learned Doing DevOps
at Scale at Microsoft
Buck Hodges
Director of Engineering
Microsoft VS Team Services
Level: Intermediate
Team Foundation
Server (TFS)
Visual Studio
Team Services (VSTS)
Sprint 1
August 2010
Sprint 127
November 2017
Team Rooms
August 2010
1ES
Spring 2014
DRI Duty
October 2013
Combined
Engineering
November 2014
Test Conversion
April 2017
Service Online
April 23, 2011
Service Preview
June 2012
3 weeks
Team Foundation Server (TFS)
Visual Studio Team Services (VSTS)
Controlling Exposure
Feature Flags
ON
OFF
What do feature flags give us?
Decouple deployment and exposure
Flags provide runtime control down to individual user
Change without redeployment
Controlled via PowerShell or web UI
Support early feedback, experimentation (stages)
Quick off switch
Example
Define
<?xml version="1.0" encoding="utf-8"?>
<!--
In this group we should register TFS specific features and sets their states.
-->
<ServicingStepGroup name="TfsFeatureAvailability" … >
<Steps>
<!-- Feature Availability -->
<ServicingStep name="Register features" stepPerformer="FeatureAvailability" … >
<StepData>
<!--specifying owner to allow implicit removal of features -->
<Features owner="TFS">
<!-- Begin TFVC/Git -->
<Feature name="SourceControl.Revert" description="Source control revert
features" />
Check
private _addRevertButton(): void {
if (FeatureAvailability.isFeatureEnabled(Flags.SourceControlRevert)) {
this._calloutButtons.unshift(
<button onClick={ () => Dialogs.revertPullRequest(
this.props.repositoryContext,
this.props.pullRequest.pullRequestContract(),
this.props.pullRequest.branchStatusContract().sourceBranchStatus,
this.props.pullRequest.branchStatusContract().targetBranchStatus) }
>{ VCResources.PullRequest_Revert_Button }</button>
);
}
}
What could go wrong?
Features to be revealed at Connect 2013 event
We turned features on globally (SU1) just before the keynote...
It didn’t go well.
What went wrong?
Turned on flags in production morning of event
On the only instance we had at the time
Lessons learned over time…
Turn new features on completely at least 24 hours ahead of an event
Turn on incrementally
Monitor
Use feature flags for back end changes
Resiliency
If we can’t prevent failure, can we limit impact?
Synchronize settings
New for Visual Studio 2013
Settings synchronized in the background
VS requests notifications from SPS
Creating notification subscriptions slowed
What happened?
1. VS requests notifications from SPS
2. SPS creates a subscription in Service Bus
3. That call was normally fast but became slow
4. Each call executed in SPS on a thread from the thread pool
5. Thread pool was quickly exhausted by incoming requests
6. Requests start queuing
7. SPS has critical services like auth and account…we’re down!
8. VS retried failed connections & SPS retried failed calls to Service Bus
Cascading failure
A low-priority feature consumed a limited resource
Blocked critical functions
Retries made the problem worse
What do we need?
Limit the impact of a problem
Degrade gracefully
Once problem is over, the service should self-recover quickly
For this incident, we mitigated by turning off feature flag
Could we detect when a dependency is unhealthy and fail fast?
Circuit breakers
Originated at Netflix
Stop cascading failures in a complex distributed system
Protection from latency, failure and volume (concurrency)
Shed load quickly: fail fast and rapidly recover
Fallback and gracefully degrade when possible
Circuit Breaker State Transitions
Closed
On call -> pass through
Call succeeds -> count success
Call fails -> count failure
Threshold reached -> trip breaker
Open
On call -> fail
On timeout -> attempt reset
Half-Open
On call -> pass through
Call succeeds -> reset
Call fails -> trip breaker
Trip breaker
Reset
Attempt reset Trip breaker
How do we know it works?
Fault Injection Testing
Test on our account in production (SU0 – no customers)
Simulate real live site incidents
Randomly inject real faults to measure resiliency
Network disconnects and latency
CPU, Memory, Disk pressure
Opening circuit breakers
Does the service handle the fault gracefully?
Does the service recover quickly?
Lessons Learned
Tune in production
Verify fallback
Treat as symptoms not causes during an event
Make it easy to understand what opened a circuit breaker
Monitor and root cause why
Live site
Customer IntelligenceBusiness IntelligenceOperational Intelligence
Dashboard DevOps Debug Experiments
Volume
~7TBAverage per day
and growing!
Alerts
Activity
Logging
Traces
Customer
Intelligence
Synthetic
KPI
Metrics
Job
History
Perf
Counters
NetworkPlatform
Gather everything
SLA
Mindset shift from on-premises to the cloud
Phase-1: Outside-in synthetic tests
Learn: Coverage too narrow as service footprint grows
Phase-2: Command health
Learn: Captures real user experience (errors & performance) but loses sensitivity as
aggregate command volumes grow
Phase-3: Failed or slow user minutes
Learn: Empathizes with each individual customer experience - aka.ms/vsts-sla
Modeling the user experience
80.0%
82.0%
84.0%
86.0%
88.0%
90.0%
92.0%
94.0%
96.0%
98.0%
100.0%
-200
0
200
400
600
800
1000
1200
1400
1600
9/25/13 2:24 PM 9/25/13 3:36 PM 9/25/13 4:48 PM 9/25/13 6:00 PM 9/25/13 7:12 PM 9/25/13 8:24 PM 9/25/13 9:36 PM 9/25/13 10:48 PM
FailedExecutionCount Start End SlowExecutionCount Availability (ID4 - Activity Only) Availability (Current)
Phase-3 model
Impact Window
Alert when customers are impacted
Alerting is the key to fast detection
Before
• Redundant alerts for same the issue
• Needed to set right thresholds and
tune often
• Stateless alerts contributed to
further noise
After
• Every alert must be actionable and
represent a real issue with the
system.
• Alerts should create a sense of
urgency – false alerts dilutes that
Live Site Roles
A strategy adopted by our teams to provide
focus, and assist with an interrupt culture.
• The team self-organizes each sprint into two
distinct sub-teams: Features and Shield
• Rotates each sprint
Team of 10 Engineers
Shield Team
Deals with all live-site
issues and interruptions
Feature Team
Works on committed
features (new work)
Live Site Incidents
• Conference bridge created
• DRI’s brought in to call
• Communication externally and
internally
• Pursue multiple theories
• Gather data for root cause & mitigate
• Record changes
• Rotate people during long running LSIs
Root Cause Analysis (RCA)
Repair work-items are logged in VSTS but linked into
the post mortem for traceability
Time-to’s are a key KPI that are reviewed for improvements
Each Feature Team has goals for closing repair items
Be Transparent
https://blogs.msdn.microsoft.com/vsoservice/?p=14575https://blogs.msdn.microsoft.com/vsoservice/?p=14406
Lessons Learned
Transform Testing
Code Test & Stabilize Code Test & Stabilize
Code
Complete
Planning
engineers on
your team# 5 ?x =
Rule: Stop working on new features when over cap.
Engineer: Combining Dev and Test
New Engineer role merged responsibilities
from dev and test
Every engineer and team has E2E
accountability
Big cultural shift across the company
L0 – requires only built binaries, no dependencies
L1 – adds ability to use SQL and file system
Run L0 & L1 in the pull request builds
L2 – test a service via REST APIs
L3 – full environment to test end to end
TRA tests – Legacy functional tests
Test at the lowest level possible
Fast and reliable
Product is designed for testability
Test code is product code
End to end tests can run in production
Testing infrastructure is a shared
service
L3 tests
TRA to L2
conversion
Analyzed
legacy
TRA tests
Started
with L0 /
L1 tests
Aggregate Flaky
tags over
cumulative runs
Un-tag
Flakiness
when
reliability bug
is resolved
Test Reporting
enhancements
VSTS Extension tasks &
Test Reporting Enhancements
Reliability Run Workflow
Execute
Reliability
Runs on
Green Builds
Flaky tests
excluded from
Test Run
Summary
Official Run Workflow
Execute
Runs
Carry Flakiness
from Latest
Reliability run
to current run
Tag Failed Tests
as Flaky & file
Reliability bugs
Reliability Bugs
surfaced in Team
scorecards
What do we measure?
Live Site Health/Debt
Time to Detect, Time To Mitigate
Incident prevention items
Aging live site problems
Customer support metrics (SLA, MPI, top
drivers)
Engineering Health/Debt
Bug cap per engineer
Aging bugs in important categories
Test reliability bugs
Velocity
Time to build
Time to self test
Time to deploy
Time to learn (Telemetry pipe)
Lessons Learned
Wrapping Up
Thank you!
aka.ms/LearnDevOps
Lessons Learned Doing DevOps at Scale at Microsoft

More Related Content

Recently uploaded

Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 

Recently uploaded (20)

Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Lessons Learned Doing DevOps at Scale at Microsoft

  • 1. Lessons Learned Doing DevOps at Scale at Microsoft Buck Hodges Director of Engineering Microsoft VS Team Services Level: Intermediate
  • 2. Team Foundation Server (TFS) Visual Studio Team Services (VSTS)
  • 3. Sprint 1 August 2010 Sprint 127 November 2017 Team Rooms August 2010 1ES Spring 2014 DRI Duty October 2013 Combined Engineering November 2014 Test Conversion April 2017 Service Online April 23, 2011 Service Preview June 2012
  • 4. 3 weeks Team Foundation Server (TFS) Visual Studio Team Services (VSTS)
  • 7. What do feature flags give us? Decouple deployment and exposure Flags provide runtime control down to individual user Change without redeployment Controlled via PowerShell or web UI Support early feedback, experimentation (stages) Quick off switch
  • 9. Define <?xml version="1.0" encoding="utf-8"?> <!-- In this group we should register TFS specific features and sets their states. --> <ServicingStepGroup name="TfsFeatureAvailability" … > <Steps> <!-- Feature Availability --> <ServicingStep name="Register features" stepPerformer="FeatureAvailability" … > <StepData> <!--specifying owner to allow implicit removal of features --> <Features owner="TFS"> <!-- Begin TFVC/Git --> <Feature name="SourceControl.Revert" description="Source control revert features" />
  • 10. Check private _addRevertButton(): void { if (FeatureAvailability.isFeatureEnabled(Flags.SourceControlRevert)) { this._calloutButtons.unshift( <button onClick={ () => Dialogs.revertPullRequest( this.props.repositoryContext, this.props.pullRequest.pullRequestContract(), this.props.pullRequest.branchStatusContract().sourceBranchStatus, this.props.pullRequest.branchStatusContract().targetBranchStatus) } >{ VCResources.PullRequest_Revert_Button }</button> ); } }
  • 11. What could go wrong? Features to be revealed at Connect 2013 event We turned features on globally (SU1) just before the keynote... It didn’t go well.
  • 12. What went wrong? Turned on flags in production morning of event On the only instance we had at the time Lessons learned over time… Turn new features on completely at least 24 hours ahead of an event Turn on incrementally Monitor Use feature flags for back end changes
  • 14. If we can’t prevent failure, can we limit impact?
  • 15. Synchronize settings New for Visual Studio 2013 Settings synchronized in the background VS requests notifications from SPS
  • 17. What happened? 1. VS requests notifications from SPS 2. SPS creates a subscription in Service Bus 3. That call was normally fast but became slow 4. Each call executed in SPS on a thread from the thread pool 5. Thread pool was quickly exhausted by incoming requests 6. Requests start queuing 7. SPS has critical services like auth and account…we’re down! 8. VS retried failed connections & SPS retried failed calls to Service Bus
  • 18. Cascading failure A low-priority feature consumed a limited resource Blocked critical functions Retries made the problem worse
  • 19. What do we need? Limit the impact of a problem Degrade gracefully Once problem is over, the service should self-recover quickly For this incident, we mitigated by turning off feature flag Could we detect when a dependency is unhealthy and fail fast?
  • 20. Circuit breakers Originated at Netflix Stop cascading failures in a complex distributed system Protection from latency, failure and volume (concurrency) Shed load quickly: fail fast and rapidly recover Fallback and gracefully degrade when possible
  • 21. Circuit Breaker State Transitions Closed On call -> pass through Call succeeds -> count success Call fails -> count failure Threshold reached -> trip breaker Open On call -> fail On timeout -> attempt reset Half-Open On call -> pass through Call succeeds -> reset Call fails -> trip breaker Trip breaker Reset Attempt reset Trip breaker
  • 22. How do we know it works? Fault Injection Testing Test on our account in production (SU0 – no customers) Simulate real live site incidents Randomly inject real faults to measure resiliency Network disconnects and latency CPU, Memory, Disk pressure Opening circuit breakers Does the service handle the fault gracefully? Does the service recover quickly?
  • 23. Lessons Learned Tune in production Verify fallback Treat as symptoms not causes during an event Make it easy to understand what opened a circuit breaker Monitor and root cause why
  • 25. Customer IntelligenceBusiness IntelligenceOperational Intelligence Dashboard DevOps Debug Experiments Volume ~7TBAverage per day and growing! Alerts Activity Logging Traces Customer Intelligence Synthetic KPI Metrics Job History Perf Counters NetworkPlatform Gather everything SLA Mindset shift from on-premises to the cloud
  • 26. Phase-1: Outside-in synthetic tests Learn: Coverage too narrow as service footprint grows Phase-2: Command health Learn: Captures real user experience (errors & performance) but loses sensitivity as aggregate command volumes grow Phase-3: Failed or slow user minutes Learn: Empathizes with each individual customer experience - aka.ms/vsts-sla Modeling the user experience 80.0% 82.0% 84.0% 86.0% 88.0% 90.0% 92.0% 94.0% 96.0% 98.0% 100.0% -200 0 200 400 600 800 1000 1200 1400 1600 9/25/13 2:24 PM 9/25/13 3:36 PM 9/25/13 4:48 PM 9/25/13 6:00 PM 9/25/13 7:12 PM 9/25/13 8:24 PM 9/25/13 9:36 PM 9/25/13 10:48 PM FailedExecutionCount Start End SlowExecutionCount Availability (ID4 - Activity Only) Availability (Current) Phase-3 model Impact Window
  • 27. Alert when customers are impacted
  • 28. Alerting is the key to fast detection Before • Redundant alerts for same the issue • Needed to set right thresholds and tune often • Stateless alerts contributed to further noise After • Every alert must be actionable and represent a real issue with the system. • Alerts should create a sense of urgency – false alerts dilutes that
  • 30. A strategy adopted by our teams to provide focus, and assist with an interrupt culture. • The team self-organizes each sprint into two distinct sub-teams: Features and Shield • Rotates each sprint Team of 10 Engineers Shield Team Deals with all live-site issues and interruptions Feature Team Works on committed features (new work)
  • 31. Live Site Incidents • Conference bridge created • DRI’s brought in to call • Communication externally and internally • Pursue multiple theories • Gather data for root cause & mitigate • Record changes • Rotate people during long running LSIs
  • 32. Root Cause Analysis (RCA) Repair work-items are logged in VSTS but linked into the post mortem for traceability Time-to’s are a key KPI that are reviewed for improvements Each Feature Team has goals for closing repair items
  • 36. Code Test & Stabilize Code Test & Stabilize Code Complete Planning
  • 37.
  • 38. engineers on your team# 5 ?x = Rule: Stop working on new features when over cap.
  • 39. Engineer: Combining Dev and Test New Engineer role merged responsibilities from dev and test Every engineer and team has E2E accountability Big cultural shift across the company
  • 40.
  • 41. L0 – requires only built binaries, no dependencies L1 – adds ability to use SQL and file system Run L0 & L1 in the pull request builds L2 – test a service via REST APIs L3 – full environment to test end to end TRA tests – Legacy functional tests
  • 42. Test at the lowest level possible Fast and reliable Product is designed for testability Test code is product code End to end tests can run in production Testing infrastructure is a shared service
  • 43. L3 tests TRA to L2 conversion Analyzed legacy TRA tests Started with L0 / L1 tests
  • 44.
  • 45. Aggregate Flaky tags over cumulative runs Un-tag Flakiness when reliability bug is resolved Test Reporting enhancements VSTS Extension tasks & Test Reporting Enhancements Reliability Run Workflow Execute Reliability Runs on Green Builds Flaky tests excluded from Test Run Summary Official Run Workflow Execute Runs Carry Flakiness from Latest Reliability run to current run Tag Failed Tests as Flaky & file Reliability bugs Reliability Bugs surfaced in Team scorecards
  • 46.
  • 47. What do we measure? Live Site Health/Debt Time to Detect, Time To Mitigate Incident prevention items Aging live site problems Customer support metrics (SLA, MPI, top drivers) Engineering Health/Debt Bug cap per engineer Aging bugs in important categories Test reliability bugs Velocity Time to build Time to self test Time to deploy Time to learn (Telemetry pipe)

Editor's Notes

  1. As of October 2017: Single repo 430 people pushed to the repo in the last 30 days ~40 feature teams Code base is 90+% the same Teams work in master No nightly build Over 3,000 projects (doubled in the last 3 years) Pull requests build & run unit test validation
  2. And I can make it the main path for all customers
  3. https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/ Sent the system into a death spiral where even turning all of the new features off it struggled to become healthy again. We didn’t have any resiliency built it in to the system.
  4. We only had TFS SU1 and SPS at the time https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/
  5. Visual Studio 2013 introduced a new feature called synchronized settings
  6. From
  7. The retries had a multiplying effect!
  8. “Release It! – Design and Deploy Production-Ready Software”  Michael T. Nygard  Netflix Hystrix – “Making the Netflix API More Resilient”  Ben Schmaus https://github.com/Netflix/Hystrix/wiki
  9. Commands come in If closed, command runs If it fails, record failure If too many failures in the time window, open If open Command gets fallback Periodically a command is allowed to pass through to test to see if the problem is resolved See https://github.com/Netflix/Hystrix/wiki/How-it-Works for complete explanation
  10. Test in production on SU0 where there are no customers, and we have real production load – not just tests. Here are some patterns we’ve caught from this: Fallbacks don’t work as expected.  For example, a feature returning a 500 error when Redis is down.  That’s a best-effort cache and shouldn’t cause failures. Circuit breakers don’t open when they should.  While it’s straightforward to test fallback logic through testcases, injecting faults helps verify the tuning of those resiliency mechanisms.  A breaker that never opens is useless. Requests don’t time out quickly enough or they retry when they shouldn’t.  Our web UI shouldn’t hang for minutes trying to fetch an identity image, for example. Unexpected dependencies cause small problems into major incidents.  A call to a non-critical service was added through some additional code in service start.  So if that non-critical service is down or slow when an instance of a critical service is starting, that critical service is down or slow.   One important thing we do is to run nearly all of this in our Ring0 production deployments.  That’s where our account is that we work from.  We tried test deployments and other canary instances early on, but struggled with the typical problems of load patterns that aren’t real-world and missing telemetry and alerting infrastructure.  There’s nothing like production.  Having a Ring0 instance means that if we do have greater-than-expected impact, we only impact our own team and not other customers.
  11. Long timeouts may limit the effectiveness of circuit breakers – better to fail fast Sometimes people mistake issues with circuit breakers for a cause
  12. Story about initial release – Twitter was a better monitor than what we had in 2011 when we first announced the service at //Build in 2011 ~60GB per day in 2015
  13. https://aka.ms/vsts-sla = https://blogs.msdn.microsoft.com/bharry/2013/10/14/how-do-you-measure-quality-of-a-service/
  14. Journey Story: Darkness: The service went live > some issues happen > there were blind spots Rally the Troops: We talked to the team about telemetry There is Light: We start to generate tons of telemetry and alerts Blinded by the Light: We become overwhelmed by too much data & alerts Time to Tune: You refocus the team on reducing alert noise Overcompensating: After turning and tweaking alerts we start missing some customer incident The Art of Balance: We’re now learning how to enable precise alerting and how to make sense of so much data: Using the sophistication of “high mature alerts” over log analytics Shifting critical alerts that “page” to focus on SLA Investigating anomaly detection
  15. Key Takeaway: --5 whys -- define improvement for both code and process -- visibility to ensure learnings are applied Outcomes: -- Improve how you respond (TTx) -- Stop from happening again
  16. This is a graph of the TFS 2010 release – you can see beta 1, beta 2, RC, RTM as we built up and burned down bugs, and some percentage of bug fixes introduced new bugs or allowed more bugs to be discovered
  17. 1 bug per dev per day is the average over enough time. Started with 15 and the team just grew to that and stayed there. Limited to 5 so we were 5 – minimum - day from shipping. Maybe it’s more than that because of regressions. Zero Bug Bounce (ZBB)
  18. November 2014
  19. We never shipped a release of TFS with all of the tests passing! Tests weren’t reliable enough, so someone had to manually go through all of the failures. That led to mistakes, of course. Same for shipping the service every 3 weeks.
  20. L2 conversion took over two years, as it was done in parallel with new feature development. So it progressed in fits and starts. Phase 1 Made it easy to author and execute high quality L0/L1 tests Stopped creating new TRA tests as much as possible Phase 2 Tests that can be deleted Tests that can move to L0/L1 Tests that will move to VSSF Test SDK On-prem tests we expect to maintain in lights-on mode Phase 3 Test Arch v-team re-wrote L2 test framework Top-down push from management with org wide scorecard Phase 4 Just started
  21. Continuous Reliability Runs triggered on green CI builds If a test fails in any of 500 runs, it is considered Flaky Bugs filed for flaky tests Now working on automatic re-run of tests and detecting flakiness if a test passes on re-run