In this presentation (video: https://youtu.be/3LpuncdAehE) I cover some key lessons we've learned over the years as we moved from TFS on-premises to running VS Team Services in Azure. I cover feature flags, circuit breakers (resiliency), live site, and our testing transformation. You can find even more detail at https://aka.ms/LearnDevOps.
3. Sprint 1
August 2010
Sprint 127
November 2017
Team Rooms
August 2010
1ES
Spring 2014
DRI Duty
October 2013
Combined
Engineering
November 2014
Test Conversion
April 2017
Service Online
April 23, 2011
Service Preview
June 2012
7. What do feature flags give us?
Decouple deployment and exposure
Flags provide runtime control down to individual user
Change without redeployment
Controlled via PowerShell or web UI
Support early feedback, experimentation (stages)
Quick off switch
9. Define
<?xml version="1.0" encoding="utf-8"?>
<!--
In this group we should register TFS specific features and sets their states.
-->
<ServicingStepGroup name="TfsFeatureAvailability" … >
<Steps>
<!-- Feature Availability -->
<ServicingStep name="Register features" stepPerformer="FeatureAvailability" … >
<StepData>
<!--specifying owner to allow implicit removal of features -->
<Features owner="TFS">
<!-- Begin TFVC/Git -->
<Feature name="SourceControl.Revert" description="Source control revert
features" />
11. What could go wrong?
Features to be revealed at Connect 2013 event
We turned features on globally (SU1) just before the keynote...
It didn’t go well.
12. What went wrong?
Turned on flags in production morning of event
On the only instance we had at the time
Lessons learned over time…
Turn new features on completely at least 24 hours ahead of an event
Turn on incrementally
Monitor
Use feature flags for back end changes
17. What happened?
1. VS requests notifications from SPS
2. SPS creates a subscription in Service Bus
3. That call was normally fast but became slow
4. Each call executed in SPS on a thread from the thread pool
5. Thread pool was quickly exhausted by incoming requests
6. Requests start queuing
7. SPS has critical services like auth and account…we’re down!
8. VS retried failed connections & SPS retried failed calls to Service Bus
19. What do we need?
Limit the impact of a problem
Degrade gracefully
Once problem is over, the service should self-recover quickly
For this incident, we mitigated by turning off feature flag
Could we detect when a dependency is unhealthy and fail fast?
20. Circuit breakers
Originated at Netflix
Stop cascading failures in a complex distributed system
Protection from latency, failure and volume (concurrency)
Shed load quickly: fail fast and rapidly recover
Fallback and gracefully degrade when possible
21. Circuit Breaker State Transitions
Closed
On call -> pass through
Call succeeds -> count success
Call fails -> count failure
Threshold reached -> trip breaker
Open
On call -> fail
On timeout -> attempt reset
Half-Open
On call -> pass through
Call succeeds -> reset
Call fails -> trip breaker
Trip breaker
Reset
Attempt reset Trip breaker
22. How do we know it works?
Fault Injection Testing
Test on our account in production (SU0 – no customers)
Simulate real live site incidents
Randomly inject real faults to measure resiliency
Network disconnects and latency
CPU, Memory, Disk pressure
Opening circuit breakers
Does the service handle the fault gracefully?
Does the service recover quickly?
23. Lessons Learned
Tune in production
Verify fallback
Treat as symptoms not causes during an event
Make it easy to understand what opened a circuit breaker
Monitor and root cause why
28. Alerting is the key to fast detection
Before
• Redundant alerts for same the issue
• Needed to set right thresholds and
tune often
• Stateless alerts contributed to
further noise
After
• Every alert must be actionable and
represent a real issue with the
system.
• Alerts should create a sense of
urgency – false alerts dilutes that
30. A strategy adopted by our teams to provide
focus, and assist with an interrupt culture.
• The team self-organizes each sprint into two
distinct sub-teams: Features and Shield
• Rotates each sprint
Team of 10 Engineers
Shield Team
Deals with all live-site
issues and interruptions
Feature Team
Works on committed
features (new work)
31. Live Site Incidents
• Conference bridge created
• DRI’s brought in to call
• Communication externally and
internally
• Pursue multiple theories
• Gather data for root cause & mitigate
• Record changes
• Rotate people during long running LSIs
32. Root Cause Analysis (RCA)
Repair work-items are logged in VSTS but linked into
the post mortem for traceability
Time-to’s are a key KPI that are reviewed for improvements
Each Feature Team has goals for closing repair items
39. Engineer: Combining Dev and Test
New Engineer role merged responsibilities
from dev and test
Every engineer and team has E2E
accountability
Big cultural shift across the company
40.
41. L0 – requires only built binaries, no dependencies
L1 – adds ability to use SQL and file system
Run L0 & L1 in the pull request builds
L2 – test a service via REST APIs
L3 – full environment to test end to end
TRA tests – Legacy functional tests
42. Test at the lowest level possible
Fast and reliable
Product is designed for testability
Test code is product code
End to end tests can run in production
Testing infrastructure is a shared
service
43. L3 tests
TRA to L2
conversion
Analyzed
legacy
TRA tests
Started
with L0 /
L1 tests
44.
45. Aggregate Flaky
tags over
cumulative runs
Un-tag
Flakiness
when
reliability bug
is resolved
Test Reporting
enhancements
VSTS Extension tasks &
Test Reporting Enhancements
Reliability Run Workflow
Execute
Reliability
Runs on
Green Builds
Flaky tests
excluded from
Test Run
Summary
Official Run Workflow
Execute
Runs
Carry Flakiness
from Latest
Reliability run
to current run
Tag Failed Tests
as Flaky & file
Reliability bugs
Reliability Bugs
surfaced in Team
scorecards
46.
47. What do we measure?
Live Site Health/Debt
Time to Detect, Time To Mitigate
Incident prevention items
Aging live site problems
Customer support metrics (SLA, MPI, top
drivers)
Engineering Health/Debt
Bug cap per engineer
Aging bugs in important categories
Test reliability bugs
Velocity
Time to build
Time to self test
Time to deploy
Time to learn (Telemetry pipe)
As of October 2017:
Single repo
430 people pushed to the repo in the last 30 days
~40 feature teams
Code base is 90+% the same
Teams work in master
No nightly build
Over 3,000 projects (doubled in the last 3 years)
Pull requests build & run unit test validation
And I can make it the main path for all customers
https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/
Sent the system into a death spiral where even turning all of the new features off it struggled to become healthy again. We didn’t have any resiliency built it in to the system.
We only had TFS SU1 and SPS at the time
https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/
Visual Studio 2013 introduced a new feature called synchronized settings
From
The retries had a multiplying effect!
“Release It! – Design and Deploy Production-Ready Software”
Michael T. Nygard
Netflix Hystrix – “Making the Netflix API More Resilient”
Ben Schmaus
https://github.com/Netflix/Hystrix/wiki
Commands come in
If closed, command runs
If it fails, record failure
If too many failures in the time window, open
If open
Command gets fallback
Periodically a command is allowed to pass through to test to see if the problem is resolved
See https://github.com/Netflix/Hystrix/wiki/How-it-Works for complete explanation
Test in production on SU0 where there are no customers, and we have real production load – not just tests.
Here are some patterns we’ve caught from this:
Fallbacks don’t work as expected. For example, a feature returning a 500 error when Redis is down. That’s a best-effort cache and shouldn’t cause failures.
Circuit breakers don’t open when they should. While it’s straightforward to test fallback logic through testcases, injecting faults helps verify the tuning of those resiliency mechanisms. A breaker that never opens is useless.
Requests don’t time out quickly enough or they retry when they shouldn’t. Our web UI shouldn’t hang for minutes trying to fetch an identity image, for example.
Unexpected dependencies cause small problems into major incidents. A call to a non-critical service was added through some additional code in service start. So if that non-critical service is down or slow when an instance of a critical service is starting, that critical service is down or slow.
One important thing we do is to run nearly all of this in our Ring0 production deployments. That’s where our account is that we work from. We tried test deployments and other canary instances early on, but struggled with the typical problems of load patterns that aren’t real-world and missing telemetry and alerting infrastructure. There’s nothing like production. Having a Ring0 instance means that if we do have greater-than-expected impact, we only impact our own team and not other customers.
Long timeouts may limit the effectiveness of circuit breakers – better to fail fast
Sometimes people mistake issues with circuit breakers for a cause
Story about initial release – Twitter was a better monitor than what we had in 2011 when we first announced the service at //Build in 2011
~60GB per day in 2015
Journey Story:
Darkness: The service went live > some issues happen > there were blind spots
Rally the Troops: We talked to the team about telemetry
There is Light: We start to generate tons of telemetry and alerts
Blinded by the Light: We become overwhelmed by too much data & alerts
Time to Tune: You refocus the team on reducing alert noise
Overcompensating: After turning and tweaking alerts we start missing some customer incident
The Art of Balance: We’re now learning how to enable precise alerting and how to make sense of so much data:
Using the sophistication of “high mature alerts” over log analytics
Shifting critical alerts that “page” to focus on SLA
Investigating anomaly detection
Key Takeaway:
--5 whys
-- define improvement for both code and process
-- visibility to ensure learnings are applied
Outcomes:
-- Improve how you respond (TTx)
-- Stop from happening again
This is a graph of the TFS 2010 release – you can see beta 1, beta 2, RC, RTM as we built up and burned down bugs, and some percentage of bug fixes introduced new bugs or allowed more bugs to be discovered
1 bug per dev per day is the average over enough time. Started with 15 and the team just grew to that and stayed there. Limited to 5 so we were 5 – minimum - day from shipping. Maybe it’s more than that because of regressions. Zero Bug Bounce (ZBB)
November 2014
We never shipped a release of TFS with all of the tests passing! Tests weren’t reliable enough, so someone had to manually go through all of the failures. That led to mistakes, of course. Same for shipping the service every 3 weeks.
L2 conversion took over two years, as it was done in parallel with new feature development. So it progressed in fits and starts.
Phase 1
Made it easy to author and execute high quality L0/L1 tests
Stopped creating new TRA tests as much as possible
Phase 2
Tests that can be deleted
Tests that can move to L0/L1
Tests that will move to VSSF Test SDK
On-prem tests we expect to maintain in lights-on mode
Phase 3
Test Arch v-team re-wrote L2 test framework
Top-down push from management with org wide scorecard
Phase 4
Just started
Continuous Reliability Runs triggered on green CI builds
If a test fails in any of 500 runs, it is considered Flaky
Bugs filed for flaky tests
Now working on automatic re-run of tests and detecting flakiness if a test passes on re-run