How can we detect a bad deployment before it hits production? By automatically looking at the right architectural metrics in your CI/CD and stop a build before its too late. Lets hook up your test automation with app metrics and use them as quality gates to stop bad builds early!
Most screenshots are from Dynatrace AppMon – http://bit.ly/dtpersonal – but presented concepts should work with many other tools
How I prepared for DevOps Days
I love metrics! And I think we need to make metrics-based decisions. There are different types of metrics and different visualizations
They come from tools. I work for Dynatrace and we provide all these metrics – but there are also other tools out there that do that job
A basic key metric for developers should be „Did I break the build“. This is why we at Dynatrace installed these Pipeline State UFOs that are hooked up with Jenkins to tell engineers how good or bad the current Trunk or Latest Sprint build is Key thing here is that this should not only be applied to the build itself but to metrics across the delivery pipeline: from DevToOps. It should include metrics like the next examples
The most basic metric for everyone operating software. Did my last deployment break anything? Is the software still available from those locations where my users are accessing the software? Use Synthetic Monitoring: http://www.dynatrace.com/en/synthetic-monitoring/
Monitoring user experience and impact on conversion rate Screenshot from Dynatrace AppMon & UEM
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power Screenshot from Dynatrace AppMon
If you test for scalability make sure the application scales „linear“ – or at least as linear as possible. Not like in this case where twice the load required 4.8X the number of containers. Screenshot from Dynatrace AppMon -> comparing two Transaction Flows!
Understand user behavior depending on who they are and what they are doing. Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Does the behavior change if they have a less optimal user experience? Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Seems like users that have a frustrating experience are more likely to click on Support Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Another cool example of conversion rate compared to technical metrics
In case you are a “DevOps Virgin” I definitely recommend checking out The Phoenix Project (the DevOps Bible) and Continuous Delivery (which is what we actually all want to achieve): Deliverying software faster with great quality and without all potential mistakes that a manual and rigid process brings with it
This inspired many companies which have been talking about their successes!
At Dynatrace we also went through a major transformation over the last years.
But it is not only about delivering features faster – it is also about delivering fast features!
These stats come from here: http://nft.atcyber.com/infographics/infographic-the-importance-of-web-performance-20140913
But don’t make the mistake to blindly follow every unicorn out there Taken from http://www.hostingadvice.com/blog/cloud-66-devops-as-a-service/
If you just automate a process that hasnt yet had enough time for quality you will just produce bad software -> but faster
If you have the freedom to add more features more rapidly make sure you measure if they are used. If not – take them out. This avoids piling up Technical and Business Debt
I get most of my stories from my Share Your PurePath program which is a free offering for our Dynatrace Free Trial & Personal License users: http://bit.ly/dtpersonal
They had a monolithic app that couldnt scale endlessly. Their popularity caused them to think about re-architecture and allowing developers to make faster changes to their code. The were moving towards a Service Approach
Separating frontend logic from backend (search service). The idea was to also host these services potentially in the public cloud (frontend) and in a dynamic virtual enviornment (backend) to be able to scale better globally
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online!
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items. Lots of wasted traffic and resources as these Key Architectural Metrics show us
They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
Got this story also covered here: https://www.infoq.com/articles/Diagnose-Microservice-Performance-Anti-Patterns
If we monitor these key metrics in dev and in ops we can make much better decisions on which builds to deploy We immediately detect bad changes and fix them. We will stop builds from making it into Production in case these metrics tell us that something is wrong.
We can also take features out that nobody uses if we have usage insights for our services. Like in this case we monitor % of Visitors using a certain feature. If a feature is never used – even when we spent time to improve performance – it is about time to take this feature out. This removes code that nobody needs and therefore reduces technical debt: less code to maintain – less tests to maintain – less bugs in the system!
How? Leverage your existing Functional, Unit or Integration Tests. Instrument the code you are testing and extract key metrics that you can track from build to build. Then baseline these metrics
Check out blogs on Problem Pattern Detection and Key Performance Metrics http://apmblog.dynatrace.com/2016/06/23/automatic-problem-detection-with-dynatrace/ http://apmblog.dynatrace.com/2016/02/23/top-tomcat-performance-problems-database-micro-services-and-frameworks/ https://www.infoq.com/articles/Diagnosing-Common-Java-Database-Performance-Hotspots
If one of these metrics spikes you detected a regression that should fail the build
If we do all that we can build a beautilful pipeline where quality metrics are enforced along the way!!
With that we can make our users happy 24/7 – at any load
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Why and How!
Andreas Grabner: @grabnerandi, email@example.com
DevOps @ Target
presented at Velocity, DOES and more …
“We increased from monthly to 80
deployments per week
… only 10 incidents per month …
… over 96% successful! ….”
“We Deliver High Quality Software,
Faster and Automated using New Stack“
to Reduce Lead Time“
Adam Auerbach, Sr. Dir DevOps
https://github.com/capitalone/Hygieia & https://www.spreaker.com/user/pureperformance
“… deploy some of our most critical production
workloads on the AWS platform …”, Rob Alexander, CIO
26.7s Load Time
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Direct access to DB from frontend service
Single search query end-to-end
The fixed end-to-end use case
“Re-architect” vs. “Migrate” to Service-Orientation
2.5s (vs 26.7)
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
You measure it! from Dev (to) Ops
Build 17 testNewsAlert OK
Build # Use Case Stat # API Calls # SQL Payload CPU
1 5 2kb 70ms
1 35 5kb 120ms
Use Case Tests and Monitors Service & App Metrics
Build 26 testNewsAlert OK
Build 25 testNewsAlert OK
1 4 1kb 60ms
34 171 104kb 550ms
#ServInst Usage RT
1 0.5% 7.2s
1 63% 5.2s
1 4 1kb 60ms
2 3 10kb 150ms
1 0.6% 3.2s
5 75% 2.5s
Build 35 testNewsAlert -
- - - -
2 3 10kb 150ms
- - -
8 80% 2.0s
Metrics from and for Dev(to)Ops
Re-architecture into „Services“ + Performance Fixes
Scenario: Monolithic App with 2 Key Features
your tool of choice
#SQL, #Threads, Bytes Sent, # Connections
WPO Metrics, Objects Allocated, ...
Fail the build!
Performance: Production Ready
Checks! Validate Monitoring
Ops/Biz: Provide Usage and
Resource Feedback for next
Test / CI: Stop Bad Builds Early
Build & Deliver Apps like the Unicorns!
With a Metrics-Driven Pipeline!