The goal behind devops is Faster Lead Times
What this really means for Software Delivery -> my Kodak/Smart Phone Analogy
How and Which Metrics to use along the Delivery Pipeline to make better decisions along the way.
4. @grabnerandi
“The stuff we did
when we were a Start Up
and we All were
Devs, Testers and Ops”
Quote from Andreas Grabner back in 2013 @ DevOps Boston
6. Utmost goal: minimize cycle time (= Lead Time)
timefeature cycle time
minimize Users
This is where you
create value!
7. From the DevOps Webinar with Gene & Mark
Mark Tomlinson
Performance Sherpa
@mark_on_task
Andi Grabner
Performance Advocate
@grabnerandi
Gene Kim, CTO
Researcher and Author
@RealGeneKim
Webinar Recording: https://info.dynatrace.com/apm_wc_gene_kim_na_registration.html
8. High Performers Are …
200x 2,555x
more frequent deployments faster lead times than their peers
Source: Puppet Labs 2015 State Of DevOps Report: https://puppet.com/resources/white-paper/2016-state-of-devops-report
More Agile
3x 24x
lower change failure rate faster Mean Time to Recover
More Reliable
9. 24 “Features in a Box” Ship the whole box!
Very late feedback
10. „1 Feature at a Time“
„Optimize before Deploy“„Immediate Customer Feedback“
Continuous Innovation and Optimization
12. 700 deployments / YEAR
10 + deployments / DAY
50 – 60 deployments / DAY
Every 11.6 SECONDS
Innovators (aka Unicorns): Deliver value at the speed of business
13.
14. @grabnerandi
DevOps @ Target
presented at Velocity, DOES and more …
http://apmblog.dynatrace.com/2016/07/07/measure-frequent-successful-software-releases/
“We increased from monthly to 80
deployments per week
… only 10 incidents per month …
… over 96% successful! ….”
15. “We Deliver High Quality Software,
Faster and Automated using New Stack“
„Shift-Left Performance
to Reduce Lead Time“
Adam Auerbach, Sr. Dir DevOps
https://github.com/capitalone/Hygieia & https://www.spreaker.com/user/pureperformance
“… deploy some of our most critical production
workloads on the AWS platform …”, Rob Alexander, CIO
16. From 0 to DevOps in 80 days
Lessons learnt from shifting an on-prem to a cloud culture
Bernd Greifeneder, CTO
http://dynatrace.com/trial
Webinar: http://ow.ly/cEYo305kFEy
Podcast: http://bit.ly/pureperf
18. 18 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
believe in the mission impossible
6months
major/minor release
+ intermediate fix-packs
+ weeks to months
rollout delay
sprint releases (continuous-delivery)
1h : code to production
21. @grabnerandi
New Deployment + Mkt Push
Increase # of unhappy users!
Decline in Conversion Rate
Overall increase of Users!
#2: User Experience -> Conversion
Spikes in FRUSTRATED Users!
24. Dynatrace Transformation by the numbers
23x
170
more releases
Deployments / Day
31000 60hUnit+Int Tests / hour UI Tests per Build
More Quality
~200 340code commits / day Stories per sprint
More Agile
93%
Production bugs found by Dev
@grabnerandi
More Stability
450 99.998%Global EC2 Instances Global Availability
27. Understanding Code Complexity
• 4 Millions Lines of Monolith Code
• Partially coded and commented in
Russian
From Monolith to Microservice
• Initial devs no longer with company
• What to extract withouth breaking it?
Shift Left Quality & Performance
• No automated testing in the pipeline
• Bad builds just made it into production
Cross Application Impacts
• Shared Infrastructure between Apps
• No consolidated monitoring strategy
28. @grabnerandi
Scaling an Online Sports Club Search Service
2015201420xx
Response Time
2016+
1) 2-Man Project 2) Limited Success
3) Start Expansion
4) Performance
Slows Growth Users
5) Potential Decline?
34. @grabnerandi
26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171! Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end
35. @grabnerandi
The fixed end-to-end use case
“Re-architect” vs. “Migrate” to Service-Orientation
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3! (vs 177)
Total SQL Count
49. “To Deliver High Quality Working Software Faster“
„We have to Shift-Left Performance to Optimize Pipelines“
http://apmblog.dynatrace.com/2016/10/04/scaling-continuous-delivery-shift-left-performance-to-improve-lead-time-pipeline-flow/
50. = Functional Result (passed/failed)
+ Web Performance Metrics (# of Images, # of JavaScript, Page Load Time, ...)
+ App Performance Metrics (# of SQL, # of Logs, # of API Calls, # of Exceptions ...)
Fail the build early!
51. Reduce Lead Time: Stop 80% of Performance Issues
in your Integration Phase
CI/CD: Test Automation (Selenium, Appium,
Cucumber, Silk, ...) to detect functional and
architectural (performance, scalabilty) regressions
Perf: Performance Test (JMeter,
LoadRunner, Neotys, Silk, ...) to
detect tough performance issues
52. Shift-Left Performance results in Reduced Lead Time
powered by Dynatrace Test Automation
http://apmblog.dynatrace.com/2016/10/04/scaling-continuous-delivery-shift-left-performance-to-improve-lead-time-pipeline-flow/
53. @grabnerandi
Fast Response to Outcome: Address Deployment Impact
User Experience, Conversion Rate
Costs and Efficiency
Availability
54. @grabnerandi
Real User Feedback: Building the RIGHT thing RIGHT!
Experiment &
innovate on
new ideas
Optimizing what is
not perfect
Removin
g what
nobody
needs
In case you are a “DevOps Virgin” I definitely recommend checking out The Phoenix Project (the DevOps Bible) and Continuous Delivery (which is what we actually all want to achieve): Deliverying software faster with great quality and without all potential mistakes that a manual and rigid process brings with it
This inspired many companies which have been talking about their successes!
Minimize feature cycle time and
See the full webinar: https://info.dynatrace.com/apm_wc_gene_kim_na_registration.html
My blog on this: http://apmblog.dynatrace.com/2016/11/16/transformation-to-continuous-innovation-and-optimization/
The new way is how we take Pictures Right Now: We see what we are about to build – we optimize it on the spot based on best practices – we deploy it into production and immediately get feedback from our customers / friends: then we can decide whehter we „built“ the right thing or not. Whether we take another picture of the scene or not – or whether we delete some of those that nobody likes
Several companies changed their way they develop and deploy software over the years. Here are some examples (numbers from 2011 – 2014)
Cars: from 2 deployments to 700
Flicks: 10+ per Day
Etsy: lets every new employee on their first day of employment make a code change and push it through the pipeline in production: THAT’S the right approach towards required culture change
Amazon: every 11.6s
Remember: these are very small changes – which is also a key goal of continuous delivery. The smaller the change the easier it is to deploy, the less risk it has, the easier it is to test and the easier is it to take it out in case it has a problem.
But not only the hipsters / unicorns have been doing it – it is catching on – even in enterprises that seem too big. But because they are too big to fail they had to go through a major transformation!
Taken from http://www.hostingadvice.com/blog/cloud-66-devops-as-a-service/
Check out the recorded webinar and podcast
Webinar: http://ow.ly/cEYo305kFEy
Podcast: http://bit.ly/pureperf
Die
Dynatrace 6.2 – verstärkte burn-down phase im letzten 1/3:
Ruxit - up/down trend in sprints, ideal wäre eine gerade blaue linie, wobei sich rot und grün leicht zeitversetzt überdecken
A basic key metric for developers should be „Did I break the build“.
This is why we at Dynatrace installed these Pipeline State UFOs that are hooked up with Jenkins to tell engineers how good or bad the current Trunk or Latest Sprint build is
Key thing here is that this should not only be applied to the build itself but to metrics across the delivery pipeline: from DevToOps. It should include metrics like the next examples
The most basic metric for everyone operating software. Did my last deployment break anything? Is the software still available from those locations where my users are accessing the software? Use Synthetic Monitoring: http://www.dynatrace.com/en/synthetic-monitoring/
Monitoring user experience and impact on conversion rate
Screenshot from Dynatrace AppMon & UEM
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power
Screenshot from Dynatrace AppMon
Understand user behavior depending on who they are and what they are doing.
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Does the behavior change if they have a less optimal user experience?
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Seems like users that have a frustrating experience are more likely to click on Support
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Unfortunately not every story is a good story. But the bad stories are often not told – even though we can learn even more. PrepSportswear failed 80% of their deployments after speading up deployments
They had a monolithic app that couldnt scale endlessly. Their popularity caused them to think about re-architecture and allowing developers to make faster changes to their code. The were moving towards a Service Approach
Separating frontend logic from backend (search service). The idea was to also host these services potentially in the public cloud (frontend) and in a dynamic virtual enviornment (backend) to be able to scale better globally
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online!
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items. Lots of wasted traffic and resources as these Key Architectural Metrics show us
They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
Lessons Learned!
Got this story also covered here: https://www.infoq.com/articles/Diagnose-Microservice-Performance-Anti-Patterns
If we monitor these key metrics in dev and in ops we can make much better decisions on which builds to deploy
We immediately detect bad changes and fix them. We will stop builds from making it into Production in case these metrics tell us that something is wrong.
We can also take features out that nobody uses if we have usage insights for our services. Like in this case we monitor % of Visitors using a certain feature. If a feature is never used – even when we spent time to improve performance – it is about time to take this feature out. This removes code that nobody needs and therefore reduces technical debt: less code to maintain – less tests to maintain – less bugs in the system!
You should know these books
The Phoenix Project explains in details the 3 Ways on how to mature your Organization:
http://itrevolution.com/the-three-ways-principles-underpinning-devops/
They come from tools. I work for Dynatrace and we provide all these metrics – but there are also other tools out there that do that job
Do it manually
Or do some of it automated through these tools such as Dynatrace
And this place nicely into what our friends from CapitalOne do to optimize their pipeline throughput: Shift Quality Left; Find problems earlier to avoid too many „bad builds“ wasting time in later pipeline stages that take longer to execute
Monitor your production deployments and monitor the impact on your end users, performance and conversion rates. Take this data to respond fast to issues
Using Real User Feedback also allows us to start experimenting, optimizing what is good and remove what nobody really needs
This is the Second Way (Feedback Loops) and alread the Third Way (Continuous Experimentation)
If we do all that we become successful as a business as we outpace our competition with new innovative ideas