From 0 to DevOps in 80 Days
Link to the webinar replay: https://info.dynatrace.com/apm_dtm_ops_17q3_wc_from_enterprise_tocloud_native_na_registration.html
“Innovate or die” may sound extreme, but it’s the only way to thrive in today’s ever competitive market. Bernd Greifeneder, CTO of Dynatrace, wanted to ensure that the company was relevant 5 years from now so he formed an internal incubator with one goal: transform Dynatrace into a Cloud Native DevOps organization.
The incubator focused on what the company needed to do in order to integrate nascent cloud technologies so that they wouldn’t be left in the dust when the inevitable tipping point to cloud arrives. Transforming into a cloud native company would allow for rapid release cycles and provide an embedded feedback loop.
The Results: Dynatrace now has a 99.998% availability of SaaS Service and can deploy changes within an hour if necessary. In parallel, a new SaaS and managed offering is released every 2 weeks with 170 production updates per day.
Watch this recorded webinar as Bernd Greifeneder shares the lessons learned moving Dynatrace from an on-prem company to one that is cloud native.
Bernd discusses:
• The driving factors that led to the transformation
• The goals that were set back in 2011 towards the engineering team
• How to sell such a transformation project in a large enterprise organization
• How to support this multi-year project from top down without impacting regular operations
• What's next on the innovator's mind
1. From 0 to DevOps in 80 days
Lessons learnt from shifting an on-prem to a cloud culture
Bernd Greifeneder, CTO
http://dynatrace.com/trial
2. From the DevOps Webinar with Gene & Mark
Mark Tomlinson
Performance Sherpa
@mark_on_task
Andi Grabner
Performance Advocate
@grabnerandi
Gene Kim, CTO
Researcher and Author
@RealGeneKim
3. High Performers Are …
200x 2,555x
more frequent deployments faster lead times than their peers
Source: Puppet Labs 2015 State Of DevOps Report: https://puppet.com/resources/white-paper/2016-state-of-devops-report
More Agile
3x 24x
lower change failure rate faster Mean Time to Recover
More Reliable
6. 2011: APM about to be disrupted!
• Migrate from On-Prem to VM, Cloud, Containers and PaaS
• Architectures include micro-services, on-demand scaling, self-healing
• ”Cloud Natives“ demand SaaS based solutions
• Bridging the gap between ”New Stack“ and “Enterprise Stack“
• Digital Transformers demand Analytics for Biz, Dev, Ops & Sec
• Many new players on the market
@grabnerandi
7. From 0 to DevOps in 80 days
Lessons learnt from shifting an on-prem to a cloud culture
Bernd Greifeneder, CTO
http://dynatrace.com/trial
11. NOC lessons learntContinuous Integration faster than classic Ops
Automation and APIs
One delivery stack across the pipeline
12. DevOps
Continuous Delivery
is a culture that emphasizes
the collaboration of various teams
involved in software delivery
is an approach about building, testing
and releasing software
reliable, faster and more frequent
19. ruxit theme 2014.05.15Production facts – Oct 2016
450
AWS EC2 instances
>2 years
value to customers in production
99.998%cluster availability since June’15 NO
24/7 OPs Team
170
deployments per working day
2 weeks
release cycle
20. 20 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Shift-Left Quality
Quality/Performance matters in Dev/Staging as well!
Make Dev/CSA/PM dependent from Quality in trunk!
25. 26 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
iPhone 6
failing early
improves
quality
late feedback
sucks
26. 27 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Shift-Left Quality
Quality/Performance matters in Dev/Staging as well!
Make Dev/CSA/PM dependent from Quality in trunk!
DevOps = start thinking like an Ops before Commit
27. 28 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Shift-Left Quality
Quality/Performance matters in Dev/Staging as well!
Make Dev/CSA/PM dependent from Quality in trunk!
DevOps = start thinking like an Ops before Commit
Shift-Right Metrics
enable DEVs defining quality metrics
make DEVs to the primary consumers of their metrics
28. 29 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Developer will never do that!
Operator’s job
34. 35 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Daily deployments from trunk the whole DEV team depends on
35. 36 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
believe in the mission impossible
6months
major/minor release
+ intermediate fix-packs
+ weeks to months
rollout delay
sprint releases (continuous-delivery)
1h : code to production
36. New “multimodal” dynamics
Apps of
innovation
Apps of records
Apps of Engagement
Apps of
Engagement
Explore new revenue models
Look ahead, see beyond present. Non-IT led.
New hyper-scale, hyper-dynamic apps.
Public cloud and open source bias
New cloud platforms and micro-services stacks.
Decisions made in hours/days, no POC.
DevOps Assumed
Expect release cycles measured in hours.
Tight teaming between biz, dev, ops.
38. Thank you!
Food for thoughts:
• In which of the following stages are you and what's next?
• Classic siloed, Continuous Integration, DevOps?
• Is technology or process/culture the hurdle?
• Do you have the right monitoring strategy?
http://dynatrace.com/trial
41. Dev Stage Daily
Continuous Delivery & Feedback
Production Stage
Bi-Weekly & On Demand
Acceptance Stage Bi-Daily
Deploy
Develop &
Fixing
Acceptance &
Performance
& Load Tests
& Monitor
Deploy
Fixing
Acceptance &
Performance
& Load Tests
& Monitor
Deploy
Hotfixing
Monitor
Release
Every 2 weeks version is pushed to next stage = 2 weeks release cycle
Unit+Integration
testing & Build
Unit+Integration
testing & Build
Unit+Integration
testing & Build
SaaS
Week 1 Week 2 Week 3 Week 4
42. Continuous Delivery & Feedback
Dev Stage Daily
Production Stage
Monthly & On Demand
Acceptance Stage Bi-Daily
Week 1 Week 2 Release
Every 4 weeks version is pushed to next stage = 4 weeks release cycle
Managed - On Premises
Deploy
Develop &
Fixing
Acceptance &
Performance
& Load Tests
& Monitor
Deploy
Fixing
Acceptance &
Performance
& Load Tests
& Monitor
Deploy
Hotfixing
MonitorUnit+Integration
testing & Build
Unit+Integration
testing & Build
Unit+Integration
testing & Build
Week 3 Week 4 Week 5 Week 6 Week 7 Week 8
43. CDF Tooling Chain - Dev Stage
Acceptance Tests
Selenium/Appium Automated Functional Tests
on UI Level
Automated E2E Tests on UI Level
Browser Compatibility Testing
(Chrome, IE, FF, Edge, Safari)
24/7 Deployment Checks
Jira Manual Regression Testing
AWS, VMware, VirtualBox Test Environments
Develop
Gradle + Artifactory build/deployment automation
SVN + Git Version Control
Eclipse / IntelliJ IDEA
Quickbuild + Jenkins Continuous Integration
Unit & Integration Tests (fully automated)
Unit Tests +Integration Tests
Quickbuild, Eclipse/IntelliJ IDEA for local
execution before commit
Memory/CPU Overhead tests
Dynatrace AppMon
Code Quality and Coverage Checks
Sonar
Memory leak detection
Bullseye, Valgrind
Virus Scan Kaspersky
Open Source License compliance checks
Blackduck
Facts/Numbers
28.000 Unit Tests + 3.000 integration test executions / hour
Deployment (fully automated)
Quickbuild Automated daily deployment of Trunk Builds
using Ansible and Puppet or calling
Dynatrace Ruxit CloudControl, that uses
AWS Cloud Formation
(whole infrastructure as code!)
Selenium/Appium
Automated Customer-Like Deployment
Deployment Checks on UI Level
Performance & Load Tests
Cluster Workload Simulator
Eclipse Mem Analyzer (MAT) + Eclipse Thread Dump
Analyzer
Java Flightrecorder
Monitoring
Dynatrace Real-User, Service and
Infrastructure monitoring
WebChecks, AWS Monitoring, Log Analytics
& Monitoring
OpsGenie and common HipChat Room to
escalate detected Problems directly to
Development
Security 24/7 OSSEC Host intrusion monitoring
Deploy
Develop &
Fixing
Unit+Integration
testing & Build
Acceptance &
Performance
& Load Tests
& Monitor
44. CDF Tooling Chain - Acceptance Stage
Deploy
Fixing
Unit+Integration
testing & Build
Acceptance &
Performance
& Load Tests
& Monitor
Facts/Numbers (Dev + Acceptance Stage)
~700 Automated UI Tests
~60 hours UI Test execution per Build ~20
parallel running test sets executed on ~30
execution machines up to ~5 hours per test set
~15 different OS (Windows, Linux)
Unit & Integration Tests (fully automated)
Unit Tests +Integration Tests
Quickbuild, Eclipse/IntelliJ IDEA for local
execution before commit
Virus Scan Kaspersky
Deployment (fully automated)
Quickbuild Automated deployment of Sprint Builds
using Ansible and Puppet or calling
Dynatrace Ruxit CloudControl, that uses
AWS Cloud Formation
(whole infrastructure as code!)
Selenium/Appium
Automated Customer-Like Deployment
Deployment Checks on UI Level
Acceptance Tests
Selenium/Appium Automated Functional Tests
on UI Level
Automated E2E Tests on UI Level
Browser Compatibility Testing
(Chrome, IE, FF, Edge, Safari)
24/7 Deployment Checks
Jira Manual Regression Testing
AWS, VMware, VirtualBox Test Environments
Fixing
Gradle + Artifactory build/deployment automation
SVN + Git Version Control
Eclipse / IntelliJ IDEA
Quickbuild + Jenkins Continuous Integration
Performance & Load Tests
Cluster Workload Simulator
Eclipse Mem Analyzer (MAT) + Thread Dump Analyzer
Java Flightrecorder
Monitoring
Dynatrace Real-User, Service and
Infrastructure monitoring
WebChecks, AWS Monitoring, Log Analytics
& Monitoring
OpsGenie and common HipChat Room to escalate
detected Problems directly to Development
Security 24/7 OSSEC Host intrusion monitoring
45. CDF Tooling Chain - Production Stage
Deploy
Hotfixing
Unit+Integration
testing & Build
Monitor
Fixing
Gradle + Artifactory build/deployment automation
SVN + Git Version Control
Eclipse / IntelliJ IDEA
Quickbuild + Jenkins Continuous Integration
Monitoring
Dynatrace Real-User, Service and
Infrastructure monitoring
WebChecks, AWS Monitoring, Log Analytics
& Monitoring
OpsGenie and common HipChat Room to escalate
detected Problems directly to Development
Security 24/7 OSSEC Host intrusion monitoring
Monthly and on demand Vulnerability Scans
(KPMG Linz)
Closed Bug Bounty Program at HackerOne
Unit & Integration Tests (fully automated)
Unit Tests +Integration Tests
Quickbuild, Eclipse/IntelliJ IDEA for local
execution before commit
Virus Scan Kaspersky
Deployment (fully automated)
Quickbuild Automated deployment of Sprint Builds
using Ansible and Puppet or calling
Dynatrace Ruxit CloudControl, that uses
AWS Cloud Formation
(whole infrastructure as code!)
Selenium/Appium
Automated Customer-Like Deployment
Deployment Checks on UI Level
Editor's Notes
Die
CD flows directly into DevOps
DevOps culture can be seen a product of Continuous Delivery.
Both have the same goal
Minimize feature cycle time and
deliver features earlier by meeting the quality goals
So – our goal is to deploy new features faster to get it in front of our paying end users
For many companies that tried this it may also meant that they fail faster
Its also very important to keep the focus right – building and fixing those things that matter.
avoiding ending up in a war room as it happend at facebook in dec 2012.
like to share some fact that shows where we are now. …
we indeed managed to live a DevOps Culture and to build up a working CD Pipeline.
But how did this happen is such a short time. I like to pick the -from my point of view- most important decision on our way that shows as well how persistently the RnD teams and the management are at Dynatrace.
---
This simple and powerful decision was that we shift quality to the left:
Devs demoed before on their machin „it works on my machine“ – but it breaks therafter
start deploying every day the latest successful trunk build of each component to a production-like env
that only this environment is allowed to demo new features and to verify bug fixes.
paid additional cloud resources to realize this idea and initially the only think we won was pain:
DEV: - when a failing test blocks the whole deployment - when one component breaks the whole environment - when a component needs to be able to handle gracefully every unhealthy state of the other components
PM: - when they wanted to review or demo upcoming features
But we did not reverse this decision because we realized that we won something very promising.
Devs before went home on Friday, checked-in but the entire build broke and others had to clean-up
The whole team was now sitting in the very same boat and everybody was feeling the pain of insufficient quality at product or test coverage side and as well of insufficient automation and quality of the same within the continuous delivery and deployment pipeline.
You might now think being a developer at Ruxit must be painful. But indeed the opposite is true. …
Think how painful it is to work on features for several month and weeks after the release you get bad feedback
you probably then realize that your assumptions were wrong and you would have built the feature different, if you would have known that earlier. Every Developer with some professional experience know such situations and probably start dreaming of a process that delivers such valuable feedback short time after commit.
So, welcome at Ruxit! we make Developer dreams come true!
By doing very frequent big bang deployments from trunk we gave everybody of the team the chance to learn thinking like an operations guy. And this is a very natural learning process like kits are used to learn: They try, fail and next time they try to it better.
try to do it better, could for instance mean, that developers start thinking about metrics as feedback about their code works as expected or not.
try to do it better can also means somethin else …
Once you manage to automate this feedback channels, this feedback metrics can be used for all deployements. Means our developer knows at every time the status of each and every deployment of their component. This is something an Operations Team could never provide: They look only on Production and do not have the detailed insight the whole dev team has.
Not for a Development environment! So who is doing that for our Development environment. Right our Devs. But they will never do this manual as it’s time consuming and usually needed when people knowing how are out of office.
So what‘s the solution – sure they automate this jobs.watchdogs detecting unhealthy processes and restart themreplacing hardware by a simple AWS API call
And this is the answer why we can do NoOps. Our Devs already automated their duties!
ANITA:Chaos Monkey schießt nach dem Zufallstprinzip Instancen ab.
Wenn du es nicht machst, macht es jemand anders für dich.
Dynatrace 6.2 – verstärkte burn-down phase im letzten 1/3:
Ruxit - up/down trend in sprints, ideal wäre eine gerade blaue linie, wobei sich rot und grün leicht zeitversetzt überdecken