Starting Your DevOps Journey – Practical Tips for Ops

Starting your DevOps Journey
Practical Tips for Ops
http://dynatrace.com/trial
Brian Chandler
Systems Engineer @ Raymond James
@Channer531
Andreas Grabner
Chief DevOps Activist @ Dynatrace
@grabnerandi

Promise of DevOps: Faster & Efficient Innovation
Smaller Apps, Micro-Services More Deployments
App-, Service- & End-User Feedback Loops
Happy Users
Lower Costs

Proof: DevOps Adopters Are …
200x 2,555x
more frequent deployments faster lead times than their peers
More Agile
3x 24x
lower change failure rate faster Mean Time to Recover
More Reliable
More Successful 2x 50%
More likely to exceed market
expectations
Higher market cap growth
over 3 years
Source: Puppet Labs 2015 State Of DevOps Report: https://puppet.com/resources/white-paper/2016-state-of-devops-report

Dynatrace Transformation by the numbers
23x
170
More releases
Deployments / Day
31000 60h
Unit+Int Tests / hour UI Tests per Build
More Quality
~200 340
Code commits / day Stories per sprint
More Agile
93%
Production bugs found by Dev
More Stability 450 99.998%
Global EC2 Instances Global Availability
Webinar @ https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html

YET: „DevOps Adoption is only 2%“ Gene Kim, Nov 2016

Interesting Ops Learnings from Adopters
New Technology Stack
New Architectural Patterns
End User Focused
New Deployment Models

DevOps Requirements and Engagement Options for Ops
Feedback through High Quality App & User Data
Ops as a Service: “Self-Service for Application Teams”
Bridge the Gap between Enterprise Stack and New Stack
Shift-Left: (No)Ops as “Part of Application Delivery”
RequirementsEngagementOptions

Basic App Monitoring1
App Dependencies2
End User Monitoring3
How to monitor mobile vs desktop vs tablet vs service endpoints?
How much network bandwidth is required per app, service and feature?
Where to start optimizing bandwidth: CDNs, Caching, Compression?
Are our applications up and running?
What load patterns do we have per application?
What is the resource consumption per application?
What are the dependencies between apps, services, DB and infra?
How to monitor „non custom app“ tiers?
Where are the dependency bottlenecks? Where is the weakest link?
Closing the Ops to Dev Feedback Loop: One Step at a Time!
“Soft-Launch” Support4
Virtualization Monitoring5 How to automatically monitor virtual and container instances?
What to monitor when deploying into public or private clouds?
How to deploy and monitor multiple versions of the same app / service?
What and how to baseline?
Do we have a better or worse version of an app/service/feature?
Ops: Need answers to these questions! Closing the gap to AppBizDev
Ready for “Cloud Native” How to alert on real problems and not architectural patterns?
How to consolidate monitoring between Cloud Native and Enterprise?
Who is using our apps? Geo? Device?
Which features are used? Whats the behavior?
Where to start optimizing? App Flow? Page Size?
Conversion Rates? Bounce Rates?
Where are the performance / resource hotspots?
When and where do applications break?
Do we have bad dependencies through code or config?
How does the system really behave in production?
What to learn for future architecturs?
What are the usage patterns for A/B or Green/Blue?
Difference between different versions and features?
Does the architecture work in these dynamic enviornments?
Does scale up/down work as expected?
Provide „Monitoring as a Service“ for Cloud Native
Application Teams6
Today

Questions to Answer!
Are our applications up & running?
What are the real load patterns?
What is the resource consumption?
Where to start optimizing?

Are our Apps Up, Running & Accessible?
Availability dropped to 0%

Early Warning SLA Monitoring!
Quality of
Connectivity, DNS
Quality of
Connectivity & DNS
Quality of Content
Delivery
Quality of Content
Delivery
3rd Party Impact
Delivery by Geo
Quality of Content
Delivery

Client Center Daily Traffic Pattern

Client Center sees a
peak of about 3,800
Req/min against the
it’s API.

Client Center sees a
peak of about 3,800
Req/min against the
it’s API.
60 unique
calls/functions that
make up the Client
Center API

~20% of that traffic is
ClientCenter/API/Holdings

ClientCenter/API/ClientDetails

ClientCenter/API/ClientDetails
ClientCenter/API/RecentSearch

Typical Peak Hour If you’re not careful, it could look like this…
Rhythmatic peaks and valleys suggest “lock-step” scripts (all virtual
users start and end at the same time.)
PRD usage is much more “fluid”. Steady stream
and balance across transaction usage
Total sum of traffic load was met. However, correct ratio of key transactions were not met.
Leveraging PRD data to tune QA Load Tests

Normal Production Distribution Failed Load Test Distribution
Black: Overall application load and peak volume Percentile breakdown of fast, warning, slow txs
VS.
Performance Differences Before and After Release

Occurrences of slow AccountList Transactions from load testingDistribution of “yellow” transactions for that time
AccountList makes
up most of these
transactions.
Normal distribution of
“expected” slow
transactions for this
API function.
Distribution generated
from load test. New
code would greatly
increase the
occurrences of slow
transactions in
production!
What is making up all that yellow?

Detection Load Distribution and Deployment Hotspots
Overall Load Distribution by SLA
Very Slow, Slow, Med, Fast
Tip: Logarithmic Y-Axis
Finding #3:
Server #3 only gets
load at certain times!
Finding #2a:
Server #1 was put back
in rotation HERE
Finding #2b:
Server #2 saw less
errors once #1 was up
Finding #1:
Response Time Spikes at
certain times not related
to load!
Validate Load Balancing
Tip: Load per Server!

Detection Load Distribution and Deployment Hotspots
Requests by App Server:
Tip: Percentage Bar Chart
Thread Usage:
Tip: Pool Size + Actual Use
Same for Web ServerSame for Web Server
Transfer Rate
Identify “heavy hitters”
Resource Utilization
Tip: CPU, Memory, I/O …

Detecting Resource Regression Hotspots
Time of Deployment
Other Resources: Bytes Transferred, Disk I/O, # of Log Messages, # of Open Connections, # of Calls …

Detecting Error Hotspots under Load

Automatic Hotspot Detection under Load
My Favorite: Layer Breakdown Chart
With increasing load: Which LAYER
doesn’t SCALE?

Automatic Availability Root Cause Detection
Web Performance Optimization
Automated 
List of root cause explanations for
SLA violations

Automatic Baselining per Business Transaction
Response Time Baselines based on
50th & 90th Percentile
Smart Alerting based on Significant
Measurement Violation
Direct link to Layer Breakdown and
Method Hotspot!

Automatic Anomaly and Root Cause Detection
Automatic Anomaly Detection Automatic Root Cause Information
Automatic Impact Details

Summary: Capabilities to Get Answers
Through Synthetic Monitoring: Are our applications up & running?
Availability, Response Time, CDN, Geo, …
Content Size and Content Validation
Through Endpoint Monitoring: What are the real load patterns?
Bucket by Response Time (Fast, Medium, Slow, Very Slow ...)
Bucket by Status Code (HTTP 2xx, 3xx, 4xx, 5xx, ...)
Through System Monitoring: What is the resource consumption?
CPU, Memory, Network and I/O
Through Basic Application Monitoring: Where to start optimizing?
Top Exceptions & Log Messages; # Thread (Idle, Busy)
Memory by Heap Space, Garbage Collection Activity
Execution Hotspots by Component

Which services do we actually host?
What is the health state of every component?
What are the dependencies?
What impacts the interconnected system health?
Questions to Answer!

Agent-Based Monitoring & Tracing:
Bridging Enterprise and New Stack
From Mobile
Via Middleware
To Mainframe
And Services
To SQL /
NoSQL
To SQL /
NoSQL
To SQL /
NoSQL
To External
Services

Analyzing Inter Tier Impact
#1: Load Spike
Direct correlation with # of
SQL queries -> OK!
#2: Same Load Spike
Direct correlation with # of
Exceptions -> OK!
#3: Starting with Load Spike
Time spent in JDBC (blue)
stays very high -> NOT OK!
#4: Problem Solved
Issue on Oracle Server
caused all SQL to be slow

Health State and Impact of Database!
DB-Related Blogs from Sonja: https://www.dynatrace.com/blog/author/sonja-chevre/

Proper Connection Pool Sizing!
Do we have enough DB
CONNECTIONS per pool?

Detecting Database Impact on Message Processing
#1: Cluster Failover Event
#2: System Struggled
but managed load
#2: System Struggled
but managed load
#3: DB Index Job with MAJOR
impact on End Users

@ Dynatrace: Service Tier Monitoring
#3: Queue Sizes
#1: Cassandra
Health
#2: Cassandra
Health
#1: Overall Tier Health
#4: Error States

What’s lurking under the water of the iceberg?

What is the cause of all performance problems?

40
Red wave of death appears on
dashboard.
Conference Bridge/Crisis Center
call with lots of “Smart Guy
Correlation”
Application recovers.
Triaging w/o anomaly detection on app dependencies

App1
Web
AppSvc
MB
EntSvc
DB
App2
Web
AppSvc
MB
EntSvc
DB DB
EntSvc
MB
App3
Web
AppSvc
App4
Web
AppSvc
MB
EntSvc
DB
App5
Web
AppSvc
MB
EntSvc
DB
41
DCRUM – True enterprise monitoring

App1
Web
AppSvc
MB
EntSvc
DB
App2
Web
AppSvc
MB
EntSvc
DB DB
EntSvc
MB
App3
Web
AppSvc
App4
Web
AppSvc
MB
EntSvc
DB
App5
Web
AppSvc
MB
EntSvc
DB
42

43

44
App1 App2 App5App4App3
Web Web Web
Svc1
WebWeb
DB1
EntSvc2
DB2
ENTSvc1
MB
Svc2 Svc4Svc3

45
DB1
EntSvc2
DB2
ENTSvc1
MB
Svc2 Svc4Svc3

46
DB1
EntSvc2
DB2
ENTSvc1
MB
Svc2 Svc4Svc3
Successful application dependency monitoring will allow you to
take a “bottom-up” approach to monitoring your enterprise.

“Bottom-up” Service View
Client Group 1, Servers A-D
Client Group 2, Servers E-H
Client Group 3, Servers I-L
Client Group 4, Servers M-Q
Client Group 5, Servers R-S
Different Apps and services
exercise enterprise services
and databases in varying ways!
Lack of load from these peers against this service
Poor performing node in this clientgroup

48
Link to the appropriate heat map
Alert sent based on deviation of calculated baseline
Baseline alerting granularity down the
operation level, not just the Software
Service
Delivering this data as actionable alerts

Usage and application behavior vary day-to-day.
A rolling average of services is not good enough
One week application usage trend
Monday Tuesday Wednesday Thursday Friday
The need for seasonal baselining

To achieve deeper statistical capabilities, we
use a combination of the PureLytics stream and
DCRUM REST interface to pour data into
analysis tools.
This allows us to reach back several weeks, on a single
minute for the given day (e.g. Monday at 10:03am
compared to the last 5 Mondays at 10:03am) to
calculate our baselines. For every unique operation in or
enterprise (25k+ recorded). That is a great deal of data!
Dynatrace performance metrics streaming

By reaching that far back at granular 1-minute intervals,
you can be very confident with the validity of your
baseline values
A 50ms-150ms deviation may not seem like a huge deal –
but in the world of app dependency monitoring, it truly is!
Graphical View of deep seasonal baselining

Service 1 needs to call Service 2 multiple
times. If service 2 slows down, it has an
enormous impact on all upstream services.
150ms shift in service 2
causes Service 1 to shift from
200ms-2s
Service 1
Service 2
Upstream impact of dependencies

Automatic Full Stack Monitoring
#1: All your Technologies #2: All Key Metrics
#3: Physical, Virtual, Containers or Cloud

Smartscape: Real Time Service-Oriented CMDB
#1: Understand WHO
talks with WHOM?
#2: Where are tiers
deployed?
#3: WHO might be
impacted by a failure?

Automatic Service Flow Tracing
#1: Understanding
Flow
#2: Dependencies
between Service
#3: Service
Clustering

Automatic Architectural Pattern Detection
#1: Action
initiated by the
SPA (Single Page
App)
#2: SPA was
making 3 AJAX
Calls in total!
#3: One of the calls
makes 13! Backend
REST Calls to
external system on
13 asynchronous
threads

Automatic Problem Pattern Detection
#1: Select Top
Common
Problem Patterns
#1: Explore
which
transactions
have this and
other problems

Automating Anomaly Detection
#1: All Root
Cause
Information
„encapsulated“
into a single
Problem
#2: “Time-Lapse”
of Problem
Evolution
#3: All relevant
Events: Infra,
Logging, App,
Service, End User
…

Automatic Integration with ChatOps

Summary: Capabilities to get answers
Through Automatic Dependency Detection
Which services hosted by which processes?
Where do these processes run?
Through Component Monitoring
Key metrics from Oracle, SQL, DB2, MySql, Postgres
Throughout on your Message Broker / Bus, Firewalls / Proxies
Through End-to-End Tracing
Which Services are depending for end-to-end use cases?
Where are our bottlenecks? How to optimize Deployment and archtiecture?
Through Anomaly Detection
Which tiers are acting out-of-the norm after an update or under certain load?
Who is impacted when one tier has an issue?
Where to look for the real root cause when a service goes down?

Basic App Monitoring1
App Dependencies2
End User Monitoring3
How to monitor mobile vs desktop vs tablet vs service endpoints?
How much network bandwidth is required per app, service and feature?
Where to start optimizing bandwidth: CDNs, Caching, Compression?
Are our applications up and running?
What load patterns do we have per application?
What is the resource consumption per application?
What are the dependencies between apps, services, DB and infra?
How to monitor „non custom app“ tiers?
Where are the dependency bottlenecks? Where is the weakest link?
DevOps Monitoring Maturity: What we covered today?
“Soft-Launch” Support4
Virtualization Monitoring5 How to automatically monitor virtual and container instances?
What to monitor when deploying into public or private clouds?
How to deploy and monitor multiple versions of the same app / service?
What and how to baseline?
Do we have a better or worse version of an app/service/feature?
Ops: Need answers to these questions! Closing the gap to AppBizDev
Ready for “Cloud Native” How to alert on real problems and not architectural patterns?
How to consolidate monitoring between Cloud Native and Enterprise?
Who is using our apps? Geo? Device?
Which features are used? Whats the behavior?
Where to start optimizing? App Flow? Page Size?
Conversion Rates? Bounce Rates?
Where are the performance / resource hotspots?
When and where do applications break?
Do we have bad dependencies through code or config?
How does the system really behave in production?
What to learn for future architecturs?
What are the usage patterns for A/B or Green/Blue?
Difference between different versions and features?
Does the architecture work in these dynamic enviornments?
Does scale up/down work as expected?
Provide „Monitoring as a Service“ for Cloud Native
Application Teams6

We have the experience.
 One of the largest health care
insurance providers in the nation
– to DevOps in two weeks
 One of the largest furniture retailers in
the United States
– to DevOps in two weeks

We have a proven approach--
The DevOps Xcelerator
 Outline your digital performance
management (DPM) strategy
 Build on what you already have
 Implement DPM to support DevOps
 Validate your success
DPM Vision & Strategy
Discovery & Planning
Implementation
Validate Success
Identify DPM goals that guide your implementation
strategy in alignment with business objectives.
Ask the right questions. Collect the information.
Assemble required resources. Create your
implementation plan.
Follow the Dynatrace Expert Services (DXS)
implementation framework to successfully execute
your implementation plan.
Track, measure and report progress towards your
DPM goals so that your digital performance
investments add increasing value to the business.

66
Q & A Brian Chandler
Systems Engineer @ Raymond James
@Channer531
Andreas Grabner
Chief DevOps Activist @ Dynatrace
@grabnerandi
Action Items for you!
Try Dynatrace SaaS: http://bit.ly/dtsaastrial
Try Dynatrace AppMon On Premise: http://bit.ly/dtpersonal
List to our Podcast: http://bit.ly/pureperf
Read more on our blog: http://blog.dynatrace.com

Starting Your DevOps Journey – Practical Tips for Ops

Starting Your DevOps Journey – Practical Tips for Ops

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Starting Your DevOps Journey – Practical Tips for Ops

Similar to Starting Your DevOps Journey – Practical Tips for Ops (20)

More from Dynatrace

More from Dynatrace (20)

Recently uploaded

Recently uploaded (20)

Starting Your DevOps Journey – Practical Tips for Ops

Editor's Notes