Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PCF Metrics – App Dev
Providing App Developers insight into app performance
PCF Metrics
Providing App Developers insight i...
Gartner believes that more than 80% of all
mission-critical IT service outages result
from people and process errors and
f...
Modern infrastructure is constantly changing
Methodologies Deployment
Sparingly at
designated times
Ready for prod at
any ...
Rate of change is
driving more outages
5
Outages often preventable using automation
Facebook
1 hour, Jan 26th
Config / app / net failures
Apple App Store
11 hour...
“25% of customers will abandon a web page that takes more than 4 seconds to load”
“47% of consumers expect a web page to l...
7
Speed Performance and Human Perception
Delay time
User Reaction
0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds ...
Changes to a single microservice
or monolithic app can impact
performance of downstream apps
and services, or cause breaka...
9
Troubleshooting apps and
microservices is hard
Most platforms have:
Disparate permissions on different apps
Data silos a...
Multiple
Languages
Microservices
Support
Services
Marketplace
Native
User
Provided Partner
DEVELOPMENT
1010
Operating
Syst...
4 Levels of High Availability
11
Availability Zone Fail
4
VM Fail
3
Process Fail
2
App Instance
Fail
1
V
M
V
M
Process
V
M...
Container Scheduler Handles Workloads
12
250,000
containers
managed in a
single
environment
https://blog.pivotal.io/pivota...
Container Scheduler Handles Workloads
13
Dynamic load
balancing
Container Scheduler Handles Workloads
14
Dynamic load
balancing
Remediation
and rebalance
of workloads
Each Layer Upgradable with No Downtime
15
App Runtime*
File system mapping
Application
Linux host & kernel
Blue-Green depl...
Our Charter
To provide App Devs with data points
to assess overall solution performance
and healthProviding App Developers...
• Near real-time
view
• Covers 80-90%
of the problems
• One tool correlates
events, logs, metrics
• Common set of facts
fo...
Available Data
CF
EVENTS
APP
LOGS
APP
METRICS
ROUTES
Select an app,
watch streaming
data
2 weeks of app log storage
2 weeks of detailed container
and http start stop metric storage
App Log distribution histogram...
Data Correlation
Demo
22
PCF Metrics 1.2 Architecture
Our Journey
PCF Metrics v1.0
PCF Metrics v1.1
PCF Metrics v1.2.1
PCF Metrics v1.3
Aggregate Container
and HTTP metrics
pro...
Spring Boot actuator support
Expanded event descriptions
Additional Log sources *
Data exposed as API
Continued UX improve...
Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2
Upcoming SlideShare
Loading in …5
×

Troubleshooting App Health and Performance with PCF Metrics 1.2

Join Allen Duet and Pieter Humphrey from Pivotal, to learn how PCF Metrics enhances the developer experience on Pivotal Cloud Foundry, with a simple and powerful way to troubleshoot app health and performance issues. You will see how, with a single, unified interface for events, logs, and metrics, app devs can easily navigate graphs to identify problems and then view logs for that time slice.

  • Login to see the comments

Troubleshooting App Health and Performance with PCF Metrics 1.2

  1. 1. PCF Metrics – App Dev Providing App Developers insight into app performance PCF Metrics Providing App Developers insight into app performance Pieter Humphrey, Allen Duet
  2. 2. Gartner believes that more than 80% of all mission-critical IT service outages result from people and process errors and failures, and of those outages, more than 50% result from a lack of coordination between change, release and configuration management processes. Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013
  3. 3. Modern infrastructure is constantly changing Methodologies Deployment Sparingly at designated times Ready for prod at any time Architecture Technologies Operations App Server on Machine Containers, Public / Private / Hybrid Cloud Monolithic App Microservices / Composite app Linear / Sequential Agile DevOps CI / CD Pipelines Many tools, ad hoc automation Manage services, not servers
  4. 4. Rate of change is driving more outages
  5. 5. 5 Outages often preventable using automation Facebook 1 hour, Jan 26th Config / app / net failures Apple App Store 11 hours March 11th Internal DNS error NYSE, United, WSJ 4 hr, 1.5 hr, 1 hr July 8th Software update, routing failure, server overload UltraDNS 2.5 hours Oct 15th Configuration Errors https://blog.thousandeyes.com/top-internet-outages-2015/ http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=2 http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=4 http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=8 2015
  6. 6. “25% of customers will abandon a web page that takes more than 4 seconds to load” “47% of consumers expect a web page to load in < 2 seconds” “Customers prefer competitors website if it is 250ms faster” “Increase revenue 1% for each 100ms improvement” Sources: Gartner, Google, Amazon, Walmart 6 Speed and Availability Matters
  7. 7. 7 Speed Performance and Human Perception Delay time User Reaction 0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds + Instant Feels sluggish Machine is working.. Mental context switch I’ll come back later .. Stay under 250 ms to feel "fast". Stay under 1000 ms to keep users attention. Breaking the 1000 ms Mobile Barrier - Velocity - Google Slides https://docs.google.com/presentation/d/1wAxB5DPN-rcelwbGO6lCOus_S1rP24LMqA8m1eXEDRo/present?slide=id.p19
  8. 8. Changes to a single microservice or monolithic app can impact performance of downstream apps and services, or cause breakage 8
  9. 9. 9 Troubleshooting apps and microservices is hard Most platforms have: Disparate permissions on different apps Data silos across subsystems Trouble reconciling time series data
  10. 10. Multiple Languages Microservices Support Services Marketplace Native User Provided Partner DEVELOPMENT 1010 Operating System Cloud API Container Orchestration App Deployment & Management Availability Visibility & Administration CI/CD Tools, ID, Security Health, Metrics, Patching Apps & Platform Dashboards OPERATIONS
  11. 11. 4 Levels of High Availability 11 Availability Zone Fail 4 VM Fail 3 Process Fail 2 App Instance Fail 1 V M V M Process V M V M V M VM VM VM VM VM VM VM VM
  12. 12. Container Scheduler Handles Workloads 12 250,000 containers managed in a single environment https://blog.pivotal.io/pivotal-cloud-foundry/products/250k-containers-in-production-a-real-test-for-the-real-world
  13. 13. Container Scheduler Handles Workloads 13 Dynamic load balancing
  14. 14. Container Scheduler Handles Workloads 14 Dynamic load balancing Remediation and rebalance of workloads
  15. 15. Each Layer Upgradable with No Downtime 15 App Runtime* File system mapping Application Linux host & kernel Blue-Green deploy Canary style deploy * e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks C o n t a i n e r
  16. 16. Our Charter To provide App Devs with data points to assess overall solution performance and healthProviding App Developers insight into app performance
  17. 17. • Near real-time view • Covers 80-90% of the problems • One tool correlates events, logs, metrics • Common set of facts for Dev+Ops • Designed for PCF multi-tenancy • Agentless, no install • Enabled automatically for all applications Immediate Integrated Automated
  18. 18. Available Data CF EVENTS APP LOGS APP METRICS ROUTES
  19. 19. Select an app, watch streaming data
  20. 20. 2 weeks of app log storage 2 weeks of detailed container and http start stop metric storage App Log distribution histogram App Event UI improvements Fault tolerance on all storage services Testing and tuning for large ingestion loads v1.2.1 PCF Metrics
  21. 21. Data Correlation Demo
  22. 22. 22 PCF Metrics 1.2 Architecture
  23. 23. Our Journey PCF Metrics v1.0 PCF Metrics v1.1 PCF Metrics v1.2.1 PCF Metrics v1.3 Aggregate Container and HTTP metrics provided for Apps Aggregate Container and HTTP metrics + App events and Logs (24 hour storage) Aggregate Container and HTTP metrics + App events and Logs (2 weeks storage) Aggregate Container and HTTP metrics + App events and Logs (2 weeks storage) TraceID capture and Trace Logs
  24. 24. Spring Boot actuator support Expanded event descriptions Additional Log sources * Data exposed as API Continued UX improvements v1.3+ App Developers

×