Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices

586 views

Published on

Slides from my MonitoringSF meetup presentation. The why and how to achieve sane monitoring of microservices using Nomad, Consul, and Terraform.

Published in: Internet
  • Login to see the comments

Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices

  1. 1. Life Cycle of Metrics, Alerting, and Performance Monitoring in microservices Good Bad Ugly
  2. 2. 2 Operations Life Cycle
  3. 3. 3 Operations Life Cycle: Incident Incident
  4. 4. 4 Operations Life Cycle: Incident Management •Monitoring •Diagnosis •Escalation •Remediation •Communication Incident R esponse
  5. 5. 5 Operations Life Cycle: Resolution R esponse •Monitoring •Diagnosis •Escalation •Remediation •Communication Resolution
  6. 6. 6 Operations Life Cycle: Recovery R esponse •Monitoring •Diagnosis •Escalation •Remediation •Communication • Investigation • Root Cause Analysis • Incident RCA • Problem RCA • Incident Review/Post-Mortem • Identification of Action Items • Lessons Learned Resolution R ecovery
  7. 7. 7 Operations Life Cycle: Prevention R esponse R ecovery Prevention •Monitoring •Diagnosis •Escalation •Remediation •Communication • Investigation • Root Cause Analysis • Incident RCA • Problem RCA • Incident Review/Post-Mortem • Identification of Action Items • Lessons Learned •Documentation Development •Training •Risk Mitigation •Execution of AIs from prior incidents •Production Readiness Reviews
  8. 8. 8 Operations Life Cycle: Preparation R esponse R ecovery Prevention Preparation •Monitoring •Diagnosis •Escalation •Remediation •Communication • Investigation • Root Cause Analysis • Incident RCA • Problem RCA • Incident Review/Post-Mortem • Identification of Action Items • Lessons Learned •Documentation Development •Training •Risk Mitigation •Execution of AIs from prior incidents •Production Readiness Reviews •System Development •Risk Identification •Monitoring Systems •Tactical, Operational, Strategic KPIs •Identify Meaningful KPIs •KPIs to Notification and
 Escalation Matrix •Architecture Review
  9. 9. 9 Operations Life Cycle: Complacency Edition Response Recovery Prevention Preparation "Stable"
  10. 10. 10 Operations Life Cycle: Complacency Edition Response Recovery Prevention Preparation Steady-State Undetected
 Problem Blissful
 Ignorance QuietBefore
 theStorm
  11. 11. 11 Situational Awareness: Why
  12. 12. 12 Idea! Software Life Cycle: IDEA!
  13. 13. 13 Idea! Software Life Cycle: But it's not a bright idea, yet
  14. 14. 14 Software Life Cycle: Development Begins Time Prod 1) Idea! R&D
  15. 15. 15 Software Life Cycle: Development Begins Time Prod 1) Idea! R&D Forever
  16. 16. 16 Software Life Cycle: Development Begins Time Prod 1) Idea! R&D Forever (a.k.a. next week, month, or quarter)
  17. 17. 17 Software Life Cycle: Sprint to Prod! Time Prod 1) Idea! 2) Production Ready R&D
  18. 18. 18 Software Life Cycle: Knowledge of Actual Process Time Prod 1) Idea! 2) Production Ready R&D
  19. 19. 19 Software Life Cycle: Knowledge of Actual Process Time Prod 1) Idea! 2) Production Ready R&D
  20. 20. 20 Software Life Cycle: Wisdom Time Prod 1) Idea! 2) Production Ready R&D
  21. 21. 21 Software Life Cycle: Contrived Lifecycle Time Readiness 1) Idea! 2) Production Ready 3) End of Life 2.9) "It’ll be time to wind this service down when ___ happens and ___ comes online." R&D
  22. 22. 22 Software Life Cycle: Dose of Reality Time Production 1) Idea! 2) Production Ready 4) End of Life "Production Supported" 3) "Oops" R&D
  23. 23. 23 Software Life Cycle: Do NOT Pass Go, No $200 Time Production 1) Idea! N) End of Life "Production Supported" Forced to fix code or docs. R&D
  24. 24. 24 Software Life Cycle: Why the fails? Time Production 1) Idea! 2) Production Ready N) End of Life "Production Supported" "Drug feet to produce docs." [3,M) "Oops" R&D N-1) "That’s it, we’ve had enough…"
  25. 25. 25 Software Life Cycle Time Production 1) Idea! 2) Production Ready N) End of Life "Production Supported" [3,M) "Oops" R&D N-2) "That’s it, we’ve had enough…" N-1) "Just support it until the next version is out"
  26. 26. 26 Software Life Cycle: Detecting Problems Early Time Production 1) Idea! 2) Production Ready 4) End of Life "Production Supported" 3) "Oops" R&D WTB Alerting Here
  27. 27. 27 Metrics
  28. 28. 28 Metrics: Direction: Push statsd sink
  29. 29. 29 Metrics: Direction: Push statsd sink <metricname>:<value>|<type>
  30. 30. 30 Metrics: Direction: Push statsd sink <metricname>:<value>|<type> "Primitive"
  31. 31. 31 Metrics: Direction: Push statsd sink <metricname>:<value>|<type> Coordinated Endpoint for Firehose Data
  32. 32. 32 Metrics: Direction: Push statsd sink
  33. 33. 33 Metrics: Direction: Push statsd sink
  34. 34. 34 Metrics: Direction: Poll http agent database
  35. 35. 35 Metrics: Direction: Poll http agent database
  36. 36. 36 Metrics: Host Metrics HTTP JSON HTTP Trap "Feature Rich"
  37. 37. 37 Metrics: Host Metrics
  38. 38. 38 Metrics: Host Metrics HTTP JSON Broker (noit)
  39. 39. 39 Metrics: Alerting Pipeline HTTP JSON Broker (noit) stratcon Message Queue (fq) Rules Engine (ernie) Alerting (bert)
  40. 40. Nomad HASHICORP Cluster Manager Scheduler
  41. 41. Nomad HASHICORP Cluster Manager Scheduler
  42. 42. HASHICORP Schedulers map a set of work to a set of resources
  43. 43. HASHICORP CPU Scheduler Web Server -Thread 1 CPU - Core 1 CPU - Core 2 Web Server -Thread 2 Redis -Thread 1 Kernel -Thread 1 Work (Input) Resources CPU Scheduler
  44. 44. HASHICORP CPU Scheduler Web Server -Thread 1 CPU - Core 1 CPU - Core 2 Web Server -Thread 2 Redis -Thread 1 Kernel -Thread 1 Work (Input) Resources CPU Scheduler
  45. 45. HASHICORP Advantages Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  46. 46. HASHICORP Advantages Bin Packing Over-Subscription Job Queueing Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  47. 47. HASHICORP Advantages Abstraction API Contracts Standardization Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  48. 48. HASHICORP Advantages Priorities Resource Isolation Pre-emption Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  49. 49. Nomad HASHICORP
  50. 50. Nomad HASHICORP Cluster Scheduler Easily Deploy Applications Operationally Simple Built for Scale
  51. 51. job "redis" { datacenters = ["us-east-1"] task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 dynamic_ports = ["redis"] } } } } example.nomad
  52. 52. HASHICORP Job Specification Declares what to run
  53. 53. HASHICORP Job Specification Nomad determines where and manages how to run
  54. 54. HASHICORP Job Specification Nomad abstracts work from resources
  55. 55. job “my-app" { … task “my-app" { ephemeral_disk { sticky = true } } } example.nomad
  56. 56. HASHICORP Moves data between tasks on the same machine
  57. 57. HASHICORP Copies data between tasks on different machines
  58. 58. 58 Why is this more difficult?
  59. 59. 59 Metrics: Direction: Poll http agent database Static Endpoint Ephemeral Containers
  60. 60. 60 Metrics: Direction: Poll http agent database Static Endpoint Ephemeral Containers Function-as-a-Service
  61. 61. 61 Metrics: Push Metrics HTTP JSON HTTP Trap
  62. 62. 62 Metric Types
  63. 63. 63 Metrics: Counter •Counter - Monotonic Number •Bytes transmitted •Number of 2XX requests
  64. 64. 64 Metrics: Gauge •Counter - Monotonic Number •Bytes transmitted •Number of 2XX requests •Gauge - Non-monotonic number •Load average •Number of services in a critical state
  65. 65. 65 Metrics: Gauge
  66. 66. 66 Metrics: Histogram •Counter - Monotonic Number •Bytes transmitted •Number of 2XX requests •Gauge - Non-monotonic number •Load average •Number of services in a critical state •Histograms - Distribution of Streams of Values •Latency of an individual request •Disk IO latency •Bytes per response
  67. 67. 67 Metrics: Concepts •Interval - How often a metric is polled •Samples - Per Interval
  68. 68. 68 Metrics: Averages
  69. 69. 69 Metrics: Long Tail
  70. 70. 70 Operations Life Cycle: Complacency Edition Response Recovery Prevention Preparation Steady-State Undetected
 Problem Blissful
 Ignorance QuietBefore
 theStorm
  71. 71. 71 Metrics: Long Tail
  72. 72. 72 Metrics: Banded Latencies
  73. 73. 73 Data Sizes to Problem Specificity AMOUNT OF DATA NECESSARY TO ANSWER THE QUESTION IPSUM SCOPE OR SPECIFICITY OF THE QUESTION IS THERE A PROBLEM? WHERE IS THE PROBLEM? WHAT IS THE PROBLEM?
  74. 74. 74 Histograms
  75. 75. 75 Histograms
  76. 76. 76 Histograms
  77. 77. 77 Histograms
  78. 78. 78 Histograms
  79. 79. 79 Histograms
  80. 80. 80 Metrics: Histogram Heat Map
  81. 81. 81 Metrics: Long Tail Alert Because Something Happened Out Here Don't Celebrate This Success
  82. 82. 82 Why is this hard?
  83. 83. Why is this hard? Milliseconds
  84. 84. $ nomad status atlas-4119-b246fd8fa2 ID = atlas-4119-b246fd8fa2 Name = atlas-4119 Type = service Priority = 50 Datacenters = dc1 Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost console 0 0 1 0 0 0 frontend 0 0 2 0 0 0 worker 0 0 1 0 0 0 Allocations ID Eval ID Node ID Task Group Desired Status Created At 24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC 87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC WTF?
  85. 85. $ nomad status atlas-4119-b246fd8fa2 ID = atlas-4119-b246fd8fa2 Name = atlas-4119 Type = service Priority = 50 Datacenters = dc1 Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost console 0 0 1 0 0 0 frontend 0 0 2 0 0 0 worker 0 0 1 0 0 0 Allocations ID Eval ID Node ID Task Group Desired Status Created At 24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC 87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC WTF?
  86. 86. $ nomad alloc-status 87f46c82 ID = 87f46c82 Eval ID = 9fedfef9 Name = atlas-4119.worker[0] Node ID = d6b60eb1 Job ID = atlas-4119-b246fd8fa2 Client Status = running Client Description = <none> Desired Status = run Desired Description = <none> Created At = 01/25/17 23:14:28 UTC Task "worker" is "running" Task Resources CPU Memory Disk IOPS Addresses 47/256 MHz 218 MiB/2.0 GiB 0 B 0 Recent Events: Time Type Description 01/25/17 23:19:36 UTC Started Task started by client 01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts 01/25/17 23:14:28 UTC Received Task received by client
  87. 87. $ nomad alloc-status d5ea84f2 ID = d5ea84f2 Eval ID = 9fedfef9 Name = atlas-4119.frontend[1] Node ID = 70ba3d96 Job ID = atlas-4119-b246fd8fa2 Client Status = running Client Description = <none> Desired Status = run Desired Description = <none> Created At = 01/25/17 23:14:28 UTC Task "frontend" is "running" Task Resources CPU Memory Disk IOPS Addresses 370/1024 MHz 673 MiB/2.0 GiB 0 B 0 atlasfrontend: 10.151.2.227:80 Recent Events: Time Type Description 01/25/17 23:19:18 UTC Started Task started by client 01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts 01/25/17 23:14:28 UTC Received Task received by client NOT STATIC
  88. 88. 88 Parting Thoughts
  89. 89. # Terraform and Circonus to the rescue module "atlas" { source = "../modules/atlas" environment = "staging" }
  90. 90. % cat ../modules/atlas/interface.tf variable "atlas-worker-tags" { type = "list" default = [ "app:atlas", "app:atlas-worker", "source:nomad" ] } variable "environment" { type = "string" }
  91. 91. module "atlas-worker-job" { source = "../nomad-job" environment = "${var.environment}" human_name = "Atlas Worker" job_name = "atlas" task_group = "worker" job_tags = [ "app:atlas", "app:atlas-worker" ] }
  92. 92. % cat ../modules/nomad-job/interface.tf # *-description's taken from https://www.nomadproject.io/docs/agent/telemetry.html variable "cpu-kernel-description" { type = "string" default = "Total CPU resources consumed by the task in the system space" } variable "cpu-throttled-periods-description" { type = "string" default = "Number of periods when the container hit its throttling limit (`nr_throttled`)" } variable "cpu-throttled-time-description" { type = "string" default = "Total time that the task was throttled (`throttled_time`)" } variable "cpu-total-percentage-description" { type = "string" default = "Total CPU resources consumed by the task across all cores" }
  93. 93. variable "cpu-total-ticks-description" { type = "string" default = "CPU ticks consumed by the process in the last collection interval" } variable "cpu-user-description" { type = "string" default = "An aggregation of all userland CPU usage for this Nomad job." } variable "environment" { type = "string" } variable "human_name" { description = "The human-friendly name for this job" type = "string" } variable "job_name" { type = "string" description = "The Nomad Job Name (or its prefix)" }
  94. 94. variable "job_tags" { type = "list" description = "Tags that should be added to this job's resources" } variable "memory-cache-description" { type = "string" default = "Amount of memory cached by the task" } variable "memory-kernel-usage-description" { type = "string" default = "Amount of memory used by the kernel for this task" } variable "memory-max-usage-description" { type = "string" default = "Maximum amount of memory ever used by the kernel for this task" } variable "memory-kernel-max-usage-description" { type = "string" default = "Maximum amount of memory ever used by the tasks in this job." }
  95. 95. variable "memory-rss-description" { type = "string" default = "An aggregation of all resident memory for this Nomad job." } variable "memory-swap-description" { type = "string" default = "Amount of memory swapped by the task" } variable "nomad-tags" { type = "list" default = [ "source:nomad" ] } variable "task_group" { type = "string" description = "The name of the task group" }
  96. 96. % cat ../modules/nomad-job/stream-groups.tf resource "circonus_stream_group" "cpu-kern" { name = "${var.human_name} CPU Kernel" description = "${var.cpu-kernel-description}" group { query = "*`${var.job_name}-${var.task_group}`cpu`system" type = "average" } tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:cpu", "use:utilization" ] # unit = "%" } resource "circonus_stream_group" "memory-rss" { name = "${var.human_name} Memory RSS" description = "${var.memory-rss-description}" group { query = "*`${var.job_name}-${var.task_group}`memory`rss" type = "average" } tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:memory", "use:utilization" ] }
  97. 97. resource "circonus_trigger" "rss-alarm" { check = "${circonus_check.usage.checks[0]}" stream_name = "${var.used_metric_name}" if { value { absent = "3600s" } then { notify = [ "${circonus_contact_group.circonus-owners-slack.id}", "${circonus_contact_group.circonus-owners-slack-escalation.id}", ] severity = 1 } } if { value { # SEV1 if we're over 4GB more = "${4 * 1024 * 1024 * 1024}" } ...
  98. 98. resource "circonus_contact_group" "job-owner-slack-escalation" { name = "${var.appname} Owners (${title(var.environment)} Slack Escalation)" slack { channel = "${var.alert_slack_escalate_channel_name}" team = "${var.alert_slack_team_id}" username = "Circonus" buttons = true } tags = [ "author:terraform", "environment:${var.environment}", "owner:${var.app-owner}", ] }
  99. 99. resource "circonus_contact_group" "app-owners-slack" { name = "${var.appname} Owners (${title(var.environment)} Slack)" slack { channel = "${var.alert_slack_channel_name}" team = "${var.alert_slack_team_id}" username = "Circonus" buttons = true } aggregation_window = "5m" alert_option { severity = 1 reminder = "15m" escalate_to = "${circonus_contact_group.app-owners-slack-escalation.id}" escalate_after = "1h" } alert_option { severity = 2 reminder = "1h" escalate_to = "${circonus_contact_group.app-owners-slack-escalation.id}" escalate_after = "6h" }
  100. 100. Why?
  101. 101. 106 Parting Thoughts •Be an engineer. Put rigid constraints around your app. •Don't confuse static with rigid. •Work top to bottom. •Develop an error budget and prioritize. •Be consistent in your observability regimen.
  102. 102. 107 Parting Thoughts •Expose HTTP Endpoints for stats (both monotonic counters and gauges) •Trap Metrics to a broker frequently to create a histogram (e.g. 100ms) •Expose or export JSON Histograms •Valuable metrics tend to record the behavior of edges, not vertices
  103. 103. 108 Parting Thoughts
  104. 104. 109 Demo Time

×