Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices

Life Cycle of Metrics, Alerting, and
Performance Monitoring in
microservices
Good
Bad
Ugly

3
Operations Life Cycle: Incident
Incident

4
Operations Life Cycle: Incident Management
•Monitoring
•Diagnosis
•Escalation
•Remediation
•Communication
Incident
R
esponse

5
Operations Life Cycle: Resolution
R
esponse
•Monitoring
•Diagnosis
•Escalation
•Remediation
•Communication
Resolution

6
Operations Life Cycle: Recovery
R
esponse
•Monitoring
•Diagnosis
•Escalation
•Remediation
•Communication
• Investigation
• Root Cause Analysis
• Incident RCA
• Problem RCA
• Incident Review/Post-Mortem
• Identiﬁcation of Action Items
• Lessons Learned
Resolution
R
ecovery

7
Operations Life Cycle: Prevention
R
esponse
R
ecovery
Prevention
•Monitoring
•Diagnosis
•Escalation
•Remediation
•Communication
• Investigation
• Incident RCA
• Problem RCA
• Lessons Learned
•Documentation Development
•Training
•Risk Mitigation
•Execution of AIs from prior incidents
•Production Readiness Reviews

8
Operations Life Cycle: Preparation
R
esponse
R
ecovery
Prevention
Preparation
•Monitoring
•Diagnosis
•Escalation
•Remediation
•Communication
• Investigation
• Incident RCA
• Problem RCA
• Lessons Learned
•Documentation Development
•Training
•Risk Mitigation
•Execution of AIs from prior incidents
•Production Readiness Reviews
•System Development
•Risk Identiﬁcation
•Monitoring Systems
•Tactical, Operational, Strategic KPIs
•Identify Meaningful KPIs
•KPIs to Notiﬁcation and 
Escalation Matrix
•Architecture Review

9
Operations Life Cycle: Complacency Edition
Response
Recovery
Prevention
Preparation
"Stable"

10
Response
Recovery
Prevention
Preparation
Steady-State
Undetected 
Problem
Blissful 
Ignorance
QuietBefore 
theStorm

12
Idea!
Software Life Cycle: IDEA!

13
Idea!
Software Life Cycle: But it's not a bright idea, yet

14
Software Life Cycle: Development Begins
Time
Prod
1) Idea!
R&D

15
Time
Prod
1) Idea!
R&D
Forever

16
Time
Prod
1) Idea!
R&D
Forever
(a.k.a. next week, month, or quarter)

17
Software Life Cycle: Sprint to Prod!
Time
Prod
1) Idea!
2) Production Ready
R&D

18
Software Life Cycle: Knowledge of Actual Process
Time
Prod
1) Idea!
2) Production Ready
R&D

19
Software Life Cycle: Knowledge of Actual Process
Time
Prod
1) Idea!
2) Production Ready
R&D

20
Software Life Cycle: Wisdom
Time
Prod
1) Idea!
2) Production Ready
R&D

21
Software Life Cycle: Contrived Lifecycle
Time
Readiness
1) Idea!
2) Production Ready 3) End of Life
2.9) "It’ll be time to wind this service down
when ___ happens and ___ comes online."
R&D

22
Software Life Cycle: Dose of Reality
Time
Production
1) Idea!
2) Production Ready
4) End of Life
"Production Supported"
3) "Oops"
R&D

23
Software Life Cycle: Do NOT Pass Go, No $200
Time
Production
1) Idea!
N) End of Life
Forced to ﬁx code or docs.
R&D

24
Software Life Cycle: Why the fails?
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Drug feet to produce docs."
[3,M) "Oops"
R&D
N-1) "That’s it, we’ve had enough…"

25
Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
N) End of Life
[3,M) "Oops"
R&D
N-2) "That’s it, we’ve had enough…"
N-1) "Just support it until
the next version is out"

26
Software Life Cycle: Detecting Problems Early
Time
Production
1) Idea!
2) Production Ready
4) End of Life
3) "Oops"
R&D
WTB Alerting Here

28
Metrics: Direction: Push
statsd
sink

29
statsd
sink
<metricname>:<value>|<type>

30
statsd
sink
"Primitive"

31
statsd
sink
Coordinated Endpoint for Firehose Data

32
statsd
sink

33
statsd
sink

34
Metrics: Direction: Poll
http
agent
database

35
http
agent
database

36
Metrics: Host Metrics
HTTP JSON
HTTP Trap
"Feature Rich"

38
Metrics: Host Metrics
HTTP JSON
Broker
(noit)

39
Metrics: Alerting Pipeline
HTTP JSON
Broker
(noit)
stratcon
Message Queue
(fq) Rules Engine
(ernie)
Alerting
(bert)

Nomad
HASHICORP
Cluster Manager
Scheduler

HASHICORP
Schedulers map a set of work to
a set of resources

HASHICORP
CPU Scheduler
Web Server -Thread 1
CPU - Core 1
CPU - Core 2
Web Server -Thread 2
Redis -Thread 1
Kernel -Thread 1
Work (Input) Resources
CPU
Scheduler

HASHICORP
Advantages
Higher Resource Utilization
Decouple Work from Resources
Better Quality of Service

HASHICORP
Advantages
Bin Packing
Over-Subscription
Job Queueing

HASHICORP
Advantages
Abstraction
API Contracts
Standardization

HASHICORP
Advantages
Priorities
Resource Isolation
Pre-emption

Nomad
HASHICORP
Cluster Scheduler
Easily Deploy Applications
Operationally Simple
Built for Scale

job "redis" {
datacenters = ["us-east-1"]
task "redis" {
driver = "docker"
config { image = "redis:latest" }
resources {
cpu = 500 # Mhz
memory = 256 # MB
network {
mbits = 10
dynamic_ports = ["redis"]
}
}
}
}
example.nomad

HASHICORP
Job Speciﬁcation
Declares what to run

HASHICORP
Job Speciﬁcation
Nomad determines where and
manages how to run

HASHICORP
Job Speciﬁcation
Nomad abstracts work
from resources

job “my-app" {
…
task “my-app" {
ephemeral_disk {
sticky = true
}
}
}
example.nomad

HASHICORP
Moves data between tasks on
the same machine

HASHICORP
Copies data between tasks on
different machines

58
Why is this more diﬃcult?

59
http
agent
database
Static Endpoint
Ephemeral Containers

60
http
agent
database
Static Endpoint
Ephemeral Containers
Function-as-a-Service

61
Metrics: Push Metrics
HTTP JSON
HTTP Trap

63
Metrics: Counter
•Counter - Monotonic Number
•Bytes transmitted
•Number of 2XX requests

64
Metrics: Gauge
•Gauge - Non-monotonic number
•Load average
•Number of services in a critical state

66
Metrics: Histogram
•Gauge - Non-monotonic number
•Load average
•Number of services in a critical state
•Histograms - Distribution of Streams of Values
•Latency of an individual request
•Disk IO latency
•Bytes per response

67
Metrics: Concepts
•Interval - How often a metric is polled
•Samples - Per Interval

70
Response
Recovery
Prevention
Preparation
Steady-State
Undetected 
Problem
Blissful 
Ignorance
QuietBefore 
theStorm

73
Data Sizes to Problem Speciﬁcity
AMOUNT OF DATA NECESSARY TO
ANSWER THE QUESTION
IPSUM
SCOPE OR SPECIFICITY OF THE QUESTION IS THERE A
PROBLEM?
WHERE IS THE
PROBLEM?
WHAT IS THE
PROBLEM?

80
Metrics: Histogram Heat Map

81
Metrics: Long Tail
Alert Because Something
Happened Out Here
Don't Celebrate
This Success

Why is this hard?
Milliseconds

$ nomad status atlas-4119-b246fd8fa2
ID = atlas-4119-b246fd8fa2
Name = atlas-4119
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
console 0 0 1 0 0 0
frontend 0 0 2 0 0 0
worker 0 0 1 0 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC
87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC
d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC
eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC
WTF?

$ nomad alloc-status 87f46c82
ID = 87f46c82
Eval ID = 9fedfef9
Name = atlas-4119.worker[0]
Node ID = d6b60eb1
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "worker" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
47/256 MHz 218 MiB/2.0 GiB 0 B 0
Recent Events:
Time Type Description
01/25/17 23:19:36 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client

$ nomad alloc-status d5ea84f2
ID = d5ea84f2
Eval ID = 9fedfef9
Name = atlas-4119.frontend[1]
Node ID = 70ba3d96
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "frontend" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
370/1024 MHz 673 MiB/2.0 GiB 0 B 0 atlasfrontend: 10.151.2.227:80
Recent Events:
Time Type Description
01/25/17 23:19:18 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client
NOT STATIC

# Terraform and Circonus to the rescue
module "atlas" {
source = "../modules/atlas"
environment = "staging"
}

% cat ../modules/atlas/interface.tf
variable "atlas-worker-tags" {
type = "list"
default = [ "app:atlas", "app:atlas-worker", "source:nomad" ]
}
variable "environment" {
type = "string"
}

module "atlas-worker-job" {
source = "../nomad-job"
environment = "${var.environment}"
human_name = "Atlas Worker"
job_name = "atlas"
task_group = "worker"
job_tags = [ "app:atlas", "app:atlas-worker" ]
}

% cat ../modules/nomad-job/interface.tf
# *-description's taken from https://www.nomadproject.io/docs/agent/telemetry.html
variable "cpu-kernel-description" {
type = "string"
default = "Total CPU resources consumed by the task in the system space"
}
variable "cpu-throttled-periods-description" {
type = "string"
default = "Number of periods when the container hit its throttling limit (`nr_throttled`)"
}
variable "cpu-throttled-time-description" {
type = "string"
default = "Total time that the task was throttled (`throttled_time`)"
}
variable "cpu-total-percentage-description" {
type = "string"
default = "Total CPU resources consumed by the task across all cores"
}

variable "cpu-total-ticks-description" {
type = "string"
default = "CPU ticks consumed by the process in the last collection interval"
}
variable "cpu-user-description" {
type = "string"
default = "An aggregation of all userland CPU usage for this Nomad job."
}
variable "environment" {
type = "string"
}
variable "human_name" {
description = "The human-friendly name for this job"
type = "string"
}
variable "job_name" {
type = "string"
description = "The Nomad Job Name (or its prefix)"
}

variable "job_tags" {
type = "list"
description = "Tags that should be added to this job's resources"
}
variable "memory-cache-description" {
type = "string"
default = "Amount of memory cached by the task"
}
variable "memory-kernel-usage-description" {
type = "string"
default = "Amount of memory used by the kernel for this task"
}
variable "memory-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the kernel for this task"
}
variable "memory-kernel-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the tasks in this job."
}

variable "memory-rss-description" {
type = "string"
default = "An aggregation of all resident memory for this Nomad job."
}
variable "memory-swap-description" {
type = "string"
default = "Amount of memory swapped by the task"
}
variable "nomad-tags" {
type = "list"
default = [ "source:nomad" ]
}
variable "task_group" {
type = "string"
description = "The name of the task group"
}

% cat ../modules/nomad-job/stream-groups.tf
resource "circonus_stream_group" "cpu-kern" {
name = "${var.human_name} CPU Kernel"
description = "${var.cpu-kernel-description}"
group {
query = "*`${var.job_name}-${var.task_group}`cpu`system"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:cpu", "use:utilization" ]
# unit = "%"
}
resource "circonus_stream_group" "memory-rss" {
name = "${var.human_name} Memory RSS"
description = "${var.memory-rss-description}"
group {
query = "*`${var.job_name}-${var.task_group}`memory`rss"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:memory", "use:utilization" ]
}

resource "circonus_trigger" "rss-alarm" {
check = "${circonus_check.usage.checks[0]}"
stream_name = "${var.used_metric_name}"
if {
value {
absent = "3600s"
}
then {
notify = [
"${circonus_contact_group.circonus-owners-slack.id}",
"${circonus_contact_group.circonus-owners-slack-escalation.id}",
]
severity = 1
}
}
if {
value {
# SEV1 if we're over 4GB
more = "${4 * 1024 * 1024 * 1024}"
}
...

resource "circonus_contact_group" "job-owner-slack-escalation" {
name = "${var.appname} Owners (${title(var.environment)} Slack Escalation)"
slack {
channel = "${var.alert_slack_escalate_channel_name}"
team = "${var.alert_slack_team_id}"
username = "Circonus"
buttons = true
}
tags = [
"author:terraform",
"environment:${var.environment}",
"owner:${var.app-owner}",
]
}

resource "circonus_contact_group" "app-owners-slack" {
name = "${var.appname} Owners (${title(var.environment)} Slack)"
slack {
channel = "${var.alert_slack_channel_name}"
team = "${var.alert_slack_team_id}"
username = "Circonus"
buttons = true
}
aggregation_window = "5m"
alert_option {
severity = 1
reminder = "15m"
escalate_to = "${circonus_contact_group.app-owners-slack-escalation.id}"
escalate_after = "1h"
}
alert_option {
severity = 2
reminder = "1h"
escalate_to = "${circonus_contact_group.app-owners-slack-escalation.id}"
escalate_after = "6h"
}

106
Parting Thoughts
•Be an engineer. Put rigid constraints around your app.
•Don't confuse static with rigid.
•Work top to bottom.
•Develop an error budget and prioritize.
•Be consistent in your observability regimen.

107
Parting Thoughts
•Expose HTTP Endpoints for stats (both monotonic counters and gauges)
•Trap Metrics to a broker frequently to create a histogram (e.g. 100ms)
•Expose or export JSON Histograms
•Valuable metrics tend to record the behavior of edges, not vertices

Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices

Similar to Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices (20)

More from Sean Chittenden

More from Sean Chittenden (13)

Recently uploaded

Recently uploaded (20)

Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices