21. 21
Software Life Cycle: Contrived Lifecycle
Time
Readiness
1) Idea!
2) Production Ready 3) End of Life
2.9) "Itâll be time to wind this service down
when ___ happens and ___ comes online."
R&D
22. 22
Software Life Cycle: Dose of Reality
Time
Production
1) Idea!
2) Production Ready
4) End of Life
"Production Supported"
3) "Oops"
R&D
23. 23
Software Life Cycle: Do NOT Pass Go, No $200
Time
Production
1) Idea!
N) End of Life
"Production Supported"
Forced to ďŹx code or docs.
R&D
24. 24
Software Life Cycle: Why the fails?
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Production Supported"
"Drug feet to produce docs."
[3,M) "Oops"
R&D
N-1) "Thatâs it, weâve had enoughâŚ"
25. 25
Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Production Supported"
[3,M) "Oops"
R&D
N-2) "Thatâs it, weâve had enoughâŚ"
N-1) "Just support it until
the next version is out"
26. 26
Software Life Cycle: Detecting Problems Early
Time
Production
1) Idea!
2) Production Ready
4) End of Life
"Production Supported"
3) "Oops"
R&D
WTB Alerting Here
43. HASHICORP
CPU Scheduler
Web Server -Thread 1
CPU - Core 1
CPU - Core 2
Web Server -Thread 2
Redis -Thread 1
Kernel -Thread 1
Work (Input) Resources
CPU
Scheduler
44. HASHICORP
CPU Scheduler
Web Server -Thread 1
CPU - Core 1
CPU - Core 2
Web Server -Thread 2
Redis -Thread 1
Kernel -Thread 1
Work (Input) Resources
CPU
Scheduler
64. 64
Metrics: Gauge
â˘Counter - Monotonic Number
â˘Bytes transmitted
â˘Number of 2XX requests
â˘Gauge - Non-monotonic number
â˘Load average
â˘Number of services in a critical state
66. 66
Metrics: Histogram
â˘Counter - Monotonic Number
â˘Bytes transmitted
â˘Number of 2XX requests
â˘Gauge - Non-monotonic number
â˘Load average
â˘Number of services in a critical state
â˘Histograms - Distribution of Streams of Values
â˘Latency of an individual request
â˘Disk IO latency
â˘Bytes per response
73. 73
Data Sizes to Problem SpeciďŹcity
AMOUNT OF DATA NECESSARY TO
ANSWER THE QUESTION
IPSUM
SCOPE OR SPECIFICITY OF THE QUESTION IS THERE A
PROBLEM?
WHERE IS THE
PROBLEM?
WHAT IS THE
PROBLEM?
84. $ nomad status atlas-4119-b246fd8fa2
ID = atlas-4119-b246fd8fa2
Name = atlas-4119
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
console 0 0 1 0 0 0
frontend 0 0 2 0 0 0
worker 0 0 1 0 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC
87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC
d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC
eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC
WTF?
85. $ nomad status atlas-4119-b246fd8fa2
ID = atlas-4119-b246fd8fa2
Name = atlas-4119
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
console 0 0 1 0 0 0
frontend 0 0 2 0 0 0
worker 0 0 1 0 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC
87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC
d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC
eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC
WTF?
86. $ nomad alloc-status 87f46c82
ID = 87f46c82
Eval ID = 9fedfef9
Name = atlas-4119.worker[0]
Node ID = d6b60eb1
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "worker" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
47/256 MHz 218 MiB/2.0 GiB 0 B 0
Recent Events:
Time Type Description
01/25/17 23:19:36 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client
87. $ nomad alloc-status d5ea84f2
ID = d5ea84f2
Eval ID = 9fedfef9
Name = atlas-4119.frontend[1]
Node ID = 70ba3d96
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "frontend" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
370/1024 MHz 673 MiB/2.0 GiB 0 B 0 atlasfrontend: 10.151.2.227:80
Recent Events:
Time Type Description
01/25/17 23:19:18 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client
NOT STATIC
92. % cat ../modules/nomad-job/interface.tf
# *-description's taken from https://www.nomadproject.io/docs/agent/telemetry.html
variable "cpu-kernel-description" {
type = "string"
default = "Total CPU resources consumed by the task in the system space"
}
variable "cpu-throttled-periods-description" {
type = "string"
default = "Number of periods when the container hit its throttling limit (`nr_throttled`)"
}
variable "cpu-throttled-time-description" {
type = "string"
default = "Total time that the task was throttled (`throttled_time`)"
}
variable "cpu-total-percentage-description" {
type = "string"
default = "Total CPU resources consumed by the task across all cores"
}
93. variable "cpu-total-ticks-description" {
type = "string"
default = "CPU ticks consumed by the process in the last collection interval"
}
variable "cpu-user-description" {
type = "string"
default = "An aggregation of all userland CPU usage for this Nomad job."
}
variable "environment" {
type = "string"
}
variable "human_name" {
description = "The human-friendly name for this job"
type = "string"
}
variable "job_name" {
type = "string"
description = "The Nomad Job Name (or its prefix)"
}
94. variable "job_tags" {
type = "list"
description = "Tags that should be added to this job's resources"
}
variable "memory-cache-description" {
type = "string"
default = "Amount of memory cached by the task"
}
variable "memory-kernel-usage-description" {
type = "string"
default = "Amount of memory used by the kernel for this task"
}
variable "memory-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the kernel for this task"
}
variable "memory-kernel-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the tasks in this job."
}
95. variable "memory-rss-description" {
type = "string"
default = "An aggregation of all resident memory for this Nomad job."
}
variable "memory-swap-description" {
type = "string"
default = "Amount of memory swapped by the task"
}
variable "nomad-tags" {
type = "list"
default = [ "source:nomad" ]
}
variable "task_group" {
type = "string"
description = "The name of the task group"
}
96. % cat ../modules/nomad-job/stream-groups.tf
resource "circonus_stream_group" "cpu-kern" {
name = "${var.human_name} CPU Kernel"
description = "${var.cpu-kernel-description}"
group {
query = "*`${var.job_name}-${var.task_group}`cpu`system"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:cpu", "use:utilization" ]
# unit = "%"
}
resource "circonus_stream_group" "memory-rss" {
name = "${var.human_name} Memory RSS"
description = "${var.memory-rss-description}"
group {
query = "*`${var.job_name}-${var.task_group}`memory`rss"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:memory", "use:utilization" ]
}
97. resource "circonus_trigger" "rss-alarm" {
check = "${circonus_check.usage.checks[0]}"
stream_name = "${var.used_metric_name}"
if {
value {
absent = "3600s"
}
then {
notify = [
"${circonus_contact_group.circonus-owners-slack.id}",
"${circonus_contact_group.circonus-owners-slack-escalation.id}",
]
severity = 1
}
}
if {
value {
# SEV1 if we're over 4GB
more = "${4 * 1024 * 1024 * 1024}"
}
...
106. 106
Parting Thoughts
â˘Be an engineer. Put rigid constraints around your app.
â˘Don't confuse static with rigid.
â˘Work top to bottom.
â˘Develop an error budget and prioritize.
â˘Be consistent in your observability regimen.
107. 107
Parting Thoughts
â˘Expose HTTP Endpoints for stats (both monotonic counters and gauges)
â˘Trap Metrics to a broker frequently to create a histogram (e.g. 100ms)
â˘Expose or export JSON Histograms
â˘Valuable metrics tend to record the behavior of edges, not vertices