Visualizing Systems with Statemaps

Visualizing Systems with Statemaps
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill

The stack of abstraction
• Our software systems are built as stacks of abstraction
• These stacks allow us to stand on the shoulders of history — to
reuse components without rebuilding them
• We can do this because of the software paradox: software is
both information and machine, exhibiting properties of both
• Our stacks are higher and run deeper than we can see or know:
software is opaque; the nature of abstraction is to seal us from
what runs beneath!

Run silent, run deep
• Not only is the stack deep, it is silent
• Running software emits neither light nor heat; it makes no
sound; it attracts no mass; it (mostly) has no odor
• Running software is — by all conventional notions — unseeable
• This generally isn’t a bad thing, as long as it all works…

Hurricanes from butterflies
• When the stack of abstraction performs pathologically, its power
transmogrifies to peril: layering amplifies performance
pathologies but hinders insight
• Work amplifies as we go down the stack
• Latency amplifies as we go up the stack
• Seemingly minor issues in one layer can cascade into systemic
pathological performance…
• As the system becomes dominated by its outliers, butterflies
spawn hurricanes of pathological performance

Debugging the hurricanes
• Understanding a pathologically performing system is
excruciatingly difﬁcult:
• Symptoms are often far removed from root cause
• There may not be a single root cause but several
• The system is dynamic and may change without warning
• Improvements to the system are hard to model and verify
• Emphatically, this is not “tuning” — it is debugging

How do we debug?
• To debug methodically, we must resist the temptation to quick
hypotheses, focusing rather on questions and observations
• Iterating between questions and observations gathers the facts
that will constrain future hypotheses
• These facts can be used to disconﬁrm hypotheses!
• How do we ask questions?
• How do we make observations?

Asking questions
• For performance debugging, the initial question formulation is
particularly challenging: where does one start?
• Resource-centric methodologies like the USE Method
(Utilization/Saturation/Errors) can be excellent starting points…
• But keep these methodologies in their context: they provide
initial questions to ask — they are not recipes for debugging
arbitrary performance pathologies!

Making observations
• Questions are answered through observation
• But — reminder! — software cannot by conventionally seen!
• It is up to the system itself to have the capacity to be seen
• This capacity is the system’s observability — and without it, we
are reduced to guessing
• Do not conﬂate software observability with control theory’s
deﬁnition of observability!
• Software is observable when it can answer your question about
its behavior — software observability is not a boolean!

The pillars of observability
• Much has been made of the so-called “pillars of observability”:
monitoring, logging and instrumentation
• Each of these is important, for each has within it the capacity to
answer questions about the system
• But each also has limitations!
• Their shared limitation: each can only be as effective as the
observer — they cannot answer questions not asked!
• Observability seeks to answer questions asked and prompt new
ones: the human is the foundation of observability!

Observability through instrumentation
• Static instrumentation modiﬁes source to provide semantically
relevant information, e.g., via logging or counters
• Dynamic instrumentation allows for the system to be changed
while running to emit data, e.g. DTrace, OpenTracing
• Both mechanisms of instrumentation are essential!
• Static instrumentation provides the observations necessary for
early question formulation…
• Dynamic instrumentation answers deeper, ad hoc questions

Aggregation
• When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation
• Aggregation is essential for scalable, non-invasive
instrumentation — and is a ﬁrst-class primitive in (e.g.) DTrace
• But aggregation also eliminates important dimensions of data,
especially with respect to time; some questions may only be
answered with disaggregated data!
• Use aggregation for performance debugging — but also
understand its limits!

Visualization
• The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers,
but also (and especially) provoking new questions
• Our systems are so large, complicated and abstract that there is
not one way to visualize them, but many
• The visualization of systems and their representations is an
essential facet of system observability!

Visualization: Gnuplot
• Graphs are terriﬁc — so much so that we should not restrict
ourselves to the captive graphs found in bundled software!
• An ad hoc plotting tool is essential for performance debugging;
and Gnuplot is an excellent (if idiosyncratic) one
• Gnuplot is easily combined with workhorses like awk or perl
• That Gnuplot is an essential tool helps to set expectation
around performance debugging tools: they are not magicians!

Visualization: Statemaps
• Flamegraphs help understand the work a system is doing, but
how does one visualize a system that isn’t doing work?
• That is, idleness is a common pathology in a suboptimal
system; there is a hidden bottleneck — but where?
• To explore these kinds of problems, we have developed
statemaps, a visualization of entity state over time

Statemap input data
• Statemaps operate on a payload of concatenated JSON where
each line corresponds to a state transition for an entity: 
 
{ "time": "52524411", "entity": "30080", "state": 0 } 
{ "time": "52587486", "entity": "30137", "state": 0 }
{ "time": "53762283", "entity": "30137", "state": 6 } 
…

Statemap input data
• States are described in JSON metadata header, e.g.: 
 
 
{ 
"start": [ 1544138397, 322335287 ], 
"title": "PostgreSQL statemap on HAB01436, by process ID", 
"host": "HAB01436", 
"entityKind": "Process", 
"states": { 
"on-cpu": {"value": 0, "color": "#DAF7A6" }, 
"off-cpu-waiting": {"value": 1, "color": "#f9f9f9" }, 
"off-cpu-semop": {"value": 2, "color": "#FF5733" }, 
"off-cpu-blocked": {"value": 3, "color": "#C70039" }, 
"off-cpu-zfs-read": {"value": 4, "color": "#FFC300" }, 
"off-cpu-zfs-write": {"value": 5, "color": "#338AFF" }, 
"off-cpu-zil-commit": {"value": 6, "color": "#66FFCC" }, 
"off-cpu-tx-delay": {"value": 7, "color": "#CCFF00" }, 
"off-cpu-dead": {"value": 8, "color": "#E0E0E0" }, 
"wal-init": {"value": 9, "color": "#dd1871" }, 
"wal-init-tx-delay": {"value": 10, "color": "#fd4bc9" } 
} 
}

Statemap output
• Statemap rendering code processes the JSON stream and
renders it into a SVG that is the actual state map
• SVG can be manipulated interactively (zoomed, panned,
highlighted, etc.) but also stands independently
• Statemaps are entirely neutral with respect to methodology!

Instrumentation for statemaps
• Statemaps themselves — like gnuplot — are entirely generic to
input data: they visualize arbitrary state over arbitrary time
• We have developed example statemap-generating dynamic
instrumentation for database, CPU, I/O, ﬁlesystem operations
• The data rate in terms of state transitions per second varies
based on what is being instrumented: from <10/sec to >1M/sec

Coalescing states
• For even modestly large inputs, adjacent states must be
coalesced to allow for reasonable visualization
• When this aggregation is required, the statemap rendering code
coalesces the least signiﬁcant two adjacent states — allowing
for larger trends to stay intact
• The threshold at which states are coalesced can be dynamically
adjusted to allow for higher resolution
• Importantly, the original data retains all state transitions!

Tagged statemaps
• We have found it useful to be able to tag states with immutable
information that describes the context around the state
• For example, tagging a state for CPU execution with immutable
context information (process, thread, etc.)
• Tag occurs separately in the stream, e.g.: 
 
{ "state": 0, "tag": "d136827", "pid": "51943", "tid": "1",
"execname": "postgres", "psargs": "/opt/postgresql/9.6.3/bin/
postgres -D /manatee/pg/data" } 
… 
{ "time": "330931", "entity": "12", "state": 0, "tag": "d136827" }

Stacked statemaps
• We have found it useful to be able to stack statemaps from
either disjoint sources or disjoint machines
• Allows for activity in one domain or machine to be tightly
correlated with activity in another domain or machine
• Across machines, can be subject to wall clock skew…
• …but if wall clocks are skewing within the datacenter, there are
likely bigger problems!

Stacked statemaps across domains

Stacked statemaps across machines

Stacked statemaps across many machines?

Statemaps
• Statemaps provide a generic and system-neutral tool for
visualizing system state over time
• Statemaps use visualization to prompt questions
• Statemaps work in concert with system observability facilities
that can answer the questions that statemaps raise
• We must keep the human in mind when developing for
observability — the capacity to answer arbitrary questions is
only as effective as the human asking them!
• Statemap renderer: https://github.com/joyent/statemap

Visualizing Systems with Statemaps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Visualizing Systems with Statemaps

Similar to Visualizing Systems with Statemaps (20)

More from bcantrill

More from bcantrill (20)

Recently uploaded

Recently uploaded (20)

Visualizing Systems with Statemaps