Modern software systems now increasingly span cloud and on-premises deployments and remote embedded devices and sensors. These distributed systems bring challenges with data, connectivity, performance, and systems management; to ensure success, you must design and build with operability as a first-class property.
Matthew Skelton shares five practical, tried-and-tested techniques for improving operability with many kinds of software systems, including the cloud, serverless, on-premises, and the IoT: logging as a live diagnostics vector with sparse event IDs; operational checklists and runbook dialog sheets as a discovery mechanism for teams; endpoint health checks as a way to assess runtime dependencies and complexity; correlation IDs beyond simple HTTP calls; and lightweight user personas as drivers for operational dashboards.
These techniques work very differently with different technologies. For instance, an IoT device has limited storage, processing, and I/O, so generating and shipping of logs and metrics looks very different from cloud or serverless cases. However, the principles—logging as a live diagnostics vector, event IDs for discovery, etc.—work remarkably well across very different technologies.
Drawing from his experience helping teams improve the operability of their software systems, Matthew explains what works (and what doesn’t) and how teams can expand their understanding and awareness of operability through these straightforward, team-friendly techniques.
From a talk given by Matthew Skelton at Velocity Conference EU 2017 - https://conferences.oreilly.com/velocity/vl-eu/public/schedule/detail/61954
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
Practical operability techniques for distributed systems - Velocity EU 2017
1. Practical Operability
Techniques for Teams
Matthew Skelton
Skelton Thatcher Consulting
skeltonthatcher.com / @SkeltonThatcher
Velocity Conference EU 2017, London – 19 Oct 2017
2. Today
What is operability?
Modern logging
Run Book dialogue sheets
Endpoint healthchecks
Correlation IDs
User Personas for dashboards
5. Operability:
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
30. Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency TV broadcaster
High throughput
Glitch-free video & audio
34. Example: video processing
Discover processing bottlenecks
Trigger alerts via LogEntries /
HostedGraphite
Report on KPIs
Target areas for improvement
40. System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
…
45. endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
56. Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this is internal only)
57. Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
58. Example: electronic trading
High speed, low latency
Trading options & derivatives
Connected to stock exchanges
Sub-millisecond timings
> £1 million per day traded
59.
60.
61.
62. Correlations IDs for trading
Evidence for timely operation
Help identify bottlenecks
Target areas for perf tuning
Identify race conditions
Increase operability
80. Operability
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
81. Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample chapter
83. Resources
• Training: Practical Operability for Developers and Testers – led
by Matthew Skelton and Rob Thatcher – 1-day workshop –
http://www.unicom.co.uk/practical-operability-for-developers-
and-testers.html
• Team Guide to Software Operability by Matthew Skelton and Rob
Thatcher (Skelton Thatcher Publications, 2016)
http://operabilitybook.com/
• Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/