SlideShare a Scribd company logo
1 of 29
Download to read offline
Debugging under fire
Keeping your head when

systems have lost their mind
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
The genesis of an outage
“Please don’t be me, please don’t be me”
“…doesn’t begin to describe it”
“WHEE!”
“Fat-finger”?
• Not just a “fat-finger”; even this relatively simple failure reflected
deeper complexities:















• Outage was instructive — and lucky — on many levels…
It could have been much worse!
• The (open source!) software stack that we have developed to
run our public cloud, Triton, is a complicated distributed system
• Compute nodes are PXE booted from the headnode with a
RAM-resident platform image
• It seemed entire conceivable that the services needed to boot
compute nodes would not be able to start because a compute
node could not boot…
• This was a condition we had tested, but at nowhere near the
scale — this was a failure that we hadn’t anticipated!
How did we get here?
• Software is increasingly delivered as part of a service
• Software configuration, deployment and management is
increasingly automated
• But automation is not total: humans are still in the loop, even
if only developing software
• Semi-automated systems are fraught with peril: the arrogance
and power of automation — but with human fallibility
Human fallibility in semi-automated systems
Human fallibility in semi-automated systems
Whither microservices?
• Microservices have yielded simpler components — but more
complicated systems
• …and open source has allowed us to deploy many more
kinds of software components, increasing complexity again
• As abstractions become more robust, failures become rare,
but arguably more acute: service outage is more likely due to
cascading failure in which there is not one bug but several
• That these failures may be in discrete software services
makes understanding the system very difficult…
The Microservices Complexity Paradox
The Microservices Complexity Paradox
an active shooter
Modern software failure modes
An even more apt metaphor
A mechanical distributed system
But… but… alerts and monitoring!


“It is a difficult thing to look at a winking light on a board,
or hear a peeping alarm — let alone several of them —
and immediately draw any sort of rational picture of
something happening”
— Nuclear Regulatory Commission’s Special Report

on incident at Three Mile Island
The debugging imperative
• We suffer from many of the same problems as nuclear power
in the 1970s: we are delivering systems that we think can’t fail
• In particular, distributed systems are vulnerable to software
defects — we must be able to debug them in production
• What does it mean to develop software to be debugged?
• Prompts a deeper question: how do we debug, anyway?
Debugging in the abstract
• Debugging is the process by which we understand
pathological behavior in a software system
• It is not unlike the process by which we understand the
behavior of a natural system — a process we call science
• Reasoning about the natural world can be very difficult:
experiments are expensive and even observations can be
very difficult
• Physical science is hypothesis-centric
The exceptionalism of software
• Software is entirely synthetic — it is mathematical machine!
• The conclusions of software debugging are often
mathematical in their unequivocal power!
• Software is so distilled and pure — experiments are so cheap
and observation so limitless — that we can structure our
reasoning about it differently
• We can understand software by simply observing it
The art of debugging
• The art of debugging isn’t to guess the answer — it is to be
able to ask the right questions to know how to answer them
• Answered questions are facts, not hypotheses
• Facts form constraints on future questions and hypotheses
• As facts beget questions which beget observations and more
facts, hypotheses become more tightly constrained — like a
cordon being cinched around the truth
The craft of debuggable software
• The essence of debugging is asking and answering questions
— and the craft of writing debuggable software is allowing the
software to be able to answer questions about itself
• This takes many forms:
• Designing for postmortem debuggability
• Designing for in situ instrumentation
• Designing for post hoc debugging
A culture of debugging
• Debugging must be viewed as the process by which systems
are understood and improved, not merely as the process by
which bugs are made to go away!
• Too often, we have found that beneath innocent wisps of
smoke lurk raging coal infernos
• Engineers must be empowered to understand anomalies!
• Engineers must be empowered to take the extra time to build
for debuggability — we must be secure in the knowledge that
this pays later dividends!
Debugging during an outage
• When systems are down, there is a natural tension: do we
optimize for recovery or understanding?
• “Can we resume service without losing information?”
• “What degree of service can we resume with minimal loss
of information?”
• Overemphasizing recovery with respect to understanding may
leave the problem undebugged or (worse) exacerbate the
problem with a destructive but unrelated action
The peril of overemphasizing recovery
• Recovery in lieu of understanding normalizes broken software
• If it becomes culturally engrained, the dubious principle of
software recovery has toxic corollaries, e.g.:
• Software should tolerate bad input (viz. “npm isntall”)
• Software should “recover” from fatal failures (uncaught
exceptions, segmentation violations, etc.)
• Software should not assert the correctness of its state
• These anti-patterns impede debuggability!
Debugging after an outage
• After an outage, we must debug to complete understanding
• In mature systems, we can expect cascading failures —
which can be exhausting to fully unwind
• It will be (very!) tempting after an outage to simply move on,
but every service failure (outage-inducing or not) represents
an opportunity to advance understanding
• Software engineers must be encouraged to understand their
own failures to encourage designing for debuggability
Enshrining debuggability
• Designing for debuggability effects true software robustness:
differentiating operational failure from programmatic ones
• Operational failures should be handled; programmatic failures
should be debugged
• Ironically, the more software is designed for debuggability the
less you will need to debug it — and the more you will
leverage it to debug the software that surrounds it
Debugging under fire
• It will always be stressful to debug a service that is down
• When a service is down, we must balance the need to restore
service with the need to debug it
• Missteps can be costly; taking time to huddle and think can
yield a better, safer path to recovery and root-cause
• In massive outages, parallelize by having teams take different
avenues of investigation
• Viewing outages as opportunities for understanding allows us
to develop software cultures that value debuggability!
Hungry for more?
• If you are the kind of software engineer who values
debuggability — and loves debugging — Joyent is hiring!
• If you have not yet hit your Cantrillian LD50, I will be joining
Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old
Geeks Shout At Cloud”
• Thank you!

More Related Content

What's hot

The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsbcantrill
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadershipbcantrill
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of databcantrill
 
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of FailSteve Poole
 
OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"The Linux Foundation
 
Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013The Linux Foundation
 
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 Microservices for Mortals by Bert Ertman at Codemotion Dubai Microservices for Mortals by Bert Ertman at Codemotion Dubai
Microservices for Mortals by Bert Ertman at Codemotion DubaiCodemotion Dubai
 
Internet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit IIInternet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit IIArti Parab Academics
 
Internet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit VInternet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit VArti Parab Academics
 
Graphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital InteractionGraphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital Interactionmtrimpe
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remanijaxconf
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew engCineSoft
 
The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsJonas Bonér
 
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1The Linux Foundation
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracled0nn9n
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Clinton Wolfe
 
VMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack EngineeringVMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack EngineeringChris Wahl
 
Accidental Architecture 0.9
Accidental Architecture 0.9Accidental Architecture 0.9
Accidental Architecture 0.9Mark Cathcart
 

What's hot (20)

The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
 
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017  Succeeding in the Cloud – the guidebook of FailJax Devops 2017  Succeeding in the Cloud – the guidebook of Fail
Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail
 
OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"OSCON 2013: "Case Study: What to do when your project outgrows your company"
OSCON 2013: "Case Study: What to do when your project outgrows your company"
 
Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013Lessons Learned from Xen - SELF2013
Lessons Learned from Xen - SELF2013
 
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 Microservices for Mortals by Bert Ertman at Codemotion Dubai Microservices for Mortals by Bert Ertman at Codemotion Dubai
Microservices for Mortals by Bert Ertman at Codemotion Dubai
 
Internet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit IIInternet of Things, TYBSC IT, Semester 5, Unit II
Internet of Things, TYBSC IT, Semester 5, Unit II
 
Internet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit VInternet of Things, TYBSC IT, Semester 5, Unit V
Internet of Things, TYBSC IT, Semester 5, Unit V
 
Graphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital InteractionGraphel: A Purely Functional Approach to Digital Interaction
Graphel: A Purely Functional Approach to Digital Interaction
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
 
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
Your Goat Anti-Fragiled My Snowflake! Demystifying DevOps Jargon (30 minute v...
 
Elatt Presentation
Elatt PresentationElatt Presentation
Elatt Presentation
 
VMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack EngineeringVMUG - My Journey to Full Stack Engineering
VMUG - My Journey to Full Stack Engineering
 
Accidental Architecture 0.9
Accidental Architecture 0.9Accidental Architecture 0.9
Accidental Architecture 0.9
 

Similar to Debugging under fire: Keeping your head when systems have lost their mind

DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesAlex Cruise
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsBert Jan Schrijver
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software designUwe Friedrichsen
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorialduleepa
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsDavid Simons
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsBert Jan Schrijver
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsBert Jan Schrijver
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombDaniel Zivkovic
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringAditi Abhang
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedis Labs
 
Software Defects and SW Reliability Assessment
Software Defects and SW Reliability AssessmentSoftware Defects and SW Reliability Assessment
Software Defects and SW Reliability AssessmentKristine Hejna
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!VMware Tanzu
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!Andrew Shafer
 

Similar to Debugging under fire: Keeping your head when systems have lost their mind (20)

DevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 SlidesDevOps Days Vancouver 2014 Slides
DevOps Days Vancouver 2014 Slides
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
 
dist_systems.pdf
dist_systems.pdfdist_systems.pdf
dist_systems.pdf
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
 
Debugging distributed systems
Debugging distributed systemsDebugging distributed systems
Debugging distributed systems
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
No Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software EngineeringNo Silver Bullet - Essence and Accidents of Software Engineering
No Silver Bullet - Essence and Accidents of Software Engineering
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
Software Defects and SW Reliability Assessment
Software Defects and SW Reliability AssessmentSoftware Defects and SW Reliability Assessment
Software Defects and SW Reliability Assessment
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 

More from bcantrill

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Presentbcantrill
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmakingbcantrill
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...bcantrill
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsbcantrill
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systemsbcantrill
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolutionbcantrill
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Agebcantrill
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesbcantrill
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Lawbcantrill
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineeringbcantrill
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarebcantrill
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?bcantrill
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the unionbcantrill
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after darkbcantrill
 
Down Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocatorbcantrill
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadebcantrill
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zonesbcantrill
 
Why it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metalWhy it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metalbcantrill
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!bcantrill
 
A crime against common sense
A crime against common senseA crime against common sense
A crime against common sensebcantrill
 

More from bcantrill (20)

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Present
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
 
Down Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocator
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decade
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zones
 
Why it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metalWhy it’s (past) time to run containers on bare metal
Why it’s (past) time to run containers on bare metal
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!
 
A crime against common sense
A crime against common senseA crime against common sense
A crime against common sense
 

Recently uploaded

Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsBert Jan Schrijver
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 

Recently uploaded (20)

Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 

Debugging under fire: Keeping your head when systems have lost their mind

  • 1. Debugging under fire Keeping your head when
 systems have lost their mind CTO bryan@joyent.com Bryan Cantrill @bcantrill
  • 2. The genesis of an outage
  • 3. “Please don’t be me, please don’t be me”
  • 4. “…doesn’t begin to describe it”
  • 6. “Fat-finger”? • Not just a “fat-finger”; even this relatively simple failure reflected deeper complexities:
 
 
 
 
 
 
 
 • Outage was instructive — and lucky — on many levels…
  • 7. It could have been much worse! • The (open source!) software stack that we have developed to run our public cloud, Triton, is a complicated distributed system • Compute nodes are PXE booted from the headnode with a RAM-resident platform image • It seemed entire conceivable that the services needed to boot compute nodes would not be able to start because a compute node could not boot… • This was a condition we had tested, but at nowhere near the scale — this was a failure that we hadn’t anticipated!
  • 8. How did we get here? • Software is increasingly delivered as part of a service • Software configuration, deployment and management is increasingly automated • But automation is not total: humans are still in the loop, even if only developing software • Semi-automated systems are fraught with peril: the arrogance and power of automation — but with human fallibility
  • 9. Human fallibility in semi-automated systems
  • 10. Human fallibility in semi-automated systems
  • 11. Whither microservices? • Microservices have yielded simpler components — but more complicated systems • …and open source has allowed us to deploy many more kinds of software components, increasing complexity again • As abstractions become more robust, failures become rare, but arguably more acute: service outage is more likely due to cascading failure in which there is not one bug but several • That these failures may be in discrete software services makes understanding the system very difficult…
  • 13. The Microservices Complexity Paradox an active shooter
  • 15. An even more apt metaphor
  • 17. But… but… alerts and monitoring! 
 “It is a difficult thing to look at a winking light on a board, or hear a peeping alarm — let alone several of them — and immediately draw any sort of rational picture of something happening” — Nuclear Regulatory Commission’s Special Report
 on incident at Three Mile Island
  • 18. The debugging imperative • We suffer from many of the same problems as nuclear power in the 1970s: we are delivering systems that we think can’t fail • In particular, distributed systems are vulnerable to software defects — we must be able to debug them in production • What does it mean to develop software to be debugged? • Prompts a deeper question: how do we debug, anyway?
  • 19. Debugging in the abstract • Debugging is the process by which we understand pathological behavior in a software system • It is not unlike the process by which we understand the behavior of a natural system — a process we call science • Reasoning about the natural world can be very difficult: experiments are expensive and even observations can be very difficult • Physical science is hypothesis-centric
  • 20. The exceptionalism of software • Software is entirely synthetic — it is mathematical machine! • The conclusions of software debugging are often mathematical in their unequivocal power! • Software is so distilled and pure — experiments are so cheap and observation so limitless — that we can structure our reasoning about it differently • We can understand software by simply observing it
  • 21. The art of debugging • The art of debugging isn’t to guess the answer — it is to be able to ask the right questions to know how to answer them • Answered questions are facts, not hypotheses • Facts form constraints on future questions and hypotheses • As facts beget questions which beget observations and more facts, hypotheses become more tightly constrained — like a cordon being cinched around the truth
  • 22. The craft of debuggable software • The essence of debugging is asking and answering questions — and the craft of writing debuggable software is allowing the software to be able to answer questions about itself • This takes many forms: • Designing for postmortem debuggability • Designing for in situ instrumentation • Designing for post hoc debugging
  • 23. A culture of debugging • Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away! • Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos • Engineers must be empowered to understand anomalies! • Engineers must be empowered to take the extra time to build for debuggability — we must be secure in the knowledge that this pays later dividends!
  • 24. Debugging during an outage • When systems are down, there is a natural tension: do we optimize for recovery or understanding? • “Can we resume service without losing information?” • “What degree of service can we resume with minimal loss of information?” • Overemphasizing recovery with respect to understanding may leave the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action
  • 25. The peril of overemphasizing recovery • Recovery in lieu of understanding normalizes broken software • If it becomes culturally engrained, the dubious principle of software recovery has toxic corollaries, e.g.: • Software should tolerate bad input (viz. “npm isntall”) • Software should “recover” from fatal failures (uncaught exceptions, segmentation violations, etc.) • Software should not assert the correctness of its state • These anti-patterns impede debuggability!
  • 26. Debugging after an outage • After an outage, we must debug to complete understanding • In mature systems, we can expect cascading failures — which can be exhausting to fully unwind • It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding • Software engineers must be encouraged to understand their own failures to encourage designing for debuggability
  • 27. Enshrining debuggability • Designing for debuggability effects true software robustness: differentiating operational failure from programmatic ones • Operational failures should be handled; programmatic failures should be debugged • Ironically, the more software is designed for debuggability the less you will need to debug it — and the more you will leverage it to debug the software that surrounds it
  • 28. Debugging under fire • It will always be stressful to debug a service that is down • When a service is down, we must balance the need to restore service with the need to debug it • Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause • In massive outages, parallelize by having teams take different avenues of investigation • Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!
  • 29. Hungry for more? • If you are the kind of software engineer who values debuggability — and loves debugging — Joyent is hiring! • If you have not yet hit your Cantrillian LD50, I will be joining Brigit Kromhout, Andrew Clay Shafer, Matt Stratton as “Old Geeks Shout At Cloud” • Thank you!