SlideShare a Scribd company logo
1 of 26
Download to read offline
Running Aground:
Debugging Docker in production
Bryan Cantrill (@bcantrill), CTO, Joyent
The Docker revolution
• While OS containers have been around for over a decade, Docker
has brought the concept to a much broader audience
• Docker has used the rapid provisioning and shared filesystem of
containers to allow developers to think operationally
- Deployment procedures can be encoded via an image
- Images can be reliably and reproducibly deployed as containers
• Docker is doing to apt what apt did to tar
Docker + microservices
• Docker is particular apt at deploying microservices: small, well-
defined services that do one thing and do it well
• While the term provokes nerd rage in some, it is merely a new
embodiment of an old idea: the Unix Philosophy
- Write programs that do one thing and do it well.
- Write programs to work together.
- Write programs to handle text streams, because that is a
universal interface.
Docker in production
• Containers + microservices are great when they work — but what
happens when these systems fail?
• For continuous integration/continuous deployment use cases (and/
or other entirely stateless services), failure is less of a concern…
• But as Docker is increasingly used for workloads that matter, one
can no longer insist that failures don’t happen — or that restarts
will cure any that do
• The ability to understand failure is essential to leap the chasm
from development into meaningful production!
When containers fail...
When containers fail...
When containers fail...
When containers fail...
When containers fail...
When containers fail...
Docker at Joyent
• At Joyent, we have run SmartOS-based containers on the metal
and in multi-tenant production since ~2006
• We wanted to create a best-of-all-worlds platform: the developer
ease of Docker on the production-grade substrate of SmartOS
- We developed a Linux system call interface for SmartOS, allowing
SmartOS to run Linux binaries at bare-metal speed
- In March 2015, we introduced Triton, our (open source!) stack
that deploys Docker containers directly on the metal
- Triton virtualizes the notion of a Docker host (i.e., “docker ps”
shows all of one’s containers datacenter-wide)
Debugging Docker
• When deploying Docker + microservices, there is an unstated truth:
you are developing a distributed system
• While more resilient to certain classes of force majeure failure,
distributed systems remain vulnerable to software defects
• We must be able to debug such systems; hope is not a strategy!
• Distributed systems are hard to debug — and are more likely to
exhibit behavior non-reproducible in development
• Docker forces us to change the way we debug systems: we must
debug not in terms of sick pets but rather sick cattle
Software failure
• Different failure modes have different implications for debugging!
• And software has many different failure modes:
- Fatal failure (segmentation violation, uncaught exception)
- Non-fatal failure (gives the wrong answer, performs terribly)
- Explicit failure (assertion failure, error message)
- Implicit failure (cheerfully does the wrong thing)
Taxonomizing software failure
Implicit
Explicit
FatalNon-fatal
Segmentation violation
Bus Error
Panic
Type Error
Uncaught Exception
Assertion failure
Process explicitly aborts
Exits with an error code
Gives the wrong answer
Returns the wrong result
Leaks resources
Stops doing work
Performs pathologically
Emits an error message
Returns an error code
Debugging fatal failure
• When software fails fatally, we know that the software itself is
broken — its state has become inconsistent
• By saving in-memory state to stable storage, the software can be
debugged postmortem
• To debug, one starts with the invalid state and reasons backwards
to discover a transition from a valid state to an invalid one
• This technique is so old, that the terms for this saved state dates
back to the dawn of the computing age: a core dump
• Not as low-level as the name implies! Modern high-level languages
(e.g., node.js and Go) have capabilities for postmortem debugging
Debugging fatal failure: Containers
• Postmortem analysis lends itself very well to the container model:
- There is no run-time overhead; overhead (such as it is) is only at
the time of death
- The container can be safely (automatically!) restarted; the core
dump can be analyzed asynchronously
- Debugging tooling can be made arbitrarily rich, as it need not
exist within the failing container
Core dump management in Docker
• In Triton, all core dumps are automatically stored and then
uploaded into a system that allows for analysis, tagging, etc.
• This has been invaluable for debugging our own services!
• Outside of Triton, the lack of container awareness around
core_pattern in the Linux kernel is problematic for Docker: core
dumps from Docker are still a bit rocky (viz. docker#11740)
• Docker-based core dump management (e.g., “docker dumps”?)
would be a welcome addition!
Debugging non-fatal failure
• There is a solace in fatal failure: it always represents a software
defect at some level — and the inconsistent state is static
• Non-fatal failure can be more challenging: the state is valid and
dynamic — and it can be difficult to separate symptom from cause
• Non-fatal failure must still be understood empirically!
• Debugging in vivo requires that data be extracted from the system
— either of its own volition (e.g., via logs) or by coercion (e.g., via
instrumentation)
Debugging explicit, non-fatal failure
• When failure is explicit (e.g., an error or warning message), it
provides a very important data point
• If failure is non-reproducible or otherwise transient, analysis of
explicit software activity becomes essential
• Action in one container will often need to be associated with
failures in another
• For modern software, this becomes log analysis, and is an essential
forensic tool for understanding explicit failure
Log management in Docker
• “docker logs” is fine when the problem is simple — but more
complicated issues will require more sophisticated analysis
• Deeper analysis requires logs be moved out of a container
• Docker is not prescriptive about how this is done, and there are
many ways to do it:
- Logs can be shipped from a process within the container
- Logs can be pulled from a container that is sharing a volume
• Log management techniques that rely on Docker host manipulation
should be considered an anti-pattern!
Aside: Docker host anti-patterns
• In the traditional Docker model, Docker hosts are virtual machines
to which containers are directly provisioned
• It may become tempting to manipulate Docker hosts directly, but
doing this entirely compromises the Docker security model
• Worse, compromising the security model creates a VM dependency
that makes bare-metal containers impossible
• And ironically, Docker hosts can resemble pets: the reasons for
backdooring through the Docker host can come to resemble the
arguments made by those who resist containerization entirely!
Debugging implicit, non-fatal failure
• Problems that are both implicit and non-fatal represent the most
time-consuming, most difficult problems to debug because the
system must be understood against its will
- Wherever possible make software explicit about failure!
- Where errors are programmatic (and not operational), they
should always induce fatal failure!
• Data must be coerced from the system via instrumentation
Instrumenting production systems
• Traditionally, software instrumentation was hard-coded and static
(necessitating software restart or — worse — recompile)
• Dynamic system instrumentation was historically limited to system
call table (strace/truss) or packet capture (tcpdump/snoop)
• Effective for some problems, but a poor fit for ad hoc analysis
• In 2003, Sun developed DTrace, a facility for arbitrary, dynamic
instrumentation of production systems that has since been ported
to Mac OS X, FreeBSD, NetBSD and (to a degree) Linux
• DTrace has inspired dynamic instrumentation software in other
systems (see Brendan Gregg’s talks for details)
Instrumenting Docker containers
• In Docker, instrumentation is a challenge as containers may not
include the tooling necessary to understand the system
• Host-based techniques for instrumentation may be tempting, but
(again!) they should be considered an anti-pattern!
• DTrace has a privilege model that allows it to be safely (and
usefully) used from within a container
• In Triton, DTrace is available from within every container — one
can “docker exec -it bash” and then debug interactively
Debugging Docker in production
• Debugging Docker in production requires us to shift our thinking
• Different types of failures necessitate different techniques:
- Fatal failure is best debugged via postmortem analysis — which is
particular appropriate in an all-container world
- Non-fatal failure necessitates log analysis and dynamic
instrumentation
• The ability to debug production problems is essential to accelerate
Docker into broad production deployment!
Thank you
Bryan Cantrill
@bcantrill, bryan@joyent.com

More Related Content

Viewers also liked

The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
The Peril and Promise of Early Adoption: Arriving 10 Years Early to ContainersThe Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containersbcantrill
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!bcantrill
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zonesbcantrill
 
Docker's Killer Feature: The Remote API
Docker's Killer Feature: The Remote APIDocker's Killer Feature: The Remote API
Docker's Killer Feature: The Remote APIbcantrill
 
The State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destructionThe State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destructionbcantrill
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadebcantrill
 
Manta: a new internet-facing object storage facility that features compute by...
Manta: a new internet-facing object storage facility that features compute by...Manta: a new internet-facing object storage facility that features compute by...
Manta: a new internet-facing object storage facility that features compute by...Hakka Labs
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceRod Boothby
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Jonathan Salwan
 
Upcoming CAR s of Skoda
Upcoming CAR s of SkodaUpcoming CAR s of Skoda
Upcoming CAR s of SkodaJatin Parwani
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbookbcantrill
 
node.js and Containers: Dispatches from the Frontier
node.js and Containers: Dispatches from the Frontiernode.js and Containers: Dispatches from the Frontier
node.js and Containers: Dispatches from the Frontierbcantrill
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Deploying JHipster Microservices
Deploying JHipster MicroservicesDeploying JHipster Microservices
Deploying JHipster MicroservicesJoe Kutner
 
Oral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsbcantrill
 

Viewers also liked (17)

The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
The Peril and Promise of Early Adoption: Arriving 10 Years Early to ContainersThe Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
The Peril and Promise of Early Adoption: Arriving 10 Years Early to Containers
 
Run containers on bare metal already!
Run containers on bare metal already!Run containers on bare metal already!
Run containers on bare metal already!
 
Papers We Love: Jails and Zones
Papers We Love: Jails and ZonesPapers We Love: Jails and Zones
Papers We Love: Jails and Zones
 
Docker's Killer Feature: The Remote API
Docker's Killer Feature: The Remote APIDocker's Killer Feature: The Remote API
Docker's Killer Feature: The Remote API
 
The State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destructionThe State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destruction
 
The Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decadeThe Container Revolution: Reflections after the first decade
The Container Revolution: Reflections after the first decade
 
Manta: a new internet-facing object storage facility that features compute by...
Manta: a new internet-facing object storage facility that features compute by...Manta: a new internet-facing object storage facility that features compute by...
Manta: a new internet-facing object storage facility that features compute by...
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage Service
 
Digital concepts
Digital conceptsDigital concepts
Digital concepts
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
 
DDTrace
DDTraceDDTrace
DDTrace
 
Upcoming CAR s of Skoda
Upcoming CAR s of SkodaUpcoming CAR s of Skoda
Upcoming CAR s of Skoda
 
The DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps PlaybookThe DIY Punk Rock DevOps Playbook
The DIY Punk Rock DevOps Playbook
 
node.js and Containers: Dispatches from the Frontier
node.js and Containers: Dispatches from the Frontiernode.js and Containers: Dispatches from the Frontier
node.js and Containers: Dispatches from the Frontier
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Deploying JHipster Microservices
Deploying JHipster MicroservicesDeploying JHipster Microservices
Deploying JHipster Microservices
 
Oral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generations
 

More from bcantrill

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Presentbcantrill
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmakingbcantrill
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...bcantrill
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsbcantrill
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systemsbcantrill
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolutionbcantrill
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Agebcantrill
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesbcantrill
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Lawbcantrill
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineeringbcantrill
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemapsbcantrill
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarebcantrill
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?bcantrill
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the unionbcantrill
 
The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsbcantrill
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after darkbcantrill
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadershipbcantrill
 
Zebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathbcantrill
 
Platform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondbcantrill
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindbcantrill
 

More from bcantrill (20)

Predicting the Present
Predicting the PresentPredicting the Present
Predicting the Present
 
Sharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
 
Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
 
I have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
 
Towards Holistic Systems
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
 
The Coming Firmware Revolution
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
 
Hardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
 
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
 
No Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
 
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
 
Visualizing Systems with Statemaps
Visualizing Systems with StatemapsVisualizing Systems with Statemaps
Visualizing Systems with Statemaps
 
Platform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
 
Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
 
dtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
 
The Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
 
Papers We Love: ARC after dark
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
 
Principles of Technology Leadership
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
 
Zebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data path
 
Platform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyond
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mind
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 

Recently uploaded (20)

Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 

Running aground: Debugging Docker in production

  • 1. Running Aground: Debugging Docker in production Bryan Cantrill (@bcantrill), CTO, Joyent
  • 2. The Docker revolution • While OS containers have been around for over a decade, Docker has brought the concept to a much broader audience • Docker has used the rapid provisioning and shared filesystem of containers to allow developers to think operationally - Deployment procedures can be encoded via an image - Images can be reliably and reproducibly deployed as containers • Docker is doing to apt what apt did to tar
  • 3. Docker + microservices • Docker is particular apt at deploying microservices: small, well- defined services that do one thing and do it well • While the term provokes nerd rage in some, it is merely a new embodiment of an old idea: the Unix Philosophy - Write programs that do one thing and do it well. - Write programs to work together. - Write programs to handle text streams, because that is a universal interface.
  • 4. Docker in production • Containers + microservices are great when they work — but what happens when these systems fail? • For continuous integration/continuous deployment use cases (and/ or other entirely stateless services), failure is less of a concern… • But as Docker is increasingly used for workloads that matter, one can no longer insist that failures don’t happen — or that restarts will cure any that do • The ability to understand failure is essential to leap the chasm from development into meaningful production!
  • 11. Docker at Joyent • At Joyent, we have run SmartOS-based containers on the metal and in multi-tenant production since ~2006 • We wanted to create a best-of-all-worlds platform: the developer ease of Docker on the production-grade substrate of SmartOS - We developed a Linux system call interface for SmartOS, allowing SmartOS to run Linux binaries at bare-metal speed - In March 2015, we introduced Triton, our (open source!) stack that deploys Docker containers directly on the metal - Triton virtualizes the notion of a Docker host (i.e., “docker ps” shows all of one’s containers datacenter-wide)
  • 12. Debugging Docker • When deploying Docker + microservices, there is an unstated truth: you are developing a distributed system • While more resilient to certain classes of force majeure failure, distributed systems remain vulnerable to software defects • We must be able to debug such systems; hope is not a strategy! • Distributed systems are hard to debug — and are more likely to exhibit behavior non-reproducible in development • Docker forces us to change the way we debug systems: we must debug not in terms of sick pets but rather sick cattle
  • 13. Software failure • Different failure modes have different implications for debugging! • And software has many different failure modes: - Fatal failure (segmentation violation, uncaught exception) - Non-fatal failure (gives the wrong answer, performs terribly) - Explicit failure (assertion failure, error message) - Implicit failure (cheerfully does the wrong thing)
  • 14. Taxonomizing software failure Implicit Explicit FatalNon-fatal Segmentation violation Bus Error Panic Type Error Uncaught Exception Assertion failure Process explicitly aborts Exits with an error code Gives the wrong answer Returns the wrong result Leaks resources Stops doing work Performs pathologically Emits an error message Returns an error code
  • 15. Debugging fatal failure • When software fails fatally, we know that the software itself is broken — its state has become inconsistent • By saving in-memory state to stable storage, the software can be debugged postmortem • To debug, one starts with the invalid state and reasons backwards to discover a transition from a valid state to an invalid one • This technique is so old, that the terms for this saved state dates back to the dawn of the computing age: a core dump • Not as low-level as the name implies! Modern high-level languages (e.g., node.js and Go) have capabilities for postmortem debugging
  • 16. Debugging fatal failure: Containers • Postmortem analysis lends itself very well to the container model: - There is no run-time overhead; overhead (such as it is) is only at the time of death - The container can be safely (automatically!) restarted; the core dump can be analyzed asynchronously - Debugging tooling can be made arbitrarily rich, as it need not exist within the failing container
  • 17. Core dump management in Docker • In Triton, all core dumps are automatically stored and then uploaded into a system that allows for analysis, tagging, etc. • This has been invaluable for debugging our own services! • Outside of Triton, the lack of container awareness around core_pattern in the Linux kernel is problematic for Docker: core dumps from Docker are still a bit rocky (viz. docker#11740) • Docker-based core dump management (e.g., “docker dumps”?) would be a welcome addition!
  • 18. Debugging non-fatal failure • There is a solace in fatal failure: it always represents a software defect at some level — and the inconsistent state is static • Non-fatal failure can be more challenging: the state is valid and dynamic — and it can be difficult to separate symptom from cause • Non-fatal failure must still be understood empirically! • Debugging in vivo requires that data be extracted from the system — either of its own volition (e.g., via logs) or by coercion (e.g., via instrumentation)
  • 19. Debugging explicit, non-fatal failure • When failure is explicit (e.g., an error or warning message), it provides a very important data point • If failure is non-reproducible or otherwise transient, analysis of explicit software activity becomes essential • Action in one container will often need to be associated with failures in another • For modern software, this becomes log analysis, and is an essential forensic tool for understanding explicit failure
  • 20. Log management in Docker • “docker logs” is fine when the problem is simple — but more complicated issues will require more sophisticated analysis • Deeper analysis requires logs be moved out of a container • Docker is not prescriptive about how this is done, and there are many ways to do it: - Logs can be shipped from a process within the container - Logs can be pulled from a container that is sharing a volume • Log management techniques that rely on Docker host manipulation should be considered an anti-pattern!
  • 21. Aside: Docker host anti-patterns • In the traditional Docker model, Docker hosts are virtual machines to which containers are directly provisioned • It may become tempting to manipulate Docker hosts directly, but doing this entirely compromises the Docker security model • Worse, compromising the security model creates a VM dependency that makes bare-metal containers impossible • And ironically, Docker hosts can resemble pets: the reasons for backdooring through the Docker host can come to resemble the arguments made by those who resist containerization entirely!
  • 22. Debugging implicit, non-fatal failure • Problems that are both implicit and non-fatal represent the most time-consuming, most difficult problems to debug because the system must be understood against its will - Wherever possible make software explicit about failure! - Where errors are programmatic (and not operational), they should always induce fatal failure! • Data must be coerced from the system via instrumentation
  • 23. Instrumenting production systems • Traditionally, software instrumentation was hard-coded and static (necessitating software restart or — worse — recompile) • Dynamic system instrumentation was historically limited to system call table (strace/truss) or packet capture (tcpdump/snoop) • Effective for some problems, but a poor fit for ad hoc analysis • In 2003, Sun developed DTrace, a facility for arbitrary, dynamic instrumentation of production systems that has since been ported to Mac OS X, FreeBSD, NetBSD and (to a degree) Linux • DTrace has inspired dynamic instrumentation software in other systems (see Brendan Gregg’s talks for details)
  • 24. Instrumenting Docker containers • In Docker, instrumentation is a challenge as containers may not include the tooling necessary to understand the system • Host-based techniques for instrumentation may be tempting, but (again!) they should be considered an anti-pattern! • DTrace has a privilege model that allows it to be safely (and usefully) used from within a container • In Triton, DTrace is available from within every container — one can “docker exec -it bash” and then debug interactively
  • 25. Debugging Docker in production • Debugging Docker in production requires us to shift our thinking • Different types of failures necessitate different techniques: - Fatal failure is best debugged via postmortem analysis — which is particular appropriate in an all-container world - Non-fatal failure necessitates log analysis and dynamic instrumentation • The ability to debug production problems is essential to accelerate Docker into broad production deployment!