Instrumentation of Complex Systems is necessary and addresses the issues of static documentation of said systems. Instrumentation is flawed, flaws which are resolvable with an intentional kind of documentation.
Given at Write the Docs, Portland OR 2014.
11. The nature of the problem domain:
• Low latency ( < 100ms per transaction )
• Firm real-time system
• Highly concurrent ( > 55 billion transactions per
day )
• Global, 24/7 operation
13. Complex Systems
• Non-linear feedback
• Tightly coupled to external systems
• Difficult to model, understand
• Usually a solution to some “wicked
problem”
14. - - C . W E S T C H U R C H M A N ,
- G U E S T E D I T O R I A L : W I C K E D P R O B L E M S
- M A N A G E M E N T S C I E N C E V O L . 4 , 1 9 6 7
[WICKED PROBLEMS ARE] SOCIAL PROBLEMS WHICH ARE
ILL FORMULATED, WHERE THE INFORMATION IS CONFUSING,
WHERE THERE ARE MANY CLIENTS AND DECISION-MAKERS
WITH CONFLICTING VALUES, AND WHERE THE
RAMIFICATIONS IN THE WHOLE SYSTEM ARE THOROUGHLY
CONFUSING. […] THE ADJECTIVE ‘WICKED’ IS SUPPOSED TO
DESCRIBE THE MISCHIEVOUS AND EVEN EVIL QUALITY OF
THESE PROBLEMS, WHERE PROPOSED ‘SOLUTIONS’ OFTEN
TURN OUT TO BE WORSE THAN THE SYMPTOMS.”
17. HUMANS ARE BAD AT PREDICTING
THE PERFORMANCE OF COMPLEX
SYSTEMS(…). OUR ABILITY TO CREATE
LARGE AND COMPLEX SYSTEMS FOOLS
US INTO BELIEVING THAT WE’RE ALSO
ENTITLED TO UNDERSTAND THEM.
C A R L O S B U E N O
“ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
18. The key challenge to
sustaining a complex
system is maintaining
our understanding of it.
27. D AV I D E . H O F F M A N
“ T H E D E A D H A N D : T H E U N T O L D S T O R Y O F T H E C O L D
WA R A R M S R A C E A N D I T ’ S D A N G E R O U S L E G A C Y ”
ONE OPERATOR (…) WAS CONFUSED BY THE
LOGBOOK. HE CALLED SOMEONE ELSE TO INQUIRE.
!
“WHAT SHALL I DO?” HE ASKED. “IN THE PROGRAM
THERE ARE INSTRUCTIONS OF WHAT TO DO, AND
THEN A LOT OF THINGS CROSSED OUT.”
!
THE OTHER PERSON THOUGHT FOR A MINUTE, THEN
R E P L I E D , “ F O L L O W T H E C R O S S E D O U T
INSTRUCTIONS.”
29. E R I C S C H L O S S E R
C O M M A N D A N D C O N T R O L : N U C L E A R W E A P O N S , T H E
D A M A S C U S A C C I D E N T, A N D T H E I L L U S I O N O F S A F E T Y
CLEARLY THE TEXTBOOKS (…) DIDN’T TELL YOU
WHAT REALLY HAPPENED IN THE FIELD. (…)
(T)HERE WAS A WAY YOU WERE SUPPOSED TO
DO THINGS – AND THE WAY THINGS GOT DONE.
RFHCO SUITS WERE HOT AND CUMBERSOME
(…) AND IF A MAINTENANCE TASK COULD BE
ACCOMPLISHED QUICKLY WITHOUT AN OFFICER
NOTICING, SOMETIMES THE SUITS WEREN’T
WORN.
31. H E N R Y S . F. C O O P E R , J R .
X I I I : T H E A P O L L O F L I G H T T H AT FA I L E D
THE FIRST DISASTER IN SPACE HAD
OCCURRED, AND NO ONE KNEW
WHAT HAD HAPPENED. ON THE
GROUND, THE FLIGHT CONTROLLERS
W E R E N O T E V E N S U R E T H AT
ANYTHING HAD.
38. THIS “COLLECTIVE ENTITY” WAS ORGANIZED
AROUND THE PILOT TO MAKE IT “SAFER
AND MORE EFFICIENT IF THERE WAS A
FOCAL POINT. AND I WAS THE FOCAL
POINT. JIM FED THINGS INTO MY EARS.
THE MOON FED THINGS INTO MY EYES AND
I COULD FEEL THE MACHINE OPERATING.”
C O M M A N D E R D AV I D S C O T T
A S Q U O T E D I N D AV I D A . M I N D E L L ' S
D I G I TA L A P O L L O : H U M A N A N D M A C H I N E I N S PA C E F L I G H T
46. Case Study: Exchange Throttling
• All other metrics (run-queue, CPU, network IO)
were fine.
• Confirmed that no changes had been made to
the running systems via deployment.
• Amazon data showed no network issues to our
machines.
54. Case Study: Timeout Jumps
• Timeouts jump occurred only in US East, US
West fine.
• All other metrics (as above) checked out.
• System deployment strongly correlated with
timeout jump.
• Rollback to previous release reduce timeouts to
acceptable levels.
59. (THE FIREFIGHTERS) TRIED TO BEAT
DOWN THE FLAMES (OF CHERNOBYL
REACTOR 4). THEY KICKED AT THE
BURNING GRAPHITE WITH THEIR FEET.
… THE DOCTORS KEPT TELLING THEM
THEY’D BEEN POISONED BY GAS.
- S V E T L A N A A L E X I E V I C H
- V O I C E S F R O M C H E R N O B Y L : T H E O R A L H I S T O R Y O F A
N U C L E A R D I S A S T E R
60. It is possible to collect too
much information, or
present it badly.
61. SAFETY SYSTEMS, SUCH AS WARNING
LIGHTS, ARE NECESSARY, BUT THEY HAVE
THE POTENTIAL FOR DECEPTION. (…) ONE OF
THE LESSONS OF COMPLEX SYSTEMS AND
(THREE MILE ISLAND) IS THAT ANY PART OF
THE SYSTEM MIGHT BE INTERACTING WITH
OTHER PARTS IN UNANTICIPATED WAYS.
- C H A R L E S P E R R O W
- N O R M A L A C C I D E N T S : L I V I N G W I T H H I G H - R I S K
T E C H N O L O G I E S
76. IF YOU DON'T TRUST A COMPUTER
BECAUSE SOMETIMES IT DOESN'T TELL
YOU THE TRUTH, TELLING IT TO TELL
YOU TO TRUST IT IS ASKING IT TO LIE
TO YOU SOMETIMES.
M I K E S A S S A K ,
C U R B S I D E
79. I PROPOSE THAT MEN AND WOMEN BE RETURNED TO
WORK AS CONTROLLERS OF MACHINES, AND THAT THE
CONTROL OF PEOPLE BY MACHINES BE CURTAILED. I
PROPOSE, FURTHER, THAT THE EFFECTS OF CHANGES IN
TECHNOLOGY AND ORGANIZATION ON LIFE PATTERNS BE
TAKEN INTO CAREFUL CONSIDERATION, AND THAT THE
CHANGES BE WITHHELD OR INTRODUCED ON THE BASIS
OF THIS CONSIDERATION.
K U R T V O N N E G U T
P L AY E R P I A N O