SlideShare a Scribd company logo
1 of 87
Download to read offline
The 7 quests of resilient software design
A guide for the adventurous software engineer

Uwe Friedrichsen – codecentric AG – 2012-2017
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried
You want to do resilient software design ...
... and you expect everything to be like this
But somehow it feels more like that ...
... or even that
What the **** went wrong?
The road to resilience is a twisted one
“7 quests you must complete!”
Quest #1
Understand the business case
“How much money will we earn with it?”
“Does it improve our velocity?”
Resilience is not about making money
Resilience is about not losing money
Lack of resilient software design
Reduced system availability
Users cannot do what they intend to do
Less transactions per time period
Immediate lost revenue
Users get annoyed
Churn rate increases
Delayed lost revenue
Due to non-determinism
of distributed systems
This is at most your
resilience budget
Quest #2
Embrace distributed systems
Everything fails, all the time.
-- Werner Vogels
If X then Y
What we learned in our IT education
If X then maybe Y



This changes
everything!
What we need for distributed systems












We are good at this (due to how our brains work)
Inside process thinking
Reasoning about
deterministic behavior
Designing a complicated system












We are poor at that (due to how our brains work)
Reasoning about
non-deterministic behavior
Across process thinking
Designing a complex system
Yet, we usually use deterministic thinking
to reason about distributed systems
Failures in distributed systems ...

•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzantine failure
... turn seemingly simple issues into very hard ones
Time & Ordering

Leslie Lamport

"Time, clocks, and the
ordering of events in
distributed systems"
Consensus

Leslie Lamport

”The part-time
parliament”
(Paxos)
CAP

Eric A. Brewer

"Towards robust
distributed systems"
Faulty processes

Leslie Lamport,
Robert Shostak,
Marshall Pease

"The Byzantine
generals problem"
Consensus

Michael J. Fischer,
Nancy A. Lynch,
Michael S. Paterson

"Impossibility of
distributed consensus
with one faulty
process” (FLP)
Impossibility

Nancy A. Lynch

”A hundred
impossibility proofs
for distributed
computing"
Embrace distributed systems

•  Distributed systems introduce non-determinism regarding
•  Execution completeness
•  Message ordering
•  Communication timing
•  You will be affected by this at the application level
•  Don’t expect your infrastructure to hide all effects from you
•  Better have a plan to detect and recover from inconsistencies
But do I really need to care?
(The system, I am working on, is not a distributed system)
(Almost) every system is a distributed system
-- Chas Emerick


http://www.infoq.com/presentations/problems-distributed-systems
… and it’s getting “worse”


•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web
Quest #3
Avoid the “100% available” trap
The “100% available” trap, version #1

You: “How should the application respond if a technical failure occurs?”

Business owner: “This must not happen! It is your responsibility to make

 
sure that this will not happen.”
The “100% available” trap, version #2

You: “How do you handle the situation if the service you call does not

 
respond (or does not respond timely)?”

Developer 1: “We did not implement any extra measures. The other service

 
is so important and thus needs to be so highly available that it is

 
not worth any extra effort.”

Developer 2: “Actually, if that service should be down, we would not be able

 
to do anything useful anyway. Thus, it just needs to be up.”
The question is not, if a failure will happen
The question is, when a failure will happen
A short note about availability

Assume a service availability of 99,5% (incl. planned downtimes)

•  10 services involved in a request à 95,1% probability of success
•  50 services involved in a request à 77,8% probability of success
Quest #4
Establish the ops-dev feedback loop
The big wall between Dev and Ops
In a distributed environment, you cannot solve
availability issues on an infrastructure level only
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
All developer activities towards
improving robustness are basically
“shooting at the dark” which is neither
effective not sustainable
Having a wall between Dev and Ops
breaks the cycle required to implement
effective robustness measures
For effective resilient software design
you need a working ops-dev feedback loop
Establishing the feedback loop


•  Adopt DevOps
•  Adopt Site Reliability Engineering (SRE)
•  Or do it your own way if you know a better way ...
•  ... but make sure you establish the required feedback loops!
Quest #5
Master functional design
Without proper functional design
nothing else matters
Isolation

•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid cascading failures
•  Foundation of resilient software design
Bulkhead

•  Bulkheads implement the “parts” that need to be isolated
•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
•  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ...
•  Shaping good bulkheads is a pure functional design issue (and extremely hard)
Hmm, sound easy. Why should that be hard?
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
How do we avoid this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... without ending up with this?
Let us apply our well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yourself)
•  Design for reusability
•  Layered architecture
•  …
Unfortunately ...
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
... this usually leads to this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... and in the end also often to this.
Welcome to distributed hell!
Caches to the rescue!
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
CacheofB
Break tight service coupling
by caching data/responses
of downstream service
Caches to the rescue?
Do you really think
that copying stale data all over your system
is a suitable measure
to fix an inherently broken design? *

* Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
We have to re-learn design
for distributed system
No silver bullet
Yet, a few guiding thoughts ...
Foundations of design

•  “High cohesion, low coupling” & “separation of concerns”
•  “Crucial across process boundaries
•  Still poorly understood issue
•  Start with
•  Understanding organizational boundaries
•  Understanding use cases and flows
•  Identifying functional domains (à DDD)
•  Finding areas that change independently
•  Do not start with a data model!
Short activation paths

•  Long activation paths affect availability
•  Increase likelihood of failures
•  Minimize remote calls per request
•  Need to balance opposing forces
•  Avoid monolith à clear separation of concerns
•  Minimize requests à cluster functionality & data
•  Caches can sometimes help, but stale data as trade-off
Be (extremely) wary of reusability

•  Reusability increases coupling
•  Reusability usually leads to bad service design
•  Reusability compromises availability
•  Reusability rarely pays
•  Do not strive for reusable services
•  Strive for replaceable services instead
•  Try to tackle reusability issues with libraries
Quest #6
Know your toolbox
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
 Queue for
resources
Bounded queue
Finish work in
progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine
checks
Health
check
Fail fast
Let sleeping dogs lie
Small releases
Hot deployments
Routine maintenance
Backup request
Anti-fragility
Diversity
 Jitter
Error
injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
 Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Confinement
Acknowledgement
Using resilience patterns

•  Patterns are options, not obligations
•  Don’t pick too many patterns
•  Each pattern increases complexity
•  Complexity is the enemy of robustness
•  Each pattern costs money in dev & ops
•  You only have a limited resilience budget
•  Look for complementary patterns
How other people did it
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Escalation
Communication
paradigm
Isolation
System level
Monitor
Heartbeat
Either level
Hot deployments
Restart
 Let it crash!
Node level
Actor
Messaging
Erlang (Akka)

Core patterns
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Fallback
Share load
Bounded
queue
Communication
paradigm
Isolation
System level
Monitor
Either level
Error
injection
Retry
Limit retries
Node level
Circuit breaker
Timeout
Zero downtime
deployment
Canary releases
Redundancy
Several variants
(Micro)service
Request/
response
Netflix

Core patterns
Quest #7
Preserve the collective memory
We face a new generation of developers
every 5 years
We loose our collective memory
every 5 years *


* Mean time until a topic discussion in the community starts over form scratch
Time working in IT
Growth of
knowledge
Depth of
insights
What do we do to compensate this effect?
We look for the new & shiny stuff ...
... as anything not new must be useless crap!
We need to rediscover our insights
every 5 years
In IT, we suffer from
continuous collective amnesia
and we are even proud of it
How can we become better?
Wrap-up
The 7 quests at a glance
Wrap-up

•  The road to resilient software design is a twisted one!
•  Most challenges are only indirectly related to RSD
•  Most challenges are not coding related
•  Mastering functional design is extremely hard ...
•  ... while learning the patterns is relatively easy
•  How do we preserve our collective memory?
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried

More Related Content

What's hot

Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
Monoliths and Microservices
Monoliths and Microservices Monoliths and Microservices
Monoliths and Microservices Bozhidar Bozhanov
 
Implementing agile iterative project delivery approach and achieving business...
Implementing agile iterative project delivery approach and achieving business...Implementing agile iterative project delivery approach and achieving business...
Implementing agile iterative project delivery approach and achieving business...Alan McSweeney
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Conhecendo Apache Kafka
Conhecendo Apache KafkaConhecendo Apache Kafka
Conhecendo Apache KafkaRafa Noronha
 
Design patterns for microservice architecture
Design patterns for microservice architectureDesign patterns for microservice architecture
Design patterns for microservice architectureThe Software House
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkMichae Blakeney
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Michael Allen
 
Software Engineering - chp1- software dev methodologies
Software Engineering - chp1- software dev methodologiesSoftware Engineering - chp1- software dev methodologies
Software Engineering - chp1- software dev methodologiesLilia Sfaxi
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice ArchitectureNguyen Tung
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!New Relic
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
Lean and Kanban-based Software Development
Lean and Kanban-based Software DevelopmentLean and Kanban-based Software Development
Lean and Kanban-based Software DevelopmentTathagat Varma
 
From distributed caches to in-memory data grids
From distributed caches to in-memory data gridsFrom distributed caches to in-memory data grids
From distributed caches to in-memory data gridsMax Alexejev
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITUwe Friedrichsen
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailogjuljo
 
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...Chris Richardson
 

What's hot (20)

Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Monoliths and Microservices
Monoliths and Microservices Monoliths and Microservices
Monoliths and Microservices
 
Implementing agile iterative project delivery approach and achieving business...
Implementing agile iterative project delivery approach and achieving business...Implementing agile iterative project delivery approach and achieving business...
Implementing agile iterative project delivery approach and achieving business...
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Conhecendo Apache Kafka
Conhecendo Apache KafkaConhecendo Apache Kafka
Conhecendo Apache Kafka
 
Design patterns for microservice architecture
Design patterns for microservice architectureDesign patterns for microservice architecture
Design patterns for microservice architecture
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
Dynatrace: New Approach to Digital Performance Management - Gartner Symposium...
 
Software Engineering - chp1- software dev methodologies
Software Engineering - chp1- software dev methodologiesSoftware Engineering - chp1- software dev methodologies
Software Engineering - chp1- software dev methodologies
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Scrum
Scrum Scrum
Scrum
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Lean and Kanban-based Software Development
Lean and Kanban-based Software DevelopmentLean and Kanban-based Software Development
Lean and Kanban-based Software Development
 
From distributed caches to in-memory data grids
From distributed caches to in-memory data gridsFrom distributed caches to in-memory data grids
From distributed caches to in-memory data grids
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective IT
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
 
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...
 

Similar to The 7 quests of resilient software design

Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native worldUwe Friedrichsen
 
DevOps and Microservice
DevOps and MicroserviceDevOps and Microservice
DevOps and MicroserviceInho Kang
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskUwe Friedrichsen
 
Microsoft Microservices
Microsoft MicroservicesMicrosoft Microservices
Microsoft MicroservicesChase Aucoin
 
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Amazon Web Services
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native FutureJulie Coonce
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"Daniel Bryant
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleJAXLondon_Conference
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAmazon Web Services
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!VMware Tanzu
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!Andrew Shafer
 
Software Architecture for Agile Development
Software Architecture for Agile DevelopmentSoftware Architecture for Agile Development
Software Architecture for Agile DevelopmentHayim Makabee
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...distributed matters
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAmazon Web Services Korea
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!Bert Ertman
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayRandy Shoup
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platformsVMware Tanzu
 

Similar to The 7 quests of resilient software design (20)

Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native world
 
DevOps and Microservice
DevOps and MicroserviceDevOps and Microservice
DevOps and Microservice
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
Microsoft Microservices
Microsoft MicroservicesMicrosoft Microservices
Microsoft Microservices
 
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWS
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Software Architecture for Agile Development
Software Architecture for Agile DevelopmentSoftware Architecture for Agile Development
Software Architecture for Agile Development
 
The Road to SaaS
The Road to SaaSThe Road to SaaS
The Road to SaaS
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 
devops, platforms and devops platforms
devops, platforms and devops platformsdevops, platforms and devops platforms
devops, platforms and devops platforms
 

More from Uwe Friedrichsen

The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerUwe Friedrichsen
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of softwareUwe Friedrichsen
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explainedUwe Friedrichsen
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"Uwe Friedrichsen
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservicesUwe Friedrichsen
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextUwe Friedrichsen
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architecturesUwe Friedrichsen
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITUwe Friedrichsen
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) ITUwe Friedrichsen
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesUwe Friedrichsen
 

More from Uwe Friedrichsen (20)

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Life after microservices
Life after microservicesLife after microservices
Life after microservices
 
The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developer
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of software
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
 
Watch your communication
Watch your communicationWatch your communication
Watch your communication
 
Life, IT and everything
Life, IT and everythingLife, IT and everything
Life, IT and everything
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
 
Production-ready Software
Production-ready SoftwareProduction-ready Software
Production-ready Software
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architectures
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) IT
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudes
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
Devops for Developers
Devops for DevelopersDevops for Developers
Devops for Developers
 
Fantastic Elastic
Fantastic ElasticFantastic Elastic
Fantastic Elastic
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

The 7 quests of resilient software design

  • 1. The 7 quests of resilient software design A guide for the adventurous software engineer Uwe Friedrichsen – codecentric AG – 2012-2017
  • 2. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried
  • 3. You want to do resilient software design ...
  • 4. ... and you expect everything to be like this
  • 5. But somehow it feels more like that ...
  • 6. ... or even that
  • 7. What the **** went wrong?
  • 8. The road to resilience is a twisted one
  • 9. “7 quests you must complete!”
  • 12. “How much money will we earn with it?”
  • 13. “Does it improve our velocity?”
  • 14. Resilience is not about making money Resilience is about not losing money
  • 15. Lack of resilient software design Reduced system availability Users cannot do what they intend to do Less transactions per time period Immediate lost revenue Users get annoyed Churn rate increases Delayed lost revenue Due to non-determinism of distributed systems This is at most your resilience budget
  • 18. Everything fails, all the time. -- Werner Vogels
  • 19. If X then Y What we learned in our IT education If X then maybe Y This changes everything! What we need for distributed systems We are good at this (due to how our brains work) Inside process thinking Reasoning about deterministic behavior Designing a complicated system We are poor at that (due to how our brains work) Reasoning about non-deterministic behavior Across process thinking Designing a complex system
  • 20. Yet, we usually use deterministic thinking to reason about distributed systems
  • 21. Failures in distributed systems ... •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • 22. ... turn seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  • 23. Embrace distributed systems •  Distributed systems introduce non-determinism regarding •  Execution completeness •  Message ordering •  Communication timing •  You will be affected by this at the application level •  Don’t expect your infrastructure to hide all effects from you •  Better have a plan to detect and recover from inconsistencies
  • 24. But do I really need to care? (The system, I am working on, is not a distributed system)
  • 25. (Almost) every system is a distributed system -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  • 26. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web
  • 28. Avoid the “100% available” trap
  • 29. The “100% available” trap, version #1 You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure that this will not happen.”
  • 30. The “100% available” trap, version #2 You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  • 31. The question is not, if a failure will happen The question is, when a failure will happen
  • 32. A short note about availability Assume a service availability of 99,5% (incl. planned downtimes) •  10 services involved in a request à 95,1% probability of success •  50 services involved in a request à 77,8% probability of success
  • 34. Establish the ops-dev feedback loop
  • 35. The big wall between Dev and Ops
  • 36. In a distributed environment, you cannot solve availability issues on an infrastructure level only
  • 37. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect
  • 38. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect All developer activities towards improving robustness are basically “shooting at the dark” which is neither effective not sustainable Having a wall between Dev and Ops breaks the cycle required to implement effective robustness measures
  • 39. For effective resilient software design you need a working ops-dev feedback loop
  • 40. Establishing the feedback loop •  Adopt DevOps •  Adopt Site Reliability Engineering (SRE) •  Or do it your own way if you know a better way ... •  ... but make sure you establish the required feedback loops!
  • 43. Without proper functional design nothing else matters
  • 44. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  • 45. Bulkhead •  Bulkheads implement the “parts” that need to be isolated •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ... •  Shaping good bulkheads is a pure functional design issue (and extremely hard)
  • 46. Hmm, sound easy. Why should that be hard?
  • 47. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  • 48. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  • 50. Let us apply our well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  • 52. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this …
  • 53. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 54. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  • 56. Caches to the rescue!
  • 57. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  • 58. Caches to the rescue?
  • 59. Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
  • 60. We have to re-learn design for distributed system
  • 62. Yet, a few guiding thoughts ...
  • 63. Foundations of design •  “High cohesion, low coupling” & “separation of concerns” •  “Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  • 64. Short activation paths •  Long activation paths affect availability •  Increase likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches can sometimes help, but stale data as trade-off
  • 65. Be (extremely) wary of reusability •  Reusability increases coupling •  Reusability usually leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reusable services •  Strive for replaceable services instead •  Try to tackle reusability issues with libraries
  • 68. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  • 69. Using resilience patterns •  Patterns are options, not obligations •  Don’t pick too many patterns •  Each pattern increases complexity •  Complexity is the enemy of robustness •  Each pattern costs money in dev & ops •  You only have a limited resilience budget •  Look for complementary patterns
  • 71. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Escalation Communication paradigm Isolation System level Monitor Heartbeat Either level Hot deployments Restart Let it crash! Node level Actor Messaging Erlang (Akka) Core patterns
  • 72. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Fallback Share load Bounded queue Communication paradigm Isolation System level Monitor Either level Error injection Retry Limit retries Node level Circuit breaker Timeout Zero downtime deployment Canary releases Redundancy Several variants (Micro)service Request/ response Netflix Core patterns
  • 75. We face a new generation of developers every 5 years
  • 76. We loose our collective memory every 5 years * * Mean time until a topic discussion in the community starts over form scratch
  • 77. Time working in IT Growth of knowledge Depth of insights
  • 78. What do we do to compensate this effect?
  • 79. We look for the new & shiny stuff ...
  • 80. ... as anything not new must be useless crap!
  • 81. We need to rediscover our insights every 5 years
  • 82. In IT, we suffer from continuous collective amnesia and we are even proud of it
  • 83. How can we become better?
  • 85. The 7 quests at a glance
  • 86. Wrap-up •  The road to resilient software design is a twisted one! •  Most challenges are only indirectly related to RSD •  Most challenges are not coding related •  Mastering functional design is extremely hard ... •  ... while learning the patterns is relatively easy •  How do we preserve our collective memory?
  • 87. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried