How AI, OpenAI, and ChatGPT impact business and software.
Bảo mật Dành cho Tên công ty Phiên bản 1.0 - Resilience Design Patterns
1. Bảo mật Dành cho Tên công ty Phiên bản 1.0
Latency Control & Supervision in
Resilience Design Patterns
Tu Pham - CTO @ Eway
2. Bảo mật Dành cho Tên công ty Phiên bản 1.0
Terminology
Why It So
IMPORTANT
Why It So HARD
Design Patterns
Anti Patterns
Q & A
TOC
3. Terminology
Distributed Systems
These are networked components which communicate with each other
by passing messages most often to achieve a common goal.
Resiliency
The capacity of any system to recover from difficulties.
Availability
Probability that any system is operating at time `t`.
Reliability
Degree to which a system / component performs specified functions
under specified conditions for a specified period of time
4. Faults
Fault is an incorrect internal state in your
system. Examples:
1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most
often because there’s not enough validations
on input data)
Terminology
Failure
Failure is an inability of the system to perform
its intended job. Examples:
Failure means loss of Up-Time and availability
on systems. Faults if not contained from
propagating, can lead to failures.
5.
6. Why It So IMPORTANT
1
Losing customers and partners to
competitors => Financial losses for the
company
2
Affecting livelihood of publishers and
advertisers
3
Affecting salary and bonus of OUR TEAM
:))
4
Affecting services for customers and
colleges
7. But building resiliency in a complex
micro-services architecture with
multiple distributed systems
communicating with each other is
difficult.
Why It So HARD
8. Some of the things which make it
hard are:
1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable
Why It So HARD
11. Latency
Control
● Complements isolation
● Detection and handling of non-timely
responses
● Avoid cascading temporal failures
● Different approaches and patterns available
0
20
40
60
80
12. Timeout
● Preserve responsiveness
independent of downstream latency
● Measure response time of
downstream calls
● Stop waiting after a pre-determined
timeout
● Take alternate action if timeout was
reached
13.
14. Fail Fast
● “If you know you’re going to fail, you
better fail fast”
● Avoid foreseeable failures
● Usually implemented by adding
checks in front of costly actions
● Enhances probability of not failing
15. Circuit Breaker
● Probably most often cited resilience
pattern
● Extension of the timeout pattern
● Takes downstream unit offline if
calls fail multiple times
● Specific variant of the fail fast
pattern
16.
17.
18.
19. Fan out & quickest
reply
● Send request to multiple workers
● Use quickest reply and discard all
other responses
● Reduces probability of latent
responses
● Tradeoff is WASTE of resources
20. Bounded Queues
● Limit request queue sizes in front of
highly utilized resources
● Avoids latency due to overloaded
resources
● Introduces pushback on the callers
● Another variant of the fail fast
pattern
21.
22. Supervision
● Provides failure handling beyond the means of
a single failure unit
● Detect unit failures
● Provide means for error escalation
● Different approaches and patterns available
23. Shed Load
● Upstream isolation pattern
● Avoid becoming overloaded due to
too many requests
● Install a gatekeeper in front of the
resource
● Shed requests based on resource
load
24. Monitor
● Observe unit behavior and
interactions from the outside
● Automatically respond to detected
failures
● Part of the system – complex failure
handling strategies possible
● Outside the system – more robust
against system level failures
25. Error Handler
● Units often don’t have enough time
or information to handle errors
● Separate business logic and error
handling
● Business logic just focuses on
getting the task done (quickly)
● Error handler has sufficient time
and information to handle errors
26. Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
29. Fallback
● Units often don’t have enough time
or information to handle errors
● Instead of aborting the computation
because of a missing response, we
fill in a fallback value.
● Of course, it can be DANGEROUS !!!
30. Retry
● Units have enough time or
information to handle errors
● Just send the requests again and
again til it reach the BOUNDARY of
policy
31. Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
32. Just Don’t
● Infinity delay
● One config / policy for all situations
● Fallback logics without confirmation from
business departments / upper managers
● Laggy / buggy monitoring system