5. Availability
Availability is the probability that a system will work as
required when required during the period of a mission.
The mission could be the 18-hour span of an aircraft flight. The mission period could
also be the 3 to 15-month span of a military deployment.
Availability includes non-operational periods associated with reliability, maintenance,
and logistics.
6. Availability levels
Nines Unavailable period / year
1 nine - 90% 36.5 days
1,5 nines - 95% 18.2 days
2 nines - 99% 3.7 days
3 nines - 99.9% 8.8 hours
4 nines - 99.99% 52.6 minutes
5 nines - 99.999% 5.3 minutes
6 nines - 99.9999% 30 seconds
Availability
Cost
7. What is resilience?
Resiliency is the ability of a system to
gracefully handle and recover from failures.
Source MSDN - https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency
10. Things can will go wrong
As system's complexity grows, the amount and types of issues that might occur and
that affect the system availability also increases:
Code Library
● Coding errors
● Edge cases
Hardware
● Disk
● Network card
External systems / APIs
● Coding errors
● Edge cases
● Network issues
● Degradation of service
● Request overload
● Unavailability
● Protocol issues (HTTPS, …)
Databases
● Coding errors
● Resource exhaustion
● Network issues
● Degradation of service
● Request overload
● Unavailability
13. The big question now is
How does the Application Code deals with
failure from dependencies?
14. What are we looking?
● Choose your availability level:
Not every application has high availability requirements
● Reduce exposure to dependencies failures:
if a dependency fails, the application should do its best to behave
● Assume chaos:
Things will go wrong at some point. Be prepared!
● Beware of misbehaved clients:
Your clients might be evil.
● Fail fast:
In case of failure, the application must fail ASAP and report the problem
16. Polly
Polly is a .NET resilience and transient-fault-handling library that allows developers
to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and
Fallback in a fluent and thread-safe manner.
Polly targets .NET Standard 1.1 (coverage: .NET Framework 4.5-4.6.1, .NET Core 1.0,
Mono, Xamarin, UWP, WP8.1+) and .NET Standard 2.0+ (coverage: .NET Framework
4.6.1, .NET Core 2.0+, and later Mono, Xamarin and UWP targets).
https://github.com/App-vNext/Polly
17. Usage
// Create a policy
var policy = Policy
.Handle<SomeExceptionType>()
.Retry()
// Execute action with void return within a policy
policy.Execute(() => SomeAction());
// Execute action with return value within a policy
var result = policy.Execute(() => SomeAction()); // Implicit return type
18. Retry
"Maybe it's just a blip"
Automatically retry an operation in case of exception.
Timing:
● Immediate retry
● Wait and Retry
○ Constant backoff (e.g wait 10 seconds before retry)
○ Dynamic backoff (e.g. exponential backoff)
Perseverance:
● Retry forever
● Give up after n attempts
https://github.com/App-vNext/Polly/wiki/Retry
20. Retry - code
// Retry once
Policy
.Handle<SomeExceptionType>()
.Retry()
// Retry multiple times
Policy
.Handle<SomeExceptionType>()
.Retry(3)
// Retry multiple times, calling an action on each retry
// with the current exception and retry count
Policy
.Handle<SomeExceptionType>()
.Retry(3, (exception, retryCount) =>
{
// do something
});
21. Wait and retry - code
// Retry, waiting a specified duration between each retry.
// (The wait is imposed on catching the failure, before making the next try.)
Policy
.Handle<SomeExceptionType>()
.WaitAndRetry(new[]
{
TimeSpan.FromSeconds(1),
TimeSpan.FromSeconds(2),
TimeSpan.FromSeconds(3)
});
// Retry a specified number of times, using a function to calculate the duration to wait between retries
// based on the current retry attempt (allows for exponential backoff)
Policy
.Handle<SomeExceptionType>()
.WaitAndRetry(5, retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
);
// In this case will wait for
// 2 ^ 1 = 2 seconds then
// 2 ^ 2 = 4 seconds then
// 2 ^ 3 = 8 seconds then
// 2 ^ 4 = 16 seconds then
// 2 ^ 5 = 32 seconds
22. Timeout
"Don't wait forever"
Optimistic timeout
Optimistic timeout operates via CancellationToken and assumes delegates you
execute support co-operative cancellation. You must use Execute/Async(...) overloads
taking a CancellationToken, and the executed delegate must honor that
CancellationToken.
Pessimistic timeout
Pessimistic timeout allows calling code to 'walk away' from waiting for an executed
delegate to complete, even if it does not support cancellation. In synchronous
23. Fallback
"Degrade gracefully"
Provide a substitute value or substitute action in the event of failure.
Triggers:
● Exception
e.g. if user action threw a HttpRequestException, return "??"
● Specific return values
e.g. if user action returned -1, return "??"
24. Fallback - code
Policy
.Handle<Whatever>()
.Fallback<UserAvatar>(UserAvatar.Blank)
// Specify a func to provide a substitute value, if execution faults.
Policy
.Handle<Whatever>()
.Fallback<UserAvatar>(() => UserAvatar.GetRandomAvatar())
// Specify a substitute value or func, calling an action (eg for logging) if the fallback is invoked.
Policy
.Handle<Whatever>()
.Fallback<UserAvatar>(UserAvatar.Blank, onFallback: (exception, context)=>
{
// do something
});
26. Cache
"You've asked that one before"
Provides a response from cache if known.
● Multiple cache providers
● Absolute expiration: expire after a given amount of time
● Sliding expiration: keep items that are being hit
28. Cache - code
// Define a cache Policy in the .NET Framework, using the Polly.Caching.Memory nuget package.
var memoryCacheProvider = new MemoryCacheProvider(MemoryCache.Default);
var cachePolicy = Policy.Cache(memoryCacheProvider, TimeSpan.FromMinutes(5));
// Define a cache policy with absolute expiration at midnight tonight.
var cachePolicy = Policy.Cache(memoryCacheProvider, new AbsoluteTtl(DateTimeOffset.Now.Date.AddDays(1));
// Define a cache policy with sliding expiration: items remain valid for another 5 minutes each time the cache
item is used.
var cachePolicy = Policy.Cache(memoryCacheProvider, new SlidingTtl(TimeSpan.FromMinutes(5));
30. Circuit breaker
"Stop doing it if it hurts"
Breaks the circuit (i.e. blocks executions) for a period, when faults exceed some
pre-configured threshold.
31. Circuit breaker state machine
Closed:
● Executes user action and returns the result
● Initial state
Open:
● User action will NOT be executed
● Fail fast by throwing a BrokenCircuitException
● Will remain open until durationOfBreak elapses
Half-open:
● Next call will treated as a trial to determine the
circuit health
● If throws an exception, circuit will remain Opened
● If no exception, circuit will transition to Closed
32. Circuit Breaker - code
// Break the circuit after the specified number of consecutive exceptions
// and keep circuit broken for the specified duration.
Policy
.Handle<SomeExceptionType>()
.CircuitBreaker(2, TimeSpan.FromMinutes(1));
// Break the circuit after the specified number of consecutive exceptions
// and keep circuit broken for the specified duration,
// calling an action on change of circuit state.
Action<Exception, TimeSpan> onBreak = (exception, timespan) => { ... };
Action onReset = () => { ... };
CircuitBreakerPolicy breaker = Policy
.Handle<SomeExceptionType>()
.CircuitBreaker(2, TimeSpan.FromMinutes(1), onBreak, onReset);
33. Manually breaking the circuit
// Monitor the circuit state, for example for health reporting.
CircuitState state = breaker.CircuitState;
/*
CircuitState.Closed - Normal operation. Execution of actions allowed.
CircuitState.Open - The automated controller has opened the circuit. Execution of actions blocked.
CircuitState.HalfOpen - Recovering from open state, after the automated break duration has expired.
Execution of actions permitted. Success of subsequent action/s controls onward transition to Open or Closed
state.
CircuitState.Isolated - Circuit held manually in an open state. Execution of actions blocked.
*/
// Manually open (and hold open) a circuit breaker - for example to manually isolate a downstream service.
breaker.Isolate();
// Reset the breaker to closed state, to start accepting actions again.
breaker.Reset();
35. Bulkhead Isolation
"One fault shouldn't sink the whole ship"
Constrains the governed actions to a fixed-size resource pool, isolating their potential
to affect others.