SlideShare a Scribd company logo
1 of 49
Download to read offline
Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




                                             1
Dozens of dependencies.

   One going down takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.


                                                  2
3
4
5
No single dependency should
 take down the entire app.

          Fail fast.
         Fail silent.
         Fallback.

        Shed load.



                              6
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              7
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              8
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              9
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 10
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 11
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 12
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              13
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             14
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             15
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             16
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          17
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          18
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          19
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              20
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   21
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   22
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   23
Netflix uses all 4 in combination




                                   24
25
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     26
27
28
29
Benefits of Separate Threads

         Protection from client libraries

    Lower risk to accept new/updated clients

           Quick recovery from failure

             Client misconfiguration

Client service performance characteristic changes

              Built-in concurrency
                                                    30
Drawbacks of Separate Threads

    Some computational overhead

Load on machine can be pushed too far

                 ...

    Benefits outweigh drawbacks
    when clients are “untrusted”


                                        31
32
Visualizing Circuits in Realtime
      (generally sub-second latency)




      Video available at
https://vimeo.com/33576628




                                       33
Rolling 10 second counter – 1 second granularity

      Median Mean 90th 99th 99.5th

       Latent Error Timeout Rejected

                 Error Percentage
            (error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

                                                   34
Netflix DependencyCommand Implementation




                                          35
Netflix DependencyCommand Implementation




                                          36
Netflix DependencyCommand Implementation




                                          37
Netflix DependencyCommand Implementation




                                          38
Netflix DependencyCommand Implementation




                                          39
Netflix DependencyCommand Implementation




                                          40
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          41
Netflix DependencyCommand Implementation




                                          42
Netflix DependencyCommand Implementation




                                          43
Rolling Number
Realtime Stats and
 Decision Making




                     44
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    45
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    46
47
Fail fast.
Fail silent.
Fallback.

Shed load.




               48
Questions & More Information


Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



         Making the Netflix API More Resilient
   http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html




                        Ben Christensen
                          @benjchristensen
                 http://www.linkedin.com/in/benjchristensen



                                                                              49

More Related Content

What's hot

你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
maclean liu
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
maclean liu
 

What's hot (20)

Virtual machine re building
Virtual machine re buildingVirtual machine re building
Virtual machine re building
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issues
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework Evolution
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commands
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel Execution
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user group
 

Similar to Fault Tolerance in a High Volume, Distributed System

Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvm
aragozin
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
Uwe Friedrichsen
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrency
priyank09
 

Similar to Fault Tolerance in a High Volume, Distributed System (20)

JavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan JuričićJavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan Juričić
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
 
Curator intro
Curator introCurator intro
Curator intro
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in Java
 
Resilience mit Hystrix
Resilience mit HystrixResilience mit Hystrix
Resilience mit Hystrix
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Concurrency-5.pdf
Concurrency-5.pdfConcurrency-5.pdf
Concurrency-5.pdf
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced concepts
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)
 
Java Concurrency
Java ConcurrencyJava Concurrency
Java Concurrency
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizers
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvm
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications
 
Celery
CeleryCelery
Celery
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrency
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Fault Tolerance in a High Volume, Distributed System

  • 1. Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  • 2. Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • 7. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • 8. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • 10. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 10
  • 11. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 11
  • 12. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 12
  • 13. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • 14. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 14
  • 15. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 15
  • 16. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 16
  • 17. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 17
  • 18. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 18
  • 19. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 19
  • 20. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • 21. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 21
  • 22. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 22
  • 23. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 23
  • 24. Netflix uses all 4 in combination 24
  • 25. 25
  • 26. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  • 31. Drawbacks of Separate Threads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32. 32
  • 33. Visualizing Circuits in Realtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33
  • 34. Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  • 41. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • 44. Rolling Number Realtime Stats and Decision Making 44
  • 45. Request Collapsing Take advantage of resiliency to improve efficiency 45
  • 46. Request Collapsing Take advantage of resiliency to improve efficiency 46
  • 47. 47
  • 49. Questions & More Information Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49