SlideShare a Scribd company logo
1 of 54
Source: http://ir.netflix.com
(I’m skipping all the cloud intro etc. Netflix runs in the
cloud, if you hadn’t figured that out already you aren’t
   paying attention and should go to the other Netflix
talks at AWS Re:Invent or read slideshare.net/netflix)
In production at
Netflix
2009
2009
2010
2010
2010
2010
2010
2011
Architecture applies to any cloud or datacenter
  Illustrated today using real world examples
Consumer                                               User Data
Electronics
                                      Web Site or
                       Browse        Discovery API
AWS Cloud
 Services                                            Personalization

CDN Edge
Locations
                                                          DRM
               Customer       Play
              Device (PC,            Streaming API
              PS3, TV…)
                                                      QoS Logging


                                                         CDN
                                                      Management
                                                      and Steering
                            Watch    OpenConnect
                                      CDN Boxes
                                                        Content
                                                       Encoding
Each icon is three to a
 few hundred
 instances across                    Cassandra
 three AWS zones

                                                 memcached
                                             Web service
                        Start Here
                                                 S3 bucket




Personalization movie
group chooser
Deployed in Three Balanced Availability Zones

                           Load Balancers




        Zone A                 Zone B                  Zone C
Cassandra and Evcache   Cassandra and Evcache   Cassandra and Evcache
      Replicas                Replicas                Replicas
Triple Replicated Persistence

                             Load Balancers




       Zone A                    Zone B                  Zone C
Cassandra and Evcache     Cassandra and Evcache   Cassandra and Evcache
      Replicas                  Replicas                Replicas
Isolated Regions


                     US-East Load Balancers                                                EU-West Load Balancers




     Zone A                     Zone B                Zone C               Zone A                     Zone B               Zone C

Cassandra Replicas         Cassandra Replicas    Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas
Failure Mode          Probability   Mitigation Plan
Application Failure   High          Automatic degraded response
AWS Region Failure    Low           Wait for region to recover
AWS Zone Failure      Medium        Continue to run on 2 out of 3 zones
Datacenter Failure    Medium        Migrate more functions to cloud
Data store failure    Low           Restore from S3 backups
S3 failure            Low           Restore from remote archive
Run what you wrote
 Rapid detection
 Rapid Response
http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html



                                             Eureka Services
                                                metadata




                      AWS Instances, ASGs,                     AppDynamics Request
                              etc.                                    flow




                                             Edda              Monkeys
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Classify and name the types of things that
might go wrong in the platform or infrastructure
Zone Network Outage


                         US-East Load Balancers                                                   EU-West Load Balancers




         Zone A                     Zone B                   Zone C               Zone A                     Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas       Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                                    Zone Dependent
Zone Power Outage
                                                    Service Outage


                                               Dependent Service could be @NetflixOSS
                                                 platform or underlying infrastructure
Regional Network Outage


                     US-East Load Balancers                                                  EU-West Load Balancers




     Zone A                     Zone B                  Zone C               Zone A                     Zone B               Zone C

Cassandra Replicas         Cassandra Replicas      Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                         Control Plane Overload
Cascading Capacity Overload


                         US-East Load Balancers                                                     EU-West Load Balancers




         Zone A                     Zone B               Zone C                     Zone A                      Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas   Cassandra Replicas         Cassandra Replicas          Cassandra Replicas   Cassandra Replicas




Capacity demand migrates to services                    Platform and Infrastructure
                                                                                                    Migrating demand across regions may
in another zone that don’t scale up fast                Software Bugs and Global
                                                                                                    just spread the problem further…
enough to take the load                                    Configuration Errors
                                                                     “Oops…”
Hardening the cloud
 Lessons Learned at Scale
Why Netflix Stays Up (Mostly)
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html
http://aws.amazon.com/message/67457/
http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
@NetflixOSS Eureka service directory failed to mark
                                   down dead instances due to a configuration error

                         US-East Load Balancers                                                      EU-West Load Balancers




         Zone A                     Zone B                    Zone C                  Zone A                    Zone B               Zone C

    Cassandra Replicas         Cassandra Replicas        Cassandra Replicas     Cassandra Replicas         Cassandra Replicas   Cassandra Replicas




                                                                                           Effect: higher latency and errors
Zone Power Outage                                                                          Mitigation: Fixed configuration, and made
                                               Applications not using Zone-
                                                                                           zone aware routing the default
                                               aware routing kept trying to talk to
                                               dead instances and timing out
Zone Enable DNS
Command Queue                                     Per-Zone Control Plane
                                                  Command Queues


                      US-East Load Balancers                                               EU-West Load Balancers




      Zone A                     Zone B               Zone C               Zone A                     Zone B               Zone C

 Cassandra Replicas         Cassandra Replicas   Cassandra Replicas   Cassandra Replicas         Cassandra Replicas   Cassandra Replicas
A highly scalable, available and durable
          deployment pattern
Single function Cassandra Cluster
  Many Different Single-Function REST Clients                                Managed by Priam
                                                                             Between 6 and 72 nodes

                                            Stateless Data Access REST Service
                                            Astyanax Cassandra Client




                                                                                         Optional
Each icon represents a horizontally scaled service of three to hundreds of               Datacenter
instances deployed over three availability zones                                         Update Flow
                                    Appdynamics Service Flow Visualization
Linux Base AMI (CentOS or Ubuntu)

Optional Apache
    frontend,        Java (JDK 6 or 7)
memcached, non-
   java apps
                      AppDynamics
                        appagent
                       monitoring     Tomcat
   Monitoring
Log rotation to S3                    Application war file, base servlet,
                                                                            Healthcheck, status servlets, JMX
  AppDynamics         GC and thread    platform, client interface jars,
                                                                               interface, Servo autoscale
 machineagent         dump logging                Astyanax
   Epic/Atlas
http://github.com/netflix
Linux Base AMI (CentOS or Ubuntu)

Tomcat and Priam on
        JDK
                       Java (JDK 7)
 Healthcheck, Status

                          AppDynamics
                            appagent
                           monitoring       Cassandra Server
    Monitoring
   AppDynamics                              Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and
                       GC and thread dump                                    SSTables
   machineagent             logging
    Epic/Atlas
http://github.com/netflix
Cassandra

              Cassandra                     Cassandra




  Cassandra                                             Cassandra




                               S3
                             Backup
Cassandra                                                 Cassandra




       Cassandra                                  Cassandra




                     Cassandra       Cassandra




 Archive
@NetflixOSS
http://techblog.netflix.com
Legend
 Github / Techblog                Priam                                Exhibitor
                                                                                                     Servo and Autoscaling Scripts
                           Cassandra as a Service                Zookeeper as a Service
Apache Contributions
                                Astyanax                                Curator                                  Honu
Techblog Post Only
                          Cassandra client for Java                Zookeeper Patterns                 Log4j streaming to Hadoop
   Coming Soon
                                CassJMeter                           EVCache                            Circuit Breaker - Hystrix
                             Cassandra test suite               Memcached as a Service                  Robust service pattern

                         Cassandra Multi-region EC2                Eureka / Discovery             Asgard - AutoScaleGroup based AWS
                             datastore support                      Service Directory                           console

                                 Aegisthus                            Archaius                             Chaos Monkey
                          Hadoop ETL for Cassandra            Dynamics Properties Service               Robustness verification
                                                                        Edda
                                   Explorers                                                               Latency Monkey
                                                                Queryable config history

                       Governator - Library lifecycle and
                                                            Server-side latency/error injection             Janitor Monkey
                            dependency injection

                                    Odin
                                                                REST Client + mid-tier LB                  Bakeries and AMI
                            Workflow orchestration

                            Blitz4j - Async logging          Configuration REST endpoints                  Build dynaslaves
http://github.com/Netflix
       http://techblog.netflix.com
       http://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft
We are sincerely eager to
hear your FEEDBACK on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

What's hot

What's hot (20)

Containers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red HatContainers Anywhere with OpenShift by Red Hat
Containers Anywhere with OpenShift by Red Hat
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast
 
SRV410 Deep Dive on AWS Batch
SRV410 Deep Dive on AWS BatchSRV410 Deep Dive on AWS Batch
SRV410 Deep Dive on AWS Batch
 
Introduction to Google Cloud Platform (GCP) | Google Cloud Tutorial for Begin...
Introduction to Google Cloud Platform (GCP) | Google Cloud Tutorial for Begin...Introduction to Google Cloud Platform (GCP) | Google Cloud Tutorial for Begin...
Introduction to Google Cloud Platform (GCP) | Google Cloud Tutorial for Begin...
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
Room 1 - 4 - Phạm Tường Chiến & Trần Văn Thắng - Deliver managed Kubernetes C...
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
SMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsSMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step Functions
 
세션 3: IT 담당자를 위한 Cloud 로의 전환
세션 3: IT 담당자를 위한 Cloud 로의 전환세션 3: IT 담당자를 위한 Cloud 로의 전환
세션 3: IT 담당자를 위한 Cloud 로의 전환
 
Kubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystemKubernetes design principles, patterns and ecosystem
Kubernetes design principles, patterns and ecosystem
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
 
[Black Belt Online Seminar] AWS上でのログ管理
[Black Belt Online Seminar] AWS上でのログ管理[Black Belt Online Seminar] AWS上でのログ管理
[Black Belt Online Seminar] AWS上でのログ管理
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
 
Introduction to ansible galaxy
Introduction to ansible galaxyIntroduction to ansible galaxy
Introduction to ansible galaxy
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
 
OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platform
 
Terraform -- Infrastructure as Code
Terraform -- Infrastructure as CodeTerraform -- Infrastructure as Code
Terraform -- Infrastructure as Code
 
Optimizing your workloads with Amazon EC2 and AMD EPYC processors - DEM01-SR ...
Optimizing your workloads with Amazon EC2 and AMD EPYC processors - DEM01-SR ...Optimizing your workloads with Amazon EC2 and AMD EPYC processors - DEM01-SR ...
Optimizing your workloads with Amazon EC2 and AMD EPYC processors - DEM01-SR ...
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - Presentation
 

Viewers also liked

Viewers also liked (20)

Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Aws multi-region High Availability
Aws multi-region High Availability Aws multi-region High Availability
Aws multi-region High Availability
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
How to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWSHow to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWS
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 

Similar to AWS Re:Invent - High Availability Architecture at Netflix

Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
Amazon Web Services
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
Acquia
 
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Amazon Web Services
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
Ram Chinta
 

Similar to AWS Re:Invent - High Availability Architecture at Netflix (20)

ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Servers fail, who cares?
Servers fail, who cares? Servers fail, who cares?
Servers fail, who cares?
 
CloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design PatternsCloudFest Denver Windows Azure Design Patterns
CloudFest Denver Windows Azure Design Patterns
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Disaster Recovery with the AWS Cloud
Disaster Recovery with the AWS CloudDisaster Recovery with the AWS Cloud
Disaster Recovery with the AWS Cloud
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
CloudStack technical overview
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
The Netflix Open Source Platform
The Netflix Open Source PlatformThe Netflix Open Source Platform
The Netflix Open Source Platform
 
1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day 1 Introduction at CloudStack Developer Day
1 Introduction at CloudStack Developer Day
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
 
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
 
Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1Ram chinta hug-20120922-v1
Ram chinta hug-20120922-v1
 
AWS re:Invent 2016: How to Migrate Microsoft Windows Applications to AWS Quic...
AWS re:Invent 2016: How to Migrate Microsoft Windows Applications to AWS Quic...AWS re:Invent 2016: How to Migrate Microsoft Windows Applications to AWS Quic...
AWS re:Invent 2016: How to Migrate Microsoft Windows Applications to AWS Quic...
 
AWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go SquaredAWS for Start-ups - Case Study - Go Squared
AWS for Start-ups - Case Study - Go Squared
 
Windows Azure Design Patterns
Windows Azure Design PatternsWindows Azure Design Patterns
Windows Azure Design Patterns
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 

More from Adrian Cockcroft

More from Adrian Cockcroft (13)

Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

AWS Re:Invent - High Availability Architecture at Netflix

  • 1.
  • 3. (I’m skipping all the cloud intro etc. Netflix runs in the cloud, if you hadn’t figured that out already you aren’t paying attention and should go to the other Netflix talks at AWS Re:Invent or read slideshare.net/netflix)
  • 4.
  • 6.
  • 7.
  • 8.
  • 9. Architecture applies to any cloud or datacenter Illustrated today using real world examples
  • 10. Consumer User Data Electronics Web Site or Browse Discovery API AWS Cloud Services Personalization CDN Edge Locations DRM Customer Play Device (PC, Streaming API PS3, TV…) QoS Logging CDN Management and Steering Watch OpenConnect CDN Boxes Content Encoding
  • 11. Each icon is three to a few hundred instances across Cassandra three AWS zones memcached Web service Start Here S3 bucket Personalization movie group chooser
  • 12.
  • 13. Deployed in Three Balanced Availability Zones Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 14. Triple Replicated Persistence Load Balancers Zone A Zone B Zone C Cassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  • 15. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 16. Failure Mode Probability Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive
  • 17. Run what you wrote Rapid detection Rapid Response
  • 21. http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html Eureka Services metadata AWS Instances, ASGs, AppDynamics Request etc. flow Edda Monkeys
  • 22.
  • 23.
  • 24.
  • 26. Classify and name the types of things that might go wrong in the platform or infrastructure
  • 27. Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Zone Dependent Zone Power Outage Service Outage Dependent Service could be @NetflixOSS platform or underlying infrastructure
  • 28.
  • 29. Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Control Plane Overload
  • 30.
  • 31. Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Capacity demand migrates to services Platform and Infrastructure Migrating demand across regions may in another zone that don’t scale up fast Software Bugs and Global just spread the problem further… enough to take the load Configuration Errors “Oops…”
  • 32.
  • 33. Hardening the cloud Lessons Learned at Scale Why Netflix Stays Up (Mostly)
  • 34.
  • 38. @NetflixOSS Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Effect: higher latency and errors Zone Power Outage Mitigation: Fixed configuration, and made Applications not using Zone- zone aware routing the default aware routing kept trying to talk to dead instances and timing out
  • 39.
  • 40. Zone Enable DNS Command Queue Per-Zone Control Plane Command Queues US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  • 41. A highly scalable, available and durable deployment pattern
  • 42. Single function Cassandra Cluster Many Different Single-Function REST Clients Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client Optional Each icon represents a horizontally scaled service of three to hundreds of Datacenter instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization
  • 43. Linux Base AMI (CentOS or Ubuntu) Optional Apache frontend, Java (JDK 6 or 7) memcached, non- java apps AppDynamics appagent monitoring Tomcat Monitoring Log rotation to S3 Application war file, base servlet, Healthcheck, status servlets, JMX AppDynamics GC and thread platform, client interface jars, interface, Servo autoscale machineagent dump logging Astyanax Epic/Atlas
  • 45.
  • 46. Linux Base AMI (CentOS or Ubuntu) Tomcat and Priam on JDK Java (JDK 7) Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring AppDynamics Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and GC and thread dump SSTables machineagent logging Epic/Atlas
  • 48. Cassandra Cassandra Cassandra Cassandra Cassandra S3 Backup Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra Archive
  • 51. Legend Github / Techblog Priam Exhibitor Servo and Autoscaling Scripts Cassandra as a Service Zookeeper as a Service Apache Contributions Astyanax Curator Honu Techblog Post Only Cassandra client for Java Zookeeper Patterns Log4j streaming to Hadoop Coming Soon CassJMeter EVCache Circuit Breaker - Hystrix Cassandra test suite Memcached as a Service Robust service pattern Cassandra Multi-region EC2 Eureka / Discovery Asgard - AutoScaleGroup based AWS datastore support Service Directory console Aegisthus Archaius Chaos Monkey Hadoop ETL for Cassandra Dynamics Properties Service Robustness verification Edda Explorers Latency Monkey Queryable config history Governator - Library lifecycle and Server-side latency/error injection Janitor Monkey dependency injection Odin REST Client + mid-tier LB Bakeries and AMI Workflow orchestration Blitz4j - Async logging Configuration REST endpoints Build dynaslaves
  • 52.
  • 53. http://github.com/Netflix http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/adriancockcroft
  • 54. We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.