Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

#NetflixEverywhere Global Architecture

QCon London Presentation - 3/8/16

Abstract:
On December 24th, 2012 ASW US-EAST1 experienced a region-wide failure that took down the Netflix service for almost 24 hours. Knowing that failure is inevitable in any complex system we evolved our cloud-based, micro-service architecture to support multi-region traffic management and failover capabilities. With that foundation in place we drove initiatives to achieve service ubiquity and rapid global expansion. The overarching theme is #NetflixEverywhere - an amazing, global, highly available movie and TV streaming experience for any member, anytime, on any device, anywhere in the world.

Building and evolving a pervasive, global service requires a multi-disciplined approach that balances requirements around service availability, latency, data replication, compute capacity, and efficiency. In this session, we’ll follow the Netflix journey of failure, innovation, and ubiquity. We'll review the many facets of globalization then delve deep into the architectural patterns that enable seamless, multi-region traffic management, reliable, fast data propagation, and efficient service infrastructure. The patterns presented will be broadly applicable to internet services with global aspirations.

  • Login to see the comments

#NetflixEverywhere Global Architecture

  1. 1. Josh Evans - Director of Operations Engineering March, 2016 #NetflixEverywhere Global Architecture
  2. 2. December 24th, 2012
  3. 3. Disappointment
  4. 4. Outrage
  5. 5. Withdrawal
  6. 6. December 24th, 2012
  7. 7. Failure is inevitable
  8. 8. Never fail the same way twice Failure-Driven Architecture
  9. 9. #NetflixEverywhere Failure-Driven Architecture Never fail the same way twice
  10. 10. • Introductions • Failure-Driven Architecture • Taking It Global #NetflixEverywhere Our Talk Today
  11. 11. • Introductions • Failure-Driven Architecture • Taking It Global #NetflixEverywhere Our Talk Today
  12. 12. 1999 – 2009 • Ecommerce (DVD  Streaming) 2009 – 2013 • Playback Services (Activate, Manifests, DRM) 2013 - present • Operations Engineering – CD, RTA, Chaos, Performance Josh Evans – Director of Operations Engineering jevans@netflix.com
  13. 13. Bringing movies & TV shows from all over the world to people all over the world • Streaming, on demand, subscription • Global & regional licensing • Hollywood, independent, international • Striving for global ubiquity
  14. 14. 2007 • Jan – Windows 2008 • May – Roku • Oct – LG, Samsung Blu-ray • Oct – Apple Mac • Nov – XBox 360 2009 • Jun – LG DTV • Nov –Sony PS3 (disc) • Nov – Sony Bravia – DTV & Blu-ray Device Ubiquity
  15. 15. 2011 • May – Android • Nov – First e-readers – Kindle Fire, Nook Device Ubiquity 2010 • Mar – Nintendo Wii (disc) • Apr – Apple iPad • Aug – Apple iPhone • Sep – Apple TV • Oct – Sony PS3 (no disc) • Oct – Nintendo Wii (no disc) • Nov – Windows Phone 7
  16. 16. 2010 - Canada 2011 - Latin America 2012 - UK, Ireland, Nordics 2013 – Netherlands 2014 - Austria, Belgium, France, Germany, Luxembourg, Switzerland 2015 - Australia, New Zealand, Japan, Spain, Italy, Portugal Geographic Ubiquity
  17. 17. • English • Spanish (Latin American) • Portuguese (Brazilian) • Dutch • French • German • Japanese • Spanish (Castilian) • Italian • Portuguese (European) Language Ubiquity - Subs, Dubs, UI
  18. 18. 75,000,000
  19. 19. • Introductions • Failure-Driven Architecture • Taking It Global #NetflixEverywhere Our Talk Today
  20. 20. August 2008
  21. 21. • No automation, virtualization, standardization • Manual, error prone, slow • Big iron & monoliths DC2 2009
  22. 22. Undifferentiated Heavy Lifting
  23. 23. US-East-1 Amazon Web Services 2010 • Scale & elasticity • Virtual, programmable • Global footprint
  24. 24. • Micro-services • Database • Cache • Traffic Architectural Pillars
  25. 25. • Micro-services • Database • Cache • Traffic Architectural Pillars
  26. 26. FIT Fault-Injection Test Framework Micro-service Failure
  27. 27. • Micro-services • Database • Cache • Traffic Architectural Pillars
  28. 28. NoSQL but… • Not web scale • Throttling Modest scale • 100s of play starts / second • 10,000s of requests / second • 10s of billions of records SimpleDB
  29. 29. • Micro-services • Database • Cache • Traffic Architectural Pillars
  30. 30. Ephemeral Volatile memCache (EVCache) Clustered memcached optimized for Netflix use cases
  31. 31. EVCache Server Memcached Prana (Sidecar) Monitoring & Other Processes Eureka Client Application Client Library EVCache Client Shards, consistent hashing TTLs & LRU EVCache Architecture
  32. 32. Zone A Client Application Client Library EVCache Client Zone B Client Application Client Library EVCache Client Zone C Client Application Client Library EVCache Client . . .. . .. . . Reads
  33. 33. Zone A Zone B Zone C . . .. . .. . . Writes Client Application Client Library EVCache Client Client Application Client Library EVCache Client Client Application Client Library EVCache Client
  34. 34. 1. Read from cache 2. On cache miss call service 3. Service calls DB & responds 4. Service updates cache Client Application Client Library EVCache Client Service Client S S S S. . . DB DB DB DB. . . . . . Fronting Micro-services . . .
  35. 35. Linear Scaling • 30 million requests/sec • 2 trillion requests per day globally • Hundreds of billions of objects • Tens of thousands of memcached instances • Milliseconds of latency per request
  36. 36. US-East-1 Canada International Expansion 2011 US Latin America
  37. 37. US-East-1 EU-West-1 Cloud Islands 2012
  38. 38. • Micro-services • Database • Cache • Traffic Architectural Pillars
  39. 39. US-East-1 EU-West-1 UK/IE, Nordics, Netherlands Latin America DNS Geo Mapping Canada US
  40. 40. • Micro-services • Database! • Caching • Traffic Architectural Pillars
  41. 41. Why Cassandra? • NoSQL at scale • Open source • Multi-region • Multi-directional • CAP Choices – Availability – Partition tolerance – Eventual consistency* Scalable, Durable, Global
  42. 42. Single Region, Multiple AZs 1. Client writes to any node 2. Coordinator replicates to nodes 3. Nodes ack to coordinator 4. Coordinator acks to client 5. Write to commit log Zone A Zone B Zone C Zone B Zone C Client Zone A • Hinted handoff to offline nodes Local Quorum (Typical) 100ms
  43. 43. Not quite fast enough
  44. 44. December 24th, 2012
  45. 45. US-East-1US-West-2 EU-West-1 Isthmus Spring 2013 Survive a regional ELB outage
  46. 46. AZ1 AZ2 AZ3 US-EAST-1 ELBs Zuul Data Data Data Geo-located state/province AZ1 AZ2 AZ3 US-WEST-2 ELBs Zuul Data Data Data Americas Internet Traffic Eastern NA + LatAm Traffic • Zuul routes locally or remotely • Eureka - multi-region aware Isthmus
  47. 47. US-East-1US-West-2 EU-West-1 Active-Active 2013 - 2014 Survive a large-scale regional service outage
  48. 48. Active-Active Data Replication
  49. 49. Region BRegion A Zone A Zone B Zone C Zone B Zone C Zone A Zone A Zone B Zone C Zone C Client Client Zone AZone B Multi-Region Writes 500ms Bi-directional Nightly compare & repair Local Quorum (Typical)
  50. 50. EVCache Replication Repl Writer SQS Application Client EVCache Replication Repl Writer 1. Set or delete 2. send metadata 3. poll msg 6. set or delete Application Client SQS 7. read EVCache Cross-Region Replication Region BRegion A
  51. 51. Active-Active Traffic Management
  52. 52. ELB US-West-2 ELB US-East-1 ELB EU-West-1 DNS api-global.netflix.com UltraDNS Route53
  53. 53. DNS api-global.netflix.com • Remove state from geo bucket ELB US-West-2 ELB US-East-1 ELB EU-West-1
  54. 54. api-global.netflix.com DNS • Remove state from geo bucket • Add state to geo bucket • Log event • For each end point ELB US-West-2 ELB US-East-1 ELB EU-West-1
  55. 55. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  56. 56. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  57. 57. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1 .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com ELB ELB ELB Shim
  58. 58. Active-Active Failover
  59. 59. • Introductions • Failure-Driven Architecture • Taking It Global #NetflixEverywhere Our Talk Today
  60. 60. January 6th, 2016
  61. 61. Geographic Ubiquity
  62. 62. Before Global • English • Spanish (Latin American) • Portuguese (Brazilian) • Dutch • French • German • Japanese • Spanish (Castilian) • Italian • Portuguese (European) Global • Chinese • Korean • Arabic Language Ubiquity
  63. 63. March 18th, 2016 Daredevil Season 2 All episodes, all devices, all countries Simultaneously Content Ubiquity
  64. 64. Ubiquitous, Resilient Architecture
  65. 65. US-East-1US-West-2 EU-West-1 Reliably and efficiently serve any customer from any region Netflix Global 2015
  66. 66. US-East-1US-West-2 EU-West-1
  67. 67. US-East-1US-West-2 EU-West-1
  68. 68. Ubiquitous Data
  69. 69. EVCache Replication Repl Writer Application Client Kafka SQS • High latency • Read once Kafka • Low latency • Multiple readers • > 1M replications/sec
  70. 70. US Ring US Ring EU Ring EU-West-1US-East-1 1. Extend US ring to EU region & run repairs Client 2. Dual Write 3. Forklift
  71. 71. EU-West-1US-West-2 Global Ring Global Ring US-East-1 Global Ring
  72. 72. Ubiquitous Traffic Management
  73. 73. us-east-1-na • East US • East CA • MX us-west-2 • APAC • West US • West CA eu-west-1 • Europe • Mid East • Africa us-east-1-sa • LatAm • Not MX Virtual DNS Regions
  74. 74. • Fixed virtual modules • Origin tier • Standardized names api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin DNS Tiers
  75. 75. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Split Failover
  76. 76. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.neflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Split Failover
  77. 77. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Cascading Failover
  78. 78. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Cascading Failover
  79. 79. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Cascading Failover
  80. 80. api-global.netflix.com api-global.us-west-2 .prodaa.neflix.com api-global.us-east-1-sa .prodaa.neflix.com api-global.us-east-1-na .prodaa.neflix.com api-global.eu-west-1 .prodaa.netflix.com api-global.us-west-2.origin .prodaa.neflix.com api-global.us-east-1.origin .prodaa.neflix.com api-global.eu-west-1.origin .prodaa.neflix.com ELB ELB ELB Virtual Origin Cascading Failover
  81. 81. x x Multi-region Failover
  82. 82. January 6th, 2016
  83. 83. “Going global is just like having a baby.” - Reed Hastings, Netflix CEO
  84. 84. What’s next? • Global latency • Edge computing • ML-based monitoring • Self-healing systems • Capacity utilization • Fast, autonomous traffic • Integrate DB & caching #NetflixEverywhere
  85. 85. Takeaways
  86. 86. Never fail the same way twice Christmas Eve 2012 Today
  87. 87. Know your resiliency patterns Pattern Properties DC SPoF, infrastructure heavy lifting Cloud (one region) Multiple DCs, one control plane Islands Regional containment Isthmus Regional ELB bypass Active-active Regional failover Global Ubiquity, resiliency, efficiency
  88. 88. Invest in architectural pillars • Micro-services • Database • Caching • Traffic
  89. 89. #NetflixEverywhere Think globally, act locally
  90. 90. netflix.github.io
  91. 91. netflix.github.io
  92. 92. Netflix Tech Blog techblog.netflix.com
  93. 93. Josh Evans jevans@netflix.com @ops_engineering #NetflixEverywhere Global Architecture ? ? ? ? ?

×