If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at http://netflix.github.com
12. How to get to Cloud Native
Freedom and Responsibility for Developers
Decentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
13. Four Transitions
• Management: Integrated Roles in a Single Organization
– Business, Development, Operations -> BusDevOps
• Developers: Denormalized Data – NoSQL
– Decentralized, scalable, available, polyglot
• Responsibility from Ops to Dev: Continuous Delivery
– Decentralized small daily production updates
• Responsibility from Ops to Dev: Agile Infrastructure - Cloud
– Hardware in minutes, provisioned directly by developers
14. Netflix BusDevOps Organization
Chief Product
Officer
VP Product
Management
Directors
Product
VP UI
Engineering
Directors
Development
Developers +
DevOps
UI Data
Sources
AWS
VP Discovery
Engineering
Directors
Development
Developers +
DevOps
Discovery
Data Sources
AWS
VP Platform
Directors
Platform
Developers +
DevOps
Platform
Data Sources
AWS
Denormalized, independently
updated and scaled data
Cloud, self service updated &
scaled infrastructure
Code, independently updated
continuous delivery
19. Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
20. How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
22. Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers
(for US, Canada and Latam)
Each icon is
three to a few
hundred
instances
across three
AWS zones
23. Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
24. Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
25. Cross Region Use Cases
• Geographic Isolation
– US to Europe replication of subscriber data
– Read intensive, low update rate
– Production use since late 2011
• Redundancy for regional failover
– US East to US West replication of everything
– Includes write intensive data, high update rate
– Testing now
26. Benchmarking Global Cassandra
Write intensive test of cross region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
192 TB of SSD in six locations up and running Cassandra in 20 min
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test
Load
Test
Load
Validation
Load
Inter-Zone Traffic
1 Million writes
CL.ONE (wait for
one replica to ack)
1 Million reads
After 500ms
CL.ONE with no
Data loss
Inter-Region Traffic
Up to 9Gbits/s, 83ms 18TB
backups
from S3
27. Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS
DynECT
DNS
AWS
Route53
Denominator – manage traffic via multiple DNS providers with Java code
2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May
Denominator
28. Incidents – Impact and Mitigation
PR
X Incidents
CS
XX Incidents
Metrics impact – Feature disable
XXX Incidents
No Impact – fast retry or automated failover
XXXX Incidents
Public Relations
Media Impact
High Customer
Service Calls
Affects AB
Test Results
Y incidents mitigated by Active
Active, game day practicing
YY incidents
mitigated by
better tools and
practices
YYY incidents
mitigated by better
data tagging
29. Cloud Security
Automated attack surface monitoring
Crypto key store management (CloudHSM)
Scale to resist DDOS attacks
http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
32. Fitting Into Public Scale
Public
Grey
Area
Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
33. How big is Public?
AWS upper bound estimate based on the number of public IP Addresses
Every provisioned instance gets a public IP by default (some VPC don’t)
AWS Maximum Possible Instance Count 4.2 Million – May 2013
Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange
34. A Cloud Native Open Source Platform
See netflix.github.com
35. Establish our
solutions as Best
Practices / Standards
Hire, Retain and
Engage Top
Engineers
Build up Netflix
Technology Brand
Benefit from a
shared ecosystem
Goals
39. More Use Cases
More
Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contributions from vendors
What’s Coming Next?
40. Vendor Driven Portability
Interest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”
Functionally complete
Demonstrated March
Released June in V3.3
Offering $10K prize for integration work
Vendor and end user interest
Openstack “Heat” getting there
Paypal C3 Console based on Asgard
41. Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud native ecosystem
Rapid Evolution - Low MTBIAMSH
(Mean Time Between Idea And Making Stuff Happen)
42. Slideshare.net/Netflix Details
• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games
– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html
• Lightning Talks March S1E2
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-
roadmap
• Lightning Talks Feb S1E1
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1
– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Security Architecture
– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/
• Cost Aware Cloud Architectures – with Jinesh Varia of AWS
– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-
varia-aws-and-adrian-cockroft-netflix
43. What Changed?
Speed wins, Cloud Native helps you get there
NetflixOSS makes it easier for everyone to become Cloud Native
@adrianco #netflixcloud @NetflixOSS