SlideShare a Scribd company logo
1 of 44
Download to read offline
A Series Of Unfortunate Container Events
Netflix’s container platform lessons learned
About the speakers and team
Follow along - @sargun @tomaszbak_ca @fabiokung
@aspyker @amit_joshee @anwleung
2
Netflix’s container management platform
● Titus
● Scheduling
○ Service & batch jobs
○ Resource management
● Container Execution
○ Docker/AWS Integration
○ Netflix Infra Support
Service
Job Management
Resource Management & Optimization
Container Execution
Integration
Batch
3
Containers In
Production
4
Current Titus scale
● Deployed across multiple AWS accounts & three regions
● Over 5,000 instances (Mostly M4.4xls & R3.8xls)
● Over a week period launched over 1,000,000 containers
● Over 10,000 containers running concurrently
5
Single cloud platform for VMs and containers
● CI/CD (Spinnaker)
● Telemetry systems
● Discovery and RPC load balancing
● Healthcheck, Edda and system metrics
● Chaos monkey
● Traffic control (Flow & Kong)
● Netflix secure secret management
● Interactive access (ala ssh)
6
Integrate containers with AWS EC2
● VPC Connectivity (IP per container)
● Security Groups
● EC2 Metadata service
● IAM Roles
● Multi-tenant isolation (cpu, memory, disk quota, network)
● Live and S3 persisted logs rotation & mgmt
● Environmental context to similar to user data
● Autoscaling service jobs (coming)
7
● Service
○ Stream Processing (Flink)
○ UI Services (NodeJS)
○ Internal dashboards
● Batch
○ Personalization ML model training (GPUs)
○ Content value analysis
○ Digital watermarking
○ Ad hoc reporting
○ Continuous integration builds
○ Media encoding experimentation
Container users on Titus
Archer
8
Titus high level overview
9
RheaRheaTitus API
Cassandra
Titus Scheduler
● Job Lifecycle Control
● Resource Management
EC2 Autoscaling
Fenzo
9
container
container
container
docker
Titus Agents
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System AgentsWorkflow
Systems
9
Look away, look away, Look away, look away
This session will wreck your evening​, your whole life, and your day
Every single episode is nothing but dismay
So, look away, Look away, look away
Lessons learned from a year in production?
10
Expect
Bad Actors
11
Run-away submissions
Submit a job, check status
If API doesn’t answer
assume 404 and re-submit
Problem:
User
12
Worked for our content processing job of 100 containers
Let’s run our “back-fill” -- 100s of thousands of containers
Problems
● Scheduler runs out of memory
● All other jobs get queued behind
Solutions
● Scheduler capacity groups
● Absolute caps on number of concurrent live jobs
● Upstream systems doing ingest control
System perceived as infinite queue
User
13
Uses REST/JSON poorly
{ env: { “PATH” : null } }
Problems
● Scheduler crashes, fails over, crashes, repeat
Solutions
● Input validation, input fuzz testing, exception handling
Invalid Jobs
User
14
Failing jobs that repeat
Image: “org/imagename:lateest”
Command: /bin/besh -c ...
Problems
● Containers can launch FAST! Can be restarted FAST!
● Scheduler works really hard
● Cloud resources allocated/deallocated FAST
Solutions
● Rate limiting of failing jobs
User
15
Problems
● Scheduler fails, can’t recover due to “bad” jobs
Solutions
Manual removal of bad job state? ✖ Test production data sets in staging
✔
Testing for “bad” job data
16
1. Export job data
2. Restore job data
3. Test recovery
4. Deploy new code
PROD
STAGING
PROD
Identifying bad actors
V2 API
● user (optional)
V2 Auditing
● Added collection of user
performing action
V3 API
● Owner -> teamEmail (required)
17
● User Namespaces
○ Docker 1.10 - User Namespaces (Feb 2016)
○ Docker 1.11 - Fixed shared networking NSs
■ User id mapping is per daemon (not per container)
○ Deployed user namespaces recently
■ Problems - shared NFS, OSX LDAP uid/gid’s
● Locked Down hosts
○ Users only have access to containers, not hosts
○ Required “power user” ssh access for perf/debugging
Really bad actors - container escapes protections
18
The Cloud Isn’t Perfect
19
Cloud rate limiting and overall limits
Let’s do a red/black deploy of 2000 containers instantly
Problems
● Scheduler and distributed host fleet ... no problem!
● Cloud provider … problem!
Solutions
● Exponential backoff with jitter on hosts
● Setting expectations of maximum concurrent launches
● Rate limiting of container scheduling and overall number of containers
User
20
Hosts start or go bad
Problems
● Hosts come up with flakey networks
● Host disks come up and are slow
● Hosts go bad over time
Solutions
● Scheduler must be aware
of host health checks
● Linux, storage, etc warming
● Auto-termination if hosts take too
long to become healthy
21
Upgrades - In place upgrades
● Simpler for container users
● Infrastructure becomes mutable
● Doesn’t leverage elastic cloud
● How to handle rollback?
Docker V1
Titus Agents V1
Mesos V1
Batch Container #1
Service Container #2
Docker V2
Titus Agents V2
Mesos V2
Batch Container #1
Service Container #2
Agent Agent with updates
✖
✖
✖
22
Upgrades - Whole cluster red/black
● Full red/black takes hours
● Costly (duplicate clusters)
● Insufficient Capacity Exception (ICE)
● Rollback requires ALL containers to move twice
✖
23
Upgrades - Partitioned cluster updates
● Requires complex scheduler knowledge
○ Batch jobs to have runtime limits
○ Service jobs with Spinnaker migration tasks
● Starting point for fleet cluster management
Docker V1
Titus Agents V1
Mesos V1
Batch Container #1
Service Container #2
Docker V2
Titus Agents V2
Mesos V2
Service Container #2
Let “drain”
old
new
✖
✖
✖
24
CI/CD
Task
Migration
Our Code Isn’t Perfect
25
Disconnected containers
Problem
● Host agent controllers lock up
● Control plane can’t kill replaced containers
User
● Why is my old code still running?
Solutions
● Monitor and alert on differences
● Reconcile to this system as
aggressively as possible Container
Scheduler
I’m running
You shouldn’t be!
Agent
Stop Container
×
×
Locked
Up
26
Scheduler failover speed is important
27
StandBy
C*
save
Active
C*
restore
Scheduler failover time increased
with scale
● Loss of API availability
● Reconciliation bugs caused task
crashes
Solutions:
● Data sharding (current vs old tasks)
● Do as little as possible during startup
Active Active✖
Know your dependencies
28
Agent
DNSS3Zookeeper
Problems
● Container creation errors
● Logs upload failure
● Task crashes
Solutions
● Retries
● Rate limiting
● Isolation
● Containers start with Docker .. end with the kernel
● Best container runtimes deeply leverage kernel primitives
○ Resource Isolation
○ Security
○ Networking etc
● Debugging tools (tracepoints, perf events) not container aware
○ Need for BPF, Kprobe
Containers require kernel knowledge
29
Problems
● Our instances fail
● Our code fails
● Our dependencies fail
Solutions
● Learn to love the Chaos Monkey
● Enabled for prod and all
services (even our scheduler)
Strategy: Embrace chaos
30
Alerting and dashboards key
Very complex system == very complex telemetry and alerting
Continuously evolving
● Based on real incidents and resulting deep analysis
Telemetry system Number
Metrics 100’s
Dashboard Graphs 70+
Alerts 50+
Elastic Search Indexes 4
31
Temporary ad hoc remediation
● Manual babysitting of scheduler state
● Pin high when auto-scaling and
capacity management isn’t work
● Automated for each ssh across all nodes
○ Detecting and remediating problems
32
What has worked well?
33
Solid software
Docker Distribution
● Our Docker registry
● Simple redirect on top of S3
Apache Mesos
● Extensible resource manager
● Highly reliable replicated log
Zookeeper
● Leader Election
● Isolated
Cassandra
● CDE Internal Service
● Careful of access model
34
Managing the Titus product
Focus on core business value
● Container cluster management
Features are important
● Deciding what not to do …
Just as important!
Deliberately chose NOT to do
● Service Discovery / RPC
● Continuous Delivery
35
Ops enablement
Phase 1: Manual red/black deploys
Phase 2: Runbook for on-call
Phase 3: Automated pipelines
Deploy New ASG
Find Current
Leader
Terminate Non
Leader Nodes
Terminate Current
Leader
36
Problem
● If you aren’t measuring, you don’t know
● If you don’t know, you can’t improve
Solutions
● Our SLOs
○ Start Latency
○ % Crashed
○ API Availability
● Once we started watching, we started improving
Service Level Objectives (SLOs)
37
Onboarding slowly
38
Documenting container readiness
● Broken down by type of application and feature set
● Readiness expressed in
○ Alpha (early users), beta (subject to change), GA (mass adoption)
39
Growing usage slowly, carefully
Titus Created
Batch GA
4Q 2015
Service Support
Added
1Q 2016
First Scale
Production Service
4Q 2016
Netflix Customer
Facing Service
2Q 2017
40
shadow
Key takeaways
● Expect problematic containers & workloads
● Continued need for cloud to evolve for containers
● Container schedulers, runtime are complex
● Ops enablement key for production systems
● Users need help adopting containers responsibly
● Worth the effort due to value containers unlock
41
Questions?
42
Backup
Titus High Level Overview
44
Titus UITitus UI
RheaRheaTitus API
Titus UI
Cassandra
Titus Master
Job Management &
Scheduler
Zookeeper
EC2
Auto-scaling API
Mesos Master
Fenzo
44
Docker
Registry
Docker
Registry
container
container
container
docker
Titus Agent
metrics agents
Titus executor
logging agent
btrfs
Mesos agent
Docker
S3
Docker
Registry
container
Pod & VPC network
drivers
containercontainer
AWS
metadata proxy
Integration
AWS VMs

More Related Content

What's hot

Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Yauheni Akhotnikau
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
memcached Binary Protocol in a Nutshell
memcached Binary Protocol in a Nutshellmemcached Binary Protocol in a Nutshell
memcached Binary Protocol in a NutshellToru Maesaka
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
 
Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019graemerocher
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastMesut Celik
 
[KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...
 [KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui... [KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...
[KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...Akihiro Suda
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Janakiram MSV
 
Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...
 Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ... Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...
Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...hieu anh
 
containerd the universal container runtime
containerd the universal container runtimecontainerd the universal container runtime
containerd the universal container runtimeDocker, Inc.
 
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみたセキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみたLINE Corporation
 
unique_ptrにポインタ以外のものを持たせるとき
unique_ptrにポインタ以外のものを持たせるときunique_ptrにポインタ以外のものを持たせるとき
unique_ptrにポインタ以外のものを持たせるときShintarou Okada
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Brian Brazil
 

What's hot (20)

Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?
 
Observability-101
Observability-101Observability-101
Observability-101
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
memcached Binary Protocol in a Nutshell
memcached Binary Protocol in a Nutshellmemcached Binary Protocol in a Nutshell
memcached Binary Protocol in a Nutshell
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019Introduction to Micronaut - JBCNConf 2019
Introduction to Micronaut - JBCNConf 2019
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with Hazelcast
 
Báo cáo môi trường - DTM - NHÀ HÀNG LAN RỪNG - 0918755356
Báo cáo môi trường - DTM - NHÀ HÀNG LAN RỪNG - 0918755356Báo cáo môi trường - DTM - NHÀ HÀNG LAN RỪNG - 0918755356
Báo cáo môi trường - DTM - NHÀ HÀNG LAN RỪNG - 0918755356
 
[KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...
 [KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui... [KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...
[KubeConUS2019 Docker, Inc. Booth] Distributed Builds on Kubernetes with Bui...
 
ZeroMQ with NodeJS
ZeroMQ with NodeJSZeroMQ with NodeJS
ZeroMQ with NodeJS
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets
 
Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...
 Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ... Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...
Nghiên cứu và triển khai hệ thống Private Cloud cho các ứng dụng đào tạo và ...
 
containerd the universal container runtime
containerd the universal container runtimecontainerd the universal container runtime
containerd the universal container runtime
 
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみたセキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
セキュリティ未経験だったけど入社1年目から Bug Bounty Program 運営に参加してみた
 
unique_ptrにポインタ以外のものを持たせるとき
unique_ptrにポインタ以外のものを持たせるときunique_ptrにポインタ以外のものを持たせるとき
unique_ptrにポインタ以外のものを持たせるとき
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Introduction to SLURM
Introduction to SLURMIntroduction to SLURM
Introduction to SLURM
 

Similar to Series of Unfortunate Netflix Container Events - QConNYC17

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsC4Media
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Batch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container ServiceBatch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container ServiceAmazon Web Services
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in ContainerizationRyan Hunter
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
The journey to container adoption in enterprise
The journey to container adoption in enterpriseThe journey to container adoption in enterprise
The journey to container adoption in enterpriseIgor Moochnick
 
Software Testing
Software TestingSoftware Testing
Software TestingAndrew Wang
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingOpenVZ
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingAndrey Vagin
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingJulien Pivotto
 

Similar to Series of Unfortunate Netflix Container Events - QConNYC17 (20)

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Batch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container ServiceBatch Processing with Amazon EC2 Container Service
Batch Processing with Amazon EC2 Container Service
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
The journey to container adoption in enterprise
The journey to container adoption in enterpriseThe journey to container adoption in enterprise
The journey to container adoption in enterprise
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Netty training
Netty trainingNetty training
Netty training
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel Testing
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel Testing
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 

More from aspyker

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Publicaspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientistsaspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayentaaspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talkaspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 

More from aspyker (20)

Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Public
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Series of Unfortunate Netflix Container Events - QConNYC17

  • 1. A Series Of Unfortunate Container Events Netflix’s container platform lessons learned
  • 2. About the speakers and team Follow along - @sargun @tomaszbak_ca @fabiokung @aspyker @amit_joshee @anwleung 2
  • 3. Netflix’s container management platform ● Titus ● Scheduling ○ Service & batch jobs ○ Resource management ● Container Execution ○ Docker/AWS Integration ○ Netflix Infra Support Service Job Management Resource Management & Optimization Container Execution Integration Batch 3
  • 5. Current Titus scale ● Deployed across multiple AWS accounts & three regions ● Over 5,000 instances (Mostly M4.4xls & R3.8xls) ● Over a week period launched over 1,000,000 containers ● Over 10,000 containers running concurrently 5
  • 6. Single cloud platform for VMs and containers ● CI/CD (Spinnaker) ● Telemetry systems ● Discovery and RPC load balancing ● Healthcheck, Edda and system metrics ● Chaos monkey ● Traffic control (Flow & Kong) ● Netflix secure secret management ● Interactive access (ala ssh) 6
  • 7. Integrate containers with AWS EC2 ● VPC Connectivity (IP per container) ● Security Groups ● EC2 Metadata service ● IAM Roles ● Multi-tenant isolation (cpu, memory, disk quota, network) ● Live and S3 persisted logs rotation & mgmt ● Environmental context to similar to user data ● Autoscaling service jobs (coming) 7
  • 8. ● Service ○ Stream Processing (Flink) ○ UI Services (NodeJS) ○ Internal dashboards ● Batch ○ Personalization ML model training (GPUs) ○ Content value analysis ○ Digital watermarking ○ Ad hoc reporting ○ Continuous integration builds ○ Media encoding experimentation Container users on Titus Archer 8
  • 9. Titus high level overview 9 RheaRheaTitus API Cassandra Titus Scheduler ● Job Lifecycle Control ● Resource Management EC2 Autoscaling Fenzo 9 container container container docker Titus Agents Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System AgentsWorkflow Systems 9
  • 10. Look away, look away, Look away, look away This session will wreck your evening​, your whole life, and your day Every single episode is nothing but dismay So, look away, Look away, look away Lessons learned from a year in production? 10
  • 12. Run-away submissions Submit a job, check status If API doesn’t answer assume 404 and re-submit Problem: User 12
  • 13. Worked for our content processing job of 100 containers Let’s run our “back-fill” -- 100s of thousands of containers Problems ● Scheduler runs out of memory ● All other jobs get queued behind Solutions ● Scheduler capacity groups ● Absolute caps on number of concurrent live jobs ● Upstream systems doing ingest control System perceived as infinite queue User 13
  • 14. Uses REST/JSON poorly { env: { “PATH” : null } } Problems ● Scheduler crashes, fails over, crashes, repeat Solutions ● Input validation, input fuzz testing, exception handling Invalid Jobs User 14
  • 15. Failing jobs that repeat Image: “org/imagename:lateest” Command: /bin/besh -c ... Problems ● Containers can launch FAST! Can be restarted FAST! ● Scheduler works really hard ● Cloud resources allocated/deallocated FAST Solutions ● Rate limiting of failing jobs User 15
  • 16. Problems ● Scheduler fails, can’t recover due to “bad” jobs Solutions Manual removal of bad job state? ✖ Test production data sets in staging ✔ Testing for “bad” job data 16 1. Export job data 2. Restore job data 3. Test recovery 4. Deploy new code PROD STAGING PROD
  • 17. Identifying bad actors V2 API ● user (optional) V2 Auditing ● Added collection of user performing action V3 API ● Owner -> teamEmail (required) 17
  • 18. ● User Namespaces ○ Docker 1.10 - User Namespaces (Feb 2016) ○ Docker 1.11 - Fixed shared networking NSs ■ User id mapping is per daemon (not per container) ○ Deployed user namespaces recently ■ Problems - shared NFS, OSX LDAP uid/gid’s ● Locked Down hosts ○ Users only have access to containers, not hosts ○ Required “power user” ssh access for perf/debugging Really bad actors - container escapes protections 18
  • 19. The Cloud Isn’t Perfect 19
  • 20. Cloud rate limiting and overall limits Let’s do a red/black deploy of 2000 containers instantly Problems ● Scheduler and distributed host fleet ... no problem! ● Cloud provider … problem! Solutions ● Exponential backoff with jitter on hosts ● Setting expectations of maximum concurrent launches ● Rate limiting of container scheduling and overall number of containers User 20
  • 21. Hosts start or go bad Problems ● Hosts come up with flakey networks ● Host disks come up and are slow ● Hosts go bad over time Solutions ● Scheduler must be aware of host health checks ● Linux, storage, etc warming ● Auto-termination if hosts take too long to become healthy 21
  • 22. Upgrades - In place upgrades ● Simpler for container users ● Infrastructure becomes mutable ● Doesn’t leverage elastic cloud ● How to handle rollback? Docker V1 Titus Agents V1 Mesos V1 Batch Container #1 Service Container #2 Docker V2 Titus Agents V2 Mesos V2 Batch Container #1 Service Container #2 Agent Agent with updates ✖ ✖ ✖ 22
  • 23. Upgrades - Whole cluster red/black ● Full red/black takes hours ● Costly (duplicate clusters) ● Insufficient Capacity Exception (ICE) ● Rollback requires ALL containers to move twice ✖ 23
  • 24. Upgrades - Partitioned cluster updates ● Requires complex scheduler knowledge ○ Batch jobs to have runtime limits ○ Service jobs with Spinnaker migration tasks ● Starting point for fleet cluster management Docker V1 Titus Agents V1 Mesos V1 Batch Container #1 Service Container #2 Docker V2 Titus Agents V2 Mesos V2 Service Container #2 Let “drain” old new ✖ ✖ ✖ 24 CI/CD Task Migration
  • 25. Our Code Isn’t Perfect 25
  • 26. Disconnected containers Problem ● Host agent controllers lock up ● Control plane can’t kill replaced containers User ● Why is my old code still running? Solutions ● Monitor and alert on differences ● Reconcile to this system as aggressively as possible Container Scheduler I’m running You shouldn’t be! Agent Stop Container × × Locked Up 26
  • 27. Scheduler failover speed is important 27 StandBy C* save Active C* restore Scheduler failover time increased with scale ● Loss of API availability ● Reconciliation bugs caused task crashes Solutions: ● Data sharding (current vs old tasks) ● Do as little as possible during startup Active Active✖
  • 28. Know your dependencies 28 Agent DNSS3Zookeeper Problems ● Container creation errors ● Logs upload failure ● Task crashes Solutions ● Retries ● Rate limiting ● Isolation
  • 29. ● Containers start with Docker .. end with the kernel ● Best container runtimes deeply leverage kernel primitives ○ Resource Isolation ○ Security ○ Networking etc ● Debugging tools (tracepoints, perf events) not container aware ○ Need for BPF, Kprobe Containers require kernel knowledge 29
  • 30. Problems ● Our instances fail ● Our code fails ● Our dependencies fail Solutions ● Learn to love the Chaos Monkey ● Enabled for prod and all services (even our scheduler) Strategy: Embrace chaos 30
  • 31. Alerting and dashboards key Very complex system == very complex telemetry and alerting Continuously evolving ● Based on real incidents and resulting deep analysis Telemetry system Number Metrics 100’s Dashboard Graphs 70+ Alerts 50+ Elastic Search Indexes 4 31
  • 32. Temporary ad hoc remediation ● Manual babysitting of scheduler state ● Pin high when auto-scaling and capacity management isn’t work ● Automated for each ssh across all nodes ○ Detecting and remediating problems 32
  • 33. What has worked well? 33
  • 34. Solid software Docker Distribution ● Our Docker registry ● Simple redirect on top of S3 Apache Mesos ● Extensible resource manager ● Highly reliable replicated log Zookeeper ● Leader Election ● Isolated Cassandra ● CDE Internal Service ● Careful of access model 34
  • 35. Managing the Titus product Focus on core business value ● Container cluster management Features are important ● Deciding what not to do … Just as important! Deliberately chose NOT to do ● Service Discovery / RPC ● Continuous Delivery 35
  • 36. Ops enablement Phase 1: Manual red/black deploys Phase 2: Runbook for on-call Phase 3: Automated pipelines Deploy New ASG Find Current Leader Terminate Non Leader Nodes Terminate Current Leader 36
  • 37. Problem ● If you aren’t measuring, you don’t know ● If you don’t know, you can’t improve Solutions ● Our SLOs ○ Start Latency ○ % Crashed ○ API Availability ● Once we started watching, we started improving Service Level Objectives (SLOs) 37
  • 39. Documenting container readiness ● Broken down by type of application and feature set ● Readiness expressed in ○ Alpha (early users), beta (subject to change), GA (mass adoption) 39
  • 40. Growing usage slowly, carefully Titus Created Batch GA 4Q 2015 Service Support Added 1Q 2016 First Scale Production Service 4Q 2016 Netflix Customer Facing Service 2Q 2017 40 shadow
  • 41. Key takeaways ● Expect problematic containers & workloads ● Continued need for cloud to evolve for containers ● Container schedulers, runtime are complex ● Ops enablement key for production systems ● Users need help adopting containers responsibly ● Worth the effort due to value containers unlock 41
  • 44. Titus High Level Overview 44 Titus UITitus UI RheaRheaTitus API Titus UI Cassandra Titus Master Job Management & Scheduler Zookeeper EC2 Auto-scaling API Mesos Master Fenzo 44 Docker Registry Docker Registry container container container docker Titus Agent metrics agents Titus executor logging agent btrfs Mesos agent Docker S3 Docker Registry container Pod & VPC network drivers containercontainer AWS metadata proxy Integration AWS VMs