Michael Wardrop, Netflix
Usage of containers has undergone rapid growth at Netflix and it is still accelerating. Our container story started organically with developers downloading Docker and using it to improve their developer experience. The first production workloads were simple batch jobs, pioneering micro-services followed, then status as a first class platform running critical workloads.
As the types of workloads changed and their importance increased, the security of our container ecosystem needed to evolve and adapt. This session will cover some security theory, architecture, along with practical considerations, and lessons we learnt along the way.
2. Why?
There are lots of great presentations about Container Security theory
I hope to inspire more sharing so that we learn from each other and improve
everyone’s security together
Not so may about the challenges of doing it in practice
4. Context
Know your threat models - My threat models may not be the same as yours
Don’t copy & paste security - Tailor solutions to your context
Although I am presenting,
this is the work of many people from multiple teams over a few years
6. Containers at Netflix
Started organically with engineers
• Improved polyglot development and testing experience
Basic batch processing systems
• cron in the cloud
• Extract, Transform, Load
With momentum came demand
• Container management platform
• Integration with AWS and Netflix ecosystem
8. Titus
Netflix’s Container Management Platform > 3 million containers launched per week
Scheduling
• Service & batch job lifecycle
• Resource management
AWS & Netflix Integrations
High churn
• Most batch workloads < 1hour
• Due to auto scaling most Service
containers < 1 day
multi Region, multi AZ
Chaos Monkey & regional failover
> 1K different images
11. Newt
Netflix Workflow Toolkit - from Productivity Engineering
• Initialization of Projects (Stash repos, Jenkins jobs,
Spinnaker pipelines, & alerts)
• Code generation
• Consistent development environment in polyglot world
• Isolated, reproducible, and cacheable builds
• Container based testing
• Good place to incorporate best practices and secure
defaults
Docker is an important component
12. Rapid growth of container use cases
• 1000+ services
• Netflix API, Node.js Backend UI Scripts
• Machine Learning (GPUs) for personalization
• Encoding and Content use cases
• Netflix Studio use cases
• CDN tracking and planning
• Massively parallel CI system
• Data Pipeline routing & Stream Processing as a Service
• Big Data platform use cases
14. What’s interesting about OCI Containers?
1. Operating System virtualization - rely on the OS Kernel
for security. On Linux, this means:
• Namespaces - different userland views
• Control Groups - resource limits
• Seccomp - Syscall filtering
• Mandatory Access Control - Apparmor, SELinux, etc
• Capabilities - break up the power of root
• Pivot Root - Change the root file system
2. File System Image - Bring your dependencies with you
Implemented as a Tar of Tars with some metadata
16. Container Ecosystem Security
What isn’t impacted?
Registry
Image Scanning
Patch
Management
Control Plane
Cloud
Networking
and APIs
Developer
Identity
Service
Identity
Development
Production
Secret
Management
Key
Management
Version
Control
Source Code Continuous
Integration
Continuous
Delivery
18. Cloud Security
AWS EC2 Metadata proxy
• Started with one per host, changed to one per container
• Block Server Side Request Forgery (SSRF) and XML External Entity (XXE) injection
• Honey Credentials
Identity & Access Management
• IAM Role per container
• Limit IAM permissions for the host & bind credentials to the host
• Restrict which IAM roles can be used by which Applications
Elastic Network Interfaces
• VPC routable IP Address per container
• Assigning Security Groups to containers
Cloud APIs have great power, protect them!
19. Cloud Security
Separate accounts for Control Plane and
Workers (12 accounts total)
STS service in control plane account
• AuthN, AuthZ, & Audit
Container’s Identity based on Target IAM
account
• Workload can be logically in Target
account despite executing on Titus
New Titus Architecture
Agent Pool
Titus US-East-1
Control Plane Account
Titus US-East-1
Agent Account
Federation
(New)
Internet
Agent
Pool
Titus US-East-1
Account
Federation
(Old)
Agent PoolAgent Pool
21. Control Plane Security
Root controls ONE host, Control Plane controls ALL hosts.
API
• V1 was http
• V2 was https with optional mutual TLS
• V3 mutual TLS only with audit logs
Master to Workers communication
• Originally relied on Security Groups
• no authentication, authorization, or encryption
• Dangerous! 1 misconfiguration away from shadow control plane attacks
• Mutual TLS authN
• AuthZ policies
• Auditing
got root
control?
23. Control Plane Security
Problem: Failing Jobs That Repeat
Symptoms
• Scheduler works really hard
• Cloud resources are allocated /
deallocated fast
Solution
• Rate limiting of failing jobs
Image: “org/imagename:lateest”
Command:/bin/besh -c …
24. Identity for People
Pandora: unified identity service
Meechum: multi factor Single Sign On
Metatron uses Meechum identity to create:
➡X.509 cert
• person to service authN via Mutual TLS
➡SSH cert
• Bastion access
25. Cryptographic bootstrap of service identity in the cloud
Established before application code, supports:
• Ec2 Instances built on our BaseAMI
• Containers on Titus
• Netflix Functions
All get X.509 certificates for use in Mutual TLS, enabling authentication
Metatron: Identity for Services
26. Round 1
• Based on metadata signed by AWS
• No freshness guarantee, therefore no
support for instance restarts
• No Lambda support ☹
Round 2
• Based on KMS encryption context
• Freshness guarantee, therefore can
refresh identity at any time
• Lambda support 🥳
Metatron: Identity for Services
Closest open source equivalent is
How? Starts with an Application in Spinnaker,
which signs some metadata, and puts it in User Data given to AWS
27. Gandalf: Authorization
Gandalf decides who can be let in,
and who shall not pass.
• Web portal for defining policies
• REST
• gRPC
• SSH
• custom
• Policy updates are pushed out to Authorization agent
• All authorization decisions are made locally in ns
28. SSH Access
For extraordinary circumstances
Vast majority of Instances and Containers go through their lifecycle without SSH access
Initial implementation
• connection from bastion into limited environment on the host
restricted docker exec and docker cp like functionality
Current implementation
• After authorization check, the Bastion calls the Titus control plane
• A specially configured sshd is injected into the container
• The bastion connects directly to the injected sshd
30. Secret Protection
No secrets in code!
Encrypted via Gandalf web portal
• Define a policy for which Metatron identities (applications, groups, individuals)
can access
• Copy a Base64 encoded bundle / download a binary file
Files in conventional path are automatically decrypted on instance / container startup
and loaded into tmpfs
Library support for transparently loading and decrypting from configuration files
31. Secret Protection
The only place secrets should exist in the clear is in ram when they are being used
Blinded( EncryptedBundle(Secret, Policy Id) )
Blinded( Secret )
Mutual TLS
Instance /
Container
Metatron
X.509 cert
Decryption
Server
Metatron
X.509 cert
32. Host
Problem: kernel vulnerability away from loss of containment
Solutions
• Don’t use a generic kernel, use one tuned for your environment
• get rid of unneeded features, modules, and drivers
• Follow kernel hardening best practices like the Kernel Self Protection Project
Consider:
Firecracker
33. Runtime
Use User Namespaces
Docker 1.10 - Introduced User Namespaces
• Didn’t work /w shared networking NS
Docker 1.11 - Fixed shared networking NS
• User id mapping is per daemon (not per container)
Titus uses unique user namespace per container, shared User Id mapping
• avoids problems with shared filesystems
34. Vulnerability Management
Problem:
Stop known vulnerabilities from getting introduced into your ecosystem
Solution:
‘Shift left’
• IDE plugins
• Scanning of pull requests & builds in CI system
35. Vulnerability Management
Problem:
Discover and eliminate vulnerabilities in your ecosystem
Theory:
Scan your container images
Practice:
Discovering vulnerabilities is relatively easy,
flushing them from your ecosystem is hard
36. Change Management
People
> 1K Engineers
Applications
> 5K Micro Services
CI
> 600K CI builds per
week
Artifacts
> 2K NPMs
> 17K Debians
> 17K AMIs
> 97K JARs
Artifact Churn
Not deployed for ~ 3 days
~ 18K total
Deleted per day
~ 13K total
Deployments &
Autoscaling
> 3M containers deployed
per week
• Most batch workloads
< 1hour
• Most Service
containers < 1 day
~ 50% VM Instance churn
per day
37. Change Management
Who needs to change what when?
Change campaigns
• Targeted & actionable communication
• Email, Spinnaker, linters, build warnings
Deprecation cycles
• All micro services should be rebuilt &
redeployed with latest supported artifact
versions every 90 days
• Act as a forcing function to purge old /
vulnerable software
Orange: campaign rules
Pink: primary blockers
Green: affected services
38. Takeaways
1. Cloud & Platform control planes are of strategic importance
• protect with multiple independent layers of security
2. People and Service identity are the foundation
• AuthN, AuthZ, & Auditing
• Secret Management
39. Takeaways
3. Need to take an ecosystem approach
• Container security does not happen in isolation
• Engineers should get Security involved early in project / platform lifecycle
• As a security practitioner you take what is there and iterate
4. Users will need help adopting containers responsibly
• Expect problematic containers and workloads
5. Users expect ability to debug and performance tune
• Metrics, Monitoring, and Alerting are key
• SSH as break glass, not as a crutch
40. Security:
Russell Lewis: OSCON 2016 — How Netflix Gives All Its Engineers SSH Access To Instances Running In Production
Ian Haken: USENIX Enigma 2017 — Secrets at Scale: Automated Bootstrapping of Secrets & Identity in the Cloud
Manish Mehta: CloudNativeCon 2017 — How Netflix Is Solving Authorization Across Their Cloud
Manish Mehta: RWC 2018 — Secrets at Scale
Travis McPeak: Enigma 2018 — Least Privilege: Security Gain without Developer Pain
Netflix Tech Blog —Security
Titus Team:
Netflix OSS: Season 6 Episode 1 - Titus, Slides, Source
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemon