This document summarizes a presentation about how RightScale uses Kubernetes, Terraform, and other tools in their cloud management platform. It discusses how RightScale has transitioned from using Docker containers on individual VMs ("Bay of Containers") to using Kubernetes container clusters in the cloud ("Sea of Containers"). RightScale built custom images with Kubernetes components pre-installed to speed up cluster creation. Terraform is used to provision infrastructure including Kubernetes clusters and integrate with the RightScale platform. The goal was to enable developers to have self-managed Kubernetes clusters using infrastructure as code principles. Key aspects included making clusters disposable while maintaining high availability, and distributing Terraform modules to development teams to simplify cluster creation and management
2. • Ryan Williamson
• Director of Engineering
• Mark Dotson
• Steel Team Manager/Infrastructure Tech Lead
Presenters
3. Two Solutions from RightScale
RightScale
Cloud Management Platform
Orchestrate, automate and govern workloads
across all your environments.
VIRTUAL
SERVERS
PUBLIC
CLOUDS
ANY CLOUD
SERVICE
PRIVATE
CLOUDS
BARE METAL
SERVERS
CONTAINER
CLUSTERS
RightScale
Optima
Work collaboratively across the organization
to manage and optimize clouds costs.
Orchestration
Cloud Workflow
Plugins
Monitoring
Access Control
Accounts/Groups
Access/Permissions
Tags
Policies
Cost
Security/Compliance
Operational
RIGHTSCALE
CMP ENGINE
EXTENSIBLE ORCHESTRATION API
6. • The role each technology plays in DevOps
• Tales from the trenches on Kubernetes & Terraform
• Case study of RightScale’s DevOps process
Agenda
5
7. Pets vs Cattle
6
Source: Randy Bias via Slideshare.net, The History of Pets vs Cattle,
License: CC Attribution-NoDerivs License
8. “Developers want to have self-service
programmatic access to infrastructure, integrated
into the continuous integration/continuous delivery
(CI/CD) pipeline. Immutable infrastructure is a
best-practice solution to this problem, and is
increasingly adopted by DevOps-oriented teams.”
“Cattle” = Immutable Infrastructure
7
Source: Gartner, Four Key Container Deployment Considerations for I&O Leaders Feb 2018
10. How We Got Here: Project Sherpa circa 2016
9
● Moved to Docker containers deployed
in “Bay of Containers”
○ Single containers of a given docker image
invoked via the “Docker Deploy” script on
individual instances invoked from an
ops-maintained template.
○ Service discovery mesh and inputs
provided by Hashicorp Consul
○ Capacity planning via static mapping.
● Centralized development &
management structures (ops,
architecture, eng management)
11. Bay of Containers - What is That?
10
• Deploy N ‘good
neighbor’ containers
onto a VM
• Supports
microservices
• Supports “traditional”
services
• Abstract level - logical
groupings of
individual hosts
running groups of
containers.
10
Host & Container Logical Grouping 1..n
balancer
A
smtp syslog
srv.
dsc.
B C D
Sidecar B
balancer
E
smtp syslog
srv.
dsc.
F
G H Z
12. Sea of Containers - The Next Step
11
VM VM VM
A A A
A A
A
C C
B B
B B
VM VM VM
A A A
C C
A A A
C C
B B B
B B
Container Management
BA A
• N(×M) containers
• 0..N VMs
• Elastic mesh network
• Declarative
everything
• Resource scheduling
• Abstract level - a
cluster of hosts
running everything
13. Where Did We Want to Go?
12
● Research spike to evaluate
container engines - kube chosen
● Service discovery moved into
cluster(s).
● Scheduler takes over capacity
planning.
● Autoscaling nodes and pods are
potential ultimate HA model.
● Unified location to express app
deployment - release strategy,
limits, environment inputs...all in
one place
Sea of Containers Full on DevOps
● Engineering Teams are The
Deciders and Owners of all aspects
of their stacks; instances, DBs,
apps…
● No centralized
development/management
structures - cross-functional groups
for shared ownership aspects.
● RS Service Groupings moved to
individual RS accounts
14. Overarching Themes
13
● Disposable
○ After 10 years in cloud, lesson learned is to build with failure planning as
table stakes
○ If an instance that is exhibiting issues, launch a new instance and “replace”
the problematic one.
● Repeatable
○ Infrastructure as code - no ad hoc or one-off fixes, fix the code.
○ You don’t want your services updating in an unscheduled method to new
versions that break other aspects of your stack.
16. Concept: Disposable Kubernetes Clusters
15
● Small purpose-driven clusters to allow full ownership and avoid resource
conflicts between development teams.
● A valid strategy for troubleshooting an entire Kubernetes cluster could be to
launch a new cluster and “replace” the problem one.
● Storage layers/persistent services could generally live outside the application k8
cluster.
● Upgrades to kubernetes versions could be executed as no-downtime
replacement operations a-la A/B deployment.
● … but for all this to work we needed to keep things relatively simple so our new
devops could gain confidence in using these new toolsets. (Walk before
running)
17. Pieces of the Puzzle: Infrastructure Deployment
16
● Disposable clusters require stateful assets (databases,
secrets, logs, etc.) to live external to the cluster.
● Executed through Terraform RS Provider + any other resource
providers as needed, an overall framework was constructed in
each account had a given “Infrastructure Deployment” that
was static and held stateful objects.
○ Familiar RightScale objects - Servers, Arrays, etc.
○ Secrets stored in mix of Hashicorp Vault & RightScale
Credentials
○ Use of external hosted services (eg splunk) as determined
by team also helped with this strategy.
● “Hub and Spoke” strategy of N disposable Kube clusters with
unique network ranges would be peered to the infrastructure’s
network range.
18. Pieces of the Puzzle: The Image
17
● Speed to build the cluster is critical
● Images were built that had as much baked into it as
possible
● Versions of Kubernetes services (kubelet, flannel, etc.)
locked - no unscheduled updates here at boot!
● Result: Faster boot times and guaranteed operational
clusters with the tested/verified versions of services.
20. Pieces of the Puzzle: Terraform Part 1
19
● Terraform is API client with large open-source community of
developers writing “providers” that operate against publicly
exposed and documented APIs.
● A developer-friendly laptop-centric runtime cli experience for
provisioning of infrastructure assets ideal for fast development
iteration.
● Given the powerful “mix-and-match” ability to take AWS
objects and combine with Rightscale objects and gce objects
and and and…. magic!
● An easy way to use the “right tool for the job.”
21. Pieces of the Puzzle: Terraform Part 2
20
● The RightScale provider was born!
https://www.terraform.io/docs/providers/rightscale/index.html
● Allows us to use RightScale for the governance and
orchestration we need, intermixed with other provider
resources directly.
● Example: Creation/Reading of RightScale credential objects to
populate secrets during the creation of ec2 autoscaling
groups.
● Example 2: RightScript “any” execution against assets
external to infrastructure managed by Terraform. In our case,
scripts executing setup logic for new clusters.
22. Pieces of the Puzzle: Devops
21
● No centralized team to manage infrastructure and desire to
reinforce full stack ownership to devops teams.
● Use of Terraform Modules to express generic aspects of the
cluster build process with duplicated/full copies of this code
distributed to each team.
○ Copy/Paste isn’t always the best way, but it is dirt simple
to understand.
○ Smoothing the distribution process for updates to
modules would be undertaken at a later date.
● Cluster invocations (environments or “envs”) largely identical
other then variables defined in variables config file.
○ Creating new clusters easy as copying existing folder, and
changing a few variables.
25. Pieces of the Puzzle: HA while Disposable, Part 1
24
● Goal of purpose-driven disposable kube clusters that could be
replaced in no-downtime operations yet is still HA for the end
customer while attempting to keep things easy to manage… a
bit of a challenge.
● Lots of horror stories about multi-master/cross-az
troubleshooting-in-place woes or “my cluster is slow” mystery
troubleshooting were abound at the time of research.
26. Pieces of the Puzzle: HA while Disposable, Part 2
25
● Each env is N fully independent clusters each scoped to a
single az that are treated as a logical unit and knitted together
via Kubernetes Federation.
● This gives each individual cluster the ability to service a given
request, with inbound requests balanced between the N
invocations w/health checks and knit together with Federation
so if one cluster on a given az went belly up, the other would
immediately be scaled up by the Federation to handle things.
● Combined with ability to invoke our disposable cluster
anywhere including any cloud and knit together with the
Federation for easy combined management, gave us a strong
small-yet-resilient model.