One cluster to serve them all

One Cluster to Serve
Them All
How to run a multi-tenant K8s cluster for 1000+ users in
research and education at a University
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 1

First: 2 Things

1. Wide Range of Experience

2. Resources at Universities

University Requirements
• Flexible compute resources for Research & Teaching purposes
• Students: Try technologies, host small services etc.
• Research projects: Host project websites, services and run
large workloads in the cloud
• Must be simple to use but allow for complex setups!
• Large variety in technologies!
• 1000+ students
• AWS, Azure, GKE etc. not an option due to administrative
restrictions

Multi-Tenancy @ HAW
• Lab & Research projects each buy their own resources:
• Setup consumes too much time  Project elapsed before anything runs
• Large vendor variety  very hard to maintain
• Objectives:
• Consolidate heterogeneous compute resources
• Datacenter De-Fragmentation  Due to scarcity of power, cooling and space
• Goals:
• Democratize Compute Resources
• Increase Research & Development ramp-up speed and efficiency
• Improve Resource Utilization
• Simplify usage

Worked well, but...
• VMWare at scale is too expensive
• Resources became scarce as people demanded larger VM
instances
• Also: lack of flexibility
• VMs are never returned
• VMs never get patched Users need to maintain Operating
Systems (hint: they won’t)
• Problems with security rules: Either too hard or too weak 
Users unsatisfied

Containers to the Rescue!
• Lightweight
• Fast
• Flexible
• Resource Efficient
… you know it
• But:
Requires Orchestration
• Enter Kubernetes

Multiple Clusters?
• Requires the skill to run K8s
• Even if setup is automated:
• Still leaves configuration of cluster to the users
• Does not help in error cases
• Does not help with special setups
• Essentially same provisioning problem as with VMs
• aka: Who gets how many ressources and when?

Other reasons for single-cluster
• “(…) Not needing to deploy and monitor multiple clusters (i.e.
build all the tooling we did to run GKE at Google)” – David
Oppenheimer
• “(…) with the increasing emergence of "secure container"
technologies, this tendency will only increase, primarily driven
by resource cost considerations” – Quinton Hoole
• Source: https://goo.gl/ypCtzg

To the multi-tenant cluster we go!

Initial Cluster Setup
• Kubernetes the Hard Way (https://github.com/kelseyhightower/kubernetes-the-hard-way)
 ICC the Hard Way (https://github.com/christianhuening/kubernetes-the-haw-hamburg-way)
• 3 Master Nodes
• VM, 1 Core, 4 GB
• 3 Worker Nodes
• Bare Metal, 8 Core, 128GB
• 5 Node etcd cluster
• VM, 1 Core, 8 GB, HDD Storage
• Canal + Flannel (Calico) as Overlay Network Solution

We need AAA
• AuthN
• Who?
• AuthZ
• What?
• Admission
• How much?

AuthN
• Login via HAW Accounts through LDAP
• „Let‘s key in LDAP settings into the K8s LDAP module“
• ...oh... wait...
• Auth Token Webhook in API-Servers
• kubernetes-ldap service forked & extended from Apprenda/Kismatic
• Code: https://github.com/christianhuening/kubernetes-ldap
• API-Server Config:
--authentication-token-webhook-config-file=/etc/kubernetes/ssl/ldap-webhook-config.yaml
--authentication-token-webhook-cache-ttl=30m0s
--runtime-config=authentication.k8s.io/v1beta1=true

AuthN
• Kubernetes-ldap service hosts two endpoints:
• /ldapAuth: Listens for login requests and returns JWT token, exposed
via Ingress
• /authenticate: Endpoint for API-Server to validate incoming tokens
kubeloginkubectl K8s-api K8s-ldap HAW-IDM
/ldapAuth
200, JWT token
Write kube/config
Any API call /authenticate
OK / NOK
proceed
LDAP bind
Bind ok

AuthN
• Users use kubelogin to authenticate
• Creates/Updates ~/.kube/config file
• Set default namespace
• Activate Context
• Stored token is valid for 12 hours

LDAP Webhook Config

AuthZ
• Source of Truth required
• Majority of project and course work at HAW is done via Gitlab
• We built a Gitlab Integrator Service which:
• maps Groups, Projects and Personal Repos to Namespaces
• maps User roles from Gitlab to RoleBindings
• also applies PodSecurityPolicies & Docker Registry Secrets
• supports Webhook feature and full-sync every 3 hours
• allows for namespaces to be excluded from synchronization
• kube-system “cleaned up” , whooops
• sets up K8s Integration in Gitlab (i.e. for Continuous Delivery)
• can run inside of cluster or externally

AuthZ - Custom Roles and Bindings
• Special permissions are
granted through a ConfigMap
• Integrator ensures these are
present in the cluster
• Code: https://github.com/k8s-
tamias/gitlab-k8s-integrator

AuthZ
• Service at a point where it does too many things
• Reengineering:
• Tenant Operator/Controller
• Adapters for sources of truth like Gitlab, Github, LDAP, etc…
• Discussion at https://goo.gl/CQFvd8
• And in Multi-Tenancy Workgroup:
• Mailing List: https://goo.gl/fZ8g6B
• Slack: https://kubernetes.slack.com/messages/wg-multitenancy
• Come in, join the fun ☺ !

Admission
• Currently: Free4All Model
• We’ll add Quotas soon
• Ideas:
• Resource Leasing between tenants
• Node Ownership & Limited Node Control
• Ongoing discussion: https://goo.gl/vs5A3q

Current Architecture

Ready? Not Yet! More Hardware…
• 6 Storage Nodes:
• 2x10 Core Xeon & 96 GB Ram
• 8 x 4TB HDD @ 7200rpm
• 1 x 2TB NVMe SSD
• 8 x Compute Nodes:
• 1 TB HDD for Images
• 1x GPU Node:
• 4 x Tesla V100
• 1 2TB NVMe SSD
• 32 x 10Gbit Port Cisco Nexus Switch
• Boot via iPXE & ContainerLinux

Compute Node

Storage Node

GPU Node

Storage – CEPH & rook.io
• rook.io on ContainerLinux:
• Runs CEPH cluster as Pods in Kubernetes
• Same benefits for your storage cluster as you have for your apps
• Requires persistent storage for ceph-mon storage to be
shutdown/restart-safe
 Mount /var/lib/rook to extra hard-drive
• BTW: No need for multiple pools due to single, large cluster! ☺

Also: Logging
• No OpenSource Logging solution capable of multi-tenancy out-
of-the-box
• Opt 1: Deploy a Graylog+ES to every namespace  2-4 GB mem
• Opt 2: Provide Helm chart for people who want it  won‘t be used
• Option 3: Graylog can do it through Streams and Rules in
combination with User permissions
• However problems and slow 
• Gets setup via gitlab-integrator
• As I said: it‘s doing too many things…

Even More:
• SSL Certificate auto-provisioning via kube-lego
• Discontinued: Need to migrate to cert-manager!
• Monitoring via Prometheus-Operator
• No multi-tenancy yet, suggestions?
• GPGPU pods via https://github.com/NVIDIA/k8s-device-plugin
• And special PSPs in Namespaces via Gitlab-Integrator
• Dynamic Nodes from PC-Pools
• Add up to 1.2 TB memory and 600 cores
• Utilizes the csrapproval-controller (since K8s 1.7)

Summary
• Everything worked fine!
• Without actual users…
• Go-Live in September 2017 (Winter-Semester)
• ~150 concurrent users
• 2 very heavy users (master theses)
• Sort of brought down the cluster several times ☺
• Several problems showed up:

Problems – Control Plane
API Server Metrics:

Problems - Control Plane
• API Servers were running out of capacity:
• Increased memory to 32GB
• Increased Cores to 4
• Increased API Server count to 6
• However: Problems persisted
• kubectl commands timed out
• Deployments didn’t start
• Nodes failed due to API-Servers not responding
• etcd?

Problems – etcd I
• Obviously etcd ran out of memory
• Disable Swap!
• Increase mem to 16 GB per Node

Problems – etcd II
• Switch to pure
SSD storage
recommended!
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.310831ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.294797ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded

Problems – etcd III - ‘large-scale’
• 36 Nodes
+ 75 dynamic Nodes
• 2147 Namespaces
• 908 - 2500 Pods
• 10538 RoleBindings
• High Pod churn
https://coreos.com/etcd/docs/latest/op-guide/hardware.html

Problems – etcd III
• We hit etcds default storage limit of 2GB
• etcd only accepted READ and DELETE requests
• Increase the size via --quota-backend-bytes flag
• Max is 8GB
• Effectively caused downtime for 1 day; services remained up
• Recovery took about 7 hours at full utilization

Other Performance Impacts
• kube-state-metric‘s pod_nanny required higher settings
(extra_mem = 150Mi) per Node due to higher pod churn

Lessons Learned
• Large-Scale is not necessarily bound to #nodes
• etcd really is your Pet and you want to make it feel
good:
• Multi-Tenancy possible but complex
• Requires especially good monitoring, logging &
auditing
• Students are very curious and use
the new technologies

What’s next?
• Node Security & Container Isolation
• Network Policies
• Resource Management via Self-Service (tamias.io)
• Priorities / kube-arbitrator
• Improve usage of owned, but idle resources
• PodTolerationRestriction Controller
• IPv6 & multi-network setup (IoT research et al.)

Thanks for listening!

One cluster to serve them all

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to One cluster to serve them all

Similar to One cluster to serve them all (20)

Recently uploaded

Recently uploaded (20)

One cluster to serve them all