Decarbonising Buildings: Making a net-zero built environment a reality
One cluster to serve them all
1. One Cluster to Serve
Them All
How to run a multi-tenant K8s cluster for 1000+ users in
research and education at a University
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 1
5. University Requirements
• Flexible compute resources for Research & Teaching purposes
• Students: Try technologies, host small services etc.
• Research projects: Host project websites, services and run
large workloads in the cloud
• Must be simple to use but allow for complex setups!
• Large variety in technologies!
• 1000+ students
• AWS, Azure, GKE etc. not an option due to administrative
restrictions
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 5
6. Multi-Tenancy @ HAW
• Lab & Research projects each buy their own resources:
• Setup consumes too much time Project elapsed before anything runs
• Large vendor variety very hard to maintain
• Objectives:
• Consolidate heterogeneous compute resources
• Datacenter De-Fragmentation Due to scarcity of power, cooling and space
• Goals:
• Democratize Compute Resources
• Increase Research & Development ramp-up speed and efficiency
• Improve Resource Utilization
• Simplify usage
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 6
8. Worked well, but...
• VMWare at scale is too expensive
• Resources became scarce as people demanded larger VM
instances
• Also: lack of flexibility
• VMs are never returned
• VMs never get patched Users need to maintain Operating
Systems (hint: they won’t)
• Problems with security rules: Either too hard or too weak
Users unsatisfied
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 8
9. Containers to the Rescue!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 9
• Lightweight
• Fast
• Flexible
• Resource Efficient
… you know it
• But:
Requires Orchestration
• Enter Kubernetes
10. Multiple Clusters?
• Requires the skill to run K8s
• Even if setup is automated:
• Still leaves configuration of cluster to the users
• Does not help in error cases
• Does not help with special setups
• Essentially same provisioning problem as with VMs
• aka: Who gets how many ressources and when?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 10
11. Other reasons for single-cluster
• “(…) Not needing to deploy and monitor multiple clusters (i.e.
build all the tooling we did to run GKE at Google)” – David
Oppenheimer
• “(…) with the increasing emergence of "secure container"
technologies, this tendency will only increase, primarily driven
by resource cost considerations” – Quinton Hoole
• Source: https://goo.gl/ypCtzg
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 11
12. To the multi-tenant cluster we go!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 12
13. Initial Cluster Setup
• Kubernetes the Hard Way (https://github.com/kelseyhightower/kubernetes-the-hard-way)
ICC the Hard Way (https://github.com/christianhuening/kubernetes-the-haw-hamburg-way)
• 3 Master Nodes
• VM, 1 Core, 4 GB
• 3 Worker Nodes
• Bare Metal, 8 Core, 128GB
• 5 Node etcd cluster
• VM, 1 Core, 8 GB, HDD Storage
• Canal + Flannel (Calico) as Overlay Network Solution
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 13
14. We need AAA
• AuthN
• Who?
• AuthZ
• What?
• Admission
• How much?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 14
15. AuthN
• Login via HAW Accounts through LDAP
• „Let‘s key in LDAP settings into the K8s LDAP module“
• ...oh... wait...
• Auth Token Webhook in API-Servers
• kubernetes-ldap service forked & extended from Apprenda/Kismatic
• Code: https://github.com/christianhuening/kubernetes-ldap
• API-Server Config:
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 15
--authentication-token-webhook-config-file=/etc/kubernetes/ssl/ldap-webhook-config.yaml
--authentication-token-webhook-cache-ttl=30m0s
--runtime-config=authentication.k8s.io/v1beta1=true
16. AuthN
• Kubernetes-ldap service hosts two endpoints:
• /ldapAuth: Listens for login requests and returns JWT token, exposed
via Ingress
• /authenticate: Endpoint for API-Server to validate incoming tokens
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 16
kubeloginkubectl K8s-api K8s-ldap HAW-IDM
/ldapAuth
200, JWT token
Write kube/config
Any API call /authenticate
OK / NOK
proceed
LDAP bind
Bind ok
17. AuthN
• Users use kubelogin to authenticate
• Creates/Updates ~/.kube/config file
• Set default namespace
• Activate Context
• Stored token is valid for 12 hours
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 17
19. AuthZ
• Source of Truth required
• Majority of project and course work at HAW is done via Gitlab
• We built a Gitlab Integrator Service which:
• maps Groups, Projects and Personal Repos to Namespaces
• maps User roles from Gitlab to RoleBindings
• also applies PodSecurityPolicies & Docker Registry Secrets
• supports Webhook feature and full-sync every 3 hours
• allows for namespaces to be excluded from synchronization
• kube-system “cleaned up” , whooops
• sets up K8s Integration in Gitlab (i.e. for Continuous Delivery)
• can run inside of cluster or externally
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 19
20. AuthZ - Custom Roles and Bindings
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 20
• Special permissions are
granted through a ConfigMap
• Integrator ensures these are
present in the cluster
• Code: https://github.com/k8s-
tamias/gitlab-k8s-integrator
21. AuthZ
• Service at a point where it does too many things
• Reengineering:
• Tenant Operator/Controller
• Adapters for sources of truth like Gitlab, Github, LDAP, etc…
• Discussion at https://goo.gl/CQFvd8
• And in Multi-Tenancy Workgroup:
• Mailing List: https://goo.gl/fZ8g6B
• Slack: https://kubernetes.slack.com/messages/wg-multitenancy
• Come in, join the fun ☺ !
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 21
28. Storage – CEPH & rook.io
• rook.io on ContainerLinux:
• Runs CEPH cluster as Pods in Kubernetes
• Same benefits for your storage cluster as you have for your apps
• Requires persistent storage for ceph-mon storage to be
shutdown/restart-safe
Mount /var/lib/rook to extra hard-drive
• BTW: No need for multiple pools due to single, large cluster! ☺
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 28
29. Also: Logging
• No OpenSource Logging solution capable of multi-tenancy out-
of-the-box
• Opt 1: Deploy a Graylog+ES to every namespace 2-4 GB mem
• Opt 2: Provide Helm chart for people who want it won‘t be used
• Option 3: Graylog can do it through Streams and Rules in
combination with User permissions
• However problems and slow
• Gets setup via gitlab-integrator
• As I said: it‘s doing too many things…
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 29
30. Even More:
• SSL Certificate auto-provisioning via kube-lego
• Discontinued: Need to migrate to cert-manager!
• Monitoring via Prometheus-Operator
• No multi-tenancy yet, suggestions?
• GPGPU pods via https://github.com/NVIDIA/k8s-device-plugin
• And special PSPs in Namespaces via Gitlab-Integrator
• Dynamic Nodes from PC-Pools
• Add up to 1.2 TB memory and 600 cores
• Utilizes the csrapproval-controller (since K8s 1.7)
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 30
31. Summary
• Everything worked fine!
• Without actual users…
• Go-Live in September 2017 (Winter-Semester)
• ~150 concurrent users
• 2 very heavy users (master theses)
• Sort of brought down the cluster several times ☺
• Several problems showed up:
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 31
32. Problems – Control Plane
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 32
API Server Metrics:
33. Problems - Control Plane
• API Servers were running out of capacity:
• Increased memory to 32GB
• Increased Cores to 4
• Increased API Server count to 6
• However: Problems persisted
• kubectl commands timed out
• Deployments didn’t start
• Nodes failed due to API-Servers not responding
• etcd?
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 33
34. Problems – etcd I
• Obviously etcd ran out of memory
• Disable Swap!
• Increase mem to 16 GB per Node
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 34
35. Problems – etcd II
• Switch to pure
SSD storage
recommended!
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 35
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.310831ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: failed to send out heartbeat on time (exceeded the 100ms timeout for 228.294797ms)
Dez 12 09:13:23 icc-etcd-1 etcd[3875]: server is likely overloaded
37. Problems – etcd III
• We hit etcds default storage limit of 2GB
• etcd only accepted READ and DELETE requests
• Increase the size via --quota-backend-bytes flag
• Max is 8GB
• Effectively caused downtime for 1 day; services remained up
• Recovery took about 7 hours at full utilization
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 37
38. Other Performance Impacts
• kube-state-metric‘s pod_nanny required higher settings
(extra_mem = 150Mi) per Node due to higher pod churn
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 38
39. Lessons Learned
• Large-Scale is not necessarily bound to #nodes
• etcd really is your Pet and you want to make it feel
good:
• Multi-Tenancy possible but complex
• Requires especially good monitoring, logging &
auditing
• Students are very curious and use
the new technologies
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 39
40. What’s next?
• Node Security & Container Isolation
• Network Policies
• Resource Management via Self-Service (tamias.io)
• Priorities / kube-arbitrator
• Improve usage of owned, but idle resources
• PodTolerationRestriction Controller
• IPv6 & multi-network setup (IoT research et al.)
06.02.18 christian.huening@haw-hamburg.de | Twitter: @chrishuen 40