Decarbonising Buildings: Making a net-zero built environment a reality
Realizing Linux Containers (LXC) - Building Blocks, Underpinnings & Motivations
1. Realizing Linux Containers (LXC)
Building Blocks, Underpinnings &
Motivations
Boden Russell – IBM Global Technology Services
(brussell@us.ibm.com)
2. Definitions
Linux Containers (LXC for LinuX Containers) are lightweight virtual machines (VMs)
which are realized using features provided by a modern Linux kernel – VMs without
the hypervisor
Containerization of:
– (Linux) Operating Systems
– Single or multiple applications
LXC as a technology not to be confused with LXC (tools) which are a user space
toolset for creating & managing Linux Containers
From wikipedia:
LXC (LinuX Containers) is an operating system–level virtualization method for running multiple
isolated Linux systems (containers) on a single control host… LXC provides operating system-level
virtualization not via a virtual machine, but rather provides a virtual environment that has its own
process and network space.
3/11/2014 2
3. Why LXC
Provision in seconds / milliseconds
Near bare metal runtime performance
VM-like agility – it’s still “virtualization”
Flexibility
– Containerize a “system”
– Containerize “application(s)”
Lightweight
– Just enough Operating System (JeOS)
– Minimal per container penalty
Open source – free – lower TCO
Supported with OOTB modern Linux kernel
Growing in popularity
3/11/2014 3
“Linux Containers as poised as the next VM in our modern Cloud era…”
Manual VM LXC
Provision Time
Days
Minutes
Seconds / ms
linpack performance @ 45000
0
50
100
150
200
250
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
BM
vcpus
GFlops
Google trends - LXC Google trends - docker
4. Hypervisors vs. Linux Containers
Hardware
Operating System
Hypervisor
Virtual Machine
Operating
System
Bins / libs
App App
Virtual Machine
Operating
System
Bins / libs
App App
Hardware
Hypervisor
Virtual Machine
Operating
System
Bins / libs
App App
Virtual Machine
Operating
System
Bins / libs
App App
Hardware
Operating System
Container
Bins / libs
App App
Container
Bins / libs
App App
Type 1 Hypervisor Type 2 Hypervisor Linux Containers
3/11/2014 4
Containers share the OS kernel of the host and thus are lightweight.
However, each container must have the same OS kernel.
Containers are isolated, but
share OS and, where
appropriate, libs / bins.
5. LXC Technology Stack
LXCs are built on modern kernel features
– cgroups; limits, prioritization, accounting & control
– namespaces; process based resource isolation
– chroot; apparent root FS directory
– Linux Security Modules (LSM); Mandatory Access Control (MAC)
User space interfaces for kernel functions
LXC tools
– Tools to isolate process(es) virtualizing kernel resources
LXC commoditization
– Dead easy LXC
– LXC virtualization
Orchestration & management
– Scheduling across multiple hosts
– Monitoring
– Uptime
3/11/2014 5
6. Linux cgroups
History
– Work started in 2006 by google engineers
– Merged into upstream 2.6.24 kernel due to wider spread LXC usage
– A number of features still a WIP
Functionality
– Access; which devices can be used per cgroup
– Resource limiting; memory, CPU, device accessibility, block I/O, etc.
– Prioritization; who gets more of the CPU, memory, etc.
– Accounting; resource usage per cgroup
– Control; freezing & check pointing
– Injection; packet tagging
Usage
– cgroup functionality exposed as “resource controllers” (aka “subsystems”)
– Subsystems mounted on FS
– Top-level subsystem mount is the root cgroup; all procs on host
– Directories under top-level mounts created per cgroup
– Procs put in tasks file for group assignment
– Interface via read / write pseudo files in group
3/11/2014 6
7. Linux cgroup Subsystems
cgroups provided via kernel modules
– Not always loaded / provided by default
– Locate and load with modprobe
Some features tied to kernel version
See: https://www.kernel.org/doc/Documentation/cgroups/
3/11/2014 7
Subsystem Tunable Parameters
blkio - Weighted proportional block I/O access. Group wide or per device.
- Per device hard limits on block I/O read/write specified as bytes per second or IOPS per second.
cpu - Time period (microseconds per second) a group should have CPU access.
- Group wide upper limit on CPU time per second.
- Weighted proportional value of relative CPU time for a group.
cpuset - CPUs (cores) the group can access.
- Memory nodes the group can access and migrate ability.
- Memory hardwall, pressure, spread, etc.
devices - Define which devices and access type a group can use.
freezer - Suspend/resume group tasks.
memory - Max memory limits for the group (in bytes).
- Memory swappiness, OOM control, hierarchy, etc..
hugetlb - Limit HugeTLB size usage.
- Per cgroup HugeTLB metrics.
net_cls - Tag network packets with a class ID.
- Use tc to prioritize tagged packets.
net_prio - Weighted proportional priority on egress traffic (per interface).
10. Linux cgroups: CPU Usage
Use CPU shares (and other controls) to prioritize jobs / containers
Carry out complex scheduling schemes
Segment host resources
Adhere to SLAs
3/11/2014 10
11. Linux cgroups: CPU Pinning
Pin containers / jobs to CPU cores
Carry out complex scheduling schemes
Reduce core switching costs
Adhere to SLAs
3/11/2014 11
13. LXC Realization: Linux cgroups
cgroup created per container (in each cgroup subsystem)
Prioritization, access, limits per container a la cgroup controls
Per container metrics (bean counters)
3/11/2014 13
14. Linux namespaces
History
– Initial kernel patches in 2.4.19
– Recent 3.8 patches for user namespace support
– A number of features still a WIP
Functionality
– Provide process level isolation of global resources
• MNT (mount points, file systems, etc.)
• PID (process)
• NET (NICs, routing, etc.)
• IPC (System V IPC resources)
• UTS (host & domain name)
• USER (UID + GID)
– Process(es) in namespace have illusion they are the only processes on the system
– Generally constructs exist to permit “connectivity” with parent namespace
Usage
– Construct namespace(s) of desired type
– Create process(es) in namespace (typically done when creating namespace)
– If necessary, initialize “connectivity” to parent namespace
– Process(es) in name space internally function as if they are only proc(s) on system
3/11/2014 14
16. Linux namespaces: MNT namespace
Isolates the mount table – per namespace mounts
mount / unmount operations isolated to namespace
Mount propagation
– Shared; mount objects propagate events to one another
– Slave; one mount propagates events to another, but not
vice versa
– Private; no event propagation (default)
Unbindable mount forbids bind mounting itself
Various tools / APIs support the mount namespace such
as the mount command
– Options to make shared, private, slave, etc.
– Mount with namespace support
Typically used with chroot or pivot_root for
effective root FS isolation
3/11/2014 16
“global” (i.e. root)
namespace
“green” namespace
“red” namespace
MNT NS
/
/proc
/mnt/fsrd
/mnt/fsrw
/mnt/cdrom
/run2
MNT NS
/
/proc
/mnt/greenfs
/mnt/fsrw
/mnt/cdrom
MNT NS
/
/proc
/mnt/cdrom
/redns
17. Linux namespaces: UTS namespace
Per namespace
– Hostname
– NIS domain name
Reported by commands such as hostname
Processes in namespace can change UTS values – only
reflected in the child namespace
Allows containers to have their own FQDN
3/11/2014 17
“global” (i.e. root)
namespace
“green” namespace
“red” namespace
UTS NS
globalhost
rootns.com
UTS NS
greenhost
greenns.org
UTS NS
redhost
redns.com
18. Linux namespaces: PID namespace
Per namespace PID mapping
– PID 1 in namespace not the same as PID 1 in parent namespace
– No PID conflicts between namespaces
– Effectively 2 PIDs; the PID in the namespace and the PID outside
the namespace
Permits migrating namespace processes between hosts
while keeping same PID
Only processes in the namespace are visible within the
namespace (visibility limited)
3/11/2014 18
“global” (i.e. root)
namespace
“green” namespace
“red” namespace
PID NS
PID COMMAND
1 /sbin/init
2 [kthreadd]
3 [ksoftirqd]
4 [cpuset]
5 /sbin/udevd
PID NS
PID COMMAND
1 /bin/bash
2 /bin/vim
PID NS
PID COMMAND
1 /bin/bash
2 python
3 node
20. Linux namespaces: NET namespace
Per namespace network objects
– Network devices (eths)
– Bridges
– Routing tables
– IP address(es)
– ports
– Etc
Various commands support network namespace such as ip
Connectivity to other namespaces
– veths – create veth pair, move one inside the namespace and
configure
– Acts as a pipe between the 2 namespaces
LXCs can have their own IPs, routes, bridges, etc.
3/11/2014 20
“global” (i.e. root)
namespace
“green” namespace
“red” namespace
NET NS
lo: UNKNOWN…
eth0: UP…
eth1: UP…
br0: UP…
app1 IP:5000
app2 IP:6000
app3 IP:7000
NET NS
lo: UNKNOWN…
eth0: UP…
app1 IP:1000
app2 IP:7000
NET NS
lo: UNKNOWN…
eth0: DOWN…
eth1: UP
app1 IP:7000
app2 IP:9000
21. Linux namespaces: USER namespace
A long work in progress – still development for XFS and other
FS support
– Significant security impacts
– A handful of security holes already found + fixed
Two major features provided:
– Map UID / GID from outside the container to UID / GID inside the
container
– Permit non-root users to launch LXCs
– Distro’s rolling out phased support, with UID / GID mapping
typically 1st
First process in USER namespace has full CAPs; perform
initializations before other processes are created
– No CAPs in parent namespace
UID / GID map can be pre-configured via FS
Eventually USER namespace will mitigate many perceived LXC
security concerns
3/11/2014 21
“global” (i.e. root)
namespace
“green” namespace
“red” namespace
USER NS
root 0:0
ntp 104:109
Mysql 105:110
boden 106:111
USER NS
root 0:0
app 106:111
USER NS
root 0:0
app 104:109
22. LXC Realization: Linux namespaces
3/11/2014 22
A set of namespaces created for the container
Container process(es) “executed” in the namespace set
Process(es) in the container have isolated view of resources
Connectivity to parent where needed (via lxc tooling)
23. Linux namespaces & cgroups: Availability
3/11/2014 23
Note: user namespace support in
upstream kernel 3.8+, but
distributions rolling out phased
support:
- Map LXC UID/GID between
container and host
- Non-root LXC creation
24. Linux chroots
Changes apparent root directory for process and children
– Search paths
– Relative directories
– Etc
Using chroot can be escaped given proper capabilities, thus pivot_root is often
used instead
– chroot; points the processes file system root to new directory
– pivot_root; detaches the new root and attaches it to process root directory
Often used when building system images
– Chroot to temp directory
– Download and install packages in chroot
– Compress chroot as a system root FS
LXC realization
– Bind mount container root FS (image)
– Launch (unshare or clone) LXC init process in a new MNT namespace
– pivot_root to the bind mount (root FS)
3/11/2014 24
25. Linux chroot vs pivot_root
3/11/2014 25
Using pivot_root with MNT namespace addresses escaping chroot concerns
The pivot_root target directory becomes the “new root FS”
26. LXC Realization: Images
LXC images provide a flexible means to deliver only what you need – lightweight and minimal footprint
Basic constraints
– Same architecture
– Same endian
– Linux’ish Operating System; you can run different Linux distros on same host
Image types
– System; images intended to virtualize Operating System(s) – standard distro root FS less the
kernel
– Application; images intended to virtualize application(s) – only package apps + dependencies
(aka JeOS – Just enough Operating System)
Bind mount host libs / bins into LXC to share host resources
Container image init process
– Container init command provided on invocation – can be an application or a full fledged init
process
– Init script customized for image – skinny SysVinit, upstart, etc.
– Reduces overhead of lxc start-up and runtime foot print
Various tools to build images
– SuSE Kiwi
– Debootstrap
– Etc.
LXC tooling options often include numerous image templates
3/11/2014 26
27. Linux Security Modules & MAC
Linux Security Modules (LSM) – kernel modules which provide a framework for
Mandatory Access Control (MAC) security implementations
MAC vs DAC
– In MAC, admin (user or process) assigns access controls to subject / initiator
• Most MAC implementations provide the notion of profiles
• Profiles define access restrictions and are said to “confine” a subject
– In DAC, resource owner (user) assigns access controls to individual resources
Existing LSM implementations include: AppArmor, SELinux, GRSEC, etc.
3/11/2014 27
28. Linux Capabilities & Other Security Measures
Linux capabilities
– Per process privileges which define operational (sys call) access
– Typically checked based on process EUID and EGID
– Root processes (i.e. EUID = GUID = 0) bypass capability checks
Capabilities can be assigned to LXC processes to restrict
Other LXC security mitigations
– Reduce shared FS access using RO bind mounts
– Keep Linux kernel up to date
– User namespaces in 3.8+ kernel
• Allow to launch containers as non-root user
• Map UID / GID inside / outside of container
3/11/2014 28
30. LXC Tooling
LXC is not a kernel feature – it’s a technology enabled via kernel features
– User space tooling required to manage LXCs effectively
Numerous toolsets exist
– Then: add-on patches to upstream kernel due to slow kernel acceptance
– Now: upstream LXC feature support is growing – less need for patches
More popular GNU Linux toolsets include libvirt-lxc and lxc (tools)
– OpenVZ is likely the most mature toolset, but it requires kernel patches
– Note: I would consider docker a commoditization of LXC
Non-GNU Linux based LXC
– Solaris zones
– BSD jails
– Illumos / SmartOS (solaris derivatives)
– Etc.
3/11/2014 30
32. Libvirt-lxc
Perhaps the simplest to learn through a familiar virsh interface
Libvirt provides LXC support by connecting to lxc:///
Many virsh commands work
• virsh -c lxc:/// define sample.xml
• virsh –c lxc:/// start sample
• virsh –c lxc:/// console sample
• virsh –c lxc:/// shutdown sample
• virsh –c lxc:/// undefine sample
No snapshotting, templates…
OpenStack support since Grizzly
No VNC
No Cinder support in Grizzly
Config drive not supported
Alternative means of accessing metadata
Attached disk rather than http calls
3/11/2014 32
<domain type='lxc'>
<name>sample</name>
<memory>32768</memory>
<os> <type>exe</type> <init>/init</init> </os>
<vcpu>1</vcpu>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/usr/libexec/libvirt_lxc</emulator>
<filesystem type='mount'> <source dir='/opt/vm-1-root'/> <target dir='/'/> </filesystem>
<interface type='network'> <source network='default'/> </interface>
<console type='pty' />
</devices>
</domain>
33. LXC (tools)
A little more functionality
Supported by the major distributions
LXC 1.0 recently released
– Cloning supported: lxc-clone
– Templates… btrfs
– lxc-create -t ubuntu -n CN creates a new ubuntu container
• “template” is downloaded from Ubuntu
• Some support for Fedora <= 14
• Debian is supported
– lxc-start -d -n CN starts the container
– lxc-destroy -n CN destroys the container
– /etc/lxc/lxc.conf has default settings
– /var/lib/lxc/CN is the default place for each container
3/11/2014 33
34. LXC Commoditization: docker
Young project with great vibrancy in the industry
Currently based on unmodified LXC – but the goal is to make it dirt easy
As of March 10th, 2014 at v0.9. Monthly releases, 1.0 should be ready for production use
What docker adds to LXC
– Portable deployment across machines
• In Cloud terms, think of LXC as the hypervisor and docker as the Open Virtualization Appliance (OVA) and the provision engine
• Docker images can run unchanged on any platform supporting docker
– Application-centric
• User facing function geared towards application deployment, not VM analogs [!]
– Automatic build
• Create containers from build files
• Builders can use chef, maven, puppet, etc.
– Versioning support
• Think of git for docker containers
• Only delta of base container is tracked
– Component re-use
• Any container can be used as a base, specialized and saved
– Sharing
• Support for public/private repositories of containers
– Tools
• CLI / REST API for interacting with docker
• Vendors adding tools daily
Docker containers are self contained – no more “dependency hell”
3/11/2014 34
36. Docker: LXC Virtualization?
3/11/2014 36
Docker decouples the LXC provider from the operations
– LXC provider agnostic
Docker “images” run anywhere docker is supported
– Portability
37. LXC Orchestration & Management
Docker & libvirt-lxc in OpenStack
– Manage containers heterogeneously with traditional VMs… but not w/the level of support
& features we might like
CoreOS
– Zero-touch admin Linux distro with docker images as the unit of operation
– Centralized key/value store to coordinate distributed environment
Various other 3rd party apps
– Maestro for docker
– Shipyard for docker
– Fleet for CoreOS
– Etc.
LXC migration
– Container migration via criu
But…
– Still no great way to tie all virtual resources together with LXC – e.g. storage + networking
• IMO; an area which needs focus for LXC to become more generally applicable
3/11/2014 37
38. Docker in OpenStack
Introduced in Havana
– A nova driver to integrate with docker REST API
– A Glance translator to integrate containers with Glance
• A docker container which implements a docker registry API
The claim is that docker will become a “group A” hypervisor
– In it’s current form it’s effectively a “tech preview”
3/11/2014 38
39. LXC Evaluation
Goal: validate the promise with an eye towards practical applicability
Dimensions evaluated:
– Runtime performance benefits
– Density / footprint
– Workload isolation
– Ease of use and tooling
– Cloud Integration
– Security
– Ease of use / feature set
NOTE: tests performed in a passive manner – deeper analysis warrented.
3/11/2014 39
40. Runtime Performance Benefits - CPU
Tested using libvirt lxc on Ubuntu 13.10 using linpack 11.1
Cpuset was used to limit the number of CPUs that the containers could use
The performance overhead falls within the error of measurement of this test
Actual bare metal performance is actually lower than some container results
3/11/2014 40
linpack performance @ 45000
0
50
100
150
200
250
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
BM
vcpus
GFlops
220.77
Bare metal220.5
@32 vcpu
220.9
@ 31 vcpu
41. Runtime Performance Benefits – I/O
I/O Tests using libvirt lxc show a < 1 % degradation
Tested with a pass-through mount
3/11/2014 41
Sync read I/O test
Rw=Write
Size=1024m
Bs=128mb
direct=1
sync=1
Sync write I/O test
Rw=Write
Size=1024m
Bs=128mb
direct=1
sync=1
I/O throughput
1711.2 1724.9
1626.4 1633.4
0
500
1000
1500
2000
lxc write bare metal
write
lxc read bare metal
read
test
MB/s
Series1
43. Density & Footprint – libvirt-lxc
3/11/2014 43
Starting 500 containers
Mon Nov 11 13:38:49 CST 2013 ... all threads done
in 157
(sequential I/O bound)
Stopping 500 containers
Mon Nov 11 13:42:20 CST 2013 ... all threads done
in 162
Active memory delta: 417.2 KB
Starting 1000 containers
Mon Nov 11 13:59:19 CST 2013 ... all threads done
in 335
Stopping 1000 containers
Mon Nov 11 14:14:26 CST 2013 ... all threads done
in 339
Active memory delta: 838.4KB
Using libvirt lxc on RHEL 6.4, we found that empty container overhead was just 840 bytes. A container could be started in
about 330ms, which was an I/O bound process
This represents the lower limit of lxc footprint
Containers ran /bin/sh
44. Density & Footprint – Docker
In this test, we created 150 Docker containers with CentOS, started
apache & then removed them
Average footprint was ~10MB per container
Average start time was 240ms
Serially booting 150 containers which run apache
– Takes on average 36 seconds
– Consumes about 2 % of the CPU
– Negligible HDD space
– Spawns around 225 processes for create
– Around 1.5 GB of memory ~ 10 MB per container
– Expect faster results once docker addresses performance topics in the
next few months
Serially destroying 150 containers running apache
– On average takes 9 seconds
– We would expect destroy to be faster – likely a docker bug and will triage
with the docker community
3/11/2014 44
Container
Creation
Container
Deletion
I/O profile
CPU profile
45. Workload Isolation: Examples
Using the blkio cgroup (lxc.cgroup.blkio.throttle.read_bps_device) to cap the I/O of a container
Both the total bps and iops_device on read / write could be capped
Better async BIO support in kernel 3.10+
We used fio with oflag=sync, direct to test the ability to cap the reads:
– With limit set to 6 MB / second
READ: io=131072KB, aggrb=6147KB/s, minb=6295KB/s, maxb=6295KB/s, mint=21320msec,
maxt=21320msec
– With limit set to 60 MB / second
READ: io=131072KB, aggrb=61134KB/s, minb=62601KB/s, maxb=62601KB/s,
mint=2144msec, maxt=2144msec
– No read limit
READ: io=131072KB, aggrb=84726KB/s, minb=86760KB/s, maxb=86760KB/s,
mint=1547msec, maxt=1547msec
3/11/2014 45
47. Who’s Using LXC
Google app engine & infra is said to be using some form of LXC
RedHat OpenShift
dotCloud (now docker inc)
CloudFoundry (early versions)
Rackspace Cloud Databases
– Outperforms AWS (Xen) according to perf results
Parallels Virtuozzo (commercial product)
Etc..
3/11/2014 47
48. LXC Gaps
There are gaps…
Lack of industry tooling / support
Live migration still a WIP
Full orchestration across resources (compute / storage / networking)
Fears of security
Not a well known technology… yet
Integration with existing virtualization and Cloud tooling
Not much / any industry standards
Missing skillset
Slower upstream support due to kernel dev process
Etc.
3/11/2014 48
49. LXC: Use Cases For Traditional VMs
There are still use cases where traditional VMs are warranted.
Virtualization of non Linux based OSs
– Windows
– AIX
– Etc.
LXC not supported on host
VM requires unique kernel setup which is not applicable to other VMs on the host
(i.e. per VM kernel config)
Etc.
3/11/2014 49
50. LXC Recommendations
Public & private Clouds
– Increase VM density 2-3x
– Accommodate Big Data & HPC type applications
– Move the support of Linux distros to containers
PaaS & managed services
– Realize “as a Service” and managed services using LXC
Operations management
– Ease management + increase agility of bare metal components
DevOps
Development & test
– Sandboxes
– Dev / test envs
– Etc.
If you are just starting with LXC and don’t have in-depth skillset
– Start with LXC for private solutions (trusted code)
3/11/2014 50