2. Topics
1. Docker
1. Docker Container
2. Docker Key Concepts
3. Docker Internals
4. Docker Architecture Linux Vs. OS X
5. Docker Architecture Windows
6. Docker Architecture Linux (Docker Daemon and Client)
7. Anatomy of Dockerfile
8. Building a Docker Image
9. Creating and Running a Docker Container
10. Invoking Docker Container using Java ProcessBuilder
2. Linux Containers
2
3. Docker containers are Linux Containers
CGROUPS NAMESPACES IMAGES
DOCKER
CONTAINER
• Kernel Feature
• Groups of Processes
• Control Resource
Allocation
• CPU, CPU Sets
• Memory
• Disk
• Block I/O
• Not a File System
• Not a VHD
• Basically a tar file
• Has a Hierarchy
• Arbitrary Depth
• Fits into Docker Registry
• The real magic behind
containers
• It creates barriers between
processes
• Different Namespaces
• PID Namespace
• Net Namespace
• IPC Namespace
• MNT Namespace
• Linux Kernel Namespace
introduced between kernel
2.6.15 – 2.6.26
docker run lxc-start
3
4. Docker Key Concepts
• Docker images
• A Docker image is a read-only template.
• For example, an image could contain an Ubuntu operating system with Apache and your
web application installed.
• Images are used to create Docker containers.
• Docker provides a simple way to build new images or update existing images, or you can
download Docker images that other people have already created.
• Docker images are the build component of Docker.
• Docker containers
• Docker containers are similar to a directory.
• A Docker container holds everything that is needed for an application to run.
• Each container is created from a Docker image.
• Docker containers can be run, started, stopped, moved, and deleted.
• Each container is an isolated and secure application platform.
• Docker containers are the run component of Docker.
• Docker Registries
• Docker registries hold images.
• These are public or private stores from which you upload or download images.
• The public Docker registry is called Docker Hub.
• It provides a huge collection of existing images for your use.
• These can be images you create yourself or you can use images that others have
previously created.
• Docker registries are the distribution component of Docker.
4
5. Docker Architecture Linux Vs. OS X
• In an OS X installation, the docker daemon is running inside a Linux virtual
machine provided by Boot2Docker.
• In OS X, the Docker host address is the address of the Linux VM. When you start
the boot2docker process, the VM is assigned an IP address. Under boot2docker
ports on a container map to ports on the VM.
5
6. Docker – Somewhere in the Future ……
Docker Running natively in Windows!
6
7. Docker Architecture – Linux
• Docker Daemon
• Docker daemon, which does the heavy lifting of
building, running, and distributing your Docker
containers.
• Both the Docker client and the daemon can run on
the same system, or you can connect a Docker client
to a remote Docker daemon.
• The Docker client and daemon communicate via
sockets or through a RESTful API.
• Docker Client (docker) Commands
• search (Search images in the Docker
Repository)
• pull (Pull the Image)
• run (Run the container)
• create (Create the container)
• build (build an image using Dockerfile)
• images (Shows images)
• push (Push the container to Docker
Repository)
• import / export
• start (start a stopped container)
• stop (stop a container)
• restart (Restart a container)
• save (Save an image to a tar archive)
• exec (Run a command in a running
container)
• top (Look at the running process in a
container)
• ps (List the containers)
• attach (Attach to a running Container)
• diff (Inspect changes to a containers file
system)
$ docker search applifire
$ docker pull applifire/jdk:7
$ docker images
$ docker run –it applifire/jdk:7 /bin/bash
Examples
7
8. Docker client examples
Searching in
the docker
registry for
images.
Images in your
local registry
after the build
or directly
pulled from
docker
registry.
To pull an image
docker pull applifire/tomcat
8
9. Analyzing “docker run –it ubuntu /bin/bash” command
In order, Docker does the following:
1. Pulls the ubuntu image:
• Docker checks for the presence of the ubuntu image and, if it doesn't exist
locally on the host, then Docker downloads it from Docker Hub.
• If the image already exists, then Docker uses it for the new container.
• Creates a new container: Once Docker has the image, it uses it to create a
container.
2. Allocates a filesystem and mounts a read-write layer:
• The container is created in the file system and a read-write layer is added to
the image.
3. Allocates a network / bridge interface:
• Creates a network interface that allows the Docker container to talk to the
local host.
4. Sets up an IP address:
• Finds and attaches an available IP address from a pool.
5. Executes a process that you specify:
• Runs your application, and;
6. Captures and provides application output:
• Connects and logs standard input, outputs and errors for you to see how
your application is running.
9
10. Anatomy of a Dockerfile
Command Description Example
FROM
The FROM instruction sets the Base Image for subsequent instructions. As
such, a valid Dockerfile must have FROM as its first instruction. The image
can be any valid image – it is especially easy to start by pulling an image
from the Public repositories
FROM ubuntu
FROM applifire/jdk:7
MAINTAINER
The MAINTAINER instruction allows you to set the Author field of the
generated images.
MAINTAINER arafkarsh
LABEL
The LABEL instruction adds metadata to an image. A LABEL is a key-value
pair. To include spaces within a LABEL value, use quotes and blackslashes as
you would in command-line parsing.
LABEL version="1.0”
LABEL vendor=“Algo”
RUN
The RUN instruction will execute any commands in a new layer on top of the
current image and commit the results. The resulting committed image will
be used for the next step in the Dockerfile.
RUN apt-get install -y curl
ADD
The ADD instruction copies new files, directories or remote file URLs from
<src> and adds them to the filesystem of the container at the path <dest>.
ADD hom* /mydir/
ADD hom?.txt /mydir/
COPY
The COPY instruction copies new files or directories from <src> and adds
them to the filesystem of the container at the path <dest>.
COPY hom* /mydir/
COPY hom?.txt /mydir/
ENV
The ENV instruction sets the environment variable <key> to the value
<value>. This value will be in the environment of all "descendent" Dockerfile
commands and can be replaced inline in many as well.
ENV JAVA_HOME /JDK8
ENV JRE_HOME /JRE8
EXPOSE
The EXPOSE instructions informs Docker that the container will listen on the
specified network ports at runtime. Docker uses this information to
interconnect containers using links and to determine which ports to expose
to the host when using the –P flag with docker client.
EXPOSE 8080
10
11. Anatomy of a Dockerfile
Command Description Example
VOLUME
The VOLUME instruction creates a mount point with the specified
name and marks it as holding externally mounted volumes from
native host or other containers. The value can be a JSON array,
VOLUME ["/var/log/"], or a plain string with multiple arguments,
such as VOLUME /var/log or VOLUME /var/log
VOLUME /data/webapps
USER
The USER instruction sets the user name or UID to use when running
the image and for any RUN, CMD and ENTRYPOINT instructions that
follow it in the Dockerfile.
USER applifire
WORKDIR
The WORKDIR instruction sets the working directory for any RUN,
CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the
Dockerfile.
WORKDIR /home/user
CMD
There can only be one CMD instruction in a Dockerfile. If you list
more than one CMD then only the last CMD will take effect.
The main purpose of a CMD is to provide defaults for an executing
container. These defaults can include an executable, or they can omit
the executable, in which case you must specify an ENTRYPOINT
instruction as well.
CMD echo "This is a test."
| wc -
ENTRYPOINT
An ENTRYPOINT allows you to configure a container that will run as
an executable. Command line arguments to docker run <image> will
be appended after all elements in an exec form ENTRYPOINT, and will
override all elements specified using CMD. This allows arguments to
be passed to the entry point, i.e., docker run <image> -d will pass the
-d argument to the entry point. You can override the ENTRYPOINT
instruction using the docker run --entrypoint flag.
ENTRYPOINT ["top", "-b"]
11
12. Building a Docker image : Base Ubuntu
• Dockerfile (Text File)
• Create the Dockerfile
• Build image using Dockerfile
• The following command will
build the docker image
based on the Dockerfile.
• Docker will download any
required build automatically
from the Docker registry.
• docker build –t
applifire/ubuntu .
• This will build a base ubuntu
with enough Linux utilities
for the development
environment.
1
2
12
13. Building a Docker Image : Java 8 (JRE) + Tomcat 8
• Dockerfile (Text File)
1. Create the Java (JRE8) Dockerfile
with Ubuntu as the base image.
2. Create the Tomcat Dockerfile with
JRE8 as the base image.
• Build image using Dockerfile
1. Build Java 8 (JRE) Docker Image
docker build –t applifire/jre:8 .
1. Build Tomcat 8 Docker Image
docker build –t applifire/tomcat:jre8 .
1
2
13
14. Building a Docker Image : Java 7 (JDK) + Gradle 2.3
• Dockerfile (Text File)
1. Create the Java (JDK7) Dockerfile
with Ubuntu as the base image.
2. Create the Gradle Dockerfile with
Java (JDK7) as the base image.
• Build image using Dockerfile
1. Build Java 7 (JDK) Docker Image
docker build –t applifire/jdk:7 .
1. Build Gradle 2.3 Docker Image
docker build –t applifire/gradle:jdk7 .
1
2
14
15. Creating & Running Docker Containers
docker run Example
-d Detached mode
To run servers like Tomcat, Apache Web
Server
-p
Publish Container’s Port
IP:hostport:ContainerPort 192.a.b.c:80:8080
IP::ContainerPort 192.a.b.c::8080
HostPort:ContainerPort 8081:8080
-it Run Interactive Mode
When you want to log into the container.
This mode works fine from a Unix Shell.
However, ensure that you don’t use this
mode when running it through the
ProcessBuilder in Java.
-v Mount Host File System -v host-file-system:container-file-system
-name Name to the container
-w Working Directory Working Directory for the Container
-u User Name
User Name with which you can log into the
container.
15
16. Creating & Running Docker Containers - Advanced
docker run Example
--cpuset=“” CPUs in which to allow execution 0-3, 0,1
-m Memory Limit for the Container
Number & Unit (b, k, m, g)
1g = 1GB, 1m = 1MB, 1k = 1KB
--memory-
swap
Total Memory Usage (Memory +
Swap Space)
Number & Unit (b, k, m, g)
1g = 1GB
-e Set Environment Variables
--link=[] Link another container
When your Tomcat Container wants
to talk to MySQL DB container.
--ipc Inter Process Communication
--dns Set Custom DNS Servers
--dns-search Set Custom DNS Search Domains
-h Container Host Name --hostname=“voldermort”
--expose=[]
Expose a container port or a
range of ports
--expose=8080-8090
--add-host Add a custom host-IP mapping Host:IP
16
17. Docker Container Process Management
docker ps
-a Show all the containers. Only running containers are shown by default
-q Only display the Numeric IDs
-s Display the total file sizes
-f
Provide Filters to show containers.
-f status=exited
-f exited=100
-l Show only the latest Container.
docker start
Starts a stopped Container. For example Tomcat Server
Ex. docker start containerName
docker stop
Stops a container. Start and Stop is mainly used for detached
containers like Tomcat, MySQL, and Apache Web Server Containers.
Ex. docker stop containerName
docker restart
Restart a Container
Ex. docker restart containerName
17
18. Docker Container Management – Short Cuts
Remove all Exited Containers
docker rm containerId / name Removes the Exited Container
docker rm $(docker ps –aq) docker ps –aq : returns all the container ID in
exited state into $ and then docker rm
command will remove the exited containers.
docker stop To remove a running container, you need to
stop the container first.
Ex. Tomcat Server Running.
docker stop containerName
Remove Docker Image
docker rmi imageId Removes the Docker Image
Remove all Docker images with <none> tag
docker rmi $(docker images | grep "^<none>" | tr -s '' | awk -F ' ' '{print $3}')*
* This command can be made even better….
18
19. Invoking Docker Container using Java ProcessBuilder API
When you execute docker command using Java ProcessBuilder API never use
run with –it (for interactive and terminal). This will block the container from
exiting, unless you want to have an interactive session..
Ex. docker run applifire/maven:jdk7 pom.xml
If you are using a shell script to invoke the docker container then refer the
following to handle Linux and OS X environments.
Boot2Docker
Settings for
OS X
$? to get the
exit code of
previous
command
19
21. What’s Linux container
Linux Containers (LXC for LinuX Containers) are
Lightweight virtual machines (VMs)
Which are realized using features provided by a modern Linux
kernel –
VMs without the hypervisor
Containerization of:
(Linux) Operating Systems
Single or multiple applications (Tomcat, MySQL DB etc.,)
21
42. Linux namespaces & cgroups: Availability 42
Note: user namespace support
in upstream kernel 3.8+, but
distributions rolling out phased
support:
- Map LXC UID/GID between
container and host
- Non-root LXC creation
71. Thank You…..
Araf Karsh Hamid
(araf.karsh@gmail.com)
Thanks to :
Boden Russell – IBM Technology Services
(brussell@us.ibm.com)
For the fantastic presentation on LinuX Containers
71
Editor's Notes
Memory
You can limit the amount of RAM and swap space that can be used by a group of processes.It accounts for the memory used by the processes for their private use (their Resident Set Size, or RSS), but also for the memory used for caching purposes.
This is actually quite powerful, because traditional tools (ps, analysis of /proc, etc.) have no way to find out the cache memory usage incurred by specific processes. This can make a big difference, for instance, with databases.
A database will typically use very little memory for its processes (unless you do complex queries, but let’s pretend you don’t!), but can be a huge consumer of cache memory: after all, to perform optimally, your whole database (or at least, your “active set” of data that you refer to the most often) should fit into memory.
Limiting the memory available to the processes inside a cgroup is as easy as echo1000000000 > /cgroup/polkadot/memory.limit_in_bytes (it will be rounded to a page size).
To check the current usage for a cgroup, inspect the pseudo-filememory.usage_in_bytes in the cgroup directory. You can gather very detailed (and very useful) information into memory.stat; the data contained in this file could justify a whole blog post by itself!
CPU
You might already be familiar with scheduler priorities, and with the nice and renice commands. Once again, control groups will let you define the amount of CPU, that should be shared by a group of processes, instead of a single one. You can give each cgroup a relative number of CPU shares, and the kernel will make sure that each group of process gets access to the CPU in proportion of the number of shares you gave it.
Setting the number of shares is as simple as echo 250 > /cgroup/polkadot/cpu.shares. Remember that those shares are just relative numbers: if you multiply everyone’s share by 10, the end result will be exactly the same. This control group also gives statistics incpu.stat.
CPU sets
This is different from the cpu controller.In systems with multiple CPUs (i.e., the vast majority of servers, desktop & laptop computers, and even phones today!), the cpuset control group lets you define which processes can use which CPU.
This can be useful to reserve a full CPU to a given process or group of processes. Those processes will receive a fixed amount of CPU cycles, and they might also run faster because there will be less thrashing at the level of the CPU cache.
On systems with Non Uniform Memory Access (NUMA), the memory is split in multiple banks, and each bank is tied to a specific CPU (or set of CPUs); so binding a process (or group of processes) to a specific CPU (or a specific group of CPUs) can also reduce the overhead happening when a process is scheduled to run on a CPU, but accessing RAM tied to another CPU.
Block I/O
The blkio controller gives a lot of information about the disk accesses (technically, block devices requests) performed by a group of processes. This is very useful, because I/O resources are much harder to share than CPU or RAM.
A system has a given, known, and fixed amount of RAM. It has a fixed number of CPU cycles every second – and even on systems where the number of CPU cycles can change (tickless systems, or virtual machines), it is not an issue, because the kernel will slice the CPU time in shares of e.g. 1 millisecond, and there is a given, known, and fixed number of milliseconds every second (doh!). I/O bandwidth, however, is quite unpredictable. Or rather, as we will see, it is predictable, but the prediction isn’t very useful.
A hard disk with a 10ms average seek time will be able to do about 100 requests of 4 KB per second; but if the requests are sequential, typical desktop hard drives can easily sustain 80 MB/s transfer rates – which means 20000 requests of 4 kB per second.
The average throughput (measured in IOPS, I/O Operations Per Second) will be somewhere between those two extremes. But as soon as some application performs a task requiring a lot of scattered, random I/O operations, the performance will drop – dramatically. The system does give you some guaranteed performance, but this guaranteed performance is so low, that it doesn’t help much (that’s exactly the problem of AWS EBS, by the way). It’s like a highway with an anti-traffic jam system that would guarantee that you can always go above a given speed, except that this speed is 5 mph. Not very helpful, is it?
That’s why SSD storage is becoming increasingly popular. SSD has virtually no seek time, and can therefore sustain random I/O as fast as sequential I/O. The available throughput is therefore predictably good, under any given load.
Actually, there are some workloads that can cause problems; for instance, if you continuously write and rewrite a whole disk, you will find that the performance will drop dramatically. This is because read and write operations are fast, but erase, which must be performed at some point before write, is slow. This won’t be a problem in most situations.
An example use-case which could exhibit the issue would be to use SSD to do catch-up TV for 100 HD channels simultaneously: the disk will sustain the write throughput until it has written every block once; then it will need to erase, and performance will drop below acceptable levels.)
To get back to the topic – what’s the purpose of the blkio controller in a PaaS environment like dotCloud?
The blkio controller metrics will help detecting applications that are putting an excessive strain on the I/O subsystem. Then, the controller lets you set limits, which can be expressed in number of operations and/or bytes per second. It also allows for different limits for read and write operations. It allows to set some safeguard limits (to make sure that a single app won’t significantly degrade performance for everyone). Furthermore, once a I/O-hungry app has been identified, its quota can be adapted to reduce impact on other apps.
more
The pid namespace
This is probably the most useful for basic isolation.
Each pid namespace has its own process numbering. Different pid namespaces form a hierarchy: the kernel keeps track of which namespace created which other. A “parent” namespace can see its children namespaces, and it can affect them (for instance, with signals); but a child namespace cannot do anything to its parent namespace. As a consequence:
each pid namespace has its own “PID 1” init-like process;
processes living in a namespace cannot affect processes living in parent or sibling namespaces with system calls like kill or ptrace, since process ids are meaningful only inside a given namespace;
if a pseudo-filesystem like proc is mounted by a process within a pid namespace, it will only show the processes belonging to the namespace;
since the numbering is different in each namespace, it means that a process in a child namespace will have multiple PIDs: one in its own namespace, and a different PID in its parent namespace.
The last item means that from the top-level pid namespace, you will be able to see all processes running in all namespaces, but with different PIDs. Of course, a process can have more than 2 PIDs if there are more than two levels of hierarchy in the namespaces.
The net namespace
With the pid namespace, you can start processes in multiple isolated environments (let’s bite the bullet and call them “containers” once and for all). But if you want to run e.g. a different Apache in each container, you will have a problem: there can be only one process listening to port 80/tcp at a time. You could configure your instances of Apache to listen on different ports… or you could use the net namespace.
As its name implies, the net namespace is about networking. Each different net namespace can have different network interfaces. Even lo, the loopback interface supporting 127.0.0.1, will be different in each different net namespace.
It is possible to create pairs of special interfaces, which will appear in two different net namespaces, and allow a net namespace to talk to the outside world.
A typical container will have its own loopback interface (lo), as well as one end of such a special interface, generally named eth0. The other end of the special interface will be in the “original” namespace, and will bear a poetic name like veth42xyz0. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switching between containers), or route packets between them, etc. (If you are familiar with the Xen networking model, this is probably no news to you!)
Note that each net namespace has its own meaning for INADDR_ANY, a.k.a. 0.0.0.0; so when your Apache process binds to *:80 within its namespace, it will only receive connections directed to the IP addresses and interfaces of its namespace – thus allowing you, at the end of the day, to run multiple Apache instances, with their default configuration listening on port 80.
In case you were wondering: each net namespace has its own routing table, but also its own iptables chains and rules.
The ipc namespace
This one won’t appeal a lot to you; unless you passed your UNIX 101 a long time ago, when they still taught about IPC (InterProcess Communication)!
IPC provides semaphores, message queues, and shared memory segments.
While still supported by virtually all UNIX flavors, those features are considered by many people as obsolete, and superseded by POSIX semaphores, POSIX message queues, and mmap. Nonetheless, some programs – including PostgreSQL – still use IPC.
What’s the connection with namespaces? Well, each IPC resources are accessed through a unique 32-bits ID. IPC implement permissions on resources, but nonetheless, one application could be surprised if it failed to access a given resource because it has already been claimed by another process in a different container.
Introduce the ipc namespace: processes within a given ipc namespace cannot access (or even see at all) IPC resources living in other ipc namespaces. And now you can safely run a PostgreSQL instance in each container without fearing IPC key collisions!
The mnt namespace
You might already be familiar with chroot, a mechanism allowing to sandbox a process (and its children) within a given directory. The mnt namespace takes that concept one step further.
As its name implies, the mnt namespace deals with mountpoints.
Processes living in different mnt namespaces can see different sets of mounted filesystems – and different root directories. If a filesystem is mounted in a mnt namespace, it will be accessible only to those processes within that namespace; it will remain invisible for processes in other namespaces.
At first, it sounds useful, since it allows to sandbox each container within its own directory, hiding other containers.
At a second glance, is it really that useful? After all, if each container is chroot‘ed in a different directory, container C1 won’t be able to access or see the filesystem of container C2, right? Well, that’s right, but there are side effects.
Inspecting /proc/mounts in a container will show the mountpoints of all containers. Also, those mountpoints will be relative to the original namespace, which can give some hints about the layout of your system – and maybe confuse some applications which would rely on the paths in /proc/mounts.
The mnt namespace makes the situation much cleaner, allowing each container to have its own mountpoints, and see only those mountpoints, with their path correctly translated to the actual root of the namespace.
The uts namespace
Finally, the uts namespace deals with one little detail: the hostname that will be “seen” by a group of processes.
Each uts namespace will hold a different hostname, and changing the hostname (through the sethostname system call) will only change it for processes running in the same namespace.