In this article, we will explain how the container works from a low layer for people who usually treat Docker as a black box.
To do this, we use the Go language to take the approach of implementing and running containers from scratch. The basic principle of containers is surprisingly simple, with just 60 lines of code at the end of this article.
The completed code can be found in the GitHub repository.
The following diagram is often used to illustrate the difference between a container and a virtual machine (VM). (Quoted from Docker official website)
The biggest feature when comparing VMs and containers is that they do not start the guest OS when creating each container. All containers exist as processes running within the same host OS.
But of course, normal processes share resources such as files with other processes and are highly environment-dependent. Therefore, in order to run the process in a logically isolated state, the features such as chroot and namespace of ** Linux kernel ** are used. This ** quarantined process ** is called a container.
The kernel is literally the core of the OS. When you think of a Linux machine as a three-tiered structure like this, the kernel is just in the middle.
-** Hardware : Physical devices such as memory and CPU - Linux kernel ** -** User Process **: Almost all programs such as shells and editors
The kernel has the privilege of directly manipulating the hardware and does tasks such as managing memory and processes, and device drivers.
On the other hand, user processes have severely restricted access to hardware. Therefore, you must ask the kernel through ** system calls ** to perform file operations, process creation, etc.
When implementing a program that creates a container, we also make heavy use of system calls to take advantage of chroot, namespace, and so on.
It is standard to use the official package golang.org/x/sys, especially when making system calls with Go language code.
From now on, the Go language program will actually create the container.
You need a Linux environment with the Go compiler installed to run the code. You can use the docker-compose.yml
file included in the GitHub repository to try it out immediately without the hassle of building an environment.
$ git clone
$ cd minimum-container
$ docker-compose run app
root@linux-env:/work_dir# go run main.go run sh
chroot chroot changes the root directory of the currently running process (and child processes). It is commonly referred to as a ** chroot prison ** because it makes it impossible to access and recognize its existence above that directory.
The chroot branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/chroot/main.go) has a container-like code example using chroot. It changes the root directory to ./rootfs
and then executes the given arguments as a command.
main.go
//Run cmd with argument arg in quarantined process
func execute(cmd string, args ...string) {
//Root directory and current directory./Set to rootfs
unix.Chroot("./rootfs")
unix.Chdir("/")
command := exec.Command(cmd, args...)
command.Stdin = os.Stdin
command.Stdout = os.Stdout
command.Stderr = os.Stderr
command.Run()
}
If you try running this main.go
right away, you should get the following error:
$ go run main.go run sh
panic: exec: "sh": executable file not found in $PATH
This is an error that occurs because there are no files in ./rootfs
yet. In the container after chroot is executed, there is no binary of sh
because the root directory is almost empty.
That's where docker export comes in handy. You can extract all the files contained in any Docker image under ./rootfs
by typing the command below.
$ docker export $(docker create <image>) | tar -C rootfs -xvf -
Let's run the container again with the file prepared in .rootfs
. Try using the ls
command or creating a file to make sure that the/
directory inside the container is linked to the rootfs
directory on the host.
root@linux-env:/work_dir# go run main.go run sh
/ # ls /
bin dev etc home proc root sys tmp usr var
/ # touch /tmp/hoge
/ # exit
root@linux-env:/work_dir# ls rootfs/tmp
hoge
namespace Linux namespace is a function that can isolate various resources such as the mount file system and PID.
To understand the need for this feature, let's run the ps
command inside the container-like we created in the previous section.
root@linux-host:/work_dir# go run main.go run ps
PID USER TIME COMMAND
You should see no results. The cause is that the ps
command references the/proc
directory. Normally, a special pseudo file system that can get process information etc. is mounted on the / proc
directory, but since the root directory is changed in the container-like state, there is still nothing in/proc
. not.
Try mounting the / proc
directory in advance and running ps
again.
root@linux-host:/work_dir# go run main.go run sh
/ # mount proc /proc -t proc
/ # ps
PID USER TIME COMMAND
1 root 0:00 bash
100 root 0:00 go run main.go run sh
154 root 0:00 /tmp/go-build474892034/b001/exe/main run sh
160 root 0:00 sh
163 root 0:00 ps
There are two problems here. One is that you can see the processes running outside the container (PID 1, 100, 154), and the other is that the mounts you set inside the container are reflected on the host. This is not enough isolation from the external environment.
root@linux-host:/work_dir# cat /proc/mounts | grep proc
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
proc /work_dir/rootfs/proc proc rw,relatime 0 0 <-Proc mount added in the container
The Linux namespace allows you to set resource namespaces separately for each process. Resources that belong to different namespaces cannot be seen or manipulated, which solves the problem described above.
At the time of writing, there are eight types of Linux namespaces, and the flag is specified by the system calls clone, setns, unshare, and so on.
Namespace | Flag | Quarantined resource |
---|---|---|
Mount | CLONE_NEWNS | File system mount point |
PID | CLONE_NEWPID | PID |
UTS | CLONE_NEWUTS | hostname |
Network | CLONE_NEWNET | Network devices, ports, etc. |
Time | CLONE_NEWTIME | clock_gettimeTimethatcanbeobtainedwith(monotonic,boot) |
IPC | CLONE_NEWIPC | Interprocess communication |
Cgroup | CLONE_NEWCGROUP | cgroup root directory |
User | CLONE_NEWUSER | UID, GID |
To set the Linux namespace in Go, set SysProcAttr
in the Cmd
structure to Cloneflags
. An example of actually creating a container using Mount, PID, and UTS namespace can be found in the GitHub repository namespace branch (https://github.com/swen128/minimum-container/blob/namespace/main.go).
main.go
func execute(cmd string, args ...string) {
unix.Chroot("./rootfs")
unix.Chdir("/")
command := exec.Command(cmd, args...)
command.Stdin = os.Stdin
command.Stdout = os.Stdout
command.Stderr = os.Stderr
//Set Linux namespace
command.SysProcAttr = &unix.SysProcAttr{
Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
}
command.Run()
}
If you recreate the container with this code and run ps
as before, you can see that only the processes inside the container are visible.
root@linux-host:/work_dir# go run main.go run sh
/ # mount proc /proc -t proc
/ # ps
PID USER TIME COMMAND
1 root 0:00 sh
4 root 0:00 ps
Also, with the UTS namespace, changing the host name inside the container no longer affects the outside world.
root@linux-host:/work_dir# go run main.go run sh
/ # hostname my-container
/ # hostname
my-container
/ # exit
root@linux-host:/work_dir# hostname
linux-host
In the previous section, you manually mounted / proc
and set the host name after launching the container. It is inconvenient as it is, so let's change the program so that these initialization processes are performed at the same time as the container is created.
The problem here is when to perform the initialization. Container creation
/ proc
mount)sh
)However, there is no hook that can be interrupted between 1. and 3. So write code that executes both 2. and 3. and execute that code on the process with the namespace set.
An example implementation is the reexec branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/reexec/main.go).
main.go
//Handling command line arguments
// go run main.go run <cmd> <args>
func main() {
switch os.Args[1] {
case "run":
initialize(os.Args[2:]...)
case "child":
execute(os.Args[2], os.Args[3:]...)
default:
panic("The command line arguments are incorrect.")
}
}
//Execute the execute function in the child process with the Linux namespace set
func initialize(args ...string) {
//Argument child to this program itself<cmd> <args>give
arg := append([]string{"child"}, args...)
command := exec.Command("/proc/self/exe", arg...)
command.Stdin = os.Stdin
command.Stdout = os.Stdout
command.Stderr = os.Stderr
command.SysProcAttr = &unix.SysProcAttr{
Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
}
command.Run()
}
//Initialization process after namespace setting and execution of user-specified command
func execute(cmd string, args ...string) {
//Root directory and current directory./Set to rootfs
unix.Chroot("./rootfs")
unix.Chdir("/")
unix.Mount("proc", "proc", "proc", 0, "")
unix.Sethostname([]byte("my-container"))
command := exec.Command(cmd, args...)
command.Stdin = os.Stdin
command.Stdout = os.Stdout
command.Stderr = os.Stderr
command.Run()
}
I'm using a slightly trickier way to complete it in one executable. The point is the part of the initialize
function that executes/proc/self/exe
as a command. / proc/self/exe
is also part of the proc file system and returns the path to the executable of the current process. This allows a program to execute itself recursively.
If you follow the flow of the above code execution in order
go run main.go run <cmd> <args>
initialize
function/ proc/self/exe init <cmd> <args>
execute
function/ proc
mountAt this time, the root directory and namespace settings are inherited by the grandchild process created to execute the user command, and it functions as a container.
With the above, we have implemented the functions that are the basis of the container, but there are still many missing parts. I can't go into all the details in this article, but here are two standard specifications that are important to give you a rough picture.
specification | Typical implementation |
---|---|
OCI Runtime Specification | runc |
OCI Image Format Specification | containerd |
The OCI Runtime Spec specifies the life cycle of the container and the format of the ** filesystem bundle **. A filesystem bundle is a tar archive of config.json
, which describes various container settings, and the rootfs
directory, which is the root file system.
The OCI Image Spec, on the other hand, specifies the format of the container image and how to convert the image to a filesystem bundle. The image is that familiar image you get by building a Dockerfile.
As you can guess from the fact that the filesystem bundle contains the rootfs
directory, this article implements the touch of the OCI Runtime Spec. There is no touch on OCI Image Spec and other elements, so if you are interested, I recommend you to investigate further.
--The container is a special process isolated by the functionality of the Linux kernel. --chroot: quarantine the root file system --namespace: quarantine various global resources such as PID, file mount, hostname, etc. --Important standard specifications for containers - OCI Runtime Specification - OCI Image Format Specification --The runtime spec is closely related to this article.
Recommended Posts