Overview

In this article, we will explain how the container works from a low layer for people who usually treat Docker as a black box.

To do this, we use the Go language to take the approach of implementing and running containers from scratch. The basic principle of containers is surprisingly simple, with just 60 lines of code at the end of this article.

The completed code can be found in the GitHub repository.

What is a container

The following diagram is often used to illustrate the difference between a container and a virtual machine (VM). (Quoted from Docker official website)

The biggest feature when comparing VMs and containers is that they do not start the guest OS when creating each container. All containers exist as processes running within the same host OS.

But of course, normal processes share resources such as files with other processes and are highly environment-dependent. Therefore, in order to run the process in a logically isolated state, the features such as chroot and namespace of ** Linux kernel ** are used. This ** quarantined process ** is called a container.

What is the Linux kernel

The kernel is literally the core of the OS. When you think of a Linux machine as a three-tiered structure like this, the kernel is just in the middle.

-** Hardware : Physical devices such as memory and CPU - Linux kernel ** -** User Process **: Almost all programs such as shells and editors

The kernel has the privilege of directly manipulating the hardware and does tasks such as managing memory and processes, and device drivers.

On the other hand, user processes have severely restricted access to hardware. Therefore, you must ask the kernel through ** system calls ** to perform file operations, process creation, etc.

When implementing a program that creates a container, we also make heavy use of system calls to take advantage of chroot, namespace, and so on.

It is standard to use the official package golang.org/x/sys, especially when making system calls with Go language code.

Container implementation from scratch

From now on, the Go language program will actually create the container.

You need a Linux environment with the Go compiler installed to run the code. You can use the docker-compose.yml file included in the GitHub repository to try it out immediately without the hassle of building an environment.

$ git clone 
$ cd minimum-container
$ docker-compose run app

root@linux-env:/work_dir# go run main.go run sh

chroot chroot changes the root directory of the currently running process (and child processes). It is commonly referred to as a ** chroot prison ** because it makes it impossible to access and recognize its existence above that directory.

The chroot branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/chroot/main.go) has a container-like code example using chroot. It changes the root directory to ./rootfs and then executes the given arguments as a command.

`main.go`


//Run cmd with argument arg in quarantined process
func execute(cmd string, args ...string) {
	//Root directory and current directory./Set to rootfs
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.Run()
}

If you try running this main.go right away, you should get the following error:

$ go run main.go run sh
panic: exec: "sh": executable file not found in $PATH

This is an error that occurs because there are no files in ./rootfs yet. In the container after chroot is executed, there is no binary of sh because the root directory is almost empty.

That's where docker export comes in handy. You can extract all the files contained in any Docker image under ./rootfs by typing the command below.

$ docker export $(docker create <image>) | tar -C rootfs -xvf -

Let's run the container again with the file prepared in .rootfs. Try using the ls command or creating a file to make sure that the/directory inside the container is linked to the rootfs directory on the host.

root@linux-env:/work_dir# go run main.go run sh

/ # ls /
bin   dev   etc   home  proc  root  sys   tmp   usr   var
/ # touch /tmp/hoge
/ # exit

root@linux-env:/work_dir# ls rootfs/tmp
hoge

namespace Linux namespace is a function that can isolate various resources such as the mount file system and PID.

To understand the need for this feature, let's run the ps command inside the container-like we created in the previous section.

root@linux-host:/work_dir# go run main.go run ps
PID   USER     TIME  COMMAND

You should see no results. The cause is that the ps command references the/proc directory. Normally, a special pseudo file system that can get process information etc. is mounted on the / proc directory, but since the root directory is changed in the container-like state, there is still nothing in/proc. not.

Try mounting the / proc directory in advance and running ps again.

root@linux-host:/work_dir# go run main.go run sh

/ # mount proc /proc -t proc
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 bash
  100 root      0:00 go run main.go run sh
  154 root      0:00 /tmp/go-build474892034/b001/exe/main run sh
  160 root      0:00 sh
  163 root      0:00 ps

There are two problems here. One is that you can see the processes running outside the container (PID 1, 100, 154), and the other is that the mounts you set inside the container are reflected on the host. This is not enough isolation from the external environment.

root@linux-host:/work_dir# cat /proc/mounts | grep proc
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
proc /work_dir/rootfs/proc proc rw,relatime 0 0        <-Proc mount added in the container

The Linux namespace allows you to set resource namespaces separately for each process. Resources that belong to different namespaces cannot be seen or manipulated, which solves the problem described above.

At the time of writing, there are eight types of Linux namespaces, and the flag is specified by the system calls clone, setns, unshare, and so on.

Namespace	Flag	Quarantined resource
Mount	CLONE_NEWNS	File system mount point
PID	CLONE_NEWPID	PID
UTS	CLONE_NEWUTS	hostname
Network	CLONE_NEWNET	Network devices, ports, etc.
Time	CLONE_NEWTIME	clock_gettimeTimethatcanbeobtainedwith(monotonic,boot)
IPC	CLONE_NEWIPC	Interprocess communication
Cgroup	CLONE_NEWCGROUP	cgroup root directory
User	CLONE_NEWUSER	UID, GID

To set the Linux namespace in Go, set SysProcAttr in the Cmd structure to Cloneflags. An example of actually creating a container using Mount, PID, and UTS namespace can be found in the GitHub repository namespace branch (https://github.com/swen128/minimum-container/blob/namespace/main.go).

`main.go`


func execute(cmd string, args ...string) {
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	//Set Linux namespace
	command.SysProcAttr = &unix.SysProcAttr{
		Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
	}

	command.Run()
}

If you recreate the container with this code and run ps as before, you can see that only the processes inside the container are visible.

root@linux-host:/work_dir# go run main.go run sh

/ # mount proc /proc -t proc
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    4 root      0:00 ps

Also, with the UTS namespace, changing the host name inside the container no longer affects the outside world.

root@linux-host:/work_dir# go run main.go run sh

/ # hostname my-container
/ # hostname
my-container
/ # exit

root@linux-host:/work_dir# hostname
linux-host

Container initialization

In the previous section, you manually mounted / proc and set the host name after launching the container. It is inconvenient as it is, so let's change the program so that these initialization processes are performed at the same time as the container is created.

The problem here is when to perform the initialization. Container creation

Create a child process with namespace set
Initialize the child process (eg / proc mount)
Execute a user-specified command (such as sh)

However, there is no hook that can be interrupted between 1. and 3. So write code that executes both 2. and 3. and execute that code on the process with the namespace set.

An example implementation is the reexec branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/reexec/main.go).

`main.go`


//Handling command line arguments
// go run main.go run <cmd> <args>
func main() {
	switch os.Args[1] {
	case "run":
		initialize(os.Args[2:]...)
	case "child":
		execute(os.Args[2], os.Args[3:]...)
	default:
		panic("The command line arguments are incorrect.")
	}
}

//Execute the execute function in the child process with the Linux namespace set
func initialize(args ...string) {
	//Argument child to this program itself<cmd> <args>give
	arg := append([]string{"child"}, args...)
	command := exec.Command("/proc/self/exe", arg...)

	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.SysProcAttr = &unix.SysProcAttr{
		Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
	}

	command.Run()
}

//Initialization process after namespace setting and execution of user-specified command
func execute(cmd string, args ...string) {
	//Root directory and current directory./Set to rootfs
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	unix.Mount("proc", "proc", "proc", 0, "")
	unix.Sethostname([]byte("my-container"))

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.Run()
}

I'm using a slightly trickier way to complete it in one executable. The point is the part of the initialize function that executes/proc/self/exe as a command. / proc/self/exe is also part of the proc file system and returns the path to the executable of the current process. This allows a program to execute itself recursively.

If you follow the flow of the above code execution in order

Execute the command go run main.go run <cmd> <args>
main.go is executed and branches to the initialize function
Create a process with namespace set
Execute the command / proc/self/exe init <cmd> <args>
main.go is executed and branches to the execute function
Perform initialization processing such as / proc mount
Create a process
Execute a user-specified command

At this time, the root directory and namespace settings are inherited by the grandchild process created to execute the user command, and it functions as a container.

Container standard specifications

With the above, we have implemented the functions that are the basis of the container, but there are still many missing parts. I can't go into all the details in this article, but here are two standard specifications that are important to give you a rough picture.

specification	Typical implementation
OCI Runtime Specification	runc
OCI Image Format Specification	containerd

The OCI Runtime Spec specifies the life cycle of the container and the format of the ** filesystem bundle **. A filesystem bundle is a tar archive of config.json, which describes various container settings, and the rootfs directory, which is the root file system.

The OCI Image Spec, on the other hand, specifies the format of the container image and how to convert the image to a filesystem bundle. The image is that familiar image you get by building a Dockerfile.

As you can guess from the fact that the filesystem bundle contains the rootfs directory, this article implements the touch of the OCI Runtime Spec. There is no touch on OCI Image Spec and other elements, so if you are interested, I recommend you to investigate further.

Summary

--The container is a special process isolated by the functionality of the Linux kernel. --chroot: quarantine the root file system --namespace: quarantine various global resources such as PID, file mount, hostname, etc. --Important standard specifications for containers - OCI Runtime Specification - OCI Image Format Specification --The runtime spec is closely related to this article.

How Docker works ~ Implement the container in 60 lines

Overview

What is a container

What is the Linux kernel

Container implementation from scratch

`main.go`

`main.go`

Container initialization

`main.go`

Container standard specifications

Summary

Reference link