How Docker works ~ Implement the container in 60 lines

Overview

In this article, we will explain how the container works from a low layer for people who usually treat Docker as a black box.

To do this, we use the Go language to take the approach of implementing and running containers from scratch. The basic principle of containers is surprisingly simple, with just 60 lines of code at the end of this article.

The completed code can be found in the GitHub repository.

What is a container

The following diagram is often used to illustrate the difference between a container and a virtual machine (VM). docker-containerized-and-vm-transparent-bg.png (Quoted from Docker official website)

The biggest feature when comparing VMs and containers is that they do not start the guest OS when creating each container. All containers exist as processes running within the same host OS.

But of course, normal processes share resources such as files with other processes and are highly environment-dependent. Therefore, in order to run the process in a logically isolated state, the features such as chroot and namespace of ** Linux kernel ** are used. This ** quarantined process ** is called a container.

What is the Linux kernel

The kernel is literally the core of the OS. When you think of a Linux machine as a three-tiered structure like this, the kernel is just in the middle.

-** Hardware : Physical devices such as memory and CPU - Linux kernel ** -** User Process **: Almost all programs such as shells and editors

The kernel has the privilege of directly manipulating the hardware and does tasks such as managing memory and processes, and device drivers.

On the other hand, user processes have severely restricted access to hardware. Therefore, you must ask the kernel through ** system calls ** to perform file operations, process creation, etc.

When implementing a program that creates a container, we also make heavy use of system calls to take advantage of chroot, namespace, and so on.

It is standard to use the official package golang.org/x/sys, especially when making system calls with Go language code.

Container implementation from scratch

From now on, the Go language program will actually create the container.

You need a Linux environment with the Go compiler installed to run the code. You can use the docker-compose.yml file included in the GitHub repository to try it out immediately without the hassle of building an environment.

$ git clone 
$ cd minimum-container
$ docker-compose run app

root@linux-env:/work_dir# go run main.go run sh

chroot chroot changes the root directory of the currently running process (and child processes). It is commonly referred to as a ** chroot prison ** because it makes it impossible to access and recognize its existence above that directory.

chroot.png

The chroot branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/chroot/main.go) has a container-like code example using chroot. It changes the root directory to ./rootfs and then executes the given arguments as a command.

main.go


//Run cmd with argument arg in quarantined process
func execute(cmd string, args ...string) {
	//Root directory and current directory./Set to rootfs
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.Run()
}

If you try running this main.go right away, you should get the following error:

$ go run main.go run sh
panic: exec: "sh": executable file not found in $PATH

This is an error that occurs because there are no files in ./rootfs yet. In the container after chroot is executed, there is no binary of sh because the root directory is almost empty.

That's where docker export comes in handy. You can extract all the files contained in any Docker image under ./rootfs by typing the command below.

$ docker export $(docker create <image>) | tar -C rootfs -xvf -

Let's run the container again with the file prepared in .rootfs. Try using the ls command or creating a file to make sure that the/directory inside the container is linked to the rootfs directory on the host.

root@linux-env:/work_dir# go run main.go run sh

/ # ls /
bin   dev   etc   home  proc  root  sys   tmp   usr   var
/ # touch /tmp/hoge
/ # exit

root@linux-env:/work_dir# ls rootfs/tmp
hoge

namespace Linux namespace is a function that can isolate various resources such as the mount file system and PID.

To understand the need for this feature, let's run the ps command inside the container-like we created in the previous section.

root@linux-host:/work_dir# go run main.go run ps
PID   USER     TIME  COMMAND

You should see no results. The cause is that the ps command references the/proc directory. Normally, a special pseudo file system that can get process information etc. is mounted on the / proc directory, but since the root directory is changed in the container-like state, there is still nothing in/proc. not.

Try mounting the / proc directory in advance and running ps again.

root@linux-host:/work_dir# go run main.go run sh

/ # mount proc /proc -t proc
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 bash
  100 root      0:00 go run main.go run sh
  154 root      0:00 /tmp/go-build474892034/b001/exe/main run sh
  160 root      0:00 sh
  163 root      0:00 ps

There are two problems here. One is that you can see the processes running outside the container (PID 1, 100, 154), and the other is that the mounts you set inside the container are reflected on the host. This is not enough isolation from the external environment.

root@linux-host:/work_dir# cat /proc/mounts | grep proc
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
proc /work_dir/rootfs/proc proc rw,relatime 0 0        <-Proc mount added in the container

The Linux namespace allows you to set resource namespaces separately for each process. Resources that belong to different namespaces cannot be seen or manipulated, which solves the problem described above.

At the time of writing, there are eight types of Linux namespaces, and the flag is specified by the system calls clone, setns, unshare, and so on.

Namespace Flag Quarantined resource
Mount CLONE_NEWNS File system mount point
PID CLONE_NEWPID PID
UTS CLONE_NEWUTS hostname
Network CLONE_NEWNET Network devices, ports, etc.
Time CLONE_NEWTIME clock_gettimeTimethatcanbeobtainedwith(monotonic,boot)
IPC CLONE_NEWIPC Interprocess communication
Cgroup CLONE_NEWCGROUP cgroup root directory
User CLONE_NEWUSER UID, GID

To set the Linux namespace in Go, set SysProcAttr in the Cmd structure to Cloneflags. An example of actually creating a container using Mount, PID, and UTS namespace can be found in the GitHub repository namespace branch (https://github.com/swen128/minimum-container/blob/namespace/main.go).

main.go


func execute(cmd string, args ...string) {
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	//Set Linux namespace
	command.SysProcAttr = &unix.SysProcAttr{
		Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
	}

	command.Run()
}

If you recreate the container with this code and run ps as before, you can see that only the processes inside the container are visible.

root@linux-host:/work_dir# go run main.go run sh

/ # mount proc /proc -t proc
/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    4 root      0:00 ps

Also, with the UTS namespace, changing the host name inside the container no longer affects the outside world.

root@linux-host:/work_dir# go run main.go run sh

/ # hostname my-container
/ # hostname
my-container
/ # exit

root@linux-host:/work_dir# hostname
linux-host

Container initialization

In the previous section, you manually mounted / proc and set the host name after launching the container. It is inconvenient as it is, so let's change the program so that these initialization processes are performed at the same time as the container is created.

The problem here is when to perform the initialization. Container creation

  1. Create a child process with namespace set
  2. Initialize the child process (eg / proc mount)
  3. Execute a user-specified command (such as sh)

However, there is no hook that can be interrupted between 1. and 3. So write code that executes both 2. and 3. and execute that code on the process with the namespace set.

An example implementation is the reexec branch of the GitHub repository (https://github.com/swen128/minimum-container/blob/reexec/main.go).

main.go


//Handling command line arguments
// go run main.go run <cmd> <args>
func main() {
	switch os.Args[1] {
	case "run":
		initialize(os.Args[2:]...)
	case "child":
		execute(os.Args[2], os.Args[3:]...)
	default:
		panic("The command line arguments are incorrect.")
	}
}

//Execute the execute function in the child process with the Linux namespace set
func initialize(args ...string) {
	//Argument child to this program itself<cmd> <args>give
	arg := append([]string{"child"}, args...)
	command := exec.Command("/proc/self/exe", arg...)

	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.SysProcAttr = &unix.SysProcAttr{
		Cloneflags: unix.CLONE_NEWNS | unix.CLONE_NEWPID | unix.CLONE_NEWUTS,
	}

	command.Run()
}

//Initialization process after namespace setting and execution of user-specified command
func execute(cmd string, args ...string) {
	//Root directory and current directory./Set to rootfs
	unix.Chroot("./rootfs")
	unix.Chdir("/")

	unix.Mount("proc", "proc", "proc", 0, "")
	unix.Sethostname([]byte("my-container"))

	command := exec.Command(cmd, args...)
	command.Stdin = os.Stdin
	command.Stdout = os.Stdout
	command.Stderr = os.Stderr

	command.Run()
}

I'm using a slightly trickier way to complete it in one executable. The point is the part of the initialize function that executes/proc/self/exe as a command. / proc/self/exe is also part of the proc file system and returns the path to the executable of the current process. This allows a program to execute itself recursively.

If you follow the flow of the above code execution in order

  1. Execute the command go run main.go run <cmd> <args>
  2. main.go is executed and branches to the initialize function
  3. Create a process with namespace set
  4. Execute the command / proc/self/exe init <cmd> <args>
  5. main.go is executed and branches to the execute function
  6. Perform initialization processing such as / proc mount
  7. Create a process
  8. Execute a user-specified command

At this time, the root directory and namespace settings are inherited by the grandchild process created to execute the user command, and it functions as a container.

Container standard specifications

With the above, we have implemented the functions that are the basis of the container, but there are still many missing parts. I can't go into all the details in this article, but here are two standard specifications that are important to give you a rough picture.

specification Typical implementation
OCI Runtime Specification runc
OCI Image Format Specification containerd

The OCI Runtime Spec specifies the life cycle of the container and the format of the ** filesystem bundle **. A filesystem bundle is a tar archive of config.json, which describes various container settings, and the rootfs directory, which is the root file system.

The OCI Image Spec, on the other hand, specifies the format of the container image and how to convert the image to a filesystem bundle. The image is that familiar image you get by building a Dockerfile.

docker.png

As you can guess from the fact that the filesystem bundle contains the rootfs directory, this article implements the touch of the OCI Runtime Spec. There is no touch on OCI Image Spec and other elements, so if you are interested, I recommend you to investigate further.

Summary

--The container is a special process isolated by the functionality of the Linux kernel. --chroot: quarantine the root file system --namespace: quarantine various global resources such as PID, file mount, hostname, etc. --Important standard specifications for containers - OCI Runtime Specification - OCI Image Format Specification --The runtime spec is closely related to this article.

Reference link

Recommended Posts

How Docker works ~ Implement the container in 60 lines
How to check the logs in the Docker container
How to update pre-built files in docker container
Copy and paste the file contents in Ubuntu's Docker container
[Docker] How to access the host from inside the container. http://host.docker.internal:
Hit the Docker API in Rust
How the JVM JIT compiler works
How memory works in object-oriented languages
[Behavior confirmed in December 2020] How to implement the alert display function
I examined the concept of the process to understand how Docker works
How to implement search functionality in Rails
How to implement date calculation in Java
Kind @ Mac in Docker and vctl container
Implement the algorithm in Ruby: Day 1 -Euclidean algorithm-
The story of updating SonarQube's Docker Container
MySQL container does not start in Docker
Directly operate mariadb running in Docker container
How to use Docker in VSCode DevContainer
[Swift] How to implement the countdown function
How to implement coding conventions in Java
[Docker] Start the container as soon as possible
How to get the date in java
[Docker] Start container, start bash in container, delete image
How to implement ranking functionality in Rails
Install the IBM Cloud CLI in the container
Japanese setting of mysql in Docker container
How to implement asynchronous processing in Outsystems
Understand in 5 minutes !! How to use Docker
How to start a Docker container with a volume mounted in a batch file
How to mount the batch file location via WSL2 and start the Docker container
When composer install/require in the Docker container, it becomes `Could not find package`
How to display the amount of disk used by Docker container for each container
Introducing the top 10 articles often read on the official Docker blog in 2020 in 3 lines each
Deep dive into how HashMap works in Java
[Docker] Check the running container and enter there
How to get a heapdump from a Docker container
[Swift] How to implement the LINE login function
Implement the algorithm in Ruby: Day 3 -Binary search-
[swift5] How to implement the Twitter share function
How to add sound in the app (swift)
How to implement the breadcrumb function using gretel
[For beginners] How to implement the delete function
Edit Docker Container in VSCode multi-stage SSH Vagrant
How to implement a like feature in Rails
[Swift] How to implement the fade-in / out function
Implement the algorithm in Ruby: Day 4-Linear search-
Implement the Like feature in Ajax with Rails.
How to implement optimistic locking in REST API
Change the location folder of Docker image & container
How to build the simplest blockchain in Ruby
How to check Rails commands in the terminal
How to implement Pagination in GraphQL (for ruby)
Implement the algorithm in Ruby: Day 2 -Bubble sort-
How to install Docker in the local environment of an existing Rails application [Rails 6 / MySQL 8]
[Docker] How to see the contents of Volumes. Start a container with root privileges.
Small Docker container
Docker in LXD
How jul-to-slf4j works
How to implement UICollectionView in Swift with code only
How to set the display time to Japan time in Rails
[Java] How to omit the private constructor in Lombok