I tried adding system calls and scheduler to Linux

I tried adding system calls and scheduler to Linux

Introduction (Critical responsibility: Nada)

In this article

On Linux

--System call --Scheduler

Explain in as much detail as possible (so that the reader can reproduce it).

PC environment

Environment construction (written by Kikuoka)

Anyway, we will start by building the environment. This time I will be messing with the linux source code, but in the end I need an environment to actually run the tampered source code. Even if your personal computer is Mac or Windows, of course, even if it is linux, you can not notice if the source code of the main body is changed arbitrarily and the operation becomes strange, so run linux on a virtual machine To.

Preparation

First, install VirtualBox and Vagrant in preparation for running the virtual machine. Throughout the whole, I referred to this site unless otherwise specified. There are some changes, but if you follow this procedure, there is no problem. Ultimately, you will create two machines, one that runs by reflecting the changed source code (hereinafter referred to as debugger) and the other that debugs debuggee (hereinafter referred to as debugger). It will be explained in detail below.

VirtualBox installation

For this part, I refer to this site. It's an English site, but you can have a look at the Method 3 part, or if it's a hassle, see the explanation below.

First, add the public key for VirtualBox with the following command.

~$ wget -q https://www.virtualbox.org/download/oracle_vbox_2016.asc -O- | sudo apt-key add -

Next, use the following command to connect to the VirtualBox repository.

~$ sudo add-apt-repository "deb [arch=amd64] http://download.virtualbox.org/virtualbox/debian $(lsb_release -cs) contrib"

You can actually install VirtualBox with the following command.

~$ sudo apt update && sudo apt install virtualbox-6.0

Install Vagrant

First, download the package file with the following command. At this time, note that the command differs depending on the environment you have. Since all the members of the team were using x86 machines, the commands below are adapted accordingly.

~$ wget https://releases.hashicorp.com/vagrant/2.2.6/vagrant_2.2.6_x86_64.deb

Install Vagrant from the package file you downloaded earlier with the following command.

~$ sudo dpkg -i vagrant_2.2.6_x86_64.deb

As you can see from the command, the version to be installed this time is 6.0 for VirtualBox and 2.2.6 for Vagrant, but if one is too new, the other does not support it and it is normal. There are cases where it does not work, so unless you have a specific reason, we recommend that you use this version.

Creating a virtual machine

Finally create a virtual machine. To do this, use the following command to create a directory where you can work and enter it.

~$ mkdir -p ~/Vagrant/ubuntu18
~$ cd Vagrant/ubuntu18

Initialize Vagrant. At this time, a Vagrant configuration file called Vagrantfile is generated.


~/Vagrant/ubuntu18$ vagrant init ubuntu/bionic64

Modify this file, but before doing so, install the plugins needed to increase the capacity of the virtual machine.

~$ vagrant plugin install vagrant-disksize

This makes it possible to increase the capacity by writing an appropriate description in the Vagrantfile. If you do not do this, you may get an error due to insufficient capacity after starting the virtual machine, so be careful.

Open the generated Vagrantfile and modify it as follows.


Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/bionic64"
  config.vm.define "debugger" do |c|
    c.vm.provider "virtualbox" do |vb|
      vb.customize ["modifyvm", :id, "--uart2", "0x2F8", "3"]
      vb.customize ["modifyvm", :id, "--uartmode2", "server", "/tmp/vagrant-ttyS1"]
      vb.memory = "8192"
    end
  end
  config.vm.define "debuggee" do |c|
    c.vm.provider "virtualbox" do |vb|
      vb.customize ["modifyvm", :id, "--uart2", "0x2F8", "3"]
      vb.customize ["modifyvm", :id, "--uartmode2", "client", "/tmp/vagrant-ttyS1"]
    end
  end
  config.disksize.size = '100GB'
end

Finally start the virtual machine. Start the virtual machine with the following command.

~/Vagrant/ubuntu18$ vagrant up

When both debugger and debuggee start without any problem, enter the following command to open debuggee in the terminal.

~/Vagrant/ubuntu18$ vagrant ssh debuggee

Add debuggee settings with the following command.

debuggee:~$ sudo systemctl enable [email protected]

Restart the virtual machine with the following command.

~/Vagrant/ubuntu18$ vagrant reload 

Now open the debugger and see if serial communication is possible.

debugger:~$ sudo screen /dev/ttyS1
<Press ENTRY>
Ubuntu 18.04.3 LTS ubuntu-bionic ttyS1

ubuntu-bionic login:

kgdb settings

At this point, you can start the virtual machine, but it is difficult to debug as it is, so rebuild the kernel to make it compatible with kgdb.

Build the kernel development environment in the debugger with the following command.

debugger:~$ sudo apt-get install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache bison flex gdb

Then download the kernel source code in the debugger's home directory. This time, we are targeting version 5.3.9.

debugger:~$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.3.9.tar.xz

Unzip the download file.

debugger:~$ tar Jxfv ./linux-5.3.9.tar.xz

Enter the generated directory and set the config.

debugger:~$ cd linux-5.3.9
debugger:~/linux-5.3.9$ cp /boot/config-`uname -r` .config
debugger:~/linux-5.3.9$ yes '' | make oldconfig
debugger:~/linux-5.3.9$ make menuconfig

The config setting screen opens, so look for the following items, check the first five, and uncheck the last one.

Kernel hacking -> [*]KGDB: kernel debugger
Kernel hacking -> KGDB: kernel debugger -> [*]KGDB: use kgdb over the serial console
Kernel hacking -> KGDB: kernel debugger -> [*]KGDB: internal test suite
Kernel hacking -> KGDB: kernel debugger -> [*]KGDB: Allow debugging with traps in notifiers
Kernel hacking -> KGDB: kernel debugger ->  [*]KGDB_KDB: include kdb frontend for kgdb
Processor type and features -> [ ]Randomize the address of the kernel image (KASLR)

Build the kernel source code. It depends on the computer you are using, but it will take several hours, so take a break for a while and let the computer do your best.

debugger:~/linux-5.3.9$ make clean
debugger:~/linux-5.3.9$ make -j `getconf _NPROCESSORS_ONLN` deb-pkg

Share the package generated by the build to debuggee. Use the fact that the directory specified by the path of / vagrant is set as a shared folder, and move the package to it once.

debugger:~/linux-5.3.9$ mv ../linux-headers-5.3.9_5.3.9-1_amd64.deb /vagrant
debugger:~/linux-5.3.9$ mv ../linux-libc-dev_5.3.9-1_amd64.deb /vagrant
debugger:~/linux-5.3.9$ mv ../linux-image-5.3.9_5.3.9-1_amd64.deb /vagrant
debugger:~/linux-5.3.9$ mv ../linux-image-5.3.9-dbg_5.3.9-1_amd64.deb /vagrant

Open debuggee and install the package you shared earlier.

debuggee:~$ sudo dpkg -i /vagrant/linux-headers-5.3.9_5.3.9-1_amd64.deb
debuggee:~$ sudo dpkg -i /vagrant/linux-libc-dev_5.3.9-1_amd64.deb
debuggee:~$ sudo dpkg -i /vagrant/linux-image-5.3.9_5.3.9-1_amd64.deb
debuggee:~$ sudo dpkg -i /vagrant/linux-image-5.3.9-dbg_5.3.9-1_amd64.deb

Now you can debug with kdgb. It is recommended that you restart the virtual machine once you have installed the new package. This is because kdgb may not work properly or problems may occur in the kernel operation itself.

Also, if you install VS Code's Remote-SSH extension, you can edit files in the virtual machine with VS Code, so if you are familiar with VS Code or cannot use vim for religious reasons, try it. You should see it.

At this point, the environment for making changes to the linux source code and actually running it is ready. Now, let's actually play with the source code from here.

Let's make a new system call (Critical responsibility: Kikuoka)

As you can see from the heading, the goal is to add a new system call and actually call it to see how it works. A system call is a function used to call a function of the operating system (kernel) in a program. File input / output and network communication are implemented as system calls, but this time we will add two integer arguments and add them together. When implementing [this site](https://pr0gr4m.tistory.com/entry/Linux-Kernel-5-system-call-%EC%B6%94%EA%B0%80%ED%95% 98% EA% B8% B0? Category = 714120) is Korean, but it was very helpful. The procedure will be described below.

Add system call

First, look for the file specified by * arch / x86 / entry / syscalls / syscall_64.tbl * in the linux source code. I think that the system calls implemented in the current version are lined up in a list format.

Screenshot from 2020-11-02 14-38-45.png

All you have to do is add the system calls you want to add to the free numbers here. If you copy the way of writing other system calls that have already been written, it will look like this.

Screenshot from 2020-11-02 14-45-41.png

548     64      mycall                  __x64_sys_mycall

The system call mycall at number 548 is the system call we add. Of course, if nothing is done, it will not work because there is no substance just on the table.

Next, let's go to the file specified by include / linux / syscalls.h. There will be a declaration of the type of system call implemented.

Screenshot from 2020-11-02 14-59-46.png

The function name here corresponds to the part below x64 in the fourth column added to the list above. Therefore, you can declare the type of sys_mycall at an appropriate position.

Screenshot from 2020-11-02 15-06-33.png

asmlinkage long sys_mycall(int a,int b, int *to_user);

a and b are two variables that perform addition, and the argument to_user contains the information of the caller of the system call, which is required to pass the calculation result performed by the kernel.

After making the declaration, it is finally time to define the entity. Create a file called mycall.c (whatever the name is) in a directory called kernel and define the function there. Screenshot from 2020-11-02 15-45-52.png

Screenshot from 2020-11-02 15-28-19.png

#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <asm/processor.h>
#include <asm/uaccess.h>

SYSCALL_DEFINE3(mycall, int, a, int, b, int *, to_user)
{
        int sum = 0;
        printk("[Kernel Message] a = %d, b = %d\n", a, b);
        sum = a + b;
        put_user(sum, to_user);
        return 21;
}

SYSCALL_DEFINE is a macro used when defining a system call, and is used with the number of arguments at the time of declaration. Since there were three arguments this time, it becomes SYSCALL_DEFINE3. The actual argument is one more than that number because the system call number is specified as the first argument. Also, when actually calling a system call, it will be called with the system call name mycall instead of a macro. If the argument is 0, asmlinkage is used as in the case of declaration.

At last, the definition of the entity is finished, and if the kernel is built correctly, mycall should be usable.

Let's actually use it

Check the operation of mycall. For that, it is necessary to build the kernel again, but since mycall.c cannot be reflected at the time of build as it is, add a little to the Makefile.

Screenshot from 2020-11-02 15-45-52.png

obj-y     = fork.o exec_domain.o panic.o \
.................................................
            mycall.o

Then build again, install the package with debuggee, reboot, and you'll be able to use the new system calls.

Let's run the test code as a trial. Screenshot from 2020-11-02 15-54-48.png

When this code is compiled and executed, the output is as follows.

Screenshot from 2020-11-02 15-57-28.png

The correct calculation result is output as the sum value, and the return value of mycall, 21, is output as the ret value, indicating that the system call is being called correctly.

Also try calling the kernel message with the dmesg command.

Screenshot from 2020-11-02 16-04-00.png

The values of a and b are output as kernel messages at the end, indicating that the values can be passed correctly to the kernel.

Let's make a scheduling class (Critical responsibility: Cho Jae Hyun)

Algorithm implementation

What is random scheduling?

Random scheduling shows how to randomly select among the waiting tasks when selecting the next task from the CPU.

Random run queue implementation

struct random_rq{
	struct list_head task_list;
	unsigned int random_nr_running;
	int random_queued;
	int random_throttled;
	u64 random_time;
	u64 random_runtime;
};
Description of list_head

list_head implements the linked list data structure in the Linux source code and is located in (linux source code directory) /tools/include/linux/types.h.

struct list_head {
	struct list_head *next, *prev;
};

Generally, when using a linked list, the pointer of the structure is used to indicate next and prev in the structure of the data. (See code below)

struct linked_list{
	int data; //data
	struct linked_list *next, *prev; //Structure pointer
}

However, list_head implements a linked list by adding a list_head variable to the data structure. (See code below)

struct linked_list{
	int data; //data
	struct list_head list; //Linked list
}

Now, looking at how list_head connects elements (nodes), it uses the following function. ((It is written in (linux source code directory) /include/linux/list.h)

INIT_LIST_HEAD initializes list_head by setting next and prev of list_head to itself.

static inline void INIT_LIST_HEAD(struct list_head *list)
{
	WRITE_ONCE(list->next, list);
	list->prev = list;
}

Looking at the figure, it looks like this, and a bidirectional circular list is created. figure_list_01.png

list_add and list_add_tail are functions that add elements to the linked list, and receive the pointer of the head that is the head of the list and the pointer of the element that you want to add as arguments.

static inline void __list_add(struct list_head *new,
			      struct list_head *prev,
			      struct list_head *next)
{
	if (!__list_add_valid(new, prev, next))
		return;

	next->prev = new;
	new->next = next;
	new->prev = prev;
	WRITE_ONCE(prev->next, new);
}

static inline void list_add(struct list_head *new, struct list_head *head)
{
	__list_add(new, head, head->next);
}

static inline void list_add_tail(struct list_head *new, struct list_head *head)
{
	__list_add(new, head->prev, head);
}

The core function of list_add and list_add_tail is __list_add, which puts new inside prev and next. The difference between list_add and list_add_tail is whether to put new in head-> next, which is the next element of head and head, or in head-> prev, which is the element before head and head. Is. After all, list_add puts the element at the beginning of the list, and list_add_tail puts the element at the end of the list.

figure_list_02.png figure_list_03.png

list_del and list_del_init are functions used to remove elements from a linked list.

static inline void __list_del(struct list_head * prev, struct list_head * next)
{
	next->prev = prev;
	WRITE_ONCE(prev->next, next);
}

static inline void __list_del_entry(struct list_head *entry)
{
	if (!__list_del_entry_valid(entry))
		return;

	__list_del(entry->prev, entry->next);
}

static inline void list_del(struct list_head *entry)
{
	__list_del_entry(entry);
	entry->next = LIST_POISON1;
	entry->prev = LIST_POISON2;
}

static inline void list_del_init(struct list_head *entry)
{
	__list_del_entry(entry);
	INIT_LIST_HEAD(entry);
}

The difference between list_del and list_del_init lies in what to do with the next and prev variables of the elements to be removed. list_del makes the data a meaningless pointer, while list_del_init makes it itself. (Description of LIST_POISON: https://lists.kernelnewbies.org/pipermail/kernelnewbies/2016-March/015879.html)

In random_rq, it was necessary to use list_head to implement a list of tasks handled by random scheduling, and to select an arbitrary random number n (smaller integer than the task to be handled) th task. Therefore, I used list_for_each, which is a macro that goes around the list.

#define list_for_each(pos, head) \
	for (pos = (head)->next; pos != (head); pos = pos->next)

It simply starts with the next element at the beginning and repeats until it reaches the beginning.

So far, we have implemented all the linked list parts, but the linked list (list_head) does not actually have data, so even if you ask for the nth element with list_for_each, the data that the nth element has is I don't know. To get that data, Linux uses a macro called list_entry. (Written in /include/linux/list.h)

#define list_entry(ptr, type, member) \
	container_of(ptr, type, member)

The container_of macro looks like this: (Written in /include/linux/kernel.h)

#define container_of(ptr, type, member) ({				\
	void *__mptr = (void *)(ptr);					\
	BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) &&	\
			 !__same_type(*(ptr), void),			\
			 "pointer type mismatch in container_of()");	\
	((type *)(__mptr - offsetof(type, member))); })

The offsetof macro looks like this:

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

The detailed explanation is written at https://stackoverflow.com/questions/5550404/list-entry-in-linux, but briefly, the pointer ptr of list_head, the type of the structure with the data you want to find type (above) According to the example, struct linked_list), take the variable name member of list_head (list according to the above example) as an argument in the structure with the desired data, and calculate the number of bytes from the beginning of the desired structure to list_head ( Role of offsetof) Find the pointer of the structure that has the data you want to find by subtracting it from the pointer of list_head. The figure is as follows.

figure_list_04.png

Implementation of each function

The part actually implemented in the function required for the scheduler is as follows. (Reference: https://trepo.tuni.fi/bitstream/handle/10024/96864/GRADU-1428493916.pdf, p.27 ~ p.28)

enqueue_task

static void enqueue_task_random(struct rq *rq, struct task_struct *p, int flags){
	enqueue_list(&rq->rd_rq,p,flags);
	add_nr_running(rq,1);
	rq->rd_rq.random_nr_running++;
}

Here the enqueue_list looks like this:

void enqueue_list(struct random_rq *rd_rq,struct task_struct *p, int flags){
	INIT_LIST_HEAD(&p->rd_list);
	list_add_tail(&p->rd_list,&rd_rq->task_list);
}

dequeue_task

static void dequeue_task_random(struct rq *rq, struct task_struct *p, int flags){
	dequeue_list(&rq->rd_rq,p,flags);
	sub_nr_running(rq,1);
	rq->rd_rq.random_nr_running--;
}

Here the dequeue_list looks like this:

void dequeue_list(struct random_rq *rd_rq,struct task_struct *p, int flags){
	list_del_init(&p->rd_list);
}

yield_task

static void yield_task_random(struct rq *rq){
	struct task_struct *p = rq->curr; //TODO
	dequeue_list(&rq->rd_rq,p,0);
	enqueue_list(&rq->rd_rq,p,0);
}

The yield_task function can be implemented by taking the current task and moving it to the end of the list, so take the current task p from dequeue_list and put it back into the list from enqueue_list.

pick_next_task

static struct task_struct * pick_next_task_random(struct rq *rq, struct task_struct *prev, struct rq_flags *rf){
	put_prev_task(rq,prev);
	struct task_struct *p;
	struct random_rq *rd_rq= &rq->rd_rq;
	p = pick_random_entity(rd_rq);
	return p;
}

The implementation of the pick_random_entity function is as follows.

static struct task_struct* pick_random_entity(struct random_rq *random_rq){ //Pick one task randomly -> return p
	struct task_struct *p;
	unsigned long random_val;
	unsigned long cnt;
	struct list_head *ptr;
	if(random_rq->random_nr_running){
		random_val = 0UL;
		get_random_bytes(&random_val,sizeof(unsigned long));
		random_val = random_val%(random_rq->random_nr_running);
		cnt = 0UL;
		list_for_each(ptr,&random_rq->task_list){
			if(cnt==random_val){
				p = list_entry(ptr,struct task_struct,rd_list);
				return p;
			}
			cnt++;
		}
	}
	return NULL;
}

Use the get_random_bytes function to put a random number in random_val and divide it by the number of tasks in the random run queue to make it too much to calculate the desired random number n. Then use list_for_each and list_entry to find the task p.

put_prev_task

static void put_prev_task_random(struct rq *rq, struct task_struct *p){
	enqueue_list(&rq->rd_rq,p,0);
}

put_prev_task is to put the task worked on the run queue at the end of the run queue list.

task_tick

static void task_tick_random(struct rq *rq, struct task_struct *p, int queued){
	if(!queued){
		resched_curr(rq);
	}
	return;
}

task_tick is a function that is called for the first time in a specified time, and the random scheduler does not need to calculate the waiting time and end time of the task, so it only performs rescheduling.

Let the implemented scheduler be used (Critical responsibility: Nada)

Next, make sure that the scheduler class implemented above is actually used.

Here, kernel / sched / core.c (hereinafter abbreviated as core.c) will be mainly tampered with.

Allocate a scheduler to a process (strictly task_struct)

// kernel/sched/core.c
int sched_fork(...){
    (Omission)
	if (dl_prio(p->prio))
		return -EAGAIN;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else
		p->sched_class = &fair_sched_class;
    (Abbreviation)
}

static void __setscheduler(struct rq *rq, struct task_struct *p,
			   const struct sched_attr *attr, bool keep_boost)
    (Omission)
    if (dl_prio(p->prio))
		p->sched_class = &dl_sched_class;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else
		p->sched_class = &fair_sched_class;
	(Abbreviation)
}

Other scheduler classes to find out how the scheduler class is handled by the kernel

If you search for rt_sched_class, fair_sched_class, etc. on GNU global, the above part in core.c will be hit.

Perhaps we could add some processing to our random scheduler here.

Upon examination, the function {scheduler name} _prio (rt_prio, dl_prio ...) is defined in the file {scheduler name} .h.

Looking at the implementation of dl_prio as a trial, it seems that it receives the priority (p-> prio) of task_struct (process) and returns 1 if it is within the priority range that the scheduler is in charge of, otherwise it returns 0.

Example: Implementation of dl_prio

// include/linux/sched/deadline.h
#define MAX_DL_PRIO		0

static inline int dl_prio(int prio)
{
	if (unlikely(prio < MAX_DL_PRIO))
		return 1;
	return 0;
}

Following this, implement a function called rd_prio in the header file include / linux / sched / random_sched.h of the random scheduler.

// include/linux/sched/random_sched.h
static inline int rd_prio(int prio)
{
	if (121 <= prio && prio <= 139)
		return 1;
	return 0;
}

Here, 1 is returned for processes with a priority range of 121 to 139.

Using this function, rewrite core.c as follows.

// kernel/sched/core.c
int sched_fork(...){
    (Omission)
	if (dl_prio(p->prio))
		return -EAGAIN;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else if (rd_prio(p->prio)) // Added this line
		p->sched_class = &random_sched_class;
	else
		p->sched_class = &fair_sched_class;
    (Abbreviation)
}

static void __setscheduler(struct rq *rq, struct task_struct *p,
			   const struct sched_attr *attr, bool keep_boost)
    (Omission)
    if (dl_prio(p->prio))
		p->sched_class = &dl_sched_class;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else if (rd_prio(p->prio)) // Added this line
		p->sched_class = &random_sched_class;
	else
		p->sched_class = &fair_sched_class;
    (Abbreviation)
}

The task is now assigned to the process.

However, you need to include the newly created random_sched.h file in core.c.

Again, following the implementation of other schedulers, where the include statements in kernel / sched / sched.h are organized

#include <linux/sched/random_sched.h>Was added. core.c is kernel/sched/sched.Since h is included, this is core.c to random_sched.You can now use the functions in h.







#### Implementation of Policy function

 Looking inside the header file kenel / sched / sched.h for general schedulers,

```c
static inline int idle_policy(int policy)
{
	return policy == SCHED_IDLE;
}
static inline int fair_policy(int policy)
{
	return policy == SCHED_NORMAL || policy == SCHED_BATCH;
}

static inline int rt_policy(int policy)
{(Omitted)}

static inline int dl_policy(int policy)
{(Omitted)}
static inline bool valid_policy(int policy)
{
	return idle_policy(policy) || fair_policy(policy) ||
		rt_policy(policy) || dl_policy(policy);
}

I found that the function called is defined. You can probably guess that you are checking if the scheduler is properly specified in the valid_policy function.

Therefore, the random scheduler follows this, as follows.

// Added this function
static inline int random_policy(int policy)
{
	return policy == SCHED_RANDOM;
}
static inline bool valid_policy(int policy)
{
	return idle_policy(policy) || fair_policy(policy) ||
		rt_policy(policy) || dl_policy(policy) || random_policy(policy); // Fixed this line
}

I implemented a function called random_policy and put it in valid_policy.

The constant value SCHED_RANDOM that appears here follows other constant values such as SCHED_NORMAL and SCHED_IDLE in include / uapi / sched.h.

#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
#define SCHED_DEADLINE		6
#define SCHED_RANDOM		7

Was defined as.

Execution result and Debugging (Critical responsibility: Cho Jae Hyun)

First of all, as a result of building in the same way as the build environment of https://leavatail.hatenablog.com/entry/2019/11/04/235300, the following error occurred when starting Linux. In the vagrant folder (/ Vagrant / ubuntu18) there is a log with kernel messages (printed out by printk). (ubuntu-bionic-18.04-cloudimg-console.log) So I was able to confirm the reason why Linux couldn't work.

Error log
[   28.333395] watchdog: BUG: soft lockup - CPU#2 stuck for 22s![kworker/2:1:88]
[   28.333395] Modules linked in:
[   28.333395] CPU: 2 PID: 88 Comm: kworker/2:1 Not tainted 5.3.9+ #1
[   28.333395] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   28.333395] Workqueue: events timer_update_keys
[   28.333395] RIP: 0010:smp_call_function_many+0x239/0x270
[   28.333395] Code: 08 89 c7 e8 e9 f7 90 00 3b 05 77 a8 70 01 0f 83 5c fe ff ff 48 63 c8 48 8b 13 48 03 14 cd 00 a9 3d 82 8b 4a 18 83 e1 01 74 0a <f3> 90 8b 4a 18 83 e1 01 75 f6 eb c7 0f 0b e9 0b fe ff ff 48 c7 c2
[   28.333395] RSP: 0018:ffffc9000028bd08 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[   28.333395] RAX: 0000000000000000 RBX: ffff88803eaab500 RCX: 0000000000000001
[   28.333395] RDX: ffff88803ea30fe0 RSI: 0000000000000000 RDI: ffff88803e445a98
[   28.333395] RBP: ffffc9000028bd40 R08: ffff88803eb40000 R09: ffff88803e403c00
[   28.333395] R10: ffff88803e445a98 R11: 0000000000000000 R12: 0000000000000006
[   28.333395] R13: 000000000002b4c0 R14: ffffffff810390a0 R15: 0000000000000000
[   28.333395] FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
[   28.333395] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.333395] CR2: 0000000000000000 CR3: 000000000260a000 CR4: 00000000000406e0
[   28.333395] Call Trace:
[   28.333395]  ? poke_int3_handler+0x70/0x70
[   28.333395]  on_each_cpu+0x2d/0x60
[   28.333395]  text_poke_bp_batch+0x8c/0x160
[   28.333395]  arch_jump_label_transform_apply+0x33/0x50
[   28.333395]  __jump_label_update+0x116/0x160
[   28.333395]  jump_label_update+0xb9/0xd0
[   28.333395]  static_key_enable_cpuslocked+0x5a/0x80
[   28.333395]  static_key_enable+0x1a/0x30
[   28.333395]  timers_update_migration+0x30/0x40
[   28.333395]  timer_update_keys+0x1a/0x40
[   28.333395]  process_one_work+0x1fd/0x3f0
[   28.333395]  worker_thread+0x34/0x410
[   28.333395]  kthread+0x121/0x140
[   28.333395]  ? process_one_work+0x3f0/0x3f0
[   28.333395]  ? kthread_park+0xb0/0xb0
[   28.333395]  ret_from_fork+0x35/0x40

Looking at RIP here, I found that an error occurred in the smp_call_function_many function, and when I searched for what SMP was, I found that the same memory was used by two or more processors, and this part was implemented in random scheduling. I haven't done so, so I removed the SMP option and rebuilt. (Reference: https://en.wikipedia.org/wiki/Symmetric_multiprocessing)

When SMP is released

When I released SMP, I was able to move with ==> debuggee: Machine booted and ready!, But an error occurred when ==> debuggee: Checking for guest additions in VM .... The error log looks like this:

[   31.591660][ T1283] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   31.592313][ T1283] #PF: supervisor read access in kernel mode
[   31.592667][ T1283] #PF: error_code(0x0000) - not-present page
[   31.593014][ T1283] PGD 38cbf067 P4D 38cbf067 PUD 3d23e067 PMD 0 
[   31.593377][ T1283] Oops: 0000 [#1] NOPTI
[   31.593615][ T1283] CPU: 0 PID: 1283 Comm: control Tainted: G           OE     5.3.9+ #20
[   31.594097][ T1283] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   31.594640][ T1283] RIP: 0010:rb_erase+0x149/0x380
[   31.594927][ T1283] Code: f6 c2 01 0f 84 c2 01 00 00 48 83 e2 fc 0f 84 ee 00 00 00 48 89 c1 48 89 d0 48 8b 50 08 48 39 ca 0f 85 71 ff ff ff 48 8b 50 10 <f6> 02 01 48 8b 4a 08 75 3a 48 89 c7 48 89 48 10 48 89 42 08 48 83
[   31.596105][ T1283] RSP: 0018:ffffc90001df7988 EFLAGS: 00010046
[   31.596454][ T1283] RAX: ffff88803deb0060 RBX: ffff888037a30000 RCX: 0000000000000000
[   31.596909][ T1283] RDX: 0000000000000000 RSI: ffffffff8245ee90 RDI: ffff888037a30060
[   31.597381][ T1283] RBP: ffffc90001df7988 R08: 0000000000000000 R09: ffffc90000343b88
[   31.597846][ T1283] R10: 00000000faccb043 R11: 0000000078dc05ec R12: ffffffff8245ee40
[   31.598322][ T1283] R13: 0000000000000009 R14: ffff888037a30060 R15: 0000000000000001
[   31.598787][ T1283] FS:  00007fef954dc700(0000) GS:ffffffff82447000(0000) knlGS:0000000000000000
[   31.599314][ T1283] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   31.599694][ T1283] CR2: 0000000000000000 CR3: 000000003de92000 CR4: 00000000000406f0
[   31.600157][ T1283] Call Trace:
[   31.600347][ T1283]  dequeue_task_fair+0x9f/0x2a0
[   31.600687][ T1283]  deactivate_task+0x57/0xf0
[   31.600957][ T1283]  ? update_rq_clock+0x2c/0x80
[   31.601235][ T1283]  __schedule+0x344/0x5d0
[   31.601559][ T1283]  schedule+0x32/0xa0
[   31.601823][ T1283]  rtR0SemEventMultiLnxWait.isra.3+0x33b/0x370 [vboxguest]
[   31.602298][ T1283]  ? wait_woken+0x90/0x90
[   31.602569][ T1283]  VBoxGuest_RTSemEventMultiWaitEx+0xe/0x10 [vboxguest]
[   31.603009][ T1283]  VBoxGuest_RTSemEventMultiWaitNoResume+0x28/0x30 [vboxguest]
[   31.603496][ T1283]  vgdrvHgcmAsyncWaitCallbackWorker+0xda/0x210 [vboxguest]
[   31.603925][ T1283]  vgdrvHgcmAsyncWaitCallbackInterruptible+0x15/0x20 [vboxguest]
[   31.604385][ T1283]  VbglR0HGCMInternalCall+0x3ff/0x1180 [vboxguest]
[   31.604764][ T1283]  ? vgdrvHgcmAsyncWaitCallback+0x20/0x20 [vboxguest]
[   31.605168][ T1283]  ? prep_new_page+0x8e/0x130
[   31.605435][ T1283]  ? get_page_from_freelist+0x6db/0x1160
[   31.605827][ T1283]  ? page_counter_cancel+0x22/0x30
[   31.606122][ T1283]  ? page_counter_uncharge+0x22/0x40
[   31.606426][ T1283]  ? drain_stock.isra.49.constprop.76+0x33/0xb0
[   31.606795][ T1283]  ? try_charge+0x62e/0x760
[   31.607062][ T1283]  ? tomoyo_init_request_info+0x80/0x90
[   31.607407][ T1283]  vgdrvIoCtl_HGCMCallWrapper+0x127/0x2c0 [vboxguest]
[   31.607856][ T1283]  VGDrvCommonIoCtl+0x3ca/0x1a20 [vboxguest]
[   31.608234][ T1283]  ? __check_object_size+0xdd/0x1a0
[   31.609064][ T1283]  ? _copy_from_user+0x3d/0x60
[   31.609829][ T1283]  vgdrvLinuxIOCtl+0x113/0x290 [vboxguest]
[   31.610671][ T1283]  do_vfs_ioctl+0xa9/0x620
[   31.611427][ T1283]  ? tomoyo_file_ioctl+0x19/0x20
[   31.612197][ T1283]  ksys_ioctl+0x75/0x80
[   31.612901][ T1283]  __x64_sys_ioctl+0x1a/0x20
[   31.613633][ T1283]  do_syscall_64+0x59/0x130
[   31.614355][ T1283]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   31.615173][ T1283] RIP: 0033:0x7fef965f56d7
[   31.615877][ T1283] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
[   31.618044][ T1283] RSP: 002b:00007fef954dba18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   31.619237][ T1283] RAX: ffffffffffffffda RBX: 00007fef954dba60 RCX: 00007fef965f56d7
[   31.620163][ T1283] RDX: 00007fef954dba60 RSI: 00000000c0485607 RDI: 0000000000000003
[   31.621066][ T1283] RBP: 00007fef954dba20 R08: 0000000000000079 R09: 0000000000000000
[   31.621966][ T1283] R10: 00007fef900008d0 R11: 0000000000000246 R12: 0000000000693410
[   31.622844][ T1283] R13: 00007fef954dbdbc R14: 00007fef954dbdac R15: 00007fef954dcdac
[   31.623731][ T1283] Modules linked in: nls_utf8 isofs snd_intel8x0 snd_ac97_codec ac97_bus input_leds snd_pcm snd_timer serio_raw snd soundcore vboxguest(OE) mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper vboxvideo drm_vram_helper ttm drm_kms_helper syscopyarea psmouse sysfillrect sysimgblt mptspi mptscsih mptbase scsi_transport_spi i2c_piix4 fb_sys_fops e1000 drm pata_acpi video
[   31.630278][ T1283] CR2: 0000000000000000
[   31.631004][ T1283] ---[ end trace 7027dee837a160c7 ]---

However, when I booted directly into the Oracle VM, it worked fine and I tried the test code there.

Test code
#include <assert.h>
#include <sched.h>
#include <sys/time.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/*Time begin--Record of running on proc until end*/
typedef struct {
  double begin;
  double end;
  int proc;
} rec_t;

/*Get the current time*/
double cur_time() {
  struct timeval tp[1];
  gettimeofday(tp, 0);
  return tp->tv_sec + tp->tv_usec * 1.0E-6;
}

void die(char * s) {
  perror(s);
  exit(1);
}

/*Keep running for T seconds,Record the time zone that you think the CPU was assigned to*/
int run(double T, int n) {
  pid_t pid = getpid();
  struct sched_param param;
  double limit = cur_time() + T;
  rec_t * R = (rec_t *)calloc(n, sizeof(rec_t));
  int i = 0;
  R[i].begin = R[i].end = cur_time();
  R[i].proc = sched_getcpu();

  int ret_1 = syscall(145,pid);
  printf("GET_SCHEDULER : %d\n",ret_1);

  param.sched_priority = 139;
  int ret_2 = syscall(144,pid,7,&param);
  if(ret_2==0){
    printf("SET_SCHEDULER TO SCHED_RANDOM SUCCESS\n");
  } else {
    printf("SET_SCHEDULER TO SCHED_RANDOM FAILED\n");
  }

  ret_1 = syscall(145,pid);
  printf("GET_SCHEDULER : %d\n",ret_1);

  while (R[i].end < limit && i < n) {
    double t = cur_time(); /*Get the current time*/
    int proc = sched_getcpu();
    if (t - R[i].end < 1.0E-3 && proc == R[i].proc) {
      /*Not much different from the last time I saw it(< 1ms) -> R[i].Increase end*/
      R[i].end = t;
    } else {
      /*More than 1ms has passed since the last time I saw it->Enter a new section*/
      if (i + 1 >= n) break;
      i++;
      R[i].proc = proc;
      R[i].begin = R[i].end = cur_time();
    }
  }
  assert(i < n);
  int j;
  for (j = 0; j <= i; j++) {
    printf("%d %f %f %d %f\n", 
	   pid, R[j].begin, R[j].end, R[j].proc,
	   R[j].end - R[j].begin);
  }
  return 0;
}

int main(int argc, char ** argv) {
  double T = (argc > 1 ? atof(argv[1]) : 10.0);
  int n    = (argc > 2 ? atoi(argv[2]) : 100000);
  run(T, n);
  return 0;
}

Set the time T (seconds) for the program to run from the arguments passed from the terminal, and use the sched_set_scheduler function to make the scheduler use a random scheduler. Here, from the terminal, set the priority using the nice command.

vagrant@ubuntu-bionic:~$ nice -15 ./test_2

The nice command changes the nice value, and the above command sets the nice value to 15. That is, the priority becomes 120 + 15 = 135 and enters the random scheduler.

Execution result

I was able to confirm that the task was in a random queue when I ran it, but then an error occurred.

[  107.960596][    C0] BUG: kernel NULL pointer dereference, address: 0000000000000058
[  107.961544][    C0] #PF: supervisor read access in kernel mode
[  107.962053][    C0] #PF: error_code(0x0000) - not-present page
[  107.962552][    C0] PGD 3abb1067 P4D 3abb1067 PUD 3abb0067 PMD 0 
[  107.963078][    C0] Oops: 0000 [#1] NOPTI
[  107.963432][    C0] CPU: 0 PID: 1515 Comm: lsb_release Tainted: G           OE     5.3.9+ #20
[  107.964161][    C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  107.964929][    C0] RIP: 0010:task_tick_fair+0xcc/0x160
[  107.965382][    C0] Code: 73 8b 0d 93 be 3a 01 48 39 ca 72 29 48 8b 0d 73 b5 3a 01 48 8d 51 e8 48 85 c9 b9 00 00 00 00 48 0f 44 d1 48 8b 8b a0 00 00 00 <48> 2b 4a 58 78 05 48 39 c8 72 12 0f 1f 44 00 00 48 83 c4 08 5b 41
[  107.967048][    C0] RSP: 0000:ffffc90000003e78 EFLAGS: 00010046
[  107.967563][    C0] RAX: 000000002d17f460 RBX: ffff88803abd8000 RCX: 000000076403fcc9
[  107.968239][    C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000b45fd181a81d0
[  107.968913][    C0] RBP: ffffc90000003ea0 R08: 003ffff0b9e49c00 R09: 0000000000000400
[  107.969584][    C0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  107.970253][    C0] R13: ffff88803abd8048 R14: ffffffff810f4650 R15: 000000191c91658e
[  107.970926][    C0] FS:  00007fd781841740(0000) GS:ffffffff82447000(0000) knlGS:0000000000000000
[  107.971675][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.972231][    C0] CR2: 0000000000000058 CR3: 000000003bf5e000 CR4: 00000000000406f0
[  107.972914][    C0] Call Trace:
[  107.973188][    C0]  <IRQ>
[  107.973437][    C0]  ? tick_sched_do_timer+0x60/0x60
[  107.973867][    C0]  scheduler_tick+0x44/0x60
[  107.974245][    C0]  update_process_times+0x45/0x60
[  107.974666][    C0]  tick_sched_handle+0x25/0x70
[  107.975069][    C0]  ? tick_sched_do_timer+0x52/0x60
[  107.975505][    C0]  tick_sched_timer+0x3b/0x80
[  107.975909][    C0]  __hrtimer_run_queues.constprop.24+0x10e/0x210
[  107.976443][    C0]  hrtimer_interrupt+0xd9/0x240
[  107.976856][    C0]  ? ksoftirqd_running+0x2f/0x40
[  107.977281][    C0]  smp_apic_timer_interrupt+0x68/0x100
[  107.977743][    C0]  apic_timer_interrupt+0xf/0x20
[  107.978607][    C0]  </IRQ>
[  107.979290][    C0] RIP: 0033:0x56fd37
[  107.980039][    C0] Code: ff 00 00 00 0f 8f 89 02 00 00 4c 8d 3c 07 48 81 fe 40 45 9d 00 0f 85 d3 03 00 00 4c 89 f5 4d 89 e2 4c 21 e5 48 0f be 5c 2a 28 <48> 83 fb ff 0f 84 3f 02 00 00 4c 8d 1c 5b 4f 8d 1c df 49 8b 73 08
[  107.982571][    C0] RSP: 002b:00007fff64ad6810 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  107.983732][    C0] RAX: 0000000000000080 RBX: 0000000000000040 RCX: 00007fff64ad68b0
[  107.984854][    C0] RDX: 000000000275fac0 RSI: 00000000009d4540 RDI: 000000000275fae8
[  107.985985][    C0] RBP: 0000000000000075 R08: 0000000000000000 R09: 00007fd7817ccd80
[  107.987119][    C0] R10: 9aa0515eaaf52cf5 R11: 00007fd781823130 R12: 9aa0515eaaf52cf5
[  107.988255][    C0] R13: 00007fd7817d0db0 R14: 000000000000007f R15: 000000000275fb68
[  107.989397][    C0] Modules linked in: nls_utf8 isofs snd_intel8x0 snd_ac97_codec ac97_bus input_leds snd_pcm serio_raw vboxguest(OE) snd_timer snd soundcore mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper vboxvideo drm_vram_helper ttm drm_kms_helper syscopyarea psmouse sysfillrect sysimgblt mptspi mptscsih mptbase scsi_transport_spi i2c_piix4 fb_sys_fops e1000 drm pata_acpi video
[  107.997916][    C0] CR2: 0000000000000058
[  107.998821][    C0] ---[ end trace 717bffdc6fc42d15 ]---

Analyzing this error, the error occurred because the task_tick_fair function accessed a null pointer. (kernel NULL pointer dereference) Since the task_tick_fair function is a part of the CFS scheduler, I thought that an error occurred by setting the priority range that should originally enter the CFS scheduler to the range of the random scheduler.

in conclusion

When rewriting Linux, it is very difficult to understand and write the whole thing, so it is recommended to implement it while using tools such as GNU Global as appropriate. Last but not least How was it ?

Recommended Posts

I tried adding system calls and scheduler to Linux
I tried to reintroduce Linux
I tried adding post-increment to CPython. Overview and summary
[Linux] I tried to summarize the command of resource confirmation system
I tried adding post-increment to CPython Implementation
I implemented DCGAN and tried to generate apples
I tried adding VPS to ConoHa ~ SSH connection
I tried adding post-increment to CPython Extra edition
I tried to operate Linux with Discord Bot
[Introduction to PID] I tried to control and play ♬
I tried to debug.
I tried to paste
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I tried to automatically read and save with VOICEROID2
I tried to implement Grad-CAM with keras and tensorflow
I tried to install scrapy on Anaconda and couldn't
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
Device and Linux file system
I tried to learn PredNet
I tried to organize SVM.
I tried to implement PCANet
I tried to introduce Pylint
I tried to summarize SparseMatrix
Hack Linux fork system calls
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to predict and submit Titanic survivors with Kaggle
I tried to combine Discord Bot and face recognition-for LT-
I tried to get Web information using "Requests" and "lxml"
I tried adding post-increment to CPython. List of all changes
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
[Introduction to infectious disease model] I tried fitting and playing ♬
I tried to enumerate the differences between java and python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to implement Deep VQE
I tried to create an environment of MkDocs on Amazon Linux
I tried to create Quip API
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
I tried to implement adversarial validation
I tried to let Pepper talk about event information and member information
I tried to explain Pytorch dataset
I tried Watson Speech to Text
I tried to make a periodical process with Selenium and Python
I tried to touch Tesla's API
I tried to display GUI on Mac with X Window System
I tried to create Bulls and Cows with a shell program
I tried to implement hierarchical clustering
I tried to organize about MCMC.
I tried to easily detect facial landmarks with python and dlib
I tried to extract players and skill names from sports articles
I tried to implement Realness GAN
I tried to move the ball
I tried to estimate the interval.
I tried to summarize until I quit the bank and became an engineer
I tried moving the image to the specified folder by right-clicking and left-clicking
I tried to visualize the age group and rate distribution of Atcoder
I tried to express sadness and joy with the stable marriage problem.
[Deep Learning from scratch] I tried to implement sigmoid layer and Relu layer.
I tried to draw a system configuration diagram with Diagrams on Docker