(Added in March 2020)
The Unix-like OS ps command has an option to display an item called wchan. wchan is an important item for troubleshooting, etc., as it gives hints on what a process or thread (task in Linux internal terminology) is waiting for something (stat term is S or D). That's right. On Unix, including * BSD, you are supposed to specify a string when waiting in the kernel, and this string will appear in wchan. On the other hand, in Linux, it shows the function name in the kernel that has been waiting. I tried to find out how the function name of this wchan is obtained.
When waiting in the Linux kernel, call a function called schedule (). The process scheduler is called, one other runnable task is selected and task switching is performed (or the CPU stops if there is no runnable task). When waiting for mutex, waiting for semaphore, etc., schedule () is called after mutex_lock () and down ().
However, neither schedule () nor mutex / semaphore-related function names appear in the wchan clause of the ps command. If you just want to display the function that called schedule (), the wchan term could be filled with more mutex_lock ("mutex_" because it is cut off by the first 6 characters by default). Somehow these functions must be skipped and the function that called them must be issued as wchan.
Since the task information only has the kernel, the ps command should get the information for each task from the kernel. Task information can be retrieved via the proc filesystem, so I suspect you're using it, but just in case.
$ strace -o pslog.txt ps l
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 10010 2520 2499 20 0 207112 5532 poll_s Ssl+ tty2 0:00 /usr/lib/gd
4 10010 2522 2520 20 0 516488 155716 ep_pol Sl+ tty2 27:22 /usr/lib/xo
... (Abbreviation)
The system call issued by the ps command is recorded in pslog.txt with its arguments. Roughly chasing, I first open the directory / proc and getdents (2) to get the entries in the directory.
pslog.txt
openat(AT_FDCWD, "/proc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
fstat(5, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
(Abbreviation)
getdents(5, /* 452 entries */, 32768) = 12120
Next, you can see that you are reading the stat, status, and cmdline files in each of the retrieved entries. Occasionally, I also read a file called wchan, which is probably the process of choice and with stat S or D.
pslog.txt
openat(AT_FDCWD, "/proc/2520/wchan", O_RDONLY) = 6
read(6, "poll_schedule_timeout", 63) = 21
close(6) = 0
You can see that there is something like a function name in / proc / PID / wchan, and the first 6 characters match the output result "poll_s" of ps.
When I search for poll_schedule_timeout from the Linux source, the function name in fs / select.c is hit. The HEAD at the time of writing is as follows.
fs/select.c
static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
ktime_t *expires, unsigned long slack)
{
int rc = -EINTR;
set_current_state(state);
if (!pwq->triggered)
rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
__set_current_state(TASK_RUNNING);
/*
* (Comment omitted)
*/
smp_store_mb(pwq->triggered, 0);
return rc;
}
Among them, schedule_hrtimeout_range () is a function defined in kernel / time / hrtimer.c, and it is commented that it waits until timeout. Even if you follow other functions such as set_current_state () and smp_store_mb () that poll_schedule_timeout () calls, it seems that you do not wait, so you can see that you are waiting by extending schedule_hrtimeout_range (). If you follow schedule_hrtimeout_range (), you will see schedule_hrtimeout_range_clock () defined in the same file, and there will be a call to schedule (). wchan shows poll_schedule_timeout () that called it, not schedule_hrtimeout_range_clock () that called schedule (), not schedule_hrtimeout_range () that called it.
So how is the information shown in the file wchan of the proc file system calculated? The source code for the proc filesystem can be found in fs / proc / * in the Linux source tree. If you search for the string wchan here, you will find a function called proc_pid_wchan () in fs / proc / base.c.
fs/proc/base.c
static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long wchan;
char symname[KSYM_NAME_LEN];
if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
goto print0;
wchan = get_wchan(task);
if (wchan && !lookup_symbol_name(wchan, symname)) {
seq_puts(m, symname);
return 0;
}
print0:
seq_putc(m, '0');
return 0;
}
get_wchan () seems suspicious. Since the returned wchan is an integer and symname is a character string, it seems that lookup_symbol_name () regards wchan as an address and converts it to a symbol name. get_wchan () is defined in the machine-dependent part, and in x86 it is defined in arch / x86 / kernel / process.c. The key is as follows.
arch/x86/kernel/process.c
unsigned long get_wchan(struct task_struct *p)
{
unsigned long start, bottom, top, sp, fp, ip, ret = 0;
int count = 0;
(Abbreviation)
fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp);
do {
if (fp < bottom || fp > top)
goto out;
ip = READ_ONCE_NOCHECK(*(unsigned long *)(fp + sizeof(unsigned long)));
if (!in_sched_functions(ip)) {
ret = ip;
goto out;
}
fp = READ_ONCE_NOCHECK(*(unsigned long *)fp);
} while (count++ < 16 && p->state != TASK_RUNNING);
(Less than)Abbreviation
Loop 16 times, or loop until the task status reaches TASK_RUNNING (runnable, ps stat term is R) (because it may start running on other CPUs during the loop).
Linux is compiled with the gcc option -fno-omit-frame-pointer, and all functions start with the following (actually, it also includes a mechanism for inserting hooks for debugging purposes).
function:
push %rbp
mov %rsp, %rbp
...
I won't go into detail about machine instructions here, but this means that as soon as you enter a function, it saves the rbp register on the stack and copies the updated stack pointer to the rbp register. .. As a result, the memory pointed to by the rbp register contains the value of the rbp register before the function is called, and the next address (because the stack advances toward address 0) contains the address of the instruction to return after the function ends. enter. The rbp register, called the frame pointer, is unchanged until the function calls another function, and is restored to its original value stored on the stack when returning from the function.
Let's go back to the previous loop. In the assignment expression before do, fp contains the value of the rbp register saved (on the stack) when the task switch occurs after schedule (). The first conditional statement in the do loop ensures that the fp value is within the task stack. It should not be out of range, but if it goes out of range, it may panic, so prevent this. The following assignment formula puts the value at the next address of fp into ip. This is the return address because fp contained the value of the rbp register. The next in_sched_functions (), which we'll look at later, must return a boolean value indicating whether or not it is a task scheduler function by name. If it is not a "task scheduler function", it returns ip, otherwise it updates fp and returns to the beginning of the loop. fp is updated to the value currently at the address pointed to by fp. Since the value of fp was the value of the rbp register, it will be updated to the value of the rbp register at the caller.
By tracing the stack in this way, it is possible to reverse the state of function calls up to the task switch. It is get_wchan () that repeats this until it exits from the "task scheduler function".
Finally, let's take a look at in_sched_functions (), which determines "task scheduler functions". It is located in kernel / sched / core.c.
kernel/sched/core.c
int in_sched_functions(unsigned long addr)
{
return in_lock_functions(addr) ||
(addr >= (unsigned long)__sched_text_start
&& addr < (unsigned long)__sched_text_end);
}
in_lock_functions () is located in kernel / locking / spinlock.c.
kernel/locking/spinlock.c
notrace int in_lock_functions(unsigned long addr)
{
/* Linker adds these: start and end of __lockfunc functions */
extern char __lock_text_start[], __lock_text_end[];
return addr >= (unsigned long)__lock_text_start
&& addr < (unsigned long)__lock_text_end;
}
Taken together, the argument addr returns whether it is in two ranges: \ _ \ _ sched_text_start to \ _ \ _ sched_text_end or \ _ \ _ lock_text_start to \ _ \ _ lock_text_end. Become.
\ _ \ _ Sched_text_start and \ _ \ _ sched_text_end are not symbols defined in the C or assembler source code. In fact, it is a symbol generated by the linker when compiling Linux.
There is the following definition in include / asm-generic / vmlinux.lds.h.
include/asm-generic/vmlinux.lds.h
#define SCHED_TEXT \
ALIGN_FUNCTION(); \
__sched_text_start = .; \
*(.sched.text) \
__sched_text_end = .;
This file is named .h and is written in the C preprocessor syntax, but it is not #included from the C source code, but is referenced by the script given to the linker. On Linux, the linker script is also passed to the linker after being run through the C preprocessor. The x86 linker script can be found in arch / x86 / kernel / vmlinux.lds.S, which certainly contains the description #include <asm-generic/vmlinux.lds.h> and a reference to SCHED_TEXT.
The meaning of the above quote is
--Give the current address the symbol \ _ \ _ sched_text_start. --Gather the code from the section .sched.text and link it here. --Give the current address the symbol \ _ \ _ sched_text_end.
Is. For the ELF section, please google for details, but the string .sched.text appears in include / linux / sched / debug.h.
include/linux/sched/debug.h
#define __sched __attribute__((__section__(".sched.text")))
Actually, schedule_hrtimeout_range () and schedule_hrtimeout_range_clock (), which I have omitted the quotation, have this attribute \ _ \ _sched.
kernel/time/hrtimer.c
int __sched schedule_hrtimeout_range(ktime_t *expires, u64 delta,
const enum hrtimer_mode mode)
{
return schedule_hrtimeout_range_clock(expires, delta, mode,
CLOCK_MONOTONIC);
}
Of course, it is also attached to schedule () and so on. Actually, each function in the task scheduler has this \ _ \ _ sched, and all of them are collected and linked in the section called .sched.text. \ _ \ _ Lock_text_start and \ _ \ _ lock_text_end are actually similar, only the lock-related rather than the task scheduler-related.
By the way, if you pay attention to the loop of get_wchan () earlier, somewhere in Linux, function A calls "task scheduler function" B with \ _ \ _ sched, and B does not have \ _ \ _ sched. Suppose you call a function C that is not a "task scheduler function", and then C calls a "task scheduler function" D. If the task waits here, wchan will point to C instead of the desired information A. To prevent this from happening, it is thought that \ _ \ _ sched is carefully attached to functions that are called from "task scheduler functions" and may wait. In fact, if you call a function that may easily wait from the task scheduler in the first place, you will fall into a rather terrible situation such as deadlock, so \ _ \ _ sched is not attached, that is, in the "task scheduler function" You might think that calling a function that doesn't exist is avoided.
--On Linux, the wchan term for the ps command is obtained from / proc / PID / wchan (not mentioned above, but / proc / PID / task / TID / wchan when viewed by thread). --Wchan in the proc file system indicates the symbol name of the function that was waiting by calling the "scheduler function". --The "scheduler function" is collected and linked to consecutive addresses by the linker, and the symbols \ _ \ _ sched_text_start and \ _ \ _ sched_text \ _end are added to the beginning and end, respectively. --Linux function names are getting longer, but ps wchan is 6 characters and it's shit.
What I wrote above does not apply to RHEL 8 (including CentOS 8 etc.) and recent Fedora. This is because it no longer compiles with the -fno-omit-frame-pointer option.
$ ps alxc
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 1 0 20 0 180900 16508 - Ss ? 0:04 systemd
1 0 2 0 20 0 0 0 - S ? 0:00 kthread
(Abbreviation)
0 42 2787 2647 20 0 617620 22028 - Ssl ? 0:00 gsd-wac
0 42 2797 2647 20 0 522552 8480 - Ssl ? 0:00 gsd-wwa
0 42 2798 2647 20 0 947008 61240 - Ssl ? 0:00 gsd-xse
0 42 2829 2647 20 0 549532 16176 - Sl ? 0:00 gsd-pri
0 42 2873 2647 20 0 160644 6960 - Ssl ? 0:00 at-spi2
0 1000 2981 2529 20 0 218684 1304 - R+ pts/1 0:00 ps
This is because Linux-4.14 introduced a mechanism called ORC unwinder so that backtrace can be taken without following the frame pointer, and the above get_wchan () does not support ORC unwinder and follows the frame pointer. Due to the fact that it remains (ORC unwinder displays the backtrace when panic etc.). No one is in trouble ??
Recommended Posts