[CENTOS] I can't find the clocksource tsc! ?? The story of trying to write a kernel patch

I have a problem that tsc cannot be recognized by clocksource on Linux running on KVM, and I tried to fix it by trial and error. Basically, I tried applying a custom patch to the distribution, but it didn't feel right, and after that I discovered the original patch and learned. It is a story. I posted the same content on my blog, but I also wrote an article on Qiita.

Background

This was more than a year ago, but when I was looking at the clocksource of Linux running on KVM, I noticed that the tsc might not be recognized. Is tsc the fastest and most accurate for Linux guests running on x86 hypervisors? It's a counter device, and in some cases you may be in trouble if it isn't recognized. At that time, I was reading the source of time management of x86 Linux, so I started looking for bugs.

Specific symptoms

Details: tsc is not registered in clocksource. It looks like the following.

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
kvm-clock acpi_pm

Examples and analogies

The OS I was using at the time was the CentOS7 kernel kernel-3.10.0-957.el7. First of all, I decided to make an analogy from the symptoms, with a view to comparing what kind of problem lies in what part with the case of Xen.

Example: CentOS7 on KVM

The following is the result of confirmation including the output of dmesg.

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
kvm-clock acpi_pm 

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.834422] TSC deadline timer enabled
[    2.600920] Switched to clocksource kvm-clock

I feel that the CentOS 7 kernel does not output anything compared to Ubuntu etc., but ... However, even with the above message, I knew that TSC itself as a device was recognized, and I felt that it could be a good hint.

After that, I confirmed that it is important to compare it with the normal case. I'm not sure if it's a good comparison, but fortunately I knew this time it wouldn't be a problem for Linux guests running on Xen, so I compared it.

Comparison with Xen: CentOS7 on Xen

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
xen tsc hpet acpi_pm 

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2394.486 MHz processor
[    0.031000] tsc: Detected 2394.476 MHz TSC
[    0.607022] Switched to clocksource xen
[    2.244080] tsc: Refined TSC clocksource calibration: 2394.455 MHz

analogy

You can see that there is a slight difference in detecting the TSC frequency, but the difference is in the presence or absence of Refined TSC clocksource calibration. Of course, at this stage, it is strictly forbidden to make assumptions, as it is a rough guess.

However, in the opinion of an amateur, for those who have problems (guests running on KVM), * it is particularly unusual that there is no error and nothing is output. I can imagine that. * From here, at least

You can expect something wrong with the clocksource registration logic itself.
If there is a clocksource that is normally bad, you should get a message that it is bad.

In the experience of playing with Linux normally? In such cases, it's easy to get a feel for it to some extent, so I feel that it's faster (personally) to look at the source code than to use the debugger. Also, it may not be a very complimented method, but I love to combine it with printk debugging.

Primitive debugging

Use * printk to make a hit and take a closer look at the source code for debugging. *

I have never known a kernel developer, and I only have the knowledge I gained from books, magazines, and the net, but I feel that printk debugging is also quite good. I don't know when it was, but I've read an article in an old magazine that netfilter developer Rusty Russell solves the problem with printk and the staring contest in the source list [1]. ..

[1] Certainly an open source magazine. Basically, most of the knowledge I know is absorbed from old magazines and books that I bought second hand.

In fact, unless you're in a situation where you have to jump around with a function pointer or investigate with some complicated processing logic, you'll probably find the problem by looking at the source code. Also, if that is not the case, I feel that using a debugger can be difficult.

This time, I think it is a typical solution that can be solved by printk debugging.

Source preparation

Install and prepare the source etc. with the following command. I remember performing these steps by looking at the official CentOS page.

$ sudo yumdownloader --source kernel-3.10.0-957.el7
$ sudo yum groupinstall "Development Tools" -y 
$ sudo yum install rpmdevtools -y
$ rpmdev-setuptree
$ rpm -Uvh kernel-3.10.0-957.el7.src.rpm
$ sudo yum-builddep -y --enablerepo=* rpmbuild/SPECS/kernel.spec
$ rpmbuild -bp ~/rpmbuild/SPECS/kernel.spec
$ cp -r ~/rpmbuild/BUILD/kernel-3.10.0-957.el7 ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig
$ cp -al ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.new

Since it is cp -al, when editing with vim, set to break the hard link at the time of writing. For example, the following .vimrc

set nobackup
set writebackup
set backupcopy=no

Make a macro

It is troublesome to think about the argument of printk one by one, so it is natural, but I will build a simple macro. For example:

#define __dprintk(n) \
printk("No.%s, func: %s, line: %d, file: %s\n", \
#n, __FUNCTION__, __LINE__, __FILE__);

The function name, line number, and file name are output, as well as the number of the inserted printk.

Where are the likely bugs?

That is, of course, the place to register the clocksource tsc, as the registration of the clocksource is not working. Let's take a look at the suspicious parts along with how you hit printk.

Entrance

Basically, the code related to TSC processing is defined in /arch/x86/kernel/tsc.c, and clocksource is registered with the clocksource_registar_hogehoge function, so check the call location.

If you grep properly, you can see that it is as follows.

static int __init init_tsc_clocksource(void)
{
    if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
        return 0;

    if (tsc_clocksource_reliable)
        clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
    /* lower the rating if we already know its unstable: */
    if (check_tsc_unstable()) {
        clocksource_tsc.rating = 0;
        clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
    }

    if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
        clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;

    /*
     * Trust the results of the earlier calibration on systems
     * exporting a reliable TSC.
     */
    if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
        clocksource_register_khz(&clocksource_tsc, tsc_khz);
        return 0;
    }

    schedule_delayed_work(&tsc_irqwork, 0);
    return 0;
}
/*
 * We use device_initcall here, to ensure we run after the hpet
 * is fully initialized, which may occur at fs_initcall time.
 */
device_initcall(init_tsc_clocksource);

There is a clocksource_register_khz. There is also device_initcall. I think this is correct.

If you know that the TSC frequency is reliable, you can set X86_FEATURE_TSC_RELIABLE and quickly register clocksource_tsc. However, the TSC frequency is calculated each time it starts up, and usually you don't know that, so I imagine you can't go inside this ʻif` block.

However, I feel that this is originally strange. If you are familiar with the mechanism of pvclock, you know that the frequency is assigned from the hypervisor from the shared information page, so if you set X86_FEATURE_TSC_RELIABLE, isn't it all settled? I notice. Considering how it occurred, this does not seem to be the essence of this problem, so I will postpone it once.

Then, it reaches the end of the function, and it can be inferred that the following processing is the problem.

schedule_delayed_work(&tsc_irqwork, 0);

schedule_delayed_work is a frequently used function that many of you may know [2]. It's usually in the documentation, but you can expect it without looking at the documentation. You can tell that you are doing asynchronous processing by passing a pointer to some processing and a delay period.

[2] schedule_delayed_work

How to make a mark

In addition, since there is a part inside the ʻinit_tsc_clocksourcefunction that seems to return early without outputting an error, just in case, to check whether it is proceeding to the next process, the__dprintk` defined earlier I will hit. I don't know how far the process is going, so I did this to make it easier to follow from kernel messages later.

static int __init init_tsc_clocksource(void)
{
    if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
        return 0;
__dprintk(1);
    if (tsc_clocksource_reliable)
        clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
    /* lower the rating if we already know its unstable: */
    if (check_tsc_unstable()) {
        clocksource_tsc.rating = 0;
        clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
    }

    if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
        clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;

    /*
     * Trust the results of the earlier calibration on systems
     * exporting a reliable TSC.
     */
    if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
        clocksource_register_khz(&clocksource_tsc, tsc_khz);
        return 0;
    }
__dprintk(2);
    schedule_delayed_work(&tsc_irqwork, 0);
__dprintk(3);
    return 0;
}

After that, I will continue to look at the suspicious parts while typing printk. It doesn't look good (bitter smile). Also, since printk is synchronous, it's not a good idea to overtype it, so it's a good idea. .. .. If you hit too much, it will be throttled, or conversely this will be a problem.

Suspicious part

The continuation of the process was tsc_irqwork. If you look for its identity in the source list, you will find that it is as follows.

static DECLARE_DELAYED_WORK(tsc_irqwork, tsc_refine_calibration_work);
static void tsc_refine_calibration_work(struct work_struct *work)
{
    static u64 tsc_start = -1, ref_start;
    static int hpet;
    u64 tsc_stop, ref_stop, delta;
    unsigned long freq;

    /* Don't bother refining TSC on unstable systems */
    if (check_tsc_unstable()) 	//If TSC is determined to be fraudulent here
        goto out; 	 	//When you go to out, you should get an error message
                     	 	//This doesn't seem to be a problem, but just in case__Hit dprintk
__dprintk(4);

    /*
     * Since the work is started early in boot, we may be
     * delayed the first time we expire. So set the workqueue
     * again once we know timers are working.
     */
    //Until you can confirm that the timer is working properly
    // schedule_delayed_It seems that work is trying to call the same process recursively.
    //It looks suspicious, so inside__Hit dprintk
    if (tsc_start == -1) {
__dprintk(5);
        /*
         * Only set hpet once, to avoid mixing hardware
         * if the hpet becomes enabled later.
         */
        hpet = is_hpet_enabled();
        schedule_delayed_work(&tsc_irqwork, HZ); //This looks suspicious.
        tsc_start = tsc_read_refs(&ref_start, hpet);
        return;
    }
__dprintk(6);

    //Again tsc_read_I see refs and I know this function is important
    tsc_stop = tsc_read_refs(&ref_stop, hpet);

    // ref_start and ref_acpi for stop_You can see that the value of pm or hpet is about to enter.
    //However, even in this case, if you go to out, you will get a message, so it doesn't seem to matter.
    //Printk just in case
    //ACPI PM itself is also in Nitro.
    /* hpet or pmtimer available ? */
    if (ref_start == ref_stop)
        goto out;
__dprintk(7);

    //If you look at the following, you can predict that some kind of message will appear if you go to out.
    //Go ahead with dprintk behind goto

    /* Check, whether the sampling was disturbed by an SMI */
    if (tsc_start == ULLONG_MAX || tsc_stop == ULLONG_MAX)
        goto out;
__dprintk(8);

    delta = tsc_stop - tsc_start;
    delta *= 1000000LL;
    if (hpet)
        freq = calc_hpet_ref(delta, ref_start, ref_stop);
    else
        freq = calc_pmtimer_ref(delta, ref_start, ref_stop);

    /* Make sure we're within 1% */
    if (abs(tsc_khz - freq) > tsc_khz/100)
        goto out;
__dprintk(9);
    tsc_khz = freq;
    //It is clear that we have not reached this point
    pr_info("Refined TSC clocksource calibration: %lu.%03lu MHz\n",
        (unsigned long)tsc_khz / 1000,
        (unsigned long)tsc_khz % 1000);

out:
    if (boot_cpu_has(X86_FEATURE_ART))
        art_related_clocksource = &clocksource_tsc;
__dprintk(10);
    clocksource_register_khz(&clocksource_tsc, tsc_khz);
}

From the above, you can predict the position where there is a problem to some extent, but let's check what happens to printk.

Try to build the above with a patch for confirmation

$ cd ~/rpmbuild/BUILD
$ diff -uNrp kernel-3.10.0-957.el7.orig kernel-3.10.0-957.el7.new > ../SOURCES/linux-3.10.0-957.el7.patch 
$ cd ../SOURCES
$ (rm linux-3.10.0-957.el7.patch && sed 's/kernel-[^ ][^ ]*[gw]\/lin/lin/g' > linux-3.10.0-957.el7.patch) < linux-3.10.0-957.el7.patch
$ cd ../BUILD
$ cd ~/rpmbuild/BUILD/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/
$ cp /boot/config-3.10.0-957.el7.x86_64 .config
$ make oldconfig
$ cp .config ~/rpmbuild/SOURCES/config-`uname -m`-generic
$ cd ~/rpmbuild/SPECS
$ vim kernel.spec
$ cat kernel.spec | grep -E '(tscheck|ApplyOptionalPatch.*[3].*|Patch1000)'
%define buildid .tscheck
Patch1000: linux-3.10.0-957.el7.patch
ApplyOptionalPatch linux-3.10.0-957.el7.patch 

$ rpmbuild -bb --with baseonly --without debuginfo --without debug --without doc --without perf --without tools --without kdump --without bootwrapper --target=`uname -m` kernel.spec
$ sudo yum localinstall -y ~/rpmbuild/RPMS/x86_64/kernel-*.rpm

The power of printk

It became as follows. You can see at a glance where there is a problem.

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] Linux version 3.10.0-957.el7.tscheck.x86_64 ...
...
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.804286] TSC deadline timer enabled
[    2.557401] Switched to clocksource kvm-clock
[    3.035763] No.1, func: init_tsc_clocksource, line: 1309, file: arch/x86/kernel/tsc.c
[    3.044996] No.2, func: init_tsc_clocksource, line: 1330, file: arch/x86/kernel/tsc.c
[    3.054436] No.3, func: init_tsc_clocksource, line: 1332, file: arch/x86/kernel/tsc.c
[    3.063727] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    3.073240] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    4.083424] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    4.092902] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    5.085423] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    5.085424] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
...
[   76.453766] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[   77.464261] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[   77.473952] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[   78.484266] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[   78.494070] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
...
[  627.100177] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[  628.110663] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[  628.120099] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c

You can see that it loops No.4 and No.5 endlessly. Apparently, you can see that it continues endlessly with repeated processing every second.

So, as expected, we can conclude that there are likely to be problems with:

if (tsc_start == -1) {
__dprintk(5);
    /*
     * Only set hpet once, to avoid mixing hardware
     * if the hpet becomes enabled later.
     */
    hpet = is_hpet_enabled();
    schedule_delayed_work(&tsc_irqwork, HZ); //This looks suspicious.
    tsc_start = tsc_read_refs(&ref_start, hpet);
    return;
}

From the processing content, the tsc_start of the above block is always -1, and the recursive call of tsc_irqwork is endlessly trooped. As a result, you can see that the process did not proceed and no error was output. It's more likely that you won't understand this kind of problem even if you use a debugger, so using printk this time doesn't seem to be that bad either.

So when does tsc_start become -1? You can see that by reading tsc_read_refs.

Problem analysis

tsc_read_refs is defined below

#define MAX_RETRIES     5
#define SMI_TRESHOLD    50000

/*
 * Read TSC and the reference counters. Take care of SMI disturbance
 */
static u64 tsc_read_refs(u64 *p, int hpet)
{
        u64 t1, t2;
        int i;

        for (i = 0; i < MAX_RETRIES; i++) {
                t1 = get_cycles();
                if (hpet)
                        *p = hpet_readl(HPET_COUNTER) & 0xFFFFFFFF;
                else
                        *p = acpi_pm_read_early();
                t2 = get_cycles();
                if ((t2 - t1) < SMI_TRESHOLD)
                        return t2;
        }
        return ULLONG_MAX;
}

Apparently, ʻULONG_MAXwas returned because I thought it was-1`.

The specific processing seems to be as follows.

First, read the TSC count value with t1 = get_cycles ();
Next, read the count value of hpet or ʻacpi_pm` by IO access.
Immediately after that, read the TSC count value again with t2 = get_cycles ();
Returns t2 if the number of counts elapsed during IO access is less than SMI_TRESHOLD.
Otherwise, it returns ʻULLONG_MAX`.

So in summary, this code is problematic in the following ways:

Finding the cause

If the TSC count-up speed is relatively fast (or I / O to ACPI_PM or HPET timer is relatively slow) The difference between t1 = get_cycles () and t2 = get_cycles () is too large, and t2 --t1 is always greater than SMI_TRESHOLD. As a result, tsc_start will always be -1 and will continue to retry with schedule_delayed_wor (& tsc_irqwork, HZ) even after the boot process ends. Eventually, the initialization of the TSC clock source is delayed endlessly, and the clock source tsc is endlessly absent. And it's hard to tell what's going on because this retry process doesn't display a message.

Write a patch

I've written a patch that just inserts a printk for debugging, but I've already patched it to actually fix the problem. Whether it's right or wrong ...

Although it is an amateur's idea, there are two possible ways to fix this problem.

It is a little strange that SMI_TRESHOLD is a fixed value because it is possible that the speed of TSC will increase and the latency of IO will increase depending on the device and environment. I think there is a good way to do it.
As I mentioned at the beginning, the frequency of TSC is known in advance in pvclock. If you use it, you should be free from troublesome things.

Let's think concretely what can be done with the solutions of 1 and 2 above.

Solution 1: It's a bit strange that SMI_TRESHOLD is a fixed value.

It seems a little strange that SMI_TRESHOLD is a fixed value. I don't know the depth of the Linux kernel, but from an amateur's point of view, this is a fairly about process, and I think it should be able to scale to some extent according to the system. Perhaps the value should be proportional to the TSC frequency. .. .. But honestly, I didn't know how much.

(Answer given in a state of brain death because it is a holiday of only one day)

"Let's make it a larger fixed value!"

(Big problem, but I don't know what the problem is)

50000 -> 5000000

I wrote the following patch and saw it. vol.1

diff -uNrp linux-3.10.0-957.el7.x86_64/arch/x86/kernel/tsc.c linux-3.10.0-957.el7.x86_64/arch/x86/kernel/tsc.c
--- linux-3.10.0-957.el7.x86_64/arch/x86/kernel/tsc.c    2019-07-28 18:54:36.422551294 +0000
+++ linux-3.10.0-957.el7.x86_64/arch/x86/kernel/tsc.c    2019-07-28 18:55:24.100351452 +0000
@@ -391,7 +391,7 @@ static int __init tsc_setup(char *str)
 __setup("tsc=", tsc_setup);
 
 #define MAX_RETRIES     5
-#define SMI_TRESHOLD    50000
+#define SMI_TRESHOLD    5000000
 
 /*
  * Read TSC and the reference counters. Take care of SMI disturbance

It's fixed. vol.1

$  cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
kvm-clock tsc acpi_pm 

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.835560] TSC deadline timer enabled
[    2.605330] Switched to clocksource kvm-clock
[    3.086972] No.1, func: init_tsc_clocksource, line: 1309, file: arch/x86/kernel/tsc.c
[    3.096286] No.2, func: init_tsc_clocksource, line: 1330, file: arch/x86/kernel/tsc.c
[    3.105617] No.3, func: init_tsc_clocksource, line: 1332, file: arch/x86/kernel/tsc.c
[    3.114963] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    3.124533] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    4.209357] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    4.219336] No.6, func: tsc_refine_calibration_work, line: 1266, file: arch/x86/kernel/tsc.c
[    4.229233] No.7, func: tsc_refine_calibration_work, line: 1273, file: arch/x86/kernel/tsc.c
[    4.239024] No.8, func: tsc_refine_calibration_work, line: 1278, file: arch/x86/kernel/tsc.c
[    4.248844] No.9, func: tsc_refine_calibration_work, line: 1290, file: arch/x86/kernel/tsc.c
[    4.258687] tsc: Refined TSC clocksource calibration: 3000.004 MHz
[    4.264562] No.10, func: tsc_refine_calibration_work, line: 1300, file: arch/x86/kernel/tsc.c

Solution 2: It's a bit strange not to use the pvclock TSC frequency even though it's a virtualized guest

As I said to the beginning, if you are familiar with the mechanism of pvclock, you know that the frequency is assigned from the board from the shared information page, so you recognized TSC (recognize kvm-clock) At that point, it is known that the TSC frequency is given as calculated from the board. So, if you set X86_FEATURE_TSC_RELIABLE when you recognize kvm-clock, isn't it all settled? I notice. pvlock is a protocol that extends TSC to a virtual environment in the first place.

It is the following part.

/*
 * Trust the results of the earlier calibration on systems
 * exporting a reliable TSC.
 */
if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
    clocksource_register_khz(&clocksource_tsc, tsc_khz);
    return 0;
}

There should be a process to detect the TSC frequency when kvm-clock is recognized, so Let's modify the code of kvm-clock so that it can be put inside ʻif` in the above part.

The source of the target kvm-clock is as follows. You can see that kvm_get_tsc_khz is called during the guest initialization process running on KVM. From the comments in the source code, we can somehow understand that the TSC frequency is calculated early here.

static unsigned long kvm_get_tsc_khz(void)
{
        struct pvclock_vcpu_time_info *src;
        int cpu;
        unsigned long tsc_khz;
   ...
        src = &hv_clock[cpu].pvti;
        tsc_khz = pvclock_tsc_khz(src);
        preempt_enable();
        return tsc_khz;
}
...
void __init kvmclock_init(void)
{
    ...
        x86_platform.calibrate_tsc = kvm_get_tsc_khz;
        x86_platform.calibrate_cpu = kvm_get_tsc_khz;
...

If you are a Linux User, you probably know that cpu flags are managed by the macros arch / x86 / kernel / cpu / mkcapflags.pl and ʻarch / x86 / include / asm / cpufeature.h. When I searched for something that forced the flag, I found the following: ʻarch / x86 / include / asm / cpufeature.h.

#define set_cpu_cap(c, bit)    set_bit(bit, (unsigned long *)((c)->x86_capability))

extern void setup_clear_cpu_cap(unsigned int bit);
extern void clear_cpu_cap(struct cpuinfo_x86 *c, unsigned int bit);

#define setup_force_cpu_cap(bit) do { \
    set_cpu_cap(&boot_cpu_data, bit);    \
    set_bit(bit, (unsigned long *)cpu_caps_set);    \
} while (0)

Use this.

I wrote the following patch and saw it. vol.2

diff -uNrp linux-3.10.0-957.el7.x86_64/arch/x86/kernel/kvmclock.c linux-3.10.0-957.el7.x86_64/arch/x86/kernel/kvmclock.c
--- linux-3.10.0-957.el7.x86_64/arch/x86/kernel/kvmclock.c    2019-07-29 02:35:27.318987845 +0000
+++ linux-3.10.0-957.el7.x86_64/arch/x86/kernel/kvmclock.c    2019-07-29 03:04:11.015862936 +0000
@@ -338,6 +338,7 @@ void __init kvmclock_init(void)
 
     x86_platform.calibrate_tsc = kvm_get_tsc_khz;
     x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+    setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
     x86_platform.get_wallclock = kvm_get_wallclock;
     x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC

It's fixed. vol.2

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
kvm-clock tsc acpi_pm 

$  dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] Linux version 3.10.0-957.21.3.el7.tsc_fixed.x86_64...
...
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.832686] TSC deadline timer enabled
[    1.929653] Skipped synchronization checks as TSC is reliable.
[    2.598602] Switched to clocksource kvm-clock
[    3.078334] No.1, func: init_tsc_clocksource, line: 1309, file: arch/x86/kernel/tsc.c

The patch can't be written!

The kernel used on CentOS / RHEL is a bit old. So, to be honest, most of the bugs you'll find in your distro are already fixed in upstream. So when I searched for it, I found it easily orz.

Better patch than Solution 1

As expected, it was modified to scale in proportion to the TSC frequency. The method of comparing ʻULONG_MAX` has also been fixed.

I understood that it should be shifted when it is proportional to some value (not. I'm not sure about the details, but I thought it was a patch like the essence of CS. I learned a lot about this kind of feeling, so I want to make use of it next time.

x86/tsc: Make calibration refinement more robust

Excerpt only for characteristic parts

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index e9f777b..3fae238 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -297,15 +297,16 @@ static int __init tsc_setup(char *str)
 
 __setup("tsc=", tsc_setup);
 
-#define MAX_RETRIES     5
-#define SMI_TRESHOLD    50000
+#define MAX_RETRIES		5
+#define TSC_DEFAULT_THRESHOLD	0x20000
 
 /*
- * Read TSC and the reference counters. Take care of SMI disturbance
+ * Read TSC and the reference counters. Take care of any disturbances
  */
 static u64 tsc_read_refs(u64 *p, int hpet)
 {
 	u64 t1, t2;
+	u64 thresh = tsc_khz ? tsc_khz >> 5 : TSC_DEFAULT_THRESHOLD;
 	int i;
 
 	for (i = 0; i < MAX_RETRIES; i++) {
@@ -315,7 +316,7 @@ static u64 tsc_read_refs(u64 *p, int hpet)
 		else
 			*p = acpi_pm_read_early();
 		t2 = get_cycles();
-		if ((t2 - t1) < SMI_TRESHOLD)
+		if ((t2 - t1) < thresh)
 			return t2;
 	}
 	return ULLONG_MAX;

Patch similar to Solution 2

The RHEL code is a bit old, so there are some differences in the CPU flags you set, To be honest, I didn't quite understand the meaning of inserting setup_force_cpu_cap into this block. This code will have to reset the CPU flag many times. I'm wondering what it means, but it may have something to do with it.

kvmclock: fix TSC calibration for nested guests

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d79a18b..4c53d12 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -138,6 +138,7 @@ static unsigned long kvm_get_tsc_khz(void)
 	src = &hv_clock[cpu].pvti;
 	tsc_khz = pvclock_tsc_khz(src);
 	put_cpu();
+	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 	return tsc_khz;
 }

Afterword

Linux is difficult but fun. By the way, I love Rust and even write a bootloader for Linux. This is also an introductory article, so please read it if you like.

A story about making an x86 bootloader that can boot vmlinux with Rust-Qiita

Please feel free to comment if you have any advice such as this is not a good idea. Will study!