When you map a file to the virtual memory space with MAP_SHARED using mmap (2), which everyone loves, the writing to that memory is reflected in the file. When and who is detecting the writing to this memory?
Considering the mechanism of mmap (2) and the function of CPU, there are roughly two possible mechanisms. Before that, I will briefly explain the operation of the OS and the functions of the CPU.
One of the important roles of the OS is to give each task (process in Unix-like terms) an independent virtual memory space. Address 100 of one process is allocated physical memory different from address 100 of another process (unless specified). The MMU (Memory Management Unit) is the hardware that maps the virtual memory space and the physical memory space in a CPU that runs full Linux, such as a PC, server, or smartphone. Many CPUs today have a built-in MMU, but in the past they were sometimes externally attached near the CPU. Linux assumes a paging-type MMU that manages memory by dividing it into pages of a certain size.
On many current mainstream CPUs, such as x86_64, ARM, and Power, the page size is 4KiB in principle. Since 4KiB is 12 bits, the 64-bit virtual memory address can be represented by the page offset of the lower 12 bits and the page number (virtual page number) of the virtual memory space of the upper 52 bits.
The MMU uses a table called a page table in memory to convert pages in virtual memory space into pages in physical memory space. A page table can be said to be an array indexed by virtual page numbers. In reality, the virtual memory space is used sparsely, and the size of the page table is compressed by making it multi-stage (4 to 5 stages in x86-64). Also, on x86-64, only 48 bits are actually used out of 64 bits. Each x86-64 page table entry (PTE: Page Table Entry) has a width of 64 bits, with a physical page number (PFN: Page Frame Number) on the upper side and various information about the page on the lower 12 bits as a bitmap. Have. What is important in this paper is the information on the lower side. x86-64 has the following information, but other architectures (with different names) have similar information.
P (Present) bit: When set to 1, it indicates that the physical page corresponding to the virtual page exists (mapped). If the P bit is cleared to 0, access to this virtual page will be a page fault because no memory is physically allocated. As an aside, in x86, the L1TF vulnerability is that the physical memory corresponding to the PFN represented by the upper bit may be speculatively accessed even when the P bit is cleared. .. When the P bit is cleared, it is allowed by the specifications to store the information used by the OS at the upper level, and in Linux, the information of the position in the swap area may be stored.
D (Dirty) bit: Set automatically by the CPU when the virtual page has write access. It is the responsibility of the OS to clear it, as it is not cleared automatically.
A (Accessed) bit: Set automatically by the CPU when the page is accessed. It is the responsibility of the OS to clear it, as it is not cleared automatically.
R / W (Read / Write) bit: When cleared, the page is read-only and writing is not allowed.
When a process tries to access a page that has a P bit cleared, the OS is notified of an exception called a page fault (#PF in x86 terminology). The OS may receive #PF and rewrite the page table appropriately to resume the execution of the process, or it may cause an error as an illegal memory access (typically a segmentation fault signal in Unix-like OS). When the process tries to write to the page where the R / W bit is cleared, #PF (write fault) is notified to the OS, and the page table may be rewritten appropriately and the process execution may be resumed. And it may be an error.
A major feature of D and A is that the CPU may change. The CPU does not change the rest of the page table, only the OS. Also, although the CPU of both D and A may be changed from 0 to 1, it is not cleared from 1 to 0.
In most Unix-like operating systems such as Linux, storage data is cached in memory, but this is done on a page-by-page basis, so it is called a page cache. File mmap is achieved by mapping this page cache to virtual memory space. Normal write (2) (including pwrite (2), writev (2), etc.) creates a page cache prior to writing (that is, reads the entire page content from storage) and rewrites this page cache. .. At the time of write (2) etc., the OS can grasp that the page cache has been rewritten and the consistency with the original on the storage has been lost. A page that is inconsistent with the original is called a dirty page, but the dirty page is written to the storage at an appropriate time and becomes the same state as the original, that is, a clean page. In Linux, the former bdflush, pdflush, and the recent bdi flusher are in charge of this writing, and the sysctl tuning parameter vm.dirty_ratio, which has become a secret sauce of each SIer company, changes the behavior of this bdi flusher. It is a thing.
In the case of mmap, the situation becomes difficult. The process writes the file to memory, including the mmap area. Normally, the OS (kernel) does not know when and where it was written. In some way, you have to detect what the process wrote in the mmap area and on which page it wrote, and make that page dirty.
As an aside, in some former implementations, mmap (2) uses a page cache, but write (2) and others use a separate buffer cache mechanism, which is superfluous between the two caches. There were some problems such as copy occurrence and file contents being corrupted when mmap (2) and write (2) were used in parallel (up to Linux-2.2, NetBSD-1.5, etc.). Currently, it has been improved to use the page cache for write (2) etc., and such a problem does not occur.
Based on the MMU function described earlier, I think there are two possible ways to supplement writing to the mmap area.
Find the one with the D bit of PTE set and make it dirty. Clear the D bit when exporting to storage. 1a. Scan and find all page tables (corresponding to the virtual space of all processes). 1b. For every physical page, look for the PTE that points to it.
Clear the R / W bit, catch the generated #PF (write fault), make the page dirty, and at the same time set the R / W bit to restart the process. Clear the R / W bit when writing to storage.
I've already done a page table survey and I'm looking at the A bit. The page cache is freed if it is not used for a long time (otherwise it fills up memory), but it is used to determine that it has not been used for a long time. This is called the [page replacement algorithm](https://ja.wikipedia.org/wiki/page replacement algorithm).
For example, the following algorithm determines the cache to be released. First, register all the pages used in the page cache in the list. Pages newly secured as a page cache will be added to the end of the list. When you run out of free memory, scan this list from the beginning to find the A bit of the PTE that is referencing the page. Pages that remain cleared are considered to have not been used for a while, so they are released, but if they are set, they move to the end of the list. As a result, the list will be arranged with pages that are used less recently near the beginning and pages that are frequently used near the end (LRU: Least Recently Used order). Although I wrote about the page cache here, the pages to be saved in the swap area can be determined by performing the same processing in the anonymous memory.
In reality, it is common to remove it from the list, add it to another list called inactive, and wait for a while. The list explained first is called an active list, and the ratio of active, inactive, and free is kept at a constant ratio. In addition, if the above is implemented in a simple manner, the scan will be greatly biased, so that a device for efficiently and fairly scanning the whole will be made. In addition, there are small differences depending on the OS and version, such as when and under what conditions the scan is executed, how many pages are moved, how much the page cache and anonymous memory are treated in the same way or differently. There is, but the rough operation is like this.
Let's get back to mmap. Method 1a (tentatively called the page table scan method) is a straightforward PTE investigation. You can investigate D bit and A bit at the same time. Areas mapped from multiple virtual spaces (executables, shared libraries, etc.) will be examined multiple times during a round of all the memory in use.
Method 1b (temporarily called the physical page scan method) requires a method to check the virtual space and virtual address that maps it from the physical page. Since A bit can be investigated at the same time, the active / inactive list mentioned earlier can be used to trace all the physical pages in use. It seems that scanning can be optimized according to the frequency of use.
Whether it's 1a or 1b, you need to investigate all the pages in use over a period of time. Otherwise, dirty pages will be undetectable for long periods of time. However, in the case of 1a, optimization such as skipping the scan of the virtual space of the process that has not been scheduled recently in cooperation with the process scheduler can be considered.
In the second method (temporarily called the write fault capture method), the problem is that the #PF processing may be heavy. Too many write faults can slow down the execution of the process. It is known that there is a lot of access in the area near the memory (locality of access), but once a write fault occurs and the R / W bit is set, the page is left dirty for a while. , Clear the R / W bit just before writing, so it is necessary to reduce it by writing multiple memory writes to the storage at once. On the other hand, because you can capture the moment when it becomes dirty, you can guarantee that it will be written to storage (start writing) 30 seconds after it becomes dirty, or keep the number of dirty pages within a certain range. Will be easier.
This is the end of this article because it has become long. When I actually examined the source code of the OS of some OSS, I found the OS that actually uses the physical page scan method of 1b and the write fault capture method of 2, so [Next article]( I would like to introduce it at https://qiita.com/SIGABRT/items/667b24a809f1575a2640).
Recommended Posts