wrap up

The file reading overhead seems to be large.
It seems that this overhead can be significantly reduced by reading multiple files into one.
I wanted to see this effect with concrete numbers, so I searched for a suitable data set and measured it considering the environment.
As a result, the overhead was larger than I expected.

Introduction

In recent years, there has been an increasing demand for reading a large number of files for machine learning. However, if you try to read a large number of files, the file reading overhead may be greater than the main processing in your program. For example, CIFAR-10 stores multiple images in a single file to reduce file read overhead.

I was curious about this effect, so I used CIFAR-10 to investigate how the file reading overhead affects the program.

Lottery

Dataset
[Measurement program](#Measurement program)
Result
Explanation
Appendix

data set

The data set to be measured is CIFAR-10, which is familiar in the image recognition world. CIFAR-10 is an image group consisting of 10 classes of 32 x 32 pixels. Use the one distributed as a binary file on the above site. Image data for 10,000 images is described in one binary file. The binary structure is as follows.

The capacity of one image is 1 byte for label + 32 x 32 x 3 bytes for image data = 3073 bytes, so one binary file is about 30 MB. By reading this, the overhead of reading the file is measured.

Measurement program

The following three programs are prepared to measure the overhead of reading a file.

open_time.cpp
open_time_individual.cpp
open_time_loop.cpp

open_timeIs cifar-A program that directly reads 10 binary files. fopen () and fclose () are called only once during execution. open_time_individualIs cifar-It is a program that reads 10 binary files from the directory saved by dividing each image in advance. fopen () and fclose () are called 10000 times in the program, which is the number of images. open_time_loopIs cifar-It is a program that directly reads 10 binary files,open_timeUnlike fopen for each image()、fclose()Is a program that calls. open_time_individualSimilar to fopen()、fclose()Is called 10000 times during execution.

Except for the above file reading, the processing common to these three programs is explained. The execution time is measured by system_clock in the chrono library. As mentioned in Dataset, the first byte of the binary file is the label of the image, so fseek (fp, 1L, SEEK_CUR) skips 1 byte. The image is read by ``` fread (pic, sizeof (uint8_t), 3072, fp) ``, and the value of each pixel is loaded, added and stored as a process in the loop. Note that error handling for file operations is omitted.

`open_time.cpp`


#include <stdio.h>
#include <chrono>
int main(int argc, char** argv) {
    chrono::system_clock::time_point start, end;
    uint8_t pic[3072] = {0};
    start = chrono::system_clock::now();
    auto fp = fopen("./cifar-10-batches-bin/data_batch_1.bin", "rb");
    for(int j=0;j<10000;++j){
        fseek(fp,1L,SEEK_CUR);
        fread(pic, sizeof(uint8_t), 3072, fp);
        for(int i=0;i<3072;++i){
            pic[i]++;
        }
    }
    fclose(fp);
    end = chrono::system_clock::now();
    double time = static_cast<double>(chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0);
    printf("time %lf[ms]\n", time);

    return 0;

`open_time_individual.cpp`


#include <stdio.h>
#include <chrono>
#include <string>
int main(int argc, char** argv) {
    chrono::system_clock::time_point start, end;
    std::string filenames[10000] = {""};
    for(int j=0; j<10000;++j){
        filenames[j] = "./cifar10-raw/" + std::to_string(j) + ".bin";
    }
    uint8_t pic[3072] = {0};
    start = chrono::system_clock::now();
    for(int j=0;j<10000;++j){
        auto fp = fopen(filenames[j].c_str(), "rb");
        fseek(fp,1L,SEEK_CUR);
        fread(pic, sizeof(uint8_t), 3072, fp);
        for(int i=0;i<3072;++i){
            pic[i]++;
        }
        fclose(fp);
    }
    end = chrono::system_clock::now();
    double time = static_cast<double>(chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0);
    printf("time %lf[ms]\n", time);

    return 0;

`open_time_loop.cpp`


#include <stdio.h>
#include <chrono>
int main(int argc, char** argv) {
    chrono::system_clock::time_point start, end;
    uint8_t pic[3072] = {0};
    start = chrono::system_clock::now();
    for(int j=0;j<10000;++j){
        auto fp = fopen("./cifar-10-batches-bin/data_batch_1.bin", "rb");
        fseek(fp,1L+3073L*j,SEEK_CUR);
        fread(pic, sizeof(uint8_t), 3072, fp);
        for(int i=0;i<3072;++i){
            pic[i]++;
        }
        fclose(fp);
    }
    end = chrono::system_clock::now();
    double time = static_cast<double>(chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0);
    printf("time %lf[ms]\n", time);

    return 0;
}

result

The result of the actual execution is shown below.

 % ./open_time
time 62.964000[ms]
 % ./open_time_individual
time 1154.943000[ms]
 % ./open_time_loop
time 1086.277000[ms]

open_timeAgainstopen_time_individualWhenopen_time_loopでは約20倍の実行時間がかかるこWhenがわかります。 You can also see that the execution times of open_time_individual and open_time_loop are about the same.

Commentary

open_timeWhenopem_time_loopIs a program that reads the same data area, but the execution time is fopen()You can see that it depends on. Also, since the execution times of open_time_individual and open_time_loop are about the same, we can see that the execution time depends on the number of times, not the type of file to fopen ().

For fopen (), you need to open the file with a system call, and you need to switch from user mode to kernel mode. In the case of memory access, once allocated address space can be executed without switching overhead. It turns out that for images of CIFAR-10 or so, it takes more time to process fopen () than memory access.

Appendix Shell script used to generate a binary file divided for each image from the CIFAR-10 binary file

for i in `seq 0 9999` 
do
    t=$(($i * 3073))
    tail -c +$t cifar-10-batches-bin/data_batch_1.bin | head -c 3073 > "cifar10-raw/"$i".bin"
done

Python script to convert to png to determine if the split binary file is correct as an image

import numpy as np
from PIL import Image

fp = open("sample.bin", "rb")
label = fp.read(1)
data = np.zeros(3072, dtype='uint8')
for i in range(3072):
    data[i] =  int.from_bytes(fp.read(1), 'little')

fp.close()
data = data.reshape(3, 32, 32)
data = np.swapaxes(data, 0, 2)
data = np.swapaxes(data, 0, 1)
with Image.fromarray(data) as img:
    img.save("sample.png ")

File read overhead measurement