Overview

Google colab is surprisingly useful, but there is a problem that reading data directly from google drive is unusually slow. (Image data in my case) So I'm going to use .p and .h5 files to speed up the reading (as fast as reading from a normal device).

I will not explain what each binary file looks like, so if you want to know, please refer to other pages.

Overall flow

I think the method will change depending on the person, but in my case

Create a .txt file that summarizes the data paths for training, validation, and test, respectively.
Put the .txt file created in 1. into the .p file (probably not so fast here)
Create a .h5 file that corresponds to the name and data (.png this time) using the .p file created earlier.

I will make it like that.

Due to the influence of the task (speech synthesis) that I am currently working on, I am writing the output as well, but for classification tasks etc., I think that it is better to save the input and output as tuples without separating them.

1. Create a .txt file

`maketxt.py`


import re
import glob
from sklearn.model_selection import train_test_split

def Create_txt(txt_path):
    #The condition is isfile because it is only for moving images=I think true is fine.
    fileList = [p for p in glob.glob('./image/**/', recursive=True) if re.search('/segment_', p)]
    #Data is 8:1:1
    train_data, val_test_data = train_test_split(fileList, test_size=0.2)
    val_data, test_data =  train_test_split(val_test_data, test_size=0.5)
 
    try:
        train_txt = os.path.join(txt_path, 'train.txt')
        with open(train_txt, mode='x') as f:
            for train_path in train_data:
                f.write(train_path.rstrip('/') + '\n')

        val_txt = os.path.join(txt_path, 'val.txt')
        with open(val_txt, mode='x') as f:
            for val_path in val_data:
                f.write(val_path.rstrip('/') + '\n')

        test_txt = os.path.join(txt_path, 'test.txt')
        with open(test_txt, mode='x') as f:
            for test_path in test_data:
                f.write(test_path.rstrip('/') + '\n')

    except FileExistsError:
        print('already exists')

What you are doing

Get all the paths of the data in the directory
Divide the acquired path
Save each (try statement is written not to overwrite) is.

2. Create .p file

Originally, the .txt file doesn't take much time, so you can skip it.

p is an acronym for pickle. It is a Python module that can save the state of an object by serializing the object. (I would appreciate it if you could see other pages for the specific contents.)

In this case, a large number of character strings (file name paths) are stored in one file. I'll put the code below, but I just read the text file and put it in the pickle file.

import pickle

video_list =list()
txt_path = os.path.join(txt_path)+ '.txt'
with open(txt_path, 'r') as textFile:
    for line in textFile:
        line = line.replace('\n', '')
        video_list.append(line)

pickle.dump(data_dict, open('txt_file.p', "wb"))

3. Create .h5 file

This is the main part of this time.

The .h5 file is one of the binary files called HDF5, and you can have a hierarchical structure in one file. In other words, let's manage the files that we usually do in a computer in one huge file.

By storing a large number of image files in one huge file, you can reduce the time to load images. Especially google colab is slow so I used it when I didn't have it. (The first epoch was particularly tight, and it took 10 minutes instead of 2 hours.)

Writing data

fileName = 'data.h5'
path = ('./data_file')
#read pickle data
dataFileList = pickle.load(open('txt_file.p', "rb"))
train_list=dataFileList['train']

count =0
with h5py.File(fileName, "w") as f:
  f.create_group('/train')
  f.create_group('/train/input')
  f.create_group('/train/output')
  
  for train in train_list:
    data = pull_item(train)
    f.create_dataset('/train'+data[0].strip('.'), data=data[2])
    f.create_dataset('/train'+data[1].strip('.'), data=data[3])
    f.flush()
    if count %100 ==0:
      print(count)
    count +=1

As a flow,

Create an h5 file and create a directory for input and a directory for output. (The directory structure is free, but it seems to be difficult to read if there are too many hierarchies.)
Save more and more training data.

f.create_dataset('file name', data=Data contents)

To do.

In the pull_item in the code, data is stored like [input file name, output file name, input contents, output contents].

Data reading

import h5py
from PIL import Image
import numpy as np
image = ('/train/input/images_18')
output = ('/train/output/images_18.csv')
with h5py.File('data.h5', mode = 'r') as data:
    img_data = data[image]
    img_group = img_data[...]
    img_group = img_group.astype(np.float64)
    feature = data[output]
    feature_data  = feature[...]
    data.close()

Due to the specifications, the object for HDF5 is read when the data is read for the first time, so

img_group = img_data[...]

You need to specify that you want to retrieve the contents of the object, such as. The rest is as usual.

At the end

Even if I searched for "google colab loading is slow" etc., I did not find an article about using a binary file, so I hope you will know its existence by looking at this article.

It seems to be deeper than I thought I would look it up, but I don't have much time and motivation to look it up properly, so if you want to know more, I think you should look at other pages.

About learning with google colab