Google colab is surprisingly useful, but there is a problem that reading data directly from google drive is unusually slow. (Image data in my case) So I'm going to use .p and .h5 files to speed up the reading (as fast as reading from a normal device).
I think the method will change depending on the person, but in my case
I will make it like that.
Due to the influence of the task (speech synthesis) that I am currently working on, I am writing the output as well, but for classification tasks etc., I think that it is better to save the input and output as tuples without separating them.
maketxt.py
import re
import glob
from sklearn.model_selection import train_test_split
def Create_txt(txt_path):
#The condition is isfile because it is only for moving images=I think true is fine.
fileList = [p for p in glob.glob('./image/**/', recursive=True) if re.search('/segment_', p)]
#Data is 8:1:1
train_data, val_test_data = train_test_split(fileList, test_size=0.2)
val_data, test_data = train_test_split(val_test_data, test_size=0.5)
try:
train_txt = os.path.join(txt_path, 'train.txt')
with open(train_txt, mode='x') as f:
for train_path in train_data:
f.write(train_path.rstrip('/') + '\n')
val_txt = os.path.join(txt_path, 'val.txt')
with open(val_txt, mode='x') as f:
for val_path in val_data:
f.write(val_path.rstrip('/') + '\n')
test_txt = os.path.join(txt_path, 'test.txt')
with open(test_txt, mode='x') as f:
for test_path in test_data:
f.write(test_path.rstrip('/') + '\n')
except FileExistsError:
print('already exists')
What you are doing
Originally, the .txt file doesn't take much time, so you can skip it.
p is an acronym for pickle. It is a Python module that can save the state of an object by serializing the object. (I would appreciate it if you could see other pages for the specific contents.)
In this case, a large number of character strings (file name paths) are stored in one file. I'll put the code below, but I just read the text file and put it in the pickle file.
import pickle
video_list =list()
txt_path = os.path.join(txt_path)+ '.txt'
with open(txt_path, 'r') as textFile:
for line in textFile:
line = line.replace('\n', '')
video_list.append(line)
pickle.dump(data_dict, open('txt_file.p', "wb"))
This is the main part of this time.
The .h5 file is one of the binary files called HDF5, and you can have a hierarchical structure in one file. In other words, let's manage the files that we usually do in a computer in one huge file.
By storing a large number of image files in one huge file, you can reduce the time to load images. Especially google colab is slow so I used it when I didn't have it. (The first epoch was particularly tight, and it took 10 minutes instead of 2 hours.)
fileName = 'data.h5'
path = ('./data_file')
#read pickle data
dataFileList = pickle.load(open('txt_file.p', "rb"))
train_list=dataFileList['train']
count =0
with h5py.File(fileName, "w") as f:
f.create_group('/train')
f.create_group('/train/input')
f.create_group('/train/output')
for train in train_list:
data = pull_item(train)
f.create_dataset('/train'+data[0].strip('.'), data=data[2])
f.create_dataset('/train'+data[1].strip('.'), data=data[3])
f.flush()
if count %100 ==0:
print(count)
count +=1
As a flow,
Create an h5 file and create a directory for input and a directory for output. (The directory structure is free, but it seems to be difficult to read if there are too many hierarchies.)
Save more and more training data.
f.create_dataset('file name', data=Data contents)
To do.
import h5py
from PIL import Image
import numpy as np
image = ('/train/input/images_18')
output = ('/train/output/images_18.csv')
with h5py.File('data.h5', mode = 'r') as data:
img_data = data[image]
img_group = img_data[...]
img_group = img_group.astype(np.float64)
feature = data[output]
feature_data = feature[...]
data.close()
Due to the specifications, the object for HDF5 is read when the data is read for the first time, so
img_group = img_data[...]
You need to specify that you want to retrieve the contents of the object, such as. The rest is as usual.
Even if I searched for "google colab loading is slow" etc., I did not find an article about using a binary file, so I hope you will know its existence by looking at this article.
It seems to be deeper than I thought I would look it up, but I don't have much time and motivation to look it up properly, so if you want to know more, I think you should look at other pages.
Recommended Posts