Dataset preparation for PyTorch

Python script for preparing datasets (images) for deep learning

Are you using scikit-learn's train_test_split etc. when splitting the dataset into train-val? If you prepare a folder of data for learning / evaluation in advance and want to extract data from it (although I think that there is almost none), you can do it by the method of this article. Click here for the program GitHub-moriitkys/PrepareDataSet The specifications of the program introduced in this article are as follows.

--For PyTorch's DateLoader (also Keras) --Data for deep learning can be set with UI by TKinter (CreatePanel () of mylib / create_panel.py is instantiated and processing is executed)

  1. Choice of learning or guessing
  2. Model Backbone Selection (ResNet, Mobilenet, MyNet)
  3. Selection of division execution of learning / evaluation data (whether to divide newly randomly regardless of whether it was divided last time)
  4. Percentage of training data (train: val)
  5. Choice to perform data expansion
  6. Total number of epochs --When you execute data expansion, MakeDataSetRGB () of mylib / makedataset_rgb.py is instantiated, the data of dataset folder and dataset_val is inflated, and it is generated in dataset_aug and dataset_val_aug. --By split execution, a folder like dataset_val is created in the same hierarchy for the dataset folder, and all evaluation data is moved. --It is assumed that the hierarchy has folders for each class such as / dataset / 1 / ..., / dataset / 2 / ... (class names can be alphabetic) --There is also a program that returns from the split state (dataset_val exists) to the state before split (dataset) (revert_dataset_val ())

How to actually operate the UI to determine variables and generate / prepare data

Panel6.gif

The above is the setting when ・ Train mode ・ ResNet ・ train-val split execution ・ train: val = 0.65: 0.35 ・ augmentation execution ・ Epochs = 40. Finally, the following directory structure and output are made.

Keras_dir_def.PNG Keras_Res_val_aug_train.PNG Keras_Res_val_aug_train_jn.PNG
Figure 1-a.Default directory state Figure 1-b.Directory after dataset partitioning and expansion Figure 1-c.Output when the program completes successfully

Differences between Keras and Pytorch

The same thing in both is the variables obtained from the panel (Tkinter) such as flag_train and total_epochs. The difference is in the preparation of the dataset.

Keras dataset preparation

Keras prepares the data as a numpy array. I couldn't think of a good way, so I converted List with numpy.array, saved it in npy, and read it from npy when learning. I intended to reduce memory consumption by emptying the List, but I would like to rewrite this part neatly.

Keras dataset preparation (click here to view program)
#Settings and prepare your dataset
import glob
import os
import sys
import keras
from keras import layers, models, optimizers
from keras.utils import np_utils
import keras.backend as K
import keras.layers as KL
import tensorflow as tf
from keras.preprocessing.image import load_img, img_to_array, array_to_img
from keras.preprocessing.image import random_rotation, random_shift, random_zoom
import numpy as np
import random
import matplotlib.pyplot as plt
import PIL
from PIL import Image
import cv2
from pathlib import Path
import shutil
from sklearn.model_selection import train_test_split
import mylib.makedataset_rgb as mkdataset
import mylib.create_panel as create_panel
import mylib.utils as myutils

# ------ Setting panels ------
import tkinter
from tkinter import messagebox
img_size_mynet = [28,28]# You can change input image size(Pay attention to network shape)
setting_panel = create_panel.CreatePanel(img_size_mynet = img_size_mynet)
setting_panel.create_buttons()#If you push "start", exit this line.

# ------ set params and preparing dataset ------
flag_train = setting_panel.flag_train
flag_aug = setting_panel.flag_aug
flag_split = setting_panel.flag_split
ratio_train = float(setting_panel.var_sp.get())#0.0 ~ 1.0
total_epochs = int(setting_panel.var_sp_epochs.get())

type_backbone = setting_panel.type_backbone#ex) ResNet, Mobilenet, MyNet
layer_name_gradcam = setting_panel.layer_name_gradcam# Don't use 
img_size = setting_panel.img_size#ex) ResNet:[224,224], Mobilenet:[192,192], MyNet:[28,28]
print(type_backbone)
print("img_size=" + str(img_size))

#How many classes are in "dataset" folder
categories = [i for i in os.listdir(os.getcwd().replace("/mylib", "") + "/dataset")]
categories_idx = {}#ex) HookWrench:0, SpannerWrench:1
for i, name in enumerate(categories):
    categories_idx[name] = i
nb_classes = len(categories)#ex) nb_classes=2

dirname_dataset = "dataset"# dataset folder
dirname_dataset_val = dirname_dataset + "_val"
output_folder = "outputs_keras/"+type_backbone
x_train, y_train, x_val, y_val = [],[],[],[]

def aug_dataset(dirname_dataset_1, dirname_dataset_val_1):
    '''
    This function returns updated dataset dirname 
    Contain MakeDataSetRGB() (mylib/makedataset_rgb.py)
    Argument1: Foldername (String), Argument2: Foldername (String)
    Usage:
    dirname_dataset, dirname_dataset_val = aug_dataset(dirname_dataset, dirname_dataset_val)
    '''
    dirname_dataset_aug = dirname_dataset_1 + "_aug"
    dirname_dataset_val_aug = dirname_dataset_val_1 + "_aug"
    make_dataset = mkdataset.MakeDataSetRGB()
    if os.path.exists(dirname_dataset_aug ) == True \
    or os.path.exists(dirname_dataset_val_aug ) == True:
        #https://pythonbasics.org/tkinter-messagebox/
        tki2 = tkinter.Tk()
        tki2.withdraw()
        ret = messagebox.askyesno('Verification', '_There is an aug folder._Are you sure you want to clear the aug folder?')
        if ret == True:
            if os.path.exists(dirname_dataset_aug ) == True:
                shutil.rmtree(dirname_dataset_aug)
            if os.path.exists(dirname_dataset_val_aug ) == True:
                shutil.rmtree(dirname_dataset_val_aug)
            make_dataset.do_augmentation(dataset_folder_name = "dataset")
            make_dataset.do_augmentation(dataset_folder_name = "dataset_val")
            tki2.destroy()
        else:
            tki2.destroy()
        tki2.mainloop()
    else:
        make_dataset.do_augmentation(dataset_folder_name = "dataset")
        make_dataset.do_augmentation(dataset_folder_name = "dataset_val")
        
    dirname_dataset_2 = dirname_dataset_1 + "_aug"
    dirname_dataset_val_2 = dirname_dataset_val_1 + "_aug"
    return dirname_dataset_2, dirname_dataset_val_2

def prepare_dataset(dirname_dataset, dirname_dataset_val):
    label = 0
    for j in categories:# Prepare Training Dataset
        files = glob.glob(dirname_dataset + "\\" + str(j) + "/*")
        for imgfile in files:
            img = load_img(imgfile, target_size=(img_size[0], img_size[1]))
            array = img_to_array(img) / 255
            x_train.append(array)
            y_train.append(label)
        label += 1

    label = 0
    for j in categories:# Prepare Validation Dataset
        files = glob.glob(dirname_dataset_val + "\\" + str(j) + "/*")
        for imgfile in files:
            img = load_img(imgfile, target_size=(img_size[0], img_size[1]))
            array = img_to_array(img) / 255
            x_val.append(array)
            y_val.append(label)
        label += 1
            
if flag_train == True:
    print("train mode")
    print("total epochs = " + str(total_epochs))
    if flag_split == True:
        revert_dataset_val()
        prepare_dataset_val()
        print("splitting complete")
    if flag_split == False and os.path.exists(dirname_dataset_val) == False:
        prepare_dataset_val()
        print("You have not splitted dataset, so splitteing automatically done")
    if flag_aug == True:
        dirname_dataset, dirname_dataset_val = aug_dataset(dirname_dataset, dirname_dataset_val)
        print("dataset source is " + dirname_dataset + "&" + dirname_dataset_val)
    elif flag_aug == False:
        dirname_dataset_aug = dirname_dataset + "_aug"
        dirname_dataset_val_aug = dirname_dataset_val + "_aug"
        make_dataset = mkdataset.MakeDataSetRGB()
        if os.path.exists(dirname_dataset_aug ) == True \
        and os.path.exists(dirname_dataset_val_aug ) == True:
            dirname_dataset = dirname_dataset_aug
            dirname_dataset_val = dirname_dataset_val_aug
    prepare_dataset(dirname_dataset, dirname_dataset_val)
    # make directory (weights_folder, outputs)
    if os.path.exists("weights_pytorch/"+type_backbone) == False:
        os.makedirs("weights_pytorch/"+type_backbone)
    if os.path.exists("outputs_pytorch/"+type_backbone) == False:
        os.makedirs("outputs_pytorch/"+type_backbone)
        
if os.path.exists(output_folder) == False:
    os.makedirs(output_folder)

# In Keras, use numpy array for NN model
if os.path.exists("tmp_npy") == False:
    os.makedirs("tmp_npy")
x_train, y_train, x_val, y_val = np.array(x_train), np.array(y_train), np.array(x_val), np.array(y_val)
np.save("tmp_npy/x_train.npy", x_train)
np.save("tmp_npy/y_train.npy", y_train)
np.save("tmp_npy/x_test.npy", x_val)
np.save("tmp_npy/y_test.npy", y_val)
x_train, y_train, x_val, y_val = [],[],[],[]
print("Complete")

In particular, the following are the main parts of data reading.

for imgfile in files:
    img = load_img(imgfile, target_size=(img_size[0], img_size[1]))
    array = img_to_array(img) / 255
    x_train.append(array)
    y_train.append(label)
# ~ x_val, y_Same for val

Here's how to load the data into the model. When classifying, use np_utils.to_categorical to transform the shape of the array (such as the One Hot label or 1ofK vector).

y_train1=np_utils.to_categorical(y_train,nb_classes)
y_val1=np_utils.to_categorical(y_val,nb_classes)

history = model.fit(x_train,y_train1,epochs=total_epochs, callbacks = [cp_callback],batch_size=32,validation_data=(x_val,y_val1))

PyTorch dataset preparation

In PyTorch, reading the dataset with ImageFolder etc., dividing the train-val with scikit-learn train_test_split etc. and using DataLoader to collect the training data and label pairs in batch units is one of the dataset preparation. There are two ways.

PyTorch dataset preparation (click here to view program)
#Settings and prepare your dataset
import glob
import os
import sys
import numpy as np
import random
import matplotlib.pyplot as plt
import PIL
from PIL import Image
import cv2
import torch
import torchvision.transforms as transforms
from pathlib import Path
from torch.utils.data import DataLoader, Dataset
from torchvision.datasets import ImageFolder
from sklearn.model_selection import train_test_split
import shutil
import mylib.makedataset_rgb as mkdataset
import mylib.create_panel as create_panel
import mylib.utils as myutils

# ----- Setting buttons -----
import tkinter
from tkinter import messagebox
img_size_mynet = [28,28]# You can change input image size(Pay attention to network shape)
setting_panel = create_panel.CreatePanel(img_size_mynet = img_size_mynet)
setting_panel.create_buttons()#If you push "start", exit this line.

# ----- set params and preparing dataset -----
flag_train = setting_panel.flag_train
flag_aug = setting_panel.flag_aug
flag_split = setting_panel.flag_split
ratio_train = float(setting_panel.var_sp.get())#0.0 ~ 1.0
total_epochs = int(setting_panel.var_sp_epochs.get())

type_backbone = setting_panel.type_backbone
layer_name_gradcam = setting_panel.layer_name_gradcam
img_size = setting_panel.img_size
print(type_backbone)
print("img_size=" + str(img_size))

#How many classes are in "dataset" folder
categories = [i for i in os.listdir(os.getcwd().replace("/mylib", "") + "/dataset")]
nb_classes = len(categories)#ex) nb_classes=2

dirname_dataset = "dataset"# dataset folder
dirname_dataset_val = dirname_dataset + "_val"
output_folder = "outputs_pytorch/"+type_backbone

def aug_dataset(dirname_dataset_1, dirname_dataset_val_1):
    '''
    This function returns updated dataset dirname 
    Contain MakeDataSetRGB() (mylib/makedataset_rgb.py)
    Argument1: Foldername (String), Argument2: Foldername (String)
    Usage:
    dirname_dataset, dirname_dataset_val = aug_dataset(dirname_dataset, dirname_dataset_val)
    '''
    dirname_dataset_aug = dirname_dataset_1 + "_aug"
    dirname_dataset_val_aug = dirname_dataset_val_1 + "_aug"
    make_dataset = mkdataset.MakeDataSetRGB(do_reverse=True,
                                            do_gamma_correction=True, 
                                            do_add_noise=True, 
                                            do_cut_out=True, 
                                            do_deformation=True )
    if os.path.exists(dirname_dataset_aug ) == True \
    or os.path.exists(dirname_dataset_val_aug ) == True:
        #https://pythonbasics.org/tkinter-messagebox/
        tki2 = tkinter.Tk()
        tki2.withdraw()
        ret = messagebox.askyesno('Verification', '_There is an aug folder._Are you sure you want to clear the aug folder?')
        if ret == True:
            if os.path.exists(dirname_dataset_aug ) == True:
                shutil.rmtree(dirname_dataset_aug)
            if os.path.exists(dirname_dataset_val_aug ) == True:
                shutil.rmtree(dirname_dataset_val_aug)
            make_dataset.do_augmentation(dataset_folder_name = "dataset")
            make_dataset.do_augmentation(dataset_folder_name = "dataset_val")
            tki2.destroy()
        else:
            tki2.destroy()
        tki2.mainloop()
        
    else:
        make_dataset.do_augmentation(dataset_folder_name = "dataset")
        make_dataset.do_augmentation(dataset_folder_name = "dataset_val")
        
    dirname_dataset_2 = dirname_dataset_1 + "_aug"
    dirname_dataset_val_2 = dirname_dataset_val_1 + "_aug"
    return dirname_dataset_2, dirname_dataset_val_2
            
def prepare_dataset_val():
    for j in categories:
        if os.path.exists(dirname_dataset_val  + "\\" + str(j) ) == False:
            os.makedirs(dirname_dataset_val + "\\" + str(j))
            files = glob.glob(dirname_dataset + "\\" + str(j) + "/*")
            for imgfile in files:# move some data from "dataset" to "dataset_val"
                if myutils.train_or_val(ratio_train) == "val":
                    shutil.move(imgfile, dirname_dataset_val+"\\" + str(j) + "/")

def revert_dataset_val():
    '''
    Revert Dataset
    This function revert splitted validation dataset directory to dataset directory
    '''
    for j in categories:
        if os.path.exists(dirname_dataset_val  + "\\" + str(j) ) == True:
            files = glob.glob(dirname_dataset_val + "\\" + str(j) + "/*")
            for imgfile in files:#Move all images in "dataset_val" to "dataset"
                shutil.move(imgfile, dirname_dataset + "\\" + str(j))
    if os.path.exists(dirname_dataset_val) == True:
        shutil.rmtree(dirname_dataset_val)#Delete "dataset_val" folder

transform = transforms.Compose([transforms.Resize((img_size[0], img_size[1])), transforms.ToTensor()])
train_loader = []
test_loader = []

def prepare_dataset(transform, dirname_dataset, dirname_dataset_val):
    dataset = ImageFolder(dirname_dataset, transform)# Prepare Training Dataset
    dataset_val = ImageFolder(dirname_dataset_val, transform)# Prepare Validation Dataset
    print(dataset.class_to_idx)
    return dataset, dataset_val
    
batch_size_train = 32
batch_size_val = 16
def get_device(gpu_id=-1):
    global batch_size_train, batch_size_val
    if gpu_id >= 0 and torch.cuda.is_available():
        print("GPU mode")
        batch_size_train = 32
        batch_size_val = 16
        return torch.device("cuda", gpu_id)
    else:
        return torch.device("cpu")
device = get_device(gpu_id=0)    

if flag_train == True:
    print("train mode")
    print("total epochs = " + str(total_epochs))
    if flag_split == True:
        revert_dataset_val()
        prepare_dataset_val()
        print("splitting complete")
    elif flag_split == False and os.path.exists(dirname_dataset_val) == False:
        prepare_dataset_val()
        print("You have not splitted dataset, so splitteing automatically done")
    if flag_aug == True:
        dirname_dataset, dirname_dataset_val = aug_dataset(dirname_dataset, dirname_dataset_val)
        print("dataset source is " + dirname_dataset + "&" + dirname_dataset_val)
    elif flag_aug == False:
        dirname_dataset_aug = dirname_dataset + "_aug"
        dirname_dataset_val_aug = dirname_dataset_val + "_aug"
        make_dataset = mkdataset.MakeDataSetRGB()
        if os.path.exists(dirname_dataset_aug ) == True \
        and os.path.exists(dirname_dataset_val_aug ) == True:
            dirname_dataset = dirname_dataset_aug
            dirname_dataset_val = dirname_dataset_val_aug
    #prepare_dataset()
    train_data, test_data = prepare_dataset(transform, dirname_dataset, dirname_dataset_val)
    # make directory (weights_folder, outputs)
    if os.path.exists("weights_pytorch/"+type_backbone) == False:
        os.makedirs("weights_pytorch/"+type_backbone)
    if os.path.exists("outputs_pytorch/"+type_backbone) == False:
        os.makedirs("outputs_pytorch/"+type_backbone)
    # In PyTorch, use DataLoader for NN model
    train_loader = DataLoader(train_data, batch_size=batch_size_train, shuffle=True)
    test_loader = DataLoader(test_data, batch_size=batch_size_val, shuffle=True)

if os.path.exists(output_folder) == False:
    os.makedirs(output_folder)
print("Complete")

--DataSet: Returns one pair of data and correct label --DataLoader: Allows you to retrieve data in a mini-batch


def prepare_dataset(transform, dirname_dataset, dirname_dataset_val):
    dataset = ImageFolder(dirname_dataset, transform)# Prepare Training Dataset
    dataset_val = ImageFolder(dirname_dataset_val, transform)# Prepare Validation Dataset
    print(dataset.class_to_idx)
    return dataset, dataset_val
#~Abbreviation~
train_data, test_data = prepare_dataset(transform, dirname_dataset, dirname_dataset_val)
#~Abbreviation~
# In PyTorch, use DataLoader for NN model
    train_loader = DataLoader(train_data, batch_size=batch_size_train, shuffle=True)
    test_loader = DataLoader(test_data, batch_size=batch_size_val, shuffle=True)

In the future, I will publish an article detailing learning using this, but the method of putting data in the model is as follows.

for batch_idx, (image, label) in enumerate(train_loader):
    #image, label = Variable(image), Variable(label)#cpu
    image, label = Variable(image).cuda(), Variable(label).cuda()
    optimizer.zero_grad()
    output = model(image)

About application

This program will be used in a later Keras VS PyTorch article. The data extension program (makedataset_rgb.py) and the UI panel program (create_panel.py) are separated into mylib, so I hope it helps.

reference

https://pytorch.org/docs/stable/data.html https://qiita.com/mathlive/items/2a512831878b8018db02 https://qiita.com/takurooo/items/e4c91c5d78059f92e76d

Recommended Posts