About learning with google colab

Overview

Google colab is surprisingly useful, but there is a problem that reading data directly from google drive is unusually slow. (Image data in my case) So I'm going to use .p and .h5 files to speed up the reading (as fast as reading from a normal device).

Overall flow

I think the method will change depending on the person, but in my case

  1. Create a .txt file that summarizes the data paths for training, validation, and test, respectively.
  2. Put the .txt file created in 1. into the .p file (probably not so fast here)
  3. Create a .h5 file that corresponds to the name and data (.png this time) using the .p file created earlier.

I will make it like that.

Due to the influence of the task (speech synthesis) that I am currently working on, I am writing the output as well, but for classification tasks etc., I think that it is better to save the input and output as tuples without separating them.

1. Create a .txt file

maketxt.py


import re
import glob
from sklearn.model_selection import train_test_split

def Create_txt(txt_path):
    #The condition is isfile because it is only for moving images=I think true is fine.
    fileList = [p for p in glob.glob('./image/**/', recursive=True) if re.search('/segment_', p)]
    #Data is 8:1:1
    train_data, val_test_data = train_test_split(fileList, test_size=0.2)
    val_data, test_data =  train_test_split(val_test_data, test_size=0.5)
 
    try:
        train_txt = os.path.join(txt_path, 'train.txt')
        with open(train_txt, mode='x') as f:
            for train_path in train_data:
                f.write(train_path.rstrip('/') + '\n')

        val_txt = os.path.join(txt_path, 'val.txt')
        with open(val_txt, mode='x') as f:
            for val_path in val_data:
                f.write(val_path.rstrip('/') + '\n')

        test_txt = os.path.join(txt_path, 'test.txt')
        with open(test_txt, mode='x') as f:
            for test_path in test_data:
                f.write(test_path.rstrip('/') + '\n')

    except FileExistsError:
        print('already exists')

What you are doing

  1. Get all the paths of the data in the directory
  2. Divide the acquired path
  3. Save each (try statement is written not to overwrite) is.

2. Create .p file

Originally, the .txt file doesn't take much time, so you can skip it.

p is an acronym for pickle. It is a Python module that can save the state of an object by serializing the object. (I would appreciate it if you could see other pages for the specific contents.)

In this case, a large number of character strings (file name paths) are stored in one file. I'll put the code below, but I just read the text file and put it in the pickle file.

import pickle

video_list =list()
txt_path = os.path.join(txt_path)+ '.txt'
with open(txt_path, 'r') as textFile:
    for line in textFile:
        line = line.replace('\n', '')
        video_list.append(line)

pickle.dump(data_dict, open('txt_file.p', "wb"))

3. Create .h5 file

This is the main part of this time.

The .h5 file is one of the binary files called HDF5, and you can have a hierarchical structure in one file. In other words, let's manage the files that we usually do in a computer in one huge file.

By storing a large number of image files in one huge file, you can reduce the time to load images. Especially google colab is slow so I used it when I didn't have it. (The first epoch was particularly tight, and it took 10 minutes instead of 2 hours.)

Writing data

fileName = 'data.h5'
path = ('./data_file')
#read pickle data
dataFileList = pickle.load(open('txt_file.p', "rb"))
train_list=dataFileList['train']

count =0
with h5py.File(fileName, "w") as f:
  f.create_group('/train')
  f.create_group('/train/input')
  f.create_group('/train/output')
  
  for train in train_list:
    data = pull_item(train)
    f.create_dataset('/train'+data[0].strip('.'), data=data[2])
    f.create_dataset('/train'+data[1].strip('.'), data=data[3])
    f.flush()
    if count %100 ==0:
      print(count)
    count +=1

As a flow,

  1. Create an h5 file and create a directory for input and a directory for output. (The directory structure is free, but it seems to be difficult to read if there are too many hierarchies.)

  2. Save more and more training data.

f.create_dataset('file name', data=Data contents)

To do.

Data reading

import h5py
from PIL import Image
import numpy as np
image = ('/train/input/images_18')
output = ('/train/output/images_18.csv')
with h5py.File('data.h5', mode = 'r') as data:
    img_data = data[image]
    img_group = img_data[...]
    img_group = img_group.astype(np.float64)
    feature = data[output]
    feature_data  = feature[...]
    data.close()

Due to the specifications, the object for HDF5 is read when the data is read for the first time, so

img_group = img_data[...]

You need to specify that you want to retrieve the contents of the object, such as. The rest is as usual.

At the end

Even if I searched for "google colab loading is slow" etc., I did not find an article about using a binary file, so I hope you will know its existence by looking at this article.

It seems to be deeper than I thought I would look it up, but I don't have much time and motivation to look it up properly, so if you want to know more, I think you should look at other pages.

Recommended Posts

About learning with google colab
Machine learning with Pytorch on Google Colab
Play with Turtle on Google Colab
Deep Learning with Shogi AI on Mac and Google Colab
Use MeCab and neologd with Google Colab
Deep Learning with Shogi AI on Mac and Google Colab Chapters 1-6
Notes about with
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
A story about machine learning with Kyasuket
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10 6-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 5-7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 1-2
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3 ~ 5
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 5-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 1-4
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 1-4
Learn with Shogi AI Deep Learning on Mac and Google Colab Use Google Colab
Deep Learning on Mac and Google Colab Words Learned with Shogi AI
Cheat sheet when scraping with Google Colaboratory (Colab)
Authenticate Google with Django
"Object-oriented" learning with python
About machine learning overfitting
Google Colab Tips Organize
Learning Python with ChemTHEATER 02
Learning Python with ChemTHEATER 01
A story about predicting exchange rates with Deep Learning
Easy way to scrape with python using Google Colab
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Talk about improving machine learning algorithm bottlenecks with Cython
How to set up a Google Colab environment with Coursera's advanced machine learning courses
A story about an error when loading a TensorFlow model created with Google Colab locally
Machine learning learned with Pokemon
Try deep learning with TensorFlow
Play with reinforcement learning with MuZero
Test embedded software with Google Test
Ensemble learning summary! !! (With implementation)
[Google Colab] How to interrupt learning and then resume it
Code snippets often used when using BigQuery with Google Colab
Reinforcement learning starting with Python
Machine learning with Python! Preparation
Deep Kernel Learning with Pyro
Plotly Dash on Google Colab
Try Deep Learning with FPGA
About machine learning mixed matrices
Linux fastest learning with AWS
Machine learning Minesweeper with PyTorch
An error that stumbled upon learning YOLO on Google Colab
Sample code summary when working with Google Spreadsheets from Google Colab
Mount google drive with google-drive-ocamlfuse
Access Google Drive with Python
Feature Engineering for Machine Learning Beginning with Part 3 Google Colaboratory-Scaling