[TensorFlow 2.x compatible version] How to train a large amount of data using TFRecord & DataSet in TensorFlow (Keras)

Introduction

This is an updated version of the previous article. How to train a large amount of data with TFRecord & DataSet in TensorFlow & Keras-Qiita

One thing I want to do is "I want an efficient way to learn huge data that doesn't fit in memory." It's a method that allows CPU data reading and GPU calculation to be processed in parallel. We will use the DataSet API to efficiently learn from data saved in a specific format.

With the release of TensorFlow 2, the module names have changed compared to previous versions of the article, and some processing has become easier to write. In this article, I will introduce how to write in TensorFlow 2, focusing on the difference from the previous one. Also, I will change Keras to use the one included in TensorFlow.

Advance preparation

This article uses Python 3.6.9 + TensorFlow 2.1.0 on Linux (Ubuntu 18.04).

Starting with TensorFlow 1.15 / 2.1, the CPU and GPU versions of the pip package have been integrated. Therefore, for those who want to try it easily with CPU and those who want to turn it in earnest with GPU

pip3 install tensorflow==2.1.0

This is OK. Please note that if you want to use GPU, you need to set up CUDA 10.1. GPU support | TensorFlow

Data preparation

There is a unique data format (TFRecord) that allows TensorFlow to calculate efficiently. Let's create a TFRecord from existing data using the DataSet API.

data2tfrecord.py


#!/usr/bin/env python3

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist

def feature_float_list(l):
    return tf.train.Feature(float_list=tf.train.FloatList(value=l))

def record2example(r):
    return tf.train.Example(features=tf.train.Features(feature={
        "x": feature_float_list(r[0:-1]),
        "y": feature_float_list([r[-1]])
    }))

filename_train = "train.tfrecords"
filename_test  = "test.tfrecords"

# ===Read MNIST data===
#For the sake of simplicity, let's assume that the same evaluation data is used for the verification data during training.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print("x_train   : ", x_train.shape) # x_train   :  (60000, 28, 28)
print("y_train   : ", y_train.shape) # y_train   :  (60000,)
print("x_test    : ", x_test.shape)  # x_test    :  (10000, 28, 28)
print("y_test    : ", y_test.shape)  # y_test    :  (10000,)

#Pre-process
#Pixels[0, 1]Convert to float32 type
#Furthermore, for TFRecording, the features are made one-dimensional (rows correspond to records).
x_train = x_train.reshape((-1, 28*28)).astype("float32") / 255.0
x_test  = x_test.reshape((-1, 28*28)).astype("float32") / 255.0
#Label is also float32 type
y_train = y_train.reshape((-1, 1)).astype("float32")
y_test  = y_test.reshape((-1, 1)).astype("float32")
#Combine features and labels for TFRecording
data_train = np.c_[x_train, y_train]
data_test = np.c_[x_test,  y_test]

#Actually, the data you want to learn is converted to the same format and created.
#If all the data does not fit in memory, go to the write phase below
#You can make it little by little and repeat writing.

#Write training data to TFRecord
with tf.io.TFRecordWriter(filename_train) as writer:
    for r in data_train:
        ex = record2example(r)
        writer.write(ex.SerializeToString())

#Write evaluation data to TFRecord
with tf.io.TFRecordWriter(filename_test) as writer:
    for r in data_test:
        ex = record2example(r)
        writer.write(ex.SerializeToString())

It's almost the same as last time, but with the version upgrade of TensorFlow, the package tensorflow.python_io has disappeared, and the functions related to TFRecord have been added to tensorflow.io. Also, since I changed to use Keras included in TensorFlow, `ʻimport`` has changed, but the method of reading the MNIST dataset itself has not changed.

If you don't have a library for GPU calculation, you will get WARNING related to CUDA (libcublas cannot be found, etc.), but if you just want to try it lightly on the CPU, you don't have to worry about it.

Learning

It has changed a little from the last time. Let's start with the code and then the diffs.

train.py


#!/usr/bin/env python3

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Model

#Learning settings
batch_size = 32
epochs = 10
#Feature setting
num_classes = 10    #Label type. 0-10 types of 9
feature_dim = 28*28 #Feature dimension. Handle as 1D for simplicity
#Number of learning / evaluation data. Check in advance.
#Note that when using multiple TFRecords, the number below is the sum of all files.
num_records_train = 60000
num_records_test  = 10000
#Number of mini-batch per epoch. Used when learning.
steps_per_epoch_train = (num_records_train-1) // batch_size + 1
steps_per_epoch_test  = (num_records_test-1) // batch_size + 1

#Decode 1 TFRecord
def parse_example(example):
    features = tf.io.parse_single_example(
        example,
        features={
            #Specify the number of dimensions when reading the list
            "x": tf.io.FixedLenFeature([feature_dim], dtype=tf.float32),
            "y": tf.io.FixedLenFeature([], dtype=tf.float32)
        })
    x = features["x"]
    y = features["y"]
    return x, y

# ===Prepare TFRecord file data for learning and evaluation===

dataset_train = tf.data.TFRecordDataset(["train.tfrecords"]) \
    .map(parse_example) \
    .shuffle(batch_size * 100) \
    .batch(batch_size).repeat(-1)
#When using multiple TFRecord files above, specify a list of file names.
# dataset_train = tf.data.TFRecordDataset(["train.tfrecords.{}".format(i) for i in range(10)]) \

dataset_test = tf.data.TFRecordDataset(["test.tfrecords"]) \
    .map(parse_example) \
    .batch(batch_size)

# ===Model definition===
#This time, only one 512-dimensional intermediate layer is specified.
layer_input = Input(shape=(feature_dim,))
fc1 = Dense(512, activation="relu")(layer_input)
layer_output = Dense(num_classes, activation="softmax")(fc1)
model = Model(layer_input, layer_output)
model.summary()

#Loss even if the label is a categorical variable="sparse_categorical_crossentropy"Can be learned at
#Label one-loss when hot vectorized="categorical_crossentropy"become
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=RMSprop(),
    metrics=["accuracy"])

# ===Learning===

#Save the model in the middle
cp_cb = ModelCheckpoint(
    filepath="weights.{epoch:02d}-{loss:.4f}-{val_loss:.4f}.hdf5",
    monitor="val_loss",
    verbose=1,
    save_best_only=True,
    mode="auto")
model.fit(
    x=dataset_train,
    epochs=epochs,
    verbose=1,
    steps_per_epoch=steps_per_epoch_train,
    validation_data=dataset_test,
    validation_steps=steps_per_epoch_test,
    callbacks=[cp_cb])

Difference from the previous time

tensorflow.keras.Model.fit () has changed to be able to take a DataSet for training data. tf.keras.Model | TensorFlow Core v2.1.0

x: Input data. It could be: (Omitted) A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

Previously, when learning from a DataSet, you had to plug data into the `ʻInputlayer, so there was a tedious procedure of creating two models with shared weights, one for training and one for evaluation. In TensorFlow 2.x (Keras included in), you can give a DataSet to Model.fit () , so you only need one model. You no longer need to create your own iterator with make_one_shot_iterator () ``. You did it!

In addition, it was now possible to give a DataSet for evaluation to the validation_data argument of tensorflow.keras.Model.fit (). Therefore, it is no longer necessary to create a callback for evaluation by yourself (although the progress bar at the time of evaluation does not appear ... It is a story to write a learning loop by yourself).

Performance improvement with multiple TFRecords

By loading multiple files in parallel, you may be able to increase the GPU usage rate (= speed up learning).

Divide and write the training data in the same way as last time. The only difference from the last time is that tf.python_io has changed to tf.io.

data2tfrecord.py (part)


for i in range(10):
    with tf.io.TFRecordWriter(filename_train + "." + str(i)) as writer:
        for r in data_train[i::10]:
            ex = record2example(r)
            writer.write(ex.SerializeToString())

During learning, how to create dataset_train changes as follows.

train.py (part)


dataset_train = tf.data.Dataset.from_tensor_slices(["train.tfrecords.{}".format(i) for i in range(10)]) \
    .interleave(
        lambda filename: tf.data.TFRecordDataset(filename).map(parse_example, num_parallel_calls=1),
        cycle_length=10) \
    .shuffle(batch_size * 100) \
    .batch(batch_size) \
    .prefetch(1) \
    .repeat(-1)

The function equivalent to tf.contrib.data.parallel_interleave () (later tf.data.experimental.parallel_interleave ()) in the previous article was officially incorporated as a method of DataSet. So it's a little easier to write. However, since it behaves like sloppy = False, it seems that you need to specify options with with_options () to make it behave like sloppy = True. tf.data.experimental.parallel_interleave | TensorFlow Core v2.1.0

Let's move to TensorFlow 2

There are some changes, but it's generally easier to write, so I felt like I didn't have to be afraid. You can expect the core performance to improve (is that true?), And let's dig into a lot of data with TensorFlow 2!

Recommended Posts

[TensorFlow 2.x compatible version] How to train a large amount of data using TFRecord & DataSet in TensorFlow (Keras)
How to create a large amount of test data in MySQL? ??
Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment
How to test each version of IE using Selenium in modan.IE (VM)
How to send a visualization image of data created in Python to Typetalk
[Circuit x Python] How to find the transfer function of a circuit using Lcapy
A memorandum on how to use keras.preprocessing.image in Keras
Read a large amount of securities reports using COTOHA
How to execute a command using subprocess in Python
How to create an instance of a particular class from dict using __new__ () in python
[Question] How to get data of textarea data in real time using Python web framework bottle
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
How to pass the execution result of a shell command in a list in Python (non-blocking version)
[TensorFlow 2 / Keras] How to run learning with CTC Loss in Keras
How to develop in a virtual environment of Python [Memo]
How to generate a query using the IN operator in Django
How to get a list of built-in exceptions in python
How to get an overview of your data in Pandas
How to get a quadratic array of squares in a spiral!
[End of 2020] A memo to start using AWS CLI (Version 2)
How to determine the existence of a selenium element in Python
I tried to make a regular expression of "amount" using Python
How to check the memory size of a variable in Python
TensorFlow To learn from a large number of images ... ~ (almost) solution ~
How to check the memory size of a dictionary in Python
[TensorFlow 2] How to check the contents of Tensor in graph mode
Convert a large number of PDF files to text files using pdfminer
<Pandas> How to handle time series data in a pivot table
How to save only a part of a long video using OpenCV
How to get the vertex coordinates of a feature in ArcPy
Explain how to use TensorFlow 2.X with implementation of VGG16 / ResNet50
The first artificial intelligence. How to check the version of Tensorflow installed.
How to update a Tableau packaged workbook data source using Python
How to copy and paste the contents of a sheet in Google Spreadsheet in JSON format (using Google Colab)
How to run TensorFlow 1.0 code in 2.0
How to divide and process a data frame using the groupby function
How to make a model for object detection using YOLO in 3 hours
[Python] How to put any number of standard inputs in a list
How to get a value from a parameter store in lambda (using python)
How to plot galaxy visible light data using OpenNGC database in python
June 2017 version to build Tensorflow / Keras environment on GPU instance of AWS
How to format a list of dictionaries (or instances) well in Python
I tried to perform a cluster analysis of customers using purchasing data
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
TensorFlow To learn from a large number of images ... (Unsolved problem) → 12/18 Solved
How to create a face image data set used in machine learning (1: Acquire candidate images using WebAPI service)
I made a program in Python that reads CSV data of FX and creates a large amount of chart images