1.First of all

Here are some useful tips for using Google Colab.

2. Challenges

When using Google Colab Pro, there is a limit of up to 24 hours of execution time. Therefore, if the amount of calculation of the model exceeds 24 hours, there is a problem that the calculation result disappears in the middle.

For example, if you estimate that it takes about 24 hours to calculate 200 epochs, and then actually run the calculation, it will take a little time, and Google Colab may be disconnected near 190 epochs.

3. Solution

To solve this, we will adopt the following method.

Use Keras' ModelCheckpoint () to save the model in detail.
Save the model to the temp folder of Google Drive (Reference: Organize Google Colab Tips)
Call the model from the middle and restart learning when the calculation is completed.

3.1. ModelCheckpoint () settings

`python`


from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint(filepath = 'XXX.h5',
                             monitor='loss',
                             save_best_only=True,
                             save_weight_only=False,
                             mode='min'
                             period=1)

Argument description

1.filepath: The path to save the character string and model file 2. monitor: Value to monitor 3. save_best_only: If save_best_only = True, the monitored data will not overwrite the latest best model 4.mode: One of {auto, min, max} will be selected 5. save_weights_only: If True, the weights of the model will be saved. Otherwise, the entire model will be saved. 6.period: Interval between checkpoints (number of epochs)

3.2. First learning → Save intermediate results to Google Drive

Write the code using the Keras MNIST case study.

`python`


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist

Google Drive Mount, model save folder settings

`python`


from google.colab import drive
drive.mount('/content/drive')

MODEL_DIR = "/content/drive/My Drive/temp"

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True)

Learning execution

`python`



history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

Running the above code will save the model file in the temp folder.

3.3. Second learning → Call the first model and restart learning from the middle

It starts by calling model-05.h5.

Model loading

`python`


#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

Renamed the second learning model

Change model-XX.h to model_new-XX.h.

`python`


if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',
    save_best_only=True,
    mode='min',
    period=1)

Continue learning execution

`python`



history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

Looking at the value of Training acc, we can see that training has resumed since the last training was completed.

The newly trained model is also saved.

4. Summary

There is a problem with Google Colab Pro being disconnected for 24 hours.
I decided to solve this problem with Keras' ModelCheckpoint () and Google Drive mount.
We confirmed the operation and effectiveness of the proposed method.

4. Overall code

First learning

`python`


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 

BATCH_SIZE = 128
NUM_EPOCHS = 20

(Xtrain, ytrain), (Xtest, ytest) = mnist.load_data()
Xtrain = Xtrain.reshape(60000, 784).astype("float32") / 255
Xtest = Xtest.reshape(10000, 784).astype("float32") / 255
Ytrain = to_categorical(ytrain, 10)
Ytest = to_categorical(ytest, 10)
print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)
#Model definition
model = Sequential()
model.add(Dense(512, input_shape=(784,), activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
model.summary()
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
              metrics=["accuracy"])

#Learning execution
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

Second and subsequent learning

`python`


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 


#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',
    save_best_only=True,
    mode='min',
    period=1) 

#Resume learning
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

5. Reference materials

Google Colab Tips Organized
[How to interrupt learning in Keras and then restart it](https://intellectual-curiosity.tokyo/2019/06/25/keras%e3%81%a7%e5%ad%a6%e7% bf% 92% e3% 82% 92% e4% b8% ad% e6% 96% ad% e3% 81% 97% e3% 81% 9f% e5% be% 8c% e3% 80% 81% e9% 80% 94% e4% b8% ad% e3% 81% 8b% e3% 82% 89% e5% 86% 8d% e9% 96% 8b% e3% 81% 99% e3% 82% 8b% e6% 96% b9% e6% b3% 95 /? unapproved = 1126 & moderation-hash = a1d80e5413be867d6179fd011c317d71 # comment-1126)
Save the best model (How to use ModelCheckpoint)

[Google Colab] How to interrupt learning and then resume it

1.First of all

2. Challenges

3. Solution

3.1. ModelCheckpoint () settings

python

Argument description

3.2. First learning → Save intermediate results to Google Drive

python

python

Learning execution

python

3.3. Second learning → Call the first model and restart learning from the middle

Model loading

python

Renamed the second learning model

python

Continue learning execution

python

4. Summary

4. Overall code

First learning

python

Second and subsequent learning

python

5. Reference materials

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`