[Google Colab] How to interrupt learning and then resume it

1.First of all

Here are some useful tips for using Google Colab.

2. Challenges

When using Google Colab Pro, there is a limit of up to 24 hours of execution time. Therefore, if the amount of calculation of the model exceeds 24 hours, there is a problem that the calculation result disappears in the middle.

For example, if you estimate that it takes about 24 hours to calculate 200 epochs, and then actually run the calculation, it will take a little time, and Google Colab may be disconnected near 190 epochs.

3. Solution

To solve this, we will adopt the following method.

  1. Use Keras' ModelCheckpoint () to save the model in detail.
  2. Save the model to the temp folder of Google Drive (Reference: Organize Google Colab Tips)
  3. Call the model from the middle and restart learning when the calculation is completed.

3.1. ModelCheckpoint () settings

python


from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint(filepath = 'XXX.h5',
                             monitor='loss',
                             save_best_only=True,
                             save_weight_only=False,
                             mode='min'
                             period=1)

Argument description

1.filepath: The path to save the character string and model file 2. monitor: Value to monitor 3. save_best_only: If save_best_only = True, the monitored data will not overwrite the latest best model 4.mode: One of {auto, min, max} will be selected 5. save_weights_only: If True, the weights of the model will be saved. Otherwise, the entire model will be saved. 6.period: Interval between checkpoints (number of epochs)

3.2. First learning → Save intermediate results to Google Drive

Write the code using the Keras MNIST case study.

python


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist

Google Drive Mount, model save folder settings

python


from google.colab import drive
drive.mount('/content/drive')

MODEL_DIR = "/content/drive/My Drive/temp"

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 

Learning execution

python



history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

image.png image.png

Running the above code will save the model file in the temp folder. image.png

3.3. Second learning → Call the first model and restart learning from the middle

It starts by calling model-05.h5.

Model loading

python


#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

Renamed the second learning model

Change model-XX.h to model_new-XX.h.

python


if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',
    save_best_only=True,
    mode='min',
    period=1) 

Continue learning execution

python



history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

Looking at the value of Training acc, we can see that training has resumed since the last training was completed.

image.png image.png

The newly trained model is also saved. image.png

4. Summary

  1. There is a problem with Google Colab Pro being disconnected for 24 hours.
  2. I decided to solve this problem with Keras' ModelCheckpoint () and Google Drive mount.
  3. We confirmed the operation and effectiveness of the proposed method.

4. Overall code

First learning

python


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 

BATCH_SIZE = 128
NUM_EPOCHS = 20

(Xtrain, ytrain), (Xtest, ytest) = mnist.load_data()
Xtrain = Xtrain.reshape(60000, 784).astype("float32") / 255
Xtest = Xtest.reshape(10000, 784).astype("float32") / 255
Ytrain = to_categorical(ytrain, 10)
Ytest = to_categorical(ytest, 10)
print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)
#Model definition
model = Sequential()
model.add(Dense(512, input_shape=(784,), activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
model.summary()
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
              metrics=["accuracy"])

#Learning execution
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

Second and subsequent learning

python


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 


#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
    os.makedirs(MODEL_DIR)
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',
    save_best_only=True,
    mode='min',
    period=1) 

#Resume learning
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label
plt.legend()
plt.show()


5. Reference materials

  1. Google Colab Tips Organized
  2. [How to interrupt learning in Keras and then restart it](https://intellectual-curiosity.tokyo/2019/06/25/keras%e3%81%a7%e5%ad%a6%e7% bf% 92% e3% 82% 92% e4% b8% ad% e6% 96% ad% e3% 81% 97% e3% 81% 9f% e5% be% 8c% e3% 80% 81% e9% 80% 94% e4% b8% ad% e3% 81% 8b% e3% 82% 89% e5% 86% 8d% e9% 96% 8b% e3% 81% 99% e3% 82% 8b% e6% 96% b9% e6% b3% 95 /? unapproved = 1126 & moderation-hash = a1d80e5413be867d6179fd011c317d71 # comment-1126)
  3. Save the best model (How to use ModelCheckpoint)

Recommended Posts

[Google Colab] How to interrupt learning and then resume it
How to display videos inline in Google Colab
How to install Cascade detector and how to use it
"Then, how does it compare to other methods?"
How to run AutoGluon in Google Colab GPU environment
How to use Decorator in Django and how to make it
[TF] How to save and load Tensorflow learning parameters
How to set up a Google Colab environment with Coursera's advanced machine learning courses
How to interactively draw a machine learning pipeline with scikit-learn and save it in HTML
About learning with google colab
[Pepper] How to utilize it?
Get a global IP and export it to Google Spreadsheets
How to convert Youtube to mp3 and download it super-safely [Python]
Deep Learning with Shogi AI on Mac and Google Colab
Overview of Python virtual environment and how to create it
How to use Google Colaboratory
[For beginners] How to implement O'reilly sample code in Google Colab
How to auto-update App Store description in Google Sheets and Fastlane
Deep Learning with Shogi AI on Mac and Google Colab Chapter 11
How to install OpenCV on Cloud9 and run it in Python
Deep Learning with Shogi AI on Mac and Google Colab Chapters 1-6
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10 6-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 5-7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 9
How to use Google Colaboratory and usage example (PyTorch x DCGAN)
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 1-2
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
The strongest way to use MeCab and CaboCha with Google Colab
[Reinforcement learning] How to draw OpenAI Gym on Google Corab (2020.6 version)
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3 ~ 5
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 5-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 1-4
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 1-4
[Rails] How to display Google Map
How to install and use Tesseract-OCR
[Memo] How to use Google MµG
How to install and configure blackbird
How to use .bash_profile and .bashrc
How to install CUDA and nvidia-driver
How to install and use Graphviz
Selenium and python to open google
How to collect machine learning data
How to solve slide puzzles and 15 puzzles
How to copy and paste the contents of a sheet in Google Spreadsheet in JSON format (using Google Colab)
[Rails] google maps api How to post and display including map information
How to make a container name a subdomain and make it accessible in Docker
Feel free to knock 100 data sciences with Google Colab and Azure Notebooks!
How to use VS Code (code server) with Google Colab in just 3 lines
Learn with Shogi AI Deep Learning on Mac and Google Colab Use Google Colab
Deep Learning on Mac and Google Colab Words Learned with Shogi AI
[Rails] How to calculate latitude and longitude with high accuracy using Geocoding API and display it on Google Map
Use MeCab and neologd with Google Colab
[Linux] How to subdivide files and folders