Execution procedure of TensorFlow in batch processing of supercomputer ITO

Subsystem B (ITO-B Of Supercomputer ITO ac.jp/scp/system/ITO/01_intro.html)) is equipped with a GPU. I will introduce how to execute TensorFlow on ITO-B. Batch processing is required to use the GPU. Therefore, it should be executed as Python code, not Jupyter. The procedure for executing TensorFlow on the front end of the supercomputer ITO was introduced in this article. The procedure has much in common with the front end, but batch processing with ITO-B is easier. ** All the following steps are performed at the login node of the supercomputer ITO. ** **

Install TensorFlow

It is assumed that the Python base environment is built with Miniconda. Build a virtual environment with ʻanaconda channeland install thetensorflow-gpu package with ʻanaconda channel. Please refer to this article for the background of this. Using Miniconda prepared by this article, prepare a new virtual environment tf and proceed with the installation work.

$ conda create -c anaconda -n tf
$ conda activate tf
$ conda install -c anaconda tensorflow-gpu

Since GPU cannot be used at the login node, operation cannot be confirmed at this stage.

Creating a batch processing script

Please refer to Official Site for batch processing. Create a bash script that includes how to prepare the environment including GPU load and how to execute Python code. The following is the bash script ʻito_b.shwhen using one GPU. The part after the arrow←` is a comment, so actually delete it. The resource group is determined by referring to the Official Site.

ito_b.sh


#!/bin/bash
#PJM -L "rscunit=ito-b"          ← ITO-Specify B
#PJM -L "rscgrp=ito-g-1"← Resource group specification
#PJM -L "vnode=1"← Specify the number of nodes to use
#PJM -L "vnode-core=9"← Specify the number of cores per node
#PJM -L "elapse=12:00:00"← Specify the maximum calculation time (specify 12 hours)
#PJM -X ← Specify that the environment variable of the login node is inherited even in batch processing

source ~/.bashrc     #← Miniconda sets Python settings.I am writing to bashrc and reading this
module load cuda/10.1     #← CUDA 10 because it uses GPU.Load 1
module list     #← Confirm that CUDA is loaded
conda activate tf     #← Enter the virtual environment tf
conda info -e     #← Confirm that you have entered the virtual environment tf

python ann_experiments.py     #← Execute Python code
conda deactivate     #← Exit from the virtual environment

Since batch processing does not inherit the environment of the login node (other than environment variables), it is necessary to build a GPU load and Python environment. There is no need to enter the tf virtual environment at the login node. This example assumes that the Python code and ʻito_b.sh` are in the same directory.

Batch job submission

Put the created batch processing script ʻito_b.sh` into the batch processing system as a batch job.

$ pjsub ito_b.sh

All you have to do is wait until the process is complete, so you can log out. To check the status of the job, do as follows.

$ pjstat

When the batch job ends, standard output and standard error output are output with file names such as ʻito_b.sh.o0000000 and ʻito_b.sh.e0000000, respectively. The file name consists of o or e and a 7-digit number in addition to the batch processing script name. Be careful not to make the standard output file too large.

Recommended Posts

Execution procedure of TensorFlow in batch processing of supercomputer ITO
Summary of various operations in Tensorflow
Batch processing notes in IPython Notebook
Status of each Python processing system in 2020
[Ansible installation procedure] From installation to execution of playbook
View the result of geometry processing in Python
Build procedure of TensorFlow 2.2.0-rc0 (CUDA10.2 + cuDNN7.6.5) --Windows10
Checkpoint format of tf.train.Saver changed in TensorFlow v0.12
Unbearable shortness of Attention in natural language processing