Until you build the environment with ABCI and run MaskTrack RCNN

Overview

When I tried to run MaskTrack RCNN, which is the baseline model of the latest Video Instance Segmentation task, with ABCI, I suffered from repeated errors, so I built an environment in ABCI. I will keep a memorandum of the method from to learning the model. I've put together the last error I wrestled with and how to solve it.

Premise

--Information as of January 15, 2021. --You have an ABCI account and can access the server remotely. --Anaconda3 has already been installed.

Environment

--Remote access to ABCI with ssh --Start GPU node The following are commands that use the GPU interactively for up to 1 hour with the minimum configuration.

```bash
qrsh -g gcc50560 -l rt_G.small=1 -l h_rt=1:00:00 -m abes
```

--Loading required modules Execute the following command on the launched node to load the required modules.

```bash
module load cuda/9.0
module load cudnn/7.6/7.6.2
module load nccl/2.3/2.3.7-1
```

--Creating and setting up a virtual environment Execute the following bash masktrackrcnn_env.sh to clone the repository and build a virtual environment.

```masktrackrcnn_env.sh
# clone MaskTrackRCNN
git clone https://github.com/youtubevos/MaskTrackRCNN.git
cd MaskTrackRCNN

# create environment
conda create -n MaskTrackRCNN python=3.7 -y

# >>> conda init >>>
__conda_setup="$(CONDA_REPORT_ERRORS=false '$HOME/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
    \eval "$__conda_setup"
else
    if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
        . "$HOME/anaconda3/etc/profile.d/conda.sh"
        CONDA_CHANGEPS1=false conda activate base
    else
        \export PATH="$PATH:$HOME/anaconda3/bin"
    fi
fi
unset __conda_setup
# <<< conda init <<<

# activate environment
conda activate MaskTrackRCNN

# setup environment
conda install -c pytorch pytorch=0.4.1 cudatoolkit=9.0 torchvision -y
conda install -c conda-forge  opencv -y
conda install numpy cython -y
conda install -c psi4 gcc-5 -y
conda install libgcc -y
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
bash compile.sh
pip install . --user

pip uninstall mmcv -y
pip install mmcv==0.2.0
```

--Modify the library The necessary environment has been set up so far, but if it is left as it is, there will be inconvenience during learning, so one final correction will be made. Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to:

--Environment construction completed !!

Data set preparation

--Download data and labels Download the data and label from here. --Dataset placement Attach a symbolic link so that it has the following structure.

```
MaskTrackRCNN
├── mmdet
├── tools
├── configs
├── data
│   ├── train
│   ├── val
│   ├── annotations
│   │   ├── instances_train_sub.json
│   │   ├── instances_val_sub.json
``` 

The label is originally given with the file names train.json and valid.json, so it's a good idea to (copy) rename it. You may change the file name of config.

```sh
#An example of how to paste a symbolic link$MaskTrackRCNN is the path to the root of the MaskTrackRCNN repository
mkdir $MaskTrackRCNN/data
ln -s /path/to/original/data_dir/train $MaskTrackRCNN/data/train
ln -s /path/to/original/data_dir/valid $MaskTrackRCNN/data/val
ln -s /path/to/original/data_dir/annotations $MaskTrackRCNN/data/annotations
```

Model learning

--Loading GCC 7.4 module load gcc/7.4.0 --Activate virtual environment conda activate MaskTrackRCNN --Learning python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py If you see the following log, it seems that you are learning correctly. At this rate, it will take about 24 hours to study.

```
2021-01-15 13:36:50,620 - INFO - Epoch [1][50/7669]	   lr: 0.00199, time: 0.774, data_time: 0.045, loss_rpn_cls: 0.0609, loss_rpn_reg: 0.0465, loss_cls: 0.9336, acc: 84.5996, loss_reg: 0.2753, loss_match: 0.4937, match_acc: 89.2641, loss_mask: 0.7734, loss: 2.5835
2021-01-15 13:37:27,818 - INFO - Epoch [1][100/7669]	lr: 0.00233, time: 0.744, data_time: 0.028, loss_rpn_cls: 0.0469, loss_rpn_reg: 0.0442, loss_cls: 0.7820, acc: 84.7695, loss_reg: 0.3567, loss_match: 0.2895, match_acc: 88.3957, loss_mask: 0.6092, loss: 2.1286
2021-01-15 13:38:04,878 - INFO - Epoch [1][150/7669]	lr: 0.00266, time: 0.741, data_time: 0.026, loss_rpn_cls: 0.0342, loss_rpn_reg: 0.0342, loss_cls: 0.7171, acc: 85.1309, loss_reg: 0.3467, loss_match: 0.2588, match_acc: 89.5726, loss_mask: 0.5057, loss: 1.8968
```

It seems that you can change the hyperparameters of learning such as the number of epochs by playing with configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py.

Wrestled error summary

Execution command: bash masktrackrcnn_env.sh Cause: cuda is not loading properly Solution: Run module load cuda/{version} and it should work. Writing this command in a shell script (masktrackrcnn_env.sh) didn't work, but running it in a normal shell worked.

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py (learning script) Cause: torch version is less than 1.1 (but the author should be running version 0.4.1) Solution: Reduce the mmcv version to 0.2.0.

```
pip uninstall mmcv
pip install mmcv==0.2.0
```

```Error details
Traceback (most recent call last):
  File "tools/train.py", line 4, in <module>
    from mmcv import Config
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 4, in <module>
    from .fileio import *
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/__init__.py", line 4, in <module>
    from .io import dump, load, register_handler
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/io.py", line 4, in <module>
    from ..utils import is_list_of, is_str
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/__init__.py", line 29, in <module>
    from .env import collect_env
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/env.py", line 12, in <module>
    from .parrots_wrapper import get_build_config
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 79, in <module>
    _BatchNorm, _InstanceNorm, SyncBatchNorm_ = _get_norm()
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 71, in _get_norm
    SyncBatchNorm_ = torch.nn.SyncBatchNorm
AttributeError: module 'torch.nn' has no attribute 'SyncBatchNorm'
```

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: CUDA version at build time Solution: Set CUDA9.2-> CUDA9.0 at build time to fix it (source). Note that CUDA 9.0 requires GCC <6.0 at this time.

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: The dimension of bbox_head.fc_cls.weight of the model is different from that of the checkpoint model. Solution: Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to the following (Reference 1, Reference 2).

```checkpoint.py
print('While copying the parameter named {}, '
      'whose dimensions in the model are {} and '
      'whose dimensions in the checkpoint are {}.'
      .format(name, own_state[name].size(),
              param.size()))
```

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: According to issue, it seems that the problem was that GCC <4.9.2 at build time. Solution: conda install -c psi4 gcc-5 Put gcc in conda (source). Because I want to use 4.9.2 <= GCC <6.0 but ABCI doesn't have that option so that it doesn't conflict with the solution of this. Result: Since the build was successful, the problem of Segmentation fault was probably solved, but there is a new opencv import error.

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: Conda's libstdc ++ version is low? Solution: conda install libgcc seems to solve this problem for the time being, but this time I get an import error regarding libstdc ++ on my system.

```Error details
Traceback (most recent call last):
  File "tools/train.py", line 4, in <module>
    from mmcv import Config
  File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
    from .opencv_info import *
  File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/opencv_info.py", line 1, in <module>
    import cv2
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: /home/acb11854zq/anaconda3/envs/test_gcc/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)
```

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: Perhaps your system's libstdc ++ is out of date? CXXABI_1.3.8 is apparently introduced from GCC 4.9. Solution: When training the model, you can force it to disappear by loading GCC 7.4 on your system as module load gcc/7.4.0. Note that Error will be displayed at build time unless the build by bash compile.sh is completed.

Traceback (most recent call last):
   File "tools/train.py", line 4, in <module>
     from mmcv import Config
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
     from .image import *
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/__init__.py", line 2, in <module>
     from .colorspace import (bgr2gray, bgr2hls, bgr2hsv, bgr2rgb, bgr2ycbcr,
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/colorspace.py", line 2, in <module>
     import cv2
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
     from .cv2 import *
 ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)

Recommended Posts

Until you build the environment with ABCI and run MaskTrack RCNN
Until you install Caffe and run the sample
Until you install Gauge and run the official sample
Until you can install blender and run it with python for the time being
Until you create a machine learning environment with Python on Windows 7 and run it
Until you install and run matplotlib
[Introduction to machine learning] Until you run the sample code with chainer
Build the fastest Django development environment with docker-compose
Until you use the Kaggle API with Colab
Until you run the changefinder sample in python
Build a virtual environment with pyenv and venv
Build PyPy and Python execution environment with Docker
Until you install Python with pythonbrew and run Flask on a WSGI server
Build a python virtual environment with virtualenv and virtualenvwrapper
Build a python virtual environment with virtualenv and virtualenvwrapper
Install Ubuntu 20.04 with GUI and prepare the development environment
Build a numerical calculation environment with pyenv and miniconda3
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
Build a Python environment and transfer data to the server
Build GPU environment with GCP and kaggle official image (docker)
Try and learn iptables, until you can browse the web
Build python3 environment with ubuntu 16.04
Build python environment with direnv
Get the strongest environment with VS Code, Remote-Containers and remote docker-daemon
Build a 64-bit Python 2.7 environment with TDM-GCC and MinGW-w64 on Windows 7
Build a Python environment on your Mac with Anaconda and PyCharm
Build a detonation velocity website with Cloud Run and Python (Flask)
How to build Python and Jupyter execution environment with VS Code
[DynamoDB] [Docker] Build a development environment for DynamoDB and Django with docker-compose
Build and run TOPPERS / ASP (2020-03-10 version)
Until you start Jupyter with Docker
Build python virtual environment with virtualenv
Run Pylint and read the results
Build Mysql + Python environment with docker
Build PyPy execution environment with Docker
Build IPython Notebook environment with boot2docker
Until you shake ruby with renpy
Prepare the development environment with anyenv
Environment construction with pyenv and pyenv-virtualenv
You can also check the communication of DB and cache with curl
I set the environment variable with Docker and displayed it in Python
Build a CentOS Linux 8 environment with Docker and start Apache HTTP Server
Until you can borrow VPS with Conoha and authenticate public key with SSH
Build a drone simulator environment and try a simple flight with Mission Planner