Until you build the environment with ABCI and run MaskTrack RCNN

Overview

When I tried to run MaskTrack RCNN, which is the baseline model of the latest Video Instance Segmentation task, with ABCI, I suffered from repeated errors, so I built an environment in ABCI. I will keep a memorandum of the method from to learning the model. I've put together the last error I wrestled with and how to solve it.

Premise

--Information as of January 15, 2021. --You have an ABCI account and can access the server remotely. --Anaconda3 has already been installed.

Environment

--Remote access to ABCI with ssh --Start GPU node The following are commands that use the GPU interactively for up to 1 hour with the minimum configuration.

```bash
qrsh -g gcc50560 -l rt_G.small=1 -l h_rt=1:00:00 -m abes
```

--Loading required modules Execute the following command on the launched node to load the required modules.

```bash
module load cuda/9.0
module load cudnn/7.6/7.6.2
module load nccl/2.3/2.3.7-1
```

--Creating and setting up a virtual environment Execute the following bash masktrackrcnn_env.sh to clone the repository and build a virtual environment.

```masktrackrcnn_env.sh
# clone MaskTrackRCNN
git clone https://github.com/youtubevos/MaskTrackRCNN.git
cd MaskTrackRCNN

# create environment
conda create -n MaskTrackRCNN python=3.7 -y

# >>> conda init >>>
__conda_setup="$(CONDA_REPORT_ERRORS=false '$HOME/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
    \eval "$__conda_setup"
else
    if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
        . "$HOME/anaconda3/etc/profile.d/conda.sh"
        CONDA_CHANGEPS1=false conda activate base
    else
        \export PATH="$PATH:$HOME/anaconda3/bin"
    fi
fi
unset __conda_setup
# <<< conda init <<<

# activate environment
conda activate MaskTrackRCNN

# setup environment
conda install -c pytorch pytorch=0.4.1 cudatoolkit=9.0 torchvision -y
conda install -c conda-forge  opencv -y
conda install numpy cython -y
conda install -c psi4 gcc-5 -y
conda install libgcc -y
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
bash compile.sh
pip install . --user

pip uninstall mmcv -y
pip install mmcv==0.2.0
```

--Modify the library The necessary environment has been set up so far, but if it is left as it is, there will be inconvenience during learning, so one final correction will be made. Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to:

In [python_lib_path], enter the path returned by which python.

print('While copying the parameter named {}, '
      'whose dimensions in the model are {} and '
      'whose dimensions in the checkpoint are {}.'
      .format(name, own_state[name].size(),
              param.size()))

--Environment construction completed !!

Data set preparation

--Download data and labels Download the data and label from here. --Dataset placement Attach a symbolic link so that it has the following structure.

```
MaskTrackRCNN
├── mmdet
├── tools
├── configs
├── data
│   ├── train
│   ├── val
│   ├── annotations
│   │   ├── instances_train_sub.json
│   │   ├── instances_val_sub.json
```

The label is originally given with the file names train.json and valid.json, so it's a good idea to (copy) rename it. You may change the file name of config.

```sh
#An example of how to paste a symbolic link$MaskTrackRCNN is the path to the root of the MaskTrackRCNN repository
mkdir $MaskTrackRCNN/data
ln -s /path/to/original/data_dir/train $MaskTrackRCNN/data/train
ln -s /path/to/original/data_dir/valid $MaskTrackRCNN/data/val
ln -s /path/to/original/data_dir/annotations $MaskTrackRCNN/data/annotations
```

Model learning

--Loading GCC 7.4 module load gcc/7.4.0 --Activate virtual environment conda activate MaskTrackRCNN --Learning python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py If you see the following log, it seems that you are learning correctly. At this rate, it will take about 24 hours to study.

```
2021-01-15 13:36:50,620 - INFO - Epoch [1][50/7669]	   lr: 0.00199, time: 0.774, data_time: 0.045, loss_rpn_cls: 0.0609, loss_rpn_reg: 0.0465, loss_cls: 0.9336, acc: 84.5996, loss_reg: 0.2753, loss_match: 0.4937, match_acc: 89.2641, loss_mask: 0.7734, loss: 2.5835
2021-01-15 13:37:27,818 - INFO - Epoch [1][100/7669]	lr: 0.00233, time: 0.744, data_time: 0.028, loss_rpn_cls: 0.0469, loss_rpn_reg: 0.0442, loss_cls: 0.7820, acc: 84.7695, loss_reg: 0.3567, loss_match: 0.2895, match_acc: 88.3957, loss_mask: 0.6092, loss: 2.1286
2021-01-15 13:38:04,878 - INFO - Epoch [1][150/7669]	lr: 0.00266, time: 0.741, data_time: 0.026, loss_rpn_cls: 0.0342, loss_rpn_reg: 0.0342, loss_cls: 0.7171, acc: 85.1309, loss_reg: 0.3467, loss_match: 0.2588, match_acc: 89.5726, loss_mask: 0.5057, loss: 1.8968
```

It seems that you can change the hyperparameters of learning such as the number of epochs by playing with configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py.

Wrestled error summary

unable to execute 'nvcc': No such file or directory

Execution command: bash masktrackrcnn_env.sh Cause: cuda is not loading properly Solution: Run module load cuda/{version} and it should work. Writing this command in a shell script (masktrackrcnn_env.sh) didn't work, but running it in a normal shell worked.

AttributeError:module 'torch.nn' has no attribute 'SyncBatchNorm

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py (learning script) Cause: torch version is less than 1.1 (but the author should be running version 0.4.1) Solution: Reduce the mmcv version to 0.2.0.

```
pip uninstall mmcv
pip install mmcv==0.2.0
```

```Error details
Traceback (most recent call last):
  File "tools/train.py", line 4, in <module>
    from mmcv import Config
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 4, in <module>
    from .fileio import *
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/__init__.py", line 4, in <module>
    from .io import dump, load, register_handler
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/io.py", line 4, in <module>
    from ..utils import is_list_of, is_str
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/__init__.py", line 29, in <module>
    from .env import collect_env
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/env.py", line 12, in <module>
    from .parrots_wrapper import get_build_config
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 79, in <module>
    _BatchNorm, _InstanceNorm, SyncBatchNorm_ = _get_norm()
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 71, in _get_norm
    SyncBatchNorm_ = torch.nn.SyncBatchNorm
AttributeError: module 'torch.nn' has no attribute 'SyncBatchNorm'
```

ImportError: /home/acb11854zq/.local/lib/python3.7/site-packages/mmdet/ops/nms/gpu_nms.cpython-37m-x86_64-linux-gnu.so: undefined symbol: __cudaPopCallConfiguration

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: CUDA version at build time Solution: Set CUDA9.2-> CUDA9.0 at build time to fix it (source). Note that CUDA 9.0 requires GCC <6.0 at this time.

RuntimeError: While copying the parameter named bbox_head.fc_cls.weight, whose dimensions in the model are torch.Size([41, 1024]) and whose dimensions in the checkpoint are torch.Size([81, 1024]).

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: The dimension of bbox_head.fc_cls.weight of the model is different from that of the checkpoint model. Solution: Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to the following (Reference 1, Reference 2).

```checkpoint.py
print('While copying the parameter named {}, '
      'whose dimensions in the model are {} and '
      'whose dimensions in the checkpoint are {}.'
      .format(name, own_state[name].size(),
              param.size()))
```

Segmentation fault

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: According to issue, it seems that the problem was that GCC <4.9.2 at build time. Solution: conda install -c psi4 gcc-5 Put gcc in conda (source). Because I want to use 4.9.2 <= GCC <6.0 but ABCI doesn't have that option so that it doesn't conflict with the solution of this. Result: Since the build was successful, the problem of Segmentation fault was probably solved, but there is a new opencv import error.

ImportError: /home/acb11854zq/anaconda3/envs/test_gcc/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: Conda's libstdc ++ version is low? Solution: conda install libgcc seems to solve this problem for the time being, but this time I get an import error regarding libstdc ++ on my system.

```Error details
Traceback (most recent call last):
  File "tools/train.py", line 4, in <module>
    from mmcv import Config
  File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
    from .opencv_info import *
  File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/opencv_info.py", line 1, in <module>
    import cv2
  File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
    from .cv2 import *
ImportError: /home/acb11854zq/anaconda3/envs/test_gcc/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)
```

ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)

Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py Cause: Perhaps your system's libstdc ++ is out of date? CXXABI_1.3.8 is apparently introduced from GCC 4.9. Solution: When training the model, you can force it to disappear by loading GCC 7.4 on your system as module load gcc/7.4.0. Note that Error will be displayed at build time unless the build by bash compile.sh is completed.

Traceback (most recent call last):
   File "tools/train.py", line 4, in <module>
     from mmcv import Config
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
     from .image import *
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/__init__.py", line 2, in <module>
     from .colorspace import (bgr2gray, bgr2hls, bgr2hsv, bgr2rgb, bgr2ycbcr,
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/colorspace.py", line 2, in <module>
     import cv2
   File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
     from .cv2 import *
 ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)

Until you build the environment with ABCI and run MaskTrack RCNN

Overview

Premise

Environment

Data set preparation

Model learning

Wrestled error summary

unable to execute 'nvcc': No such file or directory

AttributeError:module 'torch.nn' has no attribute 'SyncBatchNorm

ImportError: /home/acb11854zq/.local/lib/python3.7/site-packages/mmdet/ops/nms/gpu_nms.cpython-37m-x86_64-linux-gnu.so: undefined symbol: __cudaPopCallConfiguration

RuntimeError: While copying the parameter named bbox_head.fc_cls.weight, whose dimensions in the model are torch.Size([41, 1024]) and whose dimensions in the checkpoint are torch.Size([81, 1024]).

Segmentation fault

ImportError: /home/acb11854zq/anaconda3/envs/test_gcc/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)

ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)