When I tried to run MaskTrack RCNN, which is the baseline model of the latest Video Instance Segmentation task, with ABCI, I suffered from repeated errors, so I built an environment in ABCI. I will keep a memorandum of the method from to learning the model. I've put together the last error I wrestled with and how to solve it.
--Information as of January 15, 2021. --You have an ABCI account and can access the server remotely. --Anaconda3 has already been installed.
--Remote access to ABCI with ssh --Start GPU node The following are commands that use the GPU interactively for up to 1 hour with the minimum configuration.
```bash
qrsh -g gcc50560 -l rt_G.small=1 -l h_rt=1:00:00 -m abes
```
--Loading required modules Execute the following command on the launched node to load the required modules.
```bash
module load cuda/9.0
module load cudnn/7.6/7.6.2
module load nccl/2.3/2.3.7-1
```
--Creating and setting up a virtual environment
Execute the following bash masktrackrcnn_env.sh
to clone the repository and build a virtual environment.
```masktrackrcnn_env.sh
# clone MaskTrackRCNN
git clone https://github.com/youtubevos/MaskTrackRCNN.git
cd MaskTrackRCNN
# create environment
conda create -n MaskTrackRCNN python=3.7 -y
# >>> conda init >>>
__conda_setup="$(CONDA_REPORT_ERRORS=false '$HOME/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
\eval "$__conda_setup"
else
if [ -f "$HOME/anaconda3/etc/profile.d/conda.sh" ]; then
. "$HOME/anaconda3/etc/profile.d/conda.sh"
CONDA_CHANGEPS1=false conda activate base
else
\export PATH="$PATH:$HOME/anaconda3/bin"
fi
fi
unset __conda_setup
# <<< conda init <<<
# activate environment
conda activate MaskTrackRCNN
# setup environment
conda install -c pytorch pytorch=0.4.1 cudatoolkit=9.0 torchvision -y
conda install -c conda-forge opencv -y
conda install numpy cython -y
conda install -c psi4 gcc-5 -y
conda install libgcc -y
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"
bash compile.sh
pip install . --user
pip uninstall mmcv -y
pip install mmcv==0.2.0
```
--Modify the library The necessary environment has been set up so far, but if it is left as it is, there will be inconvenience during learning, so one final correction will be made. Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to:
In [python_lib_path], enter the path returned by which python
.
print('While copying the parameter named {}, '
'whose dimensions in the model are {} and '
'whose dimensions in the checkpoint are {}.'
.format(name, own_state[name].size(),
param.size()))
--Environment construction completed !!
--Download data and labels Download the data and label from here. --Dataset placement Attach a symbolic link so that it has the following structure.
```
MaskTrackRCNN
├── mmdet
├── tools
├── configs
├── data
│ ├── train
│ ├── val
│ ├── annotations
│ │ ├── instances_train_sub.json
│ │ ├── instances_val_sub.json
```
The label is originally given with the file names train.json
and valid.json
, so it's a good idea to (copy) rename it. You may change the file name of config.
```sh
#An example of how to paste a symbolic link$MaskTrackRCNN is the path to the root of the MaskTrackRCNN repository
mkdir $MaskTrackRCNN/data
ln -s /path/to/original/data_dir/train $MaskTrackRCNN/data/train
ln -s /path/to/original/data_dir/valid $MaskTrackRCNN/data/val
ln -s /path/to/original/data_dir/annotations $MaskTrackRCNN/data/annotations
```
--Loading GCC 7.4
module load gcc/7.4.0
--Activate virtual environment
conda activate MaskTrackRCNN
--Learning
python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
If you see the following log, it seems that you are learning correctly. At this rate, it will take about 24 hours to study.
```
2021-01-15 13:36:50,620 - INFO - Epoch [1][50/7669] lr: 0.00199, time: 0.774, data_time: 0.045, loss_rpn_cls: 0.0609, loss_rpn_reg: 0.0465, loss_cls: 0.9336, acc: 84.5996, loss_reg: 0.2753, loss_match: 0.4937, match_acc: 89.2641, loss_mask: 0.7734, loss: 2.5835
2021-01-15 13:37:27,818 - INFO - Epoch [1][100/7669] lr: 0.00233, time: 0.744, data_time: 0.028, loss_rpn_cls: 0.0469, loss_rpn_reg: 0.0442, loss_cls: 0.7820, acc: 84.7695, loss_reg: 0.3567, loss_match: 0.2895, match_acc: 88.3957, loss_mask: 0.6092, loss: 2.1286
2021-01-15 13:38:04,878 - INFO - Epoch [1][150/7669] lr: 0.00266, time: 0.741, data_time: 0.026, loss_rpn_cls: 0.0342, loss_rpn_reg: 0.0342, loss_cls: 0.7171, acc: 85.1309, loss_reg: 0.3467, loss_match: 0.2588, match_acc: 89.5726, loss_mask: 0.5057, loss: 1.8968
```
It seems that you can change the hyperparameters of learning such as the number of epochs by playing with configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py.
Execution command: bash masktrackrcnn_env.sh
Cause: cuda is not loading properly
Solution: Run module load cuda/{version}
and it should work. Writing this command in a shell script (masktrackrcnn_env.sh) didn't work, but running it in a normal shell worked.
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
(learning script)
Cause: torch version is less than 1.1 (but the author should be running version 0.4.1)
Solution: Reduce the mmcv version to 0.2.0.
```
pip uninstall mmcv
pip install mmcv==0.2.0
```
```Error details
Traceback (most recent call last):
File "tools/train.py", line 4, in <module>
from mmcv import Config
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 4, in <module>
from .fileio import *
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/__init__.py", line 4, in <module>
from .io import dump, load, register_handler
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/fileio/io.py", line 4, in <module>
from ..utils import is_list_of, is_str
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/__init__.py", line 29, in <module>
from .env import collect_env
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/env.py", line 12, in <module>
from .parrots_wrapper import get_build_config
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 79, in <module>
_BatchNorm, _InstanceNorm, SyncBatchNorm_ = _get_norm()
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/utils/parrots_wrapper.py", line 71, in _get_norm
SyncBatchNorm_ = torch.nn.SyncBatchNorm
AttributeError: module 'torch.nn' has no attribute 'SyncBatchNorm'
```
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
Cause: CUDA version at build time
Solution: Set CUDA9.2-> CUDA9.0 at build time to fix it (source). Note that CUDA 9.0 requires GCC <6.0 at this time.
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
Cause: The dimension of bbox_head.fc_cls.weight
of the model is different from that of the checkpoint model.
Solution: Change line 39 of [python_lib_path] /site-packages/mmcv/runner/checkpoint.py to the following (Reference 1, Reference 2).
```checkpoint.py
print('While copying the parameter named {}, '
'whose dimensions in the model are {} and '
'whose dimensions in the checkpoint are {}.'
.format(name, own_state[name].size(),
param.size()))
```
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
Cause: According to issue, it seems that the problem was that GCC <4.9.2 at build time.
Solution: conda install -c psi4 gcc-5 Put gcc in
conda (source). Because I want to use 4.9.2 <= GCC <6.0 but ABCI doesn't have that option so that it doesn't conflict with the solution of this.
Result: Since the build was successful, the problem of Segmentation fault was probably solved, but there is a new opencv import error.
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
Cause: Conda's libstdc ++ version is low?
Solution: conda install libgcc
seems to solve this problem for the time being, but this time I get an import error regarding libstdc ++ on my system.
```Error details
Traceback (most recent call last):
File "tools/train.py", line 4, in <module>
from mmcv import Config
File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
from .opencv_info import *
File "/home/acb11854zq/anaconda3/envs/test_gcc/lib/python3.7/site-packages/mmcv/opencv_info.py", line 1, in <module>
import cv2
File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
from .cv2 import *
ImportError: /home/acb11854zq/anaconda3/envs/test_gcc/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)
```
Execution command: python tools/train.py configs/masktrack_rcnn_r50_fpn_1x_youtubevos.py
Cause: Perhaps your system's libstdc ++ is out of date? CXXABI_1.3.8 is apparently introduced from GCC 4.9.
Solution: When training the model, you can force it to disappear by loading GCC 7.4 on your system as module load gcc/7.4.0
. Note that Error will be displayed at build time unless the build by bash compile.sh
is completed.
Traceback (most recent call last):
File "tools/train.py", line 4, in <module>
from mmcv import Config
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/__init__.py", line 5, in <module>
from .image import *
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/__init__.py", line 2, in <module>
from .colorspace import (bgr2gray, bgr2hls, bgr2hsv, bgr2rgb, bgr2ycbcr,
File "/home/acb11854zq/.local/lib/python3.7/site-packages/mmcv/image/colorspace.py", line 2, in <module>
import cv2
File "/home/acb11854zq/.local/lib/python3.7/site-packages/cv2/__init__.py", line 5, in <module>
from .cv2 import *
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/acb11854zq/.local/lib/python3.7/site-packages/cv2/cv2.cpython-37m-x86_64-linux-gnu.so)
Recommended Posts