This is Ichi Lab from RHEMS Giken. (* Please note that the title TPU is an abbreviation for Tensor Processing Unit and does not mean thermoplastic polyurethane.)
The previous article is here. [For beginners] I tried using the Tensorflow Object Detection API
TensorFlow's Object Detection API (API) is very useful for creating AI for object detection. On the other hand, I think that there are many people who have the following problems.
This time, I would like to take this opportunity to leave a memorandum on how I was able to use Cloud TPU to the extent that I could afford to pay even at the individual level. Using the API the way in this article makes it possible to make GCP's Cloud TPU much cheaper than using it from start to finish.
We hope that it will be of some help to everyone.
From the conclusion, I think the following is the best way to use the API with TPU at the lowest possible price.
Google Colaboratory has been introduced a lot in other articles, so I will omit the details, but
At the cost of getting a great and high-performance environment for free, you may not be able to use the GPU or TPU for a while if you overuse it.
In such a case, if you can prepare a similar environment yourself, it will cost money, but you can save the time to wait until you can use it again.
There is also a nice service called Google Colaboratory Pro for $ 9.99 per month, but at the time of writing this article (2020/06) it is a service only in the United States. I will. (There is another article that I was able to register even from Japan, but there is a possibility of violating the rules, so at my own risk & I have not tried it)
The explanation here is based on the following conditions.
The method introduced here always uses the service of GCP. And with either method, you will definitely be charged for the usage fee of Cloud Storage. For the first time, GCP has a $ 300 free tier, The free tier does not include the TPU usage fee, and there are some restrictions on the free tier of Cloud Storage, so be sure to check the contents yourself before proceeding. (Cloud ML has a free tier, but I haven't tried it)
In order to train using TPU, it must be stored in Cloud Storage.
Here, we will assume that the names are as follows.
Project ID:
gcp-project-123
Bucket name:
my-bucket-123
Don't forget the -m
option if you want to quickly copy from your local PC to your bucket!
Command example to send the folder of the current directory to the bucket with zsh on Mac
gsutil -m cp -r \* gs://my-bucket-123/
The folder structure in the bucket is as follows. (* The following explanation will proceed on the premise of this configuration)
gs://my-bucket-123/
├── models
│ ├── ssd_mobilenet_v1_fpn (Model data of transfer learning source)
│ └── .ckpt and many more
├── data
│ ├── save (Training data storage directory)
│ ├── train (For teacher data storage~ tfrecord)
│ └── val (For data storage for verification~ tfrecord)
├── hoge.config (Config data)
└── tf_label_map.pbtxt (Label data)
This time, I used ssd_mobilenet_v1_fpn_coco as the transfer learning source. In addition, on the page of Tensorflow detection model zoo, there is a ☆ mark on the trained model that supports TPU. It is on.
For the contents of the config, [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3%82%A3%E3 % 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86) I will omit it because I did it, but Now that the above files are in Cloud Storage, the following items need to be matched accordingly.
fine_tune_checkpoint: "gs://my-bucket-123/models/ssd_mobilenet_v1_fpn/model.ckpt"
label_map_path: "gs://my-bucket-123/tf_label_map.pbtxt"
input_path: "gs://my-bucket-123/data/train/{filename}.tfrecord
input_path: "gs://my-bucket-123/data/val/{filename}.tfrecord
(* How to write the {filename} part is [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3% 82% A3% E3% 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86))
If you can do it with the free tier, let's do it here. However, as mentioned earlier, Cloud Storage charges will be incurred.
It can be local or container, so once you have git clone
, store it in your Google Drive.
By the way, when I did it with the latest master, there were many troubles such as not being able to do various things that worked well, so The following branches are recommended for me.
git clone -b tf_2_1_reference https://github.com/tensorflow/models.git
Don't forget the coco API.
git clone --depth 1 https://github.com/cocodataset/cocoapi.git
Here, it is assumed that the source code is placed in the following directory.
/content/drive/My Drive/models/research
/content/drive/My Drive/cocoapi/PythonAPI
On your browser, from Google Drive, Select New> Other> Google Colaboratory.
Change the title from "Untitled0.ipynb" to any name you like. (Recommendation)
From the menu above, select "Runtime"-> "Change Runtime Type"-> Specify Hardware Accelerator as "TPU" and "Save". Then select Connect.
First of all, if you can not read the API source, it will not start, so mount it.
from google.colab import drive
drive.mount('/content/drive')
To link with Cloud Storage, set the project with the gcloud command.
from google.colab import auth
auth.authenticate_user()
project_id = 'gcp-project-123'
!gcloud config set project {project_id}
!gsutil ls my-bucket-123
Authentication is similar to Google Drive.
If successful, you can check the contents of the bucket with the ls
command.
%cd /content/drive/My\ Drive/cocoapi/PythonAPI
!make
!cp -r pycocotools /content/drive/My\ Drive/models/research/
Convert .proto to .py.
%cd /content/drive/My\ Drive/models/research
!protoc object_detection/protos/*.proto --python_out=.
The Tensorflow Object Detection API does not support tensorflow 2.X. On the other hand, Google Colaboratory has 2.X series installed from the beginning. Therefore, you need to check the version and reinstall.
!pip list | grep tensor
!pip install tensorflow==1.15.0rc3
%env PYTHONPATH=/env/python:/content/drive/My Drive/models/research:/content/drive/My Drive/models/research/slim
Let's test whether the environment has been built successfully.
If all goes well, you will see "OK" over multiple lines.
By the way, this article uses the source of a slightly older branch, but recently it has been renamed to model_builder_tf1_test.py
.
%cd /content/drive/My Drive/models/research
!python object_detection/builders/model_builder_test.py
If you specify the directory to save the learning data and start it as shown below, you can check the movement of loss and the number of learning steps per second.
%load_ext tensorboard
%tensorboard --logdir gs://my-bucket-123/data/save
Use model_tpu_main.py
instead of model_main.py
for training.
You can specify the GCP project ID and TPU name as options, but it was not necessary in the Google Colaboratory environment.
Probably because the TPU address is originally registered in the environment variable (guess).
(If you check with % env
, the TPU address followinggrpc: //
is registered with the name TPU_NAME
)
%cd /content/drive/My Drive/models/research
pipeline = 'gs://my-bucket-123/hoge.config'
save = 'gs://my-bucket-123/data/save'
train_step = 1000
mode = 'train'
batch_size = 64
!python object_detection/model_tpu_main.py \
--pipeline_config_path={pipeline} \
--mode={mode} \
--num_train_steps={train_step} \
--eval_training_data=True \
--train_batch_size={batch_size} \
--model_dir={save} \
--alsologtostderra
Google Colaboratory has a usage limit of less than 12 hours, I'm not writing somewhere about how many hours I can actually use. You can find out by running the code below.
import time, psutil
Start = time.time()- psutil.boot_time()
Left= 12*3600 - Start
print('remaining time: ', Left/3600)
Once you start learning, other executions will wait until it finishes, so Let's do this if something has been done.
By the way, if Google Colaboratory says "I can't use it for a while, please wait" and you can't wait, try this method.
First, you need to have Compute Engine and Cloud TPU enabled.
The first time it is displayed as below (image at the time of writing)
For Compute Engine
"Navigation menu"-> "Compute Engine"-> "VM and instance" in the upper left
Preparation starts automatically.
For Cloud TPU
"Navigation menu"-> "Compute Engine"-> "TPU" on the upper left
The first time you need to select "Enable API". (Please be assured that TPU billing will not start with this alone)
If you're ready or already enabled, open Cloud Shell.
Cloud Shell has an icon like the one below in the upper right corner.
After waiting for a while and opening it, start the VM and TPU at the same time with the ctpu command
.
ctpu up --zone=us-central1-b --tf-version=1.15 --machine-type=n1-standard-4 --name=mytpu --preemptible --preemptible-vm
The important point here is to put preemptible
in the VM and TPU options,
That is to use preemptive.
The following table shows the results calculated by the official pricing tool when the location of TPU V2 is us-central1.
TPU Class | Regular | Preemptible |
---|---|---|
Per hour | About 485 yen | About 146 yen |
For preemptive information, please refer to the Official Document.
You can do the same from the console or the gcloud command. Details can be found in the official documentation, Creating and Deleting TPUs (https://cloud.google.com/tpu/docs/creating-deleting-tpus?hl=ja).
When you execute the command, a confirmation will be displayed as shown below.
Name: mytpu
Zone: us-central1-b
GCP Project: gcp-project-123
TensorFlow Version: 1.15
VM:
Machine Type: n1-standard-4
Disk Size: 250 GB
Preemptible: true
Cloud TPU:
Size: v2-8
Preemptible: true
Reserved: false
OK to create your Cloud TPU resources with the above configuration? [Yn]:
Type y
and press ʻEnter / returnto start each creation. The reason I chose
n1-standard-4` for the machine type is just because it is close to the memory of the Google Colaboratory environment, so change it if necessary.
By the way, if you accidentally delete the default service account of Compute Engine, you will not be able to create it with the above ctpu command.
2020/06/20 00:00:00 Creating Compute Engine VM mytpu (this may take a minute)...
2020/06/20 00:00:07 TPU operation still running...
2020/06/20 00:00:07 error retrieving Compute Engine zone operation:
(For an error like this ... When did you erase it?) I didn't know the solution so I recreated the new project.
"... Let's go back in time."
Cloud TPU will be charged in seconds. If you can confirm that it has started up safely, let's stop it for the time being.
When the instance starts successfully, let's enter from "SSH" below.
When the connection was completed, the console screen opened as shown below.
From here, we will work in this.
You can also check the TPU status with the gcloud command
here.
gcloud config set compute/zone us-central1-b
Updated property [compute/zone].
gcloud compute tpus list
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINTS NETWORK RANGE STATUS
mytpu us-central1-b v2-8 10.240.1.2:8470 default 10.240.1.0/29 STOPPING
As an aside, the status of the TPU is displayed as follows.
making | During startup | Start-up | Stopping | Stop |
---|---|---|---|---|
CREATING | STARTING | READY | STOPPING | STOPPED |
I added this item because I want to clearly unify whether it is python or python3. I always want to use 3.X for python, so change the settings as follows.
Open .bashrc
vi ~/.bashrc
Add settings to the last line
to move to the bottom with
Shift + G, enter the following, press ʻesc
and overwrite with: wq
!alias python="python3"
alias pip='pip3'
Reflect settings
source ~/.bashrc
Now python
is now python3
.
From here, it will be almost the same as the API environment construction, but I will describe it without omitting it.
sudo apt-get update
sudo apt-get install -y protobuf-compiler python-pil python-lxml python-tk
pip install -U pip && pip install Cython contextlib2 jupyter matplotlib tf_slim pillow
Next, bring the API source code and cocoAPI.
git clone -b tf_2_1_reference https://github.com/tensorflow/models.git
git clone --depth 1 https://github.com/cocodataset/cocoapi.git
Again, the source code for the API used in this article is the branch above.
Then install the coco API.
I failed with make
as follows.
x86_64-linux-gnu-gcc: error: pycocotools/_mask.c: No such file or directory
To avoid this, modify the Makefile a bit.
cd cocoapi/PythonAPI
vi Makefile
After opening the Makefile, change the python
part to python3
(there are two places).
make
cp -r pycocotools /home/ichilab/models/research && cd ../../ && rm -rf cocoapi
Convert .proto to .py.
cd models/research
protoc object_detection/protos/*.proto --python_out=.
Once I closed the SSH screen, I had to do this part again.
(pwd = models/research)
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
source ~/.bashrc
Let's test whether the environment has been built successfully. If successful, "OK" will be displayed over multiple lines.
python object_detection/builders/model_builder_test.py
It may be good to make sure that you can see the contents of the bucket.
gsutil ls gs://my-bucket-123
You can't learn with it stopped, so let's start it again here. If you can confirm the startup, it is next.
Learning has started.
python object_detection/model_tpu_main.py \
--tpu_name=mytpu \
--model_dir=gs://my-bucket-123/data/save \
--mode=train \
--pipeline_config_path=gs://my-bucket-123/hoge.config \
--alsologtostderra
Write a brief description of the option.
--gcp_project
: The project ID. It was optional.--tpu_name
: Named when launched with the ctpu command.--tpu_zone
: The time zone of the TPU. It was optional.--model_dir
: Specify the save destination of the trained ckpt files.--pipeline_config_path
: Specifies the location to save the configuration file.By the way, if I use the latest source here
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:tpu_worker/replica:0/task:0:
I was quite annoyed by the error. This is the only reason I'm using the source code for the branch I mentioned earlier. The config and other files were under exactly the same conditions, so the cause is unknown at this time.
When you're done, give priority to stopping and deleting the TPU.
After finishing the explanation of environment construction, you are wondering how much it will cost.
I'm sorry I can't post a proper comparison, When you learn 100,000 steps with Google Colaboratory, it costs less than 400 yen. When I started up and used VM and TPU in my project, I have never performed 100,000 Steps, but Considering the above-mentioned charge as a guide, the TPU was 4 yen for using Compute Engine for about 6 hours, and 1 yen for the external IP usage charge, which was less than 10 yen in total.
In this area, using the official price calculation tool is closer to the correct answer than my article.
What did you think?
Surprisingly, I can't find a summary article about the environment construction of Cloud TPU × Tensorflow Object Detection API
, so I hope that more people will take this opportunity to learn with TPU and those who are interested in GCP. ..
I sincerely hope that your research on object detection AI will be accelerated as the learning speed is accelerated.
Recommended Posts