Introduction

This is Ichi Lab from RHEMS Giken. (* Please note that the title TPU is an abbreviation for Tensor Processing Unit and does not mean thermoplastic polyurethane.)

The previous article is here. [For beginners] I tried using the Tensorflow Object Detection API

TensorFlow's Object Detection API (API) is very useful for creating AI for object detection. On the other hand, I think that there are many people who have the following problems.

Learning time is too long
High-performance GPUs (such as those with a Tensor core) are too expensive to buy
It seems that there is a TPU that is most suitable for deep learning, but I'm not sure how to use it.
Cloud TPU is insanely expensive

This time, I would like to take this opportunity to leave a memorandum on how I was able to use Cloud TPU to the extent that I could afford to pay even at the individual level. Using the API the way in this article makes it possible to make GCP's Cloud TPU much cheaper than using it from start to finish.

We hope that it will be of some help to everyone.

Preface

Rough method

From the conclusion, I think the following is the best way to use the API with TPU at the lowest possible price.

Learn with the free tier of Google Colaboratory
If you want to learn while playing from the free tier, continue with GCP's VM and Cloud TPU (both preemptive)

Google Colaboratory has been introduced a lot in other articles, so I will omit the details, but

Free to use (with a limit of less than 12 hours)
You can run Python from your browser
Operability like Jupyter Notebook
Expensive GPU / TPU can be used for free It is a wonderful service with such features.

At the cost of getting a great and high-performance environment for free, you may not be able to use the GPU or TPU for a while if you overuse it.

In such a case, if you can prepare a similar environment yourself, it will cost money, but you can save the time to wait until you can use it again.

There is also a nice service called Google Colaboratory Pro for $ 9.99 per month, but at the time of writing this article (2020/06) it is a service only in the United States. I will. (There is another article that I was able to register even from Japan, but there is a possibility of violating the rules, so at my own risk & I have not tried it)

Prerequisites

The explanation here is based on the following conditions.

Have used API (recommended)
Have touched GCP (recommended)
Annotated teacher data exists (required)
Have a Google account (required)
GCP billing settings have been completed (credit card registration) (required)

Precautions (story of money)

The method introduced here always uses the service of GCP. And with either method, you will definitely be charged for the usage fee of Cloud Storage. For the first time, GCP has a $ 300 free tier, The free tier does not include the TPU usage fee, and there are some restrictions on the free tier of Cloud Storage, so be sure to check the contents yourself before proceeding. (Cloud ML has a free tier, but I haven't tried it)

Common preparation

In order to train using TPU, it must be stored in Cloud Storage.

Here, we will assume that the names are as follows. Project ID: gcp-project-123 Bucket name: my-bucket-123

Don't forget the -m option if you want to quickly copy from your local PC to your bucket!

`Command example to send the folder of the current directory to the bucket with zsh on Mac`


gsutil -m cp -r \* gs://my-bucket-123/

The folder structure in the bucket is as follows. (* The following explanation will proceed on the premise of this configuration)

gs://my-bucket-123/
├── models
│     ├── ssd_mobilenet_v1_fpn (Model data of transfer learning source)
│     　　　　　　　　└──　.ckpt and many more
├── data
│     ├── save (Training data storage directory)
│     ├── train (For teacher data storage~ tfrecord)
│     └── val (For data storage for verification~ tfrecord)
├── hoge.config (Config data)
└── tf_label_map.pbtxt (Label data)

This time, I used ssd_mobilenet_v1_fpn_coco as the transfer learning source. In addition, on the page of Tensorflow detection model zoo, there is a ☆ mark on the trained model that supports TPU. It is on.

For the contents of the config, [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3%82%A3%E3 % 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86) I will omit it because I did it, but Now that the above files are in Cloud Storage, the following items need to be matched accordingly.

fine_tune_checkpoint: "gs://my-bucket-123/models/ssd_mobilenet_v1_fpn/model.ckpt"
label_map_path: "gs://my-bucket-123/tf_label_map.pbtxt"
input_path: "gs://my-bucket-123/data/train/{filename}.tfrecord
input_path: "gs://my-bucket-123/data/val/{filename}.tfrecord

(* How to write the {filename} part is [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3% 82% A3% E3% 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86))

1. How to do it using Google Colaboratory

If you can do it with the free tier, let's do it here. However, as mentioned earlier, Cloud Storage charges will be incurred.

1-1. Put the API source on the drive

It can be local or container, so once you have git clone, store it in your Google Drive.

By the way, when I did it with the latest master, there were many troubles such as not being able to do various things that worked well, so The following branches are recommended for me.

git clone -b tf_2_1_reference https://github.com/tensorflow/models.git

Don't forget the coco API.

git clone --depth 1 https://github.com/cocodataset/cocoapi.git

Here, it is assumed that the source code is placed in the following directory.

/content/drive/My Drive/models/research
/content/drive/My Drive/cocoapi/PythonAPI

1-2. Make a new notebook

On your browser, from Google Drive, Select New> Other> Google Colaboratory.

Change the title from "Untitled0.ipynb" to any name you like. (Recommendation)

From the menu above, select "Runtime"-> "Change Runtime Type"-> Specify Hardware Accelerator as "TPU" and "Save". Then select Connect.

1-3. Mount Google Drive

First of all, if you can not read the API source, it will not start, so mount it.

from google.colab import drive
drive.mount('/content/drive')

1-4. GCP project settings

To link with Cloud Storage, set the project with the gcloud command.

from google.colab import auth
auth.authenticate_user()
project_id = 'gcp-project-123'
!gcloud config set project {project_id}
!gsutil ls my-bucket-123

Authentication is similar to Google Drive. If successful, you can check the contents of the bucket with the ls command.

1-5. Installation of cocoAPI (first time only)

%cd /content/drive/My\ Drive/cocoapi/PythonAPI
!make
!cp -r pycocotools /content/drive/My\ Drive/models/research/

1-6. Execution of protoc (first time only)

Convert .proto to .py.

%cd /content/drive/My\ Drive/models/research
!protoc object_detection/protos/*.proto --python_out=.

1-7. Change the version of tensorflow

The Tensorflow Object Detection API does not support tensorflow 2.X. On the other hand, Google Colaboratory has 2.X series installed from the beginning. Therefore, you need to check the version and reinstall.

!pip list | grep tensor

!pip install tensorflow==1.15.0rc3

1-8. Setting environment variables

%env PYTHONPATH=/env/python:/content/drive/My Drive/models/research:/content/drive/My Drive/models/research/slim

1-9. Executing API test code

Let's test whether the environment has been built successfully. If all goes well, you will see "OK" over multiple lines. By the way, this article uses the source of a slightly older branch, but recently it has been renamed to model_builder_tf1_test.py.

%cd /content/drive/My Drive/models/research
!python object_detection/builders/model_builder_test.py

1-10. Start Tensorboard (not required)

If you specify the directory to save the learning data and start it as shown below, you can check the movement of loss and the number of learning steps per second.

%load_ext tensorboard
%tensorboard --logdir gs://my-bucket-123/data/save

1-11. Start learning

Use model_tpu_main.py instead of model_main.py for training. You can specify the GCP project ID and TPU name as options, but it was not necessary in the Google Colaboratory environment. Probably because the TPU address is originally registered in the environment variable (guess). (If you check with % env, the TPU address followinggrpc: //is registered with the name TPU_NAME)

%cd /content/drive/My Drive/models/research
pipeline = 'gs://my-bucket-123/hoge.config'
save = 'gs://my-bucket-123/data/save'
train_step = 1000
mode = 'train'
batch_size = 64

!python object_detection/model_tpu_main.py \
 --pipeline_config_path={pipeline} \
 --mode={mode} \
 --num_train_steps={train_step} \
 --eval_training_data=True \
 --train_batch_size={batch_size} \
 --model_dir={save} \
 --alsologtostderra

1-12. Bonus (convenient to do)

Google Colaboratory has a usage limit of less than 12 hours, I'm not writing somewhere about how many hours I can actually use. You can find out by running the code below.

import time, psutil
Start = time.time()- psutil.boot_time()
Left= 12*3600 - Start
print('remaining time: ', Left/3600)

Once you start learning, other executions will wait until it finishes, so Let's do this if something has been done.

2. How to launch your own GCP VM and Cloud TPU

By the way, if Google Colaboratory says "I can't use it for a while, please wait" and you can't wait, try this method.

2-1. Launch VM and Cloud TPU

First, you need to have Compute Engine and Cloud TPU enabled. The first time it is displayed as below (image at the time of writing) For Compute Engine "Navigation menu"-> "Compute Engine"-> "VM and instance" in the upper left Preparation starts automatically.

For Cloud TPU "Navigation menu"-> "Compute Engine"-> "TPU" on the upper left

The first time you need to select "Enable API". (Please be assured that TPU billing will not start with this alone)

If you're ready or already enabled, open Cloud Shell. Cloud Shell has an icon like the one below in the upper right corner. After waiting for a while and opening it, start the VM and TPU at the same time with the ctpu command.

ctpu up --zone=us-central1-b --tf-version=1.15 --machine-type=n1-standard-4 --name=mytpu --preemptible --preemptible-vm

The important point here is to put preemptible in the VM and TPU options, That is to use preemptive.

The following table shows the results calculated by the official pricing tool when the location of TPU V2 is us-central1.

TPU Class	Regular	Preemptible
Per hour	About 485 yen	About 146 yen

For preemptive information, please refer to the Official Document.

You can do the same from the console or the gcloud command. Details can be found in the official documentation, Creating and Deleting TPUs (https://cloud.google.com/tpu/docs/creating-deleting-tpus?hl=ja).

When you execute the command, a confirmation will be displayed as shown below.

  Name:                 mytpu
  Zone:                 us-central1-b
  GCP Project:          gcp-project-123
  TensorFlow Version:   1.15
  VM:
      Machine Type:     n1-standard-4
      Disk Size:        250 GB
      Preemptible:      true
  Cloud TPU:
      Size:             v2-8
      Preemptible:      true
      Reserved:         false
OK to create your Cloud TPU resources with the above configuration? [Yn]:

Type y and press ʻEnter / returnto start each creation. The reason I chosen1-standard-4` for the machine type is just because it is close to the memory of the Google Colaboratory environment, so change it if necessary.

By the way, if you accidentally delete the default service account of Compute Engine, you will not be able to create it with the above ctpu command.

2020/06/20 00:00:00 Creating Compute Engine VM mytpu (this may take a minute)...
2020/06/20 00:00:07 TPU operation still running...
2020/06/20 00:00:07 error retrieving Compute Engine zone operation:

(For an error like this ... When did you erase it?) I didn't know the solution so I recreated the new project.

"... Let's go back in time."

2-2. Stop the TPU that has just started

Cloud TPU will be charged in seconds. If you can confirm that it has started up safely, let's stop it for the time being.

2-3. Enter the VM instance created by SSH

When the instance starts successfully, let's enter from "SSH" below.

When the connection was completed, the console screen opened as shown below.

From here, we will work in this.

You can also check the TPU status with the gcloud command here.


gcloud config set compute/zone us-central1-b
Updated property [compute/zone].

gcloud compute tpus list
NAME   ZONE           ACCELERATOR_TYPE  NETWORK_ENDPOINTS  NETWORK  RANGE          STATUS
mytpu  us-central1-b  v2-8              10.240.1.2:8470    default  10.240.1.0/29  STOPPING

As an aside, the status of the TPU is displayed as follows.

making	During startup	Start-up	Stopping	Stop
CREATING	STARTING	READY	STOPPING	STOPPED

2-4. Set alias (not required)

I added this item because I want to clearly unify whether it is python or python3. I always want to use 3.X for python, so change the settings as follows.

Open .bashrc

vi ~/.bashrc

Add settings to the last line

Press ʻI to move to the bottom with Shift + G, enter the following, press ʻesc and overwrite with: wq!

alias python="python3" 
alias pip='pip3'

Reflect settings

source ~/.bashrc

Now python is now python3.

2-5. Installation of required libraries

From here, it will be almost the same as the API environment construction, but I will describe it without omitting it.

sudo apt-get update
sudo apt-get install -y protobuf-compiler python-pil python-lxml python-tk
pip install -U pip && pip install Cython contextlib2 jupyter matplotlib tf_slim pillow

Next, bring the API source code and cocoAPI.

git clone -b tf_2_1_reference https://github.com/tensorflow/models.git
git clone --depth 1 https://github.com/cocodataset/cocoapi.git

Again, the source code for the API used in this article is the branch above.

2-6. Installation of coco API

Then install the coco API.

I failed with make as follows.

x86_64-linux-gnu-gcc: error: pycocotools/_mask.c: No such file or directory

To avoid this, modify the Makefile a bit.

cd cocoapi/PythonAPI
vi Makefile

After opening the Makefile, change the python part to python3 (there are two places).

make
cp -r pycocotools /home/ichilab/models/research && cd ../../ && rm -rf cocoapi

Please change the ichilab part to your user name.

2-7. Executing protoc

Convert .proto to .py.

cd models/research
protoc object_detection/protos/*.proto --python_out=.

2-8. Setting environment variables

Once I closed the SSH screen, I had to do this part again.

(pwd = models/research)
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
source ~/.bashrc

2-9. Executing API test code

Let's test whether the environment has been built successfully. If successful, "OK" will be displayed over multiple lines.

python object_detection/builders/model_builder_test.py

It may be good to make sure that you can see the contents of the bucket.

gsutil ls gs://my-bucket-123

2-10. Resume the stopped TPU

You can't learn with it stopped, so let's start it again here. If you can confirm the startup, it is next.

2-11. Start learning

Learning has started.

python object_detection/model_tpu_main.py \ 
--tpu_name=mytpu \ 
--model_dir=gs://my-bucket-123/data/save \ 
--mode=train \ 
--pipeline_config_path=gs://my-bucket-123/hoge.config \ 
--alsologtostderra

Write a brief description of the option.

--gcp_project: The project ID. It was optional.
--tpu_name: Named when launched with the ctpu command.
--tpu_zone: The time zone of the TPU. It was optional.
--model_dir: Specify the save destination of the trained ckpt files.
--pipeline_config_path: Specifies the location to save the configuration file.

By the way, if I use the latest source here

tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:tpu_worker/replica:0/task:0:

I was quite annoyed by the error. This is the only reason I'm using the source code for the branch I mentioned earlier. The config and other files were under exactly the same conditions, so the cause is unknown at this time.

2-12. Stop / delete TPU and VM after learning

When you're done, give priority to stopping and deleting the TPU.

How much will it cost

After finishing the explanation of environment construction, you are wondering how much it will cost.

I'm sorry I can't post a proper comparison, When you learn 100,000 steps with Google Colaboratory, it costs less than 400 yen. When I started up and used VM and TPU in my project, I have never performed 100,000 Steps, but Considering the above-mentioned charge as a guide, the TPU was 4 yen for using Compute Engine for about 6 hours, and 1 yen for the external IP usage charge, which was less than 10 yen in total.

In this area, using the official price calculation tool is closer to the correct answer than my article.

in conclusion

What did you think?

Surprisingly, I can't find a summary article about the environment construction of Cloud TPU × Tensorflow Object Detection API, so I hope that more people will take this opportunity to learn with TPU and those who are interested in GCP. ..

I sincerely hope that your research on object detection AI will be accelerated as the learning speed is accelerated.

[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2