Introduction

The other day, in the article Try Deep Learning with FPGA, PYNQ and I wrote about BNN-PYNQ. In the previous article, I introduced a relatively inexpensive FPGA board called PYNQ-Z1 Board and even ran a demo (Cifar10) prepared in advance. Therefore, this time, we will develop the demo prepared in advance and select the cucumbers.

Prior explanation

To customize BNN-PYNQ

As I wrote in the previous article, Deep Learning consists largely of learning and reasoning. In BNN-PYNQ, only inference is implemented (learning must be done on CPU / GPU). Therefore, customizing BNN-PYNQ means changing the network structure and parameters of inference as it is learned.

Taking the previous Cifar10 as an example, in BNN-PYNQ, Deep Learning processing on FPGA is performed from the application on Jupyter according to the following flow. Last time, there was a CPU / FPGA speed comparison result, but that was realized by switching which shared library (python_hw / sw) to load in # 4 below.

#	File	Overview	Custom method
1	Cifar10.ipynb	It is an application. Last time it was a Jupyter file to run the demo.
2	bnn.py	BNN-A library for running PYNQ in Python.
3	X-X-thres.bin X-X-weights.bin classes.txt	This is a parameter file. CPU/BNN the result of learning with GPU-It is used to capture with PYNQ.	BinaryNets for Pynq - Training Networks
4	python_sw-cnv-pynq.so	A shared library for running Deep Learning on the CPU.	make-sw.sh python_sw
	python_hw-cnv-pynq.so	A shared library for running Deep Learning on FPGAs.	make-sw.sh python_hw
5	cnv-pynq-pynq.bit	A bitstream file for performing processing on the FPGA. When you switch the overlay, this file will be switched and read.	make-hw.sh

This time, I will customize BNN-PYNQ, but since there is a hurdle to suddenly rebuild the network structure, I would like to change the parameters to be read while keeping the same network structure as Cifar10.

Cucumber selection

Since it became a hot topic for a while, many of you may know it, but it is a problem to classify the grades into 9 types based on the image of cucumber. Sorting "cucumbers" by deep learning with TensorFlow

The data required for learning is published on GitHub, so we will use it. There are two published on GitHub, ProtoType-1, 2, but this time we will use ProtoType-1, which has a dataset format close to Cifar 10. GitHub - workpiles/CUCUMBER-9

2L〜2S
Good product. Good color, relatively straight and not biased in thickness. It is sorted into 5 stages from 2L to 2S according to the size.

BL〜BS
B product. Those with bad color, slightly bent, or uneven thickness. It is sorted into 3 stages from L to S according to the size.

C
C product. Bad shape.

Looking at some blogs, it seems that the correct answer rate is around 80% without any ingenuity. This time, I'm very grateful because I'm not changing the network structure.

Implementation content

Learning (implemented on CPU / GPU instance)

Create the parameter data to load on the FPGA. As mentioned in the table above, follow the procedure published on GitHub. BinaryNets for Pynq - Training Networks

Note that this parameter file must be created on the CPU / GPU. This time, I set up a GPU instance (NC6 Ubuntu 16.04) on Azure.

Build GPU environment

Install Nvidia Drivers, CUDA, cuDNN.

Install Nvidia Drivers

$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
$ sudo apt-get update

CUDA installation

$ sudo apt-get install cuda -y

cuDNN installation

In order to install, you need to register as a developer with Nvidia and download the file. NVIDIA cuDNN

$ sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64.deb libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb

PATH setting

$ sudo sh -c "echo 'CUDA_HOME=/usr/local/cuda' >> /etc/profile.d/cuda.sh"
$ sudo sh -c "echo 'export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:\${CUDA_HOME}/lib64' >> /etc/profile.d/cuda.sh"
$ sudo sh -c "echo 'export LIBRARY_PATH=\${LIBRARY_PATH}:\${CUDA_HOME}/lib64' >> /etc/profile.d/cuda.sh"
$ sudo sh -c "echo 'export C_INCLUDE_PATH=\${C_INCLUDE_PATH}:\${CUDA_HOME}/include' >> /etc/profile.d/cuda.sh"
$ sudo sh -c "echo 'export CXX_INCLUDE_PATH=\${CXX_INCLUDE_PATH}:\${CUDA_HOME}/include' >> /etc/profile.d/cuda.sh"
$ sudo sh -c "echo 'export PATH=\${PATH}:\${CUDA_HOME}/bin' >> /etc/profile.d/cuda.sh"

Reboot the instance

$ sudo reboot

Confirmation of installation

$ nvidia-smi
Thu Mar 30 07:42:52 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 8CFC:00:00.0     Off |                    0 |
| N/A   38C    P0    75W / 149W |      0MiB / 11439MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Python installation

Install the Python libraries (Theano, Lasagne, Numpy, Pylearn2). I also have pyenv installed first to use Python 2.7.

Install pyenv & python 2.7

$ sudo apt-get install git gcc make openssl libssl-dev libbz2-dev libreadline-dev libsqlite3-dev
$ git clone https://github.com/yyuu/pyenv.git ~/.pyenv

$ vi .bashrc
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
$ source .bashrc

$ env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 2.7.13
$ pyenv global 2.7.13

Install Python libraries (Theano, Lasagne, Numpy, Pylearn 2)

$ pip install --user git+https://github.com/Theano/[email protected]
$ pip install --user https://github.com/Lasagne/Lasagne/archive/master.zip

$ echo "[global]" >> ~/.theanorc
$ echo "floatX = float32" >> ~/.theanorc
$ echo "device = gpu" >> ~/.theanorc
$ echo "openmp = True" >> ~/.theanorc
$ echo "openmp_elemwise_minsize = 200000" >> ~/.theanorc
$ echo "" >> ~/.theanorc
$ echo "[nvcc]" >> ~/.theanorc
$ echo "fastmath = True" >> ~/.theanorc
$ echo "" >> ~/.theanorc
$ echo "[blas]" >> ~/.theanorc
$ echo "ldflags = -lopenblas" >> ~/.theanorc

$ git clone https://github.com/lisa-lab/pylearn2
$ cd pylearn2/
$ python setup.py develop --user

Data set preparation

Prepare the dataset to train. This time, I will use the image data of cucumber from GitHub.

$ git clone https://github.com/workpiles/CUCUMBER-9.git
$ cd CUCUMBER-9/prototype_1/
$ tar -zxvf cucumber-9-python.tar.gz

Program preparation

We will make a small change to the Xilinx program to change the dataset that the training loads. The main changes are the following two points.

Change the data to load to CUCUMBER9
Changed the number of classification classes to 9

Get the program from BNN-PYNQ

$ git clone https://github.com/Xilinx/BNN-PYNQ.git
$ cd BNN-PYNQ/bnn/src/training/

Change the program to be executed when learning Create cucumber9.py that reads the image data of the cucumber and executes the learning.

$ cp cifar10.py cucumber9.py
$ vi cucumber9.py

Binary data conversion program changes BNN-PYNQ handles binarized data. Therefore, it is necessary to convert the real parameter data to binary. Create cucumber9-gen-binary-weights.py that learns the image data of cucumber and converts the resulting parameter data to binary.

$ cp cifar10-gen-binary-weights.py cucumber9-gen-binary-weights.py
$ vi cucumber9-gen-binary-weights.py

Execution of learning

Now that you have the environment, data, and program ready to learn, run the program.

$ pwd /home/ubuntu/BNN-PYNQ/bnn/src/training
$ python cucumber9.py
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release.  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5110)
/home/ubuntu/.local/lib/python2.7/site-packages/theano/tensor/basic.py:2144: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
batch_size = 50
alpha = 0.1
epsilon = 0.0001
W_LR_scale = Glorot
num_epochs = 500
LR_start = 0.001
LR_fin = 3e-07
LR_decay = 0.983907435305
save_path = cucumber9_parameters.npz
train_set_size = 2475
shuffle_parts = 1
Loading CUCUMBER9 dataset...
Building the CNN...
W_LR_scale = 20.0499
H = 1
W_LR_scale = 27.7128
H = 1
W_LR_scale = 33.9411
H = 1
W_LR_scale = 39.1918
H = 1
W_LR_scale = 48.0
H = 1
W_LR_scale = 55.4256
H = 1
W_LR_scale = 22.6274
H = 1
W_LR_scale = 26.1279
H = 1
W_LR_scale = 18.6369
H = 1
Training...
Epoch 1 of 500 took 6.08435511589s
  LR:                            0.001
  training loss:                 1.48512187053
  validation loss:               2.05507221487
  validation error rate:         61.1111117734%
  best epoch:                    1
  best validation error rate:    61.1111117734%
  test loss:                     2.05507221487
  test error rate:               61.1111117734%

…

Epoch 500 of 500 took 5.53324913979s
  LR:                            3.04906731299e-07
  training loss:                 0.0024273797482
  validation loss:               0.132337698506
  validation error rate:         14.2222222355%
  best epoch:                    205
  best validation error rate:    11.9999999387%
  test loss:                     0.124302371922
  test error rate:               11.9999999387%

After a while, the learning will be completed and the parameter file will be completed.

$ ls
cucumber9_parameters.npz

Binary parameter data

Converts real parameter data to binary.

$ python cucumber9-gen-binary-weights.py
cucumber9_parameters.npz

Binary parameter data is completed. Load this file with PYNQ.

$ ls binparam-cnv-pynq/
0-0-thres.bin     0-3-weights.bin   1-12-thres.bin    1-20-weights.bin  1-2-thres.bin     1-9-weights.bin   2-3-thres.bin     3-11-weights.bin  3-6-thres.bin    6-0-weights.bin
0-0-weights.bin   0-4-thres.bin     1-12-weights.bin  1-21-thres.bin    1-2-weights.bin   2-0-thres.bin     2-3-weights.bin   3-12-thres.bin    3-6-weights.bin  7-0-thres.bin
0-10-thres.bin    0-4-weights.bin   1-13-thres.bin    1-21-weights.bin  1-30-thres.bin    2-0-weights.bin   2-4-thres.bin     3-12-weights.bin  3-7-thres.bin    7-0-weights.bin
0-10-weights.bin  0-5-thres.bin     1-13-weights.bin  1-22-thres.bin    1-30-weights.bin  2-10-thres.bin    2-4-weights.bin   3-13-thres.bin    3-7-weights.bin  8-0-thres.bin
0-11-thres.bin    0-5-weights.bin   1-14-thres.bin    1-22-weights.bin  1-31-thres.bin    2-10-weights.bin  2-5-thres.bin     3-13-weights.bin  3-8-thres.bin    8-0-weights.bin
0-11-weights.bin  0-6-thres.bin     1-14-weights.bin  1-23-thres.bin    1-31-weights.bin  2-11-thres.bin    2-5-weights.bin   3-14-thres.bin    3-8-weights.bin  8-1-thres.bin
0-12-thres.bin    0-6-weights.bin   1-15-thres.bin    1-23-weights.bin  1-3-thres.bin     2-11-weights.bin  2-6-thres.bin     3-14-weights.bin  3-9-thres.bin    8-1-weights.bin
0-12-weights.bin  0-7-thres.bin     1-15-weights.bin  1-24-thres.bin    1-3-weights.bin   2-12-thres.bin    2-6-weights.bin   3-15-thres.bin    3-9-weights.bin  8-2-thres.bin
0-13-thres.bin    0-7-weights.bin   1-16-thres.bin    1-24-weights.bin  1-4-thres.bin     2-12-weights.bin  2-7-thres.bin     3-15-weights.bin  4-0-thres.bin    8-2-weights.bin
0-13-weights.bin  0-8-thres.bin     1-16-weights.bin  1-25-thres.bin    1-4-weights.bin   2-13-thres.bin    2-7-weights.bin   3-1-thres.bin     4-0-weights.bin  8-3-thres.bin
0-14-thres.bin    0-8-weights.bin   1-17-thres.bin    1-25-weights.bin  1-5-thres.bin     2-13-weights.bin  2-8-thres.bin     3-1-weights.bin   4-1-thres.bin    8-3-weights.bin
0-14-weights.bin  0-9-thres.bin     1-17-weights.bin  1-26-thres.bin    1-5-weights.bin   2-14-thres.bin    2-8-weights.bin   3-2-thres.bin     4-1-weights.bin  classes.txt
0-15-thres.bin    0-9-weights.bin   1-18-thres.bin    1-26-weights.bin  1-6-thres.bin     2-14-weights.bin  2-9-thres.bin     3-2-weights.bin   4-2-thres.bin
0-15-weights.bin  1-0-thres.bin     1-18-weights.bin  1-27-thres.bin    1-6-weights.bin   2-15-thres.bin    2-9-weights.bin   3-3-thres.bin     4-2-weights.bin
0-1-thres.bin     1-0-weights.bin   1-19-thres.bin    1-27-weights.bin  1-7-thres.bin     2-15-weights.bin  3-0-thres.bin     3-3-weights.bin   4-3-thres.bin
0-1-weights.bin   1-10-thres.bin    1-19-weights.bin  1-28-thres.bin    1-7-weights.bin   2-1-thres.bin     3-0-weights.bin   3-4-thres.bin     4-3-weights.bin
0-2-thres.bin     1-10-weights.bin  1-1-thres.bin     1-28-weights.bin  1-8-thres.bin     2-1-weights.bin   3-10-thres.bin    3-4-weights.bin   5-0-thres.bin
0-2-weights.bin   1-11-thres.bin    1-1-weights.bin   1-29-thres.bin    1-8-weights.bin   2-2-thres.bin     3-10-weights.bin  3-5-thres.bin     5-0-weights.bin
0-3-thres.bin     1-11-weights.bin  1-20-thres.bin    1-29-weights.bin  1-9-thres.bin     2-2-weights.bin   3-11-thres.bin    3-5-weights.bin   6-0-thres.bin

Inference (implemented by PYNQ)

Arrangement of parameter data

Transfer the parameter data created earlier to PYNQ.

$ sudo mkdir /opt/python3.6/lib/python3.6/site-packages/bnn/params/cucumber9
$ sudo ls /opt/python3.6/lib/python3.6/site-packages/bnn/params/cucumber9/
0-0-thres.bin     0-3-weights.bin   1-12-thres.bin    1-20-weights.bin  1-2-thres.bin     1-9-weights.bin   2-3-thres.bin     3-11-weights.bin  3-6-thres.bin    6-0-weights.bin
0-0-weights.bin   0-4-thres.bin     1-12-weights.bin  1-21-thres.bin    1-2-weights.bin   2-0-thres.bin     2-3-weights.bin   3-12-thres.bin    3-6-weights.bin  7-0-thres.bin
0-10-thres.bin    0-4-weights.bin   1-13-thres.bin    1-21-weights.bin  1-30-thres.bin    2-0-weights.bin   2-4-thres.bin     3-12-weights.bin  3-7-thres.bin    7-0-weights.bin
0-10-weights.bin  0-5-thres.bin     1-13-weights.bin  1-22-thres.bin    1-30-weights.bin  2-10-thres.bin    2-4-weights.bin   3-13-thres.bin    3-7-weights.bin  8-0-thres.bin
0-11-thres.bin    0-5-weights.bin   1-14-thres.bin    1-22-weights.bin  1-31-thres.bin    2-10-weights.bin  2-5-thres.bin     3-13-weights.bin  3-8-thres.bin    8-0-weights.bin
0-11-weights.bin  0-6-thres.bin     1-14-weights.bin  1-23-thres.bin    1-31-weights.bin  2-11-thres.bin    2-5-weights.bin   3-14-thres.bin    3-8-weights.bin  8-1-thres.bin
0-12-thres.bin    0-6-weights.bin   1-15-thres.bin    1-23-weights.bin  1-3-thres.bin     2-11-weights.bin  2-6-thres.bin     3-14-weights.bin  3-9-thres.bin    8-1-weights.bin
0-12-weights.bin  0-7-thres.bin     1-15-weights.bin  1-24-thres.bin    1-3-weights.bin   2-12-thres.bin    2-6-weights.bin   3-15-thres.bin    3-9-weights.bin  8-2-thres.bin
0-13-thres.bin    0-7-weights.bin   1-16-thres.bin    1-24-weights.bin  1-4-thres.bin     2-12-weights.bin  2-7-thres.bin     3-15-weights.bin  4-0-thres.bin    8-2-weights.bin
0-13-weights.bin  0-8-thres.bin     1-16-weights.bin  1-25-thres.bin    1-4-weights.bin   2-13-thres.bin    2-7-weights.bin   3-1-thres.bin     4-0-weights.bin  8-3-thres.bin
0-14-thres.bin    0-8-weights.bin   1-17-thres.bin    1-25-weights.bin  1-5-thres.bin     2-13-weights.bin  2-8-thres.bin     3-1-weights.bin   4-1-thres.bin    8-3-weights.bin
0-14-weights.bin  0-9-thres.bin     1-17-weights.bin  1-26-thres.bin    1-5-weights.bin   2-14-thres.bin    2-8-weights.bin   3-2-thres.bin     4-1-weights.bin  classes.txt
0-15-thres.bin    0-9-weights.bin   1-18-thres.bin    1-26-weights.bin  1-6-thres.bin     2-14-weights.bin  2-9-thres.bin     3-2-weights.bin   4-2-thres.bin
0-15-weights.bin  1-0-thres.bin     1-18-weights.bin  1-27-thres.bin    1-6-weights.bin   2-15-thres.bin    2-9-weights.bin   3-3-thres.bin     4-2-weights.bin
0-1-thres.bin     1-0-weights.bin   1-19-thres.bin    1-27-weights.bin  1-7-thres.bin     2-15-weights.bin  3-0-thres.bin     3-3-weights.bin   4-3-thres.bin
0-1-weights.bin   1-10-thres.bin    1-19-weights.bin  1-28-thres.bin    1-7-weights.bin   2-1-thres.bin     3-0-weights.bin   3-4-thres.bin     4-3-weights.bin
0-2-thres.bin     1-10-weights.bin  1-1-thres.bin     1-28-weights.bin  1-8-thres.bin     2-1-weights.bin   3-10-thres.bin    3-4-weights.bin   5-0-thres.bin
0-2-weights.bin   1-11-thres.bin    1-1-weights.bin   1-29-thres.bin    1-8-weights.bin   2-2-thres.bin     3-10-weights.bin  3-5-thres.bin     5-0-weights.bin
0-3-thres.bin     1-11-weights.bin  1-20-thres.bin    1-29-weights.bin  1-9-thres.bin     2-2-weights.bin   3-11-thres.bin    3-5-weights.bin   6-0-thres.bin

Placement of test data

Download the test data used for inference to PYNQ.

$ git clone https://github.com/workpiles/CUCUMBER-9.git
$ cd CUCUMBER-9/prototype_1/
$ tar -zxvf cucumber-9-python.tar.gz

Executing inference

Let's run it from Jupyter as in the previous demo. When executing CUCUMBER9, specify to read cucumber9 as a parameter as shown below.

classifier = bnn.CnvClassifier('cucumber9')

The execution result is as shown in the capture below.

screencapture-192-168-0-15-9090-nbconvert-html-bnn-Cucumber9-ipynb-1492315054198-min.png

You can classify it correctly! The execution time is as follows. Although the CPU of PYNQ is poor, the result of FPGA is about 360 times faster.

`FPGA`


Inference took 2240.00 microseconds
Classification rate: 446.43 images per second

`CPU`


Inference took 816809.00 microseconds
Classification rate: 1.22 images per second

reference

When writing the program, I referred to the following blog.

in conclusion

This time, PYNQ was powered by a mobile battery. I was surprised at how much power was saved.

Try Deep Learning with FPGA-Select Cucumbers