--I installed CUDA enabled OpenCV (Python), but I couldn't find the (Japanese / English) documentation and sample code, so I wrote it with reference to the c ++ version. --CUDA hurray wanted to try ――I wanted to write a Qiita article


--Installed CUDA-enabled OpenCV on Ubuntu --Resized and template matched with OpenCV on Python --Conclusion --CUDA hurray --The CPU may be faster depending on the processing and the size of the image (maybe even more in the current environment). --Template Matching is overwhelmingly fast with CUDA --Transfer takes unexpectedly




First, load various things and then get 200 * 200 maharo.jpg. This is the otter [Mahalo] in the Sunshine Aquarium (https://wikiwiki.jp/kawausotter/%E3%83%9E%E3%83%8F%E3%83%AD%EF%BC%88%E3% 82% B5% E3% 83% B3% E3% 82% B7% E3% 83% A3% E3% 82% A4% E3% 83% B3% E6% B0% B4% E6% 97% 8F% E9% A4% It is an image of A8% EF% BC% 89).

import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt

src = cv.imread("maharo.jpg ", cv.IMREAD_GRAYSCALE)
h, w = src.shape[::-1] # w=200, h=200

plt.subplot(111),plt.imshow(src,cmap = 'gray')
plt.title('otter image'), plt.xticks([]), plt.yticks([])


Next, set the variables around the GPU. You can check if the GPU is enabled by checking if cv2.cuda.getCudaEnabledDeviceCount () returns a number greater than or equal to 1. Basically, CUDA-enabled OpenCV can be used by rewriting cv2.function to cv2.cuda.function. (There are some exceptions, such as template matching this time. See dir (cv2.cuda) for details) In addition, the memory of variables and GPU is secured by cv2.cuda_GpuMat (), and information is exchanged between the memories by .upload () /. Download ().

g_src = cv.cuda_GpuMat()
g_dst = cv.cuda_GpuMat()

For the time being, I moved the image of Mahalo to the GPU. Resize Let's experience the taste of GPU material with just Resize (512-> 2K) that does not include reading / writing information in the loop. By the way, CUDA does not support Lanczos interpolation. (There is something like this on the whole) CPU

cpu_dst = cv.resize(src, (h*10, w*10), interpolation=cv.INTER_CUBIC)
# 777 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


g_dst = cv.cuda.resize(g_src, (h*4, w*4), interpolation=cv.INTER_CUBIC)
# 611 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

__ The taste of the material is slow __ This has resulted in some non-GPU benefits. By the way, the execution speed considering reading and writing from the CPU is slower.

g_dst = cv.cuda.resize(g_src, (h*4, w*4), interpolation=cv.INTER_CUBIC)
gpu_dst = g_dst.download()
# 1.51 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even though I had a hard time putting in CUDA, I don't like this. Well, it doesn't look like that in parallel, is it difficult? It seems that reading and writing takes about 800 μs. What does this depend on? I feel that PCIe 3.0 will be a bottleneck (no basis)

Template Matching Let's take a second look and do template matching that seems to be useful in parallel.

ex_src = cv.resize(src, (h*10, w*10), interpolation=cv.INTER_CUBIC)
tmpl = ex_src[1000:1200, 1000:1200]
th, tw = tmpl.shape[::-1]


result = cv.matchTemplate(ex_src, tmpl, cv.TM_CCOEFF_NORMED)
# 138 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It's not slower than I expected ... (Addition: Probably because the process is running on 32 threads.)

result = cv.matchTemplate(ex_src, tmpl, cv.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv.minMaxLoc(result)
top_left = max_loc
bottom_right = (top_left[0] + th, top_left[1] + tw)
cv.rectangle(ex_src,top_left, bottom_right, 255, 2)
plt.subplot(121),plt.imshow(result,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(ex_src,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])

image.png CUDA The CUDA version of template maching is a bit special (for Python), but it is defined as createTemplateMatching (precision, METHOD).

gsrc = cv.cuda_GpuMat()
gtmpl = cv.cuda_GpuMat()
gresult = cv.cuda_GpuMat()
matcher = cv.cuda.createTemplateMatching(cv.CV_8UC1, cv.TM_CCOEFF_NORMED)
gresult = matcher.match(gsrc, gtmpl)
# 10.5 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

__ The material tastes good __ CPU: Compared to 138 ms, CUDA is 10.5 ms, which is about 10 times faster.

matcher = cv.cuda.createTemplateMatching(cv.CV_8UC1, cv.TM_CCOEFF_NORMED)
gresult = matcher.match(gsrc, gtmpl)
resultg = gresult.download()
# 16.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Even if you calculate the read / write, it is still about 8 times faster. This is the case with GTX 1080, so with the recently talked about Ampere generation GPU, even faster ...? I only need max_locg, but is there a way to retrieve max_locg at the gresult stage?

gresult = matcher.match(gsrc, gtmpl)
resultg = gresult.download()
min_valg, max_valg, min_locg, max_locg = cv.minMaxLoc(resultg)
top_leftg = max_locg
bottom_rightg = (top_leftg[0] + tw, top_leftg[1] + th)
cv.rectangle(src,top_leftg, bottom_rightg, 255, 2)
plt.subplot(121),plt.imshow(resultg,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(ex_src,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])


# (1000, 1000) (1000, 1000)

The same result is obtained.

The code used this time (Jupyter Notebook) is published on Github.


Test code Build OpenCV with CUDA enabled on Ubuntu

