--I installed CUDA enabled OpenCV (Python), but I couldn't find the (Japanese / English) documentation and sample code, so I wrote it with reference to the c ++ version. --CUDA hurray wanted to try ――I wanted to write a Qiita article
--Installed CUDA-enabled OpenCV on Ubuntu --Resized and template matched with OpenCV on Python --Conclusion --CUDA hurray --The CPU may be faster depending on the processing and the size of the image (maybe even more in the current environment). --Template Matching is overwhelmingly fast with CUDA --Transfer takes unexpectedly
First, load various things and then get 200 * 200 maharo.jpg. This is the otter [Mahalo] in the Sunshine Aquarium (https://wikiwiki.jp/kawausotter/%E3%83%9E%E3%83%8F%E3%83%AD%EF%BC%88%E3% 82% B5% E3% 83% B3% E3% 82% B7% E3% 83% A3% E3% 82% A4% E3% 83% B3% E6% B0% B4% E6% 97% 8F% E9% A4% It is an image of A8% EF% BC% 89).
import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt
src = cv.imread("maharo.jpg ", cv.IMREAD_GRAYSCALE)
h, w = src.shape[::-1] # w=200, h=200
plt.subplot(111),plt.imshow(src,cmap = 'gray')
plt.title('otter image'), plt.xticks([]), plt.yticks([])
plt.show()
Next, set the variables around the GPU.
You can check if the GPU is enabled by checking if cv2.cuda.getCudaEnabledDeviceCount ()
returns a number greater than or equal to 1.
Basically, CUDA-enabled OpenCV can be used by rewriting cv2.function
to cv2.cuda.function
. (There are some exceptions, such as template matching this time. See dir (cv2.cuda)
for details)
In addition, the memory of variables and GPU is secured by cv2.cuda_GpuMat ()
, and information is exchanged between the memories by .upload () /. Download ()
.
print(cv.cuda.getCudaEnabledDeviceCount())
g_src = cv.cuda_GpuMat()
g_dst = cv.cuda_GpuMat()
g_src.upload(src)
For the time being, I moved the image of Mahalo to the GPU. Resize Let's experience the taste of GPU material with just Resize (512-> 2K) that does not include reading / writing information in the loop. By the way, CUDA does not support Lanczos interpolation. (There is something like this on the whole) CPU
%%timeit
cpu_dst = cv.resize(src, (h*10, w*10), interpolation=cv.INTER_CUBIC)
# 777 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CUDA
%%timeit
g_dst = cv.cuda.resize(g_src, (h*4, w*4), interpolation=cv.INTER_CUBIC)
# 611 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
__ The taste of the material is slow __ This has resulted in some non-GPU benefits. By the way, the execution speed considering reading and writing from the CPU is slower.
%%timeit
g_src.upload(src)
g_dst = cv.cuda.resize(g_src, (h*4, w*4), interpolation=cv.INTER_CUBIC)
gpu_dst = g_dst.download()
# 1.51 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even though I had a hard time putting in CUDA, I don't like this. Well, it doesn't look like that in parallel, is it difficult? It seems that reading and writing takes about 800 μs. What does this depend on? I feel that PCIe 3.0 will be a bottleneck (no basis)
Template Matching Let's take a second look and do template matching that seems to be useful in parallel.
ex_src = cv.resize(src, (h*10, w*10), interpolation=cv.INTER_CUBIC)
tmpl = ex_src[1000:1200, 1000:1200]
th, tw = tmpl.shape[::-1]
CPU
%%timeit
result = cv.matchTemplate(ex_src, tmpl, cv.TM_CCOEFF_NORMED)
# 138 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It's not slower than I expected ... (Addition: Probably because the process is running on 32 threads.)
result = cv.matchTemplate(ex_src, tmpl, cv.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv.minMaxLoc(result)
top_left = max_loc
bottom_right = (top_left[0] + th, top_left[1] + tw)
cv.rectangle(ex_src,top_left, bottom_right, 255, 2)
plt.subplot(121),plt.imshow(result,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(ex_src,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])
plt.show()
CUDA
The CUDA version of template maching is a bit special (for Python), but it is defined as createTemplateMatching (precision, METHOD)
.
gsrc = cv.cuda_GpuMat()
gtmpl = cv.cuda_GpuMat()
gresult = cv.cuda_GpuMat()
gsrc.upload(ex_src)
gtmpl.upload(tmpl)
matcher = cv.cuda.createTemplateMatching(cv.CV_8UC1, cv.TM_CCOEFF_NORMED)
%%timeit
gresult = matcher.match(gsrc, gtmpl)
# 10.5 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
__ The material tastes good __ CPU: Compared to 138 ms, CUDA is 10.5 ms, which is about 10 times faster.
%%timeit
gsrc.upload(ex_src)
gtmpl.upload(tmpl)
matcher = cv.cuda.createTemplateMatching(cv.CV_8UC1, cv.TM_CCOEFF_NORMED)
gresult = matcher.match(gsrc, gtmpl)
resultg = gresult.download()
# 16.6 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even if you calculate the read / write, it is still about 8 times faster. This is the case with GTX 1080, so with the recently talked about Ampere generation GPU, even faster ...? I only need max_locg, but is there a way to retrieve max_locg at the gresult stage?
gresult = matcher.match(gsrc, gtmpl)
resultg = gresult.download()
min_valg, max_valg, min_locg, max_locg = cv.minMaxLoc(resultg)
top_leftg = max_locg
bottom_rightg = (top_leftg[0] + tw, top_leftg[1] + th)
cv.rectangle(src,top_leftg, bottom_rightg, 255, 2)
plt.subplot(121),plt.imshow(resultg,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(ex_src,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])
plt.show()
print(max_loc,max_locg)
# (1000, 1000) (1000, 1000)
The same result is obtained.
The code used this time (Jupyter Notebook) is published on Github.
Test code Build OpenCV with CUDA enabled on Ubuntu
Recommended Posts