You will be able to understand and implement the Scale-space that supports SIFT (Scale Invariant Feature Transform), which describes the features of images that are resistant to scale changes and rotation.
Software name | version |
---|---|
Python | 3.4 or 3.5 |
Pillow | 3.1.0 |
Numpy | 1.10 |
Scipy | 0.16.1 |
Matplotlib | 1.5.0 |
This article was written as a material for the 41st meeting of Morning Project Samurai.
This is a draft version. If you find any mistakes, we would appreciate it if you could contact us.
Scale-space
Consider the detection of a particular object from any given image. For example, consider detecting a person from a scene (image) of a video of an urban road. At this time, there is no need for a scale image that shows the details of the pattern of the clothes that a person is wearing. It is better to have an image on a scale that allows you to understand the shape of a person but not what the clothes you are wearing. Information that is too detailed is rather noise.
As mentioned above, there is an image scale suitable for detecting the object of interest. However, in reality, a given image is not always at the proper scale. Therefore, a Scale-space is constructed by generating a plurality of images having different scales from a given image, and an image of an appropriate scale is found from the scale-spaces to detect an object.
In Scale-space, scale-up is the removal of detailed information from the original image using a technique called Gaussian convolution. The degree of scale-up is determined by the Gaussian convolution parameter $ \ sigma $. The larger the scale $ \ sigma $, the more information is removed from the original image.
For example, in the figure below, the original image on the far left can clearly see the details of the face and even the fur hair of the hat. In the image on the far right, with a scale of $ \ sigma = 3.2 $, you can see that there is a vague woman and the eyes are clear, but nothing more.
The detailed construction method and its program will be described after learning the Gaussian convolution, Fourier transform, sampling theorem, and so on.
Gaussian convolution
The one-dimensional Gaussian convolution is expressed by the following equation.
F(x, \sigma)=\int_{-\infty}^{\infty} f(u) \frac{1}{\sqrt{2 \pi \sigma^{2}}}e^{-\frac{(u-x)^2}{2\sigma^{2}}} du
$ f $ is the original signal. $ \ frac {1} {\ sqrt {2 \ pi \ sigma ^ {2}}} e ^ {-\ frac {(ux) ^ 2} {2 \ sigma ^ {2}}} $ is called Gaussian kernel .. $ \ sigma $ is a parameter that determines the shape of the Gaussian kernel. The Gaussian kernel is shaped like a bell with a maximum value at the center $ x $, and its height and spread are determined by $ \ sigma $.
Gaussian kernel Let's try drawing a Gaussian kernel using numpy, scipy, and matplotlib.
import numpy as np
from scipy.signal import gaussian
from matplotlib import pyplot as plt
if __name__ == '__main__':
npoints = 101
sigmas = np.array([6.0, 12.0])
for i, sigma in enumerate(sigmas):
y = gaussian(npoints, sigma) / (np.sqrt(2.0 * np.pi) * sigma)
plt.subplot(len(sigmas), 1, i + 1)
plt.title('sigma = %s' % sigma)
plt.ylim(ymax=0.08)
plt.plot(np.arange(npoints/2 - npoints, npoints/2, dtype=np.int), y)
plt.tight_layout()
plt.show()
If you run the above program, you will get the result below.
[Postscript of the nature of Gaussian kernel]
The Gaussian convolution formula is "the original signal $ f $ multiplied by the weight of the center $ x $ spread $ \ sigma $ (Gaussian kernel) and the sum (weighted average) of $ F (x, \ sigma) $. It can be interpreted as "represented by". From this, $ F $ has the information of the entire $ f $ while strongly reflecting the information in the vicinity of $ u = x $ of the original signal $ f (u) $ with the strength determined by $ \ sigma $. It can be interpreted as a "new signal consisting of $ F (x, \ sigma)
As we saw in the previous section, the Gaussian kernel becomes flatter as $ \ sigma $ increases. This means that "the larger $ \ sigma $, the lighter the color of the information in the vicinity of $ u = x $ of the original signal $ f $ contained in $ F (x, \ sigma) $". As sigma $ gets bigger and bigger, $ F $ will eventually get the same value for all $ x $. In the context of Scale-space, "the larger the scale $ \ sigma $, the more macroscopic the signal is viewed (detailed information is lost), and eventually all the information contained in the signal is identified." Can be interpreted as.
Let's visualize the story so far with a program.
import numpy as np
from scipy.ndimage.filters import gaussian_filter1d
from matplotlib import pyplot as plt
if __name__ == '__main__':
x = np.arange(0, 100)
f = np.sin(0.5 * x) + np.random.normal(0, 6.0, x.shape)
sigmas = [1.6, 3.2, 6.4]
plt.subplot(len(sigmas) + 1, 1, 1)
plt.ylim(ymin=np.min(f), ymax=np.max(f))
plt.title('Original')
plt.plot(f)
for i, sigma in enumerate(sigmas):
plt.subplot(len(sigmas) + 1, 1, 2 + i)
plt.title('Sigma = %s' % sigma)
plt.plot(gaussian_filter1d(f, sigma))
plt.tight_layout()
plt.show()
The following figure is obtained by executing the above program.
As $ \ sigma $ grows (scales up), you can see that the detailed information contained in the signal $ f $ is being lost from $ F $.
The image is a two-dimensional signal. Therefore, in order to scale up an image using Gaussian convolution, Gaussian convolution for a two-dimensional signal is required. Hereafter, when we simply write Gaussian convolution, we will refer to the two-dimensional Gaussian convolution.
The Gaussian convolution for a two-dimensional signal is expressed by the following equation.
F(x, y, \sigma) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(u, v) \frac{1}{2\sigma^{2}} e^{-\frac{(u - x)^2 + (v-y)^{2}}{\sigma^{2}}} dudv
$ \ frac {1} {2 \ sigma ^ {2}} e ^ {-\ frac {(u-x) ^ 2 + (vy) ^ {2}} {\ sigma ^ {2}}} $ is Gaussian It is a kernel. This Gaussian kernel is a three-dimensional bell type, with $ (x, y) $ taking the maximum value and $ \ sigma $ representing its spread.
Gaussian kernel Let's draw the Gaussian kernel programmatically.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def gaussian_kernel_2d(x, y, sigma):
return np.exp(-(np.power(x/sigma, 2) + np.power(y/sigma, 2)))/(2 * np.power(sigma, 2))
if __name__ == '__main__':
xrange = np.arange(-10.0, 10.5, 0.5)
yrange = np.arange(-10.0, 10.5, 0.5)
kernel_values = np.zeros(shape=(len(yrange), len(xrange)))
sigmas = [1.6, 2.4]
fig = plt.figure()
X, Y = np.meshgrid(xrange, yrange)
for i, sigma in enumerate(sigmas):
ax = fig.add_subplot(len(sigmas), 1, 1 + i, projection='3d')
plt.title('Sigma = %s' % sigma)
ax.set_zlim(zmax=0.2)
ax.plot_wireframe(X, Y, gaussian_kernel_2d(X, Y, sigma))
plt.show()
The following figure can be obtained by executing the above program. It is possible to change the viewpoint with the mouse.
Gaussian filter Computers can only handle discrete values, and their range is finite. The Gaussian kernel is a continuous function with a domain of $ \ pm \ infinty $. Therefore, it is necessary to approximate the Gaussian kernel with a finite number of numerical strings and discretize it so that it can be handled by a computer.
Consider a grid of $ s $ rows and $ t $ columns. Hereafter, this grid is called the filter of $ s \ times t $. This is represented in the program as a two-dimensional array as shown below.
filter = np.zeros(shape=(s, t))
If you can set all the elements of the variable filter to represent the Gaussian kernel when $ u = 0 $, you can create a Gaussian filter $ g $ which is a discretized Gaussian kernel with $ u = 0 $. Here, the $ k $ row $ l $ column element of $ g $ $ g (k, l) $ = $ \ frac {1} {\ alpha} e ^ {-\ frac {l ^ {2} + k Set as ^ {2}} {\ sigma ^ {2}}} $. Here, $ \ alpha $ is a constant that normalizes so that when all the elements of $ g $ are added, the sum is $ 1 $.
The following equation is a Gaussian convolution using a Gaussian filter. This is called a discretized Gaussian convolution. Hereafter, when we simply refer to Gaussian convolution, we will refer to this discretized Gaussian convolution.
F(i, j) = \sum_{k=0}^{s}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) g(k, l)
If you create a program obediently according to the above formula, it will be as follows.
for k in range(s):
for l in range(t):
F[i, j] = f[i + k - int(s/2), j + l - int(t/2)] * g(k, l)
In this case, $ s \ times t $ operations are required to obtain one pixel of the output image $ F $, and $ m \ times n \ times s \ times t $ to obtain all the pixels of the output image. Two operations are required. A convolution with a $ 320 \ times 240 $ image and a $ 5 \ times 5 $ Gaussian filter requires a total of $ 1,920,000 $ operations. Whether or not this number of calculations is critical depends on the machine specs and applications, but it is critical on my MacBook Air (13-inch, Early 2015).
Gaussian convolution has the property that the output image $ F $ of Gaussian convolution using the Gaussian filter of $ s \ times t $ is represented by the image $ F_ {1} $ output in the next two steps. There is.
This is proved as follows.
\begin{align}
F(i, j) &=\sum_{k=0}^{s}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) g(k, l)\\
&= \sum_{k=0}^{s}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) \frac{1}{\alpha}e^{-\frac{l^{2} + k^{2}}{\sigma^{2}}}\\
&= \sum_{k=0}^{s}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) \frac{1}{\alpha}e^{-\frac{l^{2}}{\sigma^{2}}}e^{-\frac{k^{2}}{\sigma^{2}}}\\
&=\frac{\alpha_{0}\alpha_{1}}{\alpha} \sum_{k=0}^{s}\frac{1}{\alpha_{1}}e^{\frac{-k^{2}}{\sigma^{2}}}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) \frac{1}{\alpha_{0}}e^{-\frac{l^{2}}{\sigma^{2}}}\\
&=\frac{\alpha_{0}\alpha_{1}}{\alpha} \sum_{k=0}^{s}\frac{1}{\alpha_{1}}e^{\frac{-k^{2}}{\sigma^{2}}}\sum_{l=0}^{t} f(i+k-[\frac{s}{2}], j+l-[\frac{t}{2}]) g_{0}(l)\\
&=\frac{\alpha_{0}\alpha_{1}}{\alpha} \sum_{k=0}^{s}\frac{1}{\alpha_{1}}e^{-\frac{k^{2}}{\sigma^{2}}} F_{0}(i, j)\\
&=\frac{\alpha_{0}\alpha_{1}}{\alpha} \sum_{k=0}^{s} F_{0}(i, j) g_{1}(k)\\
&=\frac{\alpha_{0}\alpha_{1}}{\alpha} F_{1}(i, j)
\end{align}
By doing so, the number of operations is $ s \ times m + t \ times n $. For a $ 320 \ times 240 $ image and a $ 5 \ times 5 $ Gaussian filter convolution, the number of operations is $ 768,000 $. By speeding up, the number of operations can be reduced by $ 60 % $.
Let's actually apply Gaussian convolution to the image. This time, Pillow, numpy, scipy are used. The program below generates an image with scale $ \ sigma = 1.6 $ and an image with scale $ \ sigma = 3.2 $.
from PIL import Image
import numpy as np
from scipy.ndimage.filters import gaussian_filter
from matplotlib import pyplot as plt
if __name__ == '__main__':
orig_image = Image.open('lena.jpg').convert('L')
orig_image = np.array(orig_image, dtype=np.uint8)
sigmas = [1.6, 3.2]
plt.subplot(1, len(sigmas) + 1, 1)
plt.title('Orig')
plt.imshow(orig_image, cmap='Greys_r')
for i, sigma in enumerate(sigmas):
plt.subplot(1, len(sigmas) + 1, 2 + i)
plt.title('Sigma=%s' % sigma)
plt.imshow(gaussian_filter(orig_image, sigma), cmap='Greys_r')
plt.tight_layout()
plt.savefig('gausian_convolution_2d_examples.png')
plt.show()
The following figure is obtained by executing the above program.
OpenCV is a well-known library for working with images. If you have it installed, you can use it. You can also try implementing the Gaussian convolution on your own.
Up to this point, you can generate an image of any scale $ \ sigma $ with respect to the original image.
Finally, we will build Scale-space. The 1st octave in the chapter title will be described later.
from PIL import Image
import numpy as np
from scipy.ndimage.filters import gaussian_filter
from matplotlib import pyplot as plt
if __name__ == '__main__':
orig_image = Image.open('lena.jpg').convert('L')
orig_image = np.array(orig_image, dtype=np.uint8)
sigma = 1.6
s = 3
k = np.power(2, 1/s)
scale_space = []
for n in range(s + 1):
scale_space.append(gaussian_filter(orig_image, np.power(k, n) * sigma))
for n, img in enumerate(scale_space):
plt.subplot(1, len(scale_space), 1 + n)
plt.title('Sigma=%s' % np.round(np.power(k, n) * sigma, 2))
plt.imshow(img, cmap='Greys_r')
plt.tight_layout()
plt.show()
When the above program is executed, four images with scales from $ 1.6 $ to $ 3.2 $ are stored in the variable scale_space, and the figure below is displayed.
The space consisting of the images from $ k \ sigma $ to $ 2k \ sigma $ is called 1st octave.
In creating the 2nd octave and later, it is necessary to mention the Fourier transform and the sampling theorem. However, this time, I will not go deep and just let it flow smoothly.
The Fourier transform transforms an image from the pixel domain to the frequency domain.
F(j\omega) = \int_{-\infty}^{\infty} f(x) e^{-j\omega x} dx
Let's perform Fourier transform by FFT (Fast Fourier Transform) using numpy, scipy, and matplotlib. In the program below, the sampling rate is 20 and the angular frequency of the original signal is $ 2 \ pi $.
import numpy as np
from scipy.fftpack import fft
from matplotlib import pyplot as plt
if __name__ == '__main__':
sampling_rate = 20
sampling_interval = 1.0/sampling_rate
x = np.arange(0, 1, sampling_interval)
omega0 = 1.0
omega = 2.0 * np.pi * omega0
f = np.sin(omega * x)
F = fft(f)
plt.plot(np.arange(-len(f)/2, len(f)/2), np.abs(np.concatenate((F[len(f)/2:], F[:len(f)/2]))), 'bo-')
plt.show()
When the above program is executed, the following figure is obtained.
From this figure, it can be seen that the original signal $ f $ contains a signal with a frequency of $ 1 $. It is a good idea to Fourier transform the superposition of several periodic functions.
[I will write in a little more detail later]
The sampling theorem states, "When the maximum period of the signal contained in the original signal $ f $ is $ W $, if the sampling period $ T $ of the original signal is $ T \ leq \ frac {1} {W} $ , The original signal $ f $ is completely recoverable from the sampled signal $ f_ {d} $. "
[Certificate will be added at a later date]
Let the sampling period be $ 1/20 $. Then, according to the sampling theorem, if the signal has a maximum frequency of $ 10 $, all the frequency information should be successfully extracted by Fourier transform.
Let's check with the program below.
import numpy as np
from scipy.fftpack import fft
from matplotlib import pyplot as plt
if __name__ == '__main__':
sampling_rate = 20
sampling_interval = 1.0/sampling_rate
x = np.arange(0, 1, sampling_interval)
omega0s = [9, 10, 11, 30]
for i, omega0 in enumerate(omega0s):
omega = 2.0 * np.pi * omega0
f = np.sin(omega * x)
F = fft(f)
plt.subplot(len(omega0s), 1, 1 + i)
plt.plot(np.arange(-len(f)/2, len(f)/2), np.abs(np.concatenate((F[len(f)/2:], F[:len(f)/2]))), 'bo-')
plt.title('Max frequency = %s' % omega0)
plt.xlabel('Frequency')
plt.ylabel('Amplitude spectrum')
plt.tight_layout()
plt.show()
When executed, the following figure is obtained. From this result, it can be seen that the frequency information can be successfully extracted with the signal having the maximum frequency of $ 9 $ (details will be added later).
[Postscript]
[Postscript]
[Postscript]
[Postscript]
Recommended Posts