When creating a machine learning dataset based on images collected from the Internet, it is necessary to remove duplicate images. It is still good if there are duplicate images in the training data, but if there are duplicate images between the training data and test data, so-called leakage will occur.
The simplest way to detect duplicate images is to use the hash value of a file such as MD5. However, the hash value of the file is just a hash of the binary string of the image file, and even if the same image is changed by changing the storage format and compression parameters, it will lead to omission of detection.
Therefore, in this article, we will introduce algorithms that hash the features of images themselves, and we will look at the characteristics of those hashing algorithms through simple experiments.
Average Hash (aHash) It is a hash value based on the characteristics of the image (brightness pattern) and can be calculated with a simple algorithm. The specific procedure is as follows.
aHash has the advantages of simple algorithms and fast computation. On the other hand, it also has the drawback of being inflexible. For example, the hash value of a gamma-corrected image will be far from the original image.
Perseptual Hash (pHash) While aHash used the pixel values themselves, pHash uses the Discrete Cosine Transform (DCT) of the image. DCT is one of the methods to convert signals such as images into the frequency domain, and is used for JPEG compression, for example. In JPEG compression, the amount of data is reduced by DCT the image and extracting only the low frequency components that are easily perceived by humans.
Similar to JPGE compression, pHash focuses on the low frequency components in the DCT of the image and hash them. By doing this, it is possible to preferentially extract features that are easily perceived by humans, and it is thought that robust hashing can be performed for parallel movement of images and changes in brightness.
There seem to be various variations other than aHash and pHash. Some people are benchmarking [^ 1].
Apply various processing to the image and compare the hash value with the original image. Calculate aHash and pHash as hash values, respectively.
Also, try the technique of considering the output as a hash for the layer immediately before the final layer of ResNet50. This method was adopted in a paper [^ 2] [^ 3].
OpenCV is used to calculate aHash and pHash, but there is also a library called ImageHash. Also, in aHash and pHash, you can use the Hamming distance to compare hash values. The range is [0, 64]
. In order to match this range, in the hash (mock) comparison using ResNet50, the cosine similarity is calculated and then converted to the above range.
import copy
import pprint
import cv2.cv2 as cv2
import numpy as np
from keras import models
from keras.applications.resnet50 import ResNet50, preprocess_input
from sklearn.metrics.pairwise import cosine_similarity
class ImagePairGenerator(object):
"""
A class that generates image pairs for experiments
"""
def __init__(self, img: np.ndarray):
self._img = img
self._processings = self._prepare_processings()
def _prepare_processings(self):
h, w, _ = self._img.shape
#Position and size for cropping around the face of the lenna image
org = np.array([128, 128])
size = np.array([256, 256])
# kind (processing description), img1, img2
processings = [
('Same',
lambda x: x,
lambda x: x),
('Grayscale',
lambda x: x,
lambda x: cv2.cvtColor(
cv2.cvtColor(x, cv2.COLOR_BGR2GRAY), cv2.COLOR_GRAY2BGR)),
*list(map(lambda s:
(f'1/{s:2}Shrink to',
lambda x: x,
lambda x: cv2.resize(x, (w // s, h // s))),
np.power(2, range(1, 5)))),
*list(map(lambda s:
(f'Smoothing(kernel size = {s:2}',
lambda x: x,
lambda x: cv2.blur(x, (s, s))),
[3, 5, 7, 9, 11])),
*list(map(lambda s:
(f'Insert text(fontScale = {s})',
lambda x: x,
lambda x: cv2.putText(x, 'Text', org=(10, 30*s),
fontFace=cv2.FONT_HERSHEY_SIMPLEX,
fontScale=s,
color=(255, 255, 255),
thickness=3*s,
lineType=cv2.LINE_AA)),
range(1, 8))),
*list(map(lambda q:
(f'JPEG compression(quality = {q})',
lambda x: x,
lambda x: img_encode_decode(x, q)),
range(10, 100, 10))),
*list(map(lambda gamma:
(f'Gamma correction(gamma = {gamma})',
lambda x: x,
lambda x: img_gamma(x, gamma)),
[0.2, 0.5, 0.8, 1.2, 1.5, 2.0])),
*list(map(lambda d:
(f'Translation({d:2} pixels)',
lambda x: img_crop(x, org, size),
lambda x: img_crop(x, org + d, size)),
np.power(2, range(7)))),
]
return processings
def __iter__(self):
for kind, p1, p2 in self._processings:
yield (kind,
p1(copy.deepcopy(self._img)),
p2(copy.deepcopy(self._img)))
class ResNet50Hasher(object):
"""
Class for outputting the final layer of ResNet50 as a hash value
"""
_input_size = 224
def __init__(self):
self._model = self._prepare_model()
def _prepare_model(self):
resnet50 = ResNet50(include_top=False, weights='imagenet',
input_shape=(self._input_size, self._input_size, 3),
pooling='avg')
model = models.Sequential()
model.add(resnet50)
return model
def compute(self, img: np.ndarray) -> np.ndarray:
img_arr = np.array([
cv2.resize(img, (self._input_size, self._input_size))
])
img_arr = preprocess_input(img_arr)
embeddings = self._model.predict(img_arr)
return embeddings
@staticmethod
def compare(x1: np.ndarray, x2: np.ndarray):
"""
Calculate cosine similarity. The range is[0, 1]。
Compare aHash and pHash according to the Hamming distance,
[0, 64]Convert to the range of.
"""
cs = cosine_similarity(x1, x2)
distance = 64 + (0 - 64) * ((cs - 0) / (1 - 0))
return distance.ravel()[0] # np.array -> float
def img_crop(img: np.ndarray, org: np.ndarray, size: np.ndarray) -> np.ndarray:
"""
Crop any area from the image.
"""
y, x = org
h, w = size
return img[y:y + h, x:x + w, :]
def img_encode_decode(img: np.ndarray, quality=90) -> np.ndarray:
"""
Reproduce the deterioration of Jpeg compression.
Reference: https://qiita.com/ka10ryu1/items/5fed6b4c8f29163d0d65
"""
encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), quality]
_, enc_img = cv2.imencode('.jpg', img, encode_param)
dec_img = cv2.imdecode(enc_img, cv2.IMREAD_COLOR)
return dec_img
def img_gamma(img: np.ndarray, gamma=0.5) -> np.ndarray:
"""
Gamma correction.
Reference: https://www.dogrow.net/python/blog99/
"""
lut = np.empty((1, 256), np.uint8)
for i in range(256):
lut[0, i] = np.clip(pow(i / 255.0, gamma) * 255.0, 0, 255)
return cv2.LUT(img, lut)
def image_hashing_test():
image_path = 'resources/lena_std.tif'
img = cv2.imread(image_path, cv2.IMREAD_COLOR)
h, w, _ = img.shape
hashers = [
('aHash', cv2.img_hash.AverageHash_create()),
('pHash', cv2.img_hash.PHash_create()),
('ResNet', ResNet50Hasher())
]
pairs = ImagePairGenerator(img)
result_dict = {}
for pair_kind, img1, img2 in pairs:
result_dict[pair_kind] = {}
for hasher_kind, hasher in hashers:
hash1 = hasher.compute(img1)
hash2 = hasher.compute(img2)
distance = hasher.compare(hash1, hash2)
result_dict[pair_kind][hasher_kind] = distance
#Visually check the image (only when the shape is the same, because it is troublesome to align)
if img1.shape == img2.shape:
window_name = pair_kind
cv2.imshow(window_name, cv2.hconcat((img1, img2)))
cv2.waitKey()
cv2.destroyWindow(window_name)
pprint.pprint(result_dict)
if __name__ == '__main__':
image_hashing_test()
{'Same': {'ResNet': 0.0, 'aHash': 0.0, 'pHash': 0.0},
'Grayscale': {'ResNet': 14.379967, 'aHash': 0.0, 'pHash': 0.0},
'1/Reduced to 2': {'ResNet': 1.2773285, 'aHash': 3.0, 'pHash': 1.0},
'1/Reduced to 4': {'ResNet': 6.5748253, 'aHash': 4.0, 'pHash': 1.0},
'1/Reduced to 8': {'ResNet': 18.959282, 'aHash': 7.0, 'pHash': 3.0},
'1/Reduced to 16': {'ResNet': 34.8299, 'aHash': 12.0, 'pHash': 0.0},
'JPEG compression(quality = 10)': {'ResNet': 6.4169083, 'aHash': 2.0, 'pHash': 0.0},
'JPEG compression(quality = 20)': {'ResNet': 2.6065674, 'aHash': 1.0, 'pHash': 0.0},
'JPEG compression(quality = 30)': {'ResNet': 1.8446579, 'aHash': 0.0, 'pHash': 0.0},
'JPEG compression(quality = 40)': {'ResNet': 1.2492218, 'aHash': 0.0, 'pHash': 1.0},
'JPEG compression(quality = 50)': {'ResNet': 1.0534592, 'aHash': 0.0, 'pHash': 0.0},
'JPEG compression(quality = 60)': {'ResNet': 0.99293137, 'aHash': 0.0, 'pHash': 0.0},
'JPEG compression(quality = 70)': {'ResNet': 0.7313309, 'aHash': 0.0, 'pHash': 0.0},
'JPEG compression(quality = 80)': {'ResNet': 0.58068085, 'aHash': 0.0, 'pHash': 0.0},
'JPEG compression(quality = 90)': {'ResNet': 0.354187, 'aHash': 0.0, 'pHash': 0.0},
'Gamma correction(gamma = 0.2)': {'ResNet': 16.319721, 'aHash': 2.0, 'pHash': 1.0},
'Gamma correction(gamma = 0.5)': {'ResNet': 4.2003975, 'aHash': 2.0, 'pHash': 0.0},
'Gamma correction(gamma = 0.8)': {'ResNet': 0.48334503, 'aHash': 0.0, 'pHash': 0.0},
'Gamma correction(gamma = 1.2)': {'ResNet': 0.381176, 'aHash': 0.0, 'pHash': 1.0},
'Gamma correction(gamma = 1.5)': {'ResNet': 1.7187691, 'aHash': 2.0, 'pHash': 1.0},
'Gamma correction(gamma = 2.0)': {'ResNet': 4.074257, 'aHash': 6.0, 'pHash': 2.0},
'Insert text(fontScale = 1)': {'ResNet': 0.7838249, 'aHash': 0.0, 'pHash': 0.0},
'Insert text(fontScale = 2)': {'ResNet': 1.0911484, 'aHash': 0.0, 'pHash': 1.0},
'Insert text(fontScale = 3)': {'ResNet': 2.7721176, 'aHash': 0.0, 'pHash': 2.0},
'Insert text(fontScale = 4)': {'ResNet': 4.646305, 'aHash': 0.0, 'pHash': 4.0},
'Insert text(fontScale = 5)': {'ResNet': 8.435852, 'aHash': 2.0, 'pHash': 3.0},
'Insert text(fontScale = 6)': {'ResNet': 11.267036, 'aHash': 6.0, 'pHash': 3.0},
'Insert text(fontScale = 7)': {'ResNet': 15.272251, 'aHash': 2.0, 'pHash': 7.0},
'Smoothing(kernel size = 3': {'ResNet': 1.3798943, 'aHash': 2.0, 'pHash': 0.0},
'Smoothing(kernel size = 5': {'ResNet': 3.1528091, 'aHash': 4.0, 'pHash': 1.0},
'Smoothing(kernel size = 7': {'ResNet': 4.903698, 'aHash': 4.0, 'pHash': 1.0},
'Smoothing(kernel size = 9': {'ResNet': 6.8400574, 'aHash': 4.0, 'pHash': 1.0},
'Smoothing(kernel size = 11': {'ResNet': 9.477722, 'aHash': 5.0, 'pHash': 2.0},
'Translation( 1 pixels)': {'ResNet': 0.47764206, 'aHash': 6.0, 'pHash': 0.0},
'Translation( 2 pixels)': {'ResNet': 0.98942566, 'aHash': 10.0, 'pHash': 3.0},
'Translation( 4 pixels)': {'ResNet': 1.475399, 'aHash': 15.0, 'pHash': 5.0},
'Translation( 8 pixels)': {'ResNet': 2.587471, 'aHash': 20.0, 'pHash': 13.0},
'Translation(16 pixels)': {'ResNet': 3.1883087, 'aHash': 25.0, 'pHash': 21.0},
'Translation(32 pixels)': {'ResNet': 4.8445663, 'aHash': 23.0, 'pHash': 31.0},
'Translation(64 pixels)': {'ResNet': 9.34531, 'aHash': 28.0, 'pHash': 30.0}}
Please note that ResNet uses what was pre-learned with ImageNet as it is. In other words, by preparing training data, it is possible to acquire a network with characteristics different from those described below.
Also, the tendency may change depending on the image. For practical use, it is better to evaluate with a certain data set.
--Same: Obviously, the distance is 0 for all hashing methods. --Grayscale: In aHash and pHash, the distance to the original image is 0 because it is grayscaled at the beginning of processing. --Reduction: In ResNet and aHash, the distance increases according to the scale, but pHash is relatively robust. --JPEG compression: It's interesting that ResNet seems to be proportional to the compression ratio. On the other hand, aHash and pHash are relatively robust. A simple hashing algorithm (encoding at high and low relative to the mean) may be robust against compression degradation. --Gamma correction: Compared to ResNet, aHash and pHash seem to be relatively robust. --Insert text: aHash seems to be relatively robust. It is possible that the average pixel value in the image has changed to match the color of the inserted text, and the encoding result has not changed. On the other hand, in pHash and ResNet, the distance increases with the size of the text. --Smoothing: pHash is relatively robust. This is probably because we are focusing only on the low frequency components. --Translation: All hashes move away as the amount of translation increases. Is ResNet relatively robust?
I introduced a method to hash an image. In addition, we measured the distance between the processed image and the original image and observed the characteristics of each hashing method. Among the algorithms compared in this article, we can observe that pHash tends to be robust to image scaling, compression degradation, and smoothing.
I recently learned about xHash-based algorithms, and I think they are simple and good idea-based algorithms. I think it's a relatively dead algorithm, but I hope it can be used in the right place.
--Try image comparison using perceptual hash (phash) http://hideack.hatenablog.com/entry/2015/03/16/194336
[^ 3]: To tell the truth, when I was reading this paper, I thought, "Is there an easier way to distinguish duplicates?", Which was the reason for writing this article.
Recommended Posts