Introduction

When creating a machine learning dataset based on images collected from the Internet, it is necessary to remove duplicate images. It is still good if there are duplicate images in the training data, but if there are duplicate images between the training data and test data, so-called leakage will occur.

The simplest way to detect duplicate images is to use the hash value of a file such as MD5. However, the hash value of the file is just a hash of the binary string of the image file, and even if the same image is changed by changing the storage format and compression parameters, it will lead to omission of detection.

Therefore, in this article, we will introduce algorithms that hash the features of images themselves, and we will look at the characteristics of those hashing algorithms through simple experiments.

Image hashing algorithm

Average Hash (aHash) It is a hash value based on the characteristics of the image (brightness pattern) and can be calculated with a simple algorithm. The specific procedure is as follows.

Shrink the image to 8x8 pixels.
Further convert to grayscale.
Calculate the average of the pixel values.
For each pixel of 8x8 pixels, binarize (0 or 1) depending on whether it is higher or lower than the average value.
For binary sequences, arrange them in a row in some order such as raster scan order to obtain a 64-bit hash.

aHash has the advantages of simple algorithms and fast computation. On the other hand, it also has the drawback of being inflexible. For example, the hash value of a gamma-corrected image will be far from the original image.

Perseptual Hash (pHash) While aHash used the pixel values themselves, pHash uses the Discrete Cosine Transform (DCT) of the image. DCT is one of the methods to convert signals such as images into the frequency domain, and is used for JPEG compression, for example. In JPEG compression, the amount of data is reduced by DCT the image and extracting only the low frequency components that are easily perceived by humans.

Similar to JPGE compression, pHash focuses on the low frequency components in the DCT of the image and hash them. By doing this, it is possible to preferentially extract features that are easily perceived by humans, and it is thought that robust hashing can be performed for parallel movement of images and changes in brightness.

Reduce the image. Make it larger than 8x8 (for example, 32x32).
Grayscale.
DCT.
Extract only the low frequency component 8x8.
Calculate the average value of the low frequency component excluding the DC component.
Binar according to whether it is higher or lower than the average value.
Get a 64-bit hash by lining up in some order, such as raster scan order.

Other xHash

There seem to be various variations other than aHash and pHash. Some people are benchmarking [^ 1].

Experiment

Apply various processing to the image and compare the hash value with the original image. Calculate aHash and pHash as hash values, respectively.

Also, try the technique of considering the output as a hash for the layer immediately before the final layer of ResNet50. This method was adopted in a paper [^ 2] [^ 3].

code

OpenCV is used to calculate aHash and pHash, but there is also a library called ImageHash. Also, in aHash and pHash, you can use the Hamming distance to compare hash values. The range is [0, 64]. In order to match this range, in the hash (mock) comparison using ResNet50, the cosine similarity is calculated and then converted to the above range.

import copy
import pprint

import cv2.cv2 as cv2
import numpy as np
from keras import models
from keras.applications.resnet50 import ResNet50, preprocess_input
from sklearn.metrics.pairwise import cosine_similarity


class ImagePairGenerator(object):
    """
A class that generates image pairs for experiments
    """

    def __init__(self, img: np.ndarray):
        self._img = img
        self._processings = self._prepare_processings()

    def _prepare_processings(self):
        h, w, _ = self._img.shape
        #Position and size for cropping around the face of the lenna image
        org = np.array([128, 128])
        size = np.array([256, 256])
        # kind (processing description), img1, img2
        processings = [
            ('Same',
             lambda x: x,
             lambda x: x),
            ('Grayscale',
             lambda x: x,
             lambda x: cv2.cvtColor(
                 cv2.cvtColor(x, cv2.COLOR_BGR2GRAY), cv2.COLOR_GRAY2BGR)),
            *list(map(lambda s:
                      (f'1/{s:2}Shrink to',
                       lambda x: x,
                       lambda x: cv2.resize(x, (w // s, h // s))),
                      np.power(2, range(1, 5)))),
            *list(map(lambda s:
                      (f'Smoothing(kernel size = {s:2}',
                       lambda x: x,
                       lambda x: cv2.blur(x, (s, s))),
                      [3, 5, 7, 9, 11])),
            *list(map(lambda s:
                      (f'Insert text(fontScale = {s})',
                       lambda x: x,
                       lambda x: cv2.putText(x, 'Text', org=(10, 30*s),
                                             fontFace=cv2.FONT_HERSHEY_SIMPLEX,
                                             fontScale=s,
                                             color=(255, 255, 255),
                                             thickness=3*s,
                                             lineType=cv2.LINE_AA)),
                      range(1, 8))),
            *list(map(lambda q:
                      (f'JPEG compression(quality = {q})',
                       lambda x: x,
                       lambda x: img_encode_decode(x, q)),
                      range(10, 100, 10))),
            *list(map(lambda gamma:
                      (f'Gamma correction(gamma = {gamma})',
                       lambda x: x,
                       lambda x: img_gamma(x, gamma)),
                      [0.2, 0.5, 0.8, 1.2, 1.5, 2.0])),
            *list(map(lambda d:
                      (f'Translation({d:2} pixels)',
                       lambda x: img_crop(x, org, size),
                       lambda x: img_crop(x, org + d, size)),
                      np.power(2, range(7)))),
        ]
        return processings

    def __iter__(self):
        for kind, p1, p2 in self._processings:
            yield (kind,
                   p1(copy.deepcopy(self._img)),
                   p2(copy.deepcopy(self._img)))


class ResNet50Hasher(object):
    """
Class for outputting the final layer of ResNet50 as a hash value
    """
    _input_size = 224

    def __init__(self):
        self._model = self._prepare_model()

    def _prepare_model(self):
        resnet50 = ResNet50(include_top=False, weights='imagenet',
                            input_shape=(self._input_size, self._input_size, 3),
                            pooling='avg')
        model = models.Sequential()
        model.add(resnet50)
        return model

    def compute(self, img: np.ndarray) -> np.ndarray:
        img_arr = np.array([
            cv2.resize(img, (self._input_size, self._input_size))
        ])
        img_arr = preprocess_input(img_arr)
        embeddings = self._model.predict(img_arr)
        return embeddings

    @staticmethod
    def compare(x1: np.ndarray, x2: np.ndarray):
        """
Calculate cosine similarity. The range is[0, 1]。
Compare aHash and pHash according to the Hamming distance,
        [0, 64]Convert to the range of.
        """
        cs = cosine_similarity(x1, x2)
        distance = 64 + (0 - 64) * ((cs - 0) / (1 - 0))
        return distance.ravel()[0]  # np.array -> float


def img_crop(img: np.ndarray, org: np.ndarray, size: np.ndarray) -> np.ndarray:
    """
Crop any area from the image.
    """
    y, x = org
    h, w = size
    return img[y:y + h, x:x + w, :]


def img_encode_decode(img: np.ndarray, quality=90) -> np.ndarray:
    """
Reproduce the deterioration of Jpeg compression.
Reference: https://qiita.com/ka10ryu1/items/5fed6b4c8f29163d0d65
    """
    encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), quality]
    _, enc_img = cv2.imencode('.jpg', img, encode_param)
    dec_img = cv2.imdecode(enc_img, cv2.IMREAD_COLOR)
    return dec_img


def img_gamma(img: np.ndarray, gamma=0.5) -> np.ndarray:
    """
Gamma correction.
Reference: https://www.dogrow.net/python/blog99/
    """
    lut = np.empty((1, 256), np.uint8)
    for i in range(256):
        lut[0, i] = np.clip(pow(i / 255.0, gamma) * 255.0, 0, 255)

    return cv2.LUT(img, lut)


def image_hashing_test():
    image_path = 'resources/lena_std.tif'
    img = cv2.imread(image_path, cv2.IMREAD_COLOR)
    h, w, _ = img.shape

    hashers = [
        ('aHash', cv2.img_hash.AverageHash_create()),
        ('pHash', cv2.img_hash.PHash_create()),
        ('ResNet', ResNet50Hasher())
    ]

    pairs = ImagePairGenerator(img)

    result_dict = {}
    for pair_kind, img1, img2 in pairs:
        result_dict[pair_kind] = {}
        for hasher_kind, hasher in hashers:
            hash1 = hasher.compute(img1)
            hash2 = hasher.compute(img2)
            distance = hasher.compare(hash1, hash2)
            result_dict[pair_kind][hasher_kind] = distance
        #Visually check the image (only when the shape is the same, because it is troublesome to align)
        if img1.shape == img2.shape:
            window_name = pair_kind
            cv2.imshow(window_name, cv2.hconcat((img1, img2)))
            cv2.waitKey()
            cv2.destroyWindow(window_name)

    pprint.pprint(result_dict)


if __name__ == '__main__':
    image_hashing_test()

result

{'Same': {'ResNet': 0.0, 'aHash': 0.0, 'pHash': 0.0},
 'Grayscale': {'ResNet': 14.379967, 'aHash': 0.0, 'pHash': 0.0},
 '1/Reduced to 2': {'ResNet': 1.2773285, 'aHash': 3.0,  'pHash': 1.0},
 '1/Reduced to 4': {'ResNet': 6.5748253, 'aHash': 4.0,  'pHash': 1.0},
 '1/Reduced to 8': {'ResNet': 18.959282, 'aHash': 7.0,  'pHash': 3.0},
 '1/Reduced to 16': {'ResNet': 34.8299,   'aHash': 12.0, 'pHash': 0.0},
 'JPEG compression(quality = 10)': {'ResNet': 6.4169083,  'aHash': 2.0, 'pHash': 0.0},
 'JPEG compression(quality = 20)': {'ResNet': 2.6065674,  'aHash': 1.0, 'pHash': 0.0},
 'JPEG compression(quality = 30)': {'ResNet': 1.8446579,  'aHash': 0.0, 'pHash': 0.0},
 'JPEG compression(quality = 40)': {'ResNet': 1.2492218,  'aHash': 0.0, 'pHash': 1.0},
 'JPEG compression(quality = 50)': {'ResNet': 1.0534592,  'aHash': 0.0, 'pHash': 0.0},
 'JPEG compression(quality = 60)': {'ResNet': 0.99293137, 'aHash': 0.0, 'pHash': 0.0},
 'JPEG compression(quality = 70)': {'ResNet': 0.7313309,  'aHash': 0.0, 'pHash': 0.0},
 'JPEG compression(quality = 80)': {'ResNet': 0.58068085, 'aHash': 0.0, 'pHash': 0.0},
 'JPEG compression(quality = 90)': {'ResNet': 0.354187,   'aHash': 0.0, 'pHash': 0.0},
 'Gamma correction(gamma = 0.2)': {'ResNet': 16.319721,  'aHash': 2.0, 'pHash': 1.0},
 'Gamma correction(gamma = 0.5)': {'ResNet': 4.2003975,  'aHash': 2.0, 'pHash': 0.0},
 'Gamma correction(gamma = 0.8)': {'ResNet': 0.48334503, 'aHash': 0.0, 'pHash': 0.0},
 'Gamma correction(gamma = 1.2)': {'ResNet': 0.381176,   'aHash': 0.0, 'pHash': 1.0},
 'Gamma correction(gamma = 1.5)': {'ResNet': 1.7187691,  'aHash': 2.0, 'pHash': 1.0},
 'Gamma correction(gamma = 2.0)': {'ResNet': 4.074257,   'aHash': 6.0, 'pHash': 2.0},
 'Insert text(fontScale = 1)': {'ResNet': 0.7838249, 'aHash': 0.0, 'pHash': 0.0},
 'Insert text(fontScale = 2)': {'ResNet': 1.0911484, 'aHash': 0.0, 'pHash': 1.0},
 'Insert text(fontScale = 3)': {'ResNet': 2.7721176, 'aHash': 0.0, 'pHash': 2.0},
 'Insert text(fontScale = 4)': {'ResNet': 4.646305,  'aHash': 0.0, 'pHash': 4.0},
 'Insert text(fontScale = 5)': {'ResNet': 8.435852,  'aHash': 2.0, 'pHash': 3.0},
 'Insert text(fontScale = 6)': {'ResNet': 11.267036, 'aHash': 6.0, 'pHash': 3.0},
 'Insert text(fontScale = 7)': {'ResNet': 15.272251, 'aHash': 2.0, 'pHash': 7.0},
 'Smoothing(kernel size =  3': {'ResNet': 1.3798943, 'aHash': 2.0, 'pHash': 0.0},
 'Smoothing(kernel size =  5': {'ResNet': 3.1528091, 'aHash': 4.0, 'pHash': 1.0},
 'Smoothing(kernel size =  7': {'ResNet': 4.903698,  'aHash': 4.0, 'pHash': 1.0},
 'Smoothing(kernel size =  9': {'ResNet': 6.8400574, 'aHash': 4.0, 'pHash': 1.0},
 'Smoothing(kernel size = 11': {'ResNet': 9.477722,  'aHash': 5.0, 'pHash': 2.0},
 'Translation( 1 pixels)': {'ResNet': 0.47764206, 'aHash': 6.0,  'pHash': 0.0},
 'Translation( 2 pixels)': {'ResNet': 0.98942566, 'aHash': 10.0, 'pHash': 3.0},
 'Translation( 4 pixels)': {'ResNet': 1.475399,   'aHash': 15.0, 'pHash': 5.0},
 'Translation( 8 pixels)': {'ResNet': 2.587471,   'aHash': 20.0, 'pHash': 13.0},
 'Translation(16 pixels)': {'ResNet': 3.1883087,  'aHash': 25.0, 'pHash': 21.0},
 'Translation(32 pixels)': {'ResNet': 4.8445663,  'aHash': 23.0, 'pHash': 31.0},
 'Translation(64 pixels)': {'ResNet': 9.34531,    'aHash': 28.0, 'pHash': 30.0}}

The order and indentation of the actually obtained output are changed to improve visibility.

Consideration

Please note that ResNet uses what was pre-learned with ImageNet as it is. In other words, by preparing training data, it is possible to acquire a network with characteristics different from those described below.

Also, the tendency may change depending on the image. For practical use, it is better to evaluate with a certain data set.

--Same: Obviously, the distance is 0 for all hashing methods. --Grayscale: In aHash and pHash, the distance to the original image is 0 because it is grayscaled at the beginning of processing. --Reduction: In ResNet and aHash, the distance increases according to the scale, but pHash is relatively robust. --JPEG compression: It's interesting that ResNet seems to be proportional to the compression ratio. On the other hand, aHash and pHash are relatively robust. A simple hashing algorithm (encoding at high and low relative to the mean) may be robust against compression degradation. --Gamma correction: Compared to ResNet, aHash and pHash seem to be relatively robust. --Insert text: aHash seems to be relatively robust. It is possible that the average pixel value in the image has changed to match the color of the inserted text, and the encoding result has not changed. On the other hand, in pHash and ResNet, the distance increases with the size of the text. --Smoothing: pHash is relatively robust. This is probably because we are focusing only on the low frequency components. --Translation: All hashes move away as the amount of translation increases. Is ResNet relatively robust?

Summary

I introduced a method to hash an image. In addition, we measured the distance between the processed image and the original image and observed the characteristics of each hashing method. Among the algorithms compared in this article, we can observe that pHash tends to be robust to image scaling, compression degradation, and smoothing.

I recently learned about xHash-based algorithms, and I think they are simple and good idea-based algorithms. I think it's a relatively dead algorithm, but I hope it can be used in the right place.

reference

Looks Like It - The Hacker Factor Blog http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

--Try image comparison using perceptual hash (phash) http://hideack.hatenablog.com/entry/2015/03/16/194336

[^ 3]: To tell the truth, when I was reading this paper, I thought, "Is there an easier way to distinguish duplicates?", Which was the reason for writing this article.

Hashing algorithm for determining the same image