Purpose

If the scrape destination is creating an article from the same source, the same image with a different title may be downloaded. I want to detect and delete an image folder with a different title but exactly the same contents.

usage environment

windows10 Anaconda python3.6.1 jupyter notebook

Reference url

Using Python's similar image library ImageHash on Windows

What is Image Hash?

When hashing image information, close your eyes to the size and subtle differences of the image and use it when you want to obtain the same digest value for similar images and similar digest values for similar images. Similar image library. It judges the similarity regardless of the extension and size of the image.

Module installation

In the case of Anaconda, installation of ImageHash only is completed.

`py`


pip install numpy
pip install scipy
pip install Pillow
pip install PyWavelets
pip install ImageHash

code

`compimages.py`


from PIL import Image,ImageFile
import imagehash,os
from glob import glob
#Do not skip large images
ImageFile.LOAD_TRUNCATED_IMAGES = True

#Output the difference between the hash values of two images
def d_hash(img,otherimg):
    hash = imagehash.phash(Image.open(img))
    other_hash = imagehash.phash(Image.open(otherimg))
    return hash-other_hash
#Detect the smaller image size
def minhash(img,otherimg):
    hash_size = Image.open(img).size
    otherhash_size = Image.open(otherimg).size
    if hash_size<otherhash_size: return 0
    else: return 1
    
#Specify working folder
directory_dir = r'C:\Users\hogehoge\images'
#Get folder list and folder path
folder_list = os.listdir(directory_dir)
folder_dir = [os.path.join(directory_dir,i) for i in folder_list if len(os.listdir(os.path.join(directory_dir,i))) >2 ]

#Get image list, path
img_list = [os.listdir(i) for i in folder_dir]
img_list_count = [ len( i ) for i in img_list ]
#Create an image list for each folder with double inclusion notation
img_dir = [ [ os.path.join(dir,list[i]) for i in range(count) if list[i] in 'jpg' or 'png']  for (count,dir,list) in zip(img_list_count, folder_dir, img_list) ]



i = 0
length = len(img_dir)
delete_file = []

#d_hash(),minhash()Compare images by folder with
while i < length:
    #progress
    print('i = ',i+'/'+length)
    for j in range(i+1,length):
        #Flag to break
        switch = 0
        for k in img_dir[j]:
            #If the difference between hash values is 10 or less, it is recognized as the same image.
            if d_hash(img_dir[i][1],k)<10:
                print(folder_list[i]+' | vs | '+folder_list[j])
                #Save the path with the smaller image size in the delete list
                if minhash(img_dir[i][1],k) == 0:
                    delete_file.append(folder_dir[i])
                else: delete_file.append(folder_dir[j])
                i += 1
                switch = 1
                break
        if switch != 0:break
    i += 1

#View the folder path you want to delete
print(delete_file)

#If you want to continue deleting
#import shutil
#for i in delete_file:
#   shutil.rmtree(i)

Execution result

The first folder takes time, but the number of comparison folders gradually decreases as i increases, so if the processing proceeds to half, the amount of image comparison for each folder will also decrease to half. However, assuming that 100 folders contain 10 images, the total number of loops is ** 50500 times **. If parallel processing can be done with threading module etc., I would like to implement it in the future.

Detect folders with the same image in ImageHash