If the scrape destination is creating an article from the same source, the same image with a different title may be downloaded. I want to detect and delete an image folder with a different title but exactly the same contents.
windows10 Anaconda python3.6.1 jupyter notebook
Using Python's similar image library ImageHash on Windows
When hashing image information, close your eyes to the size and subtle differences of the image and use it when you want to obtain the same digest value for similar images and similar digest values for similar images. Similar image library. It judges the similarity regardless of the extension and size of the image.
In the case of Anaconda, installation of ImageHash only is completed.
py
pip install numpy
pip install scipy
pip install Pillow
pip install PyWavelets
pip install ImageHash
compimages.py
from PIL import Image,ImageFile
import imagehash,os
from glob import glob
#Do not skip large images
ImageFile.LOAD_TRUNCATED_IMAGES = True
#Output the difference between the hash values of two images
def d_hash(img,otherimg):
hash = imagehash.phash(Image.open(img))
other_hash = imagehash.phash(Image.open(otherimg))
return hash-other_hash
#Detect the smaller image size
def minhash(img,otherimg):
hash_size = Image.open(img).size
otherhash_size = Image.open(otherimg).size
if hash_size<otherhash_size: return 0
else: return 1
#Specify working folder
directory_dir = r'C:\Users\hogehoge\images'
#Get folder list and folder path
folder_list = os.listdir(directory_dir)
folder_dir = [os.path.join(directory_dir,i) for i in folder_list if len(os.listdir(os.path.join(directory_dir,i))) >2 ]
#Get image list, path
img_list = [os.listdir(i) for i in folder_dir]
img_list_count = [ len( i ) for i in img_list ]
#Create an image list for each folder with double inclusion notation
img_dir = [ [ os.path.join(dir,list[i]) for i in range(count) if list[i] in 'jpg' or 'png'] for (count,dir,list) in zip(img_list_count, folder_dir, img_list) ]
i = 0
length = len(img_dir)
delete_file = []
#d_hash(),minhash()Compare images by folder with
while i < length:
#progress
print('i = ',i+'/'+length)
for j in range(i+1,length):
#Flag to break
switch = 0
for k in img_dir[j]:
#If the difference between hash values is 10 or less, it is recognized as the same image.
if d_hash(img_dir[i][1],k)<10:
print(folder_list[i]+' | vs | '+folder_list[j])
#Save the path with the smaller image size in the delete list
if minhash(img_dir[i][1],k) == 0:
delete_file.append(folder_dir[i])
else: delete_file.append(folder_dir[j])
i += 1
switch = 1
break
if switch != 0:break
i += 1
#View the folder path you want to delete
print(delete_file)
#If you want to continue deleting
#import shutil
#for i in delete_file:
# shutil.rmtree(i)
The first folder takes time, but the number of comparison folders gradually decreases as i increases, so if the processing proceeds to half, the amount of image comparison for each folder will also decrease to half. However, assuming that 100 folders contain 10 images, the total number of loops is ** 50500 times **. If parallel processing can be done with threading module etc., I would like to implement it in the future.
Recommended Posts