Last time, I've done enough to sort a large number of image files into year / month folders. At that time, the following CSV files were generated in each folder as clues for deleting duplicate files.
info.csv
IMG_2607.jpg,BCF3E765,1944106
IMG_2607(1).jpg,BCF3E765,1944106
IMG_2608.jpg,02B27221,3109397
IMG_2608(1).jpg,02B27221,3109397
010(8).jpg,E4A68AB2,3801239
010(9).jpg,3EBBC7BD,1841698
010(10).jpg,B9431E60,103645
From the left, the file name, CRC32, and file size are displayed, and the file whose file size matches CRC32 is almost certainly the same file, so it is targeted for deletion. Speaking of the above file, the 2nd and 4th lines are duplicated, so I would like to divide them into 2 groups as follows.
servived
IMG_2607.jpg,BCF3E765,1944106
IMG_2608.jpg,02B27221,3109397
010(8).jpg,E4A68AB2,3801239
010(9).jpg,3EBBC7BD,1841698
010(10).jpg,B9431E60,103645
delete
IMG_2607(1).jpg,BCF3E765,1944106
IMG_2608(1).jpg,02B27221,3109397
The idea is simply delete list ← original list-duplicate list
and survival list ← original list-delete list
, but I tried various kneading but could not reach a convincing implementation. So this time
Make it a simple implementation such as (I think this is enough because what I want to do is not complicated)
Classify.py
import os
import sys
import pandas as pd
def classify(path, target):
lines = pd.read_csv(os.path.join(path, target), header=None)
d = {}
servived_dict = {} #What survives
delete_dict = {} #Target to be deleted
# filename,crc32,filesize{filename, (crc32, filesize)}To
for i in range(len(lines)):
(filename, crc32, filesize) = lines.values[i]
d[filename] = (crc32, filesize)
for key, value in d.items():
if value in servived_dict.values():
delete_dict[key] = value
else:
servived_dict[key] = value
def output(full_path, dic):
with open(full_path, mode='w') as f:
for key in dic.keys():
#I only want the full path of the file to be deleted
f.write(os.path.join(path, key) + "\n")
output(os.path.join(path, "servived.txt"), servived_dict)
output(os.path.join(path, "delete.txt"), delete_dict)
if __name__ == "__main__":
full_path = sys.argv[1]
classify(os.path.dirname(full_path), os.path.basename(full_path))
In the first for statement, the inspection target is converted to {filename, (crc32, filesize)} so that it can be handled easily later. If you don't make tuples, you will have to check if CRC32 and file size are included, so it's a little crap. Also, although saved_dict has significance, it is useless even if it is output to saved.txt, so output is unnecessary (although it was useful when debugging)
Since it is still Pythonista, I feel that it will probably end quickly with set arithmetic, but this time it is. Next is the deletion process by referring to the delete.txt generated in each year / month folder (why should I write it ...)
Recommended Posts