Sorting image files with Python (2)

Preface

Last time, I've done enough to sort a large number of image files into year / month folders. At that time, the following CSV files were generated in each folder as clues for deleting duplicate files.

`info.csv`


IMG_2607.jpg,BCF3E765,1944106
IMG_2607(1).jpg,BCF3E765,1944106
IMG_2608.jpg,02B27221,3109397
IMG_2608(1).jpg,02B27221,3109397
010(8).jpg,E4A68AB2,3801239
010(9).jpg,3EBBC7BD,1841698
010(10).jpg,B9431E60,103645

From the left, the file name, CRC32, and file size are displayed, and the file whose file size matches CRC32 is almost certainly the same file, so it is targeted for deletion. Speaking of the above file, the 2nd and 4th lines are duplicated, so I would like to divide them into 2 groups as follows.

`servived`


IMG_2607.jpg,BCF3E765,1944106
IMG_2608.jpg,02B27221,3109397
010(8).jpg,E4A68AB2,3801239
010(9).jpg,3EBBC7BD,1841698
010(10).jpg,B9431E60,103645

`delete`


IMG_2607(1).jpg,BCF3E765,1944106
IMG_2608(1).jpg,02B27221,3109397

The idea is simply delete list ← original list-duplicate list and survival list ← original list-delete list, but I tried various kneading but could not reach a convincing implementation. So this time

Prepare an empty survival list and deletion list
If the inspection target is on the survival list, add it to the deletion list.
If the test target is not on the survival list, add it to the survival list

Make it a simple implementation such as (I think this is enough because what I want to do is not complicated)

Development environment

OS : Windows10Home(1903)
Editor:Visual Studio Code : 1.49.1
Python : 3.8.3
- pandas : 1.0.5

code

`Classify.py`


import os
import sys
import pandas as pd

def classify(path, target):
    lines = pd.read_csv(os.path.join(path, target), header=None)

    d = {}
    servived_dict = {}      #What survives
    delete_dict = {}        #Target to be deleted

    # filename,crc32,filesize{filename, (crc32, filesize)}To
    for i in range(len(lines)):
        (filename, crc32, filesize) = lines.values[i]
        d[filename] = (crc32, filesize)

    for key, value in d.items():
        if value in servived_dict.values():
            delete_dict[key] = value
        else:
            servived_dict[key] = value

    def output(full_path, dic):
        with open(full_path, mode='w') as f:
            for key in dic.keys():
                #I only want the full path of the file to be deleted
                f.write(os.path.join(path, key) + "\n")

    output(os.path.join(path, "servived.txt"), servived_dict)
    output(os.path.join(path, "delete.txt"), delete_dict)

if __name__ == "__main__":
    full_path = sys.argv[1]
    classify(os.path.dirname(full_path), os.path.basename(full_path))

In the first for statement, the inspection target is converted to {filename, (crc32, filesize)} so that it can be handled easily later. If you don't make tuples, you will have to check if CRC32 and file size are included, so it's a little crap. Also, although saved_dict has significance, it is useless even if it is output to saved.txt, so output is unnecessary (although it was useful when debugging)

Afterword

Since it is still Pythonista, I feel that it will probably end quickly with set arithmetic, but this time it is. Next is the deletion process by referring to the delete.txt generated in each year / month folder (why should I write it ...)