Data from recording media such as digital cameras and action cams (USB memory, SD card, etc.) was copied to another large-capacity external SSD and stored, but the remaining amount of the external SSD is running low. .. There seems to be duplicate data in the external SSD, so I want to delete it. Duplicate files occurred
--After copying to an external SSD, I added it before erasing the copy source file on the recording media, and then backed it up again. --The case of the file name may have changed due to the difference in the OS environment where the backup work was performed. At that time, I misunderstood whether the backup work had been completed or not, and backed up twice.
It seems that the cause is. I decided to write a script because it would be a hassle to check the contents of the two directories without being case sensitive.
If you execute with the two directories you want to compare as arguments, for each subdirectory, for each file, directory, and symbolic link
Compare the time stamp with the file size. For files of the same size, use the filecmp.cmp (..., shallow = False)
function to compare.
The following symbols indicating the results of the comparison are added in front of the file name.
--If there is a file with the same name in the comparison target, the symbol (>
, =
, <
) representing the time stamp (mtime) and the symbol (++
,-) of the content comparison result 3 bytes combining
,! =
,==
,
,!!
)
For example, ===
is "same time stamp and content", > ++
is "newer and larger than the file with the same name in the directory to be compared", and =! =
is "comparison target". The time stamp and file size are the same as the file with the same name in the directory, but the contents (data) are different. " In the case of a symbolic link,'= (two blank characters)' is a symbol that "the time stamp and the link destination path are the same", and'= !!'is a symbol that "the time stamp is the same but the link destination path is different". Become.
Execution example
% ./cmp_dirtree ./data0 ./data1
### ========================================
1: ./data0
2: ./data1
### ========================================
###-----------------< M4ROOT >--------------------
1: ./data0/M4ROOT
2: ./data1/M4ROOT
### Sub directories: ---
1: =++ : 2019/02/09 12:43:35 : 952 : CLIP
2: =-- : 2019/02/09 12:43:35 : 476 : CLIP
1: === : 2019/02/09 12:43:35 : 68 : GENERAL
2: === : 2019/02/09 12:43:35 : 68 : GENERAL
(Omitted)
### File lists: ---
1: >== : 2019/11/05 03:47:45 : 6148 : .DS_Store
2: <== : 2019/11/05 01:10:13 : 6148 : .DS_Store
1: >++ : 2019/11/02 19:09:34 : 5122 : MEDIAPRO.XML
2: <-- : 2019/07/01 19:07:35 : 2595 : MEDIAPRO.XML
1: >== : 2019/11/02 19:09:34 : 7 : STATUS.BIN
2: <== : 2019/07/01 19:07:35 : 7 : STATUS.BIN
###-----------------< M4ROOT/CLIP >--------------------
1: ./data0/M4ROOT/CLIP
2: ./data1/M4ROOT/CLIP
### File lists: ---
1: === : 2019/02/09 14:53:23 : 1878686635 : C0001.MP4
2: === : 2019/02/09 14:53:23 : 1878686635 : C0001.MP4
1: === : 2019/02/09 14:53:23 : 2008 : C0001M01.XML
2: === : 2019/02/09 14:53:23 : 2008 : C0001M01.XML
(Omitted)
1: === : 2019/07/01 19:07:35 : 7627022896 : C0006.MP4
2: === : 2019/07/01 19:07:35 : 7627022896 : C0006.MP4
1: === : 2019/07/01 19:07:35 : 2009 : C0006M01.XML
2: === : 2019/07/01 19:07:35 : 2009 : C0006M01.XML
1: : 2019/07/28 14:15:53 : 15709053750 : C0007.MP4
2: ! : ~~
1: : 2019/07/28 14:15:53 : 2008 : C0007M01.XML
2: ! : ~~
(The following is omitted)
If you have a lot of directories and files to compare, you can easily find duplicate files by filtering with ./cmp_dirtree ./data0 ./data1 | grep -e'. =='
. ..
Recommended Posts