[Kotlin] Delete files with duplicate contents [Java]

Thing you want to do

When there are a lot of duplicate files in a directory, leave only one and delete the duplicates.

manner

Comparing the entire file is difficult because the processing cost is high.

Therefore, this time we decided to compare the hashes (digests) of the files and implemented it with the following policy.

  1. Get a digest by hashing the file
  2. Check if the digest already exists in Set
  3. Delete if it exists
  4. ʻadd to Set` if it does not exist

Sample code

The sample that works for the time being is as follows.

import java.io.File
import java.security.MessageDigest

val sha256 : MessageDigest = MessageDigest.getInstance("SHA-256")

fun getDigest(bytes: ByteArray): List<Byte> = sha256.digest(bytes).asList()

fun getFiles(pathToDir: String): List<File> = File(pathToDir).listFiles()?.asList() ?: emptyList()

fun main() {
    val files = getFiles(/*Path of the directory to be processed*/)

    val set = HashSet<List<Byte>>()
    var count = 0

    files.forEach {
        val digest = getDigest(it.readBytes())

        if (!set.add(digest)) {
            if (it.delete()) {
                println("Deleted:\t${it.name}")
                count++
            } else {
                println("Fail delete:\t${it.name}")
            }
        }
    }

    println("\n\n$count deleted.")
}

Execution result

Folding
Deleted:	43_3 copies.gif
Deleted:	46_Copy of 3 2.gif
Deleted:	70_Copy of 1 2.gif
Deleted:	94_1 copy.gif
Deleted:	50_Copy of 3 2.gif
Deleted:	66_1 copy.gif
Deleted:	95_1 copy.jpg
Deleted:	58_3 copies.gif
Deleted:	63_1 copy.gif
Deleted:	32_1 copy.jpg
Deleted:	55_3 copies.gif
Deleted:	62_3 copies.gif
Deleted:	49_3 copies.gif
Deleted:	9_Copy of 1 2.gif
Deleted:	47_3 copies.gif
Deleted:	96_1 copy.jpg
Deleted:	71_1 copy.gif
Deleted:	52_Copy of 3 2.gif
Deleted:	64_Copy of 1 2.gif
Deleted:	61_3 copies.gif
Deleted:	56_3 copies.gif
Deleted:	60_Copy of 3 2.gif
Deleted:	31_1 copy.jpg
Deleted:	57_Copy of 3 2.gif
Deleted:	98_Copy of 1 2.jpg
Deleted:	34_1 copy.jpg
Deleted:	68_1 copy.gif
Deleted:	53_3 copies.gif
Deleted:	42_3 copies.gif
Deleted:	74_Copy of 1 2.gif
Deleted:	30_1 copy.gif
Deleted:	36_Copy of 1 2.gif
Deleted:	65_1 copy.gif
Deleted:	100_1 copy.jpg
Deleted:	37_1 copy.gif
Deleted:	35_Copy of 1 2.gif
Deleted:	45_3 copies.gif
Deleted:	99_1 copy.jpg
Deleted:	87_Copy of 1 2.jpg
Deleted:	33_1 copy.jpg
Deleted:	73_1 copy.gif
Deleted:	1_7 copies.jpg
Deleted:	48_3 copies.gif
Deleted:	54_Copy of 3 2.gif
Deleted:	51_3 copies.gif
Deleted:	67_1 copy.gif
Deleted:	93_Copy of 1 2.gif
Deleted:	44_Copy of 3 2.gif
Deleted:	72_Copy of 1 2.gif
Deleted:	97_Copy of 1 2.jpg


50 deleted.

Commentary

How to get a hash

I used java.security.MessageDigest. This can be used as the Java standard without installing any libraries.

This time, I specified SHA-256 for the text, but if you want to avoid duplication to the limit, I think you should specify SHA-512.

Duplicate management

Probably the easiest and cheapest to use is HashSet. Also, in ByteArray, there is anxiety around ʻequals, so here it is converted to List` and handled.

Recommended Posts

[Kotlin] Delete files with duplicate contents [Java]
[Java 8] Duplicate deletion (& duplicate check) with Stream
[Java] Handle Excel files with Apache POI
Delete folders and files with File Manager
Interoperability tips with Kotlin for Java developers
[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]
[Java] Get MimeType from the contents of the file with Apathce Tika [Kotlin]
Getting started with Kotlin to send to Java developers
Check with Java / Kotlin that files cannot be written in UAC on Windows
[Review] Reading and writing files with java (JDK6)
[Java] Get metadata from files with Apathce Tika, and get image / video width and height from metadata [Kotlin]
Call a method with a Kotlin callback block from Java
I want to transition screens with kotlin and java!
[Java] Development with multiple files using package and import
Install java with Homebrew
Change seats with java
[Kotlin] ZIP-compress Japanese files
Comfortable download with JAVA
Handle files with NIO.2.
Switch java with direnv
[Java, Kotlin] Type Variance
Download Java with Ansible
Let's scrape with Java! !!
Build Java with Wercker
Endian conversion with JAVA
[Java / Kotlin] Escape (sanitize) HTML5 support with unbescape [Spring Boot]
How to use trained model of tensorflow2.0 with Kotlin / Java
Kotlin post- and pre-increment and operator overload (comparison with C, Java, C ++)
I want to implement various functions with kotlin and java!