things to do

ʻUse Apache Tika Parsers to read as StringofJava regardless of the Charset` (character code) of the file.

Please use ʻApache Tika Parsers` from the following.

Maven Repository: org.apache.tika » tika-parsers

manner

This time, I use ʻUniversalEncodingDetector ()to getCharset and call the constructor of Stringaccording toCharset`.

In the example, ʻInputStream etc. is used to initializeTikaInputStream, but there are several ways to do it, so please refer to each document. So is the initialization of String`.

//Metadata has a name of Kotlin, so rename it and use it.
import org.apache.tika.metadata.Metadata as TikaMetadata

/**
 *Character encoding of input file
 */
fun getCharset(input: InputStream, metadata: TikaMetadata): Charset? {
    val encordingDetector = UniversalEncodingDetector()
    return TikaInputStream.get(input)
            .let { encordingDetector.detect(it, metadata) }
}

val metadata = TikaMetadata()

val charset = getCharset(/*Some kind of InputStream etc.*/, metadata)

if (charset == null) throw Exception("The character code could not be obtained.")

val result = String(/*ByteArray etc.*/, charset)

Feeling I tried

I was able to handle Shift-JIS and ʻUTF-8with / without bombs normally. However, in the case ofShift-JIS`, which does not include much Japanese, the analysis sometimes failed and the characters were garbled.

Character code analysis without prior information has no choice but to rely on statistical methods, so I think that any method can't be helped, but I think we should keep in mind that there can be failures.

Recommended Posts

[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]

[Java] Get MimeType from the contents of the file with Apathce Tika [Kotlin]

[Java] Get metadata from files with Apathce Tika, and get image / video width and height from metadata [Kotlin]

[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]

[Java] Get KFunction from Method / Constructor in Java [Kotlin]

[Java] Initialize, add, get

Call a method with a Kotlin callback block from Java

[Java] Get KClass in Java [Kotlin]

[Kotlin] Get Java Constructor / Method from KFunction and call it

[Java] Generate Data URI from byte string of file contents [Kotlin]

[Java] How to convert from String to Path type and get the path