[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]

things to do

ʻUse Apache Tika Parsers to read as StringofJava regardless of the Charset` (character code) of the file.

Please use ʻApache Tika Parsers` from the following.


This time, I use ʻUniversalEncodingDetector ()to getCharset and call the constructor of Stringaccording toCharset`.

In the example, ʻInputStream etc. is used to initializeTikaInputStream, but there are several ways to do it, so please refer to each document. So is the initialization of String`.

//Metadata has a name of Kotlin, so rename it and use it.
import org.apache.tika.metadata.Metadata as TikaMetadata

 *Character encoding of input file
fun getCharset(input: InputStream, metadata: TikaMetadata): Charset? {
    val encordingDetector = UniversalEncodingDetector()
    return TikaInputStream.get(input)
            .let { encordingDetector.detect(it, metadata) }

val metadata = TikaMetadata()

val charset = getCharset(/*Some kind of InputStream etc.*/, metadata)

if (charset == null) throw Exception("The character code could not be obtained.")

val result = String(/*ByteArray etc.*/, charset)

Feeling I tried

I was able to handle Shift-JIS and ʻUTF-8with / without bombs normally. However, in the case ofShift-JIS`, which does not include much Japanese, the analysis sometimes failed and the characters were garbled.

Character code analysis without prior information has no choice but to rely on statistical methods, so I think that any method can't be helped, but I think we should keep in mind that there can be failures.

