[Java] [Java] Obtaining Charset with Apathce Tika/Initializing String from Charset [Kotlin]

1 minute read

things to do

It is read as String of Java regardless of Charset (character code) of the file using Apache Tika Parsers.

Please use Apache Tika Parsers from the following.

Way

This time, we use UniversalEncodingDetector() to get the Charset and call the String constructor according to the Charset.

In the example, we use InputStream etc. to initialize the TikaInputStream, but there are several ways to do this, so please refer to each document. So is initialization of String.

// Metadata is renamed because it has a name cover with Kotlin
import org.apache.tika.metadata.Metadata as TikaMetadata

/**
 * Input file character encoding
 */
fun getCharset(input: InputStream, metadata: TikaMetadata): Charset? (
    val encordingDetector = UniversalEncodingDetector()
    return TikaInputStream.get(input)
            .let {encordingDetector.detect(it, metadata)}
}

val metadata = TikaMetadata()

val charset = getCharset(/* some InputStream etc. */, metadata)

if (charset == null) throw Exception("Character code could not be obtained.")

val result = String(/* ByteArray etc. */, charset)

Feeling tried

I was able to handle Shift-JIS and UTF-8 with/without bombs normally. However, in the case of Shift-JIS, which does not contain much Japanese, parsing sometimes failed and the characters were garbled.

Since character code analysis without prior information depends only on statistical methods, I think that it is a problem that can not be helped by any method, but I think that there is a possibility of failure in mind.