ʻUse Apache Tika Parsers to read as
Stringof
Java regardless of the
Charset` (character code) of the file.
Please use ʻApache Tika Parsers` from the following.
This time, I use ʻUniversalEncodingDetector ()to get
Charset and call the constructor of
Stringaccording to
Charset`.
In the example, ʻInputStream etc. is used to initialize
TikaInputStream, but there are several ways to do it, so please refer to each document. So is the initialization of
String`.
//Metadata has a name of Kotlin, so rename it and use it.
import org.apache.tika.metadata.Metadata as TikaMetadata
/**
*Character encoding of input file
*/
fun getCharset(input: InputStream, metadata: TikaMetadata): Charset? {
val encordingDetector = UniversalEncodingDetector()
return TikaInputStream.get(input)
.let { encordingDetector.detect(it, metadata) }
}
val metadata = TikaMetadata()
val charset = getCharset(/*Some kind of InputStream etc.*/, metadata)
if (charset == null) throw Exception("The character code could not be obtained.")
val result = String(/*ByteArray etc.*/, charset)
I was able to handle Shift-JIS
and ʻUTF-8with / without bombs normally. However, in the case of
Shift-JIS`, which does not include much Japanese, the analysis sometimes failed and the characters were garbled.
Character code analysis without prior information has no choice but to rely on statistical methods, so I think that any method can't be helped, but I think we should keep in mind that there can be failures.