[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]

things to do

ʻUse Apache Tika Parsers to read as StringofJava regardless of the Charset` (character code) of the file.

Please use ʻApache Tika Parsers` from the following.

manner

This time, I use ʻUniversalEncodingDetector ()to getCharset and call the constructor of Stringaccording toCharset`.

In the example, ʻInputStream etc. is used to initializeTikaInputStream, but there are several ways to do it, so please refer to each document. So is the initialization of String`.

//Metadata has a name of Kotlin, so rename it and use it.
import org.apache.tika.metadata.Metadata as TikaMetadata

/**
 *Character encoding of input file
 */
fun getCharset(input: InputStream, metadata: TikaMetadata): Charset? {
    val encordingDetector = UniversalEncodingDetector()
    return TikaInputStream.get(input)
            .let { encordingDetector.detect(it, metadata) }
}

val metadata = TikaMetadata()

val charset = getCharset(/*Some kind of InputStream etc.*/, metadata)

if (charset == null) throw Exception("The character code could not be obtained.")

val result = String(/*ByteArray etc.*/, charset)

Feeling I tried

I was able to handle Shift-JIS and ʻUTF-8with / without bombs normally. However, in the case ofShift-JIS`, which does not include much Japanese, the analysis sometimes failed and the characters were garbled.

Character code analysis without prior information has no choice but to rely on statistical methods, so I think that any method can't be helped, but I think we should keep in mind that there can be failures.

Recommended Posts

[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]
[Java] Get MimeType from the contents of the file with Apathce Tika [Kotlin]
[Java] Get metadata from files with Apathce Tika, and get image / video width and height from metadata [Kotlin]
Call a method with a Kotlin callback block from Java
[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]
Android: How to deal with "Could not determine java version from '10 .0.1'"
[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]
[Java] Get KFunction from Method / Constructor in Java [Kotlin]
[Java] Initialize, add, get
Call a method with a Kotlin callback block from Java
[Java] Get KClass in Java [Kotlin]
[Kotlin] Get Java Constructor / Method from KFunction and call it
[Java] Generate Data URI from byte string of file contents [Kotlin]
[Java] How to convert from String to Path type and get the path
Get country from IP address (Java)
Work with Google Sheets from Java
[Java] Get Json from URL and handle it with standard API (javax.script)
Full-width → half-width conversion with Java String (full-width kana → half-width kana)
[Java] Get List / Map elements with Iterator
[Kotlin] 3 ways to get Class from KClass
Call Java library from C with JNI
API integration from Java with Jersey Client
Get caller information from stack trace (java)
Getting Started with Java Starting from 0 Part 1
[Java] Get tag information from music files
[Kotlin] Delete files with duplicate contents [Java]
Get history from Zabbix server in Java
Split a string with ". (Dot)" in Java
Interoperability tips with Kotlin for Java developers
Get Timestamp with Azure BlobStorage Java SDK
Memo for migration from java to kotlin
Execute Java code from cpp with cocos2dx
Type conversion from java BigDecimal type to String type
[Java] Get the date with the LocalDateTime class