In this article, I used juniversalchardet.
--Guess the character code of the file, decode it, and display it on the console. --Decode URL-encoded strings Create two sample programs.
juniversalchardet is a library provided by Mozilla that provides the ability to infer character codes based on the frequency of occurrence of byte string patterns. Currently, Japanese supports ISO-2022-JP, SHIFT-JIS, and EUC-JP.
Add the following to maven dependencies
pom.xml
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>
This time, for versatility, let's take InputStream as an argument. Note that the InputStream instance passed as an argument will be offset forward. If the input data of Universal Detector is all single-byte characters, the character code judgment result will be null. This time, in such a case, the environment default value is returned.
Detector.java
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;
public class Detector {
public static Charset getCharsetName(InputStream is) throws IOException {
//Allocate a 4kb memory buffer
byte[] buf = new byte[4096];
UniversalDetector detector = new UniversalDetector(null);
//Continue reading InputStream until the character code guess result is obtained.
int nread;
while ((nread = is.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
//Get guess results
detector.dataEnd();
final String detectedCharset = detector.getDetectedCharset();
detector.reset();
if (detectedCharset != null) {
return Charset.forName(detector.getDetectedCharset());
}
//If the character code cannot be obtained, use the environment default
return Charset.forName(System.getProperty("file.encoding"));
}
}
Determines the character code of the file and outputs it to the console. FileInputStream does not support mark / reset, so it will generate another instance for character encoding and console output.
Main.class
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) throws IOException {
final String path = "./test.txt";
Charset cs;
try (FileInputStream fis = new FileInputStream(path)) {
cs = Detector.getCharsetName(fis);
System.out.println("charset:" + cs);
}
try (BufferedReader br =new BufferedReader(new InputStreamReader(new FileInputStream(path), cs))) {
br.lines().forEach(s -> System.out.println(s));
}
}
}
Execution result
charset:SHIFT-JIS
AIUEO
In addition, use Apache commons codec.
pom.xml
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.12</version>
</dependency>
Detector.class
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;
public class Detector {
public static Charset getCharsetName(byte[] bytes) {
UniversalDetector detector = new UniversalDetector(null);
//If the input string is too short, you cannot guess, so repeat the input.
while (!detector.isDone()) {
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
}
final String charsetName = detector.getDetectedCharset();
detector.reset();
if (charsetName != null) {
return Charset.forName(charsetName);
}
return Charset.forName(System.getProperty("file.encoding"));
}
}
Main.class
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.net.URLCodec;
public class Main {
public static void main(String[] args) throws DecoderException, UnsupportedEncodingException {
final String str= "%82%a0%82%a2%82%a4%82%a6%82%a8";
//Parse URL-encoded strings into byte arrays
byte[] bytes = new URLCodec()
.decode(str, StandardCharsets.ISO_8859_1.name())
.getBytes(StandardCharsets.ISO_8859_1.name());
Charset cs = Detector.getCharsetName(bytes);
System.out.println("charset:"+cs);
//Convert to a character string using the charset that detected the byte array
final String s = new String(bytes,cs);
System.out.println(s);
}
}
Execution result
charset:SHIFT-JIS
AIUEO
Please note that if the input character string is too short, a false positive may occur. The character code is very troublesome, and the information in English is not very substantial because it is a story related only to the multi-byte language family. I hope you find this article helpful.
Recommended Posts