Guess the character code in Java

Introduction

In this article, I used juniversalchardet.

--Guess the character code of the file, decode it, and display it on the console. --Decode URL-encoded strings Create two sample programs.

juniversalchardet is a library provided by Mozilla that provides the ability to infer character codes based on the frequency of occurrence of byte string patterns. Currently, Japanese supports ISO-2022-JP, SHIFT-JIS, and EUC-JP.

Development environment

OpenJDK 11
Maven 3.6

Preparation

Add the following to maven dependencies

`pom.xml`


<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

Sample 1. Read file

Detector class

This time, for versatility, let's take InputStream as an argument. Note that the InputStream instance passed as an argument will be offset forward. If the input data of Universal Detector is all single-byte characters, the character code judgment result will be null. This time, in such a case, the environment default value is returned.

`Detector.java`


import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;

public class Detector {
  public static Charset getCharsetName(InputStream is) throws IOException {
    //Allocate a 4kb memory buffer
    byte[] buf = new byte[4096];
    UniversalDetector detector = new UniversalDetector(null);

    //Continue reading InputStream until the character code guess result is obtained.
    int nread;
    while ((nread = is.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    
    //Get guess results
    detector.dataEnd();
    final String detectedCharset = detector.getDetectedCharset();
    
    detector.reset();

    if (detectedCharset != null) {
      return Charset.forName(detector.getDetectedCharset());
    }
    //If the character code cannot be obtained, use the environment default
    return Charset.forName(System.getProperty("file.encoding"));
  }
}

Main class

Determines the character code of the file and outputs it to the console. FileInputStream does not support mark / reset, so it will generate another instance for character encoding and console output.

`Main.class`


import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;

public class Main {
  public static void main(String[] args) throws IOException {
    final String path = "./test.txt";

    Charset cs;
    try (FileInputStream fis = new FileInputStream(path)) {
      cs = Detector.getCharsetName(fis);
      System.out.println("charset:" + cs);
    }

    try (BufferedReader br =new BufferedReader(new InputStreamReader(new FileInputStream(path), cs))) {
      br.lines().forEach(s -> System.out.println(s));
    }
  }
}

Execution example

`Execution result`


charset:SHIFT-JIS
AIUEO

Sample 2. URL decoding

In addition, use Apache commons codec.

`pom.xml`


<dependency>
    <groupId>commons-codec</groupId>
    <artifactId>commons-codec</artifactId>
    <version>1.12</version>
</dependency>

Detector class

`Detector.class`


import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;

public class Detector {
  public static Charset getCharsetName(byte[] bytes) {
    UniversalDetector detector = new UniversalDetector(null);
    //If the input string is too short, you cannot guess, so repeat the input.
    while (!detector.isDone()) {
      detector.handleData(bytes, 0, bytes.length);
      detector.dataEnd();
    }
    final String charsetName = detector.getDetectedCharset();
    detector.reset();
    if (charsetName != null) {
      return Charset.forName(charsetName);
    }
    return Charset.forName(System.getProperty("file.encoding"));
  }
}

Main class

`Main.class`


import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.net.URLCodec;

public class Main {
  public static void main(String[] args) throws DecoderException, UnsupportedEncodingException {
    final String str= "%82%a0%82%a2%82%a4%82%a6%82%a8";
    //Parse URL-encoded strings into byte arrays
    byte[] bytes = new URLCodec()
        .decode(str, StandardCharsets.ISO_8859_1.name())
        .getBytes(StandardCharsets.ISO_8859_1.name());
    
    Charset cs = Detector.getCharsetName(bytes);
    System.out.println("charset:"+cs);

    //Convert to a character string using the charset that detected the byte array
    final String s = new String(bytes,cs);
    System.out.println(s);
  }
}

Execution example

`Execution result`


charset:SHIFT-JIS
AIUEO

in conclusion

Please note that if the input character string is too short, a false positive may occur. The character code is very troublesome, and the information in English is not very substantial because it is a story related only to the multi-byte language family. I hope you find this article helpful.