Guess the character code in Java

Introduction

In this article, I used juniversalchardet.

--Guess the character code of the file, decode it, and display it on the console. --Decode URL-encoded strings Create two sample programs.

juniversalchardet is a library provided by Mozilla that provides the ability to infer character codes based on the frequency of occurrence of byte string patterns. Currently, Japanese supports ISO-2022-JP, SHIFT-JIS, and EUC-JP.

Development environment

Preparation

Add the following to maven dependencies

pom.xml


<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

Sample 1. Read file

Detector class

This time, for versatility, let's take InputStream as an argument. Note that the InputStream instance passed as an argument will be offset forward. If the input data of Universal Detector is all single-byte characters, the character code judgment result will be null. This time, in such a case, the environment default value is returned.

Detector.java


import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;

public class Detector {
  public static Charset getCharsetName(InputStream is) throws IOException {
    //Allocate a 4kb memory buffer
    byte[] buf = new byte[4096];
    UniversalDetector detector = new UniversalDetector(null);

    //Continue reading InputStream until the character code guess result is obtained.
    int nread;
    while ((nread = is.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    
    //Get guess results
    detector.dataEnd();
    final String detectedCharset = detector.getDetectedCharset();
    
    detector.reset();

    if (detectedCharset != null) {
      return Charset.forName(detector.getDetectedCharset());
    }
    //If the character code cannot be obtained, use the environment default
    return Charset.forName(System.getProperty("file.encoding"));
  }
}

Main class

Determines the character code of the file and outputs it to the console. FileInputStream does not support mark / reset, so it will generate another instance for character encoding and console output.

Main.class


import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;

public class Main {
  public static void main(String[] args) throws IOException {
    final String path = "./test.txt";

    Charset cs;
    try (FileInputStream fis = new FileInputStream(path)) {
      cs = Detector.getCharsetName(fis);
      System.out.println("charset:" + cs);
    }

    try (BufferedReader br =new BufferedReader(new InputStreamReader(new FileInputStream(path), cs))) {
      br.lines().forEach(s -> System.out.println(s));
    }
  }
}

Execution example

Execution result


charset:SHIFT-JIS
AIUEO

Sample 2. URL decoding

In addition, use Apache commons codec.

pom.xml


<dependency>
    <groupId>commons-codec</groupId>
    <artifactId>commons-codec</artifactId>
    <version>1.12</version>
</dependency>

Detector class

Detector.class


import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.mozilla.universalchardet.UniversalDetector;

public class Detector {
  public static Charset getCharsetName(byte[] bytes) {
    UniversalDetector detector = new UniversalDetector(null);
    //If the input string is too short, you cannot guess, so repeat the input.
    while (!detector.isDone()) {
      detector.handleData(bytes, 0, bytes.length);
      detector.dataEnd();
    }
    final String charsetName = detector.getDetectedCharset();
    detector.reset();
    if (charsetName != null) {
      return Charset.forName(charsetName);
    }
    return Charset.forName(System.getProperty("file.encoding"));
  }
}

Main class

Main.class


import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.net.URLCodec;

public class Main {
  public static void main(String[] args) throws DecoderException, UnsupportedEncodingException {
    final String str= "%82%a0%82%a2%82%a4%82%a6%82%a8";
    //Parse URL-encoded strings into byte arrays
    byte[] bytes = new URLCodec()
        .decode(str, StandardCharsets.ISO_8859_1.name())
        .getBytes(StandardCharsets.ISO_8859_1.name());
    
    Charset cs = Detector.getCharsetName(bytes);
    System.out.println("charset:"+cs);

    //Convert to a character string using the charset that detected the byte array
    final String s = new String(bytes,cs);
    System.out.println(s);
  }
}

Execution example

Execution result


charset:SHIFT-JIS
AIUEO

in conclusion

Please note that if the input character string is too short, a false positive may occur. The character code is very troublesome, and the information in English is not very substantial because it is a story related only to the multi-byte language family. I hope you find this article helpful.

Recommended Posts

Guess the character code in Java
Java character code
Correct the character code in Java and read from the URL
The application absorbs the difference in character code
Java in Visual Studio Code
Write Java8-like code in Java8
Differences in code when using the length system in Java
Access the network interface in Java
Java Spring environment in vs Code
Specify the java location in eclipse.ini
Unzip the zip file in Java
Parsing the COTOHA API in Java
Call the super method in Java
Java Converts disparate character codes to the same character code at once
Avoid character code error in java when using VScode extension RUN-CODE
Sample code to call the Yahoo! Local Search API in Java
Sample code that uses the Mustache template engine JMustache in Java
Get the result of POST in Java
Guess about the 2017 Java Persistence Framework (3) Reladomo
Java reference to understand in the figure
[HTTP] Status code included in the HTTP response
OCR in Java (character recognition from images)
Try using the Stream API in Java
Call the Windows Notification API in Java
I tried the new era in Java
[Java] Use cryptography in the standard library
Organized memo in the head (Java --Array)
Try calling the CORBA service in Java 11+
What is the main method in Java?
All same hash code string in Java
How to get the date in java
Execute Java code stored on the clipboard.
The story of writing Java in Emacs
Console input in Java (understanding the mechanism)
[Mac] Install Java in Visual Studio Code
Script Java code
Java code TIPS
Partization in Java
Java sample code 02
Java sample code 03
Changes in Java 11
Rock-paper-scissors in Java
Java sample code 04
Java sample code 01
Pi in Java
If you have trouble with the character code problem in Myanmar (Burmese)
FizzBuzz in Java
Regarding the transient modifier and serialization in Java
The story of low-level string comparison in Java
[Java] Handling of JavaBeans in the method chain
About the confusion seen in startup Java servers
The story of making ordinary Othello in Java
Add --enable-preview option in Java in Visual Studio Code
About the idea of anonymous classes in Java
ChatWork4j for using the ChatWork API in Java
A story about the JDK in the Java 11 era
Technology for reading Java source code in Eclipse
Organized memo in the head (Java --Control syntax)
Static code analysis with Checkstyle in Java + Gradle
The intersection type introduced in Java 10 is amazing (?)
The story of learning Java in the first programming