Practice working with Unicode surrogate pairs in Java

After a little review of the character encoding, I realized that I had never been concerned about Unicode surrogate pairs. So, I practiced assembling the character strings of surrogate pairs and counting the number of characters in Java, which I also handle at work.

Until Java 1.4, surrogate pairs were not considered, but in 1.5, an API that considers surrogate pairs has been added. Therefore, in the above test code, we try to call the API up to 1.4 series and the API added in 1.5 and compare the behavior.

First, let's express a surrogate pair with char type. The upper surrogate and the lower surrogate are separated as separate char type variables and incorporated into the array.

char c1 = '\u3042'; // HIRAGANA LETTER A, cp=12354
char c2 = '\uD842'; // tuchi-yoshi (high), cp=134071
char c3 = '\uDFB7'; // tuchi-yoshi (low), cp=134071
char c4 = '\u30D5'; // katakana fu, cp=12501
char c5 = '\u309A'; // handakuten, cp=12442
char c6 = '\uD842'; // kuchi + shichi (high), cp=134047
char c7 = '\uDF9F'; // kuchi + shichi (low), cp=134047
String s = new String(new char[] { c1, c2, c3, c4, c5, c6, c7 });
assertEquals(s, "\u3042\uD842\uDFB7\u30D5\u309A\uD842\uDF9F");

Next, try copying the string using String.length () or String.charAt (), which does not consider surrogate pairs. Looking at the last ʻassertEquals (), it matches the string generated from the split ʻint [] of the surrogate pair. You can see how the upper surrogate and the lower surrogate are treated as independent characters and copied.

int len = s.length();
assertEquals(len, 7); // ignores surrogate pair :P
int[] actualCps = new int[len];
for (int i = 0; i < len; i++) {
    char c = s.charAt(i);
    actualCps[i] = (int) c;
}
// Ignores surrogate pairs... :(
// BUT JavaScript unicode escape in browser accepts this format...:(
assertEquals(actualCps, new int[] { 0x3042, 0xD842, 0xDFB7, 0x30D5, 0x309A, 0xD842, 0xDF9F });

Now try using String.codePointCount () and String.codePointAt () to consider surrogate pairs. If you look at the last ʻassertEquals ()`, you'll see that the surrogate paired character is the same as the Unicode code point hexadecimal string. You can check how the surrogate pair is handled by counting it as one character.

int countOfCp = s.codePointCount(0, len);
assertEquals(countOfCp, 5); // GOOD.

actualCps = new int[countOfCp];
for (int i = 0, j = 0, cp; i < len; i += Character.charCount(cp)) {
    cp = s.codePointAt(i);
    actualCps[j++] = cp;
}
// GOOD.
assertEquals(actualCps, new int[] { 0x3042, 0x20BB7, 0x30D5, 0x309A, 0x20B9F });

reference:

Recommended Posts

Practice working with Unicode surrogate pairs in Java
Working with huge JSON in Java Lambda
I dealt with Azure Functions not working in Java
Log aggregation and analysis (working with AWS Athena in Java)
Morphological analysis in Java with Kuromoji
Play with Markdown in Java flexmark-java
Concurrency Method in Java with basic example
Read xlsx file in Java with Selenium
Split a string with ". (Dot)" in Java
Read a string in a PDF file with Java
Create a CSR with extended information in Java
Refactored GUI tools made with Java8 + JavaFX in 2016
Solution for NetBeans 8.2 not working in Java 9 environment
[JAVA] [Spring] [MyBatis] Use IN () with SQL Builder
Encrypt / decrypt with AES256 in PHP and Java
Programming with direct sum types in Java (Neta)
Get along with Java containers in Cloud Run
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
java practice part 1
Pi in Java
FizzBuzz in Java
How to call functions in bulk with Java reflection
Include image in jar file with java static method
Notice multi thread problem when working with Java Servlet
Quickly implement a singleton with an enum in Java
[Java] Get the file path in the folder with List
Output true with if (a == 1 && a == 2 && a == 3) in Java (Invisible Identifier)
Check coverage with Codecov in Java + Gradle + Wercker configuration