In Unicode (UTF-16), one character is usually represented by two bytes. However, as the number of characters that should be handled by Unicode increased, the number of characters that can be expressed in 2 bytes (65535 characters) became insufficient, and by expressing some characters in 4 bytes, the number of characters that can be handled increased. .. Such 4-byte characters are called surrogate pairs.
The character "rebuke" is a surrogate pair, so if you normally use the length method, it will be considered two characters.
Therefore, to correctly count strings containing surrogate pairs, use the codePointCount method instead of the length method.
var str1 = "Hello";
System.out.println(str1.length()); //Result: 5
var str2 = "Scold";
System.out.println(str2.length()); //Result: 3
//This will get the correct number of characters
System.out.println(str2.codePointCount(0, str2.length())); //Result: 2
codePointCount method/**
@param begin Start position for length
@param end End position for length
@number of return characters
*/
public int codePointCount(int begin, int end)
Recommended Posts