[JAVA] "hoge hoge" .equals ("hoge hoge") // false

What happened during this time

The other day, I was doing something like importing an HTML file and changing the characters.

String str1 = "hoge hoge";
String str2 = anyElement; //Imported from HTML file"hoge hoge"

System.out.println(str1.equals(str2)); 
// false

What a false was output. Eh, false ...? Is it a bug in String # equals at first? I thought, but I don't think that's the case. I checked various things.

Try it with char

So, when I changed the above two strings to UTF-8 with String # getBytes, it became as follows.

str1 : [104, 111, 103, 101, 32, 104, 111, 103, 101] str2 : [104, 111, 103, 101, -62, -96, 104, 111, 103, 101]

Hmm?

-62, -96 ...!?

*** What this! !! !! !! *** ***

&nbsp(0xC2, 0xA0);

[Non-breaking space-Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8E%E3%83%BC%E3%83%96%E3%83%AC%E3%83%BC % E3% 82% AF% E3% 82% B9% E3% 83% 9A% E3% 83% BC% E3% 82% B9) ↑ It seems that the 2 bytes of 0xC2 and 0xA0 are represented by "no break space".

Something like nbsp on an HTML file isn't the usual half-width space (0x20) in UTF-8. It seems to be represented by 2 bytes of 0xC2 0xA0.

Even if it is output to standard output, it looks like just a half-width space. It's a trap ...

I'm in trouble if it's not a normal half-width space

In that case, there may be a problem in processing the character string. The following is the case where all half-width spaces are non-breaking spaces.

String hoge = "a b c".split(" ");
// hoge = ["a b c"]
// ["a", "b", "c"]it's not...?

String fuga = "a b c".replaceAll(" ", "d");
// fuga = "a b c"
// "adbdc"it's not...?

It's a trap ... However, there is no problem if you do the following, for example.

public static final byte[] NBSP = {(byte)0xC2, (byte)0xA0}; 

String hoge = "a b c".split("[ |" + new String(NBSP) + "]");
// hoge = ["a", "b", "c"]

String fuga = "a b c"
      .replaceAll("[ |" + new String(NBSP) + "]", "d")
// fuga = "adbdc"

Both replaceAll and split take regular expressions as arguments, so If you select [(half-width space) | (no break space)], either one will be caught.

Summary

Don't worry about non-breaking spaces anymore ...

Recommended Posts

"hoge hoge" .equals ("hoge hoge") // false
== and equals