Introduction

In the system modification (Java 7), I created a process to count the number of lines whose first character is "K" in the file received by FTP. The regular expression "^ K. *" Is used to determine whether the first character of the line is "K".

In the test, the received file was created as Shift-JIS, so there was no problem in specifying the character code for reading the FTP received file in "Windows 31J", but the actual FTP received file is UTF-8 (with BOM). Therefore, BOM did not determine "K" only for the first line, and the total number of lines did not match.

It was an oversight of the specifications.

BOM What is BOM? "Byte Order Mark (https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E3%82%AA" % E3% 83% BC% E3% 83% 80% E3% 83% BC% E3% 83% 9E% E3% 83% BC% E3% 82% AF))) "," This file is in Unicode format. It is the information to make it discriminate that "it is written".

Only the main character code and BOM are excerpted.

Character code	Endian distinction	BOM code
UTF-8		0xEF 0xBB 0xBF
UTF-16	BE	0xFE 0xFF
	LE	0xFF 0xFE

BOM skip

I added a process to skip the BOM by referring to the code on the following site, but the result does not change. [JAVA] Manual removal logic of BOM (Byte Of Mark)

private static String excludeBOMString(String original_str) {
	if (original_str != null) {
		char c = original_str.charAt(0);
		if (Integer.toHexString(c).equals("feff")) {
			StringBuilder sb = new StringBuilder();
			for (int i=1; i < original_str.length(); i++) {
				sb.append(original_str.charAt(i));
			}
			return sb.toString();
		} else {
			return original_str;
		}
	} else {
		return "";
	}
}

why? When I debugged it on my PC's Windows 7 Eclipse, the first character was "fffd". When I looked up the character code a little more, it was as follows, and I found that I should skip 3 characters to the first character "K (0x4b)". It's UTF-8 BOM code (0xEF, 0xBB, 0xBF), and Java's internal code is UTF-16LE, which is different from BOM code (0xfeff), but I didn't have time until the afternoon verification work, so I clarified it for the time being. I put it off.

Index	Character code
0	0xfffd
1	0xff7b
2	0xff7f
3	0x4b

The BOM judgment was set to "fffd", and the skip was corrected to 3 characters. Now you can skip the BOM normally and get the total number of lines with "K" as the first character.

private static String excludeBOMString(String original_str) {
	if (original_str != null) {
		char c = original_str.charAt(0);
		if (Integer.toHexString(c).equals("fffd")) {
			StringBuilder sb = new StringBuilder();
			for (int i=3; i < original_str.length(); i++) {
				sb.append(original_str.charAt(i));
			}
			return sb.toString();
		} else {
			return original_str;
		}
	} else {
		return "";
	}
}

In the afternoon verification work, I replaced the modified programs, but the total number did not match. why? When I debugged it on my PC's Windows 7 Eclipse, the total number matches. Since the verification environment is Windows Server 2012R2, when I tried to output the read line by debugging to see if something was different depending on the environment, it was from the character after "K", so if you change the skip from 3 characters to 2 characters, the total number Came to fit.

However, it's scary that it depends on the environment.

0xFFFD(REPLACEMENT CHARACTER) Anyway, the verification work in the afternoon was over, so I decided to investigate the cause.

For "0xfffd", when you try to read a UTF-8 encoded text file with Shift-JIS, if the corresponding character does not exist, it will be converted to the character of '0xFFFD'. About garbled character detection in Java

In my environment (Windows 7) ... Code number 0xFFFD "REPLACEMENT CHARACTER", Mr. �! In my environment, it is a black diamond with a white "?" Mark, so is it not displayed properly? I thought, but this seems to be "characters to be displayed when it can not be displayed", so this seems to be okay The largest character in Unicode

Since the character code for reading the file was specified as "Windows 31J", it means that the BOM code part is converted to the character "0xfffd" assuming that the corresponding character does not exist. I was convinced about this.

Also, regarding the case where the number of skip characters was different on Windows 7 and Windows Server 2012 R2, the second "0xff7b" on Windows Server 2012 R2 had disappeared, so it turned out that skipping 3 characters would go too far.

Index	Win 7 character code	Index	Win2012 character code
0	0xfffd	0	0xfffd
1	0xff7b
2	0xff7f	1	0xff7f
3	0x4b	2	0x4b

I searched online, but couldn't find any literature on this. Since the characters in the file are only alphanumeric characters rather than spending time on this investigation, I decided that it would be better to specify the character code for reading the file to "UTF-8", and modified the program. ..

As a result, the BOM code judgment was "0xfeff" and the skip was one character, leaving the original manual removal logic. We tested it on Windows 7 and Windows Server 2012 R2 and both gave the same results.

Finally

This time, it was my first experience to be converted to REPLACEMENT CHARACTER (0xfffd), so I wrote it as a memorandum.

When you say UTF-8 in Java, you are assuming UTF-8 without BOM, and although there are various discussions, it seems that you do not intend to deal with it due to backward compatibility issues. See also: [What happens when I read a UTF8 file with Java SE BOM? ](Https://hondou.homedns.org/pukiwiki/pukiwiki.php?JavaSE%20BOM%C9%D5%A4%ADUTF8%A5%D5%A5%A1%A5%A4%A5%EB%A4%F2% C6% C9% A4% DF% B9% FE% A4% E0% A4% C8% A4% C9% A4% A6% A4% CA% A4% EB% A4% AB% A1% A9)

I've used PHP with BOM before, but with / without BOM is awkward.

[Java] UTF-8 (with BOM) is converted to 0xFFFD (REPLACEMENT CHARACTER)

Introduction

BOM skip

Finally