I implemented the process related to file operation in business, but from UTF-8 (with BOM) file Now that you've learned how to remove a BOM, I'll summarize it for the future.
First of all, who is BOM in the first place?
Roughly speaking with the BOM ** A mark at the beginning of a file created with a Unicode character code **. In UTF-8, it is represented by 3 bytes of ** 0xEF 0xBB 0xBF **. The BOM cannot usually be seen with Notepad, but it is actually at the beginning of the file contents. It has a BOM, and when it is read by the computer, it is interpreted and executed in that way. And it has two main roles as a landmark.
When associating with characters with a character code of 2 bytes or more such as UTF-16 and UTF-32 BOM is used to specify the order of endianness. However, when associating with a 1-byte character code like UTF-8, You don't have to specify endianness. So why does UTF-8 (with BOM) exist?
After investigating, I found that the cause was the specification when Excel opened CSV. When Excel opens CSV, it tries to open with Shift-JIS, so UTF-8 without BOM When I try to read the written file, the characters are garbled. To prevent this, even when opening CSV with BOM, use Unicode character code. You need to specify to read it.
Now, I will explain how to delete the BOM that is the main subject. Java does not assume that UTF-8 has a BOM in the first place. Therefore, when reading a file with a BOM, use the BOM as another character. Treat it as similar and do not delete the BOM. Therefore, if you want to delete the BOM, you need to implement such a process separately.
Java
//Unicode code display of BOM
public static final String BOM = "\uFEFF";
/**
*If the file contained a BOM
*Convert without BOM.
*
* @param s file string
* @File string without return BOM
*
*/
private static String removeUTF8BOM(String s) {
if (s.startsWith(BOM)) {
//Read the character string after the beginning of the file
s = s.substring(1);
}
return s;
}
Another method is to use the class library provided by apache. See below for detailed specifications.
Class for reading files with BOM
To remove the BOM using Java Remedy for UTF-8 (with BOM)
Recommended Posts