BOM (Byte Order Mark) You should die. There is no mercy.
This is [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E3%82%AA%E3%83%BC%E3 % 83% 80% E3% 83% BC% E3% 83% 9E% E3% 83% BC% E3% 82% AF).
If you use csv.DictReader or something, BOM will be added to the beginning of the header, so if you think that you will import it with seq on the first line, you will end up with a header like <0xEF> seq
.
--I think you can erase it with nkf. --You may delete it on the program side.
$ nkf --overwrite -oc=UTF-8 filename
I think this is the royal road. There is nothing wrong with erasing it before reading it.
Because it is not always possible to erase it before importing.
import codecs
def strip_bom(s):
s = s.encode('utf8')
if s.startswith(codecs.BOM_UTF8):
return s[len(codecs.BOM_UTF8):].decode('utf8')
return s.decode('utf8')
The codecs
module has a constant called BOM_UTF8
, but why can't I erase it with the ʻopen` option?
Recommended Posts