Encode when reading with UTF-8 BOM in Python
Specify **'utf_8_sig' **.
Example of reading a file
io.opne(filename, "r", encoding="utf_8_sig")
Convert from str type (UTF-8) to unicode type
uni_string = unicode(str_string, 'utf_8_sig')
I was a little addicted to reading UTF-8 in Python, so I'll write it down to prevent forgetting.
UTF-8 may have a BOM (Byte order mark). This is an identifier that the encoding is UTF-8. The first 3 bytes of the file are'EF BB BF'.
The trouble is that there are UTF-8 with BOM and UTF-8 without BOM.
BOM is added to UTF-8 in Windows'Notepad' and Excel. Linux and Mac basically seem to handle UTF-8 without BOM.
This time I wanted to load the csv edited in Excel, so I had to consider the BOM.
I thought, I wrote it in the document.
Official document UTF-8 with BOM mark
If you set the encoding codec to'utf_8_sig', If there is a BOM, it will be skipped and read. If there is no BOM, it can be read as UTF-8 as it is.
ImportCSV.py
import io
with io.open('sample.csv', 'rt', encoding='utf_8_sig') as f:
print(f.readlines())
Character codes tend to be addictive in Python, but if you can handle character codes properly when converting to unicode type, you will not have to worry about character codes.
Recommended Posts