As a python beginner, I sometimes get an error in encoding when reading and writing csv files, so I made a note of the summary of the contents. It is also an ** article for beginners **. The environment will be a windows environment.
About errors that are often caught when reading and writing csv files
error contents
UnicodeEncodeError: 'shift_jis' codec can't encode character '\u9ad9' in position 14: illegal multibyte sequence
It means that there are some characters that cannot be encoded with shift-jis. It occurs when the character code of the file and the character code of the written character do not match when writing the file.
By the way, the code is specified here.
Code example
with open(filepath, 'w', newline='', encoding='shift-jis') as f
error contents
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xee in position 0
It means that there are some characters that cannot be coded with shift-jis. It occurs when the character code of the file and the character code specified in the file reading do not match when reading the file. (Or, characters that cannot be read with the character code specified when reading the file are written in the file.)
By the way, the code is specified here.
Code example
data = pd.read_csv(filepath, encoding = 'shift-jis')
If you want to perform a series of operations of file creation, writing, and reading on python, if you specify according to the horizontal axis below, no error should occur. (The meaning of the character code of the file represents the character code of csv created by the character code specified at the time of writing)
Character code at the time of writing | File character code | Character code at the time of reading |
---|---|---|
UTF-8 | UTF-8 | UTF-8 |
cp932 | ansi | cp932 |
shift-jis | ansi | shift-jis |
** If both cp932 and shift-jis are files, it's ansi, but which one do you use? ** ** I think the biggest difference between cp932 and shift-jis is whether they can handle environment-dependent characters such as ** Hashigodaka ** and ** 﨑 (Tatesaki) **. What you can do is cp932. So, for example, when ansi csv files are linked from other systems, it is better to assume that they will be imported with cp392 instead of shift-jis.