If you want python to read it, I would like the character format to be utf-8, Since there are various reasons on the data output side, there are many cases where the receiving side must convert and read.
The csv output in the Windows & Excel environment is Shift JIS. .. .. So, with pandas,
import pandas as pd
dataset1 = pd.read_csv("hogehoge.csv",encoding="shift_jis")
If you do it, you may not be able to read it properly if you think it's OK and be careful.
test.csv
Yamada,1000
Sato,2000
Yamamoto,3000
I can read this,
test2.csv
1,Yamada,1000
2,Takahashi,2000
3,Black 﨑,3000
Without exception, I get the following error. .. ..
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xfb in position 0: illegal multibyte sequence
This is in test2.csv, ・ Hashigodaka "** Taka " ・ Tachisaki " Saki **" It is caused by the mixture of windows extension strings such as. In order to read such characters, the character code must be cp932.
encoding='cp932'
Because there is such a thing, because it is windows, if you read it with shift_jis, it is not conscious that it is OK, From the beginning, it was said that if you read it with cp932, you will not have to worry about unnecessary troubles.
import pandas as pd
dataset1 = pd.read_csv("hogehoge.csv",encoding="cp932")
Recommended Posts