--In Python3, the default character encoding when handling files with ʻopenetc. depends on the OS. --On Unix (Linux), it depends on
locale (LC_CTYPE). --If you read or write a file without thinking about it, you may encounter ʻUnicodeDecodeError
etc. depending on the environment.
--Check the operation on your macOS --For example, suppose you have a utf-8 text file with Japanese written in it. Open this file to get the contents
with open('utf-8.txt', mode='r') as fp:
text = fp.read()
--You can open the file without any error and get the contents of the file.
--This is because macOS defaults to UTF-8 character encoding
--You can check the character encoding actually used with locale.getpreferredencoding
.
>> import locale
>> locale.getpreferredencoding()
UTF-8
--Because getpreferredencoding is ʻUTF-8, the text of utf-8 can be read without error. --Actually change
LC_CTYPEand check that an error occurs --Use
setlocale to temporarily change
LC_CTYPE`
import locale
locale.setlocale(locale.LC_CTYPE, ('C'))
print(locale.getpreferredencoding(False)) # => US-Become ASCII
with open('hoge.txt') as fp:
text = fp.read()
Result
US-ASCII
Traceback (most recent call last):
File "test.py", line 7, in <module>
text = fp.read()
File "/path/to/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
--By setting LC_CTYPE to C, the character encoding becomes US-ASCII. --As a result, I got ʻUnicodeDecodeError` when reading the text of uff-8.
Note
-LC without setloacale_The same behavior can be confirmed by directly changing the environment variable of CTYPE.
- getpreferredencoding(do_setloacal=False)If you do not, you will not be able to get the temporarily changed encoding with setlocale.
--Basically, when dealing with files, it is better to specify the character encoding.
--In Python3, ʻopen can now accept ʻencoding
arguments, so you can use that (= you can handle files regardless of LC_CTYPE
).
with open('utf-8.txt', encoding='utf-8') as fp:
text = fp.read()
--If you want to write a library that works with both python2 and python3, it is better to open it in binary mode and then set it to utf-8 or use the codecs module.
#! -*- coding:utf-8 -*-
import locale
import codecs
import six
locale.setlocale(locale.LC_CTYPE, ('C'))
with open('utf-8.txt', 'rb') as fp:
text1 = fp.read()
text1 = six.text_type(text1, 'utf-8')
with codecs.open('utf-8.txt', 'r', encoding='utf-8') as fp:
text2 = fp.read()
assert text1 == text2
--Python3 determines the default character encoding when dealing with files depending on the OS and locale (LC_CTYPE)
--Basically, it is better to handle the file after specifying the character encoding. Otherwise, you will encounter unintended problems.
――I'm sorry I've done that kind of thing lately.
Recommended Posts