Character encoding when dealing with files in Python 3

Overview

--In Python3, the default character encoding when handling files with ʻopenetc. depends on the OS. --On Unix (Linux), it depends onlocale (LC_CTYPE). --If you read or write a file without thinking about it, you may encounter ʻUnicodeDecodeError etc. depending on the environment.

Verification

--Check the operation on your macOS --For example, suppose you have a utf-8 text file with Japanese written in it. Open this file to get the contents

with open('utf-8.txt', mode='r') as fp:
    text = fp.read()

--You can open the file without any error and get the contents of the file. --This is because macOS defaults to UTF-8 character encoding --You can check the character encoding actually used with locale.getpreferredencoding.

>> import locale
>> locale.getpreferredencoding() 
UTF-8

--Because getpreferredencoding is ʻUTF-8, the text of utf-8 can be read without error. --Actually change LC_CTYPEand check that an error occurs --Usesetlocale to temporarily change LC_CTYPE`

import locale

locale.setlocale(locale.LC_CTYPE, ('C')) 
print(locale.getpreferredencoding(False)) # => US-Become ASCII

with open('hoge.txt') as fp:
    text = fp.read()

Result

US-ASCII
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    text = fp.read()
  File "/path/to/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

--By setting LC_CTYPE to C, the character encoding becomes US-ASCII. --As a result, I got ʻUnicodeDecodeError` when reading the text of uff-8.

Note
-LC without setloacale_The same behavior can be confirmed by directly changing the environment variable of CTYPE.
- getpreferredencoding(do_setloacal=False)If you do not, you will not be able to get the temporarily changed encoding with setlocale.

Correspondence

--Basically, when dealing with files, it is better to specify the character encoding. --In Python3, ʻopen can now accept ʻencoding arguments, so you can use that (= you can handle files regardless of LC_CTYPE).

with open('utf-8.txt', encoding='utf-8') as fp:
    text = fp.read()

--If you want to write a library that works with both python2 and python3, it is better to open it in binary mode and then set it to utf-8 or use the codecs module.

#! -*- coding:utf-8 -*-
import locale
import codecs
import six

locale.setlocale(locale.LC_CTYPE, ('C'))

with open('utf-8.txt', 'rb') as fp:
    text1 = fp.read()
    text1 = six.text_type(text1, 'utf-8')

with codecs.open('utf-8.txt', 'r', encoding='utf-8') as fp:
    text2 = fp.read()

assert text1 == text2

Summary

--Python3 determines the default character encoding when dealing with files depending on the OS and locale (LC_CTYPE) --Basically, it is better to handle the file after specifying the character encoding. Otherwise, you will encounter unintended problems. ――I'm sorry I've done that kind of thing lately.

https://github.com/zengin-code/zengin-py/pull/4

reference

http://qiita.com/methane/items/6e294ef5a1fad4afa843
http://qiita.com/methane/items/dac75ef5019b311a0f10
https://docs.python.jp/3/library/locale.html#locale.setlocale
https://docs.python.jp/3/library/locale.html#locale.getpreferredencoding