About various encodings of Python 3

It's a lot easier than Python 2, but in some environments you may run into unexpected UnicodeErrors in Python 3. Let's sort out the handling of various Python encodings as of Python 3.6.

Python encoding

filesystem encoding (sys.getfilesystemencoding())

This encoding is mainly used for file paths, but it is also used for command line arguments. (Otherwise you'll have trouble passing the file path as a command line argument)

Also, since the locale is related, it is actually used when working with glibc and so on. It may be a remnant of the Python 2 era, but now I feel that calling it system encoding rather than filesystem encoding represents the reality.

preferred encoding (locale.getpreferredencoding())

This encoding is mainly used for the contents of text files. ʻUsed when opening a text file with the open` function.

Standard I / O encoding ( sys.stdout.encoding)

The standard I / O encoding is filesystem encoding for terminals and preferred encoding for others, but it can be changed with the environment variable PYTHONIOENCODING.

default encoding (sys.getdefaultencoding())

The default encoding used when no explicit encoding is specified when converting between a unicode string ( str) and a string of bytes ( bytes). Python 3 is completely UTF-8 fixed, regardless of the environment.

>>> "Hello".encode()
b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'
>>> _.decode()
'Hello'

It was ascii in the Python 2 era, and there was an implicit type conversion, so it tended to be a source of trouble. (There was also a hack to force utf-8 to run an app that didn't think of multibyte)

Python 3 doesn't do implicit type conversion between str and bytes, so it's just a default argument. You can forget about the existence itself.

Behavior in various environments

macOS, android

The filesystem encoding is always UTF-8 because it is fixed to UTF-8 at the OS level.

The preferred encoding is always UTF-8 fixed for Android and can be changed with the locale on macOS, but I think it's rarely a problem. (If you set the locale, it will be the same as Linux described later)

Windows

Since Windows uses the W-based API both when opening a file and when receiving command line arguments, filesystem encoding is rarely used.

The behavior when handling the file path as a byte string was the Windows A API call until Python 3.5, so the filesystem encoding was the current code page 'cp932'. Starting with Python 3.6, this behavior has changed and UTF-8 => UTF-16 conversion and W-based APIs are used, so the filesystem encoding is 'utf-8'. (There are also environment variables to revert to Python 3.5 behavior)

Preferred encoding, on the other hand, still uses code pages. It's a legacy until Microsoft changes the standard text file encoding to UTF-8. I'm sorry.

Linux, Other Unix

The most disappointing is the other Unix. It depends on locale ( LC_CTYPE). If anyone still says LANG = ja_JP.eucJP, both filesystem encoding and preferred encoding will be EUC-JP. If you want to create a text file in UTF-8, even if you don't want to support Windows at all, specify the encoding explicitly in the ʻopen` function for environments where the locale is not UTF-8.

The locale dependency is particularly disappointing because the locale's default (C or POSIX) encoding (LC_CTYPE) is by default ASCII.

Some people use C or the POSIX locale because they don't like the behavior of sort and other commands that change depending on the locale. Embedded or container images may not have the en_US.utf8 or ja_JP.utf8 locales for weight savings. Also, if you ssh from a mac (because the environment variable LANG is sent), you may get an error and fall back to C when trying to use ja_JP.UTF-8 on Linux that only has en_US.UTF-8. At that time, Python is completely in ASCII mode.

$ export LC_ALL=C
$ echo 'print("Hello\n")' > hello
$ ruby hello
Hello
$ perl hello
Hello
$ python3 hello
Traceback (most recent call last):
  File "hello", line 1, in <module>
    print("\u3053\u3093\u306b\u3061\u306f\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

If the filesystem encoding can remain ASCII, you can specify the standard I / O encoding with PYTHONIOENCODING without setting the locale. Write it in .bashrc or crontab.

$ PYTHONIOENCODING=utf-8 python3 hello
Hello

(Appendix) Linux locale settings for using UTF-8 with Python

The first thing to remember is the locale command, which displays the current locale, and the locale -a command, which displays a list of available locales.

$ locale
LANG=C.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

$ locale -a
C
C.UTF-8
POSIX
en_US.utf8

I think C.UTF-8 is included in modern Linux. It's a UTF-8 version of the C locale that doesn't do anything extra. It's perfect for people who want to use the C locale, but want to use UTF-8 filenames. If it doesn't exist and you have root privileges, you should be able to create it with sudo localedef -c -i POSIX -f UTF-8 C.UTF-8.

If ja_JP.UTF-8 or en_US.UTF-8 exists and you want to use it, set it in the LANG environment variable and check with the locale command.

$ export LANG=en_US.utf8
$ locale
LANG=en_US.utf8
LANGUAGE=en_US:en
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

I want to use the C locale! But I want to use UTF-8! But I can't make C.UTF-8! In that case, let's control it a little more precisely. Of each LC_XXXXX, Python uses LC_CTYPE to determine the encoding. The environment variable LANG sets the entire LC_XXXX except LC_ALL and can be overridden individually with each LC_XXXX, but if LC_ALL is set it will overwrite everything further. So you can set LC_CTYPE to some UTF-8 locale while setting LANG = C.

$ export LANG=C
$ export LC_CTYPE=en_US.UTF-8
$ locale
LANG=C
LANGUAGE=en_US:en
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

$ locale charmap
UTF-8
$ python3 -c 'import sys; print(sys.getfilesystemencoding())'
utf-8

Recommended Posts

About various encodings of Python 3
Various processing of Python
About the ease of Python
About the features of Python
About the basics list of Python basics
About python slices
About python comprehension
Introduction of Python
About building GUI using TKinter of Python
About Python tqdm.
About python yield
About python, class
Memorandum of python beginners About inclusion notation
Summary of various for statements in Python
About python inheritance
Basics of Python ①
Basics of python ①
About python, range ()
# 3 [python3] Various operators
Copy of python
About python decorators
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Various format specifications of str.format () method of Python3
About python reference
[Python] Various combinations of strings and values
About Python decorators
Memo of troubles about coexistence of Python 2/3 system
[Python] Chapter 02-04 Basics of Python Program (About Comments)
[Python] About multi-process
Introduction of Python
[python] Create a list of various character types
1. Statistics learned with Python 1-2. Calculation of various statistics (Numpy)
A note about the python version of python virtualenv
Various settings of Python static blog generation tool'Pelican'
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
About the * (asterisk) argument of python (and itertools.starmap)
About shallow and deep copies of Python / Ruby
About Python for loops
[Python] Operation of enumerate
Summary about Python scraping
About function arguments (python)
About all of numpy
Unification of Python environment
Copy of python preferences
Basics of Python scraping basics
About assignment of numpy.ndarray
[python] behavior of argmax
[Python] Memo about functions
About MultiIndex of pandas
Summary about Python3 + OpenCV3
Various Python visualization tools
Usage of Python locals ()
the zen of Python
About Python, for ~ (range)
Installation of Python 3.3 rc1
About Python3 character code
[Python] Memo about errors
About Python development environment
Python: About function arguments
Python, about exception handling