background

str has become Unicode, so What about CP932 as represented by the conventional Shift-JIS?

When an ascii conversion error occurs when outputting as standard on Windows. I organized it to see what it was.

environment

Windows Python3 (Anaconda3)

Windows and Python and encoding

Python string encoating

In Python3, there are two types of strings. --str type (Unicode only) --byte type (arbitrary encoding)

str is for UTF-8 only. Other encoding strings cannot be stored. On the other hand, byte can store any circular coating character string. Of course UTF-8 is also possible. You can convert from str to byte with encode (), and vice versa with decode (). If you don't know which is which, you can do dir (str). There are no two types of functions as in Python2.

In Python2, there are str type and unicode type.

Python3 internal Windows standard output(input)
==========                  ===================

  UTF-8  ---------------------->  CP932
 (str type)   str.encode('CP932')   (byte type)
         <----------------------
           byte.decode('CP932')

Windows encoding

The standard output of Windows uses an encoding called CP932. Therefore, when the str character string is output as standard or written to a file, the conversion to CP932 works automatically by default.

What is the reason why you cannot print?

In fact, Python does not explicitly convert, but when it outputs standard output, it automatically converts it to the system encoding and then tries to output it.

In the case of Windows, it tries to convert to CP932, so if it cannot be converted to CP932, a UnicodeEncodeError exception will occur.

>>> s = '\xa0'
>>> print(s)

>>> s.encode('utf-8')
b'\xc2\xa0'
>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence

Erase the bad code

The cause of UnicodeEncodeError is that it contains code that cannot be converted to CP932, so if you delete the code that is doing the wrong thing, it may be solved.

In this case, \ xa0 is bad, so if you replace it with the replace function, the exception error will not appear.

>>> s
'\xa0'
>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence
>>> s2 =s.replace('\xa0', '')
>>> s2.encode('cp932')
b''

Ignore the bad code

It is troublesome and easy to leak the code that cannot be converted to CP932. In the first place, I thought that there might be an option to ignore if it could not be converted to an encode function, and when I googled it, there was an ignore option.

[Reference] Conversion to byte string https://docs.python.jp/3/howto/unicode.html (In addition to ignore, there are replace, name replace, etc.)

An example of suppressing an exception error by using the ignore option.

>>> s.encode('cp932')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xa0' in position 0: illegal multibyte sequence
>>> s.encode('cp932', "ignore")
b''

Summary

Character strings containing \ xa0 etc. are UTF-8 in Python3 and managed internally, so they can be processed without problems in Python, but in cases where they must be converted to CP932 in a Windows environment, for example. , When outputting as standard or when outputting as a file Unicode --> CP932 The conversion process to is executed. At that time, UnicodeEncodeError will occur, so if you encode it once with the ignore option, convert it to byte type, and return it to str with decode, you can avoid UnicodeEncodeError from now on. Also, when writing to a file, the byte type can only be output in binary mode, so specify the binary mode ('wb' or'ab'instead of'w' or'a') when opening the file. In the case of open using codecs, you can specify the encoding and ignore option at the time of open, and you can output as str type.

Example of standard output:

import codecs
s = '\xa0'
b = s.encode('cp932', "ignore")
s_after = b.decode('cp932')
print(s_after)

Example of file output:

f = open('test', 'ab')
s = '\xa0'
b = s.encode('cp932', 'ignore')
f.write(b)
f.close()

Example of outputting a file using codecs:

import codecs
f = codecs.open('test', 'ab', 'cp932', 'ignore')
s = '\xa0'
f.write(s) #If you use codecs, you can write as str
f.close()

reference

Python3 Unicode HOWTO https://docs.python.jp/3/howto/unicode.html

CP932 and UTF-8 https://android.googlesource.com/toolchain/benchmark/+/master/python/src/Modules/cjkcodecs/README

(Windows) Causes and workarounds for UnicodeEncodeError on Python 3