Introduction

In Python3, let's somehow display the garbled output character string like `'\ udc82Ђ \ udce7 \ udc83J \ udc83 ^ \ udc8a \ udcbf \ udc8e \ udc9a' `. It is an attempt.

reference

When Shift_JIS byte string is UTF-8 decoded

By default `UnicodeDecodeError`

Attempting to decode Shift_JIS bytes (UTF-8 by default) results in UnicodeDecodeError

>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> bytes_sjis
b'\x82\xd0\x82\xe7\x83J\x83^\x8a\xbf\x8e\x9a'
>>> bytes_sjis.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

Decoding result when an error handler is specified

The error occurs in the previous section because the optional argument errors of `decode ()` defaults to `` `" strict ". If you give errorssome other value, no error will occur and another string will be returned. The> errors argument specifies what to do if the input string cannot be converted according to the encoding rules. The values that can be used for this argument are'strict'(Send unicodedecodeerror)、'replace' (replacement characterIsu+fffduse)、 'ignore'(Simply remove the characters from the resulting unicode) 、'backslashreplace' (Escape sequence\xnn```Insert)is. Unicode HOWTO

Besides that, you can also specify ```'surrogateescape'` ``,

'Surrogateescape'-Replace the byte sequence with individual surrogate codes in the range U + DC80 to U + DCFF. 7.2. codecs — codec registry and base classes

Let's see the output result.

>>> bytes_sjis.decode("utf-8", errors="replace")
'�Ђ�J�^����'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'ЂJ^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82Ђ\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82Ђ\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

By the way, the display in the Windows environment (CP932 ~ Shift_JIS), which is not in the UTF-8 environment, is as follows.

>>> bytes_sjis.decode("utf-8", errors="replace")
'\ufffd\u0402\ufffdJ\ufffd^\ufffd\ufffd\ufffd\ufffd'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'\u0402J^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82\u0402\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82\u0402\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

Restore the original character string from the result of decoding the Shift_JIS byte string with UTF-8.

'replace'Or'ignore'When is specified, the information has been deleted and cannot be restored, but In other cases, you can restore the original string as shown below. Why is this possible in the case of #backslashreplace ...? ?? ??

>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> backslash_str = bytes_sjis.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode("unicode_escape").encode("raw_unicode_escape").decode("shift_jis")
'Hirakata Kanji'

>>> surrogate_str = bytes_sjis.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode("utf-8", errors="surrogateescape").decode("shift_jis")
'Hirakata Kanji'

When a UTF-8 byte string is ASCII-decoded

Let's do UTF-8-> ASCII-> UTF-8 in the same way as Shift_JIS-> UTF-8-> Shift_JIS conversion.

By default `UnicodeDecodeError`

>>> bytes_utf8 = "Hirakata Kanji".encode("utf-8")
>>> bytes_utf8
b'\xe3\x81\xb2\xe3\x82\x89\xe3\x82\xab\xe3\x82\xbf\xe6\xbc\xa2\xe5\xad\x97'
>>> bytes_utf8.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

Decoding result when an error handler is specified

>>> bytes_utf8.decode("ascii", errors="ignore")
''
>>> bytes_utf8.decode("ascii", errors="replace")
'������������������'
>>> bytes_utf8.decode("ascii", errors="backslashreplace")
'\\xe3\\x81\\xb2\\xe3\\x82\\x89\\xe3\\x82\\xab\\xe3\\x82\\xbf\\xe6\\xbc\\xa2\\xe5\\xad\\x97'
>>> bytes_utf8.decode("ascii", errors="surrogateescape")
'\udce3\udc81\udcb2\udce3\udc82\udc89\udce3\udc82\udcab\udce3\udc82\udcbf\udce6\udcbc\udca2\udce5\udcad\udc97'

Restores the original character string from the result of decoding the UTF-8 byte string with ASCII

For UTF-8 and ASCII, `.encode (). Decode ()` seems to be OK.

>>> backslash_str = bytes_utf8.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode()
'Hirakata Kanji'
>>> surrogate_str = bytes_utf8.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode().decode()
'Hirakata Kanji'

Common examples

Finally, let's arrange an example of trying to restore it by force when you forget to set it and the characters are garbled.

When you forget to set `` json``` output `ʻensure_ascii = False```

>>> import json
>>> ascii_json = json.dumps({"Key":"value"})
>>> ascii_json
'{"\\u30ad\\u30fc": "\\u5024"}'
>>> ascii_json.encode().decode("unicode_escape")
'{"Key": "value"}'
>>> ascii_json.encode().decode("raw_unicode_escape")
'{"Key": "value"}'

When `encoding` of the result obtained by `requests` is not changed

>>> import requests
>>> r = requests.get('http://www.mof.go.jp/')
>>> r.text
'...
<meta property="og:title" content="\x8dà\x96±\x8fÈ\x83z\x81[\x83\x80\x83y\x81[\x83W" />
...'
>>> r.text.encode("raw_unicode_escape").decode("shift_jis")
'...
<meta property="og:title" content="Ministry of Finance homepage" />
...'

[Python3] Switch between Shift_JIS, UTF-8 and ASCII

Introduction

When Shift_JIS byte string is UTF-8 decoded

By default UnicodeDecodeError

Decoding result when an error handler is specified

Restore the original character string from the result of decoding the Shift_JIS byte string with UTF-8.

When a UTF-8 byte string is ASCII-decoded

By default UnicodeDecodeError

Decoding result when an error handler is specified

Restores the original character string from the result of decoding the UTF-8 byte string with ASCII

Common examples

When you forget to set `` json``` output `ʻensure_ascii = False```

When encoding of the result obtained by `requests` is not changed

By default `UnicodeDecodeError`

By default `UnicodeDecodeError`

When `encoding` of the result obtained by `requests` is not changed