In Python3, let's somehow display the garbled output character string like `'\ udc82Ђ \ udce7 \ udc83J \ udc83 ^ \ udc8a \ udcbf \ udc8e \ udc9a'
`. It is an attempt.
reference
UnicodeDecodeError
Attempting to decode Shift_JIS bytes (UTF-8 by default) results in UnicodeDecodeError
>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> bytes_sjis
b'\x82\xd0\x82\xe7\x83J\x83^\x8a\xbf\x8e\x9a'
>>> bytes_sjis.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte
The error occurs in the previous section because the optional argument errors
of `decode ()`
defaults to `` `" strict ". If you give
errorssome other value, no error will occur and another string will be returned. The> errors argument specifies what to do if the input string cannot be converted according to the encoding rules. The values that can be used for this argument are
'strict'(Send unicodedecodeerror)、
'replace' (
replacement characterIs
u+fffduse)、
'ignore'(Simply remove the characters from the resulting unicode) 、
'backslashreplace' (Escape sequence
\xnn```Insert)is.
Unicode HOWTO
Besides that, you can also specify ```'surrogateescape'` ``,
'Surrogateescape'
-Replace the byte sequence with individual surrogate codes in the range U + DC80 to U + DCFF. 7.2. codecs — codec registry and base classes
Let's see the output result.
>>> bytes_sjis.decode("utf-8", errors="replace")
'�Ђ�J�^����'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'ЂJ^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82Ђ\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82Ђ\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'
By the way, the display in the Windows environment (CP932 ~ Shift_JIS), which is not in the UTF-8 environment, is as follows.
>>> bytes_sjis.decode("utf-8", errors="replace")
'\ufffd\u0402\ufffdJ\ufffd^\ufffd\ufffd\ufffd\ufffd'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'\u0402J^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82\u0402\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82\u0402\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'
'replace'
Or'ignore'
When is specified, the information has been deleted and cannot be restored, but
In other cases, you can restore the original string as shown below.
Why is this possible in the case of #backslashreplace ...? ?? ??
>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> backslash_str = bytes_sjis.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode("unicode_escape").encode("raw_unicode_escape").decode("shift_jis")
'Hirakata Kanji'
>>> surrogate_str = bytes_sjis.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode("utf-8", errors="surrogateescape").decode("shift_jis")
'Hirakata Kanji'
Let's do UTF-8-> ASCII-> UTF-8 in the same way as Shift_JIS-> UTF-8-> Shift_JIS conversion.
UnicodeDecodeError
>>> bytes_utf8 = "Hirakata Kanji".encode("utf-8")
>>> bytes_utf8
b'\xe3\x81\xb2\xe3\x82\x89\xe3\x82\xab\xe3\x82\xbf\xe6\xbc\xa2\xe5\xad\x97'
>>> bytes_utf8.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>>> bytes_utf8.decode("ascii", errors="ignore")
''
>>> bytes_utf8.decode("ascii", errors="replace")
'������������������'
>>> bytes_utf8.decode("ascii", errors="backslashreplace")
'\\xe3\\x81\\xb2\\xe3\\x82\\x89\\xe3\\x82\\xab\\xe3\\x82\\xbf\\xe6\\xbc\\xa2\\xe5\\xad\\x97'
>>> bytes_utf8.decode("ascii", errors="surrogateescape")
'\udce3\udc81\udcb2\udce3\udc82\udc89\udce3\udc82\udcab\udce3\udc82\udcbf\udce6\udcbc\udca2\udce5\udcad\udc97'
For UTF-8 and ASCII, `.encode (). Decode ()`
seems to be OK.
>>> backslash_str = bytes_utf8.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode()
'Hirakata Kanji'
>>> surrogate_str = bytes_utf8.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode().decode()
'Hirakata Kanji'
Finally, let's arrange an example of trying to restore it by force when you forget to set it and the characters are garbled.
json``` output
`ʻensure_ascii = False```>>> import json
>>> ascii_json = json.dumps({"Key":"value"})
>>> ascii_json
'{"\\u30ad\\u30fc": "\\u5024"}'
>>> ascii_json.encode().decode("unicode_escape")
'{"Key": "value"}'
>>> ascii_json.encode().decode("raw_unicode_escape")
'{"Key": "value"}'
encoding
of the result obtained by `requests`
is not changed>>> import requests
>>> r = requests.get('http://www.mof.go.jp/')
>>> r.text
'...
<meta property="og:title" content="\x8dà\x96±\x8fÈ\x83z\x81[\x83\x80\x83y\x81[\x83W" />
...'
>>> r.text.encode("raw_unicode_escape").decode("shift_jis")
'...
<meta property="og:title" content="Ministry of Finance homepage" />
...'
Recommended Posts