[Python3] Switch between Shift_JIS, UTF-8 and ASCII

Introduction

In Python3, let's somehow display the garbled output character string like `'\ udc82Ђ \ udce7 \ udc83J \ udc83 ^ \ udc8a \ udcbf \ udc8e \ udc9a' `. It is an attempt.

reference

When Shift_JIS byte string is UTF-8 decoded

By default UnicodeDecodeError

Attempting to decode Shift_JIS bytes (UTF-8 by default) results in UnicodeDecodeError

>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> bytes_sjis
b'\x82\xd0\x82\xe7\x83J\x83^\x8a\xbf\x8e\x9a'
>>> bytes_sjis.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

Decoding result when an error handler is specified

The error occurs in the previous section because the optional argument errors of `decode ()` defaults to `` `" strict ". If you give errorssome other value, no error will occur and another string will be returned. The> errors argument specifies what to do if the input string cannot be converted according to the encoding rules. The values that can be used for this argument are'strict'(Send unicodedecodeerror)、'replace' (replacement characterIsu+fffduse)、 'ignore'(Simply remove the characters from the resulting unicode) 、'backslashreplace' (Escape sequence\xnn```Insert)is. Unicode HOWTO

Besides that, you can also specify ```'surrogateescape'` ``,

'Surrogateescape'-Replace the byte sequence with individual surrogate codes in the range U + DC80 to U + DCFF. 7.2. codecs — codec registry and base classes

Let's see the output result.

>>> bytes_sjis.decode("utf-8", errors="replace")
'�Ђ�J�^����'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'ЂJ^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82Ђ\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82Ђ\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

By the way, the display in the Windows environment (CP932 ~ Shift_JIS), which is not in the UTF-8 environment, is as follows.

>>> bytes_sjis.decode("utf-8", errors="replace")
'\ufffd\u0402\ufffdJ\ufffd^\ufffd\ufffd\ufffd\ufffd'
>>> bytes_sjis.decode("utf-8", errors="ignore")
'\u0402J^'
>>> bytes_sjis.decode("utf-8", errors="backslashreplace")
'\\x82\u0402\\xe7\\x83J\\x83^\\x8a\\xbf\\x8e\\x9a'
>>> bytes_sjis.decode("utf-8", errors="surrogateescape")
'\udc82\u0402\udce7\udc83J\udc83^\udc8a\udcbf\udc8e\udc9a'

Restore the original character string from the result of decoding the Shift_JIS byte string with UTF-8.

'replace'Or'ignore'When is specified, the information has been deleted and cannot be restored, but In other cases, you can restore the original string as shown below. Why is this possible in the case of #backslashreplace ...? ?? ??

>>> bytes_sjis = "Hirakata Kanji".encode("shift_jis")
>>> backslash_str = bytes_sjis.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode("unicode_escape").encode("raw_unicode_escape").decode("shift_jis")
'Hirakata Kanji'

>>> surrogate_str = bytes_sjis.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode("utf-8", errors="surrogateescape").decode("shift_jis")
'Hirakata Kanji'

When a UTF-8 byte string is ASCII-decoded

Let's do UTF-8-> ASCII-> UTF-8 in the same way as Shift_JIS-> UTF-8-> Shift_JIS conversion.

By default UnicodeDecodeError

>>> bytes_utf8 = "Hirakata Kanji".encode("utf-8")
>>> bytes_utf8
b'\xe3\x81\xb2\xe3\x82\x89\xe3\x82\xab\xe3\x82\xbf\xe6\xbc\xa2\xe5\xad\x97'
>>> bytes_utf8.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

Decoding result when an error handler is specified

>>> bytes_utf8.decode("ascii", errors="ignore")
''
>>> bytes_utf8.decode("ascii", errors="replace")
'������������������'
>>> bytes_utf8.decode("ascii", errors="backslashreplace")
'\\xe3\\x81\\xb2\\xe3\\x82\\x89\\xe3\\x82\\xab\\xe3\\x82\\xbf\\xe6\\xbc\\xa2\\xe5\\xad\\x97'
>>> bytes_utf8.decode("ascii", errors="surrogateescape")
'\udce3\udc81\udcb2\udce3\udc82\udc89\udce3\udc82\udcab\udce3\udc82\udcbf\udce6\udcbc\udca2\udce5\udcad\udc97'

Restores the original character string from the result of decoding the UTF-8 byte string with ASCII

For UTF-8 and ASCII, `.encode (). Decode ()` seems to be OK.

>>> backslash_str = bytes_utf8.decode("utf-8", errors="backslashreplace")
>>> backslash_str.encode().decode()
'Hirakata Kanji'
>>> surrogate_str = bytes_utf8.decode("utf-8", errors="surrogateescape")
>>> surrogate_str.encode().decode()
'Hirakata Kanji'

Common examples

Finally, let's arrange an example of trying to restore it by force when you forget to set it and the characters are garbled.

When you forget to set `` json``` output `ʻensure_ascii = False```

>>> import json
>>> ascii_json = json.dumps({"Key":"value"})
>>> ascii_json
'{"\\u30ad\\u30fc": "\\u5024"}'
>>> ascii_json.encode().decode("unicode_escape")
'{"Key": "value"}'
>>> ascii_json.encode().decode("raw_unicode_escape")
'{"Key": "value"}'

When encoding of the result obtained by `requests` is not changed

>>> import requests
>>> r = requests.get('http://www.mof.go.jp/')
>>> r.text
'...
<meta property="og:title" content="\x8dà\x96±\x8fÈ\x83z\x81[\x83\x80\x83y\x81[\x83W" />
...'
>>> r.text.encode("raw_unicode_escape").decode("shift_jis")
'...
<meta property="og:title" content="Ministry of Finance homepage" />
...'

Recommended Posts

[Python3] Switch between Shift_JIS, UTF-8 and ASCII
[Python] Convert Shift_JIS to UTF-8
Difference between Ruby and Python split
Difference between java and python (memo)
Difference between == and is in python
Cooperation between python module and API
Differences between Python, stftime and strptime
Difference between python2 series and python3 series dict.keys ()
[Python] Difference between function and method
Python --Difference between exec and eval
[Python] Difference between randrange () and randint ()
[Python] Difference between sorted and sorted (Colaboratory)
Differences in authenticity between Python and JavaScript
difference between statements (statements) and expressions (expressions) in Python
Differences in syntax between Python and Java
Difference between PHP and Python finally and exit
Difference between @classmethod and @staticmethod in Python
Difference between append and + = in Python list
Difference between nonlocal and global in Python
[Python] Difference between class method and static method
[Python Iroha] Difference between List and Tuple
[python] Difference between rand and randn output
Differences in multithreading between Python and Jython
Differences between Ruby and Python (basic syntax)
Correspondence between Python built-in functions and Rust
Exchange encrypted data between Python and C #
[Python] Summary of conversion between character strings and numerical values (ascii code)
Summary of the differences between PHP and Python
The answer of "1/2" is different between python2 and 3
[python] Difference between variables and self. Variables in class
Switch Python versions
[Python] Conversion memo between time data and numerical data
About the difference between "==" and "is" in python
How to switch between Linux and Mac shells
The rough difference between Unicode and UTF-8 (and their friends)
File write speed comparison experiment between python 2.7.9 and pypy 2.5.0
[Ruby vs Python] Benchmark comparison between Rails and Flask
Control other programs from Python (communication between Python and exe)
Difference between Ruby and Python in terms of variables
Indent behavior of json.dumps is different between python2 and python3
[Ubuntu] [Python] Face detection comparison between dlib and OpenCV
Interprocess communication between Ruby and Python (POSIX message queue)
Compare "relationship between log and infinity" in Gauche (0.9.4) and Python (3.5.1)
[python] Compress and decompress
Python and numpy tips
[Python] pip and wheel
Batch design and python
Python iterators and generators
Python packages and modules
Vue-Cli and Python integration
Between parametric and nonparametric
Ruby, Python and map
python input and output
Python and Ruby split
Python3, venv and Ansible
Python asyncio and ContextVar
Python module num2words Difference in behavior between English and Russian
Python> Difference between inpbt and print (inpbt) output> [1. 2. 3.] / array ([1., 2., 3.], dtype = float32)
List concatenation method in python, difference between list.extend () and “+” operator
To go back and forth between standard python, numpy, pandas ①
Install pyenv on MacBook Air and switch python to use