7.1.1.2 UTF-8 encoding and decoding

Good for exchanging data with the outside world. The following two means are required.

--Means to encode a string into a byte string --Means to decode a byte string into a string --The dynamic encoding UTF-8 is the standard encoding for Python, Linux, and HTML. --When you copy and paste other sources from a web page etc. to create a Python string, you have to make sure that the sources are encoded in UTF-8 format. (An exception occurs.)

7.1.1.3 Encoding

--The sender of the information converts the content of the message to be conveyed into the form of a symbol that reaches the receiver. --Encode the string into bytes. -** UTF-8 ** is ** 8-bit variable length encoding **.


#String encode()The first argument of the function is the encoding name.
#Unicode string"\u2603"Substitute
>>> snowman="\u2603"
>>> len(snowman)
1

#Encode this Unicode character into a byte sequence.
>>> ds=snowman.encode("utf-8")
>>> len(ds)
3
>>> ds
b'\xe2\x98\x83'

-** The above snowman is expressed as \ u2603 in the Python world, but if you convert it to utf-8, which is the standard language in the outside world. It becomes b'\ xe2 \ x98 \ x83'. ** **


#An error will occur unless the Unicode character is also a valid ASCII character.
>>> ds=snowman.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can not encode character '\u2603' in position 0: ordinal not in range(128)

#encode()The function has a second argument to make it less likely to cause an encoding exception. The default value works as before"strict"So, if non-ASCII characters are used, UnicodeEncodeError will occur.

#Discard unencoded items by specifying ignore
>>> snowman.encode("ascii","ignore")
b``
#Characters that cannot be encoded using replace?Replace with
>>> snowman.encode("ascii","replace")
b`?`
#unicode-Generate a Python Unicode string in escape format.
>>> snowman.encode("ascii","backslashreplace")
b`\\u2603`
#Generate a string of entities that can be used on web pages.
>>> snowman.encode("ascii","xmlcharrefreplace")
b`&#9731;`

7.1.1.4 Decoding

--This refers to the process in which the receiving side interprets the meaning of the symbols created by the sending side of the information. --Decode the byte string to a Unicode string. -** Extract text from some external source (files, websites, network APIs, etc.) and convert it to Unicode in the Python world. ** The text is encoded as a byte string.

#value`café`Create a Unicode string for
>>> place = "caf\u00e9"
>>> place
`café`
>>> type(place)
<class `str`>
#UTF-Encode in 8 formats and place_Assign to the bytes variable.
>>> place_bytes=place.encode("utf-8")
#place_Note that bytes is 5 bytes
#The first 3 bytes are the same as ASCII(UTF-8 advantages)Each character is encoded in 1 byte, and the last é is encoded in 2 bytes.
>>> place_bytes
b`caf\xc3\xa9`
>>> type(place_bytes)
<class `bytes`>

>>> place2=place_bytes.decode("utf-8")
>>> place2
`café`
#The ASCII decoder returns an error because the byte value 0xc3 is invalid in ASCII.
>>> place3=place_bytes.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: `ascii` codec can`t decode byte 0xc3 in position 3: ordinal not in range(128)

#UTF as much as possible-It is preferable to use 8 encodings.
>>> place4=place_bytes.decode("latin-1")
>>> place4
`cafÃ©`
>>> place5=place_bytes.decode("windows-1252")
>>> place5
`cafÃ©`

coffee break "About numbers and letters"

--Computers perform calculations and judgments in binary. -** Bit ** is one digit of binary number. -** Hexagon ** is a counting method that expresses 4 binary digits with 1 digit of "0 to F" to make it easier for humans to understand. -** Bytes ** is a unit of 2 hexadecimal digits (8 bits = 0 to 255 in decimal). -** What is binary **? The characters actually displayed on the display are displayed by converting the character code into a character image by the OS of the personal computer. --Image files are usually compressed in "GIF" or "JPG" format.

Example: Contents of JPG file (byte characters)

FFD8FFE0 00104A46 49460001 0101004B 004B0000 FFFE0094 56542D43 6F6D7072 65737320 28746D29 2058696E 67205465 63686E6F 6C6F6779 20436F72 702E0000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000FFDB 00840010 0B0C0E0C 0A100E0D 0E121110 131828

Impressions

When I tried Chapter 7, it became a story of data, and binary numbers, hexadecimal numbers, entanglement with bytes, etc. came out at once. I studied when I was a student, but I have forgotten it, so I will review it little by little in this chapter.

References

"Introduction to Python3 by Bill Lubanovic (published by O'Reilly Japan)"

Reference URL http://zaq.g1.xrea.com/2sinsuu5.htm https://docs.python.org/ja/3/howto/unicode.html