3-3, Python strings and character codes

Character code

The character string is Character code, It becomes a special data type.

Since computers can only process numbers, text-formatted characters must be converted to numbers. When the computer was first designed, 1 byte became 8 bits. The maximum integer that can be displayed in 1 byte is 255 (decimal number 11111111 is decimal number 255), and a lot of bytes are required to display a larger integer. For example, the maximum integer that can be displayed in 2 bytes is 65535, and 4 bytes is 4294967295.

Computer development is progressing mainly in the United States, and there are only 128 character codes that were initially standardized. This character code is called ASCII and can handle all alphanumeric characters, symbols, spaces, line breaks, etc. used in English. Example: The character code of ʻA is 65, and the character code of z is 122`.

Of course, 1 byte is not enough to process Japanese, so we have to do 2 bytes. However, since the same as ASCII cannot be used, Japan created a JIS code. In other countries, character codes for each country were created based on ASCII. As a result, garbled characters were displayed for text files in which multiple languages were mixed.

Unicode was born to deal with garbled characters. All languages have been unified into one character code (Unicode). Unicode is still being improved, but all commonly used characters are processed in 2 bytes.

The ASCII of the string ʻA is the decimal number 65and the binary number01000001. The ASCII of the string 0 is the decimal number 48and the binary number00110000. (Note: the string 0 and the number 0 are not the same.) Since ASCII ʻA is displayed in Unicode, 0 should be entered before it. The Unicode character code of ʻA is 00000000 01000001`. Here comes a new problem. When Unicode is used, the garbled characters disappear, but the amount of data is double that of ASCII for all English text data. To solve this, we created variable-length UTF-8. In UTF-8, it was converted from 1 byte to 6 bytes depending on the size of Unicode of one character. Alphabets are 1 byte, ordinary kanji are 1 to 3 bytes, and rarely used kanji are 4 to 6 bytes.

E697A 5th 110010111100101 11100110 10010111 10100101

Table of contents IT memos for non-IT industries

reference: ・ Character code course 1st-History of character code (pre-Unicode history) -Handling of character codes in Python