Summary of what I learned from reading the Python 2.7 documentation
Unicode HOWTO — Python 2.7.13 documentation https://docs.python.org/2/howto/unicode.html
7.8. codecs — Codec registry and base classes — Python 2.7.13 documentation https://docs.python.org/2/library/codecs.html#encodings-and-unicode
Numbers 0-127 were assigned to letters by ASCII (American Standard Code for Information Interchange). Example) a: 97
$ python -V
Python 2.7.10
>>> unichr(97)
u'a'
>>> ord('a')
97
unichr(i) - 2. Built-in Functions — Python 2.7.13 documentation ord(i) - 2. Built-in Functions — Python 2.7.13 documentation
However, the é and Russian Cyrillic letters used in Europe could not be represented.
8-bit (2 ^ 8 = 256) computers became mainstream, with 128-255 each assigned characters in their own format.
Unicode was developed to eliminate this difference.
Unicode
The Unicode standard describes how characters are represented by code points.
Character: a code points: 97 (0x61)
Initially, Unicode used 16bit (65,536). It currently has a width of 0–1,114,111 (0x10ffff).
a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff.
Encodings
The rules for translating a Unicode string into a sequence of bytes are called an encoding.
>>> 'a'.encode('hex')
'61'
$ python -V
Python 2.7.10
>>> s = 'a b c x y z'
>>> s.encode('hex')
'612062206320782079207a'
ʻAis entered with
CTRL-v + u0061`.
Recommended Posts