The standard Python 2 for Mac OS X is built with UCS-2, so the values returned by the standard functions len and unichr are different than for UCS-4, which is widely used in Linux distribution. There is.
If the build option is UCS-2 and it contains U + 10000 and later characters, you can't just use len to find the number of characters. Even if it is installed by homebrew, it will be built by USC-2.
Use the value of sys.maxunicode to see if UCS-2 was specified for the build option.
>>> import sys
>>> 0xFFFF == sys.maxunicode
True
Applying len to the following string (U + 20BB7 U + 91CE U + 5BB6) gives a return value of 4.
>>> str = u'?Noya'
>>> 4 == len(str)
True
The internal representation of U + 20BB7 is the surrogate pair U + D842 U + DFB7.
>>> 0xD842 == ord(str[0])
True
>>> 0xDFB7 == ord(str[1])
True
Let's find the number of characters, considering that the range of the upper surrogate is from U + D800 to U + DBFF. For the sake of simplicity of the code, do not consider the case where the upper or lower surrogate is isolated. With UCS-4, you can use a for loop.
# -*- coding: utf-8 -*-
import sys
def utf8_len(str):
length = 0
if sys.maxunicode > 0xFFFF:
for c in str:
length += 1
return length
code_units = len(str)
pos = 0
cp = -1
while pos < code_units:
cp = ord(str[pos])
length += 1
if cp > 0xD7FF and 0xDC00 > cp:
pos += 2
else:
pos += 1
return length
Let's try the previous string again.
str = u'?Noya'
print(3 == utf8_len(str))
As an exercise, let's modify the code a bit and define a function that applies the callback character by character.
# -*- coding: utf-8 -*-
import sys
def utf8_each_char(str, func):
if sys.maxunicode > 0xFFFF:
for c in str:
func(c)
else:
code_units = len(str)
pos = 0
buf = ''
cp = -1
while pos < code_units:
buf =str[pos]
cp = ord(buf)
if cp > 0xD7FF and 0xDC00 > cp:
buf += str[pos+1]
func(buf)
pos += 2
else:
func(buf)
pos += 1
Let's display one character at a time. To use print with a lambda expression, you need to import print_function at the beginning of the file.
from __future__ import print_function
str = u'?Noya'
f = lambda c: print(c)
utf8_each_char(str, f)
The USC-2 constraint also accepts unichr, which generates characters from code point integers, and does not accept integers 0x10000 and beyond.
>>> unichr(0x20BB7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
Unicode escape sequences are not affected by UCS-2.
>>> print(u"\U00020BB7")
?
The following is a definition of a user function that takes into account the restrictions of UCS-2.
# -*- coding: utf-8 -*-
import sys
def utf8_chr(cp):
if 0xFFFF < sys.maxunicode or cp < 0x10000:
return unichr(cp)
cp -= 0x10000
high = cp >> 10 | 0xD800
low = cp & 0x3FF | 0xDC00
return unichr(high) + unichr(low)
print(utf8_chr(0x20BB7))
print(utf8_chr(0x91CE))
Recommended Posts