The return value of len or unichr may change depending on whether it is UCS-2 or UCS-4.

The standard Python 2 for Mac OS X is built with UCS-2, so the values returned by the standard functions len and unichr are different than for UCS-4, which is widely used in Linux distribution. There is.

Behavior in UCS-2 build options

If the build option is UCS-2 and it contains U + 10000 and later characters, you can't just use len to find the number of characters. Even if it is installed by homebrew, it will be built by USC-2.

Use the value of sys.maxunicode to see if UCS-2 was specified for the build option.

>>> import sys
>>> 0xFFFF == sys.maxunicode
True

Applying len to the following string (U + 20BB7 U + 91CE U + 5BB6) gives a return value of 4.

>>> str = u'?Noya'
>>> 4 == len(str)
True

The internal representation of U + 20BB7 is the surrogate pair U + D842 U + DFB7.

>>> 0xD842 == ord(str[0])
True
>>> 0xDFB7 == ord(str[1])
True

Find the number of characters in consideration of UCS-2

Let's find the number of characters, considering that the range of the upper surrogate is from U + D800 to U + DBFF. For the sake of simplicity of the code, do not consider the case where the upper or lower surrogate is isolated. With UCS-4, you can use a for loop.

# -*- coding: utf-8 -*-

import sys

def utf8_len(str):

    length = 0

    if sys.maxunicode > 0xFFFF:
        for c in str:
            length += 1

        return length

    code_units = len(str)
    pos = 0
    cp = -1

    while pos < code_units:

        cp = ord(str[pos])
        length += 1

        if cp > 0xD7FF and 0xDC00 > cp:
            pos += 2
        else:
            pos += 1

    return length

Let's try the previous string again.

str = u'?Noya'
print(3 == utf8_len(str))

As an exercise, let's modify the code a bit and define a function that applies the callback character by character.

# -*- coding: utf-8 -*-

import sys

def utf8_each_char(str, func):

    if sys.maxunicode > 0xFFFF:
        for c in str:
            func(c)
    else:
        code_units = len(str)
        pos = 0
        buf = ''
        cp = -1

    while pos < code_units:
        buf =str[pos]
        cp = ord(buf)

        if cp > 0xD7FF and 0xDC00 > cp:
            buf += str[pos+1]
            func(buf)
            pos += 2
        else:
            func(buf)
            pos += 1

Let's display one character at a time. To use print with a lambda expression, you need to import print_function at the beginning of the file.

from __future__ import print_function

str = u'?Noya'
f = lambda c: print(c)
utf8_each_char(str, f)

Generate characters from code points with UCS-2 in mind

The USC-2 constraint also accepts unichr, which generates characters from code point integers, and does not accept integers 0x10000 and beyond.

>>> unichr(0x20BB7)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Unicode escape sequences are not affected by UCS-2.


>>> print(u"\U00020BB7")
?

The following is a definition of a user function that takes into account the restrictions of UCS-2.

# -*- coding: utf-8 -*-

import sys

def utf8_chr(cp):
    if 0xFFFF < sys.maxunicode or cp < 0x10000:
        return unichr(cp)

    cp -= 0x10000
    high = cp >> 10 | 0xD800
    low = cp & 0x3FF | 0xDC00

    return unichr(high) + unichr(low)

print(utf8_chr(0x20BB7))
print(utf8_chr(0x91CE))