Things to keep in mind when processing strings in Python2

UnicodeEncodeError, the biggest natural enemy (exaggeration) for Python programmers (with Python2) who handle Japanese. The person next to me yesterday became the prey, and while I was helping to solve it, I was able to sort out the direction of string processing in Python 2 a little. (I want to put together a Python3 version soon)

Personal conclusion

--Always be aware of whether you are dealing with byte strings or Unicode strings. -(Basically) Handle Unicode character strings in the program, and convert them to byte character strings when exchanging with standard I / O (ex. Print).

Byte string and unicode string

The byte string is encoded by a specific encoding method (ex. Utf-8), and is expressed as 'that' in literals. On the other hand, a Unicode character string is an arrangement of Unicode code points, and in literals, ʻu is added like ʻu'that'.

`python`


(py2.7)~ » ipython
   (abridgement)
>>> 'Ah' #Byte string
Out[1]: '\xe3\x81\x82'

>>> u'Ah' #Unicode string
Out[2]: u'\u3042'

>>> 'Ah'.decode('utf-8') (or unicode('Ah', 'utf-8')) #Byte string->Unicode string(=Decode)
Out[3]: u'\u3042'

>>> u'Ah'.encode('utf-8') #Unicode string->Byte string(=Encode)
Out[4]: '\xe3\x81\x82'

If you check with the type function, you can see that the byte string is of type str / the Unicode string is of type ʻunicode`.

`python`


>>> type('a')
Out[5]: str

>>> type(u'a')
Out[6]: unicode

Furthermore, in Python2, both byte strings and Unicode strings are strings and can be concatenated.

`python`


>>> u'a' + 'a'
Out[7]: u'aa'

what. There is no problem.

Yes, I have to deal with Japanese (to be exact, all non-ASCII characters)! As you can see from the output of the above example, combining a Unicode string and a byte string produces a Unicode string. In the process, you have to decode the byte string into a Unicode string, but the problem here is that the Python string doesn't have any information about its own encoding.

"If you don't know how to encode, you can decode it in ASCII," Python says, and Hello UnicodeEncodeError. It is rare for literals to make such mistakes, but it is easy to make mistakes if you are not careful about the character strings received from outside your own program (including standard input / output).

`python`


>>> u'a' + 'Ah' #Unicode string and byte string(Non-ASCII)Combine
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-084e015bd795> in <module>()
----> 1 u'a' + 'Ah'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

>>> u'a' + 'Ah'.decode('utf-8') #Byte string->Unicode string
Out[9]: u'a\u3042'

>>> print(u'a' + 'Ah'.decode('utf-8'))
a ah

The reason for moving to Unicode strings instead of byte strings is that it is often more convenient to work with strings at the codepoint level than at the byte level. For example, if you want to count the number of characters, you can use the len function for Unicode strings. On the other hand, a byte string returns the number of bytes, so it cannot be used with that intention.

`python`


>>> len(u'Ah')
Out[11]: 3

>>> len('Ah')
Out[12]: 9

Unicode strings are the best! I didn't want byte strings!

Is not. As an example, consider the following simple program.

`test.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u'Ah' + u'Say')

Try running it in the terminal. Probably the majority of people can do it without problems.

`python`


(py2.7)~ » python test.py
Ah

Then, how about redirecting the execution result to a file? There are many environments where UnicodeEncodeError occurs as shown below.

`python`


(py2.7)~ » python test.py > test.txt
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    print(u'Ah')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Locale and encoding

In the example print (u'Ai'), a Unicode character string is passed to the standard output, but at this time, Unicode character string-> byte character string conversion (encoding) is performed. If standard I / O is connected to a terminal, Python will automatically select the appropriate encoding from the locale value (ex. Environment variable LANG). On the other hand, when standard input / output is connected to other than the terminal by redirect etc., information for selecting an appropriate encoding method cannot be obtained, and encoding is attempted in ASCII, and in most cases (= when non-ASCII characters are included). Fail.

(ref.) http://blog.livedoor.jp/dankogai/archives/51816624.html

Encoding the Unicode string before passing it to standard output can solve this problem.

`test.py(Unicode string->Byte string)`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

print((u'Ah' + u'Say').encode('utf-8'))

I've always thought

By specifying the environment variable PYTHONIOENCODING, the encoding method to be used can be fixed regardless of the locale. If you specify this, you don't have to encode it one by one.

`python`


(py2.7)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u'Ah' + u'Say')
(py2.7)~ » PYTHONIOENCODING=utf-8 python test.py > test.txt
(py2.7)~ » cat test.txt
Ah

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400