Things to keep in mind when processing strings in Python3

Continuing from the article Yesterday, this time I will summarize my own policy when dealing with strings in Python 3.

Personal conclusion

--In most cases, it handles character strings and exchanges standard input / output with character strings. --However, it may be necessary to handle the byte string, such as when a byte string is passed from an external program. To put it the other way around, it doesn't handle bytes except in that case.

This isn't good or bad, it's just that there aren't many cases where you have to deal with bytes in your code.

Bytes and strings

Byte strings are encoded by a specific encoding method, and are expressed as b'a' in literals. On the other hand, a character string is an array of Unicode code points, and is expressed as 'a' in literals.

I wrote it briefly, but at this point you can see the difference in handling with Python 2.

--"Python3 byte string" is treated similar to "Python2 byte string". However, "Python2 byte string" is a "character string", but "Python3 byte string" is not a "character string" but a completely different type. --"Python3 string" and "Python2 Unicode string" can be considered equivalent. There is a difference in the literal notation, and "Python3 string" does not need it as opposed to "Python2 Unicode string" which had to be prefixed with ʻu`.

`python`


(py3.4)~ » ipython
   (abridgement)
>>> b'a' #Byte sequence
Out[1]: b'a'

#Literal notation cannot be used when containing non-ASCII characters
#You need to encode the string with a specific encoding
>>> b'Ah' 
  File "<ipython-input-2-c12eb8e58bcd>", line 1
    b'Ah'
        ^
SyntaxError: bytes can only contain ASCII literal characters.

>>> 'Ah'.encode('utf-8') #String->Byte sequence(Encode)
Out[3]: b'\xe3\x81\x82'


>>> 'Ah' #String
Out[4]: 'Ah'

>>> b'\xe3\x81\x82'.decode('utf-8') #Byte sequence->String(Decode)
Out[5]: 'Ah'


# Python2(Repost)
(py2.7)~ » ipython
   (abridgement)
>>> 'Ah' #Byte string
Out[1]: '\xe3\x81\x82'

>>> u'Ah' #Unicode string
Out[2]: u'\u3042'

>>> 'Ah'.decode('utf-8') (or unicode('Ah', 'utf-8')) #Byte string->Unicode string(=Decode)
Out[3]: u'\u3042'

>>> u'Ah'.encode('utf-8') #Unicode string->Byte string(=Encode)
Out[4]: '\xe3\x81\x82'

If you check with the type function, you can see that the byte string is of type bytes / the string is of type str.

`python`


>>> type(b'a')
Out[6]: bytes #≒ Python2 str type

>>> type('a')
Out[7]: str #≒ Python2 unicode type

Also, as mentioned above, Python3 byte strings are not "strings". Therefore, it cannot be concatenated with a character string, and the supported methods are different. This point is relatively important, because it is the same character string as Python2, the processing progresses somehow and finally "UnicodeEncodeError is ga", but with Python3 it becomes "error due to different type" and error output / The location of occurrence is relatively easy to understand.

`python`


>>> s = 'str' #String

>>> b = b'byte' #Byte sequence

>>> s + b #String+Byte string is an error
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-5fe2240a1b50> in <module>()
----> 1 s + b

TypeError: Can't convert 'bytes' object to str implicitly

>>> s.find('t') #The string supports the find method
Out[11]: 1

>>> b.find('y') #Byte strings do not support the find method.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-e1b070a5aaba> in <module>()
----> 1 b.find('y')

TypeError: Type str doesn't support the buffer API

Also, from Python 3.2, it seems to select the appropriate encoding method from the value of locale even when the standard output is connected to other than the terminal. Therefore, in Python2, the following cases with UnicodeEncodeError also work normally.

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 Addendum

`python`


(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

print('Ah' + 'Say')


#Run in terminal(I / O is connected to the terminal)
(py3.4)~ » python test.py
Ah

#Redirect to file(Standard I / O is connected to other than the terminal)
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah

UnicodeEncodeError is no longer scary

Even if you don't blatantly flag it as dead, you can still run into UnicodeEncodeError. For example, when executing from cron, you cannot select the encoding method from locale and try encoding / decoding with ASCII, and you usually end up with UnicodeEncodeError.

(ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011311.html (Posting is extremely timely)

Considering this, it may be better to always specify the encoding method with the environment variable PYTHONIOENCODING without relying on the locale.

(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 (ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011314.html

So what to do when dealing with bytes

You can use sys.stdin.buffer (standard input) / sys.stdout.buffer (standard output) to work with bytes instead of strings.

`python`


(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

#print('Ah' + 'Say') #print is sys.write to stdout
sys.stdout.write('Ah' + 'Say' + '\n') # sys.Write a string to stdout
sys.stdout.buffer.write(('Ah' + 'Say' + '\n').encode('utf-8')) # sys.stdout.Write a string of bytes to buffer

#Run in terminal
(py3.4)~ » python test.py
Ah
Ah

#Redirect to file
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah
Ah

Again, in Python 3, bytes and strings are completely different. Therefore, the byte string cannot be written to sys.stdout which writes the character string, and the character string cannot be written to sys.stdout.buffer which writes the byte string.

`python`


>>> import sys

#Text stream(ref. https://docs.python.org/3/library/io.html#io.TextIOWrapper)
>>> type(sys.stdout) 
Out[2]: _io.TextIOWrapper

#Byte stream(ref. https://docs.python.org/3/library/io.html#io.BufferedWriter)
>>> type(sys.stdout.buffer)
Out[3]: _io.BufferedWriter 

#Cannot write bytes to text stream
>>> sys.stdout.write('a'.encode('utf-8')) 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-581ae8b6af82> in <module>()
----> 1 sys.stdout.write('a'.encode('utf-8'))

TypeError: must be str, not bytes

#Strings cannot be written to byte stream
>>> sys.stdout.buffer.write('a') 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-42da1d141b96> in <module>()
----> 1 sys.stdout.buffer.write('a')

TypeError: 'str' does not support the buffer interface