Continuing from the article Yesterday, this time I will summarize my own policy when dealing with strings in Python 3.
--In most cases, it handles character strings and exchanges standard input / output with character strings. --However, it may be necessary to handle the byte string, such as when a byte string is passed from an external program. To put it the other way around, it doesn't handle bytes except in that case.
This isn't good or bad, it's just that there aren't many cases where you have to deal with bytes in your code.
Byte strings are encoded by a specific encoding method, and are expressed as b'a'
in literals. On the other hand, a character string is an array of Unicode code points, and is expressed as 'a'
in literals.
I wrote it briefly, but at this point you can see the difference in handling with Python 2.
--"Python3 byte string" is treated similar to "Python2 byte string". However, "Python2 byte string" is a "character string", but "Python3 byte string" is not a "character string" but a completely different type. --"Python3 string" and "Python2 Unicode string" can be considered equivalent. There is a difference in the literal notation, and "Python3 string" does not need it as opposed to "Python2 Unicode string" which had to be prefixed with ʻu`.
python
(py3.4)~ » ipython
(abridgement)
>>> b'a' #Byte sequence
Out[1]: b'a'
#Literal notation cannot be used when containing non-ASCII characters
#You need to encode the string with a specific encoding
>>> b'Ah'
File "<ipython-input-2-c12eb8e58bcd>", line 1
b'Ah'
^
SyntaxError: bytes can only contain ASCII literal characters.
>>> 'Ah'.encode('utf-8') #String->Byte sequence(Encode)
Out[3]: b'\xe3\x81\x82'
>>> 'Ah' #String
Out[4]: 'Ah'
>>> b'\xe3\x81\x82'.decode('utf-8') #Byte sequence->String(Decode)
Out[5]: 'Ah'
# Python2(Repost)
(py2.7)~ » ipython
(abridgement)
>>> 'Ah' #Byte string
Out[1]: '\xe3\x81\x82'
>>> u'Ah' #Unicode string
Out[2]: u'\u3042'
>>> 'Ah'.decode('utf-8') (or unicode('Ah', 'utf-8')) #Byte string->Unicode string(=Decode)
Out[3]: u'\u3042'
>>> u'Ah'.encode('utf-8') #Unicode string->Byte string(=Encode)
Out[4]: '\xe3\x81\x82'
If you check with the type
function, you can see that the byte string is of type bytes
/ the string is of type str
.
python
>>> type(b'a')
Out[6]: bytes #≒ Python2 str type
>>> type('a')
Out[7]: str #≒ Python2 unicode type
Also, as mentioned above, Python3 byte strings are not "strings". Therefore, it cannot be concatenated with a character string, and the supported methods are different. This point is relatively important, because it is the same character string as Python2, the processing progresses somehow and finally "UnicodeEncodeError is ga", but with Python3 it becomes "error due to different type" and error output / The location of occurrence is relatively easy to understand.
python
>>> s = 'str' #String
>>> b = b'byte' #Byte sequence
>>> s + b #String+Byte string is an error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-5fe2240a1b50> in <module>()
----> 1 s + b
TypeError: Can't convert 'bytes' object to str implicitly
>>> s.find('t') #The string supports the find method
Out[11]: 1
>>> b.find('y') #Byte strings do not support the find method.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-e1b070a5aaba> in <module>()
----> 1 b.find('y')
TypeError: Type str doesn't support the buffer API
Also, from Python 3.2, it seems to select the appropriate encoding method from the value of locale even when the standard output is connected to other than the terminal. Therefore, in Python2, the following cases with UnicodeEncodeError also work normally.
(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 Addendum
python
(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print('Ah' + 'Say')
#Run in terminal(I / O is connected to the terminal)
(py3.4)~ » python test.py
Ah
#Redirect to file(Standard I / O is connected to other than the terminal)
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah
Even if you don't blatantly flag it as dead, you can still run into UnicodeEncodeError. For example, when executing from cron, you cannot select the encoding method from locale and try encoding / decoding with ASCII, and you usually end up with UnicodeEncodeError.
(ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011311.html (Posting is extremely timely)
Considering this, it may be better to always specify the encoding method with the environment variable PYTHONIOENCODING
without relying on the locale.
(ref.) http://methane.hatenablog.jp/entry/20120806/1344269400 (ref.) http://www.python.jp/pipermail/python-ml-jp/2014-November/011314.html
You can use sys.stdin.buffer
(standard input) / sys.stdout.buffer
(standard output) to work with bytes instead of strings.
python
(py3.4)~ » cat test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
#print('Ah' + 'Say') #print is sys.write to stdout
sys.stdout.write('Ah' + 'Say' + '\n') # sys.Write a string to stdout
sys.stdout.buffer.write(('Ah' + 'Say' + '\n').encode('utf-8')) # sys.stdout.Write a string of bytes to buffer
#Run in terminal
(py3.4)~ » python test.py
Ah
Ah
#Redirect to file
(py3.4)~ » python test.py > test.txt
(py3.4)~ » cat test.txt
Ah
Ah
Again, in Python 3, bytes and strings are completely different. Therefore, the byte string cannot be written to sys.stdout
which writes the character string, and the character string cannot be written to sys.stdout.buffer
which writes the byte string.
python
>>> import sys
#Text stream(ref. https://docs.python.org/3/library/io.html#io.TextIOWrapper)
>>> type(sys.stdout)
Out[2]: _io.TextIOWrapper
#Byte stream(ref. https://docs.python.org/3/library/io.html#io.BufferedWriter)
>>> type(sys.stdout.buffer)
Out[3]: _io.BufferedWriter
#Cannot write bytes to text stream
>>> sys.stdout.write('a'.encode('utf-8'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-581ae8b6af82> in <module>()
----> 1 sys.stdout.write('a'.encode('utf-8'))
TypeError: must be str, not bytes
#Strings cannot be written to byte stream
>>> sys.stdout.buffer.write('a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-42da1d141b96> in <module>()
----> 1 sys.stdout.buffer.write('a')
TypeError: 'str' does not support the buffer interface
Recommended Posts