str and unicode

It's a Python ** 2.x ** series.

The code is tried on the console of python by executing the python command from the command prompt on Terminal on Mac and on Windows. `` `Python 2.7.9```.

What is str?

`str is a so-called multibyte character string. 'String'Write like.`




#### **`len(String)Then the number of bytes will be returned. coding: utf-9 for 8.`**

>>> len('Iroha')

9

`for -When turned with in, it processes byte by byte..`

for c in 'abc': ... print c

a b c

for c in 'Iroha': ... print c

� � � � � � � � �



#### **`unicode.encode()Can convert from unicode to str with.`**

>>> u'abc'.encode()

'abc'

There is a concept of encoding.

>>> u'Iroha'.encode('utf-8')

'\xe3\x81\x84\xe3\x82\x8d\xe3\x81\xaf'

>>> u'Iroha'.encode('cp932')

'\x82\xa2\x82\xeb\x82\xcd'

It depends on the encoding of the execution environment. ```utf-8`` `. When executed on the Terminal of Mac.

>>> 'Iroha'

'\xe3\x81\x84\xe3\x82\x8d\xe3\x81\xaf'

When executed from the command prompt of Windows7, cp932.

>>> 'Iroha'

'\x82\xa2\x82\xeb\x82\xcd'

`str``` written directly in the script file follows the encoding of the file. However, if it does not match `# coding: (encoding name) ```, a runtime error will occur.

#!/usr/bin/env python
# coding: utf-8

print 'Iroha'

↑ is printed with utf-8, but if the encoding of the CUI environment is different, the characters will be garbled.

What is unicode?

`unicode treats strings in character units, not bytes. u'String'Prefix u like.`




#### **`len(String)Then**word count**Will be returned.`**

>>> len(u'Iroha')

3

`for -If you turn it with in, it will be processed character by character..`

for c in u'abc': ... print c

a b c

for c in u'Iroha': ... print c

I Ro Is



#### **`str.decode()Can convert from str to unicode with.`**

>>> 'Iroha'.decode('utf-8')

u'\u3044\u308d\u306f'

`Unicode is unified in unicode, and developers do not need to be aware of encoding when using unicode..`

u'Irohani' + u'Hoheto'

u'\u3044\u308d\u306f\u306b\u307b\u3078\u3068'


 However, when outputting to the outside of the script, it will always be converted to ``` str```.

print u'Iroha'.encode('utf-8')

Iroha



#### **`If you do not convert unicode to str, the python runtime will convert it automatically, but the encoding used for conversion depends on the execution environment..`**

A common case is when you try to convert a mixed Japanese str with coding: ascii and you get a` `ʻUnicodeEncodeError exception ...

>>> print u'Iroha'

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Whether it works without conversion depends on the execution environment. When you put it out of the script, you should be conscious of it and convert it to unicode-> str.

Proper use of unicode and str

Inside the script, I think it's best to unify it to unicode.

`sys.Str obtained from the library, such as args, is immediately converted to unicode.`


 On the contrary, when printing a character string outside the script (eg `` `print```), it is converted to ```unicode```->` `str``` just before printing.


#### **`If you use a mixture of str and unicode, the python runtime will try to convert str to unicode.`**

At this time, UnicodeDecodeError occurs and it is often annoying.

>>> 'Irohani' + u'Hoheto'

Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

When you don't know the encoding of str ...

When you don't know the encoding of `` `strobtained from outside the script I am converting tounicode``` with the following code.

def toUnicode(encodedStr):
    '''
    :return: an unicode-str.
    '''
    if isinstance(encodedStr, unicode):
        return encodedStr
    
    for charset in [u'cp932', u'utf-8', u'euc-jp', u'shift-jis', u'iso2022-jp']:
        try:
            return encodedStr.decode(charset)
        except:
            pass