UTF8 text processing in python

The python2.x series is confusing because the str object and the unicode object are separate. After researching various things, it became like this. The python3.x series seems to be easier because the text is unicode processed.

MacOS X 10.6.8 Python 2.6.1

`python`


# coding: UTF-8

import codecs
import string
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #Read
lines2 = []
for line in lines:
	line = string.replace(line,u'text',u'text') #text置換
	line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1', line) #Regular expression replacement
	lines2.append(line) #Make a separate list
else:
	f_out.write(string.join(lines2,'')) #writing
	f_in.close()
	f_out.close()

`test.txt`


This is sample text.
Insert a comma every 3 digits.
iPad mini 36800 yen

`test_out.txt`


This is a sample text.
Insert a comma every 3 digits.
iPad mini 36,800 yen

Postscript: I wrote the code that works with python3.3. After all, python3 also uses the codecs module, Is replace done by a function of str object and just not using u'' literal?

`python`


from __future__ import unicode_literals

If you add, all strings are treated as unicode even if there is no u'' literal, so It works normally with python2.6. That might be the best at the moment.

`python`


# coding: UTF-8
from __future__ import unicode_literals # <-Treat all character strings as unicode. Not required for 3 series
import codecs
import re

f_in  = codecs.open('test.txt', 'r', 'utf-8')
f_out = codecs.open('test_out.txt', 'w', 'utf-8')

lines = f_in.readlines() #Read
lines2 = []
for line in lines:
    line = line.replace('text','text') #text置換
    line = re.sub(r'(\d)(?=(\d{3})+(?!\d))', r'\1,', line) #Regular expression replacement
    lines2.append(line) #Make a separate list
else:
    f_out.write(''.join(lines2)) #writing
    f_in.close()