Precautions when using codecs and pandas

environment

version

$ python --version
Python 2.7.12 :: Continuum Analytics, Inc.
$ pip freeze | grep pandas
pandas==0.19.1

Sample file

$ file --mime sample.tsv
sample.tsv: text/plain; charset=utf-8
$ cat sample.tsv
ID language
1 Japanese
2 english

codecs

First of all, codecs

>>> open("sample.tsv", "r").read()
'ID\t\xe8\xa8\x80\xe8\xaa\x9e\n1\t\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\n2\t\xe8\x8b\xb1\xe8\xaa\x9e\n'
>>> import codecs
>>> codecs.open("sample.tsv", "r", "utf-8").read()
u'ID\t\u8a00\u8a9e\n1\t\u65e5\u672c\u8a9e\n2\t\u82f1\u8a9e\n'

If you read it with codecs, it will be ʻunicode`.

pandas

The read_table function, which is useful when reading tsv.

>>> import pandas as pd
>>> df = pd.read_table(open("sample.tsv", "r"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df.columns
Index([u'ID', u'language'], dtype='object')
>>> df[u"language"]
Traceback (most recent call last):
  ...
KeyError: u'\u8a00\u8a9e'
>>> list(df.columns)
['ID', '\xe8\xa8\x80\xe8\xaa\x9e']
>>> type(list(df.columns)[1])
<type 'str'>
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object

I'm not sure that ʻuis in the display ofdf.columns, It's understandable that the type of the string is str`.

codecs & pandas

with read_table

Then, if you use codecs and read with read_table

>>> df = pd.read_table(codecs.open("sample.tsv", "r", "utf-8"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df[u"language"]
Traceback (most recent call last):
  ...
KeyError: u'\u8a00\u8a9e'
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object

It seems to be str for some reason.

without read_table

>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> f = codecs.open("sample.tsv", "r", "utf-8")
>>> labels = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> values = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> for label, value in zip(labels, values):
...     data[label].append(value)
... 
>>> df = pd.DataFrame(data)
>>> df
ID language
0 1 Japanese
>>> df["language"]
Traceback (most recent call last):
  ...
KeyError: '\xe8\xa8\x80\xe8\xaa\x9e'
>>> df[u"language"]
0 Japanese
Name:language, dtype: object
>>> list(df.columns)
[u'ID', u'\u8a00\u8a9e']
>>> type(list(df.columns)[1])
<type 'unicode'>

Without using read_table When read with codecs, It was as expected.