$ python --version
Python 2.7.12 :: Continuum Analytics, Inc.
$ pip freeze | grep pandas
pandas==0.19.1
$ file --mime sample.tsv
sample.tsv: text/plain; charset=utf-8
$ cat sample.tsv
ID language
1 Japanese
2 english
codecs
First of all, codecs
>>> open("sample.tsv", "r").read()
'ID\t\xe8\xa8\x80\xe8\xaa\x9e\n1\t\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\n2\t\xe8\x8b\xb1\xe8\xaa\x9e\n'
>>> import codecs
>>> codecs.open("sample.tsv", "r", "utf-8").read()
u'ID\t\u8a00\u8a9e\n1\t\u65e5\u672c\u8a9e\n2\t\u82f1\u8a9e\n'
If you read it with codecs
, it will be ʻunicode`.
pandas
The read_table
function, which is useful when reading tsv
.
>>> import pandas as pd
>>> df = pd.read_table(open("sample.tsv", "r"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df.columns
Index([u'ID', u'language'], dtype='object')
>>> df[u"language"]
Traceback (most recent call last):
...
KeyError: u'\u8a00\u8a9e'
>>> list(df.columns)
['ID', '\xe8\xa8\x80\xe8\xaa\x9e']
>>> type(list(df.columns)[1])
<type 'str'>
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object
I'm not sure that ʻuis in the display of
df.columns, It's understandable that the type of the string is
str`.
codecs & pandas
with read_table
Then, if you use codecs
and read with read_table
>>> df = pd.read_table(codecs.open("sample.tsv", "r", "utf-8"))
>>> df
ID language
0 1 Japanese
1 2 english
>>> df[u"language"]
Traceback (most recent call last):
...
KeyError: u'\u8a00\u8a9e'
>>> df["language"]
0 Japanese
1 english
Name:language, dtype: object
It seems to be str
for some reason.
without read_table
>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> f = codecs.open("sample.tsv", "r", "utf-8")
>>> labels = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> values = f.readline()[:-1].split("\t") #Divide other than line breaks by tabs
>>> for label, value in zip(labels, values):
... data[label].append(value)
...
>>> df = pd.DataFrame(data)
>>> df
ID language
0 1 Japanese
>>> df["language"]
Traceback (most recent call last):
...
KeyError: '\xe8\xa8\x80\xe8\xaa\x9e'
>>> df[u"language"]
0 Japanese
Name:language, dtype: object
>>> list(df.columns)
[u'ID', u'\u8a00\u8a9e']
>>> type(list(df.columns)[1])
<type 'unicode'>
Without using read_table
When read with codecs
,
It was as expected.
Recommended Posts