Here is the Excel file. It is output from a certain DB, and sentences are stored in one record per line and one field. The theme this time is to extract frequently used keywords from the sentences in this field, count the number of appearances for each keyword, and rank them.
The entrance and exit are Windows Excel files, and the middle is done on a Mac.
Do it in your usual environment.
I plan to process the data with pandas later, so I use utf-8 or pandas.
This is from the Excel menu. test.xls -> test.csv
$ nkf -g test.csv
Shift_JIS
$ nkf -w test.csv > test_utf8.csv
$ nkf -g test_utf8.csv
UTF-8
import pandas as pd
csv_file = 'test.csv'
df = pd.read_csv(csv_file, encoding='utf-8', header=1)
brew search mecab
pip search mecab
pip install mecab-python
... Successfully installed mecab-python-0.996 It's OK to come out. Now you can use it with python (2.x series).
import MeCab
def count_word(df):
e = df[u'comment']
dic_n = {}
dic_v = {}
m = MeCab.Tagger('-Ochasen') #Put the output in Chasen mode
for s in e:
if type(s) != unicode:
continue
s8 = s.encode('utf-8')
print s8
node = m.parseToNode(s8)
while node:
word=node.feature.split(',')[0]
key = node.surface
if word=='noun':
dic = dic_n
print "<", key, "> (n)"
elif word=='verb':
dic = dic_v
print "<", key, "> (v)"
else:
node = node.next
continue
if dic.has_key(key):
dic[key] += 1
else:
dic[key] = 1
node = node.next
return dic_n, dic_v
import csv
def write_to_csv(dic, csv_file):
f = open(csv_file, 'w')
writer = csv.writer(f, lineterminator='\n')
#Sort by Value
for k,v in sorted(dic.items(), key=lambda x:x[1], reverse=True):
print k, v
writer.writerow([k, v])
f.close()
write_to_csv(dic_n, 'test_dic_n_utf8.csv')
write_to_csv(dic_v, 'test_dic_v_utf8.csv')
$ nkf -g test_dic_n_utf8.csv
UTF-8
$ nkf -s test_dic_n_utf8.csv > test_dic_n_sjis.csv
$ nkf -g test_dic_n_sjis.csv
Shift_JIS
Open test_dic_n_sjis.csv in Excel and save it in xls.
end.
http://qiita.com/tstomoki/items/f17c04bd18699a6465be http://qiita.com/ysk_1031/items/7f0cfb7e9e4c4b9129c9 http://salinger.github.io/blog/2013/01/17/1/ [^1]
[^ 1]: Note that there was a note on this site. `If you want to handle Unicode strings in MeCab, you need to encode them once. At this time, if node = tagger.parseToNode (string.encode ("utf-8")), note that string may be garbage collected during parsing and behave strangely. There is no problem if you assign it to a variable once like this.
```
Recommended Posts