Here is the Excel file. It is output from a certain DB, and sentences are stored in one record per line and one field. The theme this time is to extract frequently used keywords from the sentences in this field, count the number of appearances for each keyword, and rank them.

The entrance and exit are Windows Excel files, and the middle is done on a Mac.

What to prepare

Do it in your usual environment.

Mac
python
MeCab
nkf

I plan to process the data with pandas later, so I use utf-8 or pandas.

Output to csv from the corresponding xls file

This is from the Excel menu. test.xls -> test.csv

Change the character code from sjis to utf-8

$ nkf -g test.csv
Shift_JIS
$ nkf -w test.csv > test_utf8.csv
$ nkf -g test_utf8.csv
UTF-8

Load csv with python

import pandas as pd 

csv_file = 'test.csv'
df = pd.read_csv(csv_file, encoding='utf-8', header=1)

Break down and count by noun and verb

install mecab

brew search mecab

pip search mecab
pip install mecab-python

... Successfully installed mecab-python-0.996 It's OK to come out. Now you can use it with python (2.x series).

import MeCab

def count_word(df):
	e = df[u'comment']
	dic_n = {}
	dic_v = {}
	m = MeCab.Tagger('-Ochasen')	#Put the output in Chasen mode
	
	for s in e:
		if type(s) != unicode:
			continue
		s8 = s.encode('utf-8')
		print s8
		node = m.parseToNode(s8)
		while node:
			word=node.feature.split(',')[0]
			key = node.surface
			if word=='noun':
				dic = dic_n
				print "<", key, "> (n)"
			elif word=='verb':
				dic = dic_v
				print "<", key, "> (v)"
			else:
				node = node.next
				continue
			if dic.has_key(key):
				dic[key] += 1
			else:
				dic[key] = 1
			node = node.next
	return dic_n, dic_v

Write to csv in descending order of appearance (utf-8)

import csv

def write_to_csv(dic, csv_file):
	f = open(csv_file, 'w')
	writer = csv.writer(f, lineterminator='\n')
	
	#Sort by Value
	for k,v in sorted(dic.items(), key=lambda x:x[1], reverse=True):
		print k, v
		writer.writerow([k, v])
	f.close()

write_to_csv(dic_n, 'test_dic_n_utf8.csv')
write_to_csv(dic_v, 'test_dic_v_utf8.csv')

Convert to sjis

$ nkf -g test_dic_n_utf8.csv 
UTF-8
$ nkf -s test_dic_n_utf8.csv > test_dic_n_sjis.csv
$ nkf -g test_dic_n_sjis.csv 
Shift_JIS

Convert to xls format

Open test_dic_n_sjis.csv in Excel and save it in xls.

end.

Reference site

http://qiita.com/tstomoki/items/f17c04bd18699a6465be http://qiita.com/ysk_1031/items/7f0cfb7e9e4c4b9129c9 http://salinger.github.io/blog/2013/01/17/1/ [^1]

[^ 1]: Note that there was a note on this site. `If you want to handle Unicode strings in MeCab, you need to encode them once. At this time, if node = tagger.parseToNode (string.encode ("utf-8")), note that string may be garbage collected during parsing and behave strangely. There is no problem if you assign it to a variable once like this. ```

Keyword extraction by MeCab (python)