Text data exists in an Excel file, and the text data is divided and output in tab-delimited format (tsv file).
Mac OS 10.12.3 Python 3.6.0 mecab of 0.996 mecab-python3==0.7
Installation with Homebrew (Mac) Use MeCab from Python3 Or make and install it yourself Merphological analysis engine MeCab can be used with Python3 (March 2016 version)
[Python] Read Excel with pandas
mecab.py
#!/usr/bin/env python
import xlrd
import MeCab
import sys
args = sys.argv
#Open excel file
book = xlrd.open_workbook(args[1])
sh = book.sheet_by_index(0)
# header
print("\t".join(('text','price')))
#Perspective with word-separation option
t = MeCab.Tagger ("-Owakati")
#About each line
for rx in range(1, sh.nrows):
#Pick up the columns you need
text = sh.cell_value(rowx=rx, colx=1)
price = sh.cell_value(rowx=rx, colx=2)
#Delete line breaks
text = text.replace('\n','').replace('\r','')
try:
#Perspective and line break removal
m = t.parse(text).replace('\n','')
#output
print( "\t".join((m, price)) )
except RuntimeError as e:
print("RuntimeError:" + e)
$ ./mecab.py [excel file name]
Recommended Posts