Please refer to First Post
9/24 added
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme has the key of surface form (surface), uninflected word (base), part of speech (pos), and part of speech subclassification 1 (pos1). Store it in a mapping type and express one sentence as a list of morphemes (mapping type). For the rest of the problems in Chapter 4, use the program created here.
file_analyze_mecab_030.py
from natto import MeCab
import codecs
def file_analyze_mecab(input_filename,output_filename):
with codecs.open(input_filename,'r','utf-8') as f:
text = f.read()
m = MeCab.Tagger("mecabrc")
wt = m.parse(text)
with codecs.open(output_filename,'w','utf-8') as wf:
wf.write(wt)
if __name__=="__main__":
file_analyze_mecab('neko.txt','neko.txt.mecab')
result
One noun,number,*,*,*,*,one,Ichi,Ichi
symbol,Blank,*,*,*,*, , ,
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
(Omitted because it is long)
Impression: I heard about morphological analysis for the first time, so I investigated from there. For the parameters of mecab, I referred to MeCab: Yet Another Part-of-Speech and Morphological Analyzer. The naming of modules etc. is wonderful.
mecab_030.py
#-*-coding:utf-8-*-
import codecs
if __name__ == "__main__":
with codecs.open("neko.txt.mecab",'r','utf-8') as f:
data = f.readlines()
mecab_list=[]
temp_dict ={}
for temp_word in data:
temp_word = temp_word.replace('\t', ',')
temp_word = temp_word.replace('\n', '')
if(temp_word.count(',')==9 or temp_word.count(',')==7):
temp_list = temp_word.split(',')
temp_dict={'surface':temp_list[0],'base':temp_list[7],'pos':temp_list[1],'pos1':temp_list[2]}
mecab_list.append(temp_dict)
else:
continue
print(mecab_list)
with codecs.open('neko.txt.mecab.analyze','w','utf-8') as wf:
for line in mecab_list:
wf.write(str(line)+'\n')
result
{'surface': 'object', 'base': 'object', 'pos': 'noun', 'pos1': 'General'}
(Omitted because it is long)
Impressions: The output file has been replaced with a simple character string instead of a list type for easy viewing. I was very addicted to noticing that the number of',' in the morphological analysis result column was 7 or 9. .. ..
Extract all the surface forms of the verb.
vurb_031.py
#-*-coding:utf-8-*-
import codecs
import re
import ast
if __name__ == "__main__":
with codecs.open("neko.txt.mecab.analyze",'r','utf-8') as f:
temp_lines = f.readlines()
pattern = re.compile(r".*verb.*")
data = {}
for temp_line in temp_lines:
if pattern.match(temp_line):
data = ast.literal_eval(temp_line)
print(data['surface'])
else:
continue
result
so
is there
Born
Ta
Tsuka
(Omitted because it is long)
Impression: The analysis result file is read, the column with the keyword verb is extracted by regular expression, converted to dictionary type, and only the surface is output.
Extract all the original forms of the verb.
base_vurb_032.py
# -*-coding:utf-8-*-
import codecs
import re
import ast
if __name__ == "__main__":
with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
temp_lines = f.readlines()
pattern = re.compile(r".*verb.*")
data = {}
for temp_line in temp_lines:
if pattern.match(temp_line):
data = ast.literal_eval(temp_line)
print(data['base'])
else:
continue
result
Is
is there
Born
Ta
Tsukuri
Impressions: The procedure is the same as 031. The output is changed to base.
Extract all the nouns of the s-irregular connection.
sahen_noun_033.py
# -*-coding:utf-8-*-
import codecs
import re
import ast
if __name__ == "__main__":
with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
temp_lines = f.readlines()
pattern = re.compile(r".*Change connection.*")
data = {}
for temp_line in temp_lines:
if pattern.match(temp_line):
data = ast.literal_eval(temp_line)
print(data['surface'])
else:
continue
result
Register
Memory
Talk
Decoration
Protrusion
(Omitted because it is long)
Impressions: The procedure is the same as 032. I just changed the extraction condition to a change connection.
Extract a noun phrase in which two nouns are connected by "no".
no_noun_034.py
#-*-coding:utf-8-*-
import codecs
import ast
if __name__ == "__main__":
with codecs.open('neko.txt.mecab.analyze','r','utf-8') as f:
temp_lines = f.readlines()
flag = 0
temp_list = []
for temp_line in temp_lines:
temp_dict = ast.literal_eval(temp_line)
if (temp_dict['pos'] == 'noun' and flag == 0):
temp_word = temp_dict['surface']
flag = 1
continue
elif(temp_dict['surface']=='of' and temp_dict['pos']=='Particle' and flag == 1):
temp_word += temp_dict['surface']
flag = 2
continue
elif(temp_dict['pos']=='noun' and flag == 2):
temp_word += temp_dict['surface']
temp_list.append(temp_word)
temp_word = ''
flag = 0
continue
else:
temp_word=''
flag =0
continue
no_noun_list = set(temp_list)
for temp in no_noun_list:
print(temp)
result
Child of
My year
Boredom too much
Left corner
Opponent's ability
On the forehead
For those
(Omitted because it is long)
Impression: First, ngram analysis of pos information with N = 3 is performed to list the index numbers in which noun particles are arranged, and then the surface information of the matching index number is extracted and the beginning of the extracted character string. I extracted the ones that did not have a'no' at the end of the word, but did not make the subject look correct. And corrected to the current code. You have to confirm the subject properly. Reflection.
Recommended Posts