About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme has the key of surface form (surface), uninflected word (base), part of speech (pos), and part of speech subclassification 1 (pos1). Store it in a mapping type and express one sentence as a list of morphemes (mapping type). For the rest of the problems in Chapter 4, use the program created here.

Preparation: Creating neko.txt.mecab

`file_analyze_mecab_030.py`


from natto import MeCab
import codecs

def file_analyze_mecab(input_filename,output_filename):

    with codecs.open(input_filename,'r','utf-8') as f:
        text = f.read()

    m = MeCab.Tagger("mecabrc")
    wt = m.parse(text)

    with codecs.open(output_filename,'w','utf-8') as wf:
        wf.write(wt)

if __name__=="__main__":
    file_analyze_mecab('neko.txt','neko.txt.mecab')

`result`


One noun,number,*,*,*,*,one,Ichi,Ichi
symbol,Blank,*,*,*,*,　,　,　
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
(Omitted because it is long)

Impression: I heard about morphological analysis for the first time, so I investigated from there. For the parameters of mecab, I referred to MeCab: Yet Another Part-of-Speech and Morphological Analyzer. The naming of modules etc. is wonderful.

030. Read

`mecab_030.py`


#-*-coding:utf-8-*-

import codecs

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab",'r','utf-8') as f:
        data = f.readlines()

    mecab_list=[]
    temp_dict ={}
    for temp_word in data:
        temp_word = temp_word.replace('\t', ',')
        temp_word = temp_word.replace('\n', '')
        if(temp_word.count(',')==9 or temp_word.count(',')==7):
            temp_list = temp_word.split(',')
            temp_dict={'surface':temp_list[0],'base':temp_list[7],'pos':temp_list[1],'pos1':temp_list[2]}
            mecab_list.append(temp_dict)
        else:
            continue
    print(mecab_list)

    with codecs.open('neko.txt.mecab.analyze','w','utf-8') as wf:
        for line in mecab_list:
            wf.write(str(line)+'\n')

`result`


{'surface': 'object', 'base': 'object', 'pos': 'noun', 'pos1': 'General'}
(Omitted because it is long)

Impressions: The output file has been replaced with a simple character string instead of a list type for easy viewing. I was very addicted to noticing that the number of',' in the morphological analysis result column was 7 or 9. .. ..

031. Verb

Extract all the surface forms of the verb.

`vurb_031.py`


#-*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze",'r','utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*verb.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['surface'])
        else:
            continue

`result`


so
is there
Born
Ta
Tsuka
(Omitted because it is long)

Impression: The analysis result file is read, the column with the keyword verb is extracted by regular expression, converted to dictionary type, and only the surface is output.

032. The original form of the verb

Extract all the original forms of the verb.

`base_vurb_032.py`


# -*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*verb.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['base'])
        else:
            continue

`result`


Is
is there
Born
Ta
Tsukuri

Impressions: The procedure is the same as 031. The output is changed to base.

033. Sahen noun

Extract all the nouns of the s-irregular connection.

`sahen_noun_033.py`


# -*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*Change connection.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['surface'])
        else:
            continue

`result`


Register
Memory
Talk
Decoration
Protrusion
(Omitted because it is long)

Impressions: The procedure is the same as 032. I just changed the extraction condition to a change connection.

034. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

`no_noun_034.py`


#-*-coding:utf-8-*-

import codecs
import ast

if __name__ == "__main__":

    with codecs.open('neko.txt.mecab.analyze','r','utf-8') as f:
        temp_lines = f.readlines()

    flag = 0
    temp_list = []
    for temp_line in temp_lines:
        temp_dict = ast.literal_eval(temp_line)
        if (temp_dict['pos'] == 'noun' and flag == 0):
            temp_word = temp_dict['surface']
            flag = 1
            continue

        elif(temp_dict['surface']=='of' and temp_dict['pos']=='Particle' and flag == 1):
            temp_word += temp_dict['surface']
            flag = 2
            continue

        elif(temp_dict['pos']=='noun' and flag == 2):
            temp_word += temp_dict['surface']
            temp_list.append(temp_word)
            temp_word = ''
            flag = 0
            continue

        else:
            temp_word=''
            flag =0
            continue

    no_noun_list = set(temp_list)


    for temp in no_noun_list:
        print(temp)

`result`


Child of
My year
Boredom too much
Left corner
Opponent's ability
On the forehead
For those
(Omitted because it is long)

Impression: First, ngram analysis of pos information with N = 3 is performed to list the index numbers in which noun particles are arranged, and then the surface information of the matching index number is extracted and the beginning of the extracted character string. I extracted the ones that did not have a'no' at the end of the word, but did not make the subject look correct. And corrected to the current code. You have to confirm the subject properly. Reflection.

[Python] Challenge 100 knocks! (030-034)

About the history so far

Knock status

Chapter 4: Morphological analysis

30. Reading morphological analysis results

Preparation: Creating neko.txt.mecab

file_analyze_mecab_030.py

result

030. Read

mecab_030.py

result

031. Verb

vurb_031.py

result

032. The original form of the verb

base_vurb_032.py

result

033. Sahen noun

sahen_noun_033.py

result

034. "B of A"

no_noun_034.py

result

`file_analyze_mecab_030.py`

`result`

`mecab_030.py`

`result`

`vurb_031.py`

`result`

`base_vurb_032.py`

`result`

`sahen_noun_033.py`

`result`

`no_noun_034.py`

`result`