Language processing 100 knocks-40: Reading dependency analysis results (morpheme)

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. 40th "Reading of dependency analysis result (morpheme)" record of tohoku.ac.jp/nlp100/#ch5) is. Chapter 5, which starts from now on, is troublesome and time-consuming to build an algorithm as a whole, and I feel like the first demon gate of 100 language processing knocks. This time it's like a preparatory movement and it's not very difficult. Is it brand new at best to use the class for the first time with 100 knocks?

Reference link

Link Remarks
040.Reading the dependency analysis result (morpheme).ipynb Answer program GitHub link
100 amateur language processing knocks:40 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Syntax, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

40. Reading the dependency analysis result (morpheme)

Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.

Problem supplement (about "dependency")

"Dependency" is a relationship between clauses. I did a little in the previous article "[Play] Syntactic analysis of Shinkalion's ton demo mail", but even with this document image.png

You can clarify the relationship. image.png

Answer

Answer premise

First, perform dependency analysis with CaboCha.

cabocha -f1 ../04.Morphological analysis/neko.txt -o neko.txt.cabocha

The execution result is as follows. Dependency information is added to the result of MeCab. The part of * 0 -1D 0/0 0.000000 on the first line is the dependency information, the third character 0 is the segment number, and the subsequent -1 indicates the dependency. This time, there is no contact with -1, so the example is bad.

text:neko.txt.Partial excerpt from cabocha


* 0 -1D 0/0 0.000000
One noun,number,*,*,*,*,one,Ichi,Ichi
EOS
EOS
* 0 -1D 1/1 0.000000
symbol,Blank,*,*,*,*, , , 
I am a cat noun,Proper noun,General,*,*,*,I am a cat,Wagamama High Spec,Wagamama High Spec
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
* 0 2D 0/1 -1.911675
Name noun,General,*,*,*,*,name,Namae,Namae
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/0 -1.911675
Still adverb,Particle connection,*,*,*,*,yet,Mada,Mada
* 2 -1D 0/0 0.000000
No adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,No,Nai,Nai
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
EOS
* 0 1D 1/2 1.504358
symbol,Blank,*,*,*,*, , , 
Where noun,Pronoun,General,*,*,*,Where,Doco,Doco
Particles,Case particles,General,*,*,*,so,De,De
* 1 2D 0/1 1.076607
Born verb,Independence,*,*,One step,Continuous form,Born,Umale,Umale
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
* 2 4D 0/1 -0.197109
Katon noun,General,*,*,*,*,Fire,Katong,Katong
And particles,Case particles,General,*,*,*,When,To,To
* 3 4D 0/1 -0.197109
Register noun,Change connection,*,*,*,*,Register,Kentou,Kento
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 4 -1D 0/1 0.000000
Tsuka verb,Independence,*,*,Five-dan / Ka line,Imperfective form,Tsukuri,Tsuka,Tsuka
Nu auxiliary verb,*,*,*,Special,Uninflected word,Nu,Nu,Nu
.. symbol,Kuten,*,*,*,*,。,。,。
EOS

Answer program [040. Reading dependency analysis results (morpheme) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5 % 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 040.% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90% E7% B5% 90% E6% 9E% 9C% E3% 81% AE% E8% AA% AD% E3% 81% BF% E8% BE% BC% E3% 81% BF (% E5% BD% A2% E6% 85% 8B% E7% B4% A0) .ipynb)

So, this is the main Python program.

import re

morphs = []
sentences = []

#Delimiter
separator = re.compile('\t|,')

#Excluded lines
exclude = re.compile(r'''EOS\n      # EOS,Line feed code
                         |          # OR
                         \*\s\d+\s  # '*,Blank,One or more numbers,Blank
                       ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        if not exclude.match(line):
            morphs.append(Morph(line))
        if   line == 'EOS\n' \
         and len(morphs) > 0:
            sentences.append(morphs)
            morphs = []

for sentence in sentences[2]:
    print(sentence.__dict__)

Answer commentary

Regular expressions

I use the regular expressions I learned in Chapter 2 as practice. separator is a delimiter for the morphological analysis result, and ʻexclude` is a regular expression for excluding the dependency result with EOS. For more information on regular expressions, see the article "Basics and Tips for Python Regular Expressions Learned from Zero".

python


#Delimiter
separator = re.compile('\t|,')

#Excluded lines
exclude = re.compile(r'''EOS\n      # EOS,Line feed code
                         |          # OR
                         \*\s\d+\s  # '*,Blank,One or more numbers,Blank
                       ''', re.VERBOSE)

class

This is the first class to come out with 100 knocks. __init__ is the constructor called the first time. The entire line of the morphological analysis result is received and stored in a class variable separated by tabs / commas.

python


class Morph:
    def __init__(self, line):

        #Split with tabs and commas
        cols = separator.split(line)

        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

Output of class variables

By setting __dict__, the class variable will be output as a dictionary. I didn't know it, but it's convenient.

python


for sentence in sentences[2]:
    print(sentence.__dict__)

Output result (execution result)

When the program is executed, the following results will be output.

Output result


{'surface': 'name', 'base': 'name', 'pos': 'noun', 'pos1': 'General'}
{'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'}
{'surface': 'yet', 'base': 'yet', 'pos': 'adverb', 'pos1': 'Particle connection'}
{'surface': 'No', 'base': 'No', 'pos': 'adjective', 'pos1': 'Independence'}
{'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}

Recommended Posts

Language processing 100 knocks-40: Reading dependency analysis results (morpheme)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-30 (using pandas): reading morphological analysis results
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89