Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 45th "Extraction of verb case pattern" of tohoku.ac.jp/nlp100/#ch5). The number of conditional branches of ʻif` has also increased, and it is becoming more and more complicated. It's a little tedious to think about the algorithm.

Reference link

Link	Remarks
045.Extraction of verb case patterns.ipynb	Answer program GitHub link
100 amateur language processing knocks:45	Copy and paste source of many source parts
CaboCha official	CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get
CRF++	0.58	It's too old and I forgot how to install(Perhaps`make install`)
CaboCha	0.69	It's too old and I forgot how to install(Perhaps`make install`)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Syntax, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.

--In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.

Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
At the beginning
To see
Save the output of this program to a file and check the following items using UNIX commands.

--Combination of predicates and case patterns that frequently appear in the corpus --The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

Problem supplement (about "case")

I don't pay particular attention to the purpose of completing the program, but the Japanese word "case" seems to be deep. If you're curious, take a look at Wikiped it "case". I'm just looking at it. I remember when I was doing a language exchange in Australia, I was asked what the difference between "ha" and "ga" was.

Answer

Answer program [045. Extraction of verb case patterns.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 045.% E5% 8B% 95% E8% A9% 9E% E3% 81% AE% E6% A0% BC% E3% 83 % 91% E3% 82% BF% E3% 83% BC% E3% 83% B3% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:            
            if morph.pos != 'symbol':
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

with open('./045.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            if chunk.verb != '' and len(chunk.srcs) > 0:
                
                #Create a list of particles
                sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']
            
                if len(sources) > 0:
                    sources.sort()
                    out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))

The following is the UNIX command part. I used the grep command for the first time, but it's convenient.

`UNIX command section`


#Sort, deduplication and count, descending sort
sort 045.result_python.txt | uniq --count | sort --numeric-sort --reverse > "045.result_1_all.txt"

# 「(Beginning of line)To do(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^To do\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_2_To do.txt"

# 「(Beginning of line)to see(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^to see\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_3_to see.txt"

# 「(Beginning of line)give(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^give\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_4_give.txt"

Answer commentary

Chunk class

The Chunk class stores the prototypes of verbs and particles. If there are multiple verbs in one phrase, we win second. The case particle should appear at the end of the phrase, but it has a conditional branch that takes the symbol into consideration.

`python`


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:            
            if morph.pos != 'symbol':
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

Output part

The particles of the affiliation are listed in a list comprehension notation and sorted to satisfy "Lexicographic order". And finally, the join function is used to output them separated by spaces. The nest is deep and I feel uncomfortable writing.

`python`


with open('./045.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            if chunk.verb != '' and len(chunk.srcs) > 0:

                #Create a list of particles
                sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']

                if len(sources) > 0:
                    sources.sort()
                    out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))

Output result (execution result)

When the program is executed, the following results will be output. Since there are many, only 10 lines are displayed here.

Python output

`bash:045.result_python.txt(First 10 lines)`


Be born
Tsukugato
By crying
Or
At the beginning
To see
Listen
To catch
Boil
Eat

UNIX command output

Since there are many, only 10 lines are displayed here.

`bash:045.result_1_all.txt(First 10 lines)`


There are 3176
1997 Tsukugato
800
721
464 to be
330
309 I think
305 see
301
Until there are 262

`bash:045.result_2_To do.txt(First 10 lines)`


1099
651
221
109 But
Until 86
59 What is
41
27 What is it?
Up to 24
18 as

`bash:045.result_3_to see.txt(First 10 lines)`


305 see
99 see
31 to see
24 Seeing
19 from seeing
11 Seeing
7 Because I see
5 to see
2 While watching
2 Just by looking

"Give" has a low frequency of appearance, and this is all.

`bash:045.result_4_give.txt`


7 to give
4 to give
3 What to give
Give 1 but give
1 As to give
1 to give

100 Language Processing Knock-45: Extraction of verb case patterns

Reference link

environment

Chapter 5: Dependency analysis

content of study

Knock content

45. Extraction of verb case patterns

Problem supplement (about "case")

Answer

UNIX command section

Answer commentary

Chunk class

python

Output part

python

Output result (execution result)

Python output

bash:045.result_python.txt(First 10 lines)

UNIX command output

bash:045.result_1_all.txt(First 10 lines)

bash:045.result_2_To do.txt(First 10 lines)

bash:045.result_3_to see.txt(First 10 lines)

bash:045.result_4_give.txt

`UNIX command section`

`python`

`python`

`bash:045.result_python.txt(First 10 lines)`

`bash:045.result_1_all.txt(First 10 lines)`

`bash:045.result_2_To do.txt(First 10 lines)`

`bash:045.result_3_to see.txt(First 10 lines)`

`bash:045.result_4_give.txt`