Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 49th "Extraction of dependency path between nouns" of tohoku.ac.jp/nlp100/#ch5). .. Finally, it is the turning point of 100 knocks. Here is ** a vicious knock that becomes a demon gate **. ** The problem is difficult to understand, and even if you understand it, it is difficult to solve it **. It feels like I've finished 100, but it's the most annoying knock for my own algorithm (there are other more complicated ones, but I'm basically leaving it to the package so I don't have to worry too much).

Reference link

Link	Remarks
049.Extraction of dependency paths between nouns.ipynb	Answer program GitHub link
100 amateur language processing knocks:49	Copy and paste source of many source parts
CaboCha official	CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get
CRF++	0.58	It's too old and I forgot how to install(Perhaps`make install`)
CaboCha	0.69	It's too old and I forgot how to install(Perhaps`make install`)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the clause numbers of the noun phrase pair are * i * and * j * (* i * <* j *), the dependency path shall satisfy the following specifications.

--Similar to problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "-> " --Replace noun phrases in clauses * i * and * j * with X and Y, respectively.

In addition, the shape of the dependency path can be considered in the following two ways.

--If clause * j * exists on the path from clause * i * to the root of the syntax tree: Show the path from clause * i * to clause * j * --Other than the above, when the phrase * i * and the clause * j * intersect at a common clause * k * on the path from the clause * j * to the root of the syntax tree: The path and clause * immediately before the clause * i * to the clause * k * The path from j * to just before the clause * k * and the contents of the clause * k * are concatenated with "|" and displayed.

For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
X is|In Y->Start with->Human->Things|saw
X is|Called Y->Things|saw
X is|Y|saw
In X->Start with-> Y
In X->Start with->Human-> Y
Called X-> Y

Problem supplement

The knock content is long and difficult to understand. Article ["Amateur language processing 100 knocks: 49"](https://qiita.com/segavvy/items/dfbf9d5dd5885e807d49#%E5%95%8F%E9%A1%8C%E3%81%AE%E6% I finally understood by looking at 84% 8F% E5% 91% B3). I will omit my explanation ...

Answer

Answer program [049. Extraction of dependency paths between nouns.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 049.% E5% 90% 8D% E8% A9% 9E% E9% 96% 93% E3% 81% AE% E4 % BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E3% 83% 91% E3% 82% B9% E3% 81% AE% E6% 8A% BD% E5% 87 % BA.ipynb)

There are 150 lines ... It seems that it can be made a little simpler if it takes time, but I am at a loss.

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)
NOUN_REPLACE = '@@noun@@'
SEP1 = ' -> '
SEP2 = ' | '

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        self.dsts = []
        
        self.phrase = ''
        self.phrase_noun = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                if morph.pos == 'noun':
                    
                    #Treat the first noun as a noun phrase
                    if self.noun == False:
                        self.noun = True
                        self.phrase_noun += NOUN_REPLACE
                        continue
                    
                    #If the previous one is a noun, do not replace the surface system
                    elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
                        continue
                        
                #Except for replacement and skip, add the surface system as it is
                self.phrase_noun += morph.surface

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            
            #Add contacts(Add the last line for efficiency)
            for chunk in chunks[::-1]:
                if chunk.dst!= -1:
                    chunk.dsts.extend([chunk.dst])
                    if chunks[chunk.dst].dst != -1:
                        chunk.dsts.extend(chunks[chunk.dst].dsts)
            sentences.append(chunks)
            chunks = []

def output_result(i, chunk, sentence):
    print('\tX:', i, chunk.phrase, chunk.dsts)
    
    #Noun phrase(X)Loop to search for Y after
    for i_y, y_chunk in enumerate(sentence[i+1:], i+1):
        
        #Y for noun phrases
        if y_chunk.noun:
            
            #Y information output
            print('\t\tY:', i_y, y_chunk.phrase)
            
            result = ''
            
            #Y is the route(Contact)If included in
            if i_y in chunk.dsts:
                
                result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1
                
                #Output from X to Y
                for i_path in chunk.dsts:
                    
                    #Finish when Y is reached
                    if i_path == i_y:
                        result += 'Y'
                        break
                    else:
                        result += sentence[i_path].phrase + SEP1
            
            #Y is the route(Contact)If not included in
            else:
                result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')
                
                #Find the intersection with the minimum value in the product set
                i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))
                
                #From X to Y before the common phrase
                for i_path in chunk.dsts:                    
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
                        break
                    else:
                        result += SEP1 + sentence[i_path].phrase                 
                
                #From the person in charge of Y to the common phrase
                for i_path in sentence[i_y].dsts:
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_goal].phrase
                    else:
                        result += SEP1 + sentence[i_path].phrase
            print(result)

for i_sentence, sentence in enumerate(sentences):
    
    #Statement output
    print(i_sentence, '\t', ''.join([chunk.phrase for chunk in sentence]))
    for i_chunk, chunk in enumerate(sentence):
        
        #If the phrase is a noun phrase
        if chunk.noun and chunk.dst != -1:
            output_result(i_chunk, chunk, sentence)

    #Limited because there are many
    if i_sentence > 5:
        break

Answer commentary

Chunk class

First of all, we are expanding the Chunk class significantly. Set the flag noun to True if it was a noun. Then create a phrase_noun so that" X is "and" Y is ". At this point, it can be either X or Y, so I put the fixed value NOUN_REPLACE in the X and Y parts. Also, considering the case where nouns are consecutive (eg dove + clock), ʻends with is checked to see if the end is NOUN_REPLACE(make sure thatNOUN_REPLACE` is not continuous).

`python`


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        self.dsts = []
        
        self.phrase = ''
        self.phrase_noun = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                if morph.pos == 'noun':
                    
                    #Treat the first noun as a noun phrase
                    if self.noun == False:
                        self.noun = True
                        self.phrase_noun += NOUN_REPLACE
                        continue
                    
                    #If the previous one is a noun, do not replace the surface system
                    elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
                        continue
                        
                #Except for replacement and skip, add the surface system as it is
                self.phrase_noun += morph.surface

File read loop part

What has changed from the last time is the last ʻifconditional branch "when there is a dependency result in EOS". Loop through the set of clauseschunks` to create a list of all contacts. The reason for looping backwards from the last line is that we want to create all the contacts at once and reuse the contact list created once.

`python`


for line in f:
    dependancies = dependancy.match(line)
    
    #If it is not EOS or dependency analysis result
    if not (line == 'EOS\n' or dependancies):
        morphs.append(Morph(line))
        
    #When there is a morphological analysis result in the EOS or dependency analysis result
    elif len(morphs) > 0:
        chunks.append(Chunk(morphs, dst))
        morphs = []
   
    #In the case of dependency result
    if dependancies:
        dst = int(dependancies.group(1))
    
    #When there is a dependency result in EOS
    if line == 'EOS\n' and len(chunks) > 0:
        
        #Add contacts(Add from the last line for efficiency)
        for chunk in chunks[::-1]:
            if chunk.dst!= -1:
                chunk.dsts.extend([chunk.dst])
                if chunks[chunk.dst].dst != -1:
                    chunk.dsts.extend(chunks[chunk.dst].dsts)
        sentences.append(chunks)
        chunks = []

Output section

For easy checking later, the entire sentence, X, and Y information is indented and output, although it is not included in the assignment instructions. The output this time can be roughly divided into the following two patterns.

--Pattern 1: X and Y are the same route: Y replaces the entire clause-> Set to P1 on output --Pattern 2: X and Y are separated by another path: |, Y replaces only the noun part-> Set to P2 at the time of output

In the case of pattern 1, since X and Y are the same path, X to Y are output in order. Pattern 2 is complicated because X and Y are different paths. I'm using the ʻintersection` function to find the shared termination of X and Y. "What I did in the 6th chapter of Chapter 1" That's right.

`python`


def output_result(i, chunk, sentence):
    print('\tX:', i, chunk.phrase, chunk.dsts)

    #Noun phrase(X)Loop to search for Y after
    for i_y, y_chunk in enumerate(sentence[i+1:], i+1):

        #Y for noun phrases
        if y_chunk.noun:

            #Y information output
            print('\t\tY:', i_y, y_chunk.phrase)

            result = ''

            #Y is the route(Contact)If included in
            if i_y in chunk.dsts:

                result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1

                #Output from X to Y
                for i_path in chunk.dsts:

                    #Finish when Y is reached
                    if i_path == i_y:
                        result += 'Y'
                        break
                    else:
                        result += sentence[i_path].phrase + SEP1

            #Y is the route(Contact)If not included in
            else:
                result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')

                #Find the intersection with the minimum value in the product set
                i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))

                #From X to Y before the common phrase
                for i_path in chunk.dsts:                    
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
                        break
                    else:
                        result += SEP1 + sentence[i_path].phrase                 

                #From the person in charge of Y to the common phrase
                for i_path in sentence[i_y].dsts:
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_goal].phrase
                    else:
                        result += SEP1 + sentence[i_path].phrase
            print(result)

Output result (execution result)

When the program is executed, the following results will be output. Probably it should be. I haven't checked much.

`Output result`


0 one
1 I am a cat
2 No name yet
	X:0 The name is[2]
3 I have no idea where I was born
	X:0 where[1, 2, 4]
		Y:2 Katonto
			P1:In X->Born-> Y
		Y:3 I have a clue
			P2:In X->Born->Katonto|Y|Do not use
	X:2 Katonto[4]
		Y:3 I have a clue
			P2:With X|Y|Do not use
	X:3 I have a clue[4]
4 I remember only crying in a dim and damp place
	X:0 anything[1, 5, 7]
		Y:At 3
			P2:Even with X->dim|In Y|In tears->I remember
		Y:6 Only what was there
			P2:Even with X->dim->In tears|Only Y|I remember
		Y:7 remember
			P1:Even with X->dim->In tears-> Y
	X:At 3[5, 7]
		Y:6 Only what was there
			P2:In X->In tears|Only Y|I remember
		Y:7 remember
			P1:In X->In tears-> Y
	X:6 Only what was there[7]
		Y:7 remember
			P1:Only X-> Y
5 I saw human beings for the first time here
	X:0 I am[5]
		Y:1 where
			P2:X is|In Y->Start with->Human->Things|saw
		Y:3 human
			P2:X is|Called Y->Things|saw
		Y:4 things
			P2:X is|Y|saw
	X:1 where[2, 3, 4, 5]
		Y:3 human
			P1:In X->Start with-> Y
		Y:4 things
			P1:In X->Start with->Human-> Y
	X:3 human[4, 5]
		Y:4 things
			P1:Called X-> Y
	X:4 things[5]
6 Moreover, I heard later that it was the most evil race of human beings called Shosei.
	X:1 later[2, 9]
		Y:3 it is
			P2:In X->When you hear|Y is|That's it
		Y:4 Called a student
			P2:In X->When you hear|Called Y->In humans->Was a race|That's it
		Y:5 in humans
			P2:In X->When you hear|In Y->Was a race|That's it
		Y:6 Ichiban
			P2:In X->When you hear| Y ->Evil->Was a race|That's it
		Y:7 Evil
			P2:In X->When you hear|Y->Was a race|That's it
		Y:Was 8 races
			P2:In X->When you hear|Was Y|That's it
		Y:9 That's right
			P1:In X->When you hear-> Y
	X:3 it is[9]
		Y:4 Called a student
			P2:X is|Called Y->In humans->Was a race|That's it
		Y:5 in humans
			P2:X is|In Y->Was a race|That's it
		Y:6 Ichiban
			P2:X is| Y ->Evil->Was a race|That's it
		Y:7 Evil
			P2:X is|Y->Was a race|That's it
		Y:Was 8 races
			P2:X is|Was Y|That's it
		Y:9 That's right
			P1:X is-> Y
	X:4 Called a student[5, 8, 9]
		Y:5 in humans
			P1:Called X-> Y
		Y:6 Ichiban
			P2:Called X->In humans| Y ->Evil|Was a race->That's it
		Y:7 Evil
			P2:Called X->In humans|Y|Was a race->That's it
		Y:Was 8 races
			P1:Called X->In humans-> Y
		Y:9 That's right
			P1:Called X->In humans->Was a race-> Y
	X:5 in humans[8, 9]
		Y:6 Ichiban
			P2:In X| Y ->Evil|Was a race->That's it
		Y:7 Evil
			P2:In X|Y|Was a race->That's it
		Y:Was 8 races
			P1:In X-> Y
		Y:9 That's right
			P1:In X->Was a race-> Y
	X:6 Ichiban[7, 8, 9]
		Y:7 Evil
			P1:	X -> Y
		Y:Was 8 races
			P1:	X ->Evil-> Y
		Y:9 That's right
			P1:	X ->Evil->Was a race-> Y
	X:7 Evil[8, 9]
		Y:Was 8 races
			P1:X-> Y
		Y:9 That's right
			P1:X->Was a race-> Y
	X:Was 8 races[9]
		Y:9 That's right
			P1:Was X-> Y

100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns

Reference link

environment

Chapter 5: Dependency analysis

content of study

Knock content

49. Extraction of dependency paths between nouns

Problem supplement

Answer

Answer commentary

Chunk class

python

File read loop part

python

Output section

python

Output result (execution result)

Output result

`python`

`python`

`python`

`Output result`