Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 49th "Extraction of dependency path between nouns" of tohoku.ac.jp/nlp100/#ch5). .. Finally, it is the turning point of 100 knocks. Here is ** a vicious knock that becomes a demon gate **. ** The problem is difficult to understand, and even if you understand it, it is difficult to solve it **. It feels like I've finished 100, but it's the most annoying knock for my own algorithm (there are other more complicated ones, but I'm basically leaving it to the package so I don't have to worry too much).
Link | Remarks |
---|---|
049.Extraction of dependency paths between nouns.ipynb | Answer program GitHub link |
100 amateur language processing knocks:49 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the clause numbers of the noun phrase pair are * i * and * j * (* i * <* j *), the dependency path shall satisfy the following specifications.
--Similar to problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "
->
" --Replace noun phrases in clauses * i * and * j * with X and Y, respectively.In addition, the shape of the dependency path can be considered in the following two ways.
--If clause * j * exists on the path from clause * i * to the root of the syntax tree: Show the path from clause * i * to clause * j * --Other than the above, when the phrase * i * and the clause * j * intersect at a common clause * k * on the path from the clause * j * to the root of the syntax tree: The path and clause * immediately before the clause * i * to the clause * k * The path from j * to just before the clause * k * and the contents of the clause * k * are concatenated with "
|
" and displayed.For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
X is|In Y->Start with->Human->Things|saw X is|Called Y->Things|saw X is|Y|saw In X->Start with-> Y In X->Start with->Human-> Y Called X-> Y
The knock content is long and difficult to understand. Article ["Amateur language processing 100 knocks: 49"](https://qiita.com/segavvy/items/dfbf9d5dd5885e807d49#%E5%95%8F%E9%A1%8C%E3%81%AE%E6% I finally understood by looking at 84% 8F% E5% 91% B3). I will omit my explanation ...
There are 150 lines ... It seems that it can be made a little simpler if it takes time, but I am at a loss.
import re
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
NOUN_REPLACE = '@@noun@@'
SEP1 = ' -> '
SEP2 = ' | '
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.dst = dst #Contact clause index number
self.dsts = []
self.phrase = ''
self.phrase_noun = ''
self.noun = False
for morph in morphs:
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
if morph.pos == 'noun':
#Treat the first noun as a noun phrase
if self.noun == False:
self.noun = True
self.phrase_noun += NOUN_REPLACE
continue
#If the previous one is a noun, do not replace the surface system
elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
continue
#Except for replacement and skip, add the surface system as it is
self.phrase_noun += morph.surface
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
#Add contacts(Add the last line for efficiency)
for chunk in chunks[::-1]:
if chunk.dst!= -1:
chunk.dsts.extend([chunk.dst])
if chunks[chunk.dst].dst != -1:
chunk.dsts.extend(chunks[chunk.dst].dsts)
sentences.append(chunks)
chunks = []
def output_result(i, chunk, sentence):
print('\tX:', i, chunk.phrase, chunk.dsts)
#Noun phrase(X)Loop to search for Y after
for i_y, y_chunk in enumerate(sentence[i+1:], i+1):
#Y for noun phrases
if y_chunk.noun:
#Y information output
print('\t\tY:', i_y, y_chunk.phrase)
result = ''
#Y is the route(Contact)If included in
if i_y in chunk.dsts:
result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1
#Output from X to Y
for i_path in chunk.dsts:
#Finish when Y is reached
if i_path == i_y:
result += 'Y'
break
else:
result += sentence[i_path].phrase + SEP1
#Y is the route(Contact)If not included in
else:
result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')
#Find the intersection with the minimum value in the product set
i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))
#From X to Y before the common phrase
for i_path in chunk.dsts:
if i_path == i_goal:
result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
break
else:
result += SEP1 + sentence[i_path].phrase
#From the person in charge of Y to the common phrase
for i_path in sentence[i_y].dsts:
if i_path == i_goal:
result += SEP2 + sentence[i_goal].phrase
else:
result += SEP1 + sentence[i_path].phrase
print(result)
for i_sentence, sentence in enumerate(sentences):
#Statement output
print(i_sentence, '\t', ''.join([chunk.phrase for chunk in sentence]))
for i_chunk, chunk in enumerate(sentence):
#If the phrase is a noun phrase
if chunk.noun and chunk.dst != -1:
output_result(i_chunk, chunk, sentence)
#Limited because there are many
if i_sentence > 5:
break
First of all, we are expanding the Chunk class significantly.
Set the flag noun
to True
if it was a noun.
Then create a phrase_noun
so that" X is "and" Y is ". At this point, it can be either X or Y, so I put the fixed value NOUN_REPLACE
in the X and Y parts. Also, considering the case where nouns are consecutive (eg dove + clock), ʻends with is checked to see if the end is
NOUN_REPLACE(make sure that
NOUN_REPLACE` is not continuous).
python
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.dst = dst #Contact clause index number
self.dsts = []
self.phrase = ''
self.phrase_noun = ''
self.noun = False
for morph in morphs:
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
if morph.pos == 'noun':
#Treat the first noun as a noun phrase
if self.noun == False:
self.noun = True
self.phrase_noun += NOUN_REPLACE
continue
#If the previous one is a noun, do not replace the surface system
elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
continue
#Except for replacement and skip, add the surface system as it is
self.phrase_noun += morph.surface
What has changed from the last time is the last ʻifconditional branch "when there is a dependency result in EOS". Loop through the set of clauses
chunks` to create a list of all contacts. The reason for looping backwards from the last line is that we want to create all the contacts at once and reuse the contact list created once.
python
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
#Add contacts(Add from the last line for efficiency)
for chunk in chunks[::-1]:
if chunk.dst!= -1:
chunk.dsts.extend([chunk.dst])
if chunks[chunk.dst].dst != -1:
chunk.dsts.extend(chunks[chunk.dst].dsts)
sentences.append(chunks)
chunks = []
For easy checking later, the entire sentence, X, and Y information is indented and output, although it is not included in the assignment instructions. The output this time can be roughly divided into the following two patterns.
--Pattern 1: X and Y are the same route: Y replaces the entire clause-> Set to P1
on output
--Pattern 2: X and Y are separated by another path: |, Y replaces only the noun part-> Set to P2
at the time of output
In the case of pattern 1, since X and Y are the same path, X to Y are output in order. Pattern 2 is complicated because X and Y are different paths. I'm using the ʻintersection` function to find the shared termination of X and Y. "What I did in the 6th chapter of Chapter 1" That's right.
python
def output_result(i, chunk, sentence):
print('\tX:', i, chunk.phrase, chunk.dsts)
#Noun phrase(X)Loop to search for Y after
for i_y, y_chunk in enumerate(sentence[i+1:], i+1):
#Y for noun phrases
if y_chunk.noun:
#Y information output
print('\t\tY:', i_y, y_chunk.phrase)
result = ''
#Y is the route(Contact)If included in
if i_y in chunk.dsts:
result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1
#Output from X to Y
for i_path in chunk.dsts:
#Finish when Y is reached
if i_path == i_y:
result += 'Y'
break
else:
result += sentence[i_path].phrase + SEP1
#Y is the route(Contact)If not included in
else:
result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')
#Find the intersection with the minimum value in the product set
i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))
#From X to Y before the common phrase
for i_path in chunk.dsts:
if i_path == i_goal:
result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
break
else:
result += SEP1 + sentence[i_path].phrase
#From the person in charge of Y to the common phrase
for i_path in sentence[i_y].dsts:
if i_path == i_goal:
result += SEP2 + sentence[i_goal].phrase
else:
result += SEP1 + sentence[i_path].phrase
print(result)
When the program is executed, the following results will be output. Probably it should be. I haven't checked much.
Output result
0 one
1 I am a cat
2 No name yet
X:0 The name is[2]
3 I have no idea where I was born
X:0 where[1, 2, 4]
Y:2 Katonto
P1:In X->Born-> Y
Y:3 I have a clue
P2:In X->Born->Katonto|Y|Do not use
X:2 Katonto[4]
Y:3 I have a clue
P2:With X|Y|Do not use
X:3 I have a clue[4]
4 I remember only crying in a dim and damp place
X:0 anything[1, 5, 7]
Y:At 3
P2:Even with X->dim|In Y|In tears->I remember
Y:6 Only what was there
P2:Even with X->dim->In tears|Only Y|I remember
Y:7 remember
P1:Even with X->dim->In tears-> Y
X:At 3[5, 7]
Y:6 Only what was there
P2:In X->In tears|Only Y|I remember
Y:7 remember
P1:In X->In tears-> Y
X:6 Only what was there[7]
Y:7 remember
P1:Only X-> Y
5 I saw human beings for the first time here
X:0 I am[5]
Y:1 where
P2:X is|In Y->Start with->Human->Things|saw
Y:3 human
P2:X is|Called Y->Things|saw
Y:4 things
P2:X is|Y|saw
X:1 where[2, 3, 4, 5]
Y:3 human
P1:In X->Start with-> Y
Y:4 things
P1:In X->Start with->Human-> Y
X:3 human[4, 5]
Y:4 things
P1:Called X-> Y
X:4 things[5]
6 Moreover, I heard later that it was the most evil race of human beings called Shosei.
X:1 later[2, 9]
Y:3 it is
P2:In X->When you hear|Y is|That's it
Y:4 Called a student
P2:In X->When you hear|Called Y->In humans->Was a race|That's it
Y:5 in humans
P2:In X->When you hear|In Y->Was a race|That's it
Y:6 Ichiban
P2:In X->When you hear| Y ->Evil->Was a race|That's it
Y:7 Evil
P2:In X->When you hear|Y->Was a race|That's it
Y:Was 8 races
P2:In X->When you hear|Was Y|That's it
Y:9 That's right
P1:In X->When you hear-> Y
X:3 it is[9]
Y:4 Called a student
P2:X is|Called Y->In humans->Was a race|That's it
Y:5 in humans
P2:X is|In Y->Was a race|That's it
Y:6 Ichiban
P2:X is| Y ->Evil->Was a race|That's it
Y:7 Evil
P2:X is|Y->Was a race|That's it
Y:Was 8 races
P2:X is|Was Y|That's it
Y:9 That's right
P1:X is-> Y
X:4 Called a student[5, 8, 9]
Y:5 in humans
P1:Called X-> Y
Y:6 Ichiban
P2:Called X->In humans| Y ->Evil|Was a race->That's it
Y:7 Evil
P2:Called X->In humans|Y|Was a race->That's it
Y:Was 8 races
P1:Called X->In humans-> Y
Y:9 That's right
P1:Called X->In humans->Was a race-> Y
X:5 in humans[8, 9]
Y:6 Ichiban
P2:In X| Y ->Evil|Was a race->That's it
Y:7 Evil
P2:In X|Y|Was a race->That's it
Y:Was 8 races
P1:In X-> Y
Y:9 That's right
P1:In X->Was a race-> Y
X:6 Ichiban[7, 8, 9]
Y:7 Evil
P1: X -> Y
Y:Was 8 races
P1: X ->Evil-> Y
Y:9 That's right
P1: X ->Evil->Was a race-> Y
X:7 Evil[8, 9]
Y:Was 8 races
P1:X-> Y
Y:9 That's right
P1:X->Was a race-> Y
X:Was 8 races[9]
Y:9 That's right
P1:Was X-> Y
Recommended Posts