Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. tohoku.ac.jp/nlp100/#ch5) 42nd "Display of the phrase of the person in charge and the person in charge" Record is. Since the clerk and the clerk clause are output, it feels like the actual performance of the clerk. However, technically, the output method changes a little, so it's not much different from the previous knock.
Link | Remarks |
---|---|
042.Display of the phrase of the person in charge and the person in charge.ipynb | Answer program GitHub link |
100 amateur language processing knocks:42 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
Extract all the text of the original clause and the relationed clause in tab-delimited format. However, do not output symbols such as punctuation marks.
import re
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''.join([morph.surface for morph in morphs if morph.pos!= 'symbol']) #Phrase
#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
#Substitute the entrepreneur
for i, chunk in enumerate(chunks):
if chunk.dst != -1:
chunks[chunk.dst].srcs.append(i)
sentences.append(chunks)
return sentences, []
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
sentences, chunks = append_sentence(chunks, sentences)
for si, sentence in enumerate(sentences):
print('-----', si, '-----')
for ci, chunk in enumerate(sentence):
if chunk.dst != -1:
print('{}:{}\t{}'.format(ci, chunk.phrase, sentence[chunk.dst].phrase))
#Limited because there are many
if si > 5:
break
A little different from the previous Chunk class, the symbols are excluded from the clause.
python
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''.join([morph.surface for morph in morphs if morph.pos!= 'symbol']) #Phrase
"Text is tab-delimited format" is like tab-delimited text, but I don't see it even if all of it appears, and it is easier to see if there is a sentence delimiter, so I interpret it arbitrarily and output the tab with print
.
python
for si, sentence in enumerate(sentences):
print('-----', si, '-----')
for ci, chunk in enumerate(sentence):
if chunk.dst != -1:
print('{}:{}\t{}'.format(ci, chunk.phrase, sentence[chunk.dst].phrase))
#Limited because there are many
if si > 5:
break
When the program is executed, the following result is output (only 6 sentences are output).
Output result
----- 0 -----
----- 1 -----
----- 2 -----
0:No name
1:Not yet
----- 3 -----
0:Where was born
1:Born
2:I don't get it
3:I have no idea
----- 4 -----
0:Anything dim
1:Dim crying
2:Weeping
3:Crying where you did
4:Meow meow crying
5:I cry and remember
6:I remember only what I was
----- 5 -----
0:I saw
1:For the first time here
2:For the first time called human
3:Human beings
4:I saw something
----- 6 -----
0:And that's right
1:I will ask you later
2:I heard that
3:That's right
4:In the human being called Shosei
5:Was a race in humans
6:The worst
7:Was an evil race
8:It seems that it was a race