Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 46th "Extraction of verb case frame information" of tohoku.ac.jp/nlp100/#ch5). Last time, only particles were output as case, but this time, phrases (each frame) are also output. Of course, it's even more troublesome ...
Link | Remarks |
---|---|
046.Extraction of verb case frame information.ipynb | Answer program GitHub link |
100 amateur language processing knocks:46 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
Modify the program> 45 and output the predicate and case pattern followed by the term (the clause itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications.
--The term should be a word string of the clause related to the predicate (there is no need to remove the trailing particle) --If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.
Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
Get started here See what I see
If you are interested, please see Wikipedia "Case Grammar". You can solve it without looking. I don't understand it just by glancing at it.
import re
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''
self.verb = ''
self.joshi = ''
for morph in morphs:
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb':
self.verb = morph.base
if morph.pos == 'Particle':
self.joshi = morph.base
#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
#Substitute the entrepreneur
for i, chunk in enumerate(chunks):
if chunk.dst != -1:
chunks[chunk.dst].srcs.append(i)
sentences.append(chunks)
return sentences, []
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
sentences, chunks = append_sentence(chunks, sentences)
def output_file(out_file, sentence, chunk):
#Create a list of particles
sources = [[sentence[source].joshi, sentence[source].phrase] \
for source in chunk.srcs if sentence[source].joshi != '']
if len(sources) > 0:
sources.sort()
joshi = ' '.join([row[0] for row in sources])
phrase = ' '.join([row[1] for row in sources])
out_file.write(('{}\t{}\t{}\n'.format(chunk.verb, joshi, phrase)))
with open('./046.result_python.txt', 'w') as out_file:
for sentence in sentences:
for chunk in sentence:
if chunk.verb != '' and len(chunk.srcs) > 0:
output_file(out_file, sentence, chunk)
After all, the Chunk class, which is the lifeline of Chapter 5, has been changed from the previous time. I made the instance variable phrase
and put the phrase text. Everything else is the same.
python
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''
self.verb = ''
self.joshi = ''
for morph in morphs:
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb':
self.verb = morph.base
if morph.pos == 'Particle':
self.joshi = morph.base
Since it became rather complicated, I cut out the file output part as a function. A list of particles and clauses is created with the first list comprehension, and after sorting, it is output with the join
function.
python
def output_file(out_file, sentence, chunk):
#Create a list of particles
sources = [[sentence[source].joshi, sentence[source].phrase] \
for source in chunk.srcs if sentence[source].joshi != '']
if len(sources) > 0:
sources.sort()
joshi = ' '.join([row[0] for row in sources])
phrase = ' '.join([row[1] for row in sources])
out_file.write(('{}\t{}\t{}\n'.format(chunk.verb, joshi, phrase)))
When the program is executed, the following results are output (only the first 10 items).
bash:046.result_python.txt
Where to be born
I have a clue
Where I was crying
The only thing I was crying
Get started here
See what I see
Listen later
Catch us
Boil and catch
Eat and boil
Recommended Posts