Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"]( This is the record of 47th "Mining of functional verb syntax" of In addition to the previous knock, the extraction target becomes a more complicated condition. It takes a little time just to understand the problem statement, and of course it takes time to solve it.
Link | Remarks |
047.Functional verb syntax mining.ipynb | Answer program GitHub link |
100 amateur language processing knocks:47 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I have only a frustrated memory of trying to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.
--Only when the phrase consisting of "Sahen connection noun + (particle)" is related to the verb --The predicate is "Sahen connection noun + is the basic form of + verb", and if there are multiple verbs in the phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms with spaces (align with the order of particles).
For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."
When I reply to the letter, my husband
Save the output of this program to a file and check the following items using UNIX commands.
--Predicates that frequently appear in the corpus (sa-variant noun + + verb) --Predicates and particles that frequently appear in the corpus
According to "Functional verb / Idiom verb", functional verbs are as follows. In other words, it is meaningless unless you stick to a noun like "do" and do something like "eat".
A functional verb is a verb that loses its original meaning and is associated with an action noun to represent the meaning of the verb.
import re
separator = re.compile('\t|,')
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''
self.verb = ''
self.joshi = ''
self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
for i, morph in enumerate(morphs):
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb' and self.verb == '':
self.verb = morph.base
if morphs[-1].pos == 'Particle':
self.joshi = morphs[-1].base
if morph.pos1 == 'Change connection' and \
morphs[i+1].surface == 'To':
self.sahen = morph.surface + morphs[i+1].surface
except IndexError:
#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
#Substitute the entrepreneur
for i, chunk in enumerate(chunks):
if chunk.dst != -1:
return sentences, []
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
sentences, chunks = append_sentence(chunks, sentences)
def output_file(out_file, sahen, sentence, chunk):
#Create a list of particles
sources = [[sentence[source].joshi, sentence[source].phrase] \
for source in chunk.srcs if sentence[source].joshi != '']
if len(sources) > 0:
joshi = ' '.join([row[0] for row in sources])
phrase = ' '.join([row[1] for row in sources])
out_file.write(('{}\t{}\t{}\n'.format(sahen, joshi, phrase)))
with open('./047.result_python.txt', 'w') as out_file:
for sentence in sentences:
for chunk in sentence:
if chunk.sahen != '' and \
chunk.dst != -1 and \
sentence[chunk.dst].verb != '':
output_file(out_file, chunk.sahen+sentence[chunk.dst].verb,
sentence, sentence[chunk.dst])
#Sort by predicate, deduplication, sort by number
cut --fields=1 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix1.txt
#Sort by predicate and particle to deduplication, then sort by number
cut --fields=1,2 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix2.txt
As usual, change the lifeline Chunk class. If the value of the part of speech subclassification pos1
is" sahen connection "and the next entry is" o ", then the instance variable sahen
is entered with a string that combines the two (example: reply +).
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''
self.verb = ''
self.joshi = ''
self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
for i, morph in enumerate(morphs):
if morph.pos != 'symbol':
self.phrase += morph.surface #For non-symbols Create clauses
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb' and self.verb == '':
self.verb = morph.base
if morphs[-1].pos == 'Particle':
self.joshi = morphs[-1].base
if morph.pos1 == 'Change connection' and \
morphs[i+1].surface == 'To':
self.sahen = morph.surface + morphs[i+1].surface
except IndexError:
The conditional branch of the output section is changed.
with open('./047.result_python.txt', 'w') as out_file:
for sentence in sentences:
for chunk in sentence:
if chunk.sahen != '' and \
chunk.dst != -1 and \
sentence[chunk.dst].verb != '':
output_file(out_file, chunk.sahen+sentence[chunk.dst].verb,
sentence, sentence[chunk.dst])
When you execute Python Script, the following result is output.
text:047.result_python.txt(Only the first 10)
Decide to make a decision
In return, in memory of the return
Take a nap Take a nap
He takes a nap
Persecution by chasing after persecution
Living a family life
Talk talk talk
Make a letter to the editor Make a letter to the editor
Sometimes talk to talk
To make a sketch
Execute UNIX command and output "predicates that frequently appear in the corpus (sa-variant noun + + verb)"
text:047.result_unix1.txt(Only the first 10)
29 reply
21 Say hello
16 talk
15 imitate
13 quarrel
9 Exercise
9 Ask a question
6 Be careful
6 Take a nap
6 Ask questions
Execute UNIX command and output "predicates and particle patterns that frequently appear in the corpus"
text:047.result_unix2.txt(Only the first 10)
14 When you reply
9 Exercise
9 Do the imitation
8 What is a reply?
7 To quarrel
6 To talk
6 When you say hello
5 to talk
5 To say hello
4 Ask a question
Recommended Posts