Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 45th "Extraction of verb case pattern" of tohoku.ac.jp/nlp100/#ch5). The number of conditional branches of ʻif` has also increased, and it is becoming more and more complicated. It's a little tedious to think about the algorithm.
Link | Remarks |
---|---|
045.Extraction of verb case patterns.ipynb | Answer program GitHub link |
100 amateur language processing knocks:45 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Syntax, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.
--In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.
Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
At the beginning To see
Save the output of this program to a file and check the following items using UNIX commands.
--Combination of predicates and case patterns that frequently appear in the corpus --The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)
I don't pay particular attention to the purpose of completing the program, but the Japanese word "case" seems to be deep. If you're curious, take a look at Wikiped it "case". I'm just looking at it. I remember when I was doing a language exchange in Australia, I was asked what the difference between "ha" and "ga" was.
import re
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.verb = ''
self.joshi = ''
for morph in morphs:
if morph.pos != 'symbol':
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb':
self.verb = morph.base
if morph.pos == 'Particle':
self.joshi = morph.base
#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
#Substitute the entrepreneur
for i, chunk in enumerate(chunks):
if chunk.dst != -1:
chunks[chunk.dst].srcs.append(i)
sentences.append(chunks)
return sentences, []
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
sentences, chunks = append_sentence(chunks, sentences)
with open('./045.result_python.txt', 'w') as out_file:
for sentence in sentences:
for chunk in sentence:
if chunk.verb != '' and len(chunk.srcs) > 0:
#Create a list of particles
sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']
if len(sources) > 0:
sources.sort()
out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))
The following is the UNIX command part. I used the grep
command for the first time, but it's convenient.
UNIX command section
#Sort, deduplication and count, descending sort
sort 045.result_python.txt | uniq --count | sort --numeric-sort --reverse > "045.result_1_all.txt"
# 「(Beginning of line)To do(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^To do\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_2_To do.txt"
# 「(Beginning of line)to see(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^to see\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_3_to see.txt"
# 「(Beginning of line)give(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^give\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_4_give.txt"
The Chunk class stores the prototypes of verbs and particles. If there are multiple verbs in one phrase, we win second. The case particle should appear at the end of the phrase, but it has a conditional branch that takes the symbol into consideration.
python
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.verb = ''
self.joshi = ''
for morph in morphs:
if morph.pos != 'symbol':
self.joshi = '' #Blank for non-symbols to get particles in the last line excluding symbols
if morph.pos == 'verb':
self.verb = morph.base
if morph.pos == 'Particle':
self.joshi = morph.base
The particles of the affiliation are listed in a list comprehension notation and sorted to satisfy "Lexicographic order". And finally, the join
function is used to output them separated by spaces. The nest is deep and I feel uncomfortable writing.
python
with open('./045.result_python.txt', 'w') as out_file:
for sentence in sentences:
for chunk in sentence:
if chunk.verb != '' and len(chunk.srcs) > 0:
#Create a list of particles
sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']
if len(sources) > 0:
sources.sort()
out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))
When the program is executed, the following results will be output. Since there are many, only 10 lines are displayed here.
bash:045.result_python.txt(First 10 lines)
Be born
Tsukugato
By crying
Or
At the beginning
To see
Listen
To catch
Boil
Eat
Since there are many, only 10 lines are displayed here.
bash:045.result_1_all.txt(First 10 lines)
There are 3176
1997 Tsukugato
800
721
464 to be
330
309 I think
305 see
301
Until there are 262
bash:045.result_2_To do.txt(First 10 lines)
1099
651
221
109 But
Until 86
59 What is
41
27 What is it?
Up to 24
18 as
bash:045.result_3_to see.txt(First 10 lines)
305 see
99 see
31 to see
24 Seeing
19 from seeing
11 Seeing
7 Because I see
5 to see
2 While watching
2 Just by looking
"Give" has a low frequency of appearance, and this is all.
bash:045.result_4_give.txt
7 to give
4 to give
3 What to give
Give 1 but give
1 As to give
1 to give
Recommended Posts