Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 44th "Visualization of Dependent Tree" of tohoku.ac.jp/nlp100/#ch5). Visualization makes it very easy to understand how the document is dependent. By visualizing the dependency, you can also do something nice as in the article "I tried to linguistically analyze Karen Takizawa's incomprehensible sentences.".
Link | Remarks |
---|---|
044.Visualization of dependent trees.ipynb | Answer program GitHub link |
100 amateur language processing knocks:44 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I have only a frustrated memory of trying to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pydot | 1.4.1 |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and [Graphviz](http: / /www.graphviz.org/) should be used. Also, to visualize directed graphs directly from Python, use pydot.
It seems that there are two types of visualization. I'm ignoring the first method. I haven't even checked if the first method is easy. It doesn't matter because I didn't use it in "Amateur language processing 100 knocks: 44" that I always refer to in my knocks.
For visualization, convert the dependency tree to DOT language and then [Graphviz](http: //www.graphviz.org/) should be used.
This time, I used the following method. With this, all you have to do is install pydot with pip
and pass it to the function in Python.
Also, to visualize directed graphs directly from Python, use pydot.
First, [** Graph Theory **](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%A9%E3%83%95%E7%90%86%E8% There is something called AB% 96)
Graph theory (Graph theory) is a mathematical theory of graphs consisting of a set of nodes (nodes / vertices) and a set of edges (branches / sides).
[Definition of directed graph and invalid graph as below](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%A9%E3%83%95%E7%90%86% E8% AB% 96 #% E6% A6% 82% E8% A6% 81) is roughly (the "directed graph" has a direction). Please follow the link for details.
If you want to consider not only how to connect but also "from which to which", add an arrow to the edge. Such a graph is called a directed graph or a digraph. A graph without an arrow is called an undirected graph.
import re
from subprocess import run, PIPE
import pydot
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
text = input('Please enter text')
#initial value
if len(text) == 0:
text = 'I don't remember exactly whether I said it or not, but I think I probably said it when I had a hand-wound party the other day, without feeling like I said it a little. I tried it, but I came to think that it doesn't matter whether I say it or not.'
cmd = 'echo {} | cabocha -f1'.format(text)
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
print(proc.stdout.decode('UTF-8'))
class Chunk:
def __init__(self, phrase, dst):
self.phrase = phrase
self.dst = dst #Contact clause index number
phrase = ''
chunks = []
for line in proc.stdout.decode('UTF-8').splitlines():
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result(Note that EOS does not have line breaks)
if not (line == 'EOS' or dependancies):
#Split with tabs and commas
cols = separator.split(line)
phrase += cols[0] #Surface type(surface)
#When there is a morphological analysis result in the EOS or dependency analysis result
elif phrase != '':
chunks.append(Chunk(phrase, dst))
phrase = ''
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#Changed to a format that passes something with a contact to pydot
edges = []
for i, chunk in enumerate(chunks):
if chunk.dst != -1 and \
chunk.phrase != '' and \
chunks[chunk.dst].phrase != '':
edges.append(((i, chunk.phrase), (chunk.dst, chunks[chunk.dst].phrase)))
#Save image as directed graph with pydot
if len(edges) > 0:
graph = pydot.graph_from_edges(edges, directed=True)
graph.write_png('044.dot.png')
The "given sentence" part of the knock is given by the ʻinput` function (does it conform to the question intention?). If nothing is entered, the initial value will be used.
python
text = input('Please enter text')
#initial value
if len(text) == 0:
text = 'I don't remember exactly whether I said it or not, but I think I probably said it when I had a hand-wound party the other day, without feeling like I said it a little. I tried it, but I came to think that it doesn't matter whether I say it or not.'
The CaboCha execution part uses the function run
of the package subprocess
to execute the shell. I didn't use CaboCha's Python wrapper because it was purely annoying.
python
cmd = 'echo {} | cabocha -f1'.format(text)
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
print(proc.stdout.decode('UTF-8'))
The first part of the content output by the print
function is as follows.
Part of print result
* 0 1D 0/4 0.285960
Say verb,Independence,*,*,Godan / Wa line reminder,Continuous connection,To tell,It,It
Particles,Connection particle,*,*,*,*,hand,Te,Te
A verb,Non-independent,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito
* 1 4D 0/4 2.230543
Say verb,Independence,*,*,Godan / Wa line reminder,Continuous connection,To tell,It,It
Verb,Non-independent,*,*,One step,Imperfective form,Teru,Te,Te
No auxiliary verb,*,*,*,Special Nai,Continuous connection,Absent,Naka,Naka
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito
* 2 4D 0/3 2.418727
Which noun,Pronoun,General,*,*,*,Which,Dotch,Dotch
Auxiliary verb,*,*,*,Special,Continuous connection,Is,Dad,Dad
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito
python
#Changed to a format that passes something with a contact to pydot
edges = []
for i, chunk in enumerate(chunks):
if chunk.dst != -1 and \
chunk.phrase != '' and \
chunks[chunk.dst].phrase != '':
edges.append(((i, chunk.phrase), (chunk.dst, chunks[chunk.dst].phrase)))
By the way, ʻedges` has such contents.
((0, 'Did you say'), (1, 'Didn't you say'))
((1, 'Didn't you say'), (4, 'I don't remember'))
((2, 'Which was'), (4, 'I don't remember'))
((3, 'Properly'), (4, 'I don't remember'))
((4, 'I don't remember'), (19, 'I thought about it,'))
((5, 'Certainly'), (7, 'Hooray'))
((6, 'A hand-wound party during this time'), (7, 'Hooray'))
((7, 'Hooray'), (8, 'Sometimes'))
((8, 'Sometimes'), (10, 'Said'))
((9, 'A little bit'), (10, 'Said'))
((10, 'Said'), (11, 'Feeling'))
((11, 'Feeling'), (12, 'Without'))
((12, 'Without'), (14, 'Nishimo'))
((13, 'Without'), (14, 'Nishimo'))
((14, 'Nishimo'), (15, 'Without'))
((15, 'Without'), (17, 'I think I said'))
((16, 'Perhaps'), (17, 'I think I said'))
((17, 'I think I said'), (19, 'I thought about it,'))
((18, 'To here'), (19, 'I thought about it,'))
((19, 'I thought about it,'), (28, 'It depends.'))
((20, 'Oh dear'), (21, 'I'll tell you'))
((21, 'I'll tell you'), (28, 'It depends.'))
((22, 'Say'), (23, 'I don't care'))
((23, 'I don't care'), (25, 'There is no problem,'))
((24, 'Up to that point'), (25, 'There is no problem,'))
((25, 'There is no problem,'), (26, 'I think'))
((26, 'I think'), (27, 'Reached'))
((27, 'Reached'), (28, 'It depends.'))
Finally, use the graph_from_edges
function to create a valid graph and use the write_png
function to save the image. By setting directed = True
at the time of directed graphing, the line between the segments becomes an arrow.
#Save image as directed graph with pydot
if len(edges) > 0:
graph = pydot.graph_from_edges(edges, directed=True)
graph.write_png('044.dot.png')
When the program is executed, the following results will be output.
By the way, this is the original story of this document. Article "[Play] Synthetic analysis of Shinkalion's ton demo mail".
Recommended Posts