Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 57th "Dependency Analysis" of .tohoku.ac.jp/nlp100/#ch6). "Language processing 100 knocks-44: Visualization of dependent trees" This is the Stanford Core NLP version. I'm diverting a lot of code.
Link | Remarks |
---|---|
057.Dependency analysis.ipynb | Answer program GitHub link |
100 amateur language processing knocks:57 | Copy and paste source of many source parts |
Stanford Core NLP Official | Stanford Core NLP page to look at first |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Stanford CoreNLP | 3.9.2 | I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was |
openJDK | 1.8.0_242 | I used the JDK that was installed for other purposes as it is |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pydot | 1.4.1 |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Visualize the collapsed-dependencies of Stanford Core NLP as a directed graph. For visualization, convert the dependency tree to DOT language and [Graphviz](http: / /www.graphviz.org/) should be used. Also, to visualize directed graphs directly from Python, use pydot.
Dependencies are called "Dependencies" in Stanford Core NLP, and the mechanics of Stanford Core NLP can be found in Stanford Dependencies.
It seems that there are two types, and this time the target is collapsed-dependencies
.
After doing it, I noticed that it was easier to understand if I added the relationship between edges (prep_on etc.) in the directed graph.
import xml.etree.ElementTree as ET
import pydot
for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
edges = []
for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
#Punctuation exclusion
if dependency.get('type') != 'punct':
governor = dependency.find('./governor')
dependent = dependency.find('./dependent')
edges.append(((governor.get('idx'), governor.text),
(dependent.get('idx'), dependent.text)))
if len(edges) > 0:
graph = pydot.graph_from_edges(edges, directed=True)
graph.write_jpeg('057.graph_{}.jpeg'.format(i))
if i > 5:
break
It is the mapping of the path of the following XML file and the target source and destination. The dependencies
tag of the 5th layer targets the attribute type
of collapsed-dependencies
.
output | 1st level | Second level | Third level | 4th level | 5th level | 6th level | 7th level |
---|---|---|---|---|---|---|---|
Person in charge | root | document | sentences | sentence | dependencies | dep | governor |
Contact | root | document | sentences | sentence | dependencies | dep | dependent |
The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).
xml:nlp.txt.xml(Excerpt)
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
<sentence id="1">
--Omission--
<dependencies type="collapsed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="18">field</dependent>
</dep>
<dep type="amod">
<governor idx="3">processing</governor>
<dependent idx="1">Natural</dependent>
</dep>
<dep type="compound">
<governor idx="3">processing</governor>
<dependent idx="2">language</dependent>
</dep>
The code part below. What I'm doing is the same as "Knock 100 language processing-44: Visualization of dependent tree", so I won't explain it. However, add the relationship between edges using Graphviz and networkx. I regret that I should have done it. Article "Drawing multigraphs and beautiful graphs with networkx [python]" and article "Using Graphviz on Python" Draw a beautiful graph " I think I can write it by referring to it. In the first place, pydot has not been updated since December 2018, so I am worried that it will continue in the future.
python
for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
edges = []
for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
#Punctuation exclusion
if dependency.get('type') != 'punct':
governor = dependency.find('./governor')
dependent = dependency.find('./dependent')
edges.append(((governor.get('idx'), governor.text),
(dependent.get('idx'), dependent.text)))
if len(edges) > 0:
graph = pydot.graph_from_edges(edges, directed=True)
graph.write_jpeg('057.graph_{}.jpeg'.format(i))
When the program is executed, the following result is output (only the first 3 sentences).
Recommended Posts