Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 57th "Dependency Analysis" of .tohoku.ac.jp/nlp100/#ch6). "Language processing 100 knocks-44: Visualization of dependent trees" This is the Stanford Core NLP version. I'm diverting a lot of code.

Reference link

Link	Remarks
057.Dependency analysis.ipynb	Answer program GitHub link
100 amateur language processing knocks:57	Copy and paste source of many source parts
Stanford Core NLP Official	Stanford Core NLP page to look at first

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Stanford CoreNLP	3.9.2	I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was
openJDK	1.8.0_242	I used the JDK that was installed for other purposes as it is

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pydot	1.4.1

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

57. Dependency analysis

Visualize the collapsed-dependencies of Stanford Core NLP as a directed graph. For visualization, convert the dependency tree to DOT language and [Graphviz](http: / /www.graphviz.org/) should be used. Also, to visualize directed graphs directly from Python, use pydot.

Problem supplement (about "dependency")

Dependencies are called "Dependencies" in Stanford Core NLP, and the mechanics of Stanford Core NLP can be found in Stanford Dependencies.

It seems that there are two types, and this time the target is collapsed-dependencies.

After doing it, I noticed that it was easier to understand if I added the relationship between edges (prep_on etc.) in the directed graph.

Answer

Answer Program [057. Dependency Analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 057.% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90.ipynb)

import xml.etree.ElementTree as ET

import pydot

for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
    edges = []
    for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
        
        #Punctuation exclusion
        if dependency.get('type') != 'punct':
            governor = dependency.find('./governor')
            dependent = dependency.find('./dependent')
            edges.append(((governor.get('idx'), governor.text), 
                          (dependent.get('idx'), dependent.text)))
    if len(edges) > 0:
        graph = pydot.graph_from_edges(edges, directed=True)
        graph.write_jpeg('057.graph_{}.jpeg'.format(i))
    
    if i > 5:
        break

Answer commentary

XML file path

It is the mapping of the path of the following XML file and the target source and destination. The dependencies tag of the 5th layer targets the attribute type of collapsed-dependencies.

output	1st level	Second level	Third level	4th level	5th level	6th level	7th level
Person in charge	root	document	sentences	sentence	dependencies	dep	governor
Contact	root	document	sentences	sentence	dependencies	dep	dependent

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

`xml:nlp.txt.xml(Excerpt)`


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <dependencies type="collapsed-dependencies">
          <dep type="root">
            <governor idx="0">ROOT</governor>
            <dependent idx="18">field</dependent>
          </dep>
          <dep type="amod">
            <governor idx="3">processing</governor>
            <dependent idx="1">Natural</dependent>
          </dep>
          <dep type="compound">
            <governor idx="3">processing</governor>
            <dependent idx="2">language</dependent>
          </dep>

Directed graph display using Pydot

The code part below. What I'm doing is the same as "Knock 100 language processing-44: Visualization of dependent tree", so I won't explain it. However, add the relationship between edges using Graphviz and networkx. I regret that I should have done it. Article "Drawing multigraphs and beautiful graphs with networkx [python]" and article "Using Graphviz on Python" Draw a beautiful graph " I think I can write it by referring to it. In the first place, pydot has not been updated since December 2018, so I am worried that it will continue in the future.

`python`


for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
    edges = []
    for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
        
        #Punctuation exclusion
        if dependency.get('type') != 'punct':
            governor = dependency.find('./governor')
            dependent = dependency.find('./dependent')
            edges.append(((governor.get('idx'), governor.text), 
                          (dependent.get('idx'), dependent.text)))
    if len(edges) > 0:
        graph = pydot.graph_from_edges(edges, directed=True)
        graph.write_jpeg('057.graph_{}.jpeg'.format(i))

Output result (execution result)

When the program is executed, the following result is output (only the first 3 sentences).

100 Language Processing Knock-57: Dependency Analysis