Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 58th "Tuple Extraction" of .tohoku.ac.jp/nlp100/#ch6). Last time knocking was visualization of the entire dependency, but this time it is an output by extracting a specific dependency. About 80% is the same as what we are doing.
Link | Remarks |
---|---|
058.Extraction of tuples.ipynb | Answer program GitHub link |
100 amateur language processing knocks:58 | Copy and paste source of many source parts |
Stanford Core NLP Official | Stanford Core NLP page to look at first |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Stanford CoreNLP | 3.9.2 | I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was |
openJDK | 1.8.0_242 | I used the JDK that was installed for other purposes as it is |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Output the set of "subject predicate object" in tab-delimited format based on the result of the dependency analysis (collapsed-dependencies) of Stanford Core NLP. However, refer to the following for the definitions of subject, predicate, and object.
--Predicate: A word that has children (dependants) of nsubj and dobj relationships --Subject: A child (dependent) that has an nsubj relationship from the predicate --Object: A child (dependent) that has a dobj relationship from the predicate
When I heard "tuple", I thought of Python's Tuple, but this time it seems to be different. First, in Wikipedia "Tuple", it is written as follows. ** A "set of multiple components" **.
Tuple or tuple (English: tuple) is a general concept that collectively refers to a set consisting of multiple components.
Stanford CoreNLP mentions Tuple in Stanford Open Information Extraction.
Open information extraction (open IE) refers to the extraction of relation tuples, typically binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook).
And the following figure on the same page is easy to understand about "tuple".
import xml.etree.ElementTree as ET
texts = []
#sentence enumeration, processing one sentence at a time
for sentence in ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence'):
output = {}
#Dependency enumeration
for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
#Relationship check
dep_type = dep.get('type')
if dep_type == 'nsubj' or dep_type == 'dobj':
#Add to predicate dictionary
governor = dep.find('./governor')
index = governor.get('idx')
if index in output:
texts = output[index]
else:
texts = [governor.text, '', '']
#Add to subject or object(If the same predicate, win later)
if dep_type == 'nsubj':
texts[1] = dep.find('./dependent').text
else:
texts[2] = dep.find('./dependent').text
output[index] = texts
for key, texts in output.items():
if texts[1] != '' and texts[2] != '':
print(sentence.get('id'), '\t', '\t'.join(texts))
It is the mapping of the path of the following XML file and the target source and destination. The dependencies
tag of the 5th layer targets the attribute type
of collapsed-dependencies
.
Also, the attribute type
of the 6th layer is the one with nsubj
or dobj
.
output | 1st level | Second level | Third level | 4th level | 5th level | 6th level | 7th level |
---|---|---|---|---|---|---|---|
Person in charge | root | document | sentences | sentence | dependencies | dep | governor |
Contact | root | document | sentences | sentence | dependencies | dep | dependent |
The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).
xml:nlp.txt.xml(Excerpt)
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
<sentence id="1">
--Omission--
<dependencies type="collapsed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="18">field</dependent>
</dep>
--Omission--
<dep type="nsubj">
<governor idx="18">field</governor>
<dependent idx="12">processing</dependent>
</dep>
--Omission--
<dep type="dobj">
<governor idx="13">enabling</governor>
<dependent idx="14">computers</dependent>
</dep>
This is where the dictionary variable ʻoutput for output is created for each sentence. The index of the predicate (
governor`) is used as the key of the dictionary, the value of the dictionary is a list type, and the contents are "predicate text", "subject text", and "object text". If you want to keep multiple subjects and objects, the win-win method is used.
python
#Dependency enumeration
for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
#Relationship check
dep_type = dep.get('type')
if dep_type == 'nsubj' or dep_type == 'dobj':
#Add to predicate dictionary
governor = dep.find('./governor')
index = governor.get('idx')
if index in output:
texts = output[index]
else:
texts = [governor.text, '', '']
#Add to subject or object(If the same predicate, win later)
if dep_type == 'nsubj':
texts[1] = dep.find('./dependent').text
else:
texts[2] = dep.find('./dependent').text
output[index] = texts
If there is a subject or object, it is output.
python
for key, texts in output.items():
if texts[1] != '' and texts[2] != '':
print(sentence.get('id'), '\t', '\t'.join(texts))
When the program is executed, the following results will be output.
Output result
3 involve understanding generation
5 published Turing article
6 involved experiment translation
11 provided ELIZA interaction
12 exceeded patient base
12 provide ELIZA response
14 structured which information
19 discouraged underpinnings sort
19 underlies that approach
20 produced Some systems
21 make which decisions
23 contains that errors
34 involved implementations coding
38 take algorithms set
39 produced Some systems
40 make which decisions
41 have models advantage
41 express they certainty
42 have Systems advantages
43 make procedures use
44 make that decisions
Recommended Posts