Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 59th "S-expression analysis" of .tohoku.ac.jp/nlp100/#ch6). Create a parser in a format called "S-expression". It made me think of a parser for the first time, but it is very deep. This knock took a very long time. When I finish it, it's about 50 lines, but there is a lot of room for efficiency. This time, I abandoned efficiency and made it as simple as possible.
Link | Remarks |
---|---|
059.Analysis of S-expressions.ipynb | Answer program GitHub link |
100 amateur language processing knocks:59 | Copy and paste source of many source parts |
Stanford Core NLP Official | Stanford Core NLP page to look at first |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Stanford CoreNLP | 3.9.2 | I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was |
openJDK | 1.8.0_242 | I used the JDK that was installed for other purposes as it is |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Read the result of phrase structure analysis (S-expression) of Stanford Core NLP and display all noun phrases (NP) in the sentence. Display all nested noun phrases as well.
According to Wikipedia "S-expression", the following explanation.
A formal description method of binary tree or list structure introduced in Lisp and mainly used in Lisp. S is derived from Symbol.
The mechanism for expressing natural language in "S-expressions" is described in Stanford Parser, and [Online Test Tool](http: / There is also /nlp.stanford.edu:8080/parser/). There was also a package that parses "S-expressions" in Python, but it seems that it is not used much, so I did my best by making it myself.
import re
import xml.etree.ElementTree as ET
reg_split = re.compile(r'''
( #Group start
\(|\) #Group of split characters(Start parenthesis or end parenthesis)
) #Group end
''', re.VERBOSE)
def output_np(chunks):
depth = 1
output = []
for chunk in chunks:
#The start of parentheses is the depth+1
if chunk == '(':
depth += 1
#The end of the parenthesis is the depth-1
elif chunk == ')':
depth -= 1
else:
#If it is a set of part of speech and text, it is divided and added to the output destination.
sets = chunk.split(' ')
if len(sets) == 2:
output.append(sets[1])
#Output when the depth reaches 0
if depth == 0:
print('\t', ' '.join(output))
break
for parse in \
ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):
depth = 0
print(parse.text)
#Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
chunks = [chunk.strip() for chunk in reg_split.split(parse.text)
if chunk.strip() != '']
#Output starts when you reach NP
for i, chunk in enumerate(chunks):
if chunk == 'NP':
output_np(chunks[i+1:])
The following is the mapping between the XML file path and the target sexp. The contents of the S-expression are contained in the parse
tag.
output | 1st level | Second level | Third level | 4th level | 5th level |
---|---|---|---|---|---|
S-expression | root | document | sentences | sentence | parse |
The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).
xml:nlp.txt.xml(Excerpt)
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
<sentence id="1">
--Omission--
<parse>(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) </parse>
Let's indent the above sexp to make it a little easier to see. It's a relatively short sentence, but it's still long ... The text part of the part surrounded by this NP (noun phrase) is combined and output.
(ROOT
(S
(PP
(IN As)
(NP
(JJ such)
)
)
(, ,)
(NP
(NN NLP)
)
(VP
(VBZ is)
(ADJP
(VBN related)
(PP
(TO to)
(NP
(NP
(DT the)
(NN area)
)
(PP
(IN of)
(NP
(JJ humani-computer)
(NN interaction)
)
)
)
)
)
)
(. .)
)
)
This is where the XML is read, looped, and searched for NP. The list is divided by regular expressions, and disturbing blank and empty elements are excluded. And when NP comes, we call the output function ʻoutput_np`. When NP is found from the top, it is output, but in the case of nested NP, it is inefficient because it passes through the same logic multiple times. But I wanted to keep it simple, so I'm leaving it inefficient.
python
for parse in \
ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):
depth = 0
print(parse.text)
#Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
chunks = [chunk.strip() for chunk in reg_split.split(parse.text)
if chunk.strip() != '']
#Output starts when you reach NP
for i, chunk in enumerate(chunks):
if chunk == 'NP':
output_np(chunks[i+1:])
The depth of the S-expression is judged by the start and end of the parentheses, and when the NP part ends, it is output.
python
def output_np(chunks):
depth = 1
output = []
for chunk in chunks:
#The start of parentheses is the depth+1
if chunk == '(':
depth += 1
#The end of the parenthesis is the depth-1
elif chunk == ')':
depth -= 1
else:
#If it is a set of part of speech and text, it is divided and added to the output destination.
sets = chunk.split(' ')
if len(sets) == 2:
output.append(sets[1])
#Output when the depth reaches 0
if depth == 0:
print('\t', ' '.join(output))
break
When the program is executed, the following result is output (first excerpt).
Output result(Top excerpt)
(ROOT (S (PP (NP (JJ Natural) (NN language) (NN processing)) (IN From) (NP (NNP Wikipedia))) (, ,) (NP (NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-))) (VP (VBZ is) (NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages)))))))))) (. .)))
Natural language processing
Wikipedia
the free encyclopedia Natural language processing -LRB- NLP -RRB-
the free encyclopedia Natural language processing
NLP
a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
a field of computer science
a field
computer science
artificial intelligence
linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
linguistics
the interactions between computers and human -LRB- natural -RRB- languages
the interactions
computers and human -LRB- natural -RRB- languages
computers
human -LRB- natural -RRB- languages
(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .)))
such
NLP
the area of humani-computer interaction
the area
humani-computer interaction
(ROOT (S (S (NP (NP (JJ Many) (NNS challenges)) (PP (IN in) (NP (NN NLP)))) (VP (VBP involve) (S (NP (NP (JJ natural) (NN language) (NN understanding)) (, ,) (SBAR (WHNP (WDT that)) (S (VP (VBZ is)))) (, ,)) (VP (VBG enabling) (NP (NNS computers)) (S (VP (TO to) (VP (VB derive) (NP (NN meaning)) (PP (IN from) (NP (ADJP (JJ human) (CC or) (JJ natural)) (NN language) (NN input)))))))))) (, ,) (CC and) (S (NP (NNS others)) (VP (VBP involve) (NP (JJ natural) (NN language) (NN generation)))) (. .)))
Many challenges in NLP
Many challenges
NLP
natural language understanding , that is ,
natural language understanding
computers
meaning
human or natural language input
others
natural language generation
Recommended Posts