Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the 53rd "Tokenization" record of .tohoku.ac.jp/nlp100/#ch6). Finally, Stanford Core NLP is about to begin. This is the main subject of Chapter 6. This time, the installation is the main, and the Stanford Core NLP execution and the Python part are not a big deal.
Link | Remarks |
---|---|
053_1.Tokenization.ipynb | AnswerprogramGitHublink(StanfordCoreNLPexecutionpartinBash) |
053_2.Tokenization.ipynb | AnswerprogramGitHublink(Python) |
100 amateur language processing knocks:53 | Copy and paste source of many source parts |
Stanford Core NLP Official | Stanford Core NLP page to look at first |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Stanford CoreNLP | 3.9.2 | I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was |
openJDK | 1.8.0_242 | I used the JDK that was installed for other purposes as it is |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.
"Stanford Core NLP" is a library for natural language processing. There is a similar one called "Stanford NLP", which supports Japanese. "Stanford NLP" has been used since 70th knock. The difference is clearly described in Article "Introduction to Stanford NLP with Python". Looking at Stanford CoreNLP's Release History, it hasn't been updated much recently.
I am using it according to the Official page.
If you don't use the -annotators
option, you'll get stuck afterwards (certainly the 57th).
I am allocating 5G of memory with -Xmx5G
. If it is too small, an error has occurred.
When executed, the result will be output to the same location with the extension xml
added to the read file nlp.txt
.
java -cp "/usr/local/lib/stanford-corenlp-full-2018-10-05/*" \
-Xmx5g \
edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \
-file nlp.txt
By the way, if you put the output XML file in the same directory as CoreNLP-to-HTML.xsl
that was in /usr/local/lib/stanford-corenlp-full-2018-10-05
and read it with a browser, the following You can see the result like (I saw it in IE and Edge, but not in Firefox and Chrome).
import xml.etree.ElementTree as ET
#Extract only word
for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
print(i, '\t' ,word.text)
#Limited because there are many
if i > 30:
break
I am using the Python standard package xml
as an XML parser. It's easy to use, just read the nlp.txt.xml
output by Stanford CoreNLP with the parse
function and read the word
tag.
python
for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
print(i, '\t' ,word.text)
The contents of xml are like this (excerpt from the beginning). The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).
xml:nlp.txt.xml(Excerpt from the beginning)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>Natural</word>
<lemma>natural</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>7</CharacterOffsetEnd>
<POS>JJ</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>language</word>
<lemma>language</lemma>
<CharacterOffsetBegin>8</CharacterOffsetBegin>
<CharacterOffsetEnd>16</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
When the program is executed, the following results will be output.
Output result
0 Natural
1 language
2 processing
3 From
4 Wikipedia
5 ,
6 the
7 free
8 encyclopedia
9 Natural
10 language
11 processing
12 -LRB-
13 NLP
14 -RRB-
15 is
16 a
17 field
18 of
19 computer
20 science
21 ,
22 artificial
23 intelligence
24 ,
25 and
26 linguistics
27 concerned
28 with
29 the
30 interactions
31 between
By the way, -LRB-
and -RRB-
are parentheses, which are converted by Stanford Core NLP.
--- LRB- left bracket --- RRB- right bracket
Recommended Posts