Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the 53rd "Tokenization" record of .tohoku.ac.jp/nlp100/#ch6). Finally, Stanford Core NLP is about to begin. This is the main subject of Chapter 6. This time, the installation is the main, and the Stanford Core NLP execution and the Python part are not a big deal.

Reference link

Link	Remarks
053_1.Tokenization.ipynb	AnswerprogramGitHublink(StanfordCoreNLPexecutionpartinBash)
053_2.Tokenization.ipynb	AnswerprogramGitHublink(Python)
100 amateur language processing knocks:53	Copy and paste source of many source parts
Stanford Core NLP Official	Stanford Core NLP page to look at first

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Stanford CoreNLP	3.9.2	I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was
openJDK	1.8.0_242	I used the JDK that was installed for other purposes as it is

Chapter 6: Dependency Analysis

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

53. Tokenization

Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.

Problem supplement (About "Stanford Core NLP")

"Stanford Core NLP" is a library for natural language processing. There is a similar one called "Stanford NLP", which supports Japanese. "Stanford NLP" has been used since 70th knock. The difference is clearly described in Article "Introduction to Stanford NLP with Python". Looking at Stanford CoreNLP's Release History, it hasn't been updated much recently.

Answer

Answer program (Stanford Core NLP execution part in Bash) [053_1.Tokenization.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E% E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 053_1.Tokenization.ipynb)

I am using it according to the Official page. If you don't use the -annotators option, you'll get stuck afterwards (certainly the 57th). I am allocating 5G of memory with -Xmx5G. If it is too small, an error has occurred. When executed, the result will be output to the same location with the extension xml added to the read file nlp.txt.

java -cp "/usr/local/lib/stanford-corenlp-full-2018-10-05/*" \
 -Xmx5g \
 edu.stanford.nlp.pipeline.StanfordCoreNLP \
 -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \
 -file nlp.txt

By the way, if you put the output XML file in the same directory as CoreNLP-to-HTML.xsl that was in /usr/local/lib/stanford-corenlp-full-2018-10-05 and read it with a browser, the following You can see the result like (I saw it in IE and Edge, but not in Firefox and Chrome).

Answer program (Python part) [053_1.Tokenization.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86 % E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 053_1.Tokenization.ipynb)

import xml.etree.ElementTree as ET

#Extract only word
for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
    print(i, '\t' ,word.text)
    
    #Limited because there are many
    if i > 30:
        break

Answer commentary

XML perspective

I am using the Python standard package xml as an XML parser. It's easy to use, just read the nlp.txt.xml output by Stanford CoreNLP with the parse function and read the word tag.

`python`


for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
    print(i, '\t' ,word.text)

The contents of xml are like this (excerpt from the beginning). The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

`xml:nlp.txt.xml(Excerpt from the beginning)`


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="2">
            <word>language</word>
            <lemma>language</lemma>
            <CharacterOffsetBegin>8</CharacterOffsetBegin>
            <CharacterOffsetEnd>16</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


0 	 Natural
1 	 language
2 	 processing
3 	 From
4 	 Wikipedia
5 	 ,
6 	 the
7 	 free
8 	 encyclopedia
9 	 Natural
10 	 language
11 	 processing
12 	 -LRB-
13 	 NLP
14 	 -RRB-
15 	 is
16 	 a
17 	 field
18 	 of
19 	 computer
20 	 science
21 	 ,
22 	 artificial
23 	 intelligence
24 	 ,
25 	 and
26 	 linguistics
27 	 concerned
28 	 with
29 	 the
30 	 interactions
31 	 between

By the way, -LRB- and -RRB- are parentheses, which are converted by Stanford Core NLP.

--- LRB- left bracket --- RRB- right bracket

100 Language Processing Knock-53: Tokenization