Use Stanford Core NLP from Python

Introduction

Stanford CoreNLP is a complete library for natural language processing of English text. This time, I will introduce how to use CoreNLP from Python.

Download and unzip Stanford Core NLP

download

Download Version 3.2.0 (released 2013-06-20) instead of the latest version from the link below. The reason why it is not the latest version will be described later. http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

$ curl -L -O http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

Defrost

In my case, I put it in / usr / local / lib.

$ unzip ./stanford-corenlp-full-2013-06-20.zip -d /usr/local/lib/

Install corenlp-python

corenlp-python developed by Torotoki based on dasmith's It is also registered in PyPI. However, corenlp-python registered in PyPI only supports CoreNLP Version 3.2.0 (at the time of writing this article).

Installation

$ pip install corenlp-python

Basic usage

Generate a parser by specifying the path where CoreNLP is decompressed, parse the text, and the result will be returned in JSON format.

`corenlp_example.py`


import pprint
import json
import corenlp

#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
parser = corenlp.StanfordCoreNLP(corenlp_path=corenlp_dir)

#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

Execution result:

{u'coref': [[[[u'I', 0, 0, 0, 1], [u'Alice', 0, 2, 2, 3]]]],
 u'sentences': [{u'dependencies': [[u'nsubj', u'Alice', u'I'],
                                   [u'cop', u'Alice', u'am'],
                                   [u'root', u'ROOT', u'Alice']],
                 u'parsetree': u'(ROOT (S (NP (PRP I)) (VP (VBP am) (NP (NNP Alice))) (. .)))',
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1',
                              u'Lemma': u'I',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'PRP'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'be',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'VBP'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'Alice',
                              u'NamedEntityTag': u'PERSON',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11',
                              u'Lemma': u'.',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]}]}

Narrow down the functions

By default, it does everything from parsing / morphological analysis to named entity extraction, but if you want to use only some functions, specify properties. By narrowing down the functions, the operation becomes faster (especially the ner is heavy).

For example, if you want to split words, create the following user.properties file.

`user.properties`


annotators = tokenize, ssplit

Pass the path of this file to the properties parameter when creating the parser.

`corenlp_example2.py`


import pprint
import json
import corenlp

#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
properties_file = "./user.properties"
parser = corenlp.StanfordCoreNLP(
    corenlp_path=corenlp_dir,
    properties=properties_file) #Set properties

#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

Execution result:

{u'sentences': [{u'dependencies': [],
                 u'parsetree': [],
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11'}]]}]}

Annotator list

In the above, only tokenize and ssplit are used, but since there are various other annotators, I will briefly summarize them.

annotator	function	Dependent annotator
tokenize	Word split	(None)
cleanxml	XML tag removal	tokenize
ssplit	Sentence split	tokenize
pos	Morphological analysis(Tag details）	tokenize, ssplit
lemma	Headline conversion	tokenize, ssplit, pos
ner	Named entity recognition	tokenize, ssplit, pos, lemma
regexner	Named entity extraction with regular expression	tokenize, ssplit
sentiment	Emotional word analysis	(unknown)
truecase	Uppercase / lowercase normalization	tokenize, ssplit, pos, lemma
parse	Parsing	tokenize, ssplit
dcoref	Demonstrative analysis	tokenize, ssplit, pos, lemma, ner, parse