Stanford CoreNLP is a complete library for natural language processing of English text. This time, I will introduce how to use CoreNLP from Python.
Download Version 3.2.0 (released 2013-06-20) instead of the latest version from the link below. The reason why it is not the latest version will be described later. http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip
$ curl -L -O http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip
In my case, I put it in / usr / local / lib.
$ unzip ./stanford-corenlp-full-2013-06-20.zip -d /usr/local/lib/
corenlp-python developed by Torotoki based on dasmith's It is also registered in PyPI. However, corenlp-python registered in PyPI only supports CoreNLP Version 3.2.0 (at the time of writing this article).
$ pip install corenlp-python
Generate a parser by specifying the path where CoreNLP is decompressed, parse the text, and the result will be returned in JSON format.
corenlp_example.py
import pprint
import json
import corenlp
#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
parser = corenlp.StanfordCoreNLP(corenlp_path=corenlp_dir)
#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)
Execution result:
{u'coref': [[[[u'I', 0, 0, 0, 1], [u'Alice', 0, 2, 2, 3]]]],
u'sentences': [{u'dependencies': [[u'nsubj', u'Alice', u'I'],
[u'cop', u'Alice', u'am'],
[u'root', u'ROOT', u'Alice']],
u'parsetree': u'(ROOT (S (NP (PRP I)) (VP (VBP am) (NP (NNP Alice))) (. .)))',
u'text': u'I am Alice.',
u'words': [[u'I',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'1',
u'Lemma': u'I',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'PRP'}],
[u'am',
{u'CharacterOffsetBegin': u'2',
u'CharacterOffsetEnd': u'4',
u'Lemma': u'be',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'VBP'}],
[u'Alice',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10',
u'Lemma': u'Alice',
u'NamedEntityTag': u'PERSON',
u'PartOfSpeech': u'NNP'}],
[u'.',
{u'CharacterOffsetBegin': u'10',
u'CharacterOffsetEnd': u'11',
u'Lemma': u'.',
u'NamedEntityTag': u'O',
u'PartOfSpeech': u'.'}]]}]}
By default, it does everything from parsing / morphological analysis to named entity extraction, but if you want to use only some functions, specify properties. By narrowing down the functions, the operation becomes faster (especially the ner is heavy).
For example, if you want to split words, create the following user.properties file.
user.properties
annotators = tokenize, ssplit
Pass the path of this file to the properties parameter when creating the parser.
corenlp_example2.py
import pprint
import json
import corenlp
#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
properties_file = "./user.properties"
parser = corenlp.StanfordCoreNLP(
corenlp_path=corenlp_dir,
properties=properties_file) #Set properties
#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)
Execution result:
{u'sentences': [{u'dependencies': [],
u'parsetree': [],
u'text': u'I am Alice.',
u'words': [[u'I',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'1'}],
[u'am',
{u'CharacterOffsetBegin': u'2',
u'CharacterOffsetEnd': u'4'}],
[u'Alice',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10'}],
[u'.',
{u'CharacterOffsetBegin': u'10',
u'CharacterOffsetEnd': u'11'}]]}]}
In the above, only tokenize and ssplit are used, but since there are various other annotators, I will briefly summarize them.
annotator | function | Dependent annotator |
---|---|---|
tokenize | Word split | (None) |
cleanxml | XML tag removal | tokenize |
ssplit | Sentence split | tokenize |
pos | Morphological analysis(Tag details) | tokenize, ssplit |
lemma | Headline conversion | tokenize, ssplit, pos |
ner | Named entity recognition | tokenize, ssplit, pos, lemma |
regexner | Named entity extraction with regular expression | tokenize, ssplit |
sentiment | Emotional word analysis | (unknown) |
truecase | Uppercase / lowercase normalization | tokenize, ssplit, pos, lemma |
parse | Parsing | tokenize, ssplit |
dcoref | Demonstrative analysis | tokenize, ssplit, pos, lemma, ner, parse |
Recommended Posts