A record of solving the problems in the first half of Chapter 6. The target file is nlp.txt as shown on the web page.
Perform the following processing on the English text (nlp.txt).
(. Or; or: or? Or!) → Whitespace characters → Consider the pattern of uppercase letters as sentence delimiters, and output the input document in the form of one sentence per line.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
punt = re.compile(r"(?P<punt>[\.:;!\?]) (?P<head>[A-Z])")
if __name__ == "__main__":
f = open('nlp.txt', 'r')
for line in f:
l = line.strip()
# if punt.search(l):
# print punt.sub(r"\g<punt>\n\g<head>", l)
print punt.sub(r"\g<punt>\n\g<head>", l)
Treat whitespace as word delimiters, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
if __name__ == "__main__":
f = open('nlp_line.txt', 'r')
for line in f:
l = line.strip()
for word in l.split():
print re.sub(r"\W", "", word)
print ""
Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, the stemming module should be used as an implementation of Porter's stemming algorithm.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
from nltk.stem.porter import PorterStemmer
if __name__ == "__main__":
f = open('nlp_word.txt')
for line in f:
stemmer = PorterStemmer()
l = line.strip()
if len(l) > 0:
print "%s\t%s" % (l, stemmer.stem(l))
print ""
53. Tokenization
Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
WORD = re.compile(r"<word>(\w+)</word>")
f = open('nlp.txt.xml', 'r')
for line in f:
word = WORD.search(line.strip())
if word:
print word.group(1)
Download Stanford Core NLP and go to that folder. Execute the following command.
java -Xmx5g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-models-3.6.0.jar:* edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file nlp_line.txt -outputFormat xml
For some reason it didn't work on zsh with an error, so I ran it on bash.
Read the analysis result XML of Stanford Core NLP and output words, lemmas, and part of speech in tab-delimited format.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import re
WORD = re.compile(r"<word>(\w+)</word>")
LEMMA = re.compile(r"<lemma>(\w+)</lemma>")
POS = re.compile(r"<POS>(\w+)</POS>")
f = open("nlp.txt.xml", "r")
words = []
for line in f:
if len(words) == 3:
print "\t".join(words)
words = []
line = line.strip()
word = WORD.search(line)
if len(words) == 0 and word:
lemma = LEMMA.search(line)
if len(words) == 1 and lemma:
pos = POS.search(line)
if len(words) == 2 and pos:
