100 natural language processing knocks Chapter 6 English text processing (first half)

A record of solving the problems in the first half of Chapter 6. The target file is nlp.txt as shown on the web page.

Perform the following processing on the English text (nlp.txt).

</ i> 50. Sentence break

(. Or; or: or? Or!) → Whitespace characters → Consider the pattern of uppercase letters as sentence delimiters, and output the input document in the form of one sentence per line.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

punt = re.compile(r"(?P<punt>[\.:;!\?]) (?P<head>[A-Z])")

if __name__ == "__main__":
    f = open('nlp.txt', 'r')
    for line in f:
        l = line.strip()
        # if punt.search(l):
            # print punt.sub(r"\g<punt>\n\g<head>", l)
        print punt.sub(r"\g<punt>\n\g<head>", l)
    f.close()

</ i> 51. Cutting out words

Treat whitespace as word delimiters, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

if __name__ == "__main__":
    f = open('nlp_line.txt', 'r')
    for line in f:
        l = line.strip()
        for word in l.split():
            print re.sub(r"\W", "", word)
        print ""
    f.close()

</ i> 52. Stemming

Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, the stemming module should be used as an implementation of Porter's stemming algorithm.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

from nltk.stem.porter import PorterStemmer

if __name__ == "__main__":
    f = open('nlp_word.txt')
    for line in f:
        stemmer = PorterStemmer()
        l = line.strip()
        if len(l) > 0:
            print "%s\t%s" % (l, stemmer.stem(l))
        else:
            print ""
    f.close()

53. Tokenization

Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

WORD = re.compile(r"<word>(\w+)</word>")

f = open('nlp.txt.xml', 'r')
for line in f:
    word = WORD.search(line.strip())
    if word:
        print word.group(1)
f.close()

XML file creation command

Download Stanford Core NLP and go to that folder. Execute the following command.

java -Xmx5g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-models-3.6.0.jar:* edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file nlp_line.txt -outputFormat xml

For some reason it didn't work on zsh with an error, so I ran it on bash.

</ i> 54. Part-of-speech tagging

Read the analysis result XML of Stanford Core NLP and output words, lemmas, and part of speech in tab-delimited format.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

WORD = re.compile(r"<word>(\w+)</word>")
LEMMA = re.compile(r"<lemma>(\w+)</lemma>")
POS = re.compile(r"<POS>(\w+)</POS>")

f = open("nlp.txt.xml", "r")
words = []
for line in f:
    if len(words) == 3:
        print "\t".join(words)
        words = []
    else:
        line = line.strip()
        word = WORD.search(line)
        if len(words) == 0 and word:
            words.append(word.group(1))
            continue
        lemma = LEMMA.search(line)
        if len(words) == 1 and lemma:
            words.append(lemma.group(1))
            continue
        pos = POS.search(line)
        if len(words) == 2 and pos:
            words.append(pos.group(1))
f.close()

Recommended Posts