Language processing 100 knocks 2015 "Chapter 6: Processing English texts" It is a record of 52nd "Stemming" of .tohoku.ac.jp/nlp100/#ch6). ** Stemming that you will often use in language processing **. It is actually used in the machine learning knock of 71st subsequent knock. It's technically very easy because it just calls a function.

Reference link

Link	Remarks
052.Stemming.ipynb	Answer program GitHub link
100 amateur language processing knocks:52	Copy and paste source of many source parts

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip. This time, I didn't use the stemming package specified by knocking. It hasn't been updated since 2010, and now nltk seems more common.

type	version
nltk	3.4.5

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

52. Stemming

Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, use the stemming module as an implementation of Porter's stemming algorithm.

Problem supplement (about "stemming")

"Stemming" is the stem, which refers to the unchanged front part of a word (eg, Natural's stemming is natur). Stemming will be used later in 71st. There are several types of "stemming", and this time I use Porter's algorithm (this seems to be famous). If you want to know more, please check it out by google.

Answer

Answer Program [052. Stemming.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82 % AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 052.% E3% 82% B9% E3% 83% 86% E3% 83% 9F% E3% 83% B3% E3% 82% B0.ipynb)

import re

from nltk.stem.porter import PorterStemmer as PS

ps = PS()

with open('./051.result.txt') as file_in, \
     open('./052.result.txt', 'w') as file_out:
    for token in file_in:
        if token != '\n':
            print(token.rstrip(), '\t', ps.stem(token.rstrip()), file=file_out)

Answer commentary

The program is too short to explain. You can stem it just by doing ps.stem (), and it's very easy to call it.

Output result (execution result)

When the program is executed, the following result is output (excerpt from the first 30 lines).

`text:052.result.txt(Excerpt from the first 30 lines)`


Natural 	 natur
language 	 languag
processing 	 process
From 	 from
Wikipedia 	 wikipedia
the 	 the
free 	 free
encyclopedia 	 encyclopedia
Natural 	 natur
language 	 languag
processing 	 process
(NLP) 	 (nlp)
is 	 is
a 	 a
field 	 field
of 	 of
computer 	 comput
science 	 scienc
artificial 	 artifici
intelligence 	 intellig
and 	 and
linguistics 	 linguist
concerned 	 concern
with 	 with
the 	 the
interactions 	 interact
between 	 between
computers 	 comput
and 	 and
human 	 human

100 Language Processing Knock-52: Stemming