Language processing 100 knocks 2015 "Chapter 6: Processing English texts" It is a record of 52nd "Stemming" of .tohoku.ac.jp/nlp100/#ch6). ** Stemming that you will often use in language processing **. It is actually used in the machine learning knock of 71st subsequent knock. It's technically very easy because it just calls a function.
Link | Remarks |
---|---|
052.Stemming.ipynb | Answer program GitHub link |
100 amateur language processing knocks:52 | Copy and paste source of many source parts |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip. This time, I didn't use the stemming package specified by knocking. It hasn't been updated since 2010, and now nltk
seems more common.
type | version |
---|---|
nltk | 3.4.5 |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, use the stemming module as an implementation of Porter's stemming algorithm.
"Stemming" is the stem, which refers to the unchanged front part of a word (eg, Natural's stemming is natur). Stemming will be used later in 71st. There are several types of "stemming", and this time I use Porter's algorithm (this seems to be famous). If you want to know more, please check it out by google.
import re
from nltk.stem.porter import PorterStemmer as PS
ps = PS()
with open('./051.result.txt') as file_in, \
open('./052.result.txt', 'w') as file_out:
for token in file_in:
if token != '\n':
print(token.rstrip(), '\t', ps.stem(token.rstrip()), file=file_out)
The program is too short to explain. You can stem it just by doing ps.stem ()
, and it's very easy to call it.
When the program is executed, the following result is output (excerpt from the first 30 lines).
text:052.result.txt(Excerpt from the first 30 lines)
Natural natur
language languag
processing process
From from
Wikipedia wikipedia
the the
free free
encyclopedia encyclopedia
Natural natur
language languag
processing process
(NLP) (nlp)
is is
a a
field field
of of
computer comput
science scienc
artificial artifici
intelligence intellig
and and
linguistics linguist
concerned concern
with with
the the
interactions interact
between between
computers comput
and and
human human
Recommended Posts