This is the 71st record of Language Processing 100 Knock 2015.
This time I'm using the nltk
package and stanford NLP
package to exclude stopwords. I will. A simple stopword dictionary is obtained from the nltk
package, and symbols are also judged by part of speech.
Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use Stanford NLP.
Link | Remarks |
---|---|
071_1.Stop word(Preparation).ipynb | Answerprogram(Preparation編)GitHub link |
071_2.Stop word(Run).ipynb | Answerprogram(Run編)GitHub link |
100 amateur language processing knocks:71 | I am always indebted to you by knocking 100 language processing |
Getting Started with Stanford NLP in Python | It was easy to understand the difference from Stanford Core NLP |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
nltk | 3.4.5 |
stanfordnlp | 0.2.0 |
In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).
Create an appropriate list of English stop words (stop list). Furthermore, implement a function that returns true if the word (character string) given as an argument is included in the stop list, and false otherwise. In addition, write a test for that function.
** "Appropriately" **?
I wondered what to do with the ** "appropriately" ** of the assignment. As a result, we decided to use the stop words defined in the nltk
package and the part-of-speech information of the morphological analysis results to determine the authenticity.
First of all, there is preparation. This is separate from running the answer and only needs to be run once after installing the package.
Download the stopword list for the ntlk package. This is done first, separate from pip install
.
import nltk
#Download Stopword
nltk.download('stopwords')
#Stop word confirmation
print(nltk.corpus.stopwords.words('english'))
Also download the English model of the stanford NLP package. Please note that it is about 250MB. This is also done first, separate from pip install
.
import stanfordnlp
stanfordnlp.download('en')
stanfordnlp.Pipeline()
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlp
#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))
ps = PS()
#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT', #Punctuation
'X', #Other
'SYM', #symbol
'PART', #Particle('s etc.)
'CCONJ', #conjunction(and etc.)
'AUX', #Auxiliary verb(would etc.)
'PRON', #Pronoun
'SCONJ', #Subordinate conjunction(whether etc.)
'ADP', #Preposition(in etc.)
'NUM'} #number
#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')
reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')
#Remove leading and trailing symbols
def remove_symbols(lemma):
return reg_sym.sub('', lemma)
#Stop word authenticity judgment
def is_stopword(word):
lemma = remove_symbols(word.lemma)
return True if lemma in STOP_WORDS \
or lemma == '' \
or word.upos in EXC_POS \
or len(lemma) == 1 \
or reg_dit.search(lemma)\
else False
#Judge 3 sentences as a trial
with open('./sentiment.txt') as file:
for i, line in enumerate(file):
#The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
doc = nlp(line[3:])
print(i, line)
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.upos, remove_symbols(word.lemma), ps.stem(remove_symbols(word.lemma)), is_stopword(word))
if i == 2:
break
This time, not only simple stop word exclusion, but also morphological analysis and particles are excluded. First, we get the stopwords in tuple format.
#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))
These are the contents of the stop word.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In addition, the following definitions are defined as part of speech that is not used. I saw the results later and it increased steadily.
What makes me happy with this process is that, for example, like
in" I ** like ** this movie "is not subject to stopwords as a verb, but it is like" he is ** like ** my hero ". In that case, like
is excluded as ADP (preposition).
The types here are similar to those defined as Universal POS tabs.
#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT', #Punctuation
'X', #Other
'SYM', #symbol
'PART', #Particle('s etc.)
'CCONJ', #conjunction(and etc.)
'AUX', #Auxiliary verb(would etc.)
'PRON', #Pronoun
'SCONJ', #Subordinate conjunction(whether etc.)
'ADP', #Preposition(in etc.)
'NUM'} #number
Compile the regular expression to be used later. The first line is a regular expression that searches for half-width symbols from the beginning and end. The second line is a regular expression that looks for numbers.
reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')
A function that removes half-width symbols from the beginning and end. For example, when there is a character like -a
, the first character is removed.
#Remove leading and trailing symbols
def remove_symbols(lemma):
return reg_sym.sub('', lemma)
The essential function is defined below. lemma
is called a lemma and is converted to the format defined in the dictionary as in Lemmatisation (eg better-> good).
It is judged as a stop word in the following cases.
#Stop word authenticity judgment
def is_stopword(word):
lemma = remove_symbols(word.lemma)
return True if lemma in STOP_WORDS \
or lemma == '' \
or word.upos in EXC_POS \
or len(lemma) == 1 \
or reg_dit.search(lemma)\
else False
After that, the file is read and the stop word is judged. Since stanfordnlp is slow, we have excluded the first three letters of negative and positive to make it as fast as possible.
This time, I'm trying to execute only the first three sentences.
Finally, it is output in a stemmed form using ps.stem
. This is to make the three words adhere adherence adherent, for example, common as adher. In the subsequent machine learning part, I think that this form is better and use it.
with open('./sentiment.txt') as file:
for i, line in enumerate(file):
#The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
doc = nlp(line[3:])
print(i, line)
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.upos, remove_symbols(word.lemma), ps.stem(remove_symbols(word.lemma)), is_stopword(word))
if i == 2:
break
The execution result looks like this.
0 +1 a chick flick for guys .
a DET a a True
chick NOUN chick chick False
flick NOUN flick flick False
for ADP for for True
guys NOUN guy guy False
. PUNCT True
1 +1 an impressive if flawed effort that indicates real talent .
an DET a a True
impressive ADJ impressive impress False
if SCONJ if if True
flawed VERB flaw flaw False
effort NOUN effort effort False
that PRON that that True
indicates VERB indicate indic False
real ADJ real real False
talent NOUN talent talent False
. PUNCT True
2 +1 displaying about equal amounts of naiveté , passion and talent , beneath clouds establishes sen as a filmmaker of considerable potential .
displaying VERB displaying display False
about ADP about about True
equal ADJ equal equal False
amounts NOUN amount amount False
of ADP of of True
naiveté NOUN naiveté naiveté False
, PUNCT True
passion NOUN passion passion False
and CCONJ and and True
talent NOUN talent talent False
, PUNCT True
beneath ADP beneath beneath True
clouds NOUN cloud cloud False
establishes VERB establish establish False
sen NOUN sen sen False
as ADP as as True
a DET a a True
filmmaker NOUN filmmaker filmmak False
of ADP of of True
considerable ADJ considerable consider False
potential NOUN potential potenti False
. PUNCT True
Recommended Posts