This is the 72nd record of Language Processing 100 Knock 2015. It seems that "feature" is read as "feature" instead of "feature", and it seems to be a language processing term (see Wikipedia "feature structure". % A0% E6% 80% A7% E6% A7% 8B% E9% 80% A0)). As a word familiar to those who are doing machine learning, it means "Feature". This time, the text file is read and the lemmas (dictionary heading words) other than the stop words, which are the contents of Last knock (stop word), are extracted as features. I will.
Link | Remarks |
---|---|
072_1.Feature extraction(Extraction).ipynb | Answerprogram(Extraction)GitHub link |
072_2.Feature extraction(analysis).ipynb | Answerprogram(analysis)GitHub link |
100 amateur language processing knocks:72 | I am always indebted to you by knocking 100 language processing |
Getting Started with Stanford NLP in Python | It was easy to understand the difference from Stanford Core NLP |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
nltk | 3.4.5 |
stanfordnlp | 0.2.0 |
pandas | 0.25.3 |
matplotlib | 3.1.1 |
In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).
Design your own features that may be useful for polarity analysis, and extract the features from the training data. As a feature, the minimum baseline would be the one with the stop words removed from the review and each word stemmed.
It says, "The minimum baseline of each word is stemmed," but it uses a lemma instead of stemming. This time, not only the extraction but also what kind of words there are and the frequency distribution is visualized.
First of all, the extraction edition, which is the main subject of this task.
import warnings
import re
from collections import Counter
import csv
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlpp
#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))
ps = PS()
#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT', #Punctuation
'X', #Other
'SYM', #symbol
'PART', #Particle('s etc.)
'CCONJ', #conjunction(and etc.)
'AUX', #Auxiliary verb(would etc.)
'PRON', #Pronoun
'SCONJ', #Subordinate conjunction(whether etc.)
'ADP', #Preposition(in etc.)
'NUM'} #number
#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')
reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')
#Remove leading and trailing symbols Stemming
def remove_symbols(lemma):
return reg_sym.sub('', lemma)
#Stop word authenticity judgment
def is_stopword(word):
lemma = remove_symbols(word.lemma)
return True if lemma in STOP_WORDS \
or lemma == '' \
or word.upos in EXC_POS \
or len(lemma) == 1 \
or reg_dit.search(lemma)\
else False
#Hide warning
warnings.simplefilter('ignore', UserWarning)
lemma = []
with open('./sentiment.txt') as file:
for i, line in enumerate(file):
print("\r{0}".format(i), end="")
#The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
doc = nlp(line[3:])
for sentence in doc.sentences:
lemma.extend([ps.stem(remove_symbols(word.lemma)) for word in sentence.words if is_stopword(word) is False])
freq_lemma = Counter(lemma)
with open('./lemma_all.txt', 'w') as f_out:
writer = csv.writer(f_out, delimiter='\t')
writer.writerow(['Char', 'Freq'])
for key, value in freq_lemma.items():
writer.writerow([key] + [value])
The language processing part of Stanford NLP is slow, ** it takes about an hour **. I didn't want to re-execute by trial and error, so [CSV file](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD % A6% E7% BF% 92 / lemma_all.txt) is downloading the extraction result. By downloading, the extraction result analysis was separated as a program. [Last Stop Word](https://qiita.com/FukuharaYohei/items/60719ddaa47474a9d670#%E5%9B%9E%E7%AD%94%E3%83%97%E3%83%AD%E3%82% B0% E3% 83% A9% E3% 83% A0% E5% AE% 9F% E8% A1% 8C% E7% B7% A8-071_2% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E5% AE% 9F% E8% A1% 8Cipynb), so I don't have much explanation. Forcibly, in the following part, the warning message was annoying, but it was hidden.
#Hide warning
warnings.simplefilter('ignore', UserWarning)
As a bonus, the extracted features are easily analyzed.
import pandas as pd
import matplotlib.pyplot as plt
df_feature = pd.read_table('./lemma_all.txt')
sorted = df_feature.sort_values('Freq', ascending=False)
#Top 10 features frequency output
print(sorted.head(10))
#Feature basic statistic output
print(sorted.describe())
#Feature number output in descending order of frequency
uniq_freq = df_feature['Freq'].value_counts()
print(uniq_freq)
#Bar graph display of frequency(>30 times)
uniq_freq[uniq_freq > 30].sort_index().plot.bar(figsize=(12, 10))
#Bar graph display of frequency(30 to 1000 times)
uniq_freq[(uniq_freq > 30) & (uniq_freq < 1000)].sort_index().plot.bar(figsize=(12, 10))
I'm using pandas
to process CSV.
The features of the top 10 extracted are as follows (the leftmost column is index, so it doesn't matter). Since it is Movie Review data, there are many words such as film
and movie
.
Char Freq
102 film 1801
77 movi 1583
96 make 838
187 stori 540
258 time 504
43 charact 492
79 good 432
231 comedi 414
458 even 392
21 much 388
Looking at the basic statistics, it looks like this. Approximately 12,000 features have been extracted, with an average frequency of 8.9 times.
Freq
count 12105.000000
mean 8.860140
std 34.019655
min 1.000000
25% 1.000000
50% 2.000000
75% 6.000000
max 1801.000000
Sorting in descending order by frequency for about 12,000 features is as follows, and more than half of the features appear only twice or less.
1 4884
2 1832
3 1053
4 707
5 478
6 349
7 316
8 259
9 182
10 176
The frequency is narrowed down to 31 or more types of features, the frequency is on the X-axis and the number of features is on the Y-axis, and a bar graph is displayed.
Since there are many features with 3 or less appearances and it was difficult to see the bar graph, we will narrow down the features from 1000 or less to 31 or more.
Recommended Posts