This is the 72nd record of Language Processing 100 Knock 2015. It seems that "feature" is read as "feature" instead of "feature", and it seems to be a language processing term (see Wikipedia "feature structure". % A0% E6% 80% A7% E6% A7% 8B% E9% 80% A0)). As a word familiar to those who are doing machine learning, it means "Feature". This time, the text file is read and the lemmas (dictionary heading words) other than the stop words, which are the contents of Last knock (stop word), are extracted as features. I will.

Link	Remarks
072_1.Feature extraction(Extraction).ipynb	Answerprogram(Extraction)GitHub link
072_2.Feature extraction(analysis).ipynb	Answerprogram(analysis)GitHub link
100 amateur language processing knocks:72	I am always indebted to you by knocking 100 language processing
Getting Started with Stanford NLP in Python	It was easy to understand the difference from Stanford Core NLP

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
nltk	3.4.5
stanfordnlp	0.2.0
pandas	0.25.3
matplotlib	3.1.1

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

71. Feature extraction

Design your own features that may be useful for polarity analysis, and extract the features from the training data. As a feature, the minimum baseline would be the one with the stop words removed from the review and each word stemmed.

Answer

Answer premise

It says, "The minimum baseline of each word is stemmed," but it uses a lemma instead of stemming. This time, not only the extraction but also what kind of words there are and the frequency distribution is visualized.

Answer program (extraction) [072_1. Feature extraction (extraction) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 072_1.% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA (% E6% 8A% BD% E5% 87% BA) ) .ipynb)

First of all, the extraction edition, which is the main subject of this task.

import warnings
import re
from collections import Counter
import csv

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlpp

#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))

ps = PS()

#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   #Punctuation
           'X',       #Other
           'SYM',     #symbol
           'PART',    #Particle('s etc.)
           'CCONJ',   #conjunction(and etc.)
           'AUX',     #Auxiliary verb(would etc.)
           'PRON',    #Pronoun
           'SCONJ',   #Subordinate conjunction(whether etc.)
           'ADP',     #Preposition(in etc.)
           'NUM'}     #number


#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')

reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')

#Remove leading and trailing symbols Stemming
def remove_symbols(lemma):
    return reg_sym.sub('', lemma)

#Stop word authenticity judgment
def is_stopword(word):
    lemma = remove_symbols(word.lemma)
    return True if lemma in STOP_WORDS \
                  or lemma == '' \
                  or word.upos in EXC_POS \
                  or len(lemma) == 1 \
                  or reg_dit.search(lemma)\
                else False

#Hide warning
warnings.simplefilter('ignore', UserWarning)

lemma = []

with open('./sentiment.txt') as file:
    for i, line in enumerate(file):
        print("\r{0}".format(i), end="")
        
        #The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
        doc = nlp(line[3:])
        for sentence in doc.sentences:
            lemma.extend([ps.stem(remove_symbols(word.lemma)) for word in sentence.words if is_stopword(word) is False])

freq_lemma = Counter(lemma)

with open('./lemma_all.txt', 'w') as f_out:
    writer = csv.writer(f_out, delimiter='\t')
    writer.writerow(['Char', 'Freq'])
    for key, value in freq_lemma.items():
        writer.writerow([key] + [value])

Answer explanation (extraction)

The language processing part of Stanford NLP is slow, ** it takes about an hour **. I didn't want to re-execute by trial and error, so [CSV file](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD % A6% E7% BF% 92 / lemma_all.txt) is downloading the extraction result. By downloading, the extraction result analysis was separated as a program. [Last Stop Word](https://qiita.com/FukuharaYohei/items/60719ddaa47474a9d670#%E5%9B%9E%E7%AD%94%E3%83%97%E3%83%AD%E3%82% B0% E3% 83% A9% E3% 83% A0% E5% AE% 9F% E8% A1% 8C% E7% B7% A8-071_2% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E5% AE% 9F% E8% A1% 8Cipynb), so I don't have much explanation. Forcibly, in the following part, the warning message was annoying, but it was hidden.

#Hide warning
warnings.simplefilter('ignore', UserWarning)

Answer Program (Analysis) [072_2. Feature Extraction (Analysis) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 072_2.% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA (% E5% 88% 86% E6% 9E% 90) ) .ipynb)

As a bonus, the extracted features are easily analyzed.

import pandas as pd
import matplotlib.pyplot as plt

df_feature = pd.read_table('./lemma_all.txt')

sorted = df_feature.sort_values('Freq', ascending=False)

#Top 10 features frequency output
print(sorted.head(10))

#Feature basic statistic output
print(sorted.describe())

#Feature number output in descending order of frequency
uniq_freq = df_feature['Freq'].value_counts()
print(uniq_freq)

#Bar graph display of frequency(>30 times)
uniq_freq[uniq_freq > 30].sort_index().plot.bar(figsize=(12, 10))

#Bar graph display of frequency(30 to 1000 times)
uniq_freq[(uniq_freq > 30) & (uniq_freq < 1000)].sort_index().plot.bar(figsize=(12, 10))

Answer explanation (analysis)

I'm using pandas to process CSV. The features of the top 10 extracted are as follows (the leftmost column is index, so it doesn't matter). Since it is Movie Review data, there are many words such as film and movie.

        Char  Freq
102     film  1801
77      movi  1583
96      make   838
187    stori   540
258     time   504
43   charact   492
79      good   432
231   comedi   414
458     even   392
21      much   388

Looking at the basic statistics, it looks like this. Approximately 12,000 features have been extracted, with an average frequency of 8.9 times.

               Freq
count  12105.000000
mean       8.860140
std       34.019655
min        1.000000
25%        1.000000
50%        2.000000
75%        6.000000
max     1801.000000

Sorting in descending order by frequency for about 12,000 features is as follows, and more than half of the features appear only twice or less.

The frequency is narrowed down to 31 or more types of features, the frequency is on the X-axis and the number of features is on the Y-axis, and a bar graph is displayed.

Since there are many features with 3 or less appearances and it was difficult to see the bar graph, we will narrow down the features from 1000 or less to 31 or more.

100 language processing knock-72 (using Stanford NLP): feature extraction