This is the record of the 75th "weight of features" of Language processing 100 knocks 2015. It outputs which features are important for classification in the trained (trained) model. Unlike deep learning, it should be descriptive. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

Reference link

Link	Remarks
075.The weight of the feature.ipynb	Answer program GitHub link
100 amateur language processing knocks:75	I am always indebted to you by knocking 100 language processing
Getting started with Python with 100 knocks on language processing#75 -Machine learning, scikit-learn coef_Property	scikit-Knock result using learn

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
matplotlib	3.1.1
numpy	1.17.4
pandas	0.25.3
scikit-learn	0.21.3

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

75. Weight of features

Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 73.

Answer

Answer program [075. Weight of features.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7 % BF% 92 / 075.% E7% B4% A0% E6% 80% A7% E3% 81% AE% E9% 87% 8D% E3% 81% BF.ipynb)

Basically [Previous "Answer Program (Analysis) 074. Forecast.ipynb"](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2% B0% E5% AD% A6% E7% BF% 92 / 074.% E4% BA% 88% E6% B8% AC.ipynb) with the weight display logic of the identity added.

import csv

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
		
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    
    
    #Feature weight output
    output_coef(clf.best_estimator_)
    
    return clf.best_estimator_

#Feature weight output
def output_coef(estimator):
    vec = estimator.named_steps['vectorizer']
    clf = estimator.named_steps['classifier']

    coef_df = pd.DataFrame([clf.coef_[0]]).T.rename(columns={0: 'Coefficients'})
    coef_df.index = vec.vectorizer.get_feature_names()
    coef_sort = coef_df.sort_values('Coefficients')
    coef_sort[:10].plot.barh()
    coef_sort.tail(10).plot.barh()

def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
            else:
                result = 'TN:The correct answer is Negative and the prediction is Negative'
        else:
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
            else:
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:
            break

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)

Answer commentary

I am trying to receive the best high parameters in training with the function ʻoutput_coef`.

#Feature weight output
output_coef(clf.best_estimator_)

Since it is pipelined, it is divided into steps. Fetch each step from the attribute named_steps.

vec = estimator.named_steps['vectorizer']
clf = estimator.named_steps['classifier']

Store the weight in the DataFrame of pandas and transpose it. At that time, change the column name.

coef_df = pd.DataFrame([clf.coef_[0]]).T.rename(columns={0: 'Coefficients'})

Sort by weight value with index as feature.

coef_df.index = vec.vectorizer.get_feature_names()
coef_sort = coef_df.sort_values('Coefficients')

Finally, it is output as a bar graph.

coef_sort[:10].plot.barh()
coef_sort.tail(10).plot.barh()

Feature weight output result

Top 10 low-weight features

Negative words such as bad and dull are lined up. Is tv a review sentence like "TV is interesting"?

Top 10 heavy features

Positive words such as bearuti (beautiful) and join are lined up. Is perform a word like "cost performance"? I don't think flaw is a very good word, but this time I won't go into that much.

100 language processing knock-75 (using scikit-learn): weight of features