100 language processing knock-76 (using scikit-learn): labeling

This is the record of the 76th "labeling" of Language processing 100 knock 2015. Knock's question content is labeled with predictions for training data, but this time we dare to use test data. Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
numpy 1.17.4
pandas 0.25.3
scikit-learn 0.21.3


Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

76. Labeling

Apply the logistic regression model to the training data and output the correct label, predicted label, and predicted probability in tab-delimited format.

This time, we ignore the part "for training data" and carry out for test data. I thought that test data would be more useful than training data.


Answer Program [076. Labeling.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7% BF% 92/076.% E3% 83% A9% E3% 83% 99% E3% 83% AB% E4% BB% 98% E3% 81% 91.ipynb)

Basically [Previous "Answer Program (Analysis) 075. Weight of Features.ipynb"](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6% A2% B0% E5% AD% A6% E7% BF% 92 / 075.% E7% B4% A0% E6% 80% A7% E3% 81% AE% E9% 87% 8D% E3% 81% BF.ipynb ) With prediction and file output logic.

import csv

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
#Parameters for GridSearchCV
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    clf.fit(x_train, y_train)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    
    #Feature weight output
    return clf.best_estimator_

#Feature weight output
def output_coef(estimator):
    vec = estimator.named_steps['vectorizer']
    clf = estimator.named_steps['classifier']

    coef_df = pd.DataFrame([clf.coef_[0]]).T.rename(columns={0: 'Coefficients'})
    coef_df.index = vec.vectorizer.get_feature_names()
    coef_sort = coef_df.sort_values('Coefficients')

def validate(estimator, x_test, y_test):
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:The correct answer is Positive and the prediction is Positive'
                result = 'TN:The correct answer is Negative and the prediction is Negative'
            if y == '1':
                result = 'FN:The correct answer is Positive and the prediction is Negative'
                result = 'FP:The correct answer is Negative and the prediction is Positive'
        print(result, y_pred, x)
        if i == 29:

    #TSV list output
    y_pred = estimator.predict(x_test)
    y_prob = estimator.predict_proba(x_test)

    results = pd.DataFrame([y_test, y_pred, y_prob.T[1], x_test]).T.rename(columns={ 0: 'Correct answer', 1 : 'Forecast', 2: 'Forecast確率(positive)', 3 :'Word string'})
    results.to_csv('./predict.txt' , sep='\t')

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)

Answer commentary

The following tab-delimited file is output by the to_csv function of pandas.

Column item Example
1st row Correct label もともと持っていたCorrect label 0(0 is negative)
2nd row Forecast label predict_probaPrediction results obtained using the function 0(0 is negative)
3rd row Predicted probability predict_proba関数を使って取得したPredicted probability。
The second column of the return value of the function is the probability of being positive
4th row Word string もともと持っていたWord stringの説明変数 empti shell epic rather real deal
#TSV list output
y_pred = estimator.predict(x_test)
y_prob = estimator.predict_proba(x_test)

results = pd.DataFrame([y_test, y_pred, y_prob.T[1], x_test]).T.rename(columns={ 0: 'Correct answer', 1 : 'Forecast', 2: 'Forecast確率(positive)', 3 :'Word string'})
results.to_csv('./predict.txt' , sep='\t')

Output file [predict.txt](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF% 92 / predict.txt) is on GitHub.

