This is the record of the 73rd "learning" of Language processing 100 knocks 2015. It took a lot of time to do research and trial and error.

Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.

Reference link

Link	Remarks
073_1.Learning(Preprocessing).ipynb	Answerprogram(Preprocessing編)GitHub link
073_2.Learning(Training).ipynb	Answerprogram(Training編)GitHub link
100 amateur language processing knocks:73	I am always indebted to you by knocking 100 language processing
Getting started with Python with 100 knocks on language processing#73 -Machine learning, scikit-Logistic regression with learn	scikit-Knock result using learn

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
nltk	3.4.5
stanfordnlp	0.2.0
pandas	0.25.3
scikit-learn	0.21.3

Task

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

73. Learning

Learn the logistic regression model using the features extracted in> 72.

Answer

Answer premise

Assumption 1: Divide into preprocessing and learning

I used Stanford NLP for stopword removal and lemma stemming processing, and it took a long time, so I divided it into preprocessing and learning.

Assumption 2: Word vectorization

I use tf-idf for word vectorization. tf-idf calculates importance based on two indicators, tf (Term Frequency, word frequency) and idf (Inverse Document Frequency). Decrease the importance of words (general words) that appear in many documents, and increase the importance of words that appear only in specific documents.

Assumption 3: Hyperparameter search

To be honest, I couldn't judge whether tf-idf was valid even with stopword processing, so [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer. We are vectorizing and comparing accuracy with simple word frequency using html). In addition, the hyperparameters of logistics regression are also compared by grid search.

Answer program (extraction) [073_1. Learning (preprocessing) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5 % AD% A6% E7% BF% 92 / 073_1.% E5% AD% A6% E7% BF% 92 (% E5% 89% 8D% E5% 87% A6% E7% 90% 86) .ipynb)

First is the pre-processing. However, what we are doing is [previous "answer program (analysis) 072_2. Identity extraction (analysis) .ipynb"](https://qiita.com/FukuharaYohei/items/f1a12d8e63fc576a456f#%E5%9B % 9E% E7% AD% 94% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% A0% E5% 88% 86% E6% 9E% 90 % E7% B7% A8-072_2% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA% E5% 88% 86% E6% 9E% 90ipynb) There is nothing special to mention. The drawback is that it takes about an hour to process.

import warnings
import re
import csv

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlp

#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))

# Stemmer
ps = PS()

#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT',   #Punctuation
           'X',       #Other
           'SYM',     #symbol
           'PART',    #Particle('s etc.)
           'CCONJ',   #conjunction(and etc.)
           'AUX',     #Auxiliary verb(would etc.)
           'PRON',    #Pronoun
           'SCONJ',   #Subordinate conjunction(whether etc.)
           'ADP',     #Preposition(in etc.)
           'NUM'}     #number
		   
#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')

reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')


#Remove leading and trailing symbols
def remove_symbols(lemma):
    return reg_sym.sub('', lemma)
	

#Stop word authenticity judgment
def is_stopword(word):
    lemma = remove_symbols(word.lemma)
    return True if lemma in STOP_WORDS \
                  or lemma == '' \
                  or word.upos in EXC_POS \
                  or len(lemma) == 1 \
                  or reg_dit.search(lemma)\
                else False

#Hide warning
warnings.simplefilter('ignore', UserWarning)

with open('./sentiment.txt') as file_in:
    with open('./sentiment_stem.txt', 'w') as file_out:
        writer = csv.writer(file_out, delimiter='\t')
        writer.writerow(['Lable', 'Lemmas'])

        for i, line in enumerate(file_in):
            print("\r{0}".format(i), end="")
        
            lemma = []
        
            #The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
            doc = nlp(line[3:])
            for sentence in doc.sentences:
                lemma.extend([ps.stem(remove_symbols(word.lemma)) for word in sentence.words if is_stopword(word) is False])
            writer.writerow([1 if line[0] == '+' else 0, ' '.join(lemma)])

Answer Program (Training) [073_2. Learning (Training) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5% AD% A6% E7% BF% 92/073_2.% E5% AD% A6% E7% BF% 92 (% E8% A8% 93% E7% B7% B4) .ipynb)

This is the training part of the main subject.

import csv

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0004, 0.0005], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#Read file
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]

x_all = read_csv_column(1)
y_all = read_csv_column(0)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    #clf stands for classification
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           #Parameter set you want to optimize
            cv = 5)               #Number of cross-validations
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    

train(x_all, y_all, 'gs_result.csv')

Answer explanation (training)

Word vectorization

TfidfVectorizer or CountVectorizer /sklearn.feature_extraction.text.CountVectorizer.html) is used for word vectorization. It's a bit confusing because it's classified so that it can be used with the GridSearchCV function, but the important points are:

def fit(self, x, y=None):
    if self.method == 'tfidf':
        self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
    else:
        self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
    self.vectorizer.fit(x)
    return self

def transform(self, x, y=None):
    return self.vectorizer.transform(x)

Use fit to learn from all words and transform to transform the word string. The parameters are TfidfVectorizer and CountVectorizer. /generated/sklearn.feature_extraction.text.CountVectorizer.html) Both use the following two.

--min_df: Excludes vectorization if the frequency of occurrence is less than the specified percentage. It is specified because it is considered that "learning is not possible if the frequency of appearance is too low". --max_df: Excludes vectorization if the frequency of occurrence is greater than or equal to the specified percentage. This time, I thought that "words such as film are meaningless "and specified it.

Logistic regression

I am training with logistic regression using LogisticRegression. The explanation of logistic regression is written in the article "Coursera Machine Learning Introductory Course (3rd week-Logistic regression, regularization)". ・・). Thanks to Coursera Machine Learning Introductory Course, I was able to approach with an understanding of regularization.

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])

In the parameter definition below, the regularization term is classifier__C and the optimizer is classifier__solver. I don't understand the difference between optimizers, but I haven't investigated it with the feeling that "optimize with grid search".

PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0004, 0.0005], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

Pipeline

I'm using Pipeline to pipeline the training part of word vectorization and logistic regression. As a result, two processes can be performed at the same time, and the hyperparameter search in the grid search described later can also be processed at the same time.

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])

Hyperparameter grid search

I am searching for hyperparameters using GridSearchCV. Since it is pipelined, both the word vectorization and the training part by logistic regression can be searched at the same time. The search target is defined by PARAMETERS, and the" target processing name "and" parameter name "are combined by __. Actually, there are more searchable parameters, but they are omitted because it takes a long time to process. This parameter takes about 2 minutes.

#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0004, 0.0005], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

#clf stands for classification
clf = GridSearchCV(
        pipline,
        PARAMETERS,           #Parameter set you want to optimize
        cv = 5)               #Number of cross-validations

Search for word vectorization method by grid search

TfidfVectorizer and CountVectorizer /sklearn.feature_extraction.text.CountVectorizer.html) defines the myVectorizer class to find out which is the best. I am changing the Vectorizer that receives the parameter method and processes it in the ʻif` conditional branch. I referred to the following article.

-Implement a minimum self-made estimator (Estimator) with scikit-learn -Determine whether to do PCA by grid search

class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)

Grid search results

The result of grid search is output to CSV file. Let's compare each criterion with the average and maximum scores (using Excel).

pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

I've increased the parameters a bit. Therefore, the training took about 11 minutes.

#Parameters for GridSearchCV
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004, 0.0005, 0.0006], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #I also tried 10 but the SCORE is low just because it is slow
        'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
    ]

Highest score hyperparameters

The average correct answer rate of 75.6% in 5 cross-validations was the highest hyperparameter.

Vectorizer --Type: tf-idf (TfidfVectorizer)
- max_df: 0.07
- min_df: 0.0003 --Logistic regression --Optimizer: Newton-cg, lbfgs, liblinear same --Regularization term (C): 1

Now, let's compare each parameter below.

Vectorizer parameter

TfidfVectorizer/CountVectorizer tf-idf clearly has a better score.

min_df The smaller the min_df, the better the score.

max_df For td-idf, max_df is a better score if it is less.

Logistic regression parameters

By optimizer

There is not much difference.

Regularization term

Obviously, 1 has a better score.

100 language processing knock-73 (using scikit-learn): learning