This is the record of the 73rd "learning" of Language processing 100 knocks 2015. It took a lot of time to do research and trial and error.
Until now, I didn't post it to the block because it was basically the same as "Amateur language processing 100 knocks". , "Chapter 8: Machine Learning" has been taken seriously and changed to some extent. I will post. I mainly use scikit-learn.
Link | Remarks |
---|---|
073_1.Learning(Preprocessing).ipynb | Answerprogram(Preprocessing編)GitHub link |
073_2.Learning(Training).ipynb | Answerprogram(Training編)GitHub link |
100 amateur language processing knocks:73 | I am always indebted to you by knocking 100 language processing |
Getting started with Python with 100 knocks on language processing#73 -Machine learning, scikit-Logistic regression with learn | scikit-Knock result using learn |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
nltk | 3.4.5 |
stanfordnlp | 0.2.0 |
pandas | 0.25.3 |
scikit-learn | 0.21.3 |
In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).
Learn the logistic regression model using the features extracted in> 72.
I used Stanford NLP for stopword removal and lemma stemming processing, and it took a long time, so I divided it into preprocessing and learning.
I use tf-idf for word vectorization. tf-idf calculates importance based on two indicators, tf (Term Frequency, word frequency) and idf (Inverse Document Frequency). Decrease the importance of words (general words) that appear in many documents, and increase the importance of words that appear only in specific documents.
To be honest, I couldn't judge whether tf-idf was valid even with stopword processing, so [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer. We are vectorizing and comparing accuracy with simple word frequency using html). In addition, the hyperparameters of logistics regression are also compared by grid search.
First is the pre-processing. However, what we are doing is [previous "answer program (analysis) 072_2. Identity extraction (analysis) .ipynb"](https://qiita.com/FukuharaYohei/items/f1a12d8e63fc576a456f#%E5%9B % 9E% E7% AD% 94% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% A0% E5% 88% 86% E6% 9E% 90 % E7% B7% A8-072_2% E7% B4% A0% E6% 80% A7% E6% 8A% BD% E5% 87% BA% E5% 88% 86% E6% 9E% 90ipynb) There is nothing special to mention. The drawback is that it takes about an hour to process.
import warnings
import re
import csv
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer as PS
import stanfordnlp
#Defined as a tuple for speed
STOP_WORDS = set(stopwords.words('english'))
# Stemmer
ps = PS()
#Seems to be compliant with Universal POS tags
# https://universaldependencies.org/u/pos/
EXC_POS = {'PUNCT', #Punctuation
'X', #Other
'SYM', #symbol
'PART', #Particle('s etc.)
'CCONJ', #conjunction(and etc.)
'AUX', #Auxiliary verb(would etc.)
'PRON', #Pronoun
'SCONJ', #Subordinate conjunction(whether etc.)
'ADP', #Preposition(in etc.)
'NUM'} #number
#It was slow to specify all the default processors, so narrow down to the minimum
# https://stanfordnlp.github.io/stanfordnlp/processors.html
nlp = stanfordnlp.Pipeline(processors='tokenize,pos,lemma')
reg_sym = re.compile(r'^[!-/:-@[-`{-~]|[!-/:-@[-`{-~]$')
reg_dit = re.compile('[0-9]')
#Remove leading and trailing symbols
def remove_symbols(lemma):
return reg_sym.sub('', lemma)
#Stop word authenticity judgment
def is_stopword(word):
lemma = remove_symbols(word.lemma)
return True if lemma in STOP_WORDS \
or lemma == '' \
or word.upos in EXC_POS \
or len(lemma) == 1 \
or reg_dit.search(lemma)\
else False
#Hide warning
warnings.simplefilter('ignore', UserWarning)
with open('./sentiment.txt') as file_in:
with open('./sentiment_stem.txt', 'w') as file_out:
writer = csv.writer(file_out, delimiter='\t')
writer.writerow(['Lable', 'Lemmas'])
for i, line in enumerate(file_in):
print("\r{0}".format(i), end="")
lemma = []
#The first 3 letters only indicate negative / positive, so do not perform nlp processing(Make it as fast as possible)
doc = nlp(line[3:])
for sentence in doc.sentences:
lemma.extend([ps.stem(remove_symbols(word.lemma)) for word in sentence.words if is_stopword(word) is False])
writer.writerow([1 if line[0] == '+' else 0, ' '.join(lemma)])
This is the training part of the main subject.
import csv
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
#Classes for using word vectorization in GridSearchCV
class myVectorizer(BaseEstimator, TransformerMixin):
def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
self.method = method
self.min_df = min_df
self.max_df = max_df
def fit(self, x, y=None):
if self.method == 'tfidf':
self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
else:
self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
self.vectorizer.fit(x)
return self
def transform(self, x, y=None):
return self.vectorizer.transform(x)
#Parameters for GridSearchCV
PARAMETERS = [
{
'vectorizer__method':['tfidf', 'count'],
'vectorizer__min_df': [0.0004, 0.0005],
'vectorizer__max_df': [0.07, 0.10],
'classifier__C': [1, 3], #I also tried 10 but the SCORE is low just because it is slow
'classifier__solver': ['newton-cg', 'liblinear']},
]
#Read file
def read_csv_column(col):
with open('./sentiment_stem.txt') as file:
reader = csv.reader(file, delimiter='\t')
header = next(reader)
return [row[col] for row in reader]
x_all = read_csv_column(1)
y_all = read_csv_column(0)
def train(x_train, y_train, file):
pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
#clf stands for classification
clf = GridSearchCV(
pipline, #
PARAMETERS, #Parameter set you want to optimize
cv = 5) #Number of cross-validations
clf.fit(x_train, y_train)
pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)
print('Grid Search Best parameters:', clf.best_params_)
print('Grid Search Best validation score:', clf.best_score_)
print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))
train(x_all, y_all, 'gs_result.csv')
TfidfVectorizer or CountVectorizer /sklearn.feature_extraction.text.CountVectorizer.html) is used for word vectorization.
It's a bit confusing because it's classified so that it can be used with the GridSearchCV
function, but the important points are:
def fit(self, x, y=None):
if self.method == 'tfidf':
self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
else:
self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
self.vectorizer.fit(x)
return self
def transform(self, x, y=None):
return self.vectorizer.transform(x)
Use fit
to learn from all words and transform
to transform the word string.
The parameters are TfidfVectorizer and CountVectorizer. /generated/sklearn.feature_extraction.text.CountVectorizer.html) Both use the following two.
--min_df: Excludes vectorization if the frequency of occurrence is less than the specified percentage. It is specified because it is considered that "learning is not possible if the frequency of appearance is too low".
--max_df: Excludes vectorization if the frequency of occurrence is greater than or equal to the specified percentage. This time, I thought that "words such as film
are meaningless "and specified it.
I am training with logistic regression using LogisticRegression. The explanation of logistic regression is written in the article "Coursera Machine Learning Introductory Course (3rd week-Logistic regression, regularization)". ・ ・). Thanks to Coursera Machine Learning Introductory Course, I was able to approach with an understanding of regularization.
def train(x_train, y_train, file):
pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
In the parameter definition below, the regularization term is classifier__C
and the optimizer is classifier__solver
. I don't understand the difference between optimizers, but I haven't investigated it with the feeling that "optimize with grid search".
PARAMETERS = [
{
'vectorizer__method':['tfidf', 'count'],
'vectorizer__min_df': [0.0004, 0.0005],
'vectorizer__max_df': [0.07, 0.10],
'classifier__C': [1, 3], #I also tried 10 but the SCORE is low just because it is slow
'classifier__solver': ['newton-cg', 'liblinear']},
]
I'm using Pipeline to pipeline the training part of word vectorization and logistic regression. As a result, two processes can be performed at the same time, and the hyperparameter search in the grid search described later can also be processed at the same time.
def train(x_train, y_train, file):
pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
I am searching for hyperparameters using GridSearchCV. Since it is pipelined, both the word vectorization and the training part by logistic regression can be searched at the same time.
The search target is defined by PARAMETERS
, and the" target processing name "and" parameter name "are combined by __
.
Actually, there are more searchable parameters, but they are omitted because it takes a long time to process. This parameter takes about 2 minutes.
#Parameters for GridSearchCV
PARAMETERS = [
{
'vectorizer__method':['tfidf', 'count'],
'vectorizer__min_df': [0.0004, 0.0005],
'vectorizer__max_df': [0.07, 0.10],
'classifier__C': [1, 3], #I also tried 10 but the SCORE is low just because it is slow
'classifier__solver': ['newton-cg', 'liblinear']},
]
#clf stands for classification
clf = GridSearchCV(
pipline,
PARAMETERS, #Parameter set you want to optimize
cv = 5) #Number of cross-validations
TfidfVectorizer and CountVectorizer /sklearn.feature_extraction.text.CountVectorizer.html) defines the myVectorizer
class to find out which is the best. I am changing the Vectorizer that receives the parameter method
and processes it in the ʻif` conditional branch. I referred to the following article.
-Implement a minimum self-made estimator (Estimator) with scikit-learn -Determine whether to do PCA by grid search
class myVectorizer(BaseEstimator, TransformerMixin):
def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
self.method = method
self.min_df = min_df
self.max_df = max_df
def fit(self, x, y=None):
if self.method == 'tfidf':
self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
else:
self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
self.vectorizer.fit(x)
return self
def transform(self, x, y=None):
return self.vectorizer.transform(x)
The result of grid search is output to CSV file. Let's compare each criterion with the average and maximum scores (using Excel).
pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)
I've increased the parameters a bit. Therefore, the training took about 11 minutes.
#Parameters for GridSearchCV
PARAMETERS = [
{
'vectorizer__method':['tfidf', 'count'],
'vectorizer__min_df': [0.0003, 0.0004, 0.0005, 0.0006],
'vectorizer__max_df': [0.07, 0.10],
'classifier__C': [1, 3], #I also tried 10 but the SCORE is low just because it is slow
'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
]
The average correct answer rate of 75.6% in 5 cross-validations was the highest hyperparameter.
Now, let's compare each parameter below.
TfidfVectorizer/CountVectorizer tf-idf clearly has a better score.
min_df The smaller the min_df, the better the score.
max_df For td-idf, max_df is a better score if it is less.
There is not much difference.
Obviously, 1 has a better score.
Recommended Posts