Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression

What is TF-IDF's evaluation of word relevance?

Up to the last time, I parsed sentences and converted words into feature vectors. However, even if a word exists a lot in a sentence, if it appears a lot in a sentence of any category, the word is not very important in judging the category. When you want to classify a movie review into "positive" and "negative", the word "wow" can often be used in the context of "wow" or "wow", so that's it. Then the negative and positive of the review is difficult to judge. With this kind of feeling, when a word is categorized, the method of increasing the weight of the word if it is important and decreasing it if it is not important is "TF-IDF". TF is called the frequency of occurrence of words, IDF is called the frequency of reverse documents, and the definition is as follows. Assuming that $ n_d $ represents the total number of documents and $ df (t, d) $ represents the number of documents containing the word t, idf(t, d) = log \frac{n_d}{1 + df(t,d)}, tf-idf = tf(t,d) \times idf(t, d)

By the way, the TfidfTransformer class in Python's scikit-learn can implement this relatively easily, and it receives the frequency of occurrence of words from the CountVectorizer used last time and converts it.

`tf_idf.py`


import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer

count = CountVectorizer()
texts = np.array(["He likes to play the guitar", \
"She likes to play the piano", \
"He likes to play the guitar, and she likes to play the piano"])
bag = count.fit_transform(docs)

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

Output result:
[[ 0.    0.48  0.48  0.37  0.    0.37  0.    0.37  0.37]
 [ 0.    0.    0.    0.37  0.48  0.37  0.48  0.37  0.37]
 [ 0.34  0.26  0.26  0.4   0.26  0.4   0.26  0.4   0.4 ]]

Points to keep in mind when actually analyzing text data

Cleansing text data

In a sentence like the above example, the input does not include extra symbols etc. and can be passed to countVectorizer etc. as it is. However, some text data includes html markup and separator lines, so it is necessary to remove such extra data before starting analysis (text data cleansing). This can be done with Python regular expressions, etc. Regular expression operations in Python

Removal of Stopwords

Regardless of the sentence category, words that often appear in a sentence in a certain language are not very important for classifying sentences, so it is better to look into them before actually performing machine learning. .) In English you can bring stopwords from Python's NLTK library, but in Japanese there is no official library, slothlib /Filter/StopWord/word/Japanese.txt) page is read, the source is analyzed and words are fetched in many cases. First, open the url with a code like this, and then verify the source using a mysterious thing called Beautiful Soup.

`ja_stopwords.py`


import urllib.request
import bs4
from bs4 import BeautifulSoup

def get_stop_words():
    #stopwords(Frequent words regardless of sentence attributes)Get the stopword list like Japanese in slothlib to exclude.
    #Parse the source with urllib and BeautifulSoup.
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    soup = bs4.BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser")
    ss = str(soup)
    return ss

print(get_stop_words())

Output result:
over there
Per
there
Over there
after
hole
you
that
How many
When
Now
I
...

Training of logistic regression model to classify documents

Then, like this, we will find the features, etc., and actually perform logistic regression analysis on the preprocessed sentences, and classify whether the sentences that are machine learning are positive or negative. There was no affordable source in Japanese (it seems that there are many collections from Twitter, but it is troublesome to mess with AWS for that) Try based on data annotated as negative or positive for English movie reviews. This program is [Python Machine Learning](https://www.amazon.co.jp/Python%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%83 % 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81% 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-top-gear / dp / 4844380605 / ref = sr_1_cc_2? s = aps & ie = UTF8 & qid = 1487516026 & sr = 1-2-catcorr & keywords = python +% E6% A9% I referred to the chapter on natural language processing in the book 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92).

`reviews.py`


from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
print('Accuracy: %.3f' % gs_lr_tfidf.best_score_)

First, split df into training data and test data. It may be easier to understand if you imagine that df looks like a table in the data frame in the pandas library.

In this, ['review'] and ['sentiment'] are the text of the review (this is x) and the label for that review (0,1 indicates positive or negative), and this is y. (I will omit how df actually read the original data and how the data was cleansed ...) After that, I use a class of sklearn called GridSearchCV to tune the optimum parameters for logistic regression. I have created an instance of GridSearchCV called gs_lr_tfidf and trained it with gs_lr_tfidf.fit () using X_train and y_train.

(Tuning hyperparameters with sklearn)

However, it takes a tremendous amount of time to actually do this method ... Therefore, when the data is large, it seems that it is common to do what is called out-of-core learning.