This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 05, I will write down my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

In Step04, the feature extraction method is learned, and in the next Step06, the extracted feature vector is learned to create a classifier. In Step05, you will learn the dimensional compression method that processes the feature vector into the desired shape for the classifier in the process during that time.

--Latent Semantics (LSA) --Principal component analysis (PCA)

05.1 Feature pretreatment

BoW is a vectorized version of the frequency of occurrence of words, and "the distribution of feature vector values tends to be very biased."

--Solved by feature extraction --Step04 TF-IDF etc. --Solution by processing the feature vector after extraction --With sklearn.preprocessing.QuantileTransformer, set the values in the range of 0 or more and 1 or less, and make the distribution of values uniform.

It was difficult to understand the example of the reference book, so I will check it myself.

`test_quantileTransformer.py`


import numpy as np
import MeCab
import pprint

from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_extraction.text import CountVectorizer

def _tokenize(text):
~~

texts = [
    'Cars, cars, cars run fast',
    'The bike runs fast',
    'Bicycle runs slowly',
    'Tricycle runs slowly',
    'Programming is fun',
    'Python is Python Python is Python Python is fun',
]

vectorizer = CountVectorizer(tokenizer=_tokenize, max_features = 5)
bow = vectorizer.fit_transform(texts)
pprint.pprint(bow.toarray())

qt = QuantileTransformer()
qtd = qt.fit_transform(bow)
pprint.pprint(qtd.toarray())

`Execution example`


array([[0, 3, 0, 1, 3],
       [0, 1, 0, 1, 0],
       [0, 2, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 0, 0, 0],
       [5, 5, 0, 0, 0]], dtype=int64)
array([[0.00000000e+00, 7.99911022e-01, 0.00000000e+00, 9.99999900e-01,
        9.99999900e-01],
       [0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 6.00000000e-01, 9.99999900e-01, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 9.99999998e-08, 9.99999900e-01, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00],
       [9.99999900e-01, 9.99999900e-01, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00]])

-(5,0) and (0,4) appear frequently, but they do not appear in other sentences, so the converted value is almost 1. -(2, :) and (3, :) have 1 occurrence, but there are many other sentences, so the converted value is almost 1. -(1, :) has various appearances of 1,2,3,5, and the converted values are also different. --Before conversion 1: Almost 0 after conversion --Before conversion 2: Almost 0.6 after conversion --Before conversion 3: Almost 0.8 after conversion --Before conversion 5: Almost 1 after conversion

05.2 Latent Semantics (LSA) 05.3 Principal Component Analysis (PCA)

Contents	LAS	PCA
Overview	A method to obtain a vector that expresses a document at the "meaning" level behind a "word" based on a group of feature vectors that represent the relationship between a document and a word, such as BoW.	Method to find "direction in which data points are widely scattered"
Mathematical manipulation	SVD (Singular Value Decomposition)	EVD (eigenvalue decomposition)
Implementation	svd = sklearn.decomposition.TruncatedSVD() svd.fit_transform()	evd = sklearn.decomposition.PCA() evd.fit_transform()
Importance of each dimension	singular_values_You can see the importance of each dimension after compression by referring to.	explained_variance_ratio_The cumulative contribution rate can be found by referring to.
Dimensionality reduction	When instantiating, n_Specify components	When instantiating, n_Specify components

Points to consider with both methods

merit --Reduction of classifier calculation cost by reducing the number of dimensions --Performance improvement by removing redundant information
Demerit ――If you delete the necessary information, the performance will decrease.

LSA-Topic model

The point that the topic model, which is a question of "whether one sentence and another sentence have the same meaning" instead of explicitly giving the class ID to the learning data, does not explicitly give the correct answer (class ID). It is a kind of "unsupervised learning".

PCA-whitening

By uncorrelated each component of the vector (multiplying the target vector by the eigenvector obtained by PCA) to make the average 0 variance 1 the original "spreading degree in each axial direction" of the data is erased. It may be expected that the identification performance will be improved.

PCA-Visualization method

Since a high-dimensional vector can be converted into a low-dimensional vector, it can also be used as a visualization method.

05.4 Application / Implementation

I couldn't do it just by rewriting Truncated SVD to PCA. (You cannot enter sparse into PCA)

`Execution example`


    def train(self, texts, labels):
        vectorizer = TfidfVectorizer(tokenizer=self._tokenize, ngram_range=(1, 3))
        bow = vectorizer.fit_transform(texts).toarray()

        pca = PCA(n_components = 500)
        pca_feat = pca.fit_transform(bow)

        classifier = SVC()
        classifier.fit(pca_feat, labels)

        self.vectorizer = vectorizer
        self.pca = pca
        self.classifier = classifier

    def predict(self, texts):
        bow = self.vectorizer.transform(texts).toarray()
        pca_feat = self.pca.transform(bow)
        return self.classifier.predict(pca_feat)

It can be executed by stopping the pipeline notation, inputting the result (sparse) of vectorizer to array () and then inputting it to PCA.

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"

Contents

Preparation

Chapter overview

05.1 Feature pretreatment

test_quantileTransformer.py

Execution example

05.2 Latent Semantics (LSA) 05.3 Principal Component Analysis (PCA)

Points to consider with both methods

LSA-Topic model

PCA-whitening

PCA-Visualization method

05.4 Application / Implementation

Execution example

`test_quantileTransformer.py`

`Execution example`

`Execution example`