[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing

In this article, I will explain scikit-learn using Language processing 100 knocks Chapter 6.

First, let's do pip install scikit-learn.

50. Obtaining and shaping data

Download the News Aggregator Data Set and create training data (train.txt), verification data (valid.txt), and evaluation data (test.txt) as follows.

Unzip the downloaded zip file and read the explanation of readme.txt. Extract only cases (articles) where the information source (publisher) is “Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”. Randomly sort the extracted cases. Divide 80% of the extracted cases into training data and the remaining 10% into verification data and evaluation data, and save them with the file names train.txt, valid.txt, and test.txt, respectively. Write one case per line in the file, and use the tab-delimited format of the category name and article headline (this file will be reused later in Problem 70).

After creating the training data and evaluation data, check the number of cases in each category.

This problem has nothing to do with scikit-learn, so you can solve it the way you like. First of all, download the file and read readme.txt.

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip -c NewsAggregatorDataset.zip readme.txt

A readme is fine, but I want to handle compressed files without decompressing them as much as possible. The zip file of the data body should be handled by the zipfile module. Any method of reading is fine, but in this case, I think it's easier to use pandas. Let sklearn.model_selection.train_test_split () do the splitting. It also shuffles.

As a rudimentary story, the library name is scikit-learn, but the module name when importing is sklearn.

import csv
import zipfile

import pandas as pd
from sklearn.model_selection import train_test_split


with zipfile.ZipFile("NewsAggregatorDataset.zip") as z:
    with z.open("newsCorpora.csv") as f:
        names = ('ID','TITLE','URL','PUBLISHER','CATEGORY','STORY','HOSTNAME','TIMESTAMP')
        df = pd.read_table(f, names=names, quoting=csv.QUOTE_NONE)

publisher_set = {"Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"}
df = df[df['PUBLISHER'].isin(publisher_set)]
df, valid_test_df = train_test_split(df, train_size=0.8, random_state=0)
df.to_csv('train.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)
valid_df, test_df = train_test_split(valid_test_df, test_size=0.5, random_state=0)
valid_df.to_csv('valid.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)
test_df.to_csv('test.txt', columns=('CATEGORY','TITLE'), sep='\t', header=False, index=False)

pandas.read_table () reads a TSV file and creates a DataFrame type object. names sets the column name. quoting = csv.QUOTE_NONE is a setting to treat quotation marks as character strings. csv.QUOTE_NONE is the same even if you write 3.

(I heard that read_table () used to be deprecated, so you can use read_csv (sep ='\ t'), but it seems to be deprecated because there is no warning.)

The df ['PUBLISHER'] part is an operation to extract columns, and the return value will be of type Series. The DataFrame type of pandas represented the structure of the entire table, and each column was represented by the Series type. Its method ʻisin ()returns theSeries of the truth value of the ʻin operation for each element. And if you pass it as if it were a df key, it will return a DataFrame that extracts only the True rows.

names = ('CATEGORY','TITLE')
df = pd.read_table('train.txt', names=names, quoting=csv.QUOTE_NONE)
df['CATEGORY'].value_counts()
b    4503
e    4254
t    1210
m     717
Name: CATEGORY, dtype: int64
df = pd.read_table('test.txt', names=names, quoting=csv.QUOTE_NONE)
df['CATEGORY'].value_counts()
b    565
e    518
t    163
m     90
Name: CATEGORY, dtype: int64

51. Feature extraction

Extract the features from the training data, verification data, and evaluation data, and save them with the file names train.feature.txt, valid.feature.txt, and test.feature.txt, respectively. Feel free to design the features that are likely to be useful for categorization. The minimum baseline would be an article headline converted to a word string.

In this problem, it is not said that the extracted features should be converted into a vector (matrix). It seems that it is required to save the features in a human-readable format in order to use them for error analysis later.

(If you use Count vectorizer of scikit-learn, feature extraction and vectorization will be done as a set, which is not familiar with this problem.)

Therefore, extract the features by yourself, create a dictionary object, save it, and use Dictvectorizer in the next problem. We will solve it with the policy of using it to vectorize it. The key of the dictionary is the name of the feature, and the value is 1.0. It is a binary feature. Creating a dictionary from features This process is also required for inference, so make it a function.

The format for saving features is not specified, but I think the jsonl format is better from the viewpoint of readability.

I want to separate commas and quotes from words. It doesn't matter how you do it. SpaCy is famous as a tokenizer, but I think that the tokenizer of Countvectorizer is also effective in this problem.

q51.py


import argparse
import json


from sklearn.feature_extraction.text import CountVectorizer


def ngram_gen(seq, n):
    return zip(*(seq[i:] for i in range(n)))


nlp = CountVectorizer().build_tokenizer()

def make_feats_dict(title):
    words = nlp(title)
    
    feats = {}
    for token in words:
        feats[token] = 1.0
    for bigram in ngram_gen(words, 2):
        feats[' '.join(bigram)] = 1.0
    for trigram in ngram_gen(words, 3):
        feats[' '.join(trigram)] = 1.0
    return feats


def dump_features(input_file, output_file):
    with open(input_file) as fi, open(output_file, 'w') as fo:
        for line in fi:
            vals = line.rstrip().split('\t')
            label, title = vals
            feats = {'**LABEL**': label}
            feats.update(make_feats_dict(title))
            print(json.dumps(feats), file=fo)

            
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file')
    parser.add_argument('output_file')
    args = vars(parser.parse_args())
    dump_features(**args)
    
            
if __name__ == '__main__':
    main()

Overwriting q51.py
!python q51.py test.txt test.feature.txt
!python q51.py valid.txt valid.feature.txt
!python q51.py train.txt train.feature.txt

ngram_gen () makes ngram. For bigram, transpose [[I, am, an, NLPer], [am, an, NLPer]] (according to the shorter one)! I am doing it in an elegant way.

(The label is not a feature, but I write it out because it will ease the next problem.)

I'm doing something strange with the main function, but this is also used in Chapter 4 [Unpacking the argument list](https://docs.python.org/ja/3/tutorial/controlflow.html#unpacking- I'm trying to pass keyword arguments in a dictionary by argument-lists). Since the return value of parse_args () is a namespace object, it is converted to a dictionary object by vars () (which appeared in Chapter 5).

52. Learning

Learn the logistic regression model using the training data constructed in> 51.

First, create a list X consisting of a dictionary representing features from the file created in 51. To input it into the machine learning model, we need a vector that lists the values of all features. So use DictVectorizer (). The method fit (X) of the DictVectorizer gets the feature name and index mapping from X and stores it in a variable inside the instance. Then use transform (X) to transform X into a numpy matrix. Fit_transform (X) does this all at once.

Then use LogisticRegression (). Simply instantiate and call the fit (X, y) method to learn the weight vector inside the instance. Hypara is set at instantiation. X is like a matrix, y is like a list, and it's okay if the lengths match.

Save the learned model by referring to Model persistence. When using joblib.dump (), a large number of files will be generated unless the optional argument compress is specified. So be careful.

At this time, if you do not save the mapping between the feature name and the index, you will have trouble in inference. Let's dump each instance of DictVectorizer.

q52.py


import argparse
import json


import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import joblib


def argparse_imf():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input')
    parser.add_argument('-m', '--model')
    parser.add_argument('-f', '--feats')
    args = parser.parse_args()
    return args


def load_xy(filename):
    X = []
    y = []
    with open(filename) as f:
        for line in f:
            dic = json.loads(line)
            y.append(dic.pop('**LABEL**'))
            X.append(dic)
    return X, y


def main():
    args = argparse_imf()
    X_train, y_train = load_xy(args.input)
    
    vectorizer = DictVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    y_train = np.array(y_train)
    clf = LogisticRegression(random_state=0, max_iter=1000, verbose=1)
    clf.fit(X_train, y_train)
    
    joblib.dump(clf, args.model, compress=3)
    joblib.dump(vectorizer, args.feats, compress=3)

    
if __name__ == '__main__':
    main()

Overwriting q52.py
!python q52.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib

53. Forecast

Use the logistic regression model learned in> 52 and implement a program that calculates the category and its prediction probability from the given article headline.

I feel that the "given article headline" in this question does not refer to the test data created above, but rather to make predictions from any article headline.

If you load the saved model and call predict (X), the label will come out, and if you callpredict_proba (X), the prediction probability will come out. This X can be obtained by creating a feature dictionary from the input and converting it with theDictvectorizer ()saved in 52.

If you enter two titles and apply predict_proba (), you will get a numpy.ndarray like this.

>>> y_proba
array([[0.24339871, 0.54111814, 0.10059608, 0.11488707],
       [0.19745579, 0.69644375, 0.04204659, 0.06405386]])

Predictive probabilities for all labels are coming out, but I think you only want the maximum value. what should I do? ndarray seems to have a max () method ...

>>> y_proba.max()
0.6964437549683299
>>> y_proba.max(axis=0)
array([0.24339871, 0.69644375, 0.10059608, 0.11488707])

Let's do our best. Below is an example of the answer.

q53.py


import argparse
import json
import sys


import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import joblib


from q51 import make_feats_dict
from q52 import argparse_imf, load_xy


def predict_label_proba(X, vectorizer, clf):
    X = vectorizer.transform(X)
    y_proba = clf.predict_proba(X)    
    y_pred = clf.classes_[y_proba.argmax(axis=1)]
    y_proba_max = y_proba.max(axis=1)
    return y_pred, y_proba_max


def main():
    args = argparse_imf()
    vectorizer = joblib.load(args.feats)
    clf = joblib.load(args.model)
    X = list(map(make_feats_dict, sys.stdin))
    y_pred, y_proba = predict_label_proba(X, vectorizer, clf)
    for label, proba in zip(y_pred, y_proba):
        print('%s\t%.4f' % (label, proba))

    
if __name__ == '__main__':
    main()

Overwriting q53.py
!echo 'I have a dog.' | python q53.py -m train.logistic.model -f train.feature.joblib
e	0.5441

54. Measurement of correct answer rate

Measure the correct answer rate of the logistic regression model learned in> 52 on the training data and evaluation data.

You can implement it by hand, but I'll leave it to sklearn.metrics.accuracy_score ().

The most important thing in learning scikit-learn is the flow so far.

  1. Extract features and convert to dict (list with elements) type
  2. Convert to a matrix with Dictvectorizer.fit_transform ()
  3. Select and instantiate a machine learning model such as Logistic Regression
  4. Learn with fit (X_train, y_train)
  5. Infer with predict (X_test)
  6. Evaluate in some way

Let's hold this firmly.

q54.py


import argparse
import json


import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib


from q52 import argparse_imf, load_xy


def predict(args):
    X_test, y_true = load_xy(args.input)
    
    vectorizer = joblib.load(args.feats)
    X_test = vectorizer.transform(X_test)
    y_true = np.array(y_true)
    
    clf = joblib.load(args.model)
    y_pred = clf.predict(X_test)
    
    return y_true, y_pred


def main():
    args = argparse_imf()
    y_true, y_pred = predict(args)
    accuracy = accuracy_score(y_true, y_pred) * 100
    print('Accuracy: %.3f' % accuracy)

    
if __name__ == '__main__':
    main()

Overwriting q54.py
!python q54.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib
Accuracy: 99.897
!python q54.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
Accuracy: 87.275

55. Creating a confusion matrix

Create a confusion matrix of the logistic regression model learned in> 52 on the training data and evaluation data.

Leave it to sklearn.metrics.confusion_matrix ().

q55.py


from sklearn.metrics import confusion_matrix

from q52 import argparse_imf
from q54 import predict


def main():
    args = argparse_imf()
    y_true, y_pred = predict(args)
    labels = ('b', 'e', 't', 'm')
    matrix = confusion_matrix(y_true, y_pred, labels=labels)
    print(labels)
    print(matrix)

    
if __name__ == '__main__':
    main()
Overwriting q55.py
!python q55.py -i train.feature.txt -m train.logistic.model -f train.feature.joblib
('b', 'e', 't', 'm')
[[4499    1    3    0]
 [   2 4252    0    0]
 [   3    1 1206    0]
 [   0    1    0  716]]
!python q55.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
('b', 'e', 't', 'm')
[[529  26  10   0]
 [ 13 503   2   0]
 [ 37  36  89   1]
 [ 19  26   0  45]]

56. Measurement of precision, recall, F1 score

Measure the precision, recall, and F1 score of the logistic regression model learned in> 52 on the evaluation data. Obtain the precision rate, recall rate, and F1 score for each category, and integrate the performance for each category with the micro-average and macro-average.

Leave it to sklearn.metrics.classification_report (). In the multi-class (single label) classification, the micro-average for all classes matches the correct answer rate (Reference).

q56.py


from sklearn.metrics import classification_report

from q52 import argparse_imf
from q54 import predict


def main():
    args = argparse_imf()
    y_true, y_pred = predict(args)
    print(classification_report(y_true, y_pred, digits=4))

    
if __name__ == '__main__':
    main()
Overwriting q56.py
!python q56.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib
              precision    recall  f1-score   support

           b     0.8846    0.9363    0.9097       565
           e     0.8511    0.9710    0.9071       518
           m     0.9783    0.5000    0.6618        90
           t     0.8812    0.5460    0.6742       163

    accuracy                         0.8728      1336
   macro avg     0.8988    0.7383    0.7882      1336
weighted avg     0.8775    0.8728    0.8633      1336

57. Confirmation of feature weights

Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 52.

The attribute coef_ has a weight, but since it is a multi-class classification, the weight is the number of classes x the number of feature labels. Will all 4 classes be output?

q57.py


import joblib
import numpy as np


from q52 import argparse_imf


def get_topk_indices(array, k=10):
    unsorted_max_indices = np.argpartition(-array, k)[:k]
    max_weights = array[unsorted_max_indices]
    max_indices = np.argsort(-max_weights)
    return unsorted_max_indices[max_indices]

def show_weights(args):
    vectorizer = joblib.load(args.feats)
    feature_nemes = np.array(vectorizer.get_feature_names())
    
    clf = joblib.load(args.model)
    coefs = clf.coef_
    y_labels = clf.classes_
    for coef, y_label in zip(coefs, y_labels):
        max_k_indices = get_topk_indices(coef)
        print(y_label)
        for name, weight in zip(feature_nemes[max_k_indices],  coef[max_k_indices]):
            print(name, weight, sep='\t')
        print('...')
        min_k_indices = get_topk_indices(-coef)
        for name, weight in zip(feature_nemes[min_k_indices],  coef[min_k_indices]):
            print(name, weight, sep='\t')
        print()

def main():
    args = argparse_imf()
    show_weights(args)

    
if __name__ == '__main__':
    main()
Overwriting q57.py
!python q57.py -i test.feature.txt -m train.logistic.model -f train.feature.joblib

Instead of sorting the entire coef_, I want only the upper and lower levels, so this is a roundabout method. This is because numpy doesn't have a topk () -like function, it just gets the top index that isn't sorted by ʻargpartition ()`.

58. Change regularization parameters

When training a logistic regression model, the degree of overfitting during learning can be controlled by adjusting the regularization parameters. Learn the logistic regression model with different regularization parameters and find the accuracy rate on the training data, validation data, and evaluation data. Summarize the results of the experiment in a graph with the regularization parameters on the horizontal axis and the accuracy rate on the vertical axis.

import argparse
import json


import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib
import matplotlib.pyplot as plt
from tqdm import tqdm


from q52 import load_xy


def get_accuracy(clf, X, y_true):
    y_pred = clf.predict(X)
    return accuracy_score(y_true, y_pred)


X_train, y_train = load_xy('train.feature.txt')
X_valid, y_valid = load_xy('valid.feature.txt')
X_test, y_test = load_xy('test.feature.txt')

vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_valid = vectorizer.transform(X_valid)
X_test = vectorizer.transform(X_test)

train_accuracies = []
valid_accuracies = []
test_accuracies = []

for exp in tqdm(range(10)):
    clf = LogisticRegression(random_state=0, max_iter=1000, C=2**exp)
    clf.fit(X_train, y_train)
    train_accuracies.append(get_accuracy(clf, X_train, y_train))
    valid_accuracies.append(get_accuracy(clf, X_valid, y_valid))
    test_accuracies.append(get_accuracy(clf, X_test, y_test))


cs = [2**c for c in range(10)]
plt.plot(cs, train_accuracies, label='train')
plt.plot(cs, valid_accuracies, label='valid')
plt.plot(cs, test_accuracies, label='test')
plt.legend()
plt.show()

59. Searching for hyperparameters

Learn the categorization model while changing the learning algorithm and learning parameters. Find the learning algorithm parameter that has the highest accuracy rate on the verification data. Also, find the correct answer rate on the evaluation data when the learning algorithm and parameters are used.

Algorithm high para selection should be done on the verification data and not test set tuning. But this time I didn't do that much and I'll use sklearn.ensemble.GradientBoostingClassifier as appropriate to finish it ... I intended to finish it.

from sklearn.ensemble import GradientBoostingClassifier


clf = GradientBoostingClassifier(random_state=0, min_samples_split=0.01,
                                 min_samples_leaf=5, max_depth=10, 
                                 max_features='sqrt', n_estimators=500, 
                                 subsample=0.8)
clf.fit(X_train, y_train)
valid_acc = get_accuracy(clf, X_valid, y_valid) * 100
print('Validation Accuracy: %.3f' % valid_acc)
test_acc = get_accuracy(clf, X_test, y_test) * 100
print('Test Accuracy: %.3f' % test_acc)
Validation Accuracy: 88.997
Test Accuracy: 88.548

It's a trendy GBDT! I tried it, but it was difficult because the performance changed considerably with high para. If I have time, I will seriously search for grids ...

Anyway, when it comes to machine learning in Python, I think it's a chapter where you can learn scikit-learn.

Recommended Posts

[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
Parallel processing with Parallel of scikit-learn
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 Language Processing Knock with Python (Chapter 1)
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 Language Processing Knock with Python (Chapter 3)
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 language processing knocks 2020: Chapter 3 (regular expression)
[Language processing 100 knocks 2020] Chapter 8: Neural network
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
100 language processing knocks (2020): 22
100 language processing knocks (2020): 42
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
Getting started with Python with 100 knocks on language processing