2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 6: Machine Learning out of the following Chapters 1 to 10. ..
-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis --Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation
We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.
In this chapter, we will use the News Aggregator Data Set published by Fabio Gasparetti to work on the task (category classification) of classifying news article headlines into the categories of "business", "science and technology", "entertainment", and "health".
Download the News Aggregator Data Set and create training data (train.txt), verification data (valid.txt), and evaluation data (test.txt) as follows.
After creating the learning data and evaluation data, check the number of cases in each category.
First, download the specified data.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip NewsAggregatorDataset.zip
#Check the number of lines
!wc -l ./newsCorpora.csv
output
422937 ./newsCorpora.csv
#Check the first 10 lines
!head -10 ./newsCorpora.csv
output
1 Fed official says weak data caused by weather, should not slow taper http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\?track=rss Los Angeles Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.latimes.com 1394470370698
2 Fed's Charles Plosser sees high bar for change in pace of tapering http://www.livemint.com/Politics/H2EvwJSK2VE6OF7iK1g3PP/Feds-Charles-Plosser-sees-high-bar-for-change-in-pace-of-ta.html Livemint b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.livemint.com 1394470371207
3 US open: Stocks fall after Fed official hints at accelerated tapering http://www.ifamagazine.com/news/us-open-stocks-fall-after-fed-official-hints-at-accelerated-tapering-294436 IFA Magazine b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371550
4 Fed risks falling 'behind the curve', Charles Plosser says http://www.ifamagazine.com/news/fed-risks-falling-behind-the-curve-charles-plosser-says-294430 IFA Magazine b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371793
5 Fed's Plosser: Nasty Weather Has Curbed Job Growth http://www.moneynews.com/Economy/federal-reserve-charles-plosser-weather-job-growth/2014/03/10/id/557011 Moneynews b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.moneynews.com 1394470372027
6 Plosser: Fed May Have to Accelerate Tapering Pace http://www.nasdaq.com/article/plosser-fed-may-have-to-accelerate-tapering-pace-20140310-00371 NASDAQ b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.nasdaq.com 1394470372212
7 Fed's Plosser: Taper pace may be too slow http://www.marketwatch.com/story/feds-plosser-taper-pace-may-be-too-slow-2014-03-10\?reflink=MW_news_stmp MarketWatch b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.marketwatch.com 1394470372405
8 Fed's Plosser expects US unemployment to fall to 6.2% by the end of 2014 http://www.fxstreet.com/news/forex-news/article.aspx\?storyid=23285020-b1b5-47ed-a8c4-96124bb91a39 FXstreet.com b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.fxstreet.com 1394470372615
9 US jobs growth last month hit by weather:Fed President Charles Plosser http://economictimes.indiatimes.com/news/international/business/us-jobs-growth-last-month-hit-by-weatherfed-president-charles-plosser/articleshow/31788000.cms Economic Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM economictimes.indiatimes.com 1394470372792
10 ECB unlikely to end sterilisation of SMP purchases - traders http://www.iii.co.uk/news-opinion/reuters/news/152615 Interactive Investor b dPhGU51DcrolUIMxbRm0InaHGA2XM www.iii.co.uk 1394470501265
#Replaced double quotes with single quotes to avoid errors when reading
!sed -e 's/"/'\''/g' ./newsCorpora.csv > ./newsCorpora_re.csv
Next, read in the pandas data frame and create the data according to the instructions in the question sentence.
Scikit-learn's `train_test_split``` is used to split the data. At that time, if you use the
stratify``` option, the composition ratio of the specified column will be divided so that it will be the same for each data after division. Here, the objective variable for classification, ``
CATEGORY```, is specified so that there is no bias for each data.
import pandas as pd
from sklearn.model_selection import train_test_split
#Data reading
df = pd.read_csv('./newsCorpora_re.csv', header=None, sep='\t', names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])
#Data extraction
df = df.loc[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['TITLE', 'CATEGORY']]
#Data split
train, valid_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=123, stratify=df['CATEGORY'])
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True, random_state=123, stratify=valid_test['CATEGORY'])
#Data storage
train.to_csv('./train.txt', sep='\t', index=False)
valid.to_csv('./valid.txt', sep='\t', index=False)
test.to_csv('./test.txt', sep='\t', index=False)
#Confirmation of the number of cases
print('[Learning data]')
print(train['CATEGORY'].value_counts())
print('[Verification data]')
print(valid['CATEGORY'].value_counts())
print('[Evaluation data]')
print(test['CATEGORY'].value_counts())
output
[Learning data]
b 4501
e 4235
t 1220
m 728
Name: CATEGORY, dtype: int64
[Verification data]
b 563
e 529
t 153
m 91
Name: CATEGORY, dtype: int64
[Evaluation data]
b 563
e 530
t 152
m 91
Name: CATEGORY, dtype: int64
Extract the features from the training data, verification data, and evaluation data, and save them with the file names train.feature.txt, valid.feature.txt, and test.feature.txt, respectively. Feel free to design the features that are likely to be useful for categorization. The minimum baseline would be an article headline converted to a word string.
This time, we will calculate TF-IDF for a group of words in which the headline of an article is divided by spaces, and use that value as a feature quantity. It also calculates TF-IDF not only for one word (uni-gram) but also for two consecutive words (bi-gram). In calculating the above, three processes are performed as text pre-processing: (1) replace symbols with spaces, (2) lowercase alphabets, and (3) replace number strings with 0.
import string
import re
def preprocessing(text):
table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
text = text.translate(table) #Replace symbols with spaces
text = text.lower() #Lowercase
text = re.sub('[0-9]+', '0', text) #Replace digit string with 0
return text
#Data recombination
df = pd.concat([train, valid, test], axis=0)
df.reset_index(drop=True, inplace=True) #Reassign the index
#Implementation of pretreatment
df['TITLE'] = df['TITLE'].map(lambda x: preprocessing(x))
print(df.head())
output
TITLE CATEGORY
0 refile update 0 european car sales up for sixt... b
1 amazon plans to fight ftc over mobile app purc... t
2 kids still get codeine in emergency rooms desp... m
3 what on earth happened between solange and jay... e
4 nato missile defense is flight tested over hawaii b
from sklearn.feature_extraction.text import TfidfVectorizer
#Data split
train_valid = df[:len(train) + len(valid)]
test = df[len(train) + len(valid):]
# TfidfVectorizer
vec_tfidf = TfidfVectorizer(min_df=10, ngram_range=(1, 2)) # ngram_TF in range-Specify the length of the word for which the IDF is calculated
#Vectorization
X_train_valid = vec_tfidf.fit_transform(train_valid['TITLE']) #Do not use test information
X_test = vec_tfidf.transform(test['TITLE'])
#Convert vector to data frame
X_train_valid = pd.DataFrame(X_train_valid.toarray(), columns=vec_tfidf.get_feature_names())
X_test = pd.DataFrame(X_test.toarray(), columns=vec_tfidf.get_feature_names())
#Data split
X_train = X_train_valid[:len(train)]
X_valid = X_train_valid[len(train):]
#Data storage
X_train.to_csv('./X_train.txt', sep='\t', index=False)
X_valid.to_csv('./X_valid.txt', sep='\t', index=False)
X_test.to_csv('./X_test.txt', sep='\t', index=False)
print(X_train.head())
output
0m 0million 0nd 0s 0st ... yuan zac zac efron zendaya zone
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
[5 rows x 2815 columns]
Learn the logistic regression model using the training data constructed in> 51.
We will continue to use scikit-learn to learn the logistic regression model.
from sklearn.linear_model import LogisticRegression
#Model learning
lg = LogisticRegression(random_state=123, max_iter=10000)
lg.fit(X_train, train['CATEGORY'])
output
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Implement a program that calculates the category and its prediction probability from the given article headline using the logistic regression model learned in> 52.
Defines a function that takes a dataset as input that has undergone 51 text preprocessing to vectorization with TF-IDF.
import numpy as np
def score_lg(lg, X):
return [np.max(lg.predict_proba(X), axis=1), lg.predict(X)]
train_pred = score_lg(lg, X_train)
test_pred = score_lg(lg, X_test)
print(train_pred)
output
[array([0.8402725 , 0.67906432, 0.55642575, ..., 0.86051523, 0.61362406,
0.90827641]), array(['b', 't', 'm', ..., 'b', 'm', 'e'], dtype=object)]
Measure the correct answer rate of the logistic regression model learned in> 52 on the training data and evaluation data.
Use scikit-learn's accuracy_score
to calculate the accuracy rate.
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(train['CATEGORY'], train_pred[1])
test_accuracy = accuracy_score(test['CATEGORY'], test_pred[1])
print(f'Correct answer rate (learning data):{train_accuracy:.3f}')
print(f'Correct answer rate (evaluation data):{test_accuracy:.3f}')
output
Correct answer rate (learning data): 0.927
Correct answer rate (evaluation data): 0.885
Create a confusion matrix of the logistic regression model learned in> 52 on the training data and evaluation data.
The confusion matrix is also calculated using scikit-learn. In addition, the calculated confusion matrix is visualized using seaborn.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
#Training data
train_cm = confusion_matrix(train['CATEGORY'], train_pred[1])
print(train_cm)
sns.heatmap(train_cm, annot=True, cmap='Blues')
plt.show()
output
[[4344 93 8 56]
[ 52 4173 2 8]
[ 96 125 494 13]
[ 192 133 7 888]]
#Evaluation data
test_cm = confusion_matrix(test['CATEGORY'], test_pred[1])
print(test_cm)
sns.heatmap(test_cm, annot=True, cmap='Blues')
plt.show()
output
[[528 20 2 13]
[ 12 516 1 1]
[ 11 26 52 2]
[ 38 26 1 87]]
Measure the precision, recall, and F1 score of the logistic regression model learned in> 52 on the evaluation data. Obtain the precision rate, recall rate, and F1 score for each category, and integrate the performance for each category with the micro-average and macro-average.
from sklearn.metrics import precision_score, recall_score, f1_score
def calculate_scores(y_true, y_pred):
#Compliance rate
precision = precision_score(test['CATEGORY'], test_pred[1], average=None, labels=['b', 'e', 't', 'm']) #If None is specified, the precision for each class is returned by ndarray.
precision = np.append(precision, precision_score(y_true, y_pred, average='micro')) #Add micro mean at the end
precision = np.append(precision, precision_score(y_true, y_pred, average='macro')) #Add macro mean at the end
#Recall
recall = recall_score(test['CATEGORY'], test_pred[1], average=None, labels=['b', 'e', 't', 'm'])
recall = np.append(recall, recall_score(y_true, y_pred, average='micro'))
recall = np.append(recall, recall_score(y_true, y_pred, average='macro'))
#F1 score
f1 = f1_score(test['CATEGORY'], test_pred[1], average=None, labels=['b', 'e', 't', 'm'])
f1 = np.append(f1, f1_score(y_true, y_pred, average='micro'))
f1 = np.append(f1, f1_score(y_true, y_pred, average='macro'))
#Combine results into a data frame
scores = pd.DataFrame({'Compliance rate': precision, 'Recall': recall, 'F1 score': f1},
index=['b', 'e', 't', 'm', 'Micro average', 'Macro mean'])
return scores
print(calculate_scores(test['CATEGORY'], test_pred[1]))
output
Match rate recall F1 score
b 0.896 0.938 0.917
e 0.878 0.974 0.923
t 0.845 0.572 0.682
m 0.929 0.571 0.707
Micro average 0.885 0.885 0.885
Macro average 0.887 0.764 0.807
Check the top 10 features with high weights and the top 10 features with low weights in the logistic regression model learned in> 52.
The weight of each learned feature is stored in `` `coef_``` for each class.
features = X_train.columns.values
index = [i for i in range(1, 11)]
for c, coef in zip(lg.classes_, lg.coef_):
print(f'【category】{c}')
best10 = pd.DataFrame(features[np.argsort(coef)[::-1][:10]], columns=['Higher importance'], index=index).T
worst10 = pd.DataFrame(features[np.argsort(coef)[:10]], columns=['Lower importance'], index=index).T
display(pd.concat([best10, worst10], axis=0))
print('\n')
output
[Category] b
1 2 3 4 5 6 7 8 9 \
Top importance bank fed china ecb stocks euro obamacare oil yellen
Lower importance video ebola the her and she apple google star
10
Highest importance dollar
Lower importance microsoft
[Category] e
1 2 3 4 5 6 7 8 \
Top importance kardashian chris her movie star film paul he
Lower importance us update google study china gm ceo facebook
9 10
Highest importance wedding she
Lower importance apple says
[Category] m
1 2 3 4 5 6 7 8 9 \
Top importance ebola study cancer drug mers fda cases cdc could
Lower importance facebook gm ceo apple bank deal google sales climate
10
Top importance cigarettes
Lower importance twitter
[Category] t
1 2 3 4 5 6 7 8 \
Top importance google facebook apple microsoft climate gm nasa tesla
Lower importance stocks fed her percent drug american cancer ukraine
9 10
Top importance comcast heartbleed
Lower importance still shares
When training a logistic regression model, the degree of overfitting during learning can be controlled by adjusting the regularization parameters. Learn the logistic regression model with different regularization parameters and find the accuracy rate on the training data, validation data, and evaluation data. Summarize the results of the experiment in a graph with the regularization parameters on the horizontal axis and the accuracy rate on the vertical axis.
If the regularization is too strong (C is small), learning does not proceed and the accuracy is low, and if the regularization is too weak (C is large), overfitting occurs, and the difference in accuracy between learning and evaluation is widening. From this result, we can confirm that it is important to choose the appropriate C.
from tqdm import tqdm
result = []
for C in tqdm(np.logspace(-5, 4, 10, base=10)):
#Model learning
lg = LogisticRegression(random_state=123, max_iter=10000, C=C)
lg.fit(X_train, train['CATEGORY'])
#Get predicted value
train_pred = score_lg(lg, X_train)
valid_pred = score_lg(lg, X_valid)
test_pred = score_lg(lg, X_test)
#Calculation of correct answer rate
train_accuracy = accuracy_score(train['CATEGORY'], train_pred[1])
valid_accuracy = accuracy_score(valid['CATEGORY'], valid_pred[1])
test_accuracy = accuracy_score(test['CATEGORY'], test_pred[1])
#Storage of results
result.append([C, train_accuracy, valid_accuracy, test_accuracy])
output
100%|██████████| 10/10 [07:26<00:00, 44.69s/it] #Show progress using tqdm
#Visualization
result = np.array(result).T
plt.plot(result[0], result[1], label='train')
plt.plot(result[0], result[2], label='valid')
plt.plot(result[0], result[3], label='test')
plt.ylim(0, 1.1)
plt.ylabel('Accuracy')
plt.xscale ('log')
plt.xlabel('C')
plt.legend()
plt.show()
Learn the categorization model while changing the learning algorithm and learning parameters. Find the learning algorithm parameter that has the highest accuracy rate on the verification data. Also, find the correct answer rate on the evaluation data when the learning algorithm and parameters are used.
Here, the parameter search is performed for `C``` that specifies the strength of regularization and
`l1_ratio``` that specifies the balance between L1 regularization and L2 regularization.
In addition, optuna is used for optimization.
!pip install optuna
import optuna
#Specify the optimization target with a function
def objective_lg(trial):
#Set of parameters to be tuned
l1_ratio = trial.suggest_uniform('l1_ratio', 0, 1)
C = trial.suggest_loguniform('C', 1e-4, 1e4)
#Model learning
lg = LogisticRegression(random_state=123,
max_iter=10000,
penalty='elasticnet',
solver='saga',
l1_ratio=l1_ratio,
C=C)
lg.fit(X_train, train['CATEGORY'])
#Get predicted value
valid_pred = score_lg(lg, X_valid)
#Calculation of correct answer rate
valid_accuracy = accuracy_score(valid['CATEGORY'], valid_pred[1])
return valid_accuracy
#optimisation
study = optuna.create_study(direction='maximize')
study.optimize(objective_lg, timeout=3600)
#View results
print('Best trial:')
trial = study.best_trial
print(' Value: {:.3f}'.format(trial.value))
print(' Params: ')
for key, value in trial.params.items():
print(' {}: {}'.format(key, value))
output
Best trial:
Value: 0.892
Params:
l1_ratio: 0.23568685768996045
C: 4.92280374981671
Automatic tuning of hyperparameters with Optuna -Pytorch Lightning edition-
Learn the model again with the searched parameters and check the accuracy rate.
#Parameter setting
l1_ratio = trial.params['l1_ratio']
C = trial.params['C']
#Model learning
lg = LogisticRegression(random_state=123,
max_iter=10000,
penalty='elasticnet',
solver='saga',
l1_ratio=l1_ratio,
C=C)
lg.fit(X_train, train['CATEGORY'])
#Get predicted value
train_pred = score_lg(lg, X_train)
valid_pred = score_lg(lg, X_valid)
test_pred = score_lg(lg, X_test)
#Calculation of correct answer rate
train_accuracy = accuracy_score(train['CATEGORY'], train_pred[1])
valid_accuracy = accuracy_score(valid['CATEGORY'], valid_pred[1])
test_accuracy = accuracy_score(test['CATEGORY'], test_pred[1])
print(f'Correct answer rate (learning data):{train_accuracy:.3f}')
print(f'Correct answer rate (verification data):{valid_accuracy:.3f}')
print(f'Correct answer rate (evaluation data):{test_accuracy:.3f}')
output
Correct answer rate (learning data): 0.966
Correct answer rate (verification data): 0.892
Correct answer rate (evaluation data): 0.895
Since the correct answer rate of the evaluation data when learning with the default parameters was 0.885, it can be seen that the accuracy was improved by adopting the appropriate parameters.
This time I will also try XGBoost. In addition, this does not perform parameter search, but learns the model with fixed parameters.
!pip install xgboost
import xgboost as xgb
params={'objective': 'multi:softmax',
'num_class': 4,
'eval_metric': 'mlogloss',
'colsample_bytree': 1.0,
'colsample_bylevel': 0.5,
'min_child_weight': 1,
'subsample': 0.9,
'eta': 0.1,
'max_depth': 5,
'gamma': 0.0,
'alpha': 0.0,
'lambda': 1.0,
'num_round': 1000,
'early_stopping_rounds': 50,
'verbosity': 0
}
#Format conversion for XGBoost
category_dict = {'b': 0, 'e': 1, 't':2, 'm':3}
y_train = train['CATEGORY'].map(lambda x: category_dict[x])
y_valid = valid['CATEGORY'].map(lambda x: category_dict[x])
y_test = test['CATEGORY'].map(lambda x: category_dict[x])
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)
dtest = xgb.DMatrix(X_test, label=y_test)
#Model learning
num_round = params.pop('num_round')
early_stopping_rounds = params.pop('early_stopping_rounds')
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
model = xgb.train(params, dtrain, num_round, evals=watchlist, early_stopping_rounds=early_stopping_rounds)
#Get predicted value
train_pred = model.predict(dtrain, ntree_limit=model.best_ntree_limit)
valid_pred = model.predict(dvalid, ntree_limit=model.best_ntree_limit)
test_pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
#Calculation of correct answer rate
train_accuracy = accuracy_score(y_train, train_pred)
valid_accuracy = accuracy_score(y_valid, valid_pred)
test_accuracy = accuracy_score(y_test, test_pred)
print(f'Correct answer rate (learning data):{train_accuracy:.3f}')
print(f'Correct answer rate (verification data):{valid_accuracy:.3f}')
print(f'Correct answer rate (evaluation data):{test_accuracy:.3f}')
output
Correct answer rate (learning data): 0.963
Correct answer rate (verification data): 0.873
Correct answer rate (evaluation data): 0.873
100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.
Recommended Posts