"Supervised machine learning" to find the function $ f $ such that $ y = f (X) $ when the corresponding objective variable $ y $ is known for some explanatory variables $ X_n $. Is called. The simplest of these are "linear simple regression" and "linear multiple regression".

Regression ... When the objective variable $ y $ is a continuous value
Classification ... When the objective variable $ y $ is a discrete value

Classification problem

As an example, we will use the data "Pima Indian Diabetes Diagnosis"

#Import a library that provides access to resources by URL.
import urllib.request 
#Specify resources on the web
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/pima-indians-diabetes.txt'
#Download the resource from the specified URL and give it a name.
urllib.request.urlretrieve(url, 'pima-indians-diabetes.txt')

('pima-indians-diabetes.txt', <http.client.HTTPMessage at 0x7fd16c201550>)

#Import of spreadsheet-like data processing library
import pandas as pd 
#Read data and save as data frame format
df = pd.read_csv('pima-indians-diabetes.txt', delimiter="\t", index_col=0)
#Check the contents
df

	NumTimePreg	OralGluTol	BloodPres	SkinThick	SerumInsulin	BMI	PedigreeFunc	Age	Class
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
...	...	...	...	...	...	...	...	...	...
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

768 rows × 9 columns

The above "Class" is the objective variable $ y $, and if it is 1, it is judged that it is not diabetic, and if it is 0, it is judged that it is not diabetic. Let's create a model that predicts it.

#Explanatory variable
X = df.iloc[:, :8]
#Normalization that sets the maximum value to 1 and the minimum value to 0.
# axis=If 1, it normalizes by row instead of column.
X = X.apply(lambda x: (x-x.min())/(x.max() - x.min()), axis=0)

#Objective variable
y = df.iloc[:, 8]

Divide into training data and test data

In machine learning, in order to evaluate its performance, known data is divided into training data (also called teacher data and teacher set) and test data (also called test set). A prediction model is constructed by training (learning) using the training data, and performance evaluation is performed based on how accurately the test data that was not used in the prediction model construction can be predicted. Such an evaluation method is called "cross-validation". here,

Training data (60% of all data)
X_train: Explanatory variable for training data
y_train: Objective variable of training data
Test data (40% of all data)
X_test: Explanatory variable for test data
y_test: Objective variable for test data

We aim to learn the relationship between X_train and y_train and predict y_test from X_test.

Python's machine learning library scikit-learn provides methods for splitting into training and test data.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

Logistic regression

Logistic regression is similar to linear multiple regression, but corresponds to the discrete value of whether the objective variable $ y $ is 0 or 1. In the lecture the other day, I implemented logistic regression using the scipy library, but it is more convenient to use scikit-learn in practice.

Check out what methods and parameters are available from the following sites.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

from sklearn.linear_model import LogisticRegression #Logistic regression
classifier = LogisticRegression() #Generate classifier
classifier.fit(X_train, y_train) #Learning

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Time measurement

It is convenient to use timeit to measure the time required for learning.

import timeit #Library for measuring execution time
timeit.timeit(lambda: classifier.fit(X_train, y_train), number=1)

0.006864218999993454

Calculation of correct answer rate

It is necessary to distinguish between "classification accuracy of data used for learning" and "classification accuracy of data not used for learning". Models with an extremely high former and a low latter have low generalization performance and are said to be "overfitted".

#Correct answer rate(train) :How accurately can the data used for training be predicted?
classifier.score(X_train,y_train)

0.7800289435600579

#Correct answer rate(test) :How accurately can you predict the data that was not used for training?
classifier.score(X_test,y_test)

0.7402597402597403

Prediction of data not used for training and confusion matrix

It is possible to predict not only the accuracy rate but also which data is specifically classified into which.

#Predict data not used for training
y_pred = classifier.predict(X_test)

y_pred

array([1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

If you know the correct answer, you can match the answers. The confusion matrix is useful for that aggregation.

from sklearn.metrics import confusion_matrix #Method to calculate confusion matrix
#A confusion matrix that shows how well the prediction result matches the correct answer (the true answer).
pd.DataFrame(confusion_matrix(y_pred, y_test), 
             index=['predicted 0', 'predicted 1'], columns=['real 0', 'real 1'])

	real 0	real 1
predicted 0	47	18
predicted 1	2	10

ROC curve / PR curve

Which data is classified into which is judged by the "strength of confidence" in each classification. By arranging them in order of confidence, you can draw ROC curves and PR curves and evaluate the performance of the prediction model.

#Calculate the strength of confidence in the forecast results
y_proba = classifier.predict_proba(X_test)

y_proba

array([[0.21417054, 0.78582946],
       [0.46404957, 0.53595043],
       [0.70401466, 0.29598534],
       [0.75314361, 0.24685639],
       [0.76452966, 0.23547034],
       [0.33685542, 0.66314458],
       [0.76393323, 0.23606677],
       [0.82487552, 0.17512448],
       [0.87720401, 0.12279599],
       [0.83530283, 0.16469717],
       [0.64980016, 0.35019984],
       [0.78574888, 0.21425112],
       [0.51054138, 0.48945862],
       [0.24870259, 0.75129741],
       [0.91082684, 0.08917316],
       [0.86200773, 0.13799227],
       [0.71562431, 0.28437569],
       [0.62886446, 0.37113554],
       [0.63181921, 0.36818079],
       [0.77975231, 0.22024769],
       [0.65396517, 0.34603483],
       [0.81535938, 0.18464062],
       [0.54607196, 0.45392804],
       [0.79688063, 0.20311937],
       [0.80333846, 0.19666154],
       [0.728435  , 0.271565  ],
       [0.36817034, 0.63182966],
       [0.54025915, 0.45974085],
       [0.6614052 , 0.3385948 ],
       [0.74309548, 0.25690452],
       [0.92572332, 0.07427668],
       [0.80406998, 0.19593002],
       [0.61165474, 0.38834526],
       [0.43564389, 0.56435611],
       [0.42922327, 0.57077673],
       [0.61369072, 0.38630928],
       [0.68195508, 0.31804492],
       [0.86971152, 0.13028848],
       [0.81006182, 0.18993818],
       [0.86324924, 0.13675076],
       [0.82269894, 0.17730106],
       [0.48717372, 0.51282628],
       [0.72772261, 0.27227739],
       [0.81581007, 0.18418993],
       [0.54651378, 0.45348622],
       [0.65486361, 0.34513639],
       [0.69695761, 0.30304239],
       [0.50397912, 0.49602088],
       [0.70579261, 0.29420739],
       [0.56812519, 0.43187481],
       [0.28702944, 0.71297056],
       [0.78684682, 0.21315318],
       [0.77913962, 0.22086038],
       [0.20665217, 0.79334783],
       [0.64020202, 0.35979798],
       [0.54394942, 0.45605058],
       [0.74972094, 0.25027906],
       [0.89307226, 0.10692774],
       [0.63129007, 0.36870993],
       [0.775181  , 0.224819  ],
       [0.88651222, 0.11348778],
       [0.83087546, 0.16912454],
       [0.52015754, 0.47984246],
       [0.17895175, 0.82104825],
       [0.68620306, 0.31379694],
       [0.6503939 , 0.3496061 ],
       [0.53702941, 0.46297059],
       [0.74395419, 0.25604581],
       [0.79430285, 0.20569715],
       [0.70717315, 0.29282685],
       [0.74036824, 0.25963176],
       [0.35031104, 0.64968896],
       [0.59128595, 0.40871405],
       [0.62945511, 0.37054489],
       [0.85812094, 0.14187906],
       [0.95492842, 0.04507158],
       [0.82726693, 0.17273307]])

#Import a library to illustrate diagrams and graphs.
import matplotlib.pyplot as plt
%matplotlib inline

Here is the method for handling the ROC curve and the AUC score below it.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

#Give an AUC score
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)
print ("AUC curve : %f" % roc_auc)

#Draw a ROC curve
plt.figure(figsize=(4,4))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve: AUC=%0.2f' % roc_auc)
plt.legend(loc="lower right")
plt.show()

AUC curve : 0.756560

Similarly, here is the method for drawing a PR curve.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html

from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba[:, 1])
area = auc(recall, precision)
print ("AUPR score: %0.2f" % area)

#Draw a PR curve
plt.figure(figsize=(4,4))
plt.plot(recall, precision, label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: AUPR=%0.2f' % area)
plt.legend(loc="lower left")
plt.show()

AUPR score: 0.68

Parameter tuning by grid search

Methods for machine learning require many parameters. The default (default) is not a good prediction. One of the methods to find good parameters is "grid search". GridSearchCV further divides the training data (by default, 3-fold cross-validation), tries all combinations of parameter candidates, and searches for parameters that show excellent performance on average.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search (Logistic Regression parameters)
parameters = [
    {'solver': ['liblinear', 'saga'], 'penalty':['l1', 'l2'], 'C': [0.1, 1, 10, 100]},
    {'solver': ['newton-cg', 'sag', 'lbfgs' ], 'penalty':['l2'], 'C': [0.1, 1, 10, 100]},
]

#Grid search execution
classifier = GridSearchCV(LogisticRegression(), parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Classifier with the best parameters

Accuracy score (train):  0.784370477568741
Accuracy score (test):  0.6883116883116883
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
CPU times: user 192 ms, sys: 25.4 ms, total: 218 ms
Wall time: 2.52 s

Record and compare evaluation indicators

From now on, I would like to compare the performance of various machine learning methods. Therefore, let's prepare a variable for recording the evaluation index.

scores = []

There are various indicators of the performance of a classification model. The main ones are as follows. Let's check the meaning of each.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

I created a function that calculates these evaluation indexes together and stores them in a variable for recording as follows. Repeat training and cross-validation for different data splits and record their average performance, standard deviation and training time.

import timeit
from sklearn import metrics
def record_classification_scores(classifier_name, classifier, iter=5):
    records = []
    for run_id in range(iter):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 
        print('Run ', run_id + 1)
        seconds = timeit.timeit(lambda: classifier.fit(X_train, y_train), number=1)
        print('    Learning Time (s):', seconds)
        y_pred = classifier.predict(X_test)
        y_proba = classifier.predict_proba(X_test)

        accuracy_score = metrics.accuracy_score(y_test, y_pred)
        precision_score = metrics.precision_score(y_test, y_pred)
        recall_score = metrics.recall_score(y_test, y_pred)
        f1_score = metrics.f1_score(y_test, y_pred)

        fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
        roc_auc = auc(fpr, tpr)

        pre, rec, thresholds = precision_recall_curve(y_test, y_proba[:, 1])
        aupr = auc(rec, pre)
        
        mcc = metrics.matthews_corrcoef(y_test, y_pred)

        records.append([classifier_name, accuracy_score, precision_score, recall_score, 
                        f1_score, roc_auc, aupr, mcc, seconds])
    return records

Now, let's learn using the "classifier with the best parameters" created earlier and record the performance index.

%%time
scores += record_classification_scores('LR', classifier.best_estimator_)

Run  1
    Learning Time (s): 0.004809510999990607
Run  2
    Learning Time (s): 0.004076423000000773
Run  3
    Learning Time (s): 0.004598837999992611
Run  4
    Learning Time (s): 0.004291107000000238
Run  5
    Learning Time (s): 0.003665049000005638
CPU times: user 65.8 ms, sys: 3.33 ms, total: 69.1 ms
Wall time: 67.6 ms

The average performance and its standard deviation are as follows.

df_scores = pd.DataFrame(scores, columns = ['Classifier', 'Accuracy', 'Precision', 'Recall', 
                                            'F1 score', 'ROC AUC', 'AUPR', 'MCC', 'Time'])
df_scores_mean = df_scores.iloc[:, :-1].mean()
df_scores_errors = df_scores.iloc[:, :-1].std()
df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors)

<matplotlib.axes._subplots.AxesSubplot at 0x7fd15943ca90>

Gradient boosting

Gradient Boosting is a technique that has been gaining attention lately. I won't go into details here. Check the required parameters from the site below.

https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Gradient boosting
from sklearn.ensemble import GradientBoostingClassifier

#Parameters for grid search
parameters = [{
    'loss': ['deviance', 'exponential'],
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100,200],
    'max_depth':[3,5,7,9]
}]

#Grid search execution
classifier = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Best parameters

Accuracy score (train):  1.0
Accuracy score (test):  0.7142857142857143
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=7,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
CPU times: user 947 ms, sys: 25.1 ms, total: 973 ms
Wall time: 25.2 s

Learn and record performance using the resulting "classifier with the best parameters".

%%time
scores += record_classification_scores('GB', classifier.best_estimator_)

Run  1
    Learning Time (s): 0.3291641410000068
Run  2
    Learning Time (s): 0.31575948799999765
Run  3
    Learning Time (s): 0.3144692120000059
Run  4
    Learning Time (s): 0.3252903609999862
Run  5
    Learning Time (s): 0.3103595519999942
CPU times: user 1.64 s, sys: 7.28 ms, total: 1.65 s
Wall time: 1.65 s

Performance comparison result display of multiple methods

I made a function to visualize the performance comparison of multiple classification methods.

def visualize_classification_result(scores):
    df_scores = pd.DataFrame(scores, columns = ['Classifier', 'Accuracy', 'Precision', 'Recall', 
                                            'F1 score', 'ROC AUC', 'AUPR', 'MCC', 'Time'])
    df_scores_mean = df_scores.iloc[:, :-1].groupby('Classifier').mean()
    df_scores_errors = df_scores.iloc[:, :-1].groupby('Classifier').std()
    df_scores_mean.T.plot(kind='bar', grid=True, yerr=df_scores_errors.T, 
                          figsize=(12, 2), legend=False)
    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors, 
                        figsize=(12, 2), legend=False)
    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))

    df_time_mean = df_scores.iloc[:, [0, -1]].groupby('Classifier').mean()
    df_time_errors = df_scores.iloc[:, [0, -1]].groupby('Classifier').std()
    df_time_mean.plot(kind='bar', grid=True, yerr=df_time_errors, 
                        figsize=(12, 2), legend=False)
    plt.yscale('log')

visualize_classification_result(scores)

Multilayer perceptron

Multi-Layer Perceptron is the simplest model of deep learning and is also implemented in scikit-learn.

Check out what methods and parameters are available from the following sites.

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Multilayer perceptron
from sklearn.neural_network import MLPClassifier
#Parameters for grid search
parameters = [{'hidden_layer_sizes': [8, (8, 8), (8, 8, 8)], 
               'solver': ['sgd', 'adam', 'lbfgs'],
                     'activation': ['logistic', 'tanh', 'relu'],
              'learning_rate_init': [0.1, 0.01, 0.001]}]
#Grid search execution
classifier = GridSearchCV(MLPClassifier(max_iter=10000, early_stopping=True), 
                          parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Classifier with the best parameters

Accuracy score (train):  0.7930535455861071
Accuracy score (test):  0.7272727272727273
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=True, epsilon=1e-08,
              hidden_layer_sizes=8, learning_rate='constant',
              learning_rate_init=0.1, max_iter=10000, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
CPU times: user 1.15 s, sys: 39.8 ms, total: 1.19 s
Wall time: 2min 29s

Learn and record performance using the resulting "classifier with the best parameters".

%%time
scores += record_classification_scores('MLP', classifier.best_estimator_)

Run  1
    Learning Time (s): 0.4756240830000138
Run  2
    Learning Time (s): 0.34581674499997916
Run  3
    Learning Time (s): 0.15651393699999971
Run  4
    Learning Time (s): 0.14490434999999025
Run  5
    Learning Time (s): 0.005184319999955278
CPU times: user 1.16 s, sys: 3.54 ms, total: 1.17 s
Wall time: 1.17 s

Compare performance.

visualize_classification_result(scores)

Exercise 1

scikit-learn provides the breast_cancer dataset as training data for machine learning. Divide the dataset into explanatory variables and objective variables as follows, classify breast_cancer data with MLPClassifier while tuning parameters with GridSearchCV, and evaluate the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data)
y = pd.DataFrame(breast_cancer.target.ravel())

Exercise 2

scikit-learn provides a wine dataset as data for machine learning practice. Divide the dataset into explanatory variables and objective variables as follows, classify the wine data with MLPClassifier while tuning the parameters with GridSearchCV, and evaluate the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html
from sklearn.datasets import load_wine
wine = data = load_wine()
X = pd.DataFrame(wine.data)
y = pd.DataFrame(wine.target)

However, since the wine dataset is a three-class classification rather than a two-class classification such as the breast_cancer dataset, the objective variable must be preprocessed as follows.

# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
import numpy as np
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories="auto", sparse=False, dtype=np.float32)
y = pd.DataFrame(encoder.fit_transform(y))

Also, if the number of MLP nodes (number of neurons) is too small, the classification performance will be low, so increase the number of nodes (number of neurons) appropriately.

Regression problem

Now that we've solved the classification problem, let's solve the regression problem. As a subject, we will discuss the relationship between the composition and strength of concrete. First, get the concrete data.

The data source is Concrete Slump Test Data Set https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test.

#Import a library that provides access to resources by URL.
import urllib.request 
#Specify resources on the web
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/slump_test.data'
#Download the resource from the specified URL and give it a name.
urllib.request.urlretrieve(url, 'slump_test.data')

('slump_test.data', <http.client.HTTPMessage at 0x7ff02ed82518>)

import pandas as pd
df = pd.read_csv('slump_test.data', index_col=0)
df

	Cement	Slag	Fly ash	Water	SP	Coarse Aggr.	Fine Aggr.	SLUMP(cm)	FLOW(cm)	Compressive Strength (28-day)(Mpa)
No
1	273.0	82.0	105.0	210.0	9.0	904.0	680.0	23.0	62.0	34.99
2	163.0	149.0	191.0	180.0	12.0	843.0	746.0	0.0	20.0	41.14
3	162.0	148.0	191.0	179.0	16.0	840.0	743.0	1.0	20.0	41.81
...	...	...	...	...	...	...	...	...	...	...
101	258.8	88.0	239.6	175.3	7.6	938.9	646.0	0.0	20.0	50.50
102	297.1	40.9	239.9	194.0	7.5	908.9	651.8	27.5	67.0	49.17
103	348.7	0.1	223.1	208.5	9.6	786.2	758.1	29.0	78.0	48.77

103 rows × 10 columns

Let's read the explanation of the source of the data. Here, the left 7 columns are regarded as explanatory variables, and the rightmost column is regarded as the objective variable.

X = df.iloc[:, :-3].apply(lambda x: (x-x.min())/(x.max() - x.min()), axis=0)
y = df.iloc[:, -1]

Divide into training data and test data

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

Linear multiple regression

Let's try the simplest regression model, Multiple Linear Regression.

from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() #Linear multiple regression
regressor.fit(X_train, y_train) #Learning

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Performance evaluation of regression model

In scikit-learn's regression model, a method called .score () calculates the coefficient of determination (R2 value). R2 value is

Take the sum of squares of the deviations between the true and predicted values of the objective variable.
Divide it by the sum of squares of the deviations between the true value of the objective variable and its mean.
Subtract it from 1.

It is calculated like this. Therefore,

The smaller the deviation of the predicted value, the closer the R2 value approaches 1.
As the deviation of the predicted value becomes large and the deviation between the objective variable and its average value approaches, the R2 value approaches 0.
If the deviation of the predicted value becomes larger than the deviation between the objective variable and its average value, the R2 value takes a negative value.

There is a feature.

Now, let's check the performance with the test set after learning with the teacher set.

regressor.score(X_train, y_train), regressor.score(X_test, y_test)

(0.9224703183565424, 0.8177828980042425)

Prediction of data not used for training

Prediction can be made by substituting data not used for training into the obtained regression model.

A plot of the objective variable and its predicted value is commonly referred to as the y-y plot (although it does not seem to be the official name). The more diagonal the plot is, the better the regression model.

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test) #Substitute data not used for learning

print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.8177828980042425
RMSE= 3.0139396734524633
MAE= 2.4622169354183447

Performance evaluation of regression model

scikit-learn provides methods to calculate the following indicators for performance evaluation of regression models.

sklearn.metrics.r2_score
The R2 value described earlier. A regression model with a larger value (a value closer to 1) is a better model.
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
sklearn.metrics.mean_squared_error
Mean squared error is the sum of squares of the deviation between the objective variable and the predicted value. RMSE (Root Mean Square Error) is the square root of this. The formulas are similar to the variance and standard deviation, respectively. A regression model that takes a smaller value (a value closer to 0) is a better model.
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
sklearn.metrics.mean_absolute_error
The average absolute error is the sum of the absolute values of the deviations between the objective variable and the predicted value. Mean squared error squares the error, so it tends to emphasize large errors, whereas mean absolute error does not. A regression model that takes a smaller value (a value closer to 0) is a better model.
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

PLS (Partial Least Squares)

Partial least squares regression (PLS) is a type of linear regression method that uses variables (latent variables) that are linearly transformed from explanatory variables so that they are uncorrelated with each other. Compared to normal linear multiple regression

Since the explanatory variables are converted into latent variables that are uncorrelated with each other and regression, the problem of collinearity that the regression plane is difficult to be uniquely determined when the correlation between the explanatory variables is strong does not occur.
Even if the number of samples is smaller than the explanatory variables, by reducing the number of latent variables, it is possible to regress taking all the explanatory variables into consideration.
By using only the important latent variables, noise that does not contribute to the prediction can be excluded.

It has the advantage of.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

from sklearn.cross_decomposition import PLSRegression #Method to perform PLS regression
regressor = PLSRegression() #Regressor generation
regressor.fit(X_train, y_train) #Learning

PLSRegression(copy=True, max_iter=500, n_components=2, scale=True, tol=1e-06)

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.7649827980366806
RMSE= 3.523505754210429
MAE= 2.8524226793359198

Parameter tuning by grid search

Let's tune the parameters and make a better model.

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search
parameters = [
    {'n_components': [2, 3, 4, 5, 6], 'scale':[True, False], 'max_iter': [1000]},
]

#Grid search execution
regressor = GridSearchCV(PLSRegression(), parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters

R2 (train):  0.9225247360485105
R2 (test):  0.8162623239997147
PLSRegression(copy=True, max_iter=1000, n_components=3, scale=False, tol=1e-06)
CPU times: user 114 ms, sys: 30.4 ms, total: 144 ms
Wall time: 2.18 s

Let's make a prediction using the obtained regression model. Has the prediction performance improved?

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.8162623239997147
RMSE= 3.115474987155132
MAE= 2.3005236909984426

Record and compare evaluation indicators

From now on, I would like to compare the performance of various regression models. Therefore, let's prepare a variable for recording the evaluation index.

scores = []

I made the following function to calculate the evaluation index of the regression model and store it in the variable for recording. Repeat training and cross-validation for different data splits and record their average performance, standard deviation and training time.

import timeit
import numpy as np
from sklearn import metrics

def record_regression_scores(regressor_name, regressor, iter=5):
    records = []
    run_id = 0
    successful = 0
    max_trial = 100
    while successful < iter:
        run_id += 1
        if run_id >= max_trial:
            break

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 
        print('Run ', run_id)
        seconds = timeit.timeit(lambda: regressor.fit(X_train, y_train), number=1)
        print('    Learning Time (s):', seconds)
        y_pred = regressor.predict(X_test)
        r2_score = metrics.r2_score(y_test, y_pred)
        rmse_score = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
        mae_score = metrics.mean_absolute_error(y_test, y_pred)

        if r2_score < 0:
            print("\t\t encountered negative r2_score")
            continue
        else:
            successful += 1

        records.append([regressor_name, r2_score, rmse_score, mae_score, seconds])
    return records

Now, let's learn using the "regression model with the best parameters" created earlier and record the performance evaluation index.

%%time
scores += record_regression_scores("PLS", regressor)

Run  1
    Learning Time (s): 2.0181297670001186
Run  2
    Learning Time (s): 1.9526900320001914
Run  3
    Learning Time (s): 1.9921050099997046
Run  4
    Learning Time (s): 2.0573012720001316
Run  5
    Learning Time (s): 1.979584856999736
CPU times: user 552 ms, sys: 101 ms, total: 653 ms
Wall time: 10 s

The average performance and its standard deviation are as follows.

df_scores = pd.DataFrame(scores, columns = ['Regressor', 'R2', 'RMSE', 'Mae', 'Time'])
df_scores_mean = df_scores.iloc[:, :-1].mean()
df_scores_errors = df_scores.iloc[:, :-1].std()
df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors)

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0177f0550>

Gradient boosting

We classified by gradient boosting earlier, but you can also return by gradient boosting.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

from sklearn.ensemble import GradientBoostingRegressor
regressor = GradientBoostingRegressor() #Gradient boosting
regressor.fit(X_train, y_train) #Learning

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

regressor.score(X_train, y_train), regressor.score(X_test, y_test)

(0.9996743754326906, 0.7386973055974495)

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)

print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.7386973055974495
RMSE= 4.012901982806575
MAE= 3.0486670616108

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Gradient boosting
from sklearn.ensemble import GradientBoostingRegressor

#Parameters for grid search
parameters = [{
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100],
    'max_depth':[3,5]
}]

#Grid search execution
regressor = GridSearchCV(GradientBoostingRegressor(), parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters

R2 (train):  0.9996743754326906
R2 (test):  0.7195388936429337
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
CPU times: user 130 ms, sys: 15.4 ms, total: 145 ms
Wall time: 2.26 s

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.7195388936429337
RMSE= 4.15741070004397
MAE= 3.0704656592653317

%%time
scores += record_regression_scores("GB", regressor.best_estimator_)

Run  1
    Learning Time (s): 0.027196151000225655
Run  2
    Learning Time (s): 0.01961764999987281
Run  3
    Learning Time (s): 0.01894888400011041
Run  4
    Learning Time (s): 0.019140249999964
Run  5
    Learning Time (s): 0.020592135999777383
CPU times: user 123 ms, sys: 2.75 ms, total: 126 ms
Wall time: 128 ms

Performance comparison result display of multiple methods

I made a function to visualize the performance comparison of multiple regression methods.

def visualize_regression_result(scores):
    df_scores = pd.DataFrame(scores, columns =['Regressor', 'R2', 'RMSE', 'MAE', 'Time'])
    df_scores_mean = df_scores.iloc[:, :-1].groupby('Regressor').mean()
    df_scores_errors = df_scores.iloc[:, :-1].groupby('Regressor').std()
    df_scores_mean.T.plot(kind='bar', grid=True, yerr=df_scores_errors.T, 
                          figsize=(12, 2), legend=False)
    #plt.yscale('log')

    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors, 
                        figsize=(12, 2), legend=False)
    #plt.yscale('log')

    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_time_mean = df_scores.iloc[:, [0, -1]].groupby('Regressor').mean()
    df_time_errors = df_scores.iloc[:, [0, -1]].groupby('Regressor').std()
    df_time_mean.plot(kind='bar', grid=True, yerr=df_time_errors, 
                        figsize=(12, 2), legend=False)
    plt.yscale('log')

visualize_regression_result(scores)

Multilayer perceptron

We classified by multi-layer perceptron earlier, but regression by multi-layer perceptron is also possible.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

from sklearn.neural_network import MLPRegressor
regressor = MLPRegressor() #Regressor generation
regressor.fit(X_train, y_train) #Learning

/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)





MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= -4.0467411800805415
RMSE= 19.21649631146132
MAE= 17.449687389239205

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search
parameters = [{
    'hidden_layer_sizes': [10, (10, 10)],
    'solver': ['sgd', 'adam', 'lbfgs'],
    #'solver': ['lbfgs'],
    #'activation': ['logistic', 'tanh', 'relu']
    'activation': ['relu']
}]

#Grid search execution
regressor = GridSearchCV(MLPRegressor(max_iter=10000, early_stopping=True), 
                         parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters

R2 (train):  0.9742637037080083
R2 (test):  0.9562295568855493
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=True, epsilon=1e-08,
             hidden_layer_sizes=(10, 10), learning_rate='constant',
             learning_rate_init=0.001, max_iter=10000, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)
CPU times: user 222 ms, sys: 17.5 ms, total: 239 ms
Wall time: 7.86 s

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()

R2= 0.9562295568855493
RMSE= 1.789613149534058
MAE= 1.3873465536350154

%%time
scores += record_regression_scores("MLP", regressor.best_estimator_)

Run  1
    Learning Time (s): 0.06779548599979535
Run  2
    Learning Time (s): 0.1298420270004499
Run  3
    Learning Time (s): 0.1824235089998183
Run  4
    Learning Time (s): 0.43246253200004503
Run  5
    Learning Time (s): 0.22879209799975797
CPU times: user 1.06 s, sys: 3.13 ms, total: 1.06 s
Wall time: 1.07 s

visualize_regression_result(scores)

Exercise 3

scikit-learn provides the diabetes dataset as training data for machine learning. Divide the dataset into explanatory variables and objective variables as follows, and while tuning the parameters with GridSearchCV, regress the diabetes data with MLPRegressor or GradientBoostingRegressor and compare the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

Supervised machine learning (classification / regression)

Classification problem

Divide into training data and test data

Logistic regression

Time measurement

Calculation of correct answer rate

Prediction of data not used for training and confusion matrix

ROC curve / PR curve

Parameter tuning by grid search

Record and compare evaluation indicators

Gradient boosting

Performance comparison result display of multiple methods

Multilayer perceptron

Exercise 1

Exercise 2

Regression problem

Divide into training data and test data

Linear multiple regression

Performance evaluation of regression model

Prediction of data not used for training

Performance evaluation of regression model

Parameter tuning by grid search

Record and compare evaluation indicators

Gradient boosting

Performance comparison result display of multiple methods

Multilayer perceptron

Exercise 3