This time, I will summarize the implementation of logistic regression.
We will proceed with the following 6 steps.
First, import the required modules.
import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Module for visualization
import seaborn as sns
#Module to read the dataset
from sklearn.datasets import load_iris
#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler
#Module that separates training data and test data
from sklearn.model_selection import train_test_split
#Module to perform logistic regression
from sklearn.linear_model import LogisticRegression
#Module to evaluate classification
from sklearn.metrics import classification_report
#Modules that handle confusion matrices
from sklearn.metrics import confusion_matrix
This time, we will use the iris dataset for binary classification.
First get the data, standardize it, and then split it.
#Loading iris dataset
iris = load_iris()
#Divide into objective variable and explanatory variable
X, y = iris.data[:100, [0, 2]], iris.target[:100]
#Standardization (distributed normalization)
std = StandardScaler()
X = std.fit_transform(X)
#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
In order to perform binary classification, the data set is specified up to the 100th line (Setosa / Versicolor only). We've also narrowed down the explanatory variables to two to make it easier to plot. (Sepal Length / Petal Lengh only)
In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.
In random_state, the seed value is fixed so that the result of data division is the same each time.
Let's plot the data before classification by logistic regression.
#Creating drawing objects and subplots
fig, ax = plt.subplots()
#Setosa plot
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1],
marker = 'o', label = 'Setosa')
#Versicolor plot
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
marker = 'x', label = 'Versicolor')
#Axis label settings
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Petal Length')
#Legend settings
ax.legend(loc = 'best')
plt.show()
Plot with features corresponding to Setosa (y_train == 0) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis) Plot with features corresponding to Versicolor (y_train == 1) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis)
Output result
Create an execution function (instance) of logistic regression and apply it to the training data.
#Create an instance
logreg = LogisticRegression()
#Create a model from training data
logreg.fit(X_train, y_train)
#Predict the probability of classification
y_proba = logreg.predict_proba(X_test)[: , 1]
print(y_proba)
Output result
y_proba: [0.02210131 0.99309888 0.95032727 0.04834431 0.99302674 0.04389388
0.10540851 0.99718459 0.90218405 0.03983599 0.08000775 0.99280579
0.99721384 0.78408501 0.08947531 0.01793823 0.99798469 0.01793823
0.99429762 0.9920454 ]
The sigmoid function outputs a number in the range 0 to 1. The closer it is to 0, the higher the probability of Setosa, and the closer it is to 1, the higher the probability of Versicolor.
Next, let's predict the result of classification.
#Predict classification results
y_pred = logreg.predict(X_test)
print(y_pred)
Output result
y_pred: [0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1]
Apply the cross entropy error function to the value output by the sigmoid function earlier. Values close to 0 are classified as 0: Setosa, and values close to 1 are classified as 1: Versicolor.
This time it will be a classification (binary classification), so we will evaluate using a confusion matrix.
#Create a confusion matrix
classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)
#Data frame
cmdf = pd.DataFrame(cm, index=classes, columns=classes)
#Plot the confusion matrix
sns.heatmap(cmdf, annot=True)
Output result
Next, find the numerical value of the evaluation index.
#Outputs precision rate, recall rate, and F value
print(classification_report(y_test, y_pred))
Output result
From the above, we were able to evaluate the classification in Setosa and Versicolor.
In logistic regression, we will create and evaluate a model based on steps 1 to 6 above.
This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).
Thank you for reading.
Recommended Posts