Logistic Regression (for beginners) -Code Edition-

This time, I will summarize the implementation of logistic regression.

■ Logistic procedure

We will proceed with the following 6 steps.

Preparation of module
Data preparation
Data visualization
Creating a model
Predict classification
Model evaluation

1. Preparation of module

First, import the required modules.


import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Module for visualization
import seaborn as sns

#Module to read the dataset
from sklearn.datasets import load_iris

#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler

#Module that separates training data and test data
from sklearn.model_selection import train_test_split

#Module to perform logistic regression
from sklearn.linear_model import LogisticRegression

#Module to evaluate classification
from sklearn.metrics import classification_report

#Modules that handle confusion matrices
from sklearn.metrics import confusion_matrix

2. Data preparation

This time, we will use the iris dataset for binary classification.

First get the data, standardize it, and then split it.


#Loading iris dataset
iris = load_iris()

#Divide into objective variable and explanatory variable
X, y = iris.data[:100, [0, 2]], iris.target[:100]

#Standardization (distributed normalization)
std = StandardScaler()
X = std.fit_transform(X)

#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In order to perform binary classification, the data set is specified up to the 100th line (Setosa / Versicolor only). We've also narrowed down the explanatory variables to two to make it easier to plot. (Sepal Length / Petal Lengh only)

In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.

In random_state, the seed value is fixed so that the result of data division is the same each time.

3. Data visualization

Let's plot the data before classification by logistic regression.


#Creating drawing objects and subplots
fig, ax = plt.subplots()

#Setosa plot
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
           marker = 'o', label = 'Setosa')

#Versicolor plot
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
           marker = 'x', label = 'Versicolor')

#Axis label settings
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Petal Length')

#Legend settings
ax.legend(loc = 'best')

plt.show()

Plot with features corresponding to Setosa (y_train == 0) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis) Plot with features corresponding to Versicolor (y_train == 1) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis)

Output result

4. Creating a model

Create an execution function (instance) of logistic regression and apply it to the training data.


#Create an instance
logreg = LogisticRegression()

#Create a model from training data
logreg.fit(X_train, y_train)

## 5. Predict classification Now that the model is complete, we first predict the probability of classification.

#Predict the probability of classification
y_proba = logreg.predict_proba(X_test)[: , 1]
print(y_proba)

Output result


y_proba: [0.02210131 0.99309888 0.95032727 0.04834431 0.99302674 0.04389388
 0.10540851 0.99718459 0.90218405 0.03983599 0.08000775 0.99280579
 0.99721384 0.78408501 0.08947531 0.01793823 0.99798469 0.01793823
 0.99429762 0.9920454 ]

The sigmoid function outputs a number in the range 0 to 1. The closer it is to 0, the higher the probability of Setosa, and the closer it is to 1, the higher the probability of Versicolor.

\sigma(z)=\frac{1}{1+\exp(-z)}

Next, let's predict the result of classification.


#Predict classification results
y_pred = logreg.predict(X_test)
print(y_pred)

Output result


y_pred: [0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1]

Apply the cross entropy error function to the value output by the sigmoid function earlier. Values close to 0 are classified as 0: Setosa, and values close to 1 are classified as 1: Versicolor.

L(w)=y\log(p(x,w))+(1-y)\log(1-p(x,w))

6. Model evaluation

This time it will be a classification (binary classification), so we will evaluate using a confusion matrix.


#Create a confusion matrix
classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

#Data frame
cmdf = pd.DataFrame(cm, index=classes, columns=classes)

#Plot the confusion matrix
sns.heatmap(cmdf, annot=True)

Output result

Next, find the numerical value of the evaluation index.


#Outputs precision rate, recall rate, and F value
print(classification_report(y_test, y_pred))

Output result

From the above, we were able to evaluate the classification in Setosa and Versicolor.

■ Finally

In logistic regression, we will create and evaluate a model based on steps 1 to 6 above.

This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).

Thank you for reading.