[Machine learning] Understanding logistic regression from both scikit-learn and mathematics

1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."

The purpose of this article is to "try using scikit-learn first because the theory is good" in 2-3 and "understand the background from mathematics" in 4 and later.

2. What is logistic regression?

Logistic regression is a type of statistical regression model for variables that follow the Bernoulli distribution. Source: [Wikipedia](https://ja.wikipedia.org/wiki/logistic regression)

I don't know what it is, so to put it simply, it is used for ** "predicting the probability that a certain event will occur" or "classifying based on that probability" **. Therefore, logistic regression is used when you want to classify using machine learning or when you want to predict the probability.

(I was very surprised to be able to "predict probabilities" while studying machine learning.)

◆ About sigmoid function (logistic function)

So how does this logistic regression perform classification and probability prediction? I will omit the detailed explanation because I will go to the chapter on mathematics, but "predicting the probability that a certain event will occur" means that when you enter the necessary information in the "sigmoid function" below, that event (let's call it A) It means that the probability of occurrence is calculated. And if the probability is 50% or more, it is classified as A, and if the probability is less than 50%, it is classified as not A.

[Sigmoid function] キャプチャ6.PNG

By the way, the sigmoid function is defined as follows.

y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}

As a result of this calculation, the $ y $ that comes out represents the probability that the event will occur, and logistic regression calculates this probability.

For example, if the red dot is attached on the above sigmoid function, the probability of occurrence is predicted to be 40%, and since it is less than 50%, event A is classified as not occurring.

◆ What is classification?

Although it is the "classification" described above, machine learning mainly performs "regression (predicting numerical values)" or "classification". As the name suggests, classification is used when you want to classify "A or B".

◆ Specific example

Suppose that you have the overall average score of 15 students' junior high school to high school national language and mathematics, and the data on whether the student is going on to liberal arts or science. キャプチャ1.PNG

Based on this data, I would like to use data from another student's national language and mathematics to predict whether they will go on to liberal arts or science in the future.

The distribution of the overall average score of 15 students in Japanese and mathematics is as follows. キャプチャ2.PNG

Somehow, there seems to be a boundary between the blue dots (humanities) and the orange dots (science).

Next, let's perform logistic regression analysis using scikit-learn and create a model that classifies humanities and sciences.

3. Logistic regression with scikit-learn

(1) Import of required libraries

Import the following required to perform logistic regression.

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

(2) Data preparation

Set the scores of national language and mathematics and the humanities (true for humanities and false for sciences) as data as shown below.

data = pd.DataFrame({
        "bunri":[False,True,False,True,True,False,True,False,True,False,False,True,False,False,False],
        "Japanese_score":[45, 60, 52, 70, 85, 31, 90, 55, 75, 30, 42, 65, 38, 55, 60],
        "Math_score":[75, 50, 80, 35, 40, 65, 42, 90, 35, 90, 80, 35, 88, 80, 90],
    })

(3) Try to illustrate (important)

Illustrate the scores and literacy of Japanese and mathematics. In order to grasp the characteristics, do not use scikit-learn suddenly, but try to illustrate any data.

y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values 
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')
plt.show()
キャプチャ3.PNG

It seems that you can clearly distinguish between blue (humanities) and orange (science). (In the real world, it's unlikely that it will be so clearly divided.) Let's build a logistic regression model.

(4) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data["bunri"].values#It is the same as the previous illustration, so you can omit it.
X = data[["Japanese_score", "Math_score"]].values

Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for logistic regression with scikit-learn.

(Ii) Model construction

It's finally the model building code.

clf = SGDClassifier(loss='log', penalty='none', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
clf.fit(X, y)

That's it for a simple model. We will create a logistic regression model for a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.

(Iii) Try to put out the parameters

Suddenly the word parameter came out, but this is $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + ... + b)}} $ described in the sigmoid function at the beginning. Refers to $ a $ and $ b $. In this example, there are two explanatory variables, the national language score and the mathematical score, so it can be defined as $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + b)}} $, and $ a $ And $ b $ can be calculated with scikit-learn as shown below.

#Get and display weights
b = clf.intercept_[0]
a1 = clf.coef_[0, 0]
a2 = clf.coef_[0, 1]

Then b = 4.950, a1 = 446.180, a2 = -400.540 will be displayed, so $ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)} You can see that it is a sigmoid function called} $.

(5) Illustrate the constructed model

Now let's illustrate this boundary in the scatter plot above.

y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values 
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')

#Plot and display borders
#Purple: Borderline
line_x = np.arange(np.min(x1) - 1, np.max(x1) + 1)
line_y = - line_x * w1 / w2 - w0 / w2
plt.plot(line_x, line_y, linestyle='-.', linewidth=3, color='purple', label='kyoukai')
plt.ylim([np.min(x2) - 1, np.max(x2) + 1])
plt.legend(loc='best')
plt.show()
キャプチャ4.PNG

In this way, let's be aware of what scikit-learn is doing and what it is connected to.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is necessary to use this prediction model to predict the literacy of another student. You got information for another 5 people and made a note of the data. Store it in a variable called z as shown below.

z = pd.DataFrame({
        "Japanese_score":[80, 50, 65, 40, 75],
        "Math_score":[50, 70, 55, 50, 40],
    })
z2 = z[["Japanese_score", "Math_score"]].values

What I want to do is to apply the data of another student mentioned above to the logistic regression model obtained by scikit-learn earlier, and predict the literacy.

y_est = clf.predict(z2)

In this way, y_est will display the result as "([True, False, True, False, True])". In other words, the first person has 80 points in Japanese and 50 points in mathematics, so it is predicted to be a liberal arts.

Your goal will be achieved by predicting the literacy from your national language and math scores.

Also, let's display the "probability of being a humanities" mentioned at the beginning.

clf.predict_proba(z2)

If you write in this way, the probability of being in the humanities and the probability of not being in the humanities will be displayed in two columns. However, this example is so easy to understand that the result is displayed as below, and the probability is clearly divided into 0% or 100%.

[0., 1.], [1., 0.], [0., 1.], [1., 0.], [0., 1.]

4. Understanding Logistic Regression from Mathematics

By the way, up to 3, I tried to implement the flow of building a logistic regression model using scikit-learn → illustration → predicting the literacy of another 5 students. Here, I would like to clarify how the logistic regression model of this flow is calculated mathematically.

(1) Prerequisite knowledge

Use the following for the logarithm of the product. log_aMN = log_aM + log_aN

(2) What is a sigmoid function?

As mentioned in 2. "◆ About the sigmoid function (sigmoid function)", the sigmoid function is a function for expressing a certain event with probability, and has the following form.

キャプチャ5.PNG

Also, this blue function is defined as follows.

y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}

The above $ a_1 $, $ a_2 $, and $ b $ are so-called parameters, and their positions are the same as $ a $ and $ b $ of the linear function $ y = ax + b $. And $ x_1 $ and $ x_2 $ are so-called explanatory variables. In this case, the national language score is $ x_1 $ and the math score is $ x_2 $.

If you decide these $ a_1 $, $ a_2 $, and $ b $ as ** good numbers **, then the newly acquired students' national language and math scores will be $ x_1 $ and $ x_2 $. If you enter it, the probability of becoming a humanities (or the probability of becoming a science) can be calculated as $ y $.

In other words, ** In machine learning logistic regression, these parameters $ a $ and $ b $ are calculated to calculate the sigmoid function **.

I think it's hard to understand the image if it's just sentences, so let's calculate by applying specific numerical values from the next.

(3) Mathematical understanding

(I) Apply sigmoid functions to data one by one

It summarizes what happens if the "data" set in "Data preparation" is organized in a tabular format and the sigmoid function is used on the far right. This time, I'm going to find the "probability of being a humanities", so set the humanities to 1 and the science to 0 (conversely, if you want to find the probability of being a science, set the science to 1 and the humanities to 0).

student National language score Math score Literature(0: Science 1:Humanities) Sigmoid function
1st person 45 75 0 \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}
Second person 60 50 1 \frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}
・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・
15th person 60 90 0 \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}

(Ii) Find the maximum likelihood estimator

Now, how do we find the parameters $ a $ ($ a_1 $ and $ a_2 $ in this example) and $ b $? The bottom line is that you can ** multiply the probability of being in the humanities from the 1st to the 15th and find $ a_1 $, $ a_2 $, and $ b $ that maximize the product **.

This is called the maximum likelihood estimator.

◆ What is the maximum likelihood estimator? It is read as "saiyuusui teryo" and means "most likely (more) likely" estimate. → It's complicated, but you can interpret it as "the best number".

◆ Let's multiply Multiplying the probability of being a liberal arts person from the 1st person to the 15th person gives the following (let's call it L).

L = [1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] × [\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}]× ・ ・\\
× [1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]

◆ Convert to logarithm to find the maximum value of L You may get an image, but L is multiplied by 15 people. This is very difficult to calculate when the data is for millions of people, so I will convert it to logarithm.

logL = log[1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] + log[\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}] +・ ・\\
log[1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]

◆ Find the parameters How to find the parameters $ a_1 $, $ a_2 $, and $ b $ that maximize $ logL $ cannot be calculated analytically (= manually calculated). scikit-learn uses ** "stochastic gradient descent" ** to calculate the optimal parameters.

So, while understanding that the theory behind it is doing this, it's okay to use the one provided by scikit-learn for the actual calculation.

In "(iii) Try to output parameters", $ b $ = 4.950, $ a_1 $ = 446.180, $ a_2 $ = -400.540, so ** I wanted to find $ y = \ frac {1 } {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}} $ is a sigmoid function **.

(Iii) Summary here

This sigmoid function ($ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}} ) and the newly acquired student's national language score ( x_1 ) If you enter a mathematical score ( x_2 $), the probability of humanities is calculated, and if the probability is greater than 0.5, it is classified as humanities, and if it is less than 0.5, it is classified as science.

(Iv) Slight development

** ◆ Why can we get a "good" parameter by finding the maximum value of L? ** I have issued parameters $ a $ and $ b $ that maximize $ L $ and $ logL $, but why can I get the optimum parameters if $ L $ and $ logL $ are maximized? ..

See below to get an image. You have only three data at hand, and from those three data, "try to roughly make a blue graph of the entire normal distribution". Which of the two blue graphs below is more likely to have a more overall normal distribution?

キャプチャ7.PNG

Obviously, the graph on the left is more likely to be more probable. This is because the distribution on the right side is the mountain with the highest frequency of occurrence, and there is no data at hand. This is an intuitive understanding, but mathematically the graph on the left is more accurate.

The following is the normal distribution above with the probability added. Although it is a measure, I wrote the probability that if the red dot at hand is on this normal distribution, it will occur with this probability.

キャプチャ8.PNG

Multiplying the probabilities of the distribution on the left (which has the same meaning as $ L $) gives 0.14 x 0.28 x 0.38 = 0.014896. Similarly on the right side 0.01 x 0.03 x 0.09 = 0.000027.

In this way, the larger the value of multiplication of probabilities, the closer to the graph that properly represents the original distribution.

Therefore, it is necessary to find the parameters $ a_1 $, $ a_2 $, and $ b $ so that the values of $ L $, which is the multiplication of the humanities probabilities, and $ logL $, which is the logarithm of it, are as large as possible.

** ◆ Difference between sigmoid function and logistic function ** It is OK to understand that the special form of the logistic function is a sigmoid function.

Logistic function: $ {N = {\ frac {K} {1+ \ exp {K (t_ {0} -t)}}}} N = {\ frac {K} {1 + \ exp {K (t_ { 0} -t)}}} $ The sigmoid function refers to the function when $ K = 1 $ and $ t_0 = 0 $ above.

Reference URL: Wikipedia https://ja.wikipedia.org/wiki/%E3%82%B7%E3%82%B0%E3%83%A2%E3%82%A4%E3%83%89%E9%96%A2%E6%95%B0

** ◆ Confusion matrix ** I didn't use it this time because I gave a very easy-to-understand example, but in general, I use an index called confusion matrix as a method of checking accuracy. We will list the reference URL, so if you are interested, please try to learn.
Reference URL
https://note.nkmk.me/python-sklearn-confusion-matrix-score/

5. Summary

What did you think. Logistic regression is a part that I find very difficult to understand, so it may be difficult to understand once I read it. We hope that reading it several times will help you to evolve your understanding.

Recommended Posts

[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
[Machine learning] Understanding SVM from both scikit-learn and mathematics
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
[Machine learning] Understanding uncorrelatedness from mathematics
Machine learning logistic regression
Machine learning algorithm (logistic regression)
Classification and regression in machine learning
<Course> Machine Learning Chapter 3: Logistic Regression Model
Understanding Logistic Regression (1) _ About odds and logit transformations
Overview of machine learning techniques learned from scikit-learn
Machine learning linear regression
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Easy machine learning with scikit-learn and flask ✕ Web app
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
Machine Learning: Supervised --Linear Regression
[Machine learning] Understanding random forest
Understand machine learning ~ ridge regression ~.
Machine learning and mathematical optimization
Supervised machine learning (classification / regression)
Machine learning stacking template (regression)
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
[Reading Notes] Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Chapter 1
Significance of machine learning and mini-batch learning
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Try machine learning with scikit-learn SVM
Organize machine learning and deep learning platforms
First TensorFlow (Revised) -Linear Regression and Logistic Regression
[Machine learning] OOB (Out-Of-Bag) and its ratio
Machine learning algorithm (generalization of linear regression)
scikit-learn How to use summary (machine learning)
Stock price forecast using machine learning (scikit-learn)
[Machine learning] LDA topic classification using scikit-learn
Use machine learning APIs A3RT from Python
Machine learning with python (2) Simple regression analysis
Personal notes and links about machine learning ① (Machine learning)
<Course> Machine Learning Chapter 1: Linear Regression Model
Regression model and its visualization using scikit-learn
Machine learning algorithm classification and implementation summary
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Stock price forecast using machine learning (regression)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Machine learning algorithm (linear regression summary & regularization)
[Machine learning] Regression analysis using scikit learn
Coursera Machine Learning Challenges in Python: ex3 (Handwritten Number Recognition with Logistic Regression)