1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."

The purpose of this article is to "try using scikit-learn first because the theory is good" in 2-3 and "understand the background from mathematics" in 4 and later.

I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible.
A similar article has been posted for Linear Simple Regression Ver, so please read it as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics

2. What is logistic regression?

Logistic regression is a type of statistical regression model for variables that follow the Bernoulli distribution. Source: [Wikipedia](https://ja.wikipedia.org/wiki/logistic regression)

I don't know what it is, so to put it simply, it is used for ** "predicting the probability that a certain event will occur" or "classifying based on that probability" **. Therefore, logistic regression is used when you want to classify using machine learning or when you want to predict the probability.

(I was very surprised to be able to "predict probabilities" while studying machine learning.)

◆ About sigmoid function (logistic function)

So how does this logistic regression perform classification and probability prediction? I will omit the detailed explanation because I will go to the chapter on mathematics, but "predicting the probability that a certain event will occur" means that when you enter the necessary information in the "sigmoid function" below, that event (let's call it A) It means that the probability of occurrence is calculated. And if the probability is 50% or more, it is classified as A, and if the probability is less than 50%, it is classified as not A.

Therefore, "probability prediction" and "classification" should not be considered separately. It is OK if you first calculate the probability that the event will occur and then classify by A or not A with 50% as the boundary. ..

[Sigmoid function] キャプチャ6.PNG

By the way, the sigmoid function is defined as follows.

It will appear again in the chapter on mathematics, so you can skip it.

y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}

As a result of this calculation, the $ y $ that comes out represents the probability that the event will occur, and logistic regression calculates this probability.

For example, if the red dot is attached on the above sigmoid function, the probability of occurrence is predicted to be 40%, and since it is less than 50%, event A is classified as not occurring.

◆ What is classification?

Although it is the "classification" described above, machine learning mainly performs "regression (predicting numerical values)" or "classification". As the name suggests, classification is used when you want to classify "A or B".

For example, "whether the stock price will rise or fall tomorrow". I think that classification is a very effective method depending on what kind of purpose you set.

◆ Specific example

This specific example is not very practical. In the case of myself, I would like you to read the score of the national language while thinking about the data of ●●.

Suppose that you have the overall average score of 15 students' junior high school to high school national language and mathematics, and the data on whether the student is going on to liberal arts or science. キャプチャ1.PNG

Based on this data, I would like to use data from another student's national language and mathematics to predict whether they will go on to liberal arts or science in the future.

The distribution of the overall average score of 15 students in Japanese and mathematics is as follows. キャプチャ2.PNG

Somehow, there seems to be a boundary between the blue dots (humanities) and the orange dots (science).

Next, let's perform logistic regression analysis using scikit-learn and create a model that classifies humanities and sciences.

3. Logistic regression with scikit-learn

(1) Import of required libraries

Import the following required to perform logistic regression.

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix

(2) Data preparation

Set the scores of national language and mathematics and the humanities (true for humanities and false for sciences) as data as shown below.

For example, the first student had an average score of 45 points in Japanese and 75 points in mathematics from junior high school to high school, and finally went on to science.

data = pd.DataFrame({
        "bunri":[False,True,False,True,True,False,True,False,True,False,False,True,False,False,False],
        "Japanese_score":[45, 60, 52, 70, 85, 31, 90, 55, 75, 30, 42, 65, 38, 55, 60],
        "Math_score":[75, 50, 80, 35, 40, 65, 42, 90, 35, 90, 80, 35, 88, 80, 90],
    })

(3) Try to illustrate (important)

Illustrate the scores and literacy of Japanese and mathematics. In order to grasp the characteristics, do not use scikit-learn suddenly, but try to illustrate any data.

y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values 
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')
plt.show()

It seems that you can clearly distinguish between blue (humanities) and orange (science). (In the real world, it's unlikely that it will be so clearly divided.) Let's build a logistic regression model.

(4) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data["bunri"].values#It is the same as the previous illustration, so you can omit it.
X = data[["Japanese_score", "Math_score"]].values

Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for logistic regression with scikit-learn.

I think this is a code that cannot be written unless you understand it to some extent, so I would like to summarize it somewhere.

(Ii) Model construction

It's finally the model building code.

clf = SGDClassifier(loss='log', penalty='none', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
clf.fit(X, y)

That's it for a simple model. We will create a logistic regression model for a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.

(Iii) Try to put out the parameters

(Iii) is unnecessary in the sense that it is the minimum if it is difficult, so skip it.

Suddenly the word parameter came out, but this is $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + ... + b)}} $ described in the sigmoid function at the beginning. Refers to $ a $ and $ b $. In this example, there are two explanatory variables, the national language score and the mathematical score, so it can be defined as $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + b)}} $, and $ a $ And $ b $ can be calculated with scikit-learn as shown below.

#Get and display weights
b = clf.intercept_[0]
a1 = clf.coef_[0, 0]
a2 = clf.coef_[0, 1]

Then b = 4.950, a1 = 446.180, a2 = -400.540 will be displayed, so $ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)} You can see that it is a sigmoid function called} $.

(5) Illustrate the constructed model

Now let's illustrate this boundary in the scatter plot above.

This code is a little difficult, so you can just copy and paste it without understanding it. In scikit-learn, it is okay if you calculate such a boundary line from learning and recognize that the lower right of this boundary is the humanities and the upper left is the science.

y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values 
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')

#Plot and display borders
#Purple: Borderline
line_x = np.arange(np.min(x1) - 1, np.max(x1) + 1)
line_y = - line_x * w1 / w2 - w0 / w2
plt.plot(line_x, line_y, linestyle='-.', linewidth=3, color='purple', label='kyoukai')
plt.ylim([np.min(x2) - 1, np.max(x2) + 1])
plt.legend(loc='best')
plt.show()

In this way, let's be aware of what scikit-learn is doing and what it is connected to.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is necessary to use this prediction model to predict the literacy of another student. You got information for another 5 people and made a note of the data. Store it in a variable called z as shown below.

z = pd.DataFrame({
        "Japanese_score":[80, 50, 65, 40, 75],
        "Math_score":[50, 70, 55, 50, 40],
    })
z2 = z[["Japanese_score", "Math_score"]].values

What I want to do is to apply the data of another student mentioned above to the logistic regression model obtained by scikit-learn earlier, and predict the literacy.

y_est = clf.predict(z2)

In this way, y_est will display the result as "([True, False, True, False, True])". In other words, the first person has 80 points in Japanese and 50 points in mathematics, so it is predicted to be a liberal arts.

Your goal will be achieved by predicting the literacy from your national language and math scores.

Also, let's display the "probability of being a humanities" mentioned at the beginning.

clf.predict_proba(z2)

If you write in this way, the probability of being in the humanities and the probability of not being in the humanities will be displayed in two columns. However, this example is so easy to understand that the result is displayed as below, and the probability is clearly divided into 0% or 100%.

The probability that the right side is humanities.

[0., 1.], [1., 0.], [0., 1.], [1., 0.], [0., 1.]

4. Understanding Logistic Regression from Mathematics

By the way, up to 3, I tried to implement the flow of building a logistic regression model using scikit-learn → illustration → predicting the literacy of another 5 students. Here, I would like to clarify how the logistic regression model of this flow is calculated mathematically.

If you do not need this knowledge at present, you can skip it.

(1) Prerequisite knowledge

Use the following for the logarithm of the product. log_aMN = log_aM + log_aN

(2) What is a sigmoid function?

As mentioned in 2. "◆ About the sigmoid function (sigmoid function)", the sigmoid function is a function for expressing a certain event with probability, and has the following form.

The maximum value of the $ y $ axis is 1 (probability 100%), and the minimum value is 0 (probability 0%).

Also, this blue function is defined as follows.

y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}

The above $ a_1 $, $ a_2 $, and $ b $ are so-called parameters, and their positions are the same as $ a $ and $ b $ of the linear function $ y = ax + b $. And $ x_1 $ and $ x_2 $ are so-called explanatory variables. In this case, the national language score is $ x_1 $ and the math score is $ x_2 $.

If you decide these $ a_1 $, $ a_2 $, and $ b $ as ** good numbers **, then the newly acquired students' national language and math scores will be $ x_1 $ and $ x_2 $. If you enter it, the probability of becoming a humanities (or the probability of becoming a science) can be calculated as $ y $.

In other words, ** In machine learning logistic regression, these parameters $ a $ and $ b $ are calculated to calculate the sigmoid function **.

I think it's hard to understand the image if it's just sentences, so let's calculate by applying specific numerical values from the next.

(3) Mathematical understanding

(I) Apply sigmoid functions to data one by one

It summarizes what happens if the "data" set in "Data preparation" is organized in a tabular format and the sigmoid function is used on the far right. This time, I'm going to find the "probability of being a humanities", so set the humanities to 1 and the science to 0 (conversely, if you want to find the probability of being a science, set the science to 1 and the humanities to 0).

student	National language score	Math score	Literature(0: Science 1:Humanities)	Sigmoid function
1st person	45	75	0	\frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}
Second person	60	50	1	\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}
・・・	・・・	・・・	・・・	・・・
15th person	60	90	0	\frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}

(Ii) Find the maximum likelihood estimator

Now, how do we find the parameters $ a $ ($ a_1 $ and $ a_2 $ in this example) and $ b $? The bottom line is that you can ** multiply the probability of being in the humanities from the 1st to the 15th and find $ a_1 $, $ a_2 $, and $ b $ that maximize the product **.

This is called the maximum likelihood estimator.

◆ What is the maximum likelihood estimator? It is read as "saiyuusui teryo" and means "most likely (more) likely" estimate. → It's complicated, but you can interpret it as "the best number".

◆ Let's multiply Multiplying the probability of being a liberal arts person from the 1st person to the 15th person gives the following (let's call it L).

L = [1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] × [\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}]× ・ ・\\
× [1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]

The 1st and 15th people subtract from 1 because the 1st and 15th people are in the sciences, so the value of the sigmoid function itself is the probability of being in the sciences, so subtracting it from 1 corrects it to the probability of being in the humanities. Because.

◆ Convert to logarithm to find the maximum value of L You may get an image, but L is multiplied by 15 people. This is very difficult to calculate when the data is for millions of people, so I will convert it to logarithm.

logL = log[1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] + log[\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}] +・ ・\\
log[1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]

◆ Find the parameters How to find the parameters $ a_1 $, $ a_2 $, and $ b $ that maximize $ logL $ cannot be calculated analytically (= manually calculated). scikit-learn uses ** "stochastic gradient descent" ** to calculate the optimal parameters.

If you explain from the beginning to the stochastic gradient descent method, it will be quite long, so I will omit it here.

So, while understanding that the theory behind it is doing this, it's okay to use the one provided by scikit-learn for the actual calculation.

In "(iii) Try to output parameters", $ b $ = 4.950, $ a_1 $ = 446.180, $ a_2 $ = -400.540, so ** I wanted to find $ y = \ frac {1 } {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}} $ is a sigmoid function **.

(Iii) Summary here

This sigmoid function ($ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}} ) and the newly acquired student's national language score ( x_1 ) If you enter a mathematical score ( x_2 $), the probability of humanities is calculated, and if the probability is greater than 0.5, it is classified as humanities, and if it is less than 0.5, it is classified as science.

(Iv) Slight development

** ◆ Why can we get a "good" parameter by finding the maximum value of L? ** I have issued parameters $ a $ and $ b $ that maximize $ L $ and $ logL $, but why can I get the optimum parameters if $ L $ and $ logL $ are maximized? ..

See below to get an image. You have only three data at hand, and from those three data, "try to roughly make a blue graph of the entire normal distribution". Which of the two blue graphs below is more likely to have a more overall normal distribution?

Obviously, the graph on the left is more likely to be more probable. This is because the distribution on the right side is the mountain with the highest frequency of occurrence, and there is no data at hand. This is an intuitive understanding, but mathematically the graph on the left is more accurate.

The following is the normal distribution above with the probability added. Although it is a measure, I wrote the probability that if the red dot at hand is on this normal distribution, it will occur with this probability.

Multiplying the probabilities of the distribution on the left (which has the same meaning as $ L $) gives 0.14 x 0.28 x 0.38 = 0.014896. Similarly on the right side 0.01 x 0.03 x 0.09 = 0.000027.

In this way, the larger the value of multiplication of probabilities, the closer to the graph that properly represents the original distribution.

Therefore, it is necessary to find the parameters $ a_1 $, $ a_2 $, and $ b $ so that the values of $ L $, which is the multiplication of the humanities probabilities, and $ logL $, which is the logarithm of it, are as large as possible.

** ◆ Difference between sigmoid function and logistic function ** It is OK to understand that the special form of the logistic function is a sigmoid function.

Logistic function: $ {N = {\ frac {K} {1+ \ exp {K (t_ {0} -t)}}}} N = {\ frac {K} {1 + \ exp {K (t_ { 0} -t)}}} $ The sigmoid function refers to the function when $ K = 1 $ and $ t_0 = 0 $ above.

Reference URL: Wikipedia https://ja.wikipedia.org/wiki/%E3%82%B7%E3%82%B0%E3%83%A2%E3%82%A4%E3%83%89%E9%96%A2%E6%95%B0

** ◆ Confusion matrix ** I didn't use it this time because I gave a very easy-to-understand example, but in general, I use an index called confusion matrix as a method of checking accuracy. We will list the reference URL, so if you are interested, please try to learn.
Reference URL
https://note.nkmk.me/python-sklearn-confusion-matrix-score/

5. Summary

What did you think. Logistic regression is a part that I find very difficult to understand, so it may be difficult to understand once I read it. We hope that reading it several times will help you to evolve your understanding.

In the future, I would like to post similar posts on other machine learning models.

[Machine learning] Understanding logistic regression from both scikit-learn and mathematics