If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."
The purpose of this article is to "try using scikit-learn first because the theory is good" in 2-3 and "understand the background from mathematics" in 4 and later.
I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible.
A similar article has been posted for Linear Simple Regression Ver, so please read it as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
Logistic regression is a type of statistical regression model for variables that follow the Bernoulli distribution. Source: [Wikipedia](https://ja.wikipedia.org/wiki/logistic regression)
I don't know what it is, so to put it simply, it is used for ** "predicting the probability that a certain event will occur" or "classifying based on that probability" **. Therefore, logistic regression is used when you want to classify using machine learning or when you want to predict the probability.
(I was very surprised to be able to "predict probabilities" while studying machine learning.)
So how does this logistic regression perform classification and probability prediction? I will omit the detailed explanation because I will go to the chapter on mathematics, but "predicting the probability that a certain event will occur" means that when you enter the necessary information in the "sigmoid function" below, that event (let's call it A) It means that the probability of occurrence is calculated. And if the probability is 50% or more, it is classified as A, and if the probability is less than 50%, it is classified as not A.
[Sigmoid function]
By the way, the sigmoid function is defined as follows.
y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}
As a result of this calculation, the $ y $ that comes out represents the probability that the event will occur, and logistic regression calculates this probability.
For example, if the red dot is attached on the above sigmoid function, the probability of occurrence is predicted to be 40%, and since it is less than 50%, event A is classified as not occurring.
Although it is the "classification" described above, machine learning mainly performs "regression (predicting numerical values)" or "classification". As the name suggests, classification is used when you want to classify "A or B".
Suppose that you have the overall average score of 15 students' junior high school to high school national language and mathematics, and the data on whether the student is going on to liberal arts or science.
Based on this data, I would like to use data from another student's national language and mathematics to predict whether they will go on to liberal arts or science in the future.
The distribution of the overall average score of 15 students in Japanese and mathematics is as follows.
Somehow, there seems to be a boundary between the blue dots (humanities) and the orange dots (science).
Next, let's perform logistic regression analysis using scikit-learn and create a model that classifies humanities and sciences.
Import the following required to perform logistic regression.
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix
Set the scores of national language and mathematics and the humanities (true for humanities and false for sciences) as data as shown below.
data = pd.DataFrame({
"bunri":[False,True,False,True,True,False,True,False,True,False,False,True,False,False,False],
"Japanese_score":[45, 60, 52, 70, 85, 31, 90, 55, 75, 30, 42, 65, 38, 55, 60],
"Math_score":[75, 50, 80, 35, 40, 65, 42, 90, 35, 90, 80, 35, 88, 80, 90],
})
Illustrate the scores and literacy of Japanese and mathematics. In order to grasp the characteristics, do not use scikit-learn suddenly, but try to illustrate any data.
y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')
plt.show()
It seems that you can clearly distinguish between blue (humanities) and orange (science). (In the real world, it's unlikely that it will be so clearly divided.) Let's build a logistic regression model.
First of all, we will arrange the shape of the data to build the model.
y = data["bunri"].values#It is the same as the previous illustration, so you can omit it.
X = data[["Japanese_score", "Math_score"]].values
Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for logistic regression with scikit-learn.
It's finally the model building code.
clf = SGDClassifier(loss='log', penalty='none', max_iter=10000, fit_intercept=True, random_state=1234, tol=1e-3)
clf.fit(X, y)
That's it for a simple model. We will create a logistic regression model for a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.
Suddenly the word parameter came out, but this is $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + ... + b)}} $ described in the sigmoid function at the beginning. Refers to $ a $ and $ b $. In this example, there are two explanatory variables, the national language score and the mathematical score, so it can be defined as $ y = \ frac {1} {1 + e ^ {-(a_1x_1 + a_2x_2 + b)}} $, and $ a $ And $ b $ can be calculated with scikit-learn as shown below.
#Get and display weights
b = clf.intercept_[0]
a1 = clf.coef_[0, 0]
a2 = clf.coef_[0, 1]
Then b = 4.950, a1 = 446.180, a2 = -400.540 will be displayed, so $ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)} You can see that it is a sigmoid function called} $.
Now let's illustrate this boundary in the scatter plot above.
y = data["bunri"].values
x1, x2 = data["Japanese_score"].values, data["Math_score"].values
#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='bunkei')#Blue dot: y is True(=Humanities)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='rikei')#Orange dot: y is False(=Science)
plt.xlabel("Japanese_score")
plt.ylabel("Math_score")
plt.legend(loc='best')
#Plot and display borders
#Purple: Borderline
line_x = np.arange(np.min(x1) - 1, np.max(x1) + 1)
line_y = - line_x * w1 / w2 - w0 / w2
plt.plot(line_x, line_y, linestyle='-.', linewidth=3, color='purple', label='kyoukai')
plt.ylim([np.min(x2) - 1, np.max(x2) + 1])
plt.legend(loc='best')
plt.show()
In this way, let's be aware of what scikit-learn is doing and what it is connected to.
It doesn't make sense to finish making a model. In the real world, it is necessary to use this prediction model to predict the literacy of another student. You got information for another 5 people and made a note of the data. Store it in a variable called z as shown below.
z = pd.DataFrame({
"Japanese_score":[80, 50, 65, 40, 75],
"Math_score":[50, 70, 55, 50, 40],
})
z2 = z[["Japanese_score", "Math_score"]].values
What I want to do is to apply the data of another student mentioned above to the logistic regression model obtained by scikit-learn earlier, and predict the literacy.
y_est = clf.predict(z2)
In this way, y_est will display the result as "([True, False, True, False, True])". In other words, the first person has 80 points in Japanese and 50 points in mathematics, so it is predicted to be a liberal arts.
Your goal will be achieved by predicting the literacy from your national language and math scores.
Also, let's display the "probability of being a humanities" mentioned at the beginning.
clf.predict_proba(z2)
If you write in this way, the probability of being in the humanities and the probability of not being in the humanities will be displayed in two columns. However, this example is so easy to understand that the result is displayed as below, and the probability is clearly divided into 0% or 100%.
[0., 1.], [1., 0.], [0., 1.], [1., 0.], [0., 1.]
By the way, up to 3, I tried to implement the flow of building a logistic regression model using scikit-learn → illustration → predicting the literacy of another 5 students. Here, I would like to clarify how the logistic regression model of this flow is calculated mathematically.
Use the following for the logarithm of the product.
As mentioned in 2. "◆ About the sigmoid function (sigmoid function)", the sigmoid function is a function for expressing a certain event with probability, and has the following form.
Also, this blue function is defined as follows.
y = \frac{1}{1 + e^{-(a_1x_1 + a_2x_2 + ... + b)}}
The above $ a_1 $, $ a_2 $, and $ b $ are so-called parameters, and their positions are the same as $ a $ and $ b $ of the linear function $ y = ax + b $. And $ x_1 $ and $ x_2 $ are so-called explanatory variables. In this case, the national language score is $ x_1 $ and the math score is $ x_2 $.
If you decide these $ a_1 $, $ a_2 $, and $ b $ as ** good numbers **, then the newly acquired students' national language and math scores will be $ x_1 $ and $ x_2 $. If you enter it, the probability of becoming a humanities (or the probability of becoming a science) can be calculated as $ y $.
In other words, ** In machine learning logistic regression, these parameters $ a $ and $ b $ are calculated to calculate the sigmoid function **.
I think it's hard to understand the image if it's just sentences, so let's calculate by applying specific numerical values from the next.
It summarizes what happens if the "data" set in "Data preparation" is organized in a tabular format and the sigmoid function is used on the far right. This time, I'm going to find the "probability of being a humanities", so set the humanities to 1 and the science to 0 (conversely, if you want to find the probability of being a science, set the science to 1 and the humanities to 0).
student | National language score | Math score | Literature(0: Science 1:Humanities) | Sigmoid function |
---|---|---|---|---|
1st person | 45 | 75 | 0 | |
Second person | 60 | 50 | 1 | |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
15th person | 60 | 90 | 0 |
Now, how do we find the parameters $ a $ ($ a_1 $ and $ a_2 $ in this example) and $ b $? The bottom line is that you can ** multiply the probability of being in the humanities from the 1st to the 15th and find $ a_1 $, $ a_2 $, and $ b $ that maximize the product **.
This is called the maximum likelihood estimator.
◆ What is the maximum likelihood estimator? It is read as "saiyuusui teryo" and means "most likely (more) likely" estimate. → It's complicated, but you can interpret it as "the best number".
◆ Let's multiply Multiplying the probability of being a liberal arts person from the 1st person to the 15th person gives the following (let's call it L).
L = [1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] × [\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}]× ・ ・\\
× [1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]
◆ Convert to logarithm to find the maximum value of L You may get an image, but L is multiplied by 15 people. This is very difficult to calculate when the data is for millions of people, so I will convert it to logarithm.
logL = log[1 - \frac{1}{1 + e^{-(45a_1 + 75a_2 + b)}}] + log[\frac{1}{1 + e^{-(60a_1 + 50a_2 + b)}}] +・ ・\\
log[1 - \frac{1}{1 + e^{-(60a_1 + 90a_2 + b)}}]
◆ Find the parameters How to find the parameters $ a_1 $, $ a_2 $, and $ b $ that maximize $ logL $ cannot be calculated analytically (= manually calculated). scikit-learn uses ** "stochastic gradient descent" ** to calculate the optimal parameters.
So, while understanding that the theory behind it is doing this, it's okay to use the one provided by scikit-learn for the actual calculation.
In "(iii) Try to output parameters", $ b $ = 4.950, $ a_1 $ = 446.180, $ a_2 $ = -400.540, so ** I wanted to find $ y = \ frac {1 } {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}} $ is a sigmoid function **.
This sigmoid function ($ y = \ frac {1} {1 + e ^ {-(446.180x_1 + (-400.540) x_2 + 4.950)}}
** ◆ Why can we get a "good" parameter by finding the maximum value of L? ** I have issued parameters $ a $ and $ b $ that maximize $ L $ and $ logL $, but why can I get the optimum parameters if $ L $ and $ logL $ are maximized? ..
See below to get an image. You have only three data at hand, and from those three data, "try to roughly make a blue graph of the entire normal distribution". Which of the two blue graphs below is more likely to have a more overall normal distribution?
Obviously, the graph on the left is more likely to be more probable. This is because the distribution on the right side is the mountain with the highest frequency of occurrence, and there is no data at hand. This is an intuitive understanding, but mathematically the graph on the left is more accurate.
The following is the normal distribution above with the probability added. Although it is a measure, I wrote the probability that if the red dot at hand is on this normal distribution, it will occur with this probability.
Multiplying the probabilities of the distribution on the left (which has the same meaning as $ L $) gives 0.14 x 0.28 x 0.38 = 0.014896. Similarly on the right side 0.01 x 0.03 x 0.09 = 0.000027.
In this way, the larger the value of multiplication of probabilities, the closer to the graph that properly represents the original distribution.
Therefore, it is necessary to find the parameters $ a_1 $, $ a_2 $, and $ b $ so that the values of $ L $, which is the multiplication of the humanities probabilities, and $ logL $, which is the logarithm of it, are as large as possible.
** ◆ Difference between sigmoid function and logistic function ** It is OK to understand that the special form of the logistic function is a sigmoid function.
Logistic function: $ {N = {\ frac {K} {1+ \ exp {K (t_ {0} -t)}}}} N = {\ frac {K} {1 + \ exp {K (t_ { 0} -t)}}} $ The sigmoid function refers to the function when $ K = 1 $ and $ t_0 = 0 $ above.
Reference URL: Wikipedia https://ja.wikipedia.org/wiki/%E3%82%B7%E3%82%B0%E3%83%A2%E3%82%A4%E3%83%89%E9%96%A2%E6%95%B0
** ◆ Confusion matrix **
I didn't use it this time because I gave a very easy-to-understand example, but in general, I use an index called confusion matrix as a method of checking accuracy.
We will list the reference URL, so if you are interested, please try to learn.
Reference URL
https://note.nkmk.me/python-sklearn-confusion-matrix-score/
What did you think. Logistic regression is a part that I find very difficult to understand, so it may be difficult to understand once I read it. We hope that reading it several times will help you to evolve your understanding.
Recommended Posts