If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** "I don't know the background, but I got this result" ** You can see that it is clearly weak.
This time, we will focus on ** "Uncorrelated" **.
From the previous explanation, "I've heard about uncorrelatedness, why do you do it?" And "How do you use it?", "What kind of processing is mathematically performed with uncorrelatedness?" The purpose is to make the article answer the question "?".
This time, if you want to understand it well, you need to understand various things such as matrix, covariance, eigenvalue vector, and so on. Mathematics didn't go too far into the details, ** I tried to understand the outline **.
It seems that uncorrelatedness is rarely used alone in machine learning. If anything, it seems to be used in principal component analysis.
First, Chapter 2 gives an overview of uncorrelatedness, and Chapter 3 actually does uncorrelatedness. Finally, in Chapter 4, I will explain how to understand uncorrelatedness mathematically.
As the name implies, it is ** to eliminate the correlation between each variable **. What's wrong with the high correlation between each variable?
From the conclusion, ** "Because the variance of the partial regression coefficient becomes large and the accuracy of the model tends to be unstable" **.
... I don't know at all. I will explain a little more.
For example, the formula for a regression model is commonly expressed as:
$ Y = A_1x_1 + A_2x_2 + ・ ・ + b $
Put the actual data in $ y $ (objective variable) and $ x_1 $ or $ x_2 $ (explanatory variable) here to find the partial regression coefficient $ a_1 $ or $ a_2 $ and the constant term $ b $. Is a multiple regression analysis.
How to obtain the variance of this partial regression coefficient (the image of how much the partial regression coefficient tends to take various values) is confusing if it is written in detail, so if you write only the conclusion, ** " (1-Correlation coefficient) "is included in the numerator of the formula for calculating the variance of the partial regression coefficient **.
In other words, ** the larger the correlation, the smaller the molecule, and as a result, the larger the variance of the partial regression coefficient = the partial regression coefficient tends to take various values = the accuracy of the model becomes unstable **
That is the theory.
As we saw in (1), if there are variables with high correlation, it is not the case that one of them should be deleted.
Because, ** "high correlation between variables" simply means that "variables are in a linear relationship" **.
→ So ** if you delete it easily, you may delete other important information that the variable may actually have **.
What appears here is uncorrelatedness.
We will build the model after eliminating the correlation of each variable.
I think it's hard to get an image, so let's actually try it.
This time, as a concrete example, I will use kaggle's kickstarter-projects, which I always use, as an example. https://www.kaggle.com/kemical/kickstarter-projects
This chapter is long, but ** the essential uncorrelatedness is only (ⅶ) **, so it's a good idea to look there first.
※important point※ ・ This time, I couldn't find a variable that should be uncorrelated as an explanatory variable. Variables that have nothing to do with model construction are uncorrelated.
Please note that this is just a chapter for recognizing that "uncorrelatedness is done in this way".
-There was a site that was just uncorrelated in the kickstarter projects that I always use in the example of my article, so I used that as a reference. https://ds-blog.tbtech.co.jp/entry/2019/04/27/Kaggle%E3%81%AB%E6%8C%91%E6%88%A6%E3%81%97%E3%82%88%E3%81%86%EF%BC%81_%EF%BD%9E%E3%82%B3%E3%83%BC%E3%83%89%E8%AA%AC%E6%98%8E%EF%BC%92%EF%BD%9E
#numpy,Import pandas
import numpy as np
import pandas as pd
import seaborn as sns
#Import to perform some processing on date data
import datetime
df = pd.read_csv("ks-projects-201801.csv")
From the following, you can see that it is the dataset of (378661, 15).
df.shape
I will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, I will convert this to "recruitment days".
df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days
I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.
df = df[(df["state"] == "successful") | (df["state"] == "failed")]
Then replace success with 1 and failure with 0.
df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)
I'm sorry for the conclusion, but since it is used pledged that I will use after this, only this variable will be processed for missing values.
df["usd pledged"] = df["usd pledged"].fillna(df["usd pledged"].mean())
Let's check the correlation of each variable.
sns.heatmap(df.corr())
Now, let's uncorrelate "pleged" and "used pleded", which are highly correlated between variables.
Uncorrelated itself is okay if you write the code as below. However, I don't think you understand the meaning, but this time, please read "That's what it is" and see Chapter 4, "Understanding from Mathematics."
#df only pledged and used pleged_Store in pledged
df_corr = pd.DataFrame({'pledged' : df['pledged'], 'usdpledged' : df['usd pledged']})
#Find variance / covariance
cov = np.cov(df_corr, rowvar=0)
#Substitute the eigenvectors of the covariance matrix into S
_, S = np.linalg.eig(cov)
#Uncorrelated data *.T represents transpose
pledged_decorr = np.dot(S.T, df_corr.T).T
This completes the uncorrelated. As a test, let's check the correlation coefficient between pledged and used pleased.
print('Correlation coefficient: {:.3f}'.format(np.corrcoef(pledged_decorr[:, 0], pledged_decorr[:, 1])[0,1]))
This will display "Correlation coefficient: 0.000". I was able to successfully uncorrelated!
Now, in this chapter, let's see how to actually handle uncorrelatedness mathematically. As mentioned at the beginning, it is necessary to think of "matrix" and "eigenvalue / eigenvector" to understand uncorrelatedness.
If you find it difficult, you can skip it, and the explanation itself is not detailed, but roughly speaking, it is explained in this way.
If there are some explanatory variables, let them be $ \ boldsymbol {x_1} $, $ \ boldsymbol {x_2} $ ・ ・ ・ $ \ boldsymbol {x_n} $, respectively.
The variance / covariance matrix between these variables can be written as follows.
The blue frame shows the covariance of two combinations of each variable, and the red frame shows the variance of each variable.
This variance / covariance matrix is transformed as follows by ** diagonalizing **!
... I don't think I'm sure. The important thing here is that ** the blue frame (covariance) is all 0 **.
** Diagonalization can reduce the covariance to 0, which is exactly synonymous with uncorrelated processing **.
So why is it uncorrelated if the covariance between the variables is 0?
To understand that, let's recall the formula for the correlation coefficient. If the correlation coefficient is r, then r is derived as follows.
** (Correlation coefficient r) = (Covariance) / (Standard deviation of $ x $) ・ (Standard deviation of $ y $) **
From this equation, we can see that ** the covariance is 0, that is, the numerator is 0, so the correlation coefficient is 0 **.
That is why ** the covariance between each variable was set to 0 by diagonalization, and the correlation coefficient was set to 0 to achieve uncorrelated **.
How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.
However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.
I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.
Recommended Posts