If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."
This time, I would like to post about the "correlation coefficient" that is often used in preprocessing. Many people know that the correlation coefficient is between -1 and 1, but can you explain ** "Why do you take between -1 and 1"? ** **
In this article, I will briefly introduce the correlation coefficient in 2, "Because the theory is good, first try to visualize the correlation coefficient with python", and 4 and later "Understand the background from mathematics" 2 Is aimed at.
I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible.
Similar articles have been posted for linear simple regression, logistic regression, and SVM, so please read them as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d)
The correlation coefficient is an index that measures the strength of the linear relationship between two random variables, and takes a value as a real number between -1 and 1. Source: [Wikipedia] (https://ja.wikipedia.org/wiki/%E7%9B%B8%E9%96%A2%E4%BF%82%E6%95%B0)
Roughly speaking, "When the correlation coefficient is positive, the larger the value of one explanatory variable, the larger the other explanatory variable, and when it is negative, the smaller the value of one explanatory variable, the more. One explanatory variable gets smaller. "
This is just a guide, but in general, the following guides are set. [Source] (https://sci-pursuit.com/math/statistics/correlation-coefficient.html)
It's easy to get confused, but be aware that just because the correlation is weak doesn't mean that ** there is no relationship between the two variables **. As mentioned earlier in the definition of correlation coefficient, the correlation coefficient is ** an index that measures the strength of the linear relationship between two variables **, so if there is a relationship other than linear, the phase It cannot be determined by the number of relationships **.
Let's look at a concrete example. It seems that the following two variables are clearly related like a quadratic curve. However, since the correlation coefficient of these two variables is -0.447, it is considered that the correlation is relatively weak if only the correlation coefficient is calculated mechanically, and it seems that there is a relationship between the two variables, but it is overlooked. There is a possibility that it will end up.
In this way, it is important that ** "correlation coefficient is just an index to measure linear relationships" and "visualize variables as much as possible in order not to overlook true relationships" **. I will.
In machine learning, the correlation coefficient is mainly used in preprocessing. More specifically, it is used to examine which explanatory variable to use for the objective variable (= feature selection).
Among them, there are mainly two usage scenes.
** (1) Select an item that has a high correlation with the objective variable and select it as an explanatory variable ** Of course, when building a model, you need to choose the explanatory variables that are related to the objective variable. (Even if you put variables that are completely unrelated to the model, it will cause a decrease in accuracy.) The correlation coefficient is used as one index of this "relationship". Calculate the correlation coefficient and select the variable that is judged to have a strong correlation as the explanatory variable.
** (2) If there is a variable with high correlation between the explanatory variables, delete one ** I think this is easier to understand if you give a concrete example. It's a fictitious setting, but ** Suppose you want to build a model that measures the technical skills of staff with shoe shine expertise **. Suppose that technical ability is the objective variable and there are many candidates for explanatory variables, but two of them are ** "years of service" and "staff ID" **.
I think you can expect it somehow, but the longer the service, the smaller the staff ID because it has been around for a long time, and the shorter the service, the larger the staff ID because it has recently entered. ** There is definitely a strong negative correlation. Masu **.
In such a case, even if you include both the staff ID and the length of service, the calculation cost will be high and it may have an extra effect on the model construction, so delete either one from the explanatory variables.
Import the following required to obtain the correlation coefficient.
import seaborn as sns
Use iris data.
df = sns.load_dataset("iris")
It can be output as a heat map as shown below.
sns.heatmap(df.corr(), vmax=1, vmin=-1, center=0,annot=True)
The correlation coefficient itself is calculated by df.corr () and used as a heat map. By doing this, you can intuitively check whether the correlation is strong or weak, instead of looking at the numerical values one by one.
Well, it's finally the main subject. Until now, I had no doubt about the correlation coefficient, and I thought "take a value from -1 to 1", but why do you take a value from -1 to 1?
In conclusion, ** the correlation coefficient is equal to cos $ θ $ of the angle $ θ $ formed by the deviation vector **.
I would like to explain this.
Regarding the inner product of vectors, the following holds.
x ・ y= ||x||||y||cosθ
The correlation coefficient is defined as follows.
As an image, the covariance is a numerical representation of the correlation between the two data, but since it is not clear whether the value is large or small, it is an image of dividing by the standard deviation and normalizing (= aligning the units). ..
r_{xy} := \frac{σ_{xy}}{σ_xσ_y}
(1) From prior knowledge, conversion can be done as follows.
x ・ y= ||x||||y||cosθ\\
\begin{align}
cosθ &= \frac{x y}{||x||||y||}\\
&= \frac{\frac{x ・ y}{N}}{\frac{||x||}{\sqrt{N}}\frac{||y||}{\sqrt{N}}}(* The denominator and numerator are divided by the number of data N)
\end{align}
This equation refers to dividing the covariance of $ x $ and $ y $ by their standard deviations, as shown below.
As a result, we were able to convert to the same definition as the standard deviation as described in (2).
In other words, it can be said that the correlation coefficient between $ x $ and $ y $ is equal to $ cos θ $ at the angle $ θ $ between $ x $ and $ y $. → And, as mentioned in the prior knowledge, $ cosθ $ is in the range of -1 to 1, so it can be said that the correlation coefficient is also in the range of -1 to 1.
As described so far, the definition of the correlation coefficient is the same as the angle $ cosθ $ formed by the two variables, and $ cosθ $ is in the range of -1 to 1, so the correlation coefficient is also -1 to 1. Take the range of.
How was it? In my opinion, "I can't understand even if I give a very complicated explanation from the beginning, so I can't move on, so I don't care about the theory once, so I'll try to build a machine learning model first (for that purpose, give a correlation coefficient). I think it's very important.
However, once I get used to it, I feel that it is very important to understand what the correlation coefficient really means from a mathematical background.
I hope it helps you to deepen your understanding.