When analyzing data, you will look at the correlation between variables in the given data. You can check the correlation coefficient for the correlation between numerical values, but what if one or both are categories? I looked it up, so I will summarize it.
In this case, it is famous and you can check the correlation coefficient. The definition of the correlation coefficient is as follows.
r=\frac{\sum_{i}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i}(x_{i}-\bar{x})^2}\sqrt{\sum_{i}(y_{i}-\bar{y})^2}}
To find the correlation coefficient in python, use the corr () method of pandas.DataFrame.
import numpy as np
import pandas as pd
x=np.random.randint(1, 10, 100)
y=np.random.randint(1, 10, 100)
data=pd.DataFrame({'x':x, 'y': y})
data.corr()
If the value is 0, there is no correlation, if it is close to 1, there is a positive correlation, and if it is close to -1, there is a negative correlation.
It is expressed as a statistic called correlation ratio. The definition is as follows.
r=\frac{\sum_{category}category件数\times(categoryの平均-Overall average)^2}{Sum of squares of total deviation}
Please refer to here for a specific example. The numerator represents "how far each category is". The farther the categories are, the larger the numerator and the stronger the correlation.
This correlation ratio also means no correlation when it is 0, and a strong positive correlation when it approaches 1.
In python, the calculation is as follows (see here).
def correlation_ratio(cat_key, num_key, data):
categorical=data[cat_key]
numerical=data[num_key]
mean=numerical.dropna().mean()
all_var=((numerical-mean)**2).sum() #Sum of squares of total deviation
unique_cat=pd.Series(categorical.unique())
unique_cat=list(unique_cat.dropna())
categorical_num=[numerical[categorical==cat] for cat in unique_cat]
categorical_var=[len(x.dropna())*(x.dropna().mean()-mean)**2 for x in categorical_num]
#Number of categories × (Average of categories-Overall average)^2
r=sum(categorical_var)/all_var
return r
We will look at it using a statistic called the number of Klamer correlations. The definition is
r=\sqrt{\frac{\chi^2}{n(k-1)}}
However, $ \ chi ^ {2} $ is the chi-square distribution, n is the number of data items, and k is the one with the smaller number of categories. Please refer to here for the χ-square distribution. Roughly speaking, it is a quantity that expresses how different the distribution of each category is from the overall distribution. Again, if it is close to 0, there is no correlation, and if it is close to 1, there is a positive correlation.
To calculate with python, do the following ([here](https://qiita.com/shngt/items/45da2d30acf9e84924b7#%E3%82%AF%E3%83%A9%E3%83%A1%E3] % 83% BC% E3% 83% AB% E3% 81% AE% E9% 80% A3% E9% 96% A2% E4% BF% 82% E6% 95% B0).
import scipy.stats as st
def cramerV(x, y, data):
table=pd.crosstab(data[x], data[y])
x2, p, dof, e=st.chi2_contingency(table, False)
n=table.sum().sum()
r=np.sqrt(x2/(n*(np.min(table.shape)-1)))
return r
And, this alone would be the second brew of the previous article, so I made a method to calculate each index collectively for DataFrame. You don't have to look it up one by one!
def is_categorical(data, key):
col_type=data[key].dtype
if col_type=='int':
nunique=data[key].nunique()
return nunique<6
elif col_type=="float":
return False
else:
return True
def get_corr(data, categorical_keys=None):
keys=data.keys()
if categorical_keys is None:
categorical_keys=keys[[is_categorycal(data, key) for ke in keys]]
corr=pd.DataFrame({})
corr_ratio=pd.DataFrame({})
corr_cramer=pd.DataFrame({})
for key1 in keys:
for key2 in keys:
if (key1 in categorical_keys) and (key2 in categorical_keys):
r=cramerV(key1, key2, data)
corr_cramer.loc[key1, key2]=r
elif (key1 in categorical_keys) and (key2 not in categorical_keys):
r=correlation_ratio(cat_key=key1, num_key=key2, data=data)
corr_ratio.loc[key1, key2]=r
elif (key1 not in categorical_keys) and (key2 in categorical_keys):
r=correlation_ratio(cat_key=key2, num_key=key1, data=data)
corr_ratio.loc[key1, key2]=r
else:
r=data.corr().loc[key1, key2]
corr.loc[key1, key2]=r
return corr, corr_ratio, corr_cramer
Which key is a categorical variable is automatically determined from the variable type unless otherwise specified.
Let's apply it to titanic data.
data=pd.read_csv(r"train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
category=["Survived", "Pclass", "Sex", "Embarked"]
corr, corr_ratio, corr_cramer=get_corr(data, category)
corr
corr_ratio
corr_cramer
In addition, it can be visualized with the seaborn heatmap.
import seaborn as sns
sns.heatmap(corr_cramer, vmin=-1, vmax=1)
The explanation of each statistic has become messy, so please see the page mentioned in the reference. Even if I put it together, I end up forgetting it and checking it, so I try to create a method that automates as much as possible. The source of this method is also on github, so feel free to use it.
[Calculate the relationship between variables of various scales (Python)](https://qiita.com/shngt/items/45da2d30acf9e84924b7#%E3%82%AF%E3%83%A9%E3%83%A1% E3% 83% BC% E3% 83% AB% E3% 81% AE% E9% 80% A3% E9% 96% A2% E4% BF% 82% E6% 95% B0) Correlation analysis Correlation ratio Kai-square test / Cramer correlation number
Recommended Posts