You can check the VIF (Variance Inflation Factor) in Python, and you can check the multicollinearity between the explanatory variables while looking at this result. Generally, when VIF> 10, it can be judged that multicollinearity is strong.
from statsmodels.stats.outliers_influence import variance_inflation_factor
df_all = pd.read_excel('train.xlsx',sheet_name="Sheet1")
cols = df_all.select_dtypes(include=[np.number]).columns
cols_x = cols[1:]
data_x = df_all[cols_x]
#Calculate vif
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(data_x.values, i) for i in range(data_x.shape[1])]
#vif["features"] = data_x.columns
#output the calculation result of vif
print(vif)
#Graph vif
plt.plot(vif["VIF Factor"])
The result will come out like this. It's convenient!
It was discovered that VIF came out with different results ('Д') .. !!
In the first place, VIF is calculated by the following formula.
VIF = 1/(1-R2) #R2: coefficient of determination
When one of the explanatory variables is regarded as the objective variable, the coefficient of determination R2 obtained when performing multiple regression analysis with the remaining explanatory variables is used. Speaking sensuously, I understand that if you can express one variable, which is the remaining explanatory variable, well, you don't need that variable? The fact that the VIF is different means that this R2 is different between Python and Excel, so I panicked for a moment.
It turned out that the reason was different, whether or not to include the intercept in the explanatory variable.
On the Python side, process as intercept = 0 When I examined it in Excel, I didn't specify the intercept.
I was able to confirm that the VIFs match when I set the intercept to 0 in Excel.
↑ Whether to check here
--Isn't there any problem if you use Python's stats model? --Should the intercept be specified? --Anyway, VIF should be evaluated with the combination that maximizes R2, and it doesn't matter if the intercept is 0 or not? ――VIF is just a guide, so don't you have to worry about it?
I'm thinking about the above, but how about everyone? I'm also wondering what the VIF calculation algorithm of the stats model is in the first place ...
Recommended Posts