You can check VIF with Python and it's super convenient!

You can check the VIF (Variance Inflation Factor) in Python, and you can check the multicollinearity between the explanatory variables while looking at this result. Generally, when VIF> 10, it can be judged that multicollinearity is strong.

from statsmodels.stats.outliers_influence import variance_inflation_factor

df_all = pd.read_excel('train.xlsx',sheet_name="Sheet1")

cols = df_all.select_dtypes(include=[np.number]).columns
cols_x = cols[1:]
data_x = df_all[cols_x]
#Calculate vif
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(data_x.values, i) for i in range(data_x.shape[1])]
#vif["features"] = data_x.columns 
 
#output the calculation result of vif
print(vif)
 
#Graph vif
plt.plot(vif["VIF Factor"])

The result will come out like this. It's convenient!

However, when compared with the VIF calculated by Excel ...

It was discovered that VIF came out with different results ('Д') .. !!

In the first place, VIF is calculated by the following formula.

VIF = 1/(1-R2) #R2: coefficient of determination

When one of the explanatory variables is regarded as the objective variable, the coefficient of determination R2 obtained when performing multiple regression analysis with the remaining explanatory variables is used. Speaking sensuously, I understand that if you can express one variable, which is the remaining explanatory variable, well, you don't need that variable? The fact that the VIF is different means that this R2 is different between Python and Excel, so I panicked for a moment.

The cause of the difference was whether or not the intercept was included ..

It turned out that the reason was different, whether or not to include the intercept in the explanatory variable.

On the Python side, process as intercept = 0 When I examined it in Excel, I didn't specify the intercept.

I was able to confirm that the VIFs match when I set the intercept to 0 in Excel.

↑ Whether to check here

I want to ask everyone .. Which is correct after all?

--Isn't there any problem if you use Python's stats model? --Should the intercept be specified? --Anyway, VIF should be evaluated with the combination that maximizes R2, and it doesn't matter if the intercept is 0 or not? ――VIF is just a guide, so don't you have to worry about it?

I'm thinking about the above, but how about everyone? I'm also wondering what the VIF calculation algorithm of the stats model is in the first place ...

The VIF calculated by Python and the VIF calculated by Excel are different .. ??

You can check VIF with Python and it's super convenient!

However, when compared with the VIF calculated by Excel ...

The cause of the difference was whether or not the intercept was included ..

I want to ask everyone .. Which is correct after all?

If you have any opinions or advice, please do not hesitate to contact us! !!