There is a cross-tabulation test called χ-square test, and when you do that, there is a relationship between categories (for example, men like tea but women like tea). (Prefer water, etc.) can be statistically tested. If you want to know more about the chi-square test itself, see the link above. However, there are two points to note when performing the chi-square test, even if the p value falls below the significance level.
Need to be considered. After all, what is tested by the chi-square test is the presence or absence of bias in the entire crosstabulation table, and even if the test results are significant, it does not mean that all the categories are related.
Originally, it is a standard that needs to be confirmed before performing the χ-square test. The application standard of the chi-square test is that ** cells with an expected value of less than 5 in the cross tabulation table must not exceed 20% of the total **, which corresponds to ** Cochrane's rule **. There are various theories in the part of "20% or more", and various notations such as "25% or more" and "greater than 20%" can be seen. Use scipy.stat.chi2_contingency for chi-square test in Python. I think there are many, so use the expected value table returned by this function to check the rules of Cochrane.
#Chi-square test cross is a numpy two-dimensional array
x2, p, dof, expected = stats.chi2_contingency(cross)
expected = np.array(expected)
#Cochrane rules
expected < 5
If the number of True cells is less than 20% of the total, the Cochrane rule is satisfied. If you find that your data does not meet this rule, then Fisher's exact test (https://en.wikipedia.org/wiki/%E3%83%95%E3%82%A3%E3%83] % 83% E3% 82% B7% E3% 83% A3% E3% 83% BC% E3% 81% AE% E6% AD% A3% E7% A2% BA% E7% A2% BA% E7% 8E% 87 It would be better to move to% E6% A4% 9C% E5% AE% 9A).
If you do a chi-square test with R, it will return the adjusted standardized residuals of each cell at the same time as the test, so there is no problem, but in the case of Python it seems that you need to do it manually.
The definition of residual is
Residual = observed value-expected value
However, in order to calculate the adjusted standardized residuals, it is necessary to newly define ** residual variance **.
Residual variance= (1 - \frac{Horizontal peripheral sum}{Total number})(1 - \frac{Vertical peripheral sum}{Total number})
Please refer to the reference site below for the specifics. Anyway, based on this, the adjusted standardized residuals
Adjusted standardized residuals= \frac{Residual error}{\sqrt{Expected value*Residual error分散}}
Can be calculated as If you write the flow up to this point in Python code, it looks like this.
#Residual error
res = cross - expected
#Find the residual variance
res_var = np.zeros(res.shape)
it = np.nditer(cross, flags=['multi_index'])
while not it.finished:
var = (1 - (cross[:,it.multi_index[1]].sum() / cross.sum()))*(1-(cross[it.multi_index[0],:].sum() / cross.sum()))
res_var[it.multi_index[0], it.multi_index[1]] = var
it.iternext()
#Find the adjusted standardized residuals
stdres = res / np.sqrt(expected * res_var)
#This adjusted standardized residual is absolute value 1.A significant difference can be claimed if it is 96 or higher. Here, the value is converted from the normal distribution table to the p-value and displayed.
np.apply_along_axis(stats.norm.sf, 0, np.abs(stdres[0,:]))
Did you find it useful?
https://note.chiebukuro.yahoo.co.jp/detail/n71838
Recommended Posts