Confidence interval of ** difference in population ratio **, not confidence interval of population ratio.

What is the difference in population ratio?

Detailed explanation is omitted here. The following site is easy to understand.

Confidence interval for difference in population ratio

Why you want to ask

In business, we often perform "chi-square test" and "test for difference in population ratio". Of course, it is important to pay attention to the conclusion that there is a significant difference **, but if you just pay attention to it, it is difficult to grasp the effect size and variation **. Let's make it a little more intuitive! The flow.

The confidence interval for the population ratio seems to be found in the library, but it seems that the confidence interval for the difference in population ratio is not done (1 minute survey). How to use Python to estimate the 95% confidence interval for the population ratio and determine a reasonable sample size

The calculation formula is not complicated, so implement it quickly.

a formula

(\hat{p_1} - \hat{p_2}) - z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1} + \frac{\hat{p_2}(1 - \hat{p_2})}{n_2}} \leq \hat{p_1} - \hat{p_2} \leq　\\ (\hat{p_1} - \hat{p_2}) + z_\frac{\alpha}{2} \times \sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1} + \frac{\hat{p_2}(1 - \hat{p_2})}{n_2}}

The detailed explanation is explained in the site introduced earlier. The left expression is called lower bound, and the right expression is called upper bound.

If the lower bound and upper bound do not cross 0, it can be said that there is a significant difference. How to find the 95% confidence interval? Relationship with significant differences and the meaning and formula of 1.96

Source code

It's a religion that doesn't move, so I love it.

Image of feeding a 2x2 cross tabulation table with csv.

	Purchase	Not purchased
Man	50	100
woman	40	120

`main.py`


import csv
import numpy as np

#Parameters
z = 1.96

#Read test data
with open('test.csv') as f:
    reader = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC)
    d = [row for row in reader]

#Calculate population ratio
p = [d[0][0]/sum(d[0]), d[1][0]/sum(d[1])]

# 95%Calculate confidence interval
lb = (p[0]- p[1]) - z * np.sqrt(p[0] * (1 - p[0]) / sum(d[0]) + p[1] * (1 - p[1]) / sum(d[1]))
ub = (p[0]- p[1]) + z * np.sqrt(p[0] * (1 - p[0]) / sum(d[0]) + p[1] * (1 - p[1]) / sum(d[1]))

#Output result
print('95 of the difference in population ratio%Confidence interval: {:.3f} <= p1 - p2 <= {:.3f}'.format(lb, ub))

in conclusion

It may have been a niche, but it should be convenient ...

I just want to find the 95% confidence interval for the difference in population ratios in Python