Describe how to statistically hypothesize the effects of A/B testing for your own review. It also shows a method for calculating the number of data required for A/B testing. To facilitate verification and understanding, perform simulations with Python everywhere to check the accuracy of the derived expressions.
Finally, verify the result by reproducing Adobe's Sample Size Calculator.
I would like to verify whether there is a significant difference in the click-through rate $ p_1 and p_2 $ of the ads of visitors to sites X and Y.
Let $ \ {x_i }, \ {y_i } $ be the sequence that takes 0 when the visitor clicks on the ad and 0 when it doesn't.
Average of each
now,
Null hypothesis $ H_0: p_1 = p_2 $ Alternative hypothesis $ H_1: p_1 \ neq p_2 $
By rejecting the null hypothesis, the alternative hypothesis holds, that is, there is a significant difference in the click rates of the two sites.
It is known that the total number of clicks $ \ sum {x_i}, \ sum {y_i} $ can be approximated by a normal distribution as the sample size increases. Therefore
Is approximately true (where $ n_1, n_2 $ are the size of $ \ {x_i }, \ {y_i } $, that is, the number of visitors to sites X and Y).
Than this,
Is established. Therefore, due to the nature of the normal distribution
Is established. Based on this, the test statistic $ u $ is calculated.
Since this formula contains unknowns $ p_1 and p_2 $, it is approximated by $ \ bar {x}, \ bar {y} $. Therefore,
Will be. This $ u $ is normally distributed
Follow.
Now assume that the null hypothesis $ H_0 $ holds. At this time, the test statistic $ u $ is
Follow. Therefore, when the significance level is 5%, the hypothesis $ H_0 $ should be rejected if the test statistic $ u $ is not within the 95% interval of the standard normal distribution.
That is,
At that time, $ u $ is rejected. Here, $ Z_ {p} $ is set as the point where the cumulative probability is $ p $ in the standard normal distribution.
When X and Y are given as 0 and 1 columns from the outside, the above test can be implemented as follows.
def test(X, Y, alpha):
p1_ = X.mean()
p2_ = Y.mean()
u = (p1_ - p2_) / np.sqrt(p1_*(1-p1_)/n1 + p2_*(1-p2_)/n2)
#Distribution followed by test statistic
lower, upper = sp.stats.norm.interval(alpha=1-alpha)
if u < lower or u > upper:
print("There is a significant difference")
return True
else:
print("No significant difference")
return False
Validate this test. This can be verified that the rejection rate is $ \ alpha = 0.05 $ when the null hypothesis $ H_0: p_1 = p_2 $ holds.
Now, when $ p_1 and p_2 $ are both 0.08, the situation where the number of visitors is $ 4000 and 5000 $ is simulated 100,000 times, and the rejection rate is calculated.
p1, p2 = 0.08, 0.08
n1, n2 = 4000, 5000
res = []
for i in range(100000):
#Sample generation
X = sp.stats.bernoulli.rvs(size=n1, p=p1)
Y = sp.stats.bernoulli.rvs(size=n2, p=p2)
res.append(test(X, Y, alpha=0.05))
print(f"Rejection rate: {np.mean(res)}")
The result of this execution was a rejection rate of 0.05024, confirming that the test seems to be correct.
Now, the rejection rate $ \ alpha $ when $ H_0 $ holds is 0.05. But what about the rejection rate when $ H_1 $ holds? Using the above program, let's check the rejection rate when $ p_1, p_2 = 0.08, 0.10 $ and $ n_1, n_2 = 400, 400 $, that is, when $ H_1 $ holds. When actually executed, the rejection rate was 0.16753. In other words, it can be seen that there is a high probability that a significant difference cannot be detected by the test even though $ H_1 $ holds (only 16.8% can detect it).
The rejection rate when this $ H_1 $ holds is called the detection power and is expressed as $ 1- \ beta $ ($ \ beta $ is the second type error rate). The detection power improves as the number of data increases. If you perform an A/B test with low detection power and judge that there is no significant difference, you will actually miss the significant difference when improvement is seen. Therefore, I would like to calculate and know this detection power.
Test statistic $ u $ is normally distributed
Remember to follow. In the test
At that time, I was rejecting $ u $. This rejection rate can be calculated under the condition that $ H_1 $ holds. This is the detection power.
Under the situation where $ H_0 $ holds, the rejection rate could be calculated even if $ p_1 and p_2 $ were unknown. However, in the situation where $ H_1 $ holds, the values of $ p_1 and p_2 $ are required. For site A/B testing, $ p_1 $ can often be calculated from your data as a clickthrough rate for an existing site. On the other hand, $ p_2 $ is completely unknown because it corresponds to the visit rate of the new site. This has no choice but to set the expected visit rate appropriately.
Easy to implement.
def power(p1, p2, n1, n2, alpha):
u = (p1 - p2) / np.sqrt(p1*(1-p1)/n1+p2*(1-p2)/n2)
model = sp.stats.norm(loc=u)
lower, upper = sp.stats.norm.interval(alpha=1-alpha)
power = 1 - (model.cdf(upper) - model.cdf(lower))
print (f"Detection power: {power}")
return power
If you call this power as $ p_1, p_2 = 0.08, 0.10 $, and $ n_1, n_2 = 400, 400 $,
Detection power: 0.16736
Was output. This is close to the rejection rate of 0.16753 obtained in the detection power section, and seems to be reasonably accurate.
When estimating the required sample size for A/B testing, use a size that meets the criteria required by detection. 0.8 is often used as the standard of detection power.
Since the detection power program shown above is light, we will execute power several times to obtain the required sample size. In the A/B test, $ n_1 = n_2 $ is set because almost the same number of accesses can be sent to each site.
Now, assuming that the click rate of the existing site is 8% and the click rate of the new site is 10%, the required sample size is required by the following program.
lower_n, upper_n = 10, 100000
alpha = 0.05
target_power = 0.8
p1 = 0.08
p2 = 0.10
while True:
n = int((lower_n + upper_n) / 2)
p = power(p1, p2, n, n, alpha)
if p > target_power:
upper_n = n
else:
lower_n = n
if (upper_n - lower_n) <= 1:
break
print (f"Required sample size: {upper_n}")
The output result is 3211. Since the Adobe Sample Size Calculator is 3210, it can be seen that they almost match. The sample size here is ** per site **, and twice as many samples are required for A/B testing.
When calculating the power, $ p_1, p_2 = 0.08, 0.10 $ was set, but it should be noted that $ p_2 = 0.10 $ is not proved when the null hypothesis is rejected. There was only a significant difference. If you want to test that Site X has a click-through rate of 25% or more than Site Y,
Null hypothesis $ H_0: 1.25 p_1 = p_2 $ Alternative hypothesis $ H_1: 1.25p_1> p_2 $
A hypothesis test like this is necessary.
Reference material
-Introduction to Statistics -How to determine the sample size
Recommended Posts