I tried to verify the result of A / B test by chi-square test

In the field of digital marketing, PDCA of various measures is going around every day. In such a process, it may be difficult to accurately verify the effect of measures simply by comparing the magnitude of the numbers. From the perspective of how to apply the knowledge of statistics to better analysis and improvement proposals, I worked on a case study using the method of statistical testing.

The topic this time is to verify whether there is a significant difference in the results of A / B testing of a website (assuming a significance level of 0.05). We adopted the chi-square test (independence test), which is often used to verify the effectiveness of A / B testing, and referred to kaggle's "audacity ab testing" for the data. https://www.kaggle.com/samtyagi/audacity-ab-testing

First, import the library and load the data.

import math
import numpy as np
import pandas as pd
import scipy.stats

df=pd.read_csv("homepage_actions.csv")
df.head()

The explanation of each column is as follows. timestamp: Access date and time id: User ID group: Control group is control, test group is experiment action: Click when clicked, view if just seen

Then aggregate the data. Let's get the total number for each group.

group=df.groupby('group').count()
group

In addition, the pivot table aggregates the number of clicks for each group.

pd.pivot_table(df,index='group',columns='action',values=['group','action'],aggfunc='count')

The click rate of each group is Control group: 932 ÷ 4264 = 0.21857410881801126 Test group: 928/3924 = 0.2364937410805302 And you can see that the click rate itself is higher in the test group.

Is it possible to say that the test group has a significantly higher click rate? Let's verify with the chi-square test. For Python, use the chi2_contingency function in scipy.stats.

data=np.matrix([[932,3332],[928,2996]])
chi2,p,ddof,expected=scipy.stats.chi2_contingency(data,correction=False)

print("Chi-square value:", chi2)
print("p-value:", p)
print("Degree of freedom:", ddof)
print("Expected frequency:", expected)

Looking at the output results, the p value, which is the significance probability, was higher than 0.05. In this case, the null hypothesis that "there is no significant difference between the two samples" is not rejected. This means that the test group does not have a significantly higher CTR. You can see that it is not possible to judge by the size of simple numbers.

In some cases, it may not be appropriate to judge whether a measure is good or bad based on the click rate alone, but analysis by such a method can be used as a judgment material for taking more effective measures.