In the process of learning various calculation formulas while studying statistics, I couldn't easily remember the ** uncorrelated test **, so I was staring at the formulas. And since there was something I was curious about, I let Python calculate and draw the result.
--Environment
-** Null hypothesis H0 **: Population correlation coefficient is 0 (no similar correlation)
-** Alternative hypothesis H1 **: Population correlation coefficient is not 0
From the formula below, find the statistic $ t $ and get the $ p $ value. The degrees of freedom $ ν $ for the statistic $ t $ is $ n-2 $.
t = \frac{|r| \sqrt{n - 2}}{\sqrt{1 - r^2}}
If the significance level $ a $ is 0.05, it is sufficient to see the $ p $ value of 0.025 points in the two-sided test.
#Used for data creation
import pandas as pd
import numpy as np
import math
from scipy import stats
import itertools
#Used for graph drawing
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
plt.style.use('seaborn-darkgrid')
plt.rcParams['font.family'] = 'Yu Gothic'
plt.rcParams['font.size'] = 20
#Correlation coefficient(coef)And sample size(n)If you put in, t value(t),Degree of freedom(df), P-value(p)Create a function that returns.
def Uncorrelated(coef, n):
t = (np.abs(coef) * math.sqrt( (n - 2) ) ) / (math.sqrt( ( 1 - (coef**2) ) ) )
df = (n - 2)
p = np.round(( 1 - stats.t.cdf(np.abs(t), df) ), 3) #The p-value is rounded.
return coef, n, t, df, p
#Number of samples from 10 to 1000 in 10 increments
samplesizes = np.arange(10, 1001, 10)
#Correlation coefficient-0.99 to 0.0 up to 99.01 increments
coefficients = np.linspace(-0.99, 0.99, 199)
#print(coefficients)
#Cross join the above two(Cartesian product)
c_s = list(itertools.product(coefficients, samplesizes) )
#Put the list containing the correlation coefficient and sample size into the Uncorrelated function, and convert the returned one into a DataFrame with Pandas.
df_prelist = []
for i in range(len(c_s)):
df_prelist.append(Uncorrelated(c_s[i][0],c_s[i][1]))
#Preparation is complete
df = pd.DataFrame(df_prelist,columns=['coef','sample_size','t','df','p_value'])
df is like this
df
df.sample(10)
Correlation coefficient -0.99 to 0.99, contains t-value, degree of freedom, and p-value for uncorrelated test for sample sizes 10 to 1000.
fig = plt.figure( figsize=(16, 12) )
ax = Axes3D(fig)
cm = plt.cm.get_cmap('RdYlBu')
mappable = ax.scatter( np.array(df['coef']), np.array(df['sample_size']), np.array(df['p_value']), c=np.array(df['p_value']), cmap=cm)
fig.colorbar(mappable, ax=ax)
ax.set_xlabel('Correlation coefficient', labelpad=15)
ax.set_ylabel('sample size', labelpad=15)
ax.set_zlabel('p-value', labelpad=15)
plt.savefig('3D graph.png', bbox_inches='tight', pad_inches=0.3)
plt.show()
... the closer it is to blue, the higher the p-value and it is not rejected ... ** It's hard to understand **
#p_value is 0.If it is 025 or higher`Do not reject H0`Put on
df['judge'] = 'Reject H0'
for index, series in df.query('p_value > 0.025').iterrows():
df.at[index, 'judge'] = 'Do not reject H0'
#Graph redraw
grid = sns.FacetGrid( df, hue = 'judge', height=10 )
grid.map(plt.scatter, 'coef', 'sample_size')
grid.add_legend(title='Judgment')
plt.ylabel('sample size')
plt.xlabel('Correlation coefficient')
plt.title('Correlation coefficient x sample size With or without rejection of uncorrelated test', size=30)
#Draw a red line
plt.vlines(df[df['judge'] == 'Do not reject H0']['coef'].max(), -50, 50, color='red', linestyles='dashed')
plt.vlines(df[df['judge'] == 'Do not reject H0']['coef'].min(), -50, 50, color='red', linestyles='dashed')
plt.annotate('|' + str(df[df['judge'] == 'Do not reject H0']['coef'].max().round(2) ) + '|The outer side is n=If it is 10 or more, reject all',
xy=(df[df['judge'] == 'Do not reject H0']['coef'].max(), 80), size=15, color='black')
plt.savefig('2D graph.png', bbox_inches='tight', pad_inches=0.3)
plt.show()
Indeed, ** if the correlation coefficient of the sample is greater than the absolute value of 0.62, the null hypothesis H0 is rejected even at n = 10 and "the population correlation coefficient is not 0" is adopted ** ($ a =) 0.05 $)!
Recommended Posts