Introduction

In the process of learning various calculation formulas while studying statistics, I couldn't easily remember the ** uncorrelated test **, so I was staring at the formulas. And since there was something I was curious about, I let Python calculate and draw the result.

--Environment

Windows-10-10.0.18362-SP0
Python 3.7.6
pip 19.3.1
pandas 1.0.3
matplotlib 3.1.2
seaborn 0.10.0
numpy 1.18.1
scipy 1.4.1

Uncorrelated test

It tests whether it can be said that "the population has a similar correlation" from the correlation coefficient obtained from the sample.

-** Null hypothesis H0 **: Population correlation coefficient is 0 (no similar correlation)

-** Alternative hypothesis H1 **: Population correlation coefficient is not 0

From the formula below, find the statistic $ t $ and get the $ p $ value. The degrees of freedom $ ν $ for the statistic $ t $ is $ n-2 $.

t = \frac{|r| \sqrt{n - 2}}{\sqrt{1 - r^2}}

If the significance level $ a $ is 0.05, it is sufficient to see the $ p $ value of 0.025 points in the two-sided test.

... I can't remember this formula because I don't use it easily. However, I thought, "If n (sample size) is large, the t-value will be large, ** after all, it's the sample size !! **", so I made a round-robin of the sample size and the correlation coefficient, and * * I looked at how far the null hypothesis is not rejected **.

Preparation

#Used for data creation
import pandas as pd
import numpy as np
import math
from scipy import stats
import itertools

#Used for graph drawing
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D


%matplotlib inline


plt.style.use('seaborn-darkgrid')
plt.rcParams['font.family'] = 'Yu Gothic'
plt.rcParams['font.size'] = 20


#Correlation coefficient(coef)And sample size(n)If you put in, t value(t),Degree of freedom(df), P-value(p)Create a function that returns.
def Uncorrelated(coef, n):
    t = (np.abs(coef) * math.sqrt( (n - 2) ) ) / (math.sqrt( ( 1 - (coef**2) ) ) )
    df = (n - 2)
    p = np.round(( 1 - stats.t.cdf(np.abs(t), df) ), 3) #The p-value is rounded.
    return coef, n, t, df, p


#Number of samples from 10 to 1000 in 10 increments
samplesizes = np.arange(10, 1001, 10)

#Correlation coefficient-0.99 to 0.0 up to 99.01 increments
coefficients = np.linspace(-0.99, 0.99, 199)
#print(coefficients)

#Cross join the above two(Cartesian product)
c_s = list(itertools.product(coefficients, samplesizes) )

#Put the list containing the correlation coefficient and sample size into the Uncorrelated function, and convert the returned one into a DataFrame with Pandas.
df_prelist = []
for i in range(len(c_s)): 
    df_prelist.append(Uncorrelated(c_s[i][0],c_s[i][1])) 

#Preparation is complete
df = pd.DataFrame(df_prelist,columns=['coef','sample_size','t','df','p_value'])

df is like this

df

df.sample(10)

Correlation coefficient -0.99 to 0.99, contains t-value, degree of freedom, and p-value for uncorrelated test for sample sizes 10 to 1000.

Graph drawing

fig = plt.figure( figsize=(16, 12) )
ax = Axes3D(fig)
cm = plt.cm.get_cmap('RdYlBu')
mappable = ax.scatter( np.array(df['coef']), np.array(df['sample_size']), np.array(df['p_value']), c=np.array(df['p_value']), cmap=cm)
fig.colorbar(mappable, ax=ax)
ax.set_xlabel('Correlation coefficient', labelpad=15)
ax.set_ylabel('sample size', labelpad=15)
ax.set_zlabel('p-value', labelpad=15)
plt.savefig('3D graph.png', bbox_inches='tight', pad_inches=0.3)
plt.show()

... the closer it is to blue, the higher the p-value and it is not rejected ... ** It's hard to understand **

I created a Judge column and recreated a DataFrame with a p-value greater than 0.025 as "Do not reject H0".

#p_value is 0.If it is 025 or higher`Do not reject H0`Put on
df['judge'] = 'Reject H0'
for index, series in df.query('p_value > 0.025').iterrows():
    df.at[index, 'judge'] = 'Do not reject H0'


#Graph redraw
grid = sns.FacetGrid( df, hue = 'judge', height=10 )
grid.map(plt.scatter, 'coef', 'sample_size')
grid.add_legend(title='Judgment')
plt.ylabel('sample size')
plt.xlabel('Correlation coefficient')
plt.title('Correlation coefficient x sample size With or without rejection of uncorrelated test', size=30)

#Draw a red line
plt.vlines(df[df['judge'] == 'Do not reject H0']['coef'].max(), -50, 50, color='red', linestyles='dashed')
plt.vlines(df[df['judge'] == 'Do not reject H0']['coef'].min(), -50, 50, color='red', linestyles='dashed')
plt.annotate('|' + str(df[df['judge'] == 'Do not reject H0']['coef'].max().round(2) ) + '|The outer side is n=If it is 10 or more, reject all',
            xy=(df[df['judge'] == 'Do not reject H0']['coef'].max(), 80), size=15, color='black')
plt.savefig('2D graph.png', bbox_inches='tight', pad_inches=0.3)
plt.show()

Indeed, ** if the correlation coefficient of the sample is greater than the absolute value of 0.62, the null hypothesis H0 is rejected even at n = 10 and "the population correlation coefficient is not 0" is adopted ** ($ a =) 0.05 $)!

... By the way, I remembered the original purpose of "learning formulas" by writing this article: upside_down: