Total and average


numpy.sum(data) #total
numpy.mean(data) #average

Maximum and minimum, and median

numpy.amax(data)
numpy.amin(data)
numpy.median(data)

Distributed

An index that indicates "how far the data is from the average value"

\sigma^2=\frac{1}{N}\sum_{i=1}^{n} (x_i-\mu)^2

numpy.var(data, ddof = 0)

Unbiased dispersion

The sample variance is a value obtained by further calculating the variance using the sample mean, but this value is biased to be underestimated.
Therefore, the one without bias becomes universal variance.

\sigma^2=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-\mu)^2

numpy.var(data, ddof = 1)

Hereafter, unbiased variance will be used.

standard deviation

The square root of the variance

\begin{align}
\sigma&=\sqrt{\sigma^2}\\
&=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-μ)^2
\end{align}

numpy.std(data, ddof=1)

Covariance

--When the covariance is greater than 0
→ If one variable takes a large value, the other also increases
→ There is a positive correlation. --When the covariance is less than 0
→ If one variable takes a large value, the other becomes smaller
→ There is a negative correlation.

Cov(x,y)=\frac{1}{N}\sum_{i=1}^{n-1} (x_i-\mu_x)(y_i-\mu_y)

print(cov_data) スクリーンショット 2020-01-25 17.37.23.png

#Data retrieval
x = cov_data["x"]
y = cov_data["y"]
#sample size
N = len(cov_data)
#Calculation of mean value
mu_x = sp.mean(x)
mu_y = sp.mean(y)
#Covariance
cov = sum((x - mu_x) * (y - mu_y)) / (N - 1)

Covariance matrix

Cov(x,y)=
\begin{bmatrix}
\sigma_x^2 & Cov(x,y) \\
Cov(x,y) & \sigma_y^2 
\end{bmatrix}

np.cov(x, y, ddof = 1)

When retrieving a value from a matrix

hoge = np.cov(x, y, ddof = 1)
cov = hoge[1,0]

Pearson's product moment correlation coefficient

The covariance is standardized to a maximum value of 1 and a minimum value of 1.

\rho_{xy}=\frac{Cov_{(x,y)}}{\sqrt{\sigma_x^2\sigma_y^2}}


#Variance calculation
sigma_2_x_sample = sp.var(x, ddof = 0)
sigma_2_y_sample = sp.var(y, ddof = 0)
#Correlation coefficient
cov_sample / sp.sqrt(sigma_2_x_sample * sigma_2_y_sample)

Correlation matrix

Cov_{(x,y)}=
\begin{bmatrix}
1 & \rho_{xy} \\
\rho_{xy} & 1
\end{bmatrix}


numpy.corrcoef(x,y)

Standardization

A conversion that sets the mean of the data to 0 and the standard deviation to 1. That is, the average value subtracted from each data and divided by the standard deviation.

standerd = (data - numpy.mean(data)) / numpy.std(data, ddof=1)

Probability density

Probability in a continuous variable [^ 1]. When it is a continuous variable, the probability of a specific value is always 0. This is because some values have an infinite number of decimal places. For example, a person cannot be exactly 160 centimeters tall. However, the "probability of a person between 159 cm and 160 cm" can be calculated. That probability is the "probability density". The probability density from e.g. 0 to the maximum value is 1.

c.f. The probability of a discrete variable [^ 2] is the probability that many people learn at school. (P (x) = 1/4)

In particular, when considering the probability that the variable X that takes a real value takes x <= X <= x + ⊿x, when ⊿x → 0, P (x) is called the probability density of x.

Random variable

When calculating a probability, the variable to be calculated is called a random variable. Suppose the probability of x = 2 is 1/3. At this time, 2 is the establishment variable.

Normal distribution probability density function

N(x|\mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}

Example: When the random variable x = 3, mean = 4, standard deviation = 0.8

>>>x = 3
>>>mu = 4
>>>sigma = 0.8
>>>1 / (numpy.sqrt(2 * sp.pi * sigma**2)) * numpy.exp(- ((x - mu)**2) / (2 * sigma**2))
>>>0.228

You can easily do it with the function below.

>>>stats.norm.pdf(loc = 4, scale = 0.8, x = 3)
>>>0.228

Cumulative distribution function and lower probability, percentage point

F(x)=P(X\leq x)

A function expressed as. That is, "a function that calculates the probability that the value will be less than or equal to a certain value". The value obtained here is called the lower probability. In addition, x at this time is called a percentage point. In the case of a normal distribution, it can be obtained by the integral calculation below. Also, use the scipy.stats.hoge.cdf function

P(X\leq x)=\int_{-\infty}^{x}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}dx

>>>import scipy as sp
>>>from scipy import stats
>>>stats.norm.cdf(loc = 4, scale = 0.8, x = 3) #loc is mean, scale is standard deviation
>>>0.106

Function to find percentage points-ppf function

Percentage point where the lower probability is 2.5%

>>>stats.norm.ppf(loc = 4, scale = 0.8, q = 0.025)
>>>2.432

T-value and t-value sample distribution

t=\frac{\hat{\mu}-\mu}{\frac{\hat{\sigma}}{\sqrt{N}}}

That is,

t value=\frac{Specimen average-Mother mean}{Standard error}

Will be. The distribution of repeated trials multiple times is the t-value sample distribution.

t distribution

The sample distribution of t-values when the population distribution is normal is called the t-distribution.

t-test

To check whether the mean value of data differs from a specific value. However, the specific method of t-test depends on the data correspondence. See the following page for details. Functions of stats module

Pearson residual

It is interpreted as "ordinary residual divided by the standard deviation of the distribution". Example: When binomial distribution --When p = 0.5, it becomes 0 or 1, but it means that it is half, so the probability of making a guess is low. The deviation at this time is recognized as a "small deviation" in the Pearson residual. --When p = 0.9, there should be a high probability that the guess will be correct. If the guess is wrong at this time, it is recognized as a "large deviation" in the Pearson residuals.

\begin{align}
Pearson \quad residuals &= \frac{y-N\hat{p}}{\sqrt{N\hat{p}\quad(1-\hat{p}\quad)}}\\
&=\frac{y-\hat{p}}{\sqrt{\hat{p}\quad(1-\hat{p}\quad)}}
\end{align}
\\
\hat{p}\The quad represents the estimated probability of success.

The sum of squares of the Pearson residuals is the Pearson chi-square statistic.

[^ 1]: A value that takes a value after the decimal point and changes continuously.
Example: x cm ← 3 cm, 4.5 cm [^ 2]: Those that take only integers.
Example: One.

Statistics with python