numpy.sum(data) #total
numpy.mean(data) #average
numpy.amax(data)
numpy.amin(data)
numpy.median(data)
An index that indicates "how far the data is from the average value"
\sigma^2=\frac{1}{N}\sum_{i=1}^{n} (x_i-\mu)^2
numpy.var(data, ddof = 0)
The sample variance is a value obtained by further calculating the variance using the sample mean, but this value is biased to be underestimated.
Therefore, the one without bias becomes universal variance.
\sigma^2=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-\mu)^2
numpy.var(data, ddof = 1)
Hereafter, unbiased variance will be used.
The square root of the variance
\begin{align}
\sigma&=\sqrt{\sigma^2}\\
&=\frac{1}{N-1}\sum_{i=1}^{n} (x_i-μ)^2
\end{align}
numpy.std(data, ddof=1)
--When the covariance is greater than 0
→ If one variable takes a large value, the other also increases
→ There is a positive correlation.
--When the covariance is less than 0
→ If one variable takes a large value, the other becomes smaller
→ There is a negative correlation.
Cov(x,y)=\frac{1}{N}\sum_{i=1}^{n-1} (x_i-\mu_x)(y_i-\mu_y)
print(cov_data)
#Data retrieval
x = cov_data["x"]
y = cov_data["y"]
#sample size
N = len(cov_data)
#Calculation of mean value
mu_x = sp.mean(x)
mu_y = sp.mean(y)
#Covariance
cov = sum((x - mu_x) * (y - mu_y)) / (N - 1)
Cov(x,y)=
\begin{bmatrix}
\sigma_x^2 & Cov(x,y) \\
Cov(x,y) & \sigma_y^2
\end{bmatrix}
np.cov(x, y, ddof = 1)
hoge = np.cov(x, y, ddof = 1)
cov = hoge[1,0]
The covariance is standardized to a maximum value of 1 and a minimum value of 1.
\rho_{xy}=\frac{Cov_{(x,y)}}{\sqrt{\sigma_x^2\sigma_y^2}}
#Variance calculation
sigma_2_x_sample = sp.var(x, ddof = 0)
sigma_2_y_sample = sp.var(y, ddof = 0)
#Correlation coefficient
cov_sample / sp.sqrt(sigma_2_x_sample * sigma_2_y_sample)
Cov_{(x,y)}=
\begin{bmatrix}
1 & \rho_{xy} \\
\rho_{xy} & 1
\end{bmatrix}
numpy.corrcoef(x,y)
A conversion that sets the mean of the data to 0 and the standard deviation to 1. That is, the average value subtracted from each data and divided by the standard deviation.
standerd = (data - numpy.mean(data)) / numpy.std(data, ddof=1)
Probability in a continuous variable [^ 1]. When it is a continuous variable, the probability of a specific value is always 0. This is because some values have an infinite number of decimal places. For example, a person cannot be exactly 160 centimeters tall. However, the "probability of a person between 159 cm and 160 cm" can be calculated. That probability is the "probability density". The probability density from e.g. 0 to the maximum value is 1.
c.f. The probability of a discrete variable [^ 2] is the probability that many people learn at school. (P (x) = 1/4)
In particular, when considering the probability that the variable X that takes a real value takes x <= X <= x + ⊿x, when ⊿x → 0, P (x) is called the probability density of x.
When calculating a probability, the variable to be calculated is called a random variable. Suppose the probability of x = 2 is 1/3. At this time, 2 is the establishment variable.
N(x|\mu, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}
Example: When the random variable x = 3, mean = 4, standard deviation = 0.8
>>>x = 3
>>>mu = 4
>>>sigma = 0.8
>>>1 / (numpy.sqrt(2 * sp.pi * sigma**2)) * numpy.exp(- ((x - mu)**2) / (2 * sigma**2))
>>>0.228
You can easily do it with the function below.
>>>stats.norm.pdf(loc = 4, scale = 0.8, x = 3)
>>>0.228
F(x)=P(X\leq x)
A function expressed as. That is, "a function that calculates the probability that the value will be less than or equal to a certain value". The value obtained here is called the lower probability. In addition, x at this time is called a percentage point. In the case of a normal distribution, it can be obtained by the integral calculation below. Also, use the scipy.stats.hoge.cdf function
P(X\leq x)=\int_{-\infty}^{x}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-{\frac{(x-\mu)^2}{2\sigma^2}}}dx
>>>import scipy as sp
>>>from scipy import stats
>>>stats.norm.cdf(loc = 4, scale = 0.8, x = 3) #loc is mean, scale is standard deviation
>>>0.106
Percentage point where the lower probability is 2.5%
>>>stats.norm.ppf(loc = 4, scale = 0.8, q = 0.025)
>>>2.432
t=\frac{\hat{\mu}-\mu}{\frac{\hat{\sigma}}{\sqrt{N}}}
That is,
t value=\frac{Specimen average-Mother mean}{Standard error}
Will be. The distribution of repeated trials multiple times is the t-value sample distribution.
The sample distribution of t-values when the population distribution is normal is called the t-distribution.
To check whether the mean value of data differs from a specific value. However, the specific method of t-test depends on the data correspondence. See the following page for details. Functions of stats module
It is interpreted as "ordinary residual divided by the standard deviation of the distribution". Example: When binomial distribution --When p = 0.5, it becomes 0 or 1, but it means that it is half, so the probability of making a guess is low. The deviation at this time is recognized as a "small deviation" in the Pearson residual. --When p = 0.9, there should be a high probability that the guess will be correct. If the guess is wrong at this time, it is recognized as a "large deviation" in the Pearson residuals.
\begin{align}
Pearson \quad residuals &= \frac{y-N\hat{p}}{\sqrt{N\hat{p}\quad(1-\hat{p}\quad)}}\\
&=\frac{y-\hat{p}}{\sqrt{\hat{p}\quad(1-\hat{p}\quad)}}
\end{align}
\\
\hat{p}\The quad represents the estimated probability of success.
The sum of squares of the Pearson residuals is the Pearson chi-square statistic.
[^ 1]: A value that takes a value after the decimal point and changes continuously.
Example: x cm ← 3 cm, 4.5 cm
[^ 2]: Those that take only integers.
Example: One.
Recommended Posts