Yesterday explained statistics and interval estimation as prerequisites for the hypothesis test. Once again, let's clean up the NumPy function that we often use to find statistics.
Let's assume you have the numeric vectors X and Y. Note that import numpy as np and from scipy import stats are a prerequisite.
function | Description |
---|---|
np.max(X) | Find the maximum value of X |
np.min(X) | Find the minimum value of X |
np.mean(X) | Find the mean of X |
np.median(X) | Find the median of X |
np.var(X) | Find the variance of X |
np.std(X) | Find the standard deviation of X |
stats.scoreatpercentile(X, 25) | Find the first quartile of X |
stats.scoreatpercentile(X, 75) | Find the third quartile of X |
np.dot(X, Y) | Find the matrix product of X and Y |
np.outer(X, Y) | Find the Cartesian product of X and Y |
np.corrcoef(X, Y)[0,1] | Find the correlation coefficient between X and Y |
A hypothesis test is a statistical hypothesis significance test. Since it is a hypothesis test, you have to make a hypothesis. If you make a rough hypothesis, for example, the following cases can be considered.
Here is the definition of the probability distribution.
distribution | Description |
---|---|
Binary population | Binomial distribution Bi if the population distribution is Bernoulli distribution with parameter p(1,p), X1 + ... +The distribution of Xn is the binomial distribution Bi(n,p)Follow. |
Poisson population | Poisson distribution with population parameter λ Po(λ)Then X1+ ... +Xn is Poisson distribution Po(nλ)Follow. |
Regular population | Population distribution is population parameter u,Normal distribution of σ N(μ, σ^2)Then X1+ ... +Xn is normally distributed N(nμ, nσ^2)Follow |
The description of the ** normal distribution ** that often appears is [Wikipedia description](http://en.wikipedia.org/wiki/%E6%AD%A3%E8%A6%8F%E5%88 It may be faster to look around% 86% E5% B8% 83), but the definition is as follows.
f(x) = \frac 1 {\sqrt{2\pi\sigma}} exp \{-(x-\mu)^2/2{\sigma^2}\}, -\infty \lt x \lt \infty
When the probability distribution X follows a normal distribution, the expected value is:
E(X) = \int_{-\infty}^{\infty}x(1/{\sqrt{2\pi\sigma}}) exp \{-(x-\mu)^2/2{\sigma^2}\}{dx} = \mu
Therefore, the variance is given by
V(X) = \int_{-\infty}^{\infty}(x-\mu)^2(1/{\sqrt{2\pi\sigma}})exp \{-(x-\mu)^2/2{\sigma^2}\}{dx} = \sigma^2
From this, the normal distribution of the mean μ variance σ ^ 2 is expressed as follows.
N(\mu, \sigma^2)
** exponential distribution ** is a continuous distribution defined by the following probability density function.
f(x) = {\lambda}e^{-{\lambda}x} \\
However\\
(x\ge0), 0 (x\lt0)
This probability distribution has the property of a continuous waiting time distribution. For example, the waiting time, lifespan, useful life, or years to disaster of a system with a constant failure rate.
The expected value and variance of the random variable X that follows this distribution can be calculated by the following equations.
E(X) = 1/{\lambda} \\
V(X) = 1/{\lambda^2}
Rare events, in which the number of years until occurrence is distributed by an exponential distribution, are not unnatural even if they occur in the near future, even if the probability is small. For example, a large earthquake is an easy-to-understand and familiar analogy.
Consider a binomial distribution like a coin toss. The binomial distribution is uniform, but Poisson's minority law holds if n is large and p is small (probability is rare in large numbers of observations). For example, it is easy to understand if you mention the lottery that only 3 out of 1000 hits and the rest are out, or the success rate of huge products with a very low probability of reaching a contract. The theorem is as follows.
P(X = k) = \frac {{\lambda}^xe^{-\lambda}} {k!}, \lambda \gt 0
If the random variable X follows a Poisson distribution, the expected value and variance are: It can be said that the Poisson distribution is characterized by the fact that the expected value and the variance are equal to λ.
E(X) = \lambda \\
V(X) = \lambda
The other day has also appeared ** The chi-square test ** verifies the variance match. If the null hypothesis is not rejected, the test statistic is [chi-square distribution](http://en.wikipedia.org/wiki/%E3%82%AB%E3%82%A4%E4%BA%8C%E4 % B9% 97% E5% 88% 86% E5% B8% 83).
When n random samplings are performed from the normal distribution N (μ, σ ^ 2)
Z = \sum_{i=1}^n \frac {(X_i - \mu)^2} {\sigma^2}
Z follows a chi-square distribution with n degrees of freedom.
For example, suppose you observe a shopping street and 45 women and 55 men are observed. There was a bias in these 100 people, but according to a survey that the male-female ratio may actually be fifty-fifty.
n = \frac {(45-50)^2} {50} + \frac
{(55-50)^2} {50} = 1
At this time, the degree of freedom n is 1. The chi-square distribution with one degree of freedom is 0.32, assuming that men and women are equal in the first place, so it is not rejected. In other words, it can happen enough.
** t-test (student's t-test) ** tests the mean for small samples. Using the population mean u, the sample mean X, and the standard sample deviation s for a sample of size n extracted from a normally distributed population, T can be obtained as shown in the following equation.
T = \frac {\sqrt{n-1} (X - \mu)} s
Then T follows a t distribution with n-1 degrees of freedom.
Let's take an example to explain what the difference is between the chi-square test and the t-test, and what the implementation code looks like.
The chi-square test looks for the following aggregated data to see if it is related to store and merchandise sales.
Store | Product A | Product B | total |
---|---|---|---|
Store X | 435 | 165 | 600 |
Store Y | 265 | 135 | 400 |
total | 700 | 300 | 1000 |
The chi-square test was performed previously, so it will be omitted.
The t-test examines whether there is a significant difference in the scores of Japanese and math for the following data, for example. (* Pseudo data)
Attendance number | National language | Math |
---|---|---|
1 | 68 | 86 |
2 | 75 | 83 |
3 | 80 | 76 |
4 | 71 | 81 |
5 | 73 | 75 |
6 | 79 | 82 |
7 | 69 | 87 |
8 | 65 | 75 |
This is a t-test.
import numpy as np
import scipy as sp
from scipy import stats
X = [68 75 80 71 73 79 69 65]
Y = [86 83 76 81 75 82 87 75]
print(X)
print(Y)
t, p = stats.ttest_rel(X, Y)
print( "t value is%(t)s" %locals() )
print( "The probability is%(p)s" %locals() )
if p < 0.05:
print("There is a significant difference")
else:
print("There is no significant difference")
# [68 75 80 71 73 79 69 65]
# [86 83 76 81 75 82 87 75]
#t value is-2.9923203754253302
#Probability is 0.0201600161737
#There is a significant difference
We found that there was a significant difference between Japanese and math grades.
So what about the next science and social grades?
Attendance number | Science | society |
---|---|---|
1 | 85 | 80 |
2 | 69 | 76 |
3 | 77 | 84 |
4 | 77 | 93 |
5 | 75 | 76 |
6 | 74 | 80 |
7 | 87 | 79 |
8 | 69 | 84 |
Let's try with the same code.
# [85 69 77 77 75 74 87 69]
# [80 76 84 93 76 80 79 84]
#t value is-1.6077470858053244
#Probability is 0.151925908683
#There is no significant difference
This time it turned out that there was no significant difference.
Recommended Posts