Data without context is just a list of numbers. In order to make good use of the data at hand, it is necessary to collect a wide variety of information such as the mechanism of phenomena behind the data, historical background, and environment. Then, based on such information, collect more data with free ideas.
Data does not make sense just by collecting it. The characteristics can be seen by comparing. Calculating the mean or variance is called getting a summary statistic. In addition, we draw frequency charts and line graphs to visualize the data to grasp the characteristics of the data.
When the whole picture of the phenomenon can be seen by using such an analysis method, the statistical method is finally used. In that case, the purpose of the analysis needs to be clear. big
To add to the comparison, the comparison requires some criteria. There are two ways to do this. One is to seek this standard from the outside. This is a comparison with the true model. But this is almost impossible. Therefore, compare it with the data itself at hand. This corresponds to the use of t-distribution and analysis of variance.
The term model has already come up, but in a nutshell, it's a probability distribution. This is one of the methods of expressing a stochastic phenomenon, and expresses how a phenomenon occurs with a certain probability. However, it is rare that the phenomenon that actually occurs follows such a probability distribution. This is because the phenomenon that you actually see has a slightly different personality depending on each situation. Also, the data may contain observed noise. Therefore, consider a conditional distribution model. And the representative of such a model is regression analysis. There are many libraries in statsmodels that are suitable for such analysis.
Linear regression models in stats models
Japanese | statsmodels |
---|---|
Least squares | OLS |
Weighted least squares | WLS |
Generalized least squares | GLS |
Recursive least squares method | Recursive LS |
It is estimated by four methods. $ x $ is the explanatory variable and $ e $ is the error. $ y $ is the dependent variable and is modeled as a linear combination of $ x $. In order for the model obtained by the least squares method to be plausible, the error
--There is no bias. --The variance is known and constant. --The covariance is 0. --Follow the normal distribution.
The precondition is imposed. GLS is a model that can deal with variance inhomogeneity in which the variance of the error is not constant, and errors with autocorrelation in which the errors are correlated. WLS deals with variance inhomogeneity, and Recursive LS is an error with autocorrelation. Is dealing with. In these models, the problem of error that cannot satisfy the conditions is adjusted in various ways, and the regression coefficient is estimated by satisfying these conditions.
When it comes to linear regression
In addition, there is a generalized linear model in which the distribution of $ y $ is specified as an exponential family and the residual is an arbitrary distribution. As a further development of this
-Generalized estimation equation --Generalized mixed model -Generalized additive model
and so on. OLS is used for linear regression, but the regression coefficient is estimated using the maximum likelihood method or a method similar to it in the generalized linear model and its advanced form.
Recommended Posts