There are several ways to calculate the sum of squares for ANOVA of unbalanced data with two or more factors. In SAS GLM procedure [^ 1], it seems that you can choose from 4 types, "type I", "type II", "type III", and "type IV", so I think these 4 types are the major methods. This time, I will summarize what I learned about the difference between type I to type III using two-way ANOVA as an example.
[^ 1]: This is the one used for ANOVA with SAS.
Unbalanced data is data in which the number of observed values differs depending on the cell, as shown below. Consider the data shown in the table below, where factor A and factor B each have two levels. The $ A_2B_1 $ cell has two observations, but the other cells have only one.
If all cells have the same number of observations (balanced data), the same ANOVA results will be obtained regardless of the type of sum of squares. However, I think that most of the actual data is unbalanced, so you need to select the sum of squares type when performing ANOVA.
Since ANOVA is a normal linear model, the observed vector $ \ boldsymbol {y} $, the design matrix $ \ boldsymbol {X} $, the unknown parameter vector $ \ boldsymbol {\ theta} $, and the components are all normally distributed independently. It can be expressed as follows using the vector $ \ boldsymbol {\ epsilon} $ of the error according to $ N (0, \ sigma ^ 2) $.
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\theta}+\boldsymbol{\epsilon}
The least squares estimator of unknown parameters is
\boldsymbol{\hat{\theta}}=(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}
Residual sum of squares $ SS_e $ can be obtained with the residual vector $ \ boldsymbol {\ hat {\ epsilon}} $.
SS_e=\boldsymbol{\hat{\epsilon}}^T\boldsymbol{\hat{\epsilon}}\hspace{83pt}\\\
=(\boldsymbol{y}-\boldsymbol{\hat{y}})^T(\boldsymbol{y}-\boldsymbol{\hat{y}})\hspace{31pt}\\\
=(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat{\theta}})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat{\theta}})\hspace{17pt}\\\
=\boldsymbol{y}^T\boldsymbol{y}-\boldsymbol{y}^T\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}
It can be expressed as $ \ boldsymbol {y} ^ T \ boldsymbol {X} (\ boldsymbol {X} ^ T \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ T \ boldsymbol {y} $ is the design matrix Is included, so the value will change depending on what parameters are in the model. Therefore, I will write the parameters included in the model together and express it as $ SS (\ boldsymbol {\ theta}) $. Then, for the total sum of squares $ SS_T = \ boldsymbol {y} ^ T \ boldsymbol {y} $, the following equation holds. [^ 2]
SS_T=SS(\boldsymbol{\theta})+SS_e
The total sum of squares $ SS_T $ is divided into the residual sum of squares $ SS_e $ and the other part $ SS (\ boldsymbol {\ theta}) $. This $ SS (\ boldsymbol {\ theta}) $ is called the model sum of squares. The value of total sum of squares $ SS_T $ is always constant, but depending on what model you use, the residual sum of squares $ SS_e $ and the model sum of squares $ SS (\ boldsymbol {\ theta}) $ Percentage will change.
If you imagine an analysis of variance model, it seems that $ SS (\ boldsymbol {\ theta}) $ can be further decomposed by the factors included in the model. The difference in the decomposition method of this model sum of squares is the type of sum of squares. [^ 2]: The total sum of squares here is simply the sum of squares of the observed values. It is not the sum of squared deviations.
First, consider the decomposition of the sum of squares using a one-way ANOVA model with one factor A as an example. Let the total number of data be $ N $. First, consider model (1), which does not include the level of factor A, and model (2), which includes the level of factor A. Model ② is called a full model because it contains all the effects we are thinking about, and model ① is called a reduced model because it does not include the effects of factor A.
y_{ij}=\mu+\epsilon_{ij}\hspace{1cm}\cdots \hspace{1cm}Model ①
\hspace{1cm}y_{ij}=\mu+\alpha_i+\epsilon_{ij}\hspace{1cm}\cdots \hspace{1cm}Model ②
Model (1) has the same average regardless of the level because $ \ alpha_i $ has disappeared, and model (2) has a model in which the average differs depending on the level of factor A. If the design matrix of model ① is $ \ boldsymbol {X} $ and the design matrix of model ② is $ \ boldsymbol {1} $ (N × 1 matrix with all components 1), the sum of model squares is
SS(\mu)=(\frac{1}{N})\boldsymbol{y}^T\boldsymbol{J}_N\boldsymbol{y} \hspace{1cm}\cdots \hspace{1cm}Model ①
SS(\mu,\boldsymbol{\alpha})=\boldsymbol{y}^T\boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}\hspace{1cm}\cdots \hspace{1cm}Model ②
Can be written as. $ \ boldsymbol {J_N} $ is an Nth-order square matrix with all components 1. However, in the ANOVA model, $ \ boldsymbol {X} ^ T \ boldsymbol {X} $ is not regular and cannot be calculated. As a solution, there are methods such as rewriting the design matrix by putting constraints in the model and using the generalized inverse matrix, but this time we will proceed with the former method. If you put the zero-sum constraint $ \ sum_ {} {} \ alpha_i = 0 $ as the constraint condition, the design matrix can be rewritten as follows. As an example, suppose you have simple data with 2 factor levels, 2 level 1 observations, and 3 level 2 observations.
\begin{pmatrix}
1 & 1 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1\\
1 & 0 & 1\\
1 & 0 & 1
\end{pmatrix}
\Longrightarrow
\begin{pmatrix}
1 & 1 \\
1 & 1 \\
1 & -1 \\
1 & -1 \\
1 & -1
\end{pmatrix}
The matrix on the left is the design matrix before the constraints are added, and the matrix on the right is the design matrix after the constraints are added. Using the design matrix $ \ boldsymbol {X} $ rewritten with constraints in the model in this way, $ \ boldsymbol {X} ^ T \ boldsymbol {X} $ becomes regular and the model sum of squares is calculated. You will be able to do it.
By performing the calculations as described above, the total sum of squares can be decomposed into the model sum of squares and the residual sum of squares using each model. The difference between the decomposition of the sum of squares of model ① and model ② is as shown in the picture below.
$ SS_T $ does not change with any model, but the model sum of squares is larger in model ② than in model ①, and the residual sum of squares is smaller in model ②. The difference between the model sum of squares of model ① and model ② indicated by the blue arrow is the sum of squares of factor A (so-called sum of squares between groups). In the above example, it can be calculated by $ SS (\ mu, \ boldsymbol {\ alpha})-SS (\ mu) $. The residual sum of squares is calculated by model ②.
The sum of squares of factor A can be thought of as "the amount of increase in the sum of squares of the model when factor A is added to model ①", or "the amount of decrease in the sum of squares of the model when factor A is removed from model ②". You can also think. The same is true for one-way ANOVA, but the sum of squares value changes depending on the way of thinking when unbalanced data becomes more than two-way. Type I is used to calculate the sum of squares based on the former idea, and Type II and Type III are used to calculate the sum of squares.
By the way, the formula that decomposes the deviation sum of squares into the intergroup sum of squares and the residual sum of squares, which often appears in textbooks in ANOVA.
\sum_{i}^{}\sum_{j}^{}(y_{ij}-\bar{y}_{..})^2=\sum_{i}^{}n_i(\bar{y}_{i.}-\bar{y}_{..})^2+\sum_{i}^{}\sum_{j}^{}(y_{ij}-\bar{y}_{i.})^2
Is equivalent to calculating $ SS_T-SS (\ mu) = SS (\ mu, \ boldsymbol {\ alpha})-SS (\ mu) + SS_e $ according to the notation so far. $ SS_T-SS (\ mu) $ seems to be called the average adjusted total sum of squares.
Consider a two-way ANOVA of data such as:
First, a full model of two-way ANOVA
y_{ijk}=\mu+\alpha_i+\beta_j+\gamma_{ij}+\epsilon_{ijk}
Think about. $ \ Alpha $ represents the effect of factor A, $ \ beta $ represents the effect of factor B, and $ \ gamma $ represents the interaction. Expressed as a matrix
\begin{pmatrix}
1\\
13\\
5\\
8\\
2\\
\end{pmatrix}=
\begin{pmatrix}
1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0\\
1 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0\\
1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0\\
1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0\\
1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1\\
\end{pmatrix}
\begin{pmatrix}
\mu\\
\alpha_1\\
\alpha_2\\
\beta_1\\
\beta_2\\
\gamma_{11}\\
\gamma_{12}\\
\gamma_{21}\\
\gamma_{22}\\
\end{pmatrix}
+\begin{pmatrix}
\epsilon_{111}\\
\epsilon_{121}\\
\epsilon_{211}\\
\epsilon_{212}\\
\epsilon_{221}\\
\end{pmatrix}
Can be written as. Rewrite the design matrix with constraints as in the case of one-way ANOVA. In dual placement, $ \ sum_ {} {} \ alpha_i = 0 $, $ \ sum_ {} {} \ beta_j = 0 $, $ \ sum_ {j} {} \ gamma_ {ij} = 0 $, $ \ sum_ Consider {i} {} \ gamma_ {ij} = 0 $. Due to this constraint, $ \ alpha_2 =-\ alpha_1 $, $ \ beta_2 =-\ beta_1 $, $ \ gamma_ {11} =-\ gamma_ {12} =-\ gamma_ {21} = \ gamma_ {22} $ Therefore, the unknown parameters of the model are reduced to four, $ \ mu, \ alpha_1, \ beta_1, \ gamma_ {11} $. Then the design matrix can be rewritten as:
\boldsymbol{X}=
\begin{pmatrix}
1 & 1 & 1 & 1\\
1 & 1 & -1 & -1\\
1 & -1 & 1 & -1\\
1 & -1 & 1 & -1\\
1 & -1 & -1 & 1
\end{pmatrix}
The first to fourth columns of this design matrix are the columns corresponding to $ \ mu, \ alpha_1, \ beta_1, \ gamma_ {11} $, respectively. From here, consider a reduced model in the same way as for one-way ANOVA. In the two-way arrangement, consider the following model and model square sum.
Model (error term omitted) | Model sum of squares | symbol | Model sum of squares value |
---|---|---|---|
The bottom of the table is the full model, and the others are reduced models. The design matrix of the reduced model is a matrix in which columns other than the columns corresponding to the parameters contained in the model are removed from $ \ boldsymbol {X} $. For example, $ \ boldsymbol {X_ {\ mu \ alpha}} $ retrieves the first and second columns, which are the columns corresponding to the parameters $ \ mu, \ alpha_1 $ contained in the model.
\boldsymbol{X_{\mu\alpha}}=
\begin{pmatrix}
1 & 1 \\
1 & 1 \\
1 & -1 \\
1 & -1 \\
1 & -1
\end{pmatrix}
It feels like. The value of each model sum of squares is calculated as follows. We will use this model sum of squares to calculate the sum of squares for each type.
python
import numpy as np
import pandas as pd
#data set
df = pd.DataFrame({'A':['A1','A1','A2','A2','A2'],
'B':['B1','B2','B1','B1','B2'],
'Value':[1,13,5,8,2]})
#Measurement vector and design matrix (full model)
y = np.matrix(df['Value']).T
X = np.matrix([[1,1,1,1],
[1,1,-1,-1],
[1,-1,1,-1],
[1,-1,1,-1],
[1,-1,-1,1]])
#Calculation of sum of squares for each model
SS_mu = y.T*X[:,0]*np.linalg.inv(X[:,0].T*X[:,0])*X[:,0].T*y
SS_mu_a = y.T*X[:,0:2]*np.linalg.inv(X[:,0:2].T*X[:,0:2])*X[:,0:2].T*y
SS_mu_b = y.T*X[:,0:3:2]*np.linalg.inv(X[:,0:3:2].T*X[:,0:3:2])*X[:,0:3:2].T*y
SS_mu_a_b = y.T*X[:,0:3]*np.linalg.inv(X[:,0:3].T*X[:,0:3])*X[:,0:3].T*y
SS_mu_a_g = y.T*X[:,[0,1,3]]*np.linalg.inv(X[:,[0,1,3]].T*X[:,[0,1,3]])*X[:,[0,1,3]].T*y
SS_mu_b_g = y.T*X[:,[0,2,3]]*np.linalg.inv(X[:,[0,2,3]].T*X[:,[0,2,3]])*X[:,[0,2,3]].T*y
SS_mu_a_b_g = y.T*X*np.linalg.inv(X.T*X)*X.T*y
Type I The Type I sum of squares is calculated as follows. Since type I depends on the order in which the factors are put into the model, we will put the factors in the model in the order of A → B → interaction.
Sum of squares of factor A=SS(\mu, \boldsymbol{\alpha})-SS(\mu)\\
Sum of squares of factor B=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})-SS(\mu, \boldsymbol{\alpha})\\
Sum of squares of interaction=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})\\
Residual sum of squares=SS_T-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})
The type I sum of squares is the sum of squares of the factors, which is the amount of increase in the sum of squares of the model when the factors are put into the model in order, starting from the model with only the total average $ \ mu $. It is an image like the picture below.
Let's actually calculate the type I sum of squares. Since each model sum of squares has already been calculated earlier, use that value.
python
# Type I SS
A = SS_mu_a - SS_mu
B = SS_mu_a_b - SS_mu_a
AB = SS_mu_a_b_g - SS_mu_a_b
Residual = y.T*y - SS_mu_a_b_g
print(A)
print(B)
print(AB)
print(Residual)
[[4.8]] #Sum of squares of factor A
[[7.71428571]] #Sum of squares of factor B
[[77.78571429]] #Sum of squares of interaction
[[4.5]] #Residual sum of squares
Now that we have the sum of squares value, let's compare the results with the ANOVA using stats models.
python
import statsmodels.formula.api as smf
import statsmodels.api as sm
model = smf.ols("Value ~ A*B", data = df).fit()
sm.stats.anova_lm(model, typ = 1)
If you compare the value in the sum_sq column in the image above with the value you calculated yourself, you can see that they match.
The properties of type I sum of squares are as follows: (1) When all sums of squares are added, they match the average total sum of squares $ SS_T-SS (\ mu) $, and (2) The value of sum of squares differs depending on the order in which factors are added to the model. There are points. Regarding (2), if you try the order of B → A → interaction, you can see that a different value is obtained. The interaction is put into the model at the end, so changing the order does not change the sum of squares value of the interaction. Similarly, the residual sum of squares does not depend on the order in which the factors are included in the model.
Type II The Type II sum of squares is calculated as follows.
Sum of squares of factor A=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})-SS(\mu, \boldsymbol{\beta})\\
Sum of squares of factor B=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})-SS(\mu, \boldsymbol{\alpha})\\
Sum of squares of interaction=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})\\
Residual sum of squares=SS_T-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})
In type II, we start with a model that contains all the main effects and calculate the amount of decrease in the sum of squares of the model when certain factors are removed from the model. The interaction and residual sum of squares are exactly the same as type I. It is an image like the picture below.
Let's actually calculate the sum of squares.
python
# Type II SS
A = SS_mu_a_b - SS_mu_b
B = SS_mu_a_b - SS_mu_a
AB = SS_mu_a_b_g - SS_mu_a_b
Residual = y.T*y - SS_mu_a_b_g
print(A)
print(B)
print(AB)
print(Residual)
[[2.88095238]] #Sum of squares of factor A
[[7.71428571]] #Sum of squares of factor B
[[77.78571429]] #Sum of squares of interaction
[[4.5]] #Residual sum of squares
We also perform ANOVA using stats models.
python
model = smf.ols("Value ~ A*B", data = df).fit()
sm.stats.anova_lm(model, typ = 2)
If you compare the values calculated by yourself, you can see that they are in good agreement.
As you can see from the formula, type II does not depend on the order in which the factors are put into the model. The sum of squares value obtained is the same whether the sum of squares of factor A is calculated first or the sum of squares of factor B is calculated. Also, if you add all the sums of squares in this data, it will be $ 92.88 $. Since the average sum of squares (deviation sum of squares) is $ 94.8 $, you can see that the sum of all sums of squares and the average sum of squares do not match in type II.
Type III The Type III sum of squares is calculated as follows.
Sum of squares of factor A=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})-SS(\mu, \boldsymbol{\beta}, \boldsymbol{\gamma})\\
Sum of squares of factor B=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\gamma})\\
Sum of squares of interaction=SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta})\\
Residual sum of squares=SS_T-SS(\mu, \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma})
In type III, we start with a full model with a binary arrangement and calculate the amount of decrease in the sum of squares of the model when certain factors are removed from the model. The interaction and residual sum of squares are exactly the same as for types I and II. Type III will also be adjusted for interactions when calculating the sum of squares of the main effect. The picture is almost the same as type II, so it is omitted.
Let's actually calculate the sum of squares.
python
# Type III SS
A = SS_mu_a_b_g - SS_mu_b_g
B = SS_mu_a_b_g - SS_mu_a_g
AB = SS_mu_a_b_g - SS_mu_a_b
Residual = y.T*y - SS_mu_a_b_g
print(A)
print(B)
print(AB)
print(Residual)
[[8.64285714]] #Sum of squares of factor A
[[16.07142857]] #Sum of squares of factor B
[[77.78571429]] #Sum of squares of interaction
[[4.5]] #Residual sum of squares
Let's also do ANOVA using stats models.
python
model = smf.ols("Value ~ C(A, Sum)*C(B, Sum)", data = df).fit()
sm.stats.anova_lm(model, typ = 3)
If you compare the values calculated by yourself, you can see that they match safely. [^ 3]
Like type II, type III does not depend on the order in which factors are added to the model. The sum of all squares in this data is $ 107 $. Since the average sum of squares (deviation sum of squares) is $ 94.8 $, you can see that the sum of all sums of squares and the average sum of squares of type III do not match. As you can see from the formula, type III and type II match if the model does not contain any interactions.
[^ 3]: When using type III sum of squares, if you do not describe the factor variable like "C (A, Sum)", you will get strange results. I was worried about this for about an hour.
Let's take a closer look at the difference between type II and type III. So far, as constraints, $ \ sum_ {} {} \ alpha_i = 0 $, $ \ sum_ {} {} \ beta_j = 0 $, $ \ sum_ {j} {} \ gamma_ {ij} = 0 $, $ \ I was thinking about sum_ {i} {} \ gamma_ {ij} = 0 $. Here, the constraints on interaction are $ \ sum_ {j} {} n_ {ij} \ gamma_ {ij} = 0 $, $ \ sum_ {i} {} n_ {ij} \ gamma_ {ij} = 0 $ Change to. $ n_ {ij} $ is the number of observations in cell $ A_iB_j $. In this data, $ n_ {21} = 2 $ and everything else is 1. This constraint weights the interaction by the number of data.
If you clear the parameters with this constraint, $ \ gamma_ {12} =-\ gamma_ {11} $, $ \ gamma_ {21} =-(1/2) \ gamma_ {11} $, $ \ gamma_ {22} = \ gamma_ {11} $, so the design matrix $ \ boldsymbol {X} $ after the constraints are
\boldsymbol{X}=
\begin{pmatrix}
1 & 1 & 1 & 1\\
1 & 1 & -1 & -1\\
1 & -1 & 1 & -0.5\\
1 & -1 & 1 & -0.5\\
1 & -1 & -1 & 1
\end{pmatrix}
Will be. Use this design matrix to calculate the sum of squares using the Type III method above.
python
#Design matrix (full model)
y = np.matrix(df['Value']).T
X = np.matrix([[1,1,1,1],
[1,1,-1,-1],
[1,-1,1,-0.5],
[1,-1,1,-0.5],
[1,-1,-1,1]])
#Calculation of sum of squares for each model
SS_mu_a_b = y.T*X[:,0:3]*np.linalg.inv(X[:,0:3].T*X[:,0:3])*X[:,0:3].T*y
SS_mu_a_g = y.T*X[:,[0,1,3]]*np.linalg.inv(X[:,[0,1,3]].T*X[:,[0,1,3]])*X[:,[0,1,3]].T*y
SS_mu_b_g = y.T*X[:,[0,2,3]]*np.linalg.inv(X[:,[0,2,3]].T*X[:,[0,2,3]])*X[:,[0,2,3]].T*y
SS_mu_a_b_g = y.T*X*np.linalg.inv(X.T*X)*X.T*y
#Calculate sum of squares using Type III method
A = SS_mu_a_b_g - SS_mu_b_g
B = SS_mu_a_b_g - SS_mu_a_g
AB = SS_mu_a_b_g - SS_mu_a_b
Residual = y.T*y - SS_mu_a_b_g
print(A)
print(B)
print(AB)
print(Residual)
[[2.88095238]] #Sum of squares of factor A
[[7.71428571]] #Sum of squares of factor B
[[77.78571429]] #Sum of squares of interaction
[[4.5]] #Residual sum of squares
Looking at this result, it is the same as the result of type II earlier. You can also obtain type II results by changing the constraints in the type III calculation method. When the calculation method is fixed by the type III method, the sum of squares considering the constraint condition that weights the interaction by the number of data in the cell is type II, and the sum of squares considering the constraint condition that does not weight the number of data is type. It will be III. In this way, the difference between type II and type III can be thought of as the difference in constraints.
The type of sum of squares is used properly, but it seems better to use type I when there is an order relationship between the factors. Regarding type II and type III, I feel that type II, which considers the number of cell data, is better when the degree of imbalance is large. However, there seems to be various debates as to which is better, type II or type III. Is it safe to use the one that is often used in the field, or to calculate both and compare the results?
①https://www.researchgate.net/publication/341426272_fensanfenxiniokeru3tsutaipunopingfanghe_-Type_I_II_III_SS-Three_types_of_SS_in_ANOVA-yitengyaoer-_KRyanjiuhui_2005nian9yue15ri ②https://www.sas.com/content/dam/SAS/ja_jp/doc/event/sas-user-groups/usergroups11-p-04.pdf
Recommended Posts