I had to use Python to perform Structural Equation Modeling (SEM). However, since there were almost no Japanese articles to serve as a reference, I decided to write a summary article myself.
--Structural equation modeling is also known as covariance structure analysis. --There are people who call it SEM (Esu, I, Emu) or SEM (Sem).
――It is said that it was born in the form of being implemented in a computer from the 1960s to the 1970s.
There are various algorithms such as. When I start digging deep here, it seems that I will not come back anymore. Therefore, I will omit it this time.
――Personally, I think the biggest attraction is that you can execute the factor model and the regression model at the same time. ――In addition, you can set up a ** hypothetical model ** saying, "First of all, this is the model," and then program the model as it is and execute it. ――In other words, it is a ** flexible analysis method ** that allows you to immediately put your thoughts into shape. ――So, you can ** minimize the discussion, repeat SEM repeatedly, and refine the model **.
In order to perform structural equation modeling, you need to know the idea of observed variables and latent variables.
――As the name suggests, it is ** "observable variable" **. ――Specifically, it refers to the data that can be actually collected in university research, experiments, and business. ――Examples that often appear in structural equation modeling textbooks are subjects such as "humanities academic ability" and "national language, mathematics ...". --Observed variables correspond to "national language" and "mathematics" that can be observed as scores.
--The variables ** hidden behind the observable variables **. ――In the factor model, this is called ** factor (or common factor) **. ――The matching app can collect (≒ observe) data such as "annual income", "occupation", "house", and "hobby", but you cannot observe the "whether or not it is popular (degree of popularity)" behind it. ――Probably many people somehow * recognize that there are "people who are popular (people with a high degree of popularity)" **. I know that the index of popularity is built into the ** model of matching. However, it is difficult to obtain it as data. ** These variables are called latent variables.
――That's why it is used in academic fields such as sociology and psychology where ** variables (latent variables) that cannot be easily obtained as data are required **. ――In fact, there are many things that cannot be acquired as data even in business **, so it can be applied to customer understanding and business structure understanding.
Once you know the difference between an observed variable and a latent variable, you also need to understand the path diagram. This is because this path diagram is a blueprint for a so-called hypothetical model (probably related to this).
First, let's write the variables. I think that I will illustrate from the data that can be acquired, so I will write from the observed variables. Consider what invisible variables these observed variables are derived from. Then draw a unidirectional arrow for the derived variable. In the above path diagram, y1, y2, y3, y4 are considered to be the observed variables derived from the dem60 behind them, so the arrows extend from the latent variables to the observed variables. Also, due to the complexity of the figure, we are not plotting the covariance arrows at this time.
With this basic knowledge, anyone can play structural equation modeling. Actually, before playing with semipy, why ** SEM ** in Python at this time in the first place?
--Python, which has a rich library of machine learning models, had the weakness of not being able to execute SEM. --The Institute of Statistical Mathematical Models includes ** R **, which is a strong programming language for statistical analysis. ――So, if you want to execute SEM using R or SPSS and really want to do it in Python, the mainstream method is to write it in full scratch or to connect it to Python by force using ** PypeR **. ――However, ** semipy has become version 2.0, which makes it more practical. ** ** ――From now on, as the number of users of this library increases, the number of methods that can be used will increase.
So, let's execute structural equation modeling with one Python using semopy.
I'd really like to prepare an original dataset and play with it, but I don't have time, so I'll use the sample semopy: Structural Equation Modeling in Python to summarize the procedure.
Installation with pip command
pip install semopy
Install with git command
git clone https://gitlab.com/georgy.m/semopy
--When setting the abbreviation as as sem
, rewrite the part written as semopy
as sem
.
#Import semopy
import semopy as sem
#Also import the Model that instantiates the hypothetical model
from semopy import Model
--I was able to run it correctly without running from semopy import Model
This time we will use the sample dataset, so get the sample data with the code below.
#Import sample data from the semopy library
from semopy.examples import political_democracy
#Get sample data and assign it to data
data = political_democracy.get_data()
data
--If you see a result like this, it's OK.
--semopy cannot be executed correctly unless the data type is float64 --If the data types are different, it is better to perform type conversion. --Sample data does not require type conversion, so proceed as it is
#Data type confirmation
data.dtypes
#Execution result
y1 float64
y2 float64
y3 float64
y4 float64
y5 float64
y6 float64
y7 float64
y8 float64
x1 float64
x2 float64
x3 float64
dtype: object
--In the sample data, the hypothetical model has already been described.
--Get the hypothetical model settings with get_mode ()
--Assign to the variable desc and try print ()
desc = political_democracy.get_model()
print(desc)
#Execution result
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
# regressions
dem60 ~ ind60
dem65 ~ ind60 + dem60
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8
--All of these are described in str type (character string)
--Almost the same as the description method in R (the number of ~
in the regression model is different, so some correction is required)
#Assign the hypothetical model to the variable desc
desc = '''
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
# regressions
dem60 ~ ind60
dem65 ~ ind60 + dem60
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8
'''
--Writing in str type is a simple story that you can enclose it in '''
when assigning it to a variable.
――In order to learn how to write, you need to have a deep understanding of datasets. --Political_democracy is Political Democracy: Industrialization And Political Democracy Dataset in lavaan: Latent Variable Analysis.
|Physical name|Logical name | Japanese translation|Mold|Number of defects| |:--|:--|:--|:--| |y1|1960 Expert Evaluation of Freedom of the Press|float64|0| |y2|Freedom of political opposition in 1960|float64|0| |y3|Fairness of the 1960 elections|float64|0| |y4|Effectiveness of the legislature elected in 1960|float64|0| |y5|1965 Expert Evaluation of Freedom of the Press|float64|0| |y6|Freedom of political opposition in 1965|float64|0| |y7|Fairness of the 1965 election|float64|0| |y8|Effectiveness of the legislature elected in 1965|float64|0| |x1|Gross National Product (GNP) per capita in 1960|float64|0| |x2|Inanimate energy consumption per capita in 1960|float64|0| |x3|Percentage of labor force in industry in 1960|float64|0|
――The latent variable of industry (Industry1960, ind60) in 1960 leads to the observed variables of GNP, energy consumption and the ratio of labor force. ――The latent variable of democracy in 1960 (Democracy1960, demo60) leads to the observation variables of expert evaluation of the press, freedom to oppose politics, election fairness, and government effectiveness. --The latent variable of democracy (Democracy1965, demo65) in 1966 leads to the observation variables of expert evaluation of the press, freedom to oppose politics, election fairness, and government effectiveness.
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
--Industry in 1960 (Industry1960, ind60) and democracy in 1960 (Democracy1960, demo60) are at the same time, so there may be a causal relationship. --The 1965 democracy (Democracy1965, demo65) can be explained by the historical data of the 1960 industry (Industry1960, ind60) and the 1960 democracy (Democracy1960, demo60).
# regressions
dem60 ~ ind60
dem65 ~ ind60 + dem60
--The 1960 press freedom (y1) will correlate with the 1965 press freedom (y5). --The freedom to oppose politics in 1960 (y2) can be explained by the fairness of elections at the same time (y4) and the freedom to oppose politics in 1965 (y6). etc…
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8
That's why I made a hypothetical model.
semopy has various modules as below
from see import see
see(semopy)
#Execution result
ismodule help() .Model()
.ModelEffects() .ModelMeans() .calc_stats()
.constraints .create_regularization() .efa
.estimate_means() .examples .gather_statistics()
.inspector .means .model .model_base
.model_effects .model_means .multigroup .name
.optimizer .parser .plot .polycorr
.regularization .semplot() .solver
.startingvalues .stats .utils
However, running SEM is easy because it's almost the same process as building a machine learning model.
--Substitute a hypothetical model (desc in this case) as an argument to Model ()
and assign it to mod (abbreviation of model).
#Prepare a learning device
mod = Model(desc)
--The data used for learning (or analysis) is used as an argument (data in this case), and the result is assigned to res (abbreviation of result).
#Substitute the learning result for res
res = mod.fit(data)
#Display learning results
print(res)
#Expected execution result
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully.
Objective value: 0.508
Number of iterations: 52
Params: 2.180 1.819 1.257 1.058 1.265 1.186 1.280 1.266 1.482 0.572 0.838 0.624 1.893 1.320 2.156 7.385 0.793 5.067 0.347 3.148 1.357 4.954 3.430 3.951 0.172 0.120 0.082 2.352 3.256 0.467 0.448
I was able to execute it correctly
#Display a list of learning result parameters
inspect = mod.inspect()
print(inspect)
#Expected execution result
lval op rval Estimate Std. Err z-value p-value
0 dem60 ~ ind60 1.482379 0.399024 3.71502 0.00020319
1 dem65 ~ ind60 0.571912 0.221383 2.58336 0.00978421
2 dem65 ~ dem60 0.837574 0.0984456 8.50799 0
3 x1 ~ ind60 1.000000 - - -
4 x2 ~ ind60 2.180494 0.138565 15.7363 0
5 x3 ~ ind60 1.818546 0.151993 11.9646 0
6 y1 ~ dem60 1.000000 - - -
7 y2 ~ dem60 1.256819 0.182687 6.87965 6.00009e-12
8 y3 ~ dem60 1.058174 0.151521 6.9837 2.87503e-12
9 y4 ~ dem60 1.265186 0.145151 8.71634 0
10 y5 ~ dem65 1.000000 - - -
11 y6 ~ dem65 1.185743 0.168908 7.02003 2.21823e-12
12 y7 ~ dem65 1.279717 0.159996 7.99841 1.33227e-15
13 y8 ~ dem65 1.266084 0.158238 8.00114 1.33227e-15
14 dem60 ~~ dem60 3.950849 0.920451 4.2923 1.76835e-05
15 dem65 ~~ dem65 0.172210 0.214861 0.801494 0.422846
16 ind60 ~~ ind60 0.448321 0.0866766 5.17234 2.31175e-07
17 y1 ~~ y5 0.624423 0.358435 1.74208 0.0814939
18 y1 ~~ y1 1.892743 0.44456 4.25756 2.06666e-05
19 y2 ~~ y4 1.319589 0.70268 1.87794 0.0603898
20 y2 ~~ y6 2.156164 0.734155 2.93693 0.00331475
21 y2 ~~ y2 7.385292 1.37567 5.3685 7.93938e-08
22 y3 ~~ y7 0.793329 0.607642 1.30558 0.191694
23 y3 ~~ y3 5.066628 0.951722 5.32365 1.01708e-07
24 y4 ~~ y8 0.347222 0.442234 0.785154 0.432363
25 y4 ~~ y4 3.147911 0.738841 4.2606 2.03874e-05
26 y6 ~~ y8 1.357037 0.5685 2.38705 0.0169843
27 y6 ~~ y6 4.954364 0.914284 5.41884 5.9986e-08
28 y7 ~~ y7 3.430032 0.712732 4.81251 1.49045e-06
29 x2 ~~ x2 0.119894 0.0697474 1.71897 0.0856192
30 x1 ~~ x1 0.081573 0.0194949 4.18432 2.86025e-05
31 y5 ~~ y5 2.351910 0.480369 4.89604 9.77851e-07
32 y8 ~~ y8 3.256389 0.69504 4.68518 2.79711e-06
33 x3 ~~ x3 0.466732 0.0901676 5.17628 2.26359e-07
#Show goodness of fit
stats = semopy.calc_stats(mod)
print(stats)
#Expected execution result
DoF DoF Baseline chi2 chi2 p-value chi2 Baseline CFI \
Value 35 55 38.125446 0.329171 730.654577 0.995374
GFI AGFI NFI TLI RMSEA AIC BIC \
Value 0.94782 0.918003 0.94782 0.992731 0.034738 60.983321 132.825453
LogLik
Value 0.508339
#It's hard to see, so transpose
print(stats.T)
Value
DoF 35.000000
DoF Baseline 55.000000
chi2 38.125446
chi2 p-value 0.329171
chi2 Baseline 730.654577
CFI 0.995374
GFI 0.947820
AGFI 0.918003
NFI 0.947820
TLI 0.992731
RMSEA 0.034738
AIC 60.983321
BIC 132.825453
LogLik 0.508339
--Since it is sample data, interpretation is omitted.
--For simple bar charts and scatter charts, Matplotlib
is fine, but path charts are graphs consisting of nodes and edges.
--Therefore, import Graphviz and output the path diagram.
Example of pip command
$ pip install graphviz
--semplot specifies the model to be used in the first argument (mod this time), and specifies the file name to be output in the second argument.
g = semopy.semplot(m, "sample.png ")
g
--If you take plot_covs = True
as an argument, the covariance arrow will also be plotted as a dotted line.
g = semopy.semplot(m, "sample.png ", plot_covs=True)
--If you take engine =" circleo "
as an argument, the illustration method will change to a circle type (circle type).
--The default is a hierarchical type of engine =" dot "
g = semopy.semplot(mod, "sample.png ", plot_covs=True, engine="circo")
-Summary of how to draw a graph in Graphviz and dot language --Qiita
layout | description |
---|---|
circo | Circular graph |
dot | Hierarchical graph.Suitable for directed graphs.Default layout engine |
fdp | spring(Spring)Model graph.Suitable for undirected graphs. |
neato | spring(Spring)Model graph.Suitable for undirected graphs. |
osage | Array type graph. |
sfdp | Multi-scale version of fdp.Suitable for large undirected graphs. |
twopi | Radial graph.Nodes are arranged concentrically. |
――I thought this would be the first one, and when I played with semopy, there was already an article Structural Equation Modeling (SEM) in Python ~ Trying out the tutorial of semopy ~ ―― Qiita, which I had practiced earlier. There was a person ――The above article was also very helpful. ――For the path diagram, the point that the error variable is not output has not been solved yet, and the drawing function is stronger in R before. --The above may have a solution if you look at the semipy documentation. --Let's try structural equation modeling (SEM) in Python! I hope it will be helpful for those who say.
end
Recommended Posts