Introduction

I had to use Python to perform Structural Equation Modeling (SEM). However, since there were almost no Japanese articles to serve as a reference, I decided to write a summary article myself.

What is structural equation modeling?

--Structural equation modeling is also known as covariance structure analysis. --There are people who call it SEM (Esu, I, Emu) or SEM (Sem).

History of structural equation modeling

――It is said that it was born in the form of being implemented in a computer from the 1960s to the 1970s.

Algorithm type

Simultaneous equation regression method
Iterative maximum likelihood algorithm
Iterative canonical phase matching algorithm

There are various algorithms such as. When I start digging deep here, it seems that I will not come back anymore. Therefore, I will omit it this time.

Attractiveness of structural equation modeling

――Personally, I think the biggest attraction is that you can execute the factor model and the regression model at the same time. ――In addition, you can set up a ** hypothetical model ** saying, "First of all, this is the model," and then program the model as it is and execute it. ――In other words, it is a ** flexible analysis method ** that allows you to immediately put your thoughts into shape. ――So, you can ** minimize the discussion, repeat SEM repeatedly, and refine the model **.

Observed variables and latent variables

In order to perform structural equation modeling, you need to know the idea of observed variables and latent variables.

Observed variables

――As the name suggests, it is ** "observable variable" **. ――Specifically, it refers to the data that can be actually collected in university research, experiments, and business. ――Examples that often appear in structural equation modeling textbooks are subjects such as "humanities academic ability" and "national language, mathematics ...". --Observed variables correspond to "national language" and "mathematics" that can be observed as scores.

Latent variable

--The variables ** hidden behind the observable variables **. ――In the factor model, this is called ** factor (or common factor) **. ――The matching app can collect (≒ observe) data such as "annual income", "occupation", "house", and "hobby", but you cannot observe the "whether or not it is popular (degree of popularity)" behind it. ――Probably many people somehow * recognize that there are "people who are popular (people with a high degree of popularity)" **. I know that the index of popularity is built into the ** model of matching. However, it is difficult to obtain it as data. ** These variables are called latent variables.

Practical application

――That's why it is used in academic fields such as sociology and psychology where ** variables (latent variables) that cannot be easily obtained as data are required **. ――In fact, there are many things that cannot be acquired as data even in business **, so it can be applied to customer understanding and business structure understanding.

Path diagram

Once you know the difference between an observed variable and a latent variable, you also need to understand the path diagram. This is because this path diagram is a blueprint for a so-called hypothetical model (probably related to this).

How to draw a path diagram

Observed variables are ** rectangles **
Latent variables are ** circle / ellipse **
Causality is ** unidirectional arrow (many practices / straight lines) **
Covariance (correlation) is ** bidirectional arrow (many dotted lines / curves) **

First, let's write the variables. I think that I will illustrate from the data that can be acquired, so I will write from the observed variables. Consider what invisible variables these observed variables are derived from. Then draw a unidirectional arrow for the derived variable. In the above path diagram, y1, y2, y3, y4 are considered to be the observed variables derived from the dem60 behind them, so the arrows extend from the latent variables to the observed variables. Also, due to the complexity of the figure, we are not plotting the covariance arrows at this time.

Why is it demopy?

With this basic knowledge, anyone can play structural equation modeling. Actually, before playing with semipy, why ** SEM ** in Python at this time in the first place?

--Python, which has a rich library of machine learning models, had the weakness of not being able to execute SEM. --The Institute of Statistical Mathematical Models includes ** R **, which is a strong programming language for statistical analysis. ――So, if you want to execute SEM using R or SPSS and really want to do it in Python, the mainstream method is to write it in full scratch or to connect it to Python by force using ** PypeR **. ――However, ** semipy has become version 2.0, which makes it more practical. ** ** ――From now on, as the number of users of this library increases, the number of methods that can be used will increase.

So, let's execute structural equation modeling with one Python using semopy.

Perform structural equation modeling using semopy

I'd really like to prepare an original dataset and play with it, but I don't have time, so I'll use the sample semopy: Structural Equation Modeling in Python to summarize the procedure.

0. Library installation

Installation with pip command

pip install semopy

Install with git command

git clone https://gitlab.com/georgy.m/semopy

1. Import semopy

--When setting the abbreviation as as sem, rewrite the part written as semopy as sem.

#Import semopy
import semopy as sem
#Also import the Model that instantiates the hypothetical model
from semopy import Model

--I was able to run it correctly without running from semopy import Model

2. Prepare the dataset

This time we will use the sample dataset, so get the sample data with the code below.

#Import sample data from the semopy library
from semopy.examples import political_democracy

#Get sample data and assign it to data
data = political_democracy.get_data()

data

スクリーンショット 2020-12-13 15.08.12.png

--If you see a result like this, it's OK.

Data type conversion

--semopy cannot be executed correctly unless the data type is float64 --If the data types are different, it is better to perform type conversion. --Sample data does not require type conversion, so proceed as it is

#Data type confirmation
data.dtypes

#Execution result
y1    float64
y2    float64
y3    float64
y4    float64
y5    float64
y6    float64
y7    float64
y8    float64
x1    float64
x2    float64
x3    float64
dtype: object

3. Set the model

Check the settings of the hypothetical model

--In the sample data, the hypothetical model has already been described. --Get the hypothetical model settings with get_mode () --Assign to the variable desc and try print ()

desc = political_democracy.get_model()
print(desc)

#Execution result
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
# regressions
dem60 ~ ind60
dem65 ~ ind60 + dem60
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8

--All of these are described in str type (character string) --Almost the same as the description method in R (the number of ~ in the regression model is different, so some correction is required)

Set up a hypothetical model

#Assign the hypothetical model to the variable desc
desc = '''
    # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
    # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
    # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
    '''

--Writing in str type is a simple story that you can enclose it in ''' when assigning it to a variable.

Description method

――In order to learn how to write, you need to have a deep understanding of datasets. --Political_democracy is Political Democracy: Industrialization And Political Democracy Dataset in lavaan: Latent Variable Analysis.

Overview of the dataset to use

|Physical name|Logical name | Japanese translation|Mold|Number of defects| |:--|:--|:--|:--| |y1|1960 Expert Evaluation of Freedom of the Press|float64|0| |y2|Freedom of political opposition in 1960|float64|0| |y3|Fairness of the 1960 elections|float64|0| |y4|Effectiveness of the legislature elected in 1960|float64|0| |y5|1965 Expert Evaluation of Freedom of the Press|float64|0| |y6|Freedom of political opposition in 1965|float64|0| |y7|Fairness of the 1965 election|float64|0| |y8|Effectiveness of the legislature elected in 1965|float64|0| |x1|Gross National Product (GNP) per capita in 1960|float64|0| |x2|Inanimate energy consumption per capita in 1960|float64|0| |x3|Percentage of labor force in industry in 1960|float64|0|

Structural equations | Latent variables = ~ Observed variables

――The latent variable of industry (Industry1960, ind60) in 1960 leads to the observed variables of GNP, energy consumption and the ratio of labor force. ――The latent variable of democracy in 1960 (Democracy1960, demo60) leads to the observation variables of expert evaluation of the press, freedom to oppose politics, election fairness, and government effectiveness. --The latent variable of democracy (Democracy1965, demo65) in 1966 leads to the observation variables of expert evaluation of the press, freedom to oppose politics, election fairness, and government effectiveness.

    # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8

Measurement equation | Variable (result) ~ Variable (explanation)

--Industry in 1960 (Industry1960, ind60) and democracy in 1960 (Democracy1960, demo60) are at the same time, so there may be a causal relationship. --The 1965 democracy (Democracy1965, demo65) can be explained by the historical data of the 1960 industry (Industry1960, ind60) and the 1960 democracy (Democracy1960, demo60).

    # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60

Residual correlation | Variables ~~ Variables

--The 1960 press freedom (y1) will correlate with the 1965 press freedom (y5). --The freedom to oppose politics in 1960 (y2) can be explained by the fairness of elections at the same time (y4) and the freedom to oppose politics in 1965 (y6). etc…

# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8

That's why I made a hypothetical model.

4. Perform learning

semopy has various modules as below

from see import see
see(semopy)

#Execution result
 ismodule             help()               .Model()
 .ModelEffects()      .ModelMeans()        .calc_stats()
 .constraints         .create_regularization()                  .efa
 .estimate_means()    .examples            .gather_statistics()
 .inspector           .means               .model               .model_base
 .model_effects       .model_means         .multigroup          .name
 .optimizer           .parser              .plot                .polycorr
 .regularization      .semplot()           .solver
 .startingvalues      .stats               .utils

However, running SEM is easy because it's almost the same process as building a machine learning model.

model () ｜ Prepare the model to use

--Substitute a hypothetical model (desc in this case) as an argument to Model () and assign it to mod (abbreviation of model).

#Prepare a learning device
mod = Model(desc)

fit () ｜ Perform learning

--The data used for learning (or analysis) is used as an argument (data in this case), and the result is assigned to res (abbreviation of result).

#Substitute the learning result for res
res = mod.fit(data)

#Display learning results
print(res)

#Expected execution result
Name of objective: MLW
Optimization method: SLSQP
Optimization successful.
Optimization terminated successfully.
Objective value: 0.508
Number of iterations: 52
Params: 2.180 1.819 1.257 1.058 1.265 1.186 1.280 1.266 1.482 0.572 0.838 0.624 1.893 1.320 2.156 7.385 0.793 5.067 0.347 3.148 1.357 4.954 3.430 3.951 0.172 0.120 0.082 2.352 3.256 0.467 0.448

I was able to execute it correctly

5. Evaluate the learning results

insepect () | Check parameter estimates

#Display a list of learning result parameters
inspect = mod.inspect()
print(inspect)

#Expected execution result
     lval  op   rval  Estimate   Std. Err   z-value      p-value
0   dem60   ~  ind60  1.482379   0.399024   3.71502   0.00020319
1   dem65   ~  ind60  0.571912   0.221383   2.58336   0.00978421
2   dem65   ~  dem60  0.837574  0.0984456   8.50799            0
3      x1   ~  ind60  1.000000          -         -            -
4      x2   ~  ind60  2.180494   0.138565   15.7363            0
5      x3   ~  ind60  1.818546   0.151993   11.9646            0
6      y1   ~  dem60  1.000000          -         -            -
7      y2   ~  dem60  1.256819   0.182687   6.87965  6.00009e-12
8      y3   ~  dem60  1.058174   0.151521    6.9837  2.87503e-12
9      y4   ~  dem60  1.265186   0.145151   8.71634            0
10     y5   ~  dem65  1.000000          -         -            -
11     y6   ~  dem65  1.185743   0.168908   7.02003  2.21823e-12
12     y7   ~  dem65  1.279717   0.159996   7.99841  1.33227e-15
13     y8   ~  dem65  1.266084   0.158238   8.00114  1.33227e-15
14  dem60  ~~  dem60  3.950849   0.920451    4.2923  1.76835e-05
15  dem65  ~~  dem65  0.172210   0.214861  0.801494     0.422846
16  ind60  ~~  ind60  0.448321  0.0866766   5.17234  2.31175e-07
17     y1  ~~     y5  0.624423   0.358435   1.74208    0.0814939
18     y1  ~~     y1  1.892743    0.44456   4.25756  2.06666e-05
19     y2  ~~     y4  1.319589    0.70268   1.87794    0.0603898
20     y2  ~~     y6  2.156164   0.734155   2.93693   0.00331475
21     y2  ~~     y2  7.385292    1.37567    5.3685  7.93938e-08
22     y3  ~~     y7  0.793329   0.607642   1.30558     0.191694
23     y3  ~~     y3  5.066628   0.951722   5.32365  1.01708e-07
24     y4  ~~     y8  0.347222   0.442234  0.785154     0.432363
25     y4  ~~     y4  3.147911   0.738841    4.2606  2.03874e-05
26     y6  ~~     y8  1.357037     0.5685   2.38705    0.0169843
27     y6  ~~     y6  4.954364   0.914284   5.41884   5.9986e-08
28     y7  ~~     y7  3.430032   0.712732   4.81251  1.49045e-06
29     x2  ~~     x2  0.119894  0.0697474   1.71897    0.0856192
30     x1  ~~     x1  0.081573  0.0194949   4.18432  2.86025e-05
31     y5  ~~     y5  2.351910   0.480369   4.89604  9.77851e-07
32     y8  ~~     y8  3.256389    0.69504   4.68518  2.79711e-06
33     x3  ~~     x3  0.466732  0.0901676   5.17628  2.26359e-07

calc_stats () | Display goodness of fit

#Show goodness of fit
stats = semopy.calc_stats(mod)
print(stats)

#Expected execution result
DoF  DoF Baseline       chi2  chi2 p-value  chi2 Baseline       CFI  \
Value   35            55  38.125446      0.329171     730.654577  0.995374   
           GFI      AGFI      NFI       TLI     RMSEA        AIC         BIC  \
Value  0.94782  0.918003  0.94782  0.992731  0.034738  60.983321  132.825453   
         LogLik  
Value  0.508339  

#It's hard to see, so transpose
print(stats.T)
                    Value
DoF             35.000000
DoF Baseline    55.000000
chi2            38.125446
chi2 p-value     0.329171
chi2 Baseline  730.654577
CFI              0.995374
GFI              0.947820
AGFI             0.918003
NFI              0.947820
TLI              0.992731
RMSEA            0.034738
AIC             60.983321
BIC            132.825453
LogLik           0.508339

--Since it is sample data, interpretation is omitted.

6. Output the path diagram

--For simple bar charts and scatter charts, Matplotlib is fine, but path charts are graphs consisting of nodes and edges. --Therefore, import Graphviz and output the path diagram.

Install Graphviz

Example of pip command

$ pip install graphviz

semplot () | Output path diagram

--semplot specifies the model to be used in the first argument (mod this time), and specifies the file name to be output in the second argument.

g = semopy.semplot(m, "sample.png ")

g

Also show covariance

--If you take plot_covs = True as an argument, the covariance arrow will also be plotted as a dotted line.

g = semopy.semplot(m, "sample.png ", plot_covs=True)

Change the layout

--If you take engine =" circleo " as an argument, the illustration method will change to a circle type (circle type). --The default is a hierarchical type of engine =" dot "

g = semopy.semplot(mod, "sample.png ", plot_covs=True, engine="circo")

Layout type

-Summary of how to draw a graph in Graphviz and dot language --Qiita

layout	description
circo	Circular graph
dot	Hierarchical graph.Suitable for directed graphs.Default layout engine
fdp	spring(Spring)Model graph.Suitable for undirected graphs.
neato	spring(Spring)Model graph.Suitable for undirected graphs.
osage	Array type graph.
sfdp	Multi-scale version of fdp.Suitable for large undirected graphs.
twopi	Radial graph.Nodes are arranged concentrically.

Impressions

――I thought this would be the first one, and when I played with semopy, there was already an article Structural Equation Modeling (SEM) in Python ~ Trying out the tutorial of semopy ~ ―― Qiita, which I had practiced earlier. There was a person ――The above article was also very helpful. ――For the path diagram, the point that the error variable is not output has not been solved yet, and the drawing function is stronger in R before. --The above may have a solution if you look at the semipy documentation. --Let's try structural equation modeling (SEM) in Python! I hope it will be helpful for those who say.

end

Introduction to Structural Equation Modeling (SEM), Covariance Structure Analysis with Python