Structural Equation Modelling in R (Part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Brief explanation
Structural Equation Modelling (SEM) is a state of art methodology and fulfills much of broader discusion about statistical modeling, and allows to make inference and causal analysis. SEM models are regression models braodly used in Marketing, Human Resources, Biostatistics and Medicine, revealing their flexibility as analytical tool.
Despite being a state-of-the-art methodology, SEM does not create new developments of statistical theory, since it consists in a combination of multivariate analysis, factor analysis and regression analysis. SEM makes it possible to validate or reject one or more hypothesis about an existing relation between different variables, to estimate simultaneously dependency relations between variables and estimate measurement error in model’s variables. SEM may be understood as a net of unidirectional or bidirectional paths linking different variables. Such net can be validated using multivariate data.
Confirmatory Factor Analysis
A usual methodology for model evaluation is Confirmatory Factor Analysis (CFA) that is a particular case of SEM. It is a process which consists in specifying quantity and kinds of observed variables to one or more latent variables and analyze how well those variables measure the latent variable itself. Think of a latent variable as an artificial variable that is represented as a linear combination of observed variables.
Holzinger and Swineford (1939) proposed one the most famous CFA models. In their work they analysed the mental ability based test scores of children from two different schools (details here)
We can represent Holzinger and Swineford’s model in a diagram:
In this model, \(\newcommand{\R}{\mathbb{R}} x_i\) are exogenous variables that record scores and \(y_i\) are latent (and endogenous) variables that represent visual ability, textual ability and speed ability respectively, double-headed arrows represent an association between \(y_1\), \(y_2\), and \(y_3\), single-headed arrows represent direct effects, and \(\varepsilon_i\) represent error terms.
A factor loading is a measurable effect which reflects the incidence of an observed variable over another either observed or latent variable. Interpreting factor loadings is equivalent to interpreting direct effects in a linear econometric model. In this example factor loadings are well defined as there are no missing relations between variables, is known which observed variables have an effect over defined latent variables, and both latent variables covariate so there is a relation between latent variables.
Model evaluation
When I wrote my thesis I had to study SEM’s goodness of fit and its indicators. The interested reader may visit my Github repo to read more about some important linear algebra aspects of SEM but here I present a table that synthesizes 50% of my thesis:
Indicator | Type | Classification | Direction | Ideal values | Normed (0-1) |
---|---|---|---|---|---|
Chi-squared | significance | adjusted | more is better | \(>2df\) | no |
RMSEA | significance | adjusted | less is better | \(\leq 0,05\) | no |
SMSR | significance | unadjusted | less is better | \(<0,1\) | no |
GFI | significance | unadjusted | more is better | \(>0,9\) | yes |
This table is useful for a cheatsheet and to keep in mind what to look when comparing models.
Using Lavaan to obtain parameters estimates
I highly recommend using lavaan. I have used AMOS and Stata but I guess R’s flexibility is unbeatable at the moment. Lavaan also provides HolzingerSwineford1939
dataset that contains the complete recording made by Holzinger and Swineford 🙂
First I load the neeeded packages to perform the analysis
# Needed packages to perform the analysis load <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE) sapply(pkg, require, character.only = TRUE) } packages <- c("lavaan","semPlot","semTools","nonnest2","htmlTable") load(packages)
Then I fit the model using the cfa()
function
# 1.2. Specify the model HS.model <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed =~ x7 + x8 + x9 ' # 1.3. Fit the model fit <- cfa(HS.model, data = HolzingerSwineford1939)
Then I obtain a model summary, confidence intervals and goodness of fit indicators
# 2.1. Display summary output summary(fit, fit.measures=FALSE) lavaan (0.5-22) converged normally after 35 iterations Number of observations 301 Estimator ML Minimum Function Test Statistic 85.306 Degrees of freedom 24 P-value (Chi-square) 0.000 Parameter Estimates: Information Expected Standard Errors Standard Latent Variables: Estimate Std.Err z-value P(>|z|) visual =~ x1 1.000 x2 0.554 0.100 5.554 0.000 x3 0.729 0.109 6.685 0.000 textual =~ x4 1.000 x5 1.113 0.065 17.014 0.000 x6 0.926 0.055 16.703 0.000 speed =~ x7 1.000 x8 1.180 0.165 7.152 0.000 x9 1.082 0.151 7.155 0.000 Covariances: Estimate Std.Err z-value P(>|z|) visual ~~ textual 0.408 0.074 5.552 0.000 speed 0.262 0.056 4.660 0.000 textual ~~ speed 0.173 0.049 3.518 0.000 Variances: Estimate Std.Err z-value P(>|z|) .x1 0.549 0.114 4.833 0.000 .x2 1.134 0.102 11.146 0.000 .x3 0.844 0.091 9.317 0.000 .x4 0.371 0.048 7.779 0.000 .x5 0.446 0.058 7.642 0.000 .x6 0.356 0.043 8.277 0.000 .x7 0.799 0.081 9.823 0.000 .x8 0.488 0.074 6.573 0.000 .x9 0.566 0.071 8.003 0.000 visual 0.809 0.145 5.564 0.000 textual 0.979 0.112 8.737 0.000 speed 0.384 0.086 4.451 0.000 # 2.2. Obtain confidence intervals for the estimated coefficients parameterEstimates(fit, ci = TRUE, level = 0.95) lhs op rhs est se z pvalue ci.lower ci.upper 1 visual =~ x1 1.000 0.000 NA NA 1.000 1.000 2 visual =~ x2 0.554 0.100 5.554 0 0.358 0.749 3 visual =~ x3 0.729 0.109 6.685 0 0.516 0.943 4 textual =~ x4 1.000 0.000 NA NA 1.000 1.000 5 textual =~ x5 1.113 0.065 17.014 0 0.985 1.241 6 textual =~ x6 0.926 0.055 16.703 0 0.817 1.035 7 speed =~ x7 1.000 0.000 NA NA 1.000 1.000 8 speed =~ x8 1.180 0.165 7.152 0 0.857 1.503 9 speed =~ x9 1.082 0.151 7.155 0 0.785 1.378 10 x1 ~~ x1 0.549 0.114 4.833 0 0.326 0.772 11 x2 ~~ x2 1.134 0.102 11.146 0 0.934 1.333 12 x3 ~~ x3 0.844 0.091 9.317 0 0.667 1.022 13 x4 ~~ x4 0.371 0.048 7.779 0 0.278 0.465 14 x5 ~~ x5 0.446 0.058 7.642 0 0.332 0.561 15 x6 ~~ x6 0.356 0.043 8.277 0 0.272 0.441 16 x7 ~~ x7 0.799 0.081 9.823 0 0.640 0.959 17 x8 ~~ x8 0.488 0.074 6.573 0 0.342 0.633 18 x9 ~~ x9 0.566 0.071 8.003 0 0.427 0.705 19 visual ~~ visual 0.809 0.145 5.564 0 0.524 1.094 20 textual ~~ textual 0.979 0.112 8.737 0 0.760 1.199 21 speed ~~ speed 0.384 0.086 4.451 0 0.215 0.553 22 visual ~~ textual 0.408 0.074 5.552 0 0.264 0.552 23 visual ~~ speed 0.262 0.056 4.660 0 0.152 0.373 24 textual ~~ speed 0.173 0.049 3.518 0 0.077 0.270 # 2.3 Obtain goodness of fit indicators of the model fitMeasures(fit, c("chisq", "rmsea", "srmr", "gfi", "ecvi")) chisq rmsea srmr gfi ecvi 85.306 0.092 0.065 0.943 0.423 reliability(fit) visual textual speed total alpha 0.6261171 0.8827069 0.6884550 0.7604886 omega 0.6253180 0.8851754 0.6877600 0.8453351 omega2 0.6253180 0.8851754 0.6877600 0.8453351 omega3 0.6120052 0.8850608 0.6858417 0.8596204 avevar 0.3705589 0.7210163 0.4244883 0.5145874
Exporting lavaan output to tables
There are many possibilities but these are some parameters I use on a daily basis
htmlTable(txtRound(parameterEstimates(fit, ci = TRUE, level = 0.95), digits=3, excl.cols=1), align="l|r|r|r|r|r|r|r|r|r")
lhs | op | rhs | est | se | z | pvalue | ci.lower | ci.upper | |
---|---|---|---|---|---|---|---|---|---|
1 | visual | =~ | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | ||
2 | visual | =~ | 2.000 | 0.554 | 0.100 | 5.554 | 2.798 | 0.358 | 0.749 |
3 | visual | =~ | 3.000 | 0.729 | 0.109 | 6.685 | 2.313 | 0.516 | 0.943 |
4 | textual | =~ | 4.000 | 1.000 | 0.000 | 1.000 | 1.000 | ||
5 | textual | =~ | 5.000 | 1.113 | 0.065 | 17.014 | 0.000 | 0.985 | 1.241 |
6 | textual | =~ | 6.000 | 0.926 | 0.055 | 16.703 | 0.000 | 0.817 | 1.035 |
7 | speed | =~ | 7.000 | 1.000 | 0.000 | 1.000 | 1.000 | ||
8 | speed | =~ | 8.000 | 1.180 | 0.165 | 7.152 | 8.564 | 0.857 | 1.503 |
9 | speed | =~ | 9.000 | 1.082 | 0.151 | 7.155 | 8.398 | 0.785 | 1.378 |
10 | x1 | ~~ | 1.000 | 0.549 | 0.114 | 4.833 | 1.344 | 0.326 | 0.772 |
11 | x2 | ~~ | 2.000 | 1.134 | 0.102 | 11.146 | 0.000 | 0.934 | 1.333 |
12 | x3 | ~~ | 3.000 | 0.844 | 0.091 | 9.317 | 0.000 | 0.667 | 1.022 |
13 | x4 | ~~ | 4.000 | 0.371 | 0.048 | 7.779 | 7.327 | 0.278 | 0.465 |
14 | x5 | ~~ | 5.000 | 0.446 | 0.058 | 7.642 | 2.132 | 0.332 | 0.561 |
15 | x6 | ~~ | 6.000 | 0.356 | 0.043 | 8.277 | 2.220 | 0.272 | 0.441 |
16 | x7 | ~~ | 7.000 | 0.799 | 0.081 | 9.823 | 0.000 | 0.640 | 0.959 |
17 | x8 | ~~ | 8.000 | 0.488 | 0.074 | 6.573 | 4.923 | 0.342 | 0.633 |
18 | x9 | ~~ | 9.000 | 0.566 | 0.071 | 8.003 | 1.110 | 0.427 | 0.705 |
19 | visual | ~~ | visual | 0.809 | 0.145 | 5.564 | 2.640 | 0.524 | 1.094 |
20 | textual | ~~ | textual | 0.979 | 0.112 | 8.737 | 0.000 | 0.760 | 1.199 |
21 | speed | ~~ | speed | 0.384 | 0.086 | 4.451 | 8.533 | 0.215 | 0.553 |
22 | visual | ~~ | textual | 0.408 | 0.074 | 5.552 | 2.818 | 0.264 | 0.552 |
23 | visual | ~~ | speed | 0.262 | 0.056 | 4.660 | 3.168 | 0.152 | 0.373 |
24 | textual | ~~ | speed | 0.173 | 0.049 | 3.518 | 4.346 | 0.077 | 0.270 |
htmlTable(txtRound(reliability(fit),3), align="l|r|r|r|r")
visual | textual | speed | total | |
---|---|---|---|---|
alpha | 0.626 | 0.883 | 0.688 | 0.760 |
omega | 0.625 | 0.885 | 0.688 | 0.845 |
omega2 | 0.625 | 0.885 | 0.688 | 0.845 |
omega3 | 0.612 | 0.885 | 0.686 | 0.860 |
avevar | 0.371 | 0.721 | 0.424 | 0.515 |
There are another packages to covnert R output to tables such as knitr
and xtables
that can also export to \(\rm\LaTeX\).
Creating a diagram for the model in R
I'm aware this plot has too many options but it is the way I got this to produce a suitable result for my expectations.
svg("sem_r/cfa_example_2.svg") semPaths(fit, style="lisrel", whatLabels = "std", edge.label.cex = .6, node.label.cex = .6, label.prop=0.9, edge.label.color = "black", rotation = 4, equalizeManifests = FALSE, optimizeLatRes = TRUE, node.width = 1.5, edge.width = 0.5, shapeMan = "rectangle", shapeLat = "ellipse", shapeInt = "triangle", sizeMan = 4, sizeInt = 2, sizeLat = 4, curve=2, unCol = "#070b8c") dev.off()
Appendix
If you are curious about the first diagram, I made it with TikZ. This is the code
\documentclass[border=3mm]{standalone} \usepackage[applemac]{inputenc} \usepackage[protrusion=true,expansion=true]{microtype} \usepackage[bb=lucida,bbscaled=1,cal=boondoxo]{mathalfa} \usepackage[stdmathitalics=true,math-style=iso,lucidasmallscale=true,romanfamily=bright]{lucimatx} \usepackage{tikz} \usetikzlibrary{arrows,positioning} \newcommand{\at}{\makeatletter @\makeatother} \begin{document} \begin{tikzpicture}[auto,node distance=.5cm, latent/.style={fill=red!20,circle,draw, thick,inner sep=0pt,minimum size=8mm,align=center}, observed/.style={fill=blue!20,rectangle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center}, error/.style={fill=yellow!20,circle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center}, paths/.style={->, thick, >=stealth'}, paths2/.style={<-, thick, >=stealth'}, twopaths/.style={<->, thick, >=stealth'} ] %Define observed variables \node [latent] (y1) at (0,0) {$y_1$}; \node [latent] (y2) [below=3cm of y1] {$y_2$}; \node [latent] (y3) [below=3cm of y2] {$y_3$}; %%% \node [observed] (x2) [left=1.8cm of y1] {$x_2$}; \node [observed] (x1) [above=of x2] {$x_1$}; \node [observed] (x3) [below=of x2] {$x_3$}; \node [error] (ex1) [left=0.5cm of x1] {$\delta_1$}; \node [error] (ex2) [left=0.5cm of x2] {$\delta_2$}; \node [error] (ex3) [left=0.5cm of x3] {$\delta_3$}; %%% \node [observed] (x5) [left=1.8cm of y2] {$x_5$}; \node [observed] (x4) [above=of x5] {$x_4$}; \node [observed] (x6) [below=of x5] {$x_6$}; \node [error] (ex4) [left=0.5cm of x4] {$\delta_4$}; \node [error] (ex5) [left=0.5cm of x5] {$\delta_5$}; \node [error] (ex6) [left=0.5cm of x6] {$\delta_6$}; %%% \node [observed] (x8) [left=1.8cm of y3] {$x_8$}; \node [observed] (x7) [above=of x8] {$x_7$}; \node [observed] (x9) [below=of x8] {$x_9$}; \node [error] (ex7) [left=0.5cm of x7] {$\delta_7$}; \node [error] (ex8) [left=0.5cm of x8] {$\delta_8$}; \node [error] (ex9) [left=0.5cm of x9] {$\delta_9$}; % Draw paths form latent to observed variables \foreach \all in {x1,x2,x3}{ \draw [paths] (y1.west) to node {} (\all.east); } \draw [paths] (x1.west) to (ex1.east); \draw [paths] (x2.west) to (ex2.east); \draw [paths] (x3.west) to (ex3.east); %%% \foreach \all in {x4,x5,x6}{ \draw [paths] (y2.west) to node {} (\all.east); } \draw [paths] (x4.west) to (ex4.east); \draw [paths] (x5.west) to (ex5.east); \draw [paths] (x6.west) to (ex6.east); %%% \foreach \all in {x7,x8,x9}{ \draw [paths] (y3.west) to node {} (\all.east); } \draw [paths] (x7.west) to (ex7.east); \draw [paths] (x8.west) to (ex8.east); \draw [paths] (x9.west) to (ex9.east); %%% \draw [twopaths] (y1.east) to [bend left=90] (y2.east); \draw [twopaths] (y2.east) to [bend left=90] (y3.east); \draw [twopaths] (y1.east) to [bend left=90] (y3.east); \end{tikzpicture} \end{document}
and then I did run pdf2svg cfa_example.pdf cfa_example.svg
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.