Statistics Sunday: Fit Statistics in Structural Equation Modeling
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are two types of fit statistics in structural equation modeling: absolute fit and relative fit. When assessing model fit, you should use a combination of both, though nearly all of them are derived from chi-square, which is neither a measure of absolute or relative fit, in some way. So let’s start there.
Chi-Square
Chi-square is an exception to the absolute versus relative fit dichotomy. It’s a measure of exact fit: does your model fit the data? Any deviations between the observed covariance matrix and the model-specified covariance matrix are tallied up, giving an overall metric of the difference between observed and model-specified. If the chi-square is not significant, the model fits your data. If it is significant, the model does not fit your data.The problem is that chi-square is biased to be significant with large sample sizes and/or large correlations between variables. So for many models, your chi-square will indicate the model does not fit the data, even if it’s actually a good model. One way to correct for this is with the normed chi-square I mentioned in the video: divide chi-square by your degrees of freedom. There is no agreed upon cutoff value for normed chi-square. Personally, I use the critical value for a chi-square with 1 degree of freedom, 3.841. I’ve been told that’s too liberal and also too conservative. Like I said: no agreed upon cutoff value.
But chi-square is still very useful for two reasons. First, we use it to compute other fit indices. I’ll talk about that next. Second, we can use it to compare nested models. You can find out more about that a little farther down in this post.
You may ask, then – if chi-square is biased to be significant, why do we use it for all of our other fit indices? The calculations conducted to create these different fit indices are meant to correct for these biases in different ways, factoring in things like sample size or model complexity. That underlying bias is there, though, and there are many different ways to try to correct for it, each way with its own flaws. This is why you should look at a range of fit indices.
Because your fit indices are based on chi-square, which is given to you by whatever statistical program you use to conduct your SEM, you can compute any fit index, even if your program doesn’t give them to you.
Measures of Absolute Fit
These measures are based on the assumption that the perfect model has a fit of 0 – or rather, no deviation between observed and model-specified covariance matrices. As a result, these measures tell you how much worse your model is than the theoretically perfect model, and are sometimes called badness of fit measures. For these measures, smaller is better.Root Mean Square Error of Approximation (RMSEA)
Chi-square is a little like ANOVA in how it deals with variance. This is why it’s chi-square; we measure deviations from central tendency by squaring them, to keep them from adding up to 0. The same thing is done in ANOVA: squared deviations are added up, which produce the sum of squares. This value is divided by degrees of freedom, to produce the mean square, which is then used in the calculation of the F statistic. RMSEA is calculated in a very similar way as this process of creating sum of squares then mean square:√(χ2 – df) / √[df(N-1)]
where df is degrees of freedom and N is total sample size.Chi-square is biased to be significant, so the higher the degrees of freedom, the higher the chi-square will likely be. In fact, the expected value of chi-square is equal to its degrees of freedom. The expected value of RMSEA for a perfectly fitting model, then, is 0, since in the equation above, degrees of freedom is subtracted from chi-square. There is not one single agreed upon cutoff for RMSEA, though 0.05 and 0.07 are commonly used.
Let’s look once again at the fit measures from the Satisfaction with Life confirmatory factor analysis. In fact, here’s a trick I didn’t introduce previously – while including fit.measures=TRUE in the summary function will give you only a small number of fit measures, you can access more information with fitMeasures:
Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE) SWL_Model<-'SWL =~ LS1 + LS2 + LS3 + LS4 + LS5' library(lavaan) ## This is lavaan 0.5-23.1097 ## lavaan is BETA software! Please report any bugs. SWL_Fit<-cfa(SWL_Model, data=Facebook) fitMeasures(SWL_Fit) ## npar fmin chisq ## 10.000 0.052 26.760 ## df pvalue baseline.chisq ## 5.000 0.000 635.988 ## baseline.df baseline.pvalue cfi ## 10.000 0.000 0.965 ## tli nnfi rfi ## 0.930 0.930 0.916 ## nfi pnfi ifi ## 0.958 0.479 0.966 ## rni logl unrestricted.logl ## 0.965 -2111.647 -2098.267 ## aic bic ntotal ## 4243.294 4278.785 257.000 ## bic2 rmsea rmsea.ci.lower ## 4247.082 0.130 0.084 ## rmsea.ci.upper rmsea.pvalue rmr ## 0.181 0.003 0.106 ## rmr_nomean srmr srmr_bentler ## 0.106 0.040 0.040 ## srmr_bentler_nomean srmr_bollen srmr_bollen_nomean ## 0.040 0.040 0.040 ## srmr_mplus srmr_mplus_nomean cn_05 ## 0.040 0.040 107.321 ## cn_01 gfi agfi ## 145.888 0.959 0.876 ## pgfi mfi ecvi ## 0.320 0.959 0.182
The RMSEA is 0.13. We can recreate this using the model chi-square (called chisq above), degrees of freedom (df), and sample size (ntotal):
chi.sq=26.76 df = 5 N = 257 sqrt(chi.sq-df)/sqrt(df*(N-1)) ## [1] 0.130384
Standardized Root Mean Square Residual
The standardized root mean square residual (SRMR) is the average square root of the residual between the observed covariance matrix and the model-specified covariance matrix, which has been standardized to range between 0 and 1. Unlike some of the other fit indices I discuss here, SRMR is biased to be larger for models with few degrees of freedom or small sample size. This means SRMR has the unusual characteristic of being smaller (i.e., showing better fit) for more complex models. If you remember from the CFA post and video, both models showed poor fit for many of the fit indices but showed good fit based on SRMR. In essence, SRMR rewards something that is penalized with other fit indices. Also unlike the other fit indices discussed here, SRMR is not based on chi-square; you can read more about its calculation here.Measures of Relative Fit
In addition to measures of absolute fit, which deal with deviations of the observed covariance matrix from the model-specified covariance matrix, we have measures of relative fit, which compare our model to another theoretical model, the null model, sometimes called the independence model. This model assumes that all variables included are independent, or uncorrelated with each other. This is basically the worst possible model, and fit measures using this model can be thought of as goodness of fit measures - how much better does your model fit than the worst possible model you could have? In the fit measures output, this value is called the Baseline Chi-Square. So let's create a new variable to use in our calculations called "null", which uses this baseline chi-square value.null=634.988 null.df=10
In general, closer to 1 is better. Anything lower than 0.9 would be considered poor fit. If any of these formulas produce a value higher than 1, the fit measure is set at 1.
Normed Fit Index (NFI)
According to David Kenny, this was the first fit measure proposed in the literature. It's computed as the difference between the null and observed model chi-squares, divided by the null chi-square.(null-chi.sq)/null ## [1] 0.9578575
This measure doesn't provide any kind of correction for more complex models, so it isn't recommended for use. (Although, when I was in grad school, which wasn't that long ago, it was one of the recommended measures in my SEM course. How quickly things change...)
Tucker-Lewis Index (TLI)
This measure is also sometimes call the Non-Normed Fit Index (NNFI). It is similar to the NFI but corrects for more complex models, by taking a ratio of each chi-square and its corresponding degrees of freedom.((null/null.df)-(chi.sq/df))/((null/null.df)-1) ## [1] 0.9303667
Comparative Fit Index (CFI)
CFI provides a very similar, and slightly elevated, estimate as the NNFI/TLI. The penalty for complexity is smaller than for the TLI. Instead of taking a ratio of chi-square to degrees of freedom, CFI uses the difference between chi-square and the corresponding degrees of freedom.((null-null.df)-(chi.sq-df))/(null-null.df) ## [1] 0.9651833
There are many other fit indices you'll see listed in the fit measures output. GFI and AGFI (which are actually absolute fit measures) were developed by the creators of the LISREL software and are automatically computed by that program. However, pretty much everything else I've read said not to use these fit indices. (Again, different from what I heard in grad school.) I prefer to use CFI and TLI. CFI is always going to be higher than TLI, because it penalizes you less for model complexity than the TLI. So using both gives you a sort of range of goodness of fit, with lower end of the continuum (TLI) being more conservative than the other (CFI). They're similar, so they'll often tell you the same thing, but you can run into the situation of having a TLI just below your cutoff and CFI just above it.
Comparing Nested Models
I mentioned in the video the idea of nested versus non-nested models. First, let's talk about nested models. A nested model is another model you specify that has the same structure but adds or drops paths. For instance, I conducted two three-factor models using the Rumination Response Scale: one in which the 3 factors were allowed to correlate with each other and another where they were considered orthogonal (uncorrelated). If I drew out these two models, they would look the same except that one would have curved arrows between the 3 factors to reflect the correlations and the other would not. Because I'm comparing two models with the same structure, I can test the impact of that change with my chi-square values.RRS_Model<- ' Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 + Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22 Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21 Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16 ' RRS_Fit<-cfa(RRS_Model, data=Facebook) RRS_Fit2<-cfa(RRS_Model, data=Facebook, orthogonal=TRUE) summary(RRS_Fit) ## lavaan (0.5-23.1097) converged normally after 40 iterations ## ## Number of observations 257 ## ## Estimator ML ## Minimum Function Test Statistic 600.311 ## Degrees of freedom 206 ## P-value (Chi-square) 0.000 ## ## Parameter Estimates: ## ## Information Expected ## Standard Errors Standard ## ## Latent Variables: ## Estimate Std.Err z-value P(>|z|) ## Depression =~ ## Rum1 1.000 ## Rum2 0.867 0.124 6.965 0.000 ## Rum3 0.840 0.124 6.797 0.000 ## Rum4 0.976 0.126 7.732 0.000 ## Rum6 1.167 0.140 8.357 0.000 ## Rum8 1.147 0.141 8.132 0.000 ## Rum9 1.095 0.136 8.061 0.000 ## Rum14 1.191 0.135 8.845 0.000 ## Rum17 1.261 0.141 8.965 0.000 ## Rum18 1.265 0.142 8.887 0.000 ## Rum19 1.216 0.135 8.992 0.000 ## Rum22 1.257 0.142 8.870 0.000 ## Reflecting =~ ## Rum7 1.000 ## Rum11 0.906 0.089 10.138 0.000 ## Rum12 0.549 0.083 6.603 0.000 ## Rum20 1.073 0.090 11.862 0.000 ## Rum21 0.871 0.088 9.929 0.000 ## Brooding =~ ## Rum5 1.000 ## Rum10 1.092 0.133 8.216 0.000 ## Rum13 0.708 0.104 6.823 0.000 ## Rum15 1.230 0.143 8.617 0.000 ## Rum16 1.338 0.145 9.213 0.000 ## ## Covariances: ## Estimate Std.Err z-value P(>|z|) ## Depression ~~ ## Reflecting 0.400 0.061 6.577 0.000 ## Brooding 0.373 0.060 6.187 0.000 ## Reflecting ~~ ## Brooding 0.419 0.068 6.203 0.000 ## ## Variances: ## Estimate Std.Err z-value P(>|z|) ## .Rum1 0.687 0.063 10.828 0.000 ## .Rum2 0.796 0.072 11.007 0.000 ## .Rum3 0.809 0.073 11.033 0.000 ## .Rum4 0.694 0.064 10.857 0.000 ## .Rum6 0.712 0.067 10.668 0.000 ## .Rum8 0.778 0.072 10.746 0.000 ## .Rum9 0.736 0.068 10.768 0.000 ## .Rum14 0.556 0.053 10.442 0.000 ## .Rum17 0.576 0.056 10.370 0.000 ## .Rum18 0.611 0.059 10.418 0.000 ## .Rum19 0.526 0.051 10.352 0.000 ## .Rum22 0.609 0.058 10.428 0.000 ## .Rum7 0.616 0.067 9.200 0.000 ## .Rum11 0.674 0.069 9.746 0.000 ## .Rum12 0.876 0.080 10.894 0.000 ## .Rum20 0.438 0.056 7.861 0.000 ## .Rum21 0.673 0.068 9.867 0.000 ## .Rum5 0.955 0.090 10.657 0.000 ## .Rum10 0.663 0.065 10.154 0.000 ## .Rum13 0.626 0.058 10.819 0.000 ## .Rum15 0.627 0.064 9.731 0.000 ## .Rum16 0.417 0.050 8.368 0.000 ## Depression 0.360 0.072 4.987 0.000 ## Reflecting 0.708 0.111 6.408 0.000 ## Brooding 0.455 0.096 4.715 0.000 summary(RRS_Fit2) ## lavaan (0.5-23.1097) converged normally after 31 iterations ## ## Number of observations 257 ## ## Estimator ML ## Minimum Function Test Statistic 1007.349 ## Degrees of freedom 209 ## P-value (Chi-square) 0.000 ## ## Parameter Estimates: ## ## Information Expected ## Standard Errors Standard ## ## Latent Variables: ## Estimate Std.Err z-value P(>|z|) ## Depression =~ ## Rum1 1.000 ## Rum2 0.903 0.129 6.985 0.000 ## Rum3 0.915 0.129 7.065 0.000 ## Rum4 1.071 0.134 8.023 0.000 ## Rum6 1.245 0.147 8.462 0.000 ## Rum8 1.142 0.145 7.849 0.000 ## Rum9 1.124 0.141 7.961 0.000 ## Rum14 1.219 0.140 8.686 0.000 ## Rum17 1.198 0.143 8.374 0.000 ## Rum18 1.189 0.144 8.235 0.000 ## Rum19 1.240 0.141 8.806 0.000 ## Rum22 1.215 0.145 8.380 0.000 ## Reflecting =~ ## Rum7 1.000 ## Rum11 0.999 0.100 9.952 0.000 ## Rum12 0.614 0.090 6.842 0.000 ## Rum20 1.002 0.100 9.979 0.000 ## Rum21 0.971 0.098 9.875 0.000 ## Brooding =~ ## Rum5 1.000 ## Rum10 1.132 0.150 7.536 0.000 ## Rum13 0.662 0.112 5.901 0.000 ## Rum15 1.295 0.164 7.914 0.000 ## Rum16 1.461 0.176 8.292 0.000 ## ## Covariances: ## Estimate Std.Err z-value P(>|z|) ## Depression ~~ ## Reflecting 0.000 ## Brooding 0.000 ## Reflecting ~~ ## Brooding 0.000 ## ## Variances: ## Estimate Std.Err z-value P(>|z|) ## .Rum1 0.692 0.065 10.637 0.000 ## .Rum2 0.777 0.072 10.829 0.000 ## .Rum3 0.766 0.071 10.808 0.000 ## .Rum4 0.630 0.060 10.454 0.000 ## .Rum6 0.653 0.064 10.184 0.000 ## .Rum8 0.790 0.075 10.537 0.000 ## .Rum9 0.719 0.069 10.485 0.000 ## .Rum14 0.540 0.054 9.999 0.000 ## .Rum17 0.640 0.062 10.247 0.000 ## .Rum18 0.686 0.066 10.337 0.000 ## .Rum19 0.513 0.052 9.881 0.000 ## .Rum22 0.655 0.064 10.243 0.000 ## .Rum7 0.656 0.075 8.790 0.000 ## .Rum11 0.588 0.069 8.491 0.000 ## .Rum12 0.838 0.079 10.604 0.000 ## .Rum20 0.582 0.069 8.446 0.000 ## .Rum21 0.580 0.067 8.613 0.000 ## .Rum5 0.993 0.096 10.386 0.000 ## .Rum10 0.671 0.071 9.454 0.000 ## .Rum13 0.671 0.063 10.729 0.000 ## .Rum15 0.616 0.072 8.530 0.000 ## .Rum16 0.342 0.064 5.368 0.000 ## Depression 0.354 0.073 4.867 0.000 ## Reflecting 0.668 0.112 5.972 0.000 ## Brooding 0.417 0.096 4.332 0.000
The first model, where the 3 factors are allowed to correlate, produces a chi-square of 600.311, with 206 degrees of freedom. The second model, where the 3 factors are forced to be orthogonal, produces a chi-square of 1007.349, with 209 degrees of freedom. I can compare these two models by looking at the difference in chi-square between them. That produces a chi-square with degrees of freedom equal to the difference between df for model 1 and df for model 2.
1007.349-600.311 ## [1] 407.038
This gives me a change in chi-square (Δχ2) of 407.038, with 3 degrees of freedom. I don't even need to check a chi-square table to tell you that value is significant. (I looked it up and was informed my p-value is less than 0.00001.) So forcing the 3 factors to be orthogonal significantly worsens model fit. This provides further evidence that the 3 subscales are highly correlated with each other.
Information Criterion Measures
There are a few other fit indices that don't really fall within absolute or relative. These are the information criterion measures: the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Sample-Size Adjusted BIC. These fit indices are only meaningful when comparing two different models using the same data. That is, the models should be non-nested. For instance, let's say that in addition to examining a single factor analysis of the Satisfaction with Life Scale, I also tested a two-factor model. These two models have a different structure, so they would be non-nested models. I can't look at difference in chi-square to figure out which model is better. Instead, I can compare my information criterion measures. I prefer to use AIC. In this case, the model with the lowest AIC is the superior model.Fit measures are a hotly debated topic in structural equation modeling, with disagreement on which ones to use, which cutoffs to apply, and even whether we should be using them at all. (What can I say? We statisticians don't get out much.) Regardless of where you fall on the debate, if you're testing a structural equation model, chances are someone is going to ask to see fit measures, so it's best to provide them even if you hate them with a fiery passion. And though people will likely disagree with which ones you selected and which cutoffs you use, the best things you can do are 1) pick your fit measures before conducting your analysis and stick to them - do not cherry-pick fit measures that make your model look good, and 2) provide sources to back up which ones you used and which cutoffs you selected. My recommendations for sources are:
1. Hooper, D., Coughlan, J., & Mullen, M.R. (2008). Structural equation modelling: Guidelines for determining model fit. Electronic Journal of Business Research, 6, 53-60.
2. Hu, L., & Bentler, P.M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424-453.
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.