Five ways to calculate internal consistency
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s get psychometric and learn a range of ways to compute the internal consistency of a test or questionnaire in R. We’ll be covering:
- Average inter-item correlation
- Average item-total correlation
- Cronbach’s alpha
- Split-half reliability (adjusted using the Spearman–Brown prophecy formula)
- Composite reliability
If you’re unfamiliar with any of these, here are some resources to get you up to speed:
- https://en.wikipedia.org/wiki/Internal_consistency
- https://en.wikipedia.org/wiki/Cronbach%27s_alpha
- http://www.socialresearchmethods.net/kb/reltypes.php
- http://zencaroline.blogspot.com.au/2007/06/composite-reliability.html
The data
For this post, we’ll be using data on a Big 5 measure of personality that is freely available from Personality Tests. You can download the data yourself HERE, or running the following code will handle the downloading and save the data as an object called d
:
temp <- tempfile() download.file("http://personality-testing.info/_rawdata/BIG5.zip", temp, mode="wb") d <- read.table(unz(temp, "BIG5/data.csv"), header = TRUE, sep="\t") unlink(temp); rm(temp)
At the time this post was written, this data set contained data for 19719 people, starting with some demographic information and then their responses on 50 items: 10 for each Big 5 dimension. This is a bit much, so let’s cut it down to work on the first 500 participants and the Extraversion items (E1
to E10
):
d <- d[1:500, paste0("E", 1:10)] str(d) #> 'data.frame': 500 obs. of 10 variables: #> $ E1 : int 4 2 5 2 3 1 5 4 3 1 ... #> $ E2 : int 2 2 1 5 1 5 1 3 1 4 ... #> $ E3 : int 5 3 1 2 3 2 5 5 5 2 ... #> $ E4 : int 2 3 4 4 3 4 1 3 1 5 ... #> $ E5 : int 5 3 5 3 3 1 5 5 5 2 ... #> $ E6 : int 1 3 1 4 1 3 1 1 1 4 ... #> $ E7 : int 4 1 1 3 3 2 5 4 5 1 ... #> $ E8 : int 3 5 5 4 1 4 4 3 2 4 ... #> $ E9 : int 5 1 5 4 3 1 4 4 5 1 ... #> $ E10: int 1 5 1 5 5 5 1 3 3 5 ...
Here is a list of the extraversion items that people are rating from 1 = Disagree to 5 = Agree:
- E1 I am the life of the party.
- E2 I don’t talk a lot.
- E3 I feel comfortable around people.
- E4 I keep in the background.
- E5 I start conversations.
- E6 I have little to say.
- E7 I talk to a lot of different people at parties.
- E8 I don’t like to draw attention to myself.
- E9 I don’t mind being the center of attention.
- E10 I am quiet around strangers.
You can see that there are five items that need to be reverse scored (E2, E4, E6, E8, E10). Because ratings range from 1 to 5, we can do the following:
d[, paste0("E", c(2, 4, 6, 8, 10))] <- 6 - d[, paste0("E", c(2, 4, 6, 8, 10))]
We’ve now got a data frame of responses with each column being an item (scored in the correct direction) and each row being a participant. Let’s get started!
Average inter-item correlation
The average inter-item correlation is any easy place to start. To calculate this statistic, we need the correlations between all items, and then to average them. Let’s use my corrr package to get these correlations as follows (no bias here!):
library(corrr) d %>% correlate() #> # A tibble: 10 x 11 #> rowname E1 E2 E3 E4 E5 E6 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 E1 NA 0.4528892 0.5002331 0.5237516 0.5378563 0.3657830 #> 2 E2 0.4528892 NA 0.4792026 0.5549109 0.5916995 0.5694593 #> 3 E3 0.5002331 0.4792026 NA 0.4930294 0.6161848 0.3296186 #> 4 E4 0.5237516 0.5549109 0.4930294 NA 0.5123502 0.4714739 #> 5 E5 0.5378563 0.5916995 0.6161848 0.5123502 NA 0.4996636 #> 6 E6 0.3657830 0.5694593 0.3296186 0.4714739 0.4996636 NA #> 7 E7 0.6360617 0.4731673 0.5684515 0.4999339 0.6205428 0.3725773 #> 8 E8 0.4498123 0.3791802 0.4177098 0.4505763 0.3850780 0.3310247 #> 9 E9 0.5280366 0.3958574 0.4753085 0.4631383 0.4851778 0.3280024 #> 10 E10 0.4908861 0.4487868 0.5000737 0.5234228 0.5525188 0.4137737 #> # ... with 4 more variables: E7 <dbl>, E8 <dbl>, E9 <dbl>, E10 <dbl>
Because the diagonal is already set to NA
, we can obtain the average correlation of each item with all others by computing the means for each column (excluding the rowname
column):
inter_item <- d %>% correlate() %>% select(-rowname) %>% colMeans(na.rm = TRUE) inter_item #> E1 E2 E3 E4 E5 E6 E7 #> 0.4983678 0.4827948 0.4866458 0.4991764 0.5334524 0.4090418 0.5133091 #> E8 E9 E10 #> 0.4270190 0.4731896 0.4814496
Aside, note that select()
comes from the dplyr package, which is imported when you use corrr.
We can see that E5
and E7
are more strongly correlated with the other items on average than E8
. However, most items correlate with the others in a reasonably restricted range around .4 to .5.
To obtain the overall average inter-item correlation, we calculate the mean()
of these values:
mean(inter_item) #> [1] 0.4804446
However, with these values, we can explore a range of attributes about the relationships between the items. For example, we can visualise them in a histogram and highlight the mean as follows:
library(ggplot2) data.frame(inter_item) %>% ggplot(aes(x = inter_item)) + geom_histogram(bins = 10, alpha = .5) + geom_vline(xintercept = mean(inter_item), color = "red") + xlab("Mean inter-item correlation") + theme_bw()
Average item-total correlation
We can investigate the average item-total correlation in a similar way to the inter-item correlations. The first thing we need to do is calculate the total score. Let’s say that a person’s score is the mean of their responses to all ten items:
d$score <- rowMeans(d) head(d) #> E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 score #> 1 4 4 5 4 5 5 4 3 5 5 4.4 #> 2 2 4 3 3 3 3 1 1 1 1 2.2 #> 3 5 5 1 2 5 5 1 1 5 5 3.5 #> 4 2 1 2 2 3 2 3 2 4 1 2.2 #> 5 3 5 3 3 3 5 3 5 3 1 3.4 #> 6 1 1 2 2 1 3 2 2 1 1 1.6
Now, we’ll correlate()
everything again, but this time focus()
on the correlations of the score
with the items:
item_total <- d %>% correlate() %>% focus(score) item_total #> # A tibble: 10 x 2 #> rowname score #> <chr> <dbl> #> 1 E1 0.7520849 #> 2 E2 0.7297506 #> 3 E3 0.7350644 #> 4 E4 0.7485092 #> 5 E5 0.7945943 #> 6 E6 0.6367799 #> 7 E7 0.7768180 #> 8 E8 0.6640914 #> 9 E9 0.7273987 #> 10 E10 0.7306038
Again, we can calculate their mean as:
mean(item_total$score) #> [1] 0.7295695
And we can plot the results:
item_total %>% ggplot(aes(x = score)) + geom_histogram(bins = 10, alpha = .5) + geom_vline(xintercept = mean(item_total$score), color = "red") + xlab("Mean item-total correlation") + theme_bw()
Cronbach’s alpha
Cronbach’s alpha is one of the most widely reported measures of internal consistency. Although it’s possible to implement the maths behind it, I’m lazy and like to use the alpha()
function from the psych package. This function takes a data frame or matrix of data in the structure that we’re using: each column is a test/questionnaire item, each row is a person. Let’s test it out below. Note that alpha()
is also a function from the ggplot2 package, and this creates a conflict. To specify that we want alpha()
from the psych package, we will use psych::alpha()
d$score <- NULL # delete the score column we made earlier psych::alpha(d) #> #> Reliability analysis #> Call: psych::alpha(x = d) #> #> raw_alpha std.alpha G6(smc) average_r S/N ase mean sd #> 0.9 0.9 0.9 0.48 9.2 0.0065 3 0.98 #> #> lower alpha upper 95% confidence boundaries #> 0.89 0.9 0.91 #> #> Reliability if an item is dropped: #> raw_alpha std.alpha G6(smc) average_r S/N alpha se #> E1 0.89 0.89 0.89 0.48 8.2 0.0073 #> E2 0.89 0.89 0.89 0.48 8.3 0.0072 #> E3 0.89 0.89 0.89 0.48 8.3 0.0073 #> E4 0.89 0.89 0.89 0.48 8.2 0.0073 #> E5 0.89 0.89 0.89 0.47 7.9 0.0075 #> E6 0.90 0.90 0.90 0.50 8.9 0.0068 #> E7 0.89 0.89 0.89 0.47 8.1 0.0075 #> E8 0.90 0.90 0.90 0.49 8.8 0.0069 #> E9 0.89 0.89 0.89 0.48 8.4 0.0071 #> E10 0.89 0.89 0.89 0.48 8.3 0.0072 #> #> Item statistics #> n raw.r std.r r.cor r.drop mean sd #> E1 500 0.75 0.75 0.72 0.68 2.6 1.3 #> E2 500 0.73 0.73 0.70 0.66 3.2 1.3 #> E3 500 0.74 0.74 0.70 0.67 3.3 1.3 #> E4 500 0.75 0.75 0.72 0.68 2.8 1.3 #> E5 500 0.79 0.80 0.78 0.74 3.3 1.3 #> E6 500 0.64 0.64 0.59 0.55 3.6 1.3 #> E7 500 0.78 0.77 0.75 0.70 2.8 1.5 #> E8 500 0.66 0.66 0.61 0.58 2.7 1.3 #> E9 500 0.73 0.72 0.68 0.64 3.1 1.5 #> E10 500 0.73 0.73 0.69 0.66 2.3 1.3 #> #> Non missing response frequency for each item #> 1 2 3 4 5 miss #> E1 0.28 0.23 0.23 0.16 0.10 0 #> E2 0.12 0.21 0.22 0.21 0.23 0 #> E3 0.09 0.20 0.25 0.22 0.24 0 #> E4 0.19 0.26 0.23 0.21 0.12 0 #> E5 0.11 0.21 0.20 0.23 0.25 0 #> E6 0.09 0.15 0.17 0.28 0.30 0 #> E7 0.28 0.20 0.17 0.16 0.20 0 #> E8 0.23 0.27 0.19 0.19 0.12 0 #> E9 0.19 0.21 0.17 0.19 0.25 0 #> E10 0.36 0.27 0.12 0.15 0.09 0
This function provides a range of output, and generally what we’re interested in is std.alpha
, which is “the standardised alpha based upon the correlations”. Also note that we get “the average interitem correlation”, average_r
, and various versions of “the correlation of each item with the total score” such as raw.r
, whose values match our earlier calculations.
If you’d like to access the alpha value itself, you can do the following:
psych::alpha(d)$total$std.alpha #> [1] 0.9024126
Split-half reliability (adjusted using the Spearman–Brown prophecy formula)
There are times when we can’t calculate internal consistency using item responses. For example, I often work with a decision-making variable called recklessness. This variable is calculated after people answer questions (e.g., “What is the longest river is Asia”), and then decide whether or not to bet on their answer being correct. Recklessness is calculated as the proportion of incorrect answers that a person bets on.
If you think about it, it’s not possible to calculate internal consistency for this variable using any of the above measures. The reason for this is that the items that contribute to two people’s recklessness scores could be completely different. One person could give incorrect answers on questions 1 to 5 (thus these questions go into calculating their score), while another person might incorrectly respond to questions 6 to 10. Thus, calculating recklessness for many individuals isn’t as simple as summing across items. Instead, we need an item pool from which to pull different combinations of questions for each person.
To overcome this sort of issue, an appropriate method for calculating internal consistency is to use a split-half reliability. This entails splitting your test items in half (e.g., into odd and even) and calculating your variable for each person with each half. For example, I typically calculate recklessness for each participant from odd items and then from even items. These scores are then correlated and adjusted using the Spearman-Brown prophecy/prediction formula (for examples, see some of my publications such as this or this). Similar to Cronbach’s alpha, a value closer to 1 and further from zero indicates greater internal consistency.
We can still calculate split-half reliability for variables that do not have this problem! So let’s do this with our extraversion data as follows:
# Calculating total score... score_e <- rowMeans(d[, c(TRUE, FALSE)]) # with even items score_o <- rowMeans(d[, c(FALSE, TRUE)]) # with odd items # Correlating scores from even and odd items r <- cor(score_e, score_o) r #> [1] 0.7681034 # Adjusting with the Spearman-Brown prophecy formula (2 * r) / (1 + r) #> [1] 0.8688445
Thus, in this case, the split-half reliability approach yields an internal consistency estimate of .87.
Composite reliability
The final method for calculating internal consistency that we’ll cover is composite reliability. Where possible, my personal preference is to use this approach. Although it’s not perfect, it takes care of many inappropriate assumptions that measures like Cronbach’s alpha make. If the specificities interest you, I suggest reading this post.
Composite reliability is based on the factor loadings in a confirmatory factor analysis (CFA). In the case of a unidimensional scale (like extraversion here), we define a one-factor CFA, and then use the factor loadings to compute our internal consistency estimate. I won’t go into the detail, but we can interpret a composite reliability score similarly to any of the other metrics covered here (closer to one indicates better internal consistency). We’ll fit our CFA model using the lavaan package as follows:
library(lavaan) # Define the model items <- paste(names(d), collapse = "+") model <- paste("extraversion", items, sep = "=~") model #> [1] "extraversion=~E1+E2+E3+E4+E5+E6+E7+E8+E9+E10" # Fit the model fit <- cfa(model, data = d)
There are various ways to get to the composite reliability from this model. We’ll extract the standardized factor loadings and work with those:
sl <- standardizedSolution(fit) sl <- sl$est.std[sl$op == "=~"] names(sl) <- names(d) sl # These are the standardized factor loadings for each item #> E1 E2 E3 E4 E5 E6 E7 #> 0.7266721 0.6903172 0.7154705 0.7118762 0.7841427 0.5791526 0.7589250 #> E8 E9 E10 #> 0.5962855 0.6726232 0.6940120
We then obtain the composite reliability via the following:
# Compute residual variance of each item re <- 1 - sl^2 # Compute composite reliability sum(sl)^2 / (sum(sl)^2 + sum(re)) #> [1] 0.9029523
There you have it. The composite reliability for the extraversion factor is .90.
One appealing aspect of composite reliability is that we can calculate it for multiple factors in the same model. For example, say we had included all personality items in a CFA with five factors, we could do the above calculations separately for each factor and obtain their composite reliabilities.
Just to finish off, I’ll mention that you can use the standardised factor loadings to visualise more information like we did earlier with the correlations. I’ll leave this part up to you!
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at [email protected] to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.