Site icon R-bloggers

G is for GLM Function

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the Beta post, I used linear regression and demonstrated how to request standardized regression coefficients (betas). You may remember that I mentioned in that post that there are other types of regression. Linear regression is used with a continouous outcome. But sometimes you want to predict outcomes that aren’t continuous – perhaps they’re binary outcomes (0 or 1 coded in some meaningful way), counts, or any other types of outcome variables that don’t follow the typical normal distribution. Fortunately, there are many types of regression to allow you to work with these different types of variables. You can access many of them with the glm function, a built-in (base R) function for conducting a variety of regression models.

GLM stands for general linear model, which is the basis for many statistical analyses, including regression and structural equation modeling. It provides a mathematical method of relating predictor variables to outcomes in terms of an equation, converting values on the predictor variable(s) to values on the outcome variable.

Using the glm function is very similar to using the lm function for a linear model – you need to symbolically represent your equation:

Y ~ x1 + x2 + … + lastx

And you need some kind of data object for R to apply the equation to. A key difference in the glm function is that you reference the kind of regression model to use, through the “family” argument. Here are the different family options:
The family refers to the distribution that best represents your outcome data. For instance, Gaussian refers to the normal distribution – unless you specify some additional arguments in your glm function, these results will be essentially the same as using the lm function. Binomial is used for two-level outcomes, which is what I’ll demonstrate below. Poisson is used for count data. Quasibinomial is used when you have additional variance in your outcome not explained by the binomial distribution alone (that is, there is more variance than if it followed the binomial distribution perfectly); it provides similar results as binomial, except with an additional error term. Quasipoisson is similar for count data. We won’t worry about those for now though – just remember that anytime you have to estimate something extra, you need to provide more data, so each additional term increases how much data you must collect. We’ll come back to this topic when we get to power analysis.

So let’s have some fun using the GLM function. First up, let’s run a binomial regression. But we’ll need some binary data first.

Sara’s Dissertation

I’ve been sitting on my dissertation data for several years now and have gotten some fun presentations out it. To briefly summarize, I conducted my dissertation, which if you’re so inclined you can read it here, on pretrial publicity: participants were exposed to a certain kind of pretrial publicity at random or to a control condition with no biasing information, then they completed a measure of justice attitudes, read a trial transcript, rendered a verdict, and provided a guilt rating. I included both verdict and guilt rating because of my meta-analysis, which included some studies that used guilt ratings instead of or in addition to verdicts. I was curious how these two measures related to each other, and what impact attitudes might have. So in addition to my planned analyses on pretrial publicity, I conducted some exploratory analyses using the attitude items.

The measure I used provided many attitude items, but based on a literature search, I picked out a handful of specific attitudes that appeared to impact how people make decisions about guilt versus innocence. Though I’ve chided other researchers for examining individual items instead of overall measure scores, there are some cases where it makes sense to think in terms of specific versus general attitudes. The theoretical background for my dissertation had to do with the tenuous relationship between attitudes and behaviors. Attitudes are poor predictors of behaviors, though more specific attitudes are better predictors than more general attitudes. My hypothesis was that pretrial publicity was not in itself biasing, unless you hold an attitude that a specific piece of information implies a defendant is guilty. And I conducted my exploratory analysis with much the same framework, that specific attitudes relating to the case and factors of the trial would be better predictors of the outcome.

First, I ran a regression with a handful of attitudes: belief that courts should be able to use illegal evidence (e.g., obtained without a search warrant), beliefs about circumstantial evidence, belief that the defendant should be required to testify (even though defendants don’t have to do anything, even mount a defense), belief that police abuse power, an attitude that police should be allowed to arrest anyone who seems “suspicious”, belief that police are “overzealous”, and finally an attitude that upstanding citizens have nothing to fear from police. I conducted this analysis once with verdict as the outcome, and again with guilt rating as the outcome.

Since we want to get started with our glm function, let’s run a binomial regression with verdict as the outcome.

dissertation<-read.delim("dissertation_data.txt",
                         header=TRUE)
verdict_binomial<-glm(verdict ~ illev + circumst + deftest + policepower +
                        suspicious + overzealous + upstanding, family=binomial,
                       data=dissertation)
summary(verdict_binomial)

## 
## Call:
## glm(formula = verdict ~ illev + circumst + deftest + policepower + 
##     suspicious + overzealous + upstanding, family = binomial, 
##     data = dissertation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0267  -1.0100  -0.5999   1.0981   2.0656  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.57140    0.81612  -0.700 0.483834    
## illev        0.13836    0.10659   1.298 0.194280    
## circumst    -0.43640    0.12770  -3.417 0.000632 ***
## deftest      0.09870    0.10859   0.909 0.363398    
## policepower -0.13188    0.11256  -1.172 0.241354    
## suspicious   0.37546    0.12157   3.088 0.002013 ** 
## overzealous -0.01246    0.09991  -0.125 0.900717    
## upstanding   0.22768    0.10820   2.104 0.035362 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.08  on 354  degrees of freedom
## Residual deviance: 445.22  on 347  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 461.22
## 
## Number of Fisher Scoring iterations: 4

You would interpret the output in much the same way you would linear regression – significant variables indicate those significantly related to the outcome. I could use this resulting equation to predict a person’s verdict based on their attitudes on these items. As I found in my dissertation analysis, beliefs about circumstantial evidence, police ability to arrest “suspicious people”, and that upstanding (law-abiding) citizens have no reason to fear the police significantly affected whether the participant convicted in this particular case. The next step in my analysis would be to test how well the equation does at predicting verdict: how often it correctly and incorrectly classified people. (Spoiler alert: I did this as part of my dissertation and found that, though 3 indicators were significant, this equation wasn’t great at predicting verdict, doing so correctly only about 55 percent of the time.)

I also conducted a linear regression using guilt rating. Just for fun, let’s conduct an lm as well as a glm of the Gaussian family to see how results match up:

guilt_lm<-lm(guilt ~ illev + circumst + deftest + policepower +
                        suspicious + overzealous + upstanding,
                        data=dissertation)
summary(guilt_lm)

## 
## Call:
## lm(formula = guilt ~ illev + circumst + deftest + policepower + 
##     suspicious + overzealous + upstanding, data = dissertation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0357 -0.7452  0.1828  0.9706  2.5013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.16081    0.38966  10.678  < 2e-16 ***
## illev        0.11111    0.05816   1.911  0.05689 .  
## circumst    -0.08779    0.06708  -1.309  0.19147    
## deftest     -0.02020    0.05834  -0.346  0.72942    
## policepower  0.02828    0.06058   0.467  0.64090    
## suspicious   0.17286    0.06072   2.847  0.00468 ** 
## overzealous -0.03298    0.04792  -0.688  0.49176    
## upstanding   0.08941    0.05374   1.664  0.09706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.169 on 347 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07647, Adjusted R-squared:  0.05784 
## F-statistic: 4.105 on 7 and 347 DF,  p-value: 0.0002387

guilt_gaus<-glm(guilt ~ illev + circumst + deftest + policepower +
                        suspicious + overzealous + upstanding,
                        family="gaussian", data=dissertation)
summary(guilt_gaus)

## 
## Call:
## glm(formula = guilt ~ illev + circumst + deftest + policepower + 
##     suspicious + overzealous + upstanding, family = "gaussian", 
##     data = dissertation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0357  -0.7452   0.1828   0.9706   2.5013  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.16081    0.38966  10.678  < 2e-16 ***
## illev        0.11111    0.05816   1.911  0.05689 .  
## circumst    -0.08779    0.06708  -1.309  0.19147    
## deftest     -0.02020    0.05834  -0.346  0.72942    
## policepower  0.02828    0.06058   0.467  0.64090    
## suspicious   0.17286    0.06072   2.847  0.00468 ** 
## overzealous -0.03298    0.04792  -0.688  0.49176    
## upstanding   0.08941    0.05374   1.664  0.09706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.366385)
## 
##     Null deviance: 513.40  on 354  degrees of freedom
## Residual deviance: 474.14  on 347  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1128.2
## 
## Number of Fisher Scoring iterations: 2

As you can see, the regression results are the same, though the output is slightly different between the two. The lm function gives you your R-squared and F-test for the regression (test that any indicators are significant), while the glm function gives you dispersion parameters and AIC. For this reason, if you’re conducting a linear regression, use the lm function. The glm function only makes sense here if you need to specify additional arguments to use a different member of the Gaussian family. The results may be the same, but the applications are slightly different.

I also did a second set of binomial regressions for my dissertation, this one dealing with variable that might affect how a person translates their guilt rating (which was on a scale of 1 to 7, the likelihood that the defendant actually committed the crime) to a verdict. That is, I found that while most people didn’t select a verdict of guilty unless they also selected a guilt rating of 6 or 7, some convicted at a much lower rating, while others didn’t select guilty even when they gave a guilt rating of 7, which translated to “definitely committed the crime”. So I conducted two binomial regressions: one with criminal justice attitudes (e.g., how they think “reasonable doubt” should be defined, belief in the fairness of juries, and so on) as well as guilt rating, and one that also looked at how each attitude item interacts with guilt rating. This lets us see if the impact of guilt rating on verdict depends on a person’s attitude. To make things easier to read, I’m going to set up my two equations separately from the glm function.

predictors_only<-'verdict ~ obguilt + reasdoubt + bettertolet + libertyvorder + 
                  jurevidence + guilt'
pred_int<-'verdict ~ obguilt + reasdoubt + bettertolet + libertyvorder + 
                  jurevidence + guilt + obguilt*guilt + reasdoubt*guilt +
                  bettertolet*guilt + libertyvorder*guilt + jurevidence*guilt'
model1<-glm(predictors_only, family="binomial", data=dissertation)
summary(model1)

## 
## Call:
## glm(formula = predictors_only, family = "binomial", data = dissertation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4814  -0.6050  -0.1296   0.6419   2.7575  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -10.21158    1.33072  -7.674 1.67e-14 ***
## obguilt         0.39024    0.16010   2.437   0.0148 *  
## reasdoubt      -0.12315    0.11591  -1.062   0.2880    
## bettertolet    -0.10054    0.11406  -0.881   0.3781    
## libertyvorder  -0.01425    0.14961  -0.095   0.9241    
## jurevidence     0.08448    0.09897   0.854   0.3933    
## guilt           1.81769    0.20222   8.989  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.08  on 354  degrees of freedom
## Residual deviance: 309.66  on 348  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 323.66
## 
## Number of Fisher Scoring iterations: 5

model2<-glm(pred_int, family="binomial", data=dissertation)
summary(model2)

## 
## Call:
## glm(formula = pred_int, family = "binomial", data = dissertation)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6101  -0.5432  -0.1289   0.6422   2.2805  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         -12.84571    6.10651  -2.104   0.0354 *
## obguilt              -0.34506    1.13742  -0.303   0.7616  
## reasdoubt             1.56658    0.83360   1.879   0.0602 .
## bettertolet          -0.21819    0.86374  -0.253   0.8006  
## libertyvorder         0.88325    0.95459   0.925   0.3548  
## jurevidence          -0.62756    0.94042  -0.667   0.5046  
## guilt                 2.43022    1.19628   2.031   0.0422 *
## obguilt:guilt         0.13062    0.21752   0.600   0.5482  
## reasdoubt:guilt      -0.33930    0.16323  -2.079   0.0376 *
## bettertolet:guilt     0.01426    0.16662   0.086   0.9318  
## libertyvorder:guilt  -0.17482    0.18666  -0.937   0.3490  
## jurevidence:guilt     0.14006    0.18359   0.763   0.4456  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.08  on 354  degrees of freedom
## Residual deviance: 300.66  on 343  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 324.66
## 
## Number of Fisher Scoring iterations: 6

This matches what I found initially: that there was a significant interaction between guilt rating and attitudes about reasonable doubt. So depending on how an individual defined reasonable doubt, they would convict at lower or higher guilt ratings.

That’s all for now! Tune in tomorrow for my first ever Deeply Trivial video on interpreting CFA output! And back to A to Z posts on Monday!

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.