[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As my data example, I’ll use my dissertation dataset and the linear model I introduced in the GLM post.
dissertation<-read.delim("dissertation_data.txt",header=TRUE) guilt_lm_full<-lm(guilt ~ illev + circumst + deftest + policepower + suspicious + overzealous + upstanding, data=dissertation) options(scipen = 999) summary(guilt_lm_full) ## ## Call: ## lm(formula = guilt ~ illev + circumst + deftest + policepower + ## suspicious + overzealous + upstanding, data = dissertation) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0357 -0.7452 0.1828 0.9706 2.5013 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.16081 0.38966 10.678 < 0.0000000000000002 *** ## illev 0.11111 0.05816 1.911 0.05689 . ## circumst -0.08779 0.06708 -1.309 0.19147 ## deftest -0.02020 0.05834 -0.346 0.72942 ## policepower 0.02828 0.06058 0.467 0.64090 ## suspicious 0.17286 0.06072 2.847 0.00468 ** ## overzealous -0.03298 0.04792 -0.688 0.49176 ## upstanding 0.08941 0.05374 1.664 0.09706 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.169 on 347 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.07647, Adjusted R-squared: 0.05784 ## F-statistic: 4.105 on 7 and 347 DF, p-value: 0.0002387
In this model, the outcome variable is a guilt rating, ranging from 1 to 7. This is our y, which our regression model is trying to recreate through the linear relationship between our x’s and our y. Using the coefficients in the output, we could compute the predicted y (y-hat) – what a person’s score would be if the linear model fit the data perfectly. Fortunately, R has a built-in function that will compute y-hat for a dataset: predict. This function requires two arguments: a regression model and the dataset to use to predict values. Let’s have R predict values for the dissertation dataset, and add it on as a new variable.
dissertation$predicted<-predict(guilt_lm_full, dissertation)
In this application, we don’t care as much about the predicted values – we will later on in this post – but we probably do care about the residuals: the difference between the observed value and the predicted value. This gives us an idea of how our model is doing and whether it fits reasonably well. It can also tell us if the model falls apart at certain values or ranges of values.
In the residuals post, I showed that you can easily request residuals from the model. As we did with predicted, let’s create a new variable in the dataset that contains our residuals.
dissertation$residual<-resid(guilt_lm_full) ## Error in `$<-.data.frame`(`*tmp*`, residual, value = structure(c(0.0326393185592984, : replacement has 355 rows, data has 356
Ruh-roh, we got an error. Our dataset contains 356 observations, but we only have 355 residuals. This is because someone has a missing value on one of the variables in the regression model and was dropped from the analysis. There are a variety of ways we could find out which case is missing a value, but since I’m only working with a handful of variables, I’ll just run descriptives and look for the variable with only 355 values.
library(psych) ## Warning: package 'psych' was built under R version 3.4.4 describe(dissertation[c(13,15,18,21,27,29,31,44)]) ## vars n mean sd median trimmed mad min max range skew ## illev 1 356 2.98 1.13 3 3.02 1.48 1 5 4 -0.17 ## circumst 2 356 2.99 0.95 3 2.97 1.48 1 5 4 0.13 ## deftest 3 356 3.39 1.46 4 3.57 0.00 -9 5 14 -5.25 ## policepower 4 355 3.86 1.41 4 4.02 0.00 -9 5 14 -6.40 ## suspicious 5 356 2.09 1.14 2 2.01 1.48 -9 5 14 -1.97 ## overzealous 6 356 3.34 1.34 4 3.41 1.48 -9 5 14 -4.49 ## upstanding 7 356 3.09 1.29 3 3.11 1.48 -9 5 14 -2.31 ## guilt 8 356 4.80 1.21 5 4.90 1.48 2 7 5 -0.59 ## kurtosis se ## illev -1.04 0.06 ## circumst -0.51 0.05 ## deftest 40.74 0.08 ## policepower 55.05 0.08 ## suspicious 23.52 0.06 ## overzealous 38.44 0.07 ## upstanding 19.66 0.07 ## guilt -0.54 0.06
The variable policepower is the culprit. I can drop that missing value then rerun the residual code.
dissertation<-subset(dissertation, !is.na(policepower)) dissertation$residual<-resid(guilt_lm_full)
Now I can plot my observed values and residuals.
library(ggplot2) ## ## Attaching package: 'ggplot2' ## The following objects are masked from 'package:psych': ## ## %+%, alpha qplot(guilt,residual,data=dissertation)