Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post will describe linear regression as from the book Veterinary Epidemiologic Research, describing the examples provided with R.
Regression analysis is used for modeling the relationship between a single variable Y (the outcome, or dependent variable) measured on a continuous or near-continuous scale and one or more predictor (independent or explanatory variable), X. If only one predictor is used, we have a simple regression model, and if we have more than one predictor, we have a multiple regression model. Let’s download the data coming together with the book. There’s a choice of various format, including R:
temp <- tempfile() download.file("http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", tmp) # fetch the file into the temporary file load(unz(tmp, "ver2_data_R/daisy2.rdata")) # extract the target file from the temporary file unlink(tmp) # remove the temporary file ### some data management daisy2 <- daisy2[daisy2$h7 == 1, ] # we only use a subset of the data daisy2[, c(4, 7, 11, 15:18)] <- lapply(daisy2[, c(4, 7, 11, 15:18)], factor) library(Hmisc) daisy2 <- upData(daisy2, labels = c(region = 'Region', herd = 'Herd number', cow = 'Cow number', study_lact = 'Study lactation number', herd_size = 'Herd size', mwp = "Minimum wait period for herd", parity = 'Lactation number', milk120 = 'Milk volume in first 120 days of lactation', calv_dt = 'Calving date', cf = 'Calving to first service interval', fs = 'Conception at first service', cc = 'Calving to conception interval', wpc = 'Interval from wait period to conception', spc = 'Services to conception', twin = 'Twins born', dyst = 'Dystocia at calving', rp = 'Retained placenta at calving', vag_disch = 'Vaginal discharge observed', h7 = 'Indicator for 7 herd subset'), levels = list(fs = list('No' = 0, 'Yes' = 1), twin = list('No' = 0, 'Yes' = 1), dyst = list('No' = 0, 'Yes' = 1), rp = list('No' = 0, 'Yes' = 1), vag_disch = list('No' = 0, 'Yes' = 1)), units = c(milk120 = "litres")) summary(daisy2$milk120, na.rm = TRUE) Min. 1st Qu. Median Mean 3rd Qu. Max. 1110 2742 3226 3225 3693 5630 sd(daisy2$milk120, na.rm = TRUE) [1] 703.364 daisy2 <- daisy2[!is.na(daisy2$milk120), ] # get rid of missing observations for milk production
To use R to download zipped file, see this answer on Stackoverflow by Dirk Eddelbuettel.
The regression model could be written:
lm.milk <- lm(milk120 ~ parity, data = daisy2) (lm.milk.sum <- summary(lm.milk)) Call: lm(formula = milk120 ~ parity, data = daisy2) Residuals: Milk volume in first 120 days of lactation [litres] Min 1Q Median 3Q Max -2234.73 -384.50 -0.63 376.52 2188.48 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2629.63 29.31 89.712 < 2e-16 *** parity2 715.30 42.45 16.849 < 2e-16 *** parity3 794.67 44.05 18.039 < 2e-16 *** parity4 834.51 48.50 17.205 < 2e-16 *** parity5 812.19 50.77 15.998 < 2e-16 *** parity6 929.63 66.92 13.893 < 2e-16 *** parity7 795.26 218.86 3.634 0.000287 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 613.5 on 1761 degrees of freedom Multiple R-squared: 0.2419, Adjusted R-squared: 0.2393 F-statistic: 93.65 on 6 and 1761 DF, p-value: < 2.2e-16
The output shows in order: the model specified, the distribution of residuals (which should be normal with mean = 0 and so median close to 0), the
anova(lm.milk) Analysis of Variance Table Response: milk120 Df Sum Sq Mean Sq F value Pr(>F) parity 6 211463774 35243962 93.653 < 2.2e-16 *** Residuals 1761 662708166 376325 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
How would you calculate this F-test yourself? The F value is given by
(src.var.total <- sum((daisy2$milk120 - mean(daisy2$milk120))^2)) #source of variation: total [1] 874171940 (src.var.res <- sum(lm.milk$res^2)) #source of variation: error (or residual) [1] 662708166 (f.stat <- ((src.var.total - src.var.res) / (lm.milk.sum$fstatistic[2])) / + (src.var.res / lm.milk.sum$fstatistic[3])) #F-statistic 93.65301 1 - pf(f.stat, lm.milk.sum$fstatistic[2], lm.milk.sum$fstatistic[3]) # p-value 0
We could also compute ourself
1 - sum(lm.milk$res^2) / sum((daisy2$milk120 - mean(daisy2$milk120))^2) [1] 0.2419018
We can also show some actual values, predicted ones and residuals:
head(data.frame(daisy2[, c(7:8)], fitted.value = fitted(lm.milk), + residual = resid(lm.milk)), n = 10) parity milk120 fitted.value residual 1 5 3505.8 3441.824 63.97631 2 5 3691.3 3441.824 249.47631 3 5 4173.0 3441.824 731.17626 4 5 3727.3 3441.824 285.47631 5 5 3090.8 3441.824 -351.02369 6 4 5041.2 3464.141 1577.05893 7 5 3861.2 3441.824 419.37621 8 5 4228.4 3441.824 786.57616 9 6 3431.1 3559.258 -128.15761 10 5 4445.5 3441.824 1003.67626
Next we could plot actual values and intervals for prediction both for the mean and new observations. We first create a new data frame to hold the values of parity for which we want the predictions, then compute the prediction and confidence bands and plot them with ggplo2.
confidence <- data.frame(parity = as.factor(1:7)) # Confidence bands confidence.band <- predict(lm.milk, int = "c" , newdata = confidence) confidence.band <- as.data.frame(cbind(confidence.band, parity = c(1:7))) # Prediction bands prediction.band <- predict(lm.milk, int = "p" , newdata = confidence) prediction.band <- as.data.frame(cbind(prediction.band, parity = c(1:7))) library(ggplot2) ggplot(data = daisy2, aes(x = as.numeric(parity), y = milk120)) + geom_point() + geom_smooth(data = confidence.band, aes(x = parity, y = lwr), method = lm, se = FALSE, colour = "blue") + geom_smooth(data = confidence.band, aes(x = parity, y = upr), method = lm, se = FALSE, colour = "blue") + geom_smooth(data = prediction.band, aes(x = parity, y = lwr), method = lm, se = FALSE, colour = "green") + geom_smooth(data = prediction.band, aes(x = parity, y = upr), method = lm, se = FALSE, colour = "green") + xlab("Parity") + ylab("Milk volume in first 120 days of lactation")
Finally, let’s have a look at a multiple regression and see how we can assess the significance of predictor variables. We will look at the effect of milk production, parity, retained placenta and vaginal discharge on the waiting period of the cows.
daisy2$milk120.sq <- daisy2$milk120^2 daisy2$parity.cat[as.numeric(daisy2$parity) >= 3] <- "3+" daisy2$parity.cat[daisy2$parity == 2] <- "2" daisy2$parity.cat[as.numeric(daisy2$parity) < 2] <- "1" lm.wpc <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch, data = daisy2) summary(lm.wpc) Call: lm(formula = wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch, data = daisy2) Residuals: Interval from wait period to conception Min 1Q Median 3Q Max -69.21 -37.98 -14.66 24.46 219.46 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.232e+02 2.134e+01 5.775 9.29e-09 *** milk120 -3.426e-02 1.339e-02 -2.558 0.0106 * milk120.sq 4.507e-06 2.011e-06 2.241 0.0251 * as.factor(parity.cat)2 5.083e+00 4.056e+00 1.253 0.2103 as.factor(parity.cat)3+ 8.887e+00 3.665e+00 2.425 0.0154 * rpYes 9.771e+00 4.572e+00 2.137 0.0327 * vag_dischYes 9.838e+00 6.086e+00 1.616 0.1062 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 51.47 on 1529 degrees of freedom (232 observations deleted due to missingness) Multiple R-squared: 0.01306, Adjusted R-squared: 0.009191 F-statistic: 3.373 on 6 and 1529 DF, p-value: 0.002622
The F-test tells us if any of the predictors is useful in predicting the length of the waiting period. The F-statistic is significant and we can reject the null hypothesis that no predictors are useful. If we had failed to reject the null hypothesis, we would still be interested in the possibility of non-linear transformations of variables (already done here) or the presence of outliers, masking the true effect. Also we could have not enough data to demonstrate a real effect. Now that we rejected this null hypothesis, maybe not all predictors are necessary to predict the response.
We can see from the summary table which predictors are significant. However we can test a single predictor, with an F-test:
lm.wpc2 <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp, data = daisy2) anova(lm.wpc, lm.wpc2) Analysis of Variance Table Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch Model 2: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp Res.Df RSS Df Sum of Sq F Pr(>F) 1 1529 4050745 2 1530 4057666 -1 -6921.1 2.6125 0.1062
or with a t-test, as
The same approach can be used to test a group of predictors. For example, if we want to see if we have to keep both milk production and its squared value in the model:
lm.wpc3 <- lm(wpc ~ as.factor(parity.cat) + rp + vag_disch, data = daisy2) anova(lm.wpc, lm.wpc3) Analysis of Variance Table Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch Model 2: wpc ~ as.factor(parity.cat) + rp + vag_disch Res.Df RSS Df Sum of Sq F Pr(>F) 1 1529 4050745 2 1531 4076138 -2 -25394 4.7925 0.008416 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.