Basic Generalised Linear Modelling – Part 1: Exercises

Hanif Kusuma

4 years ago

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

GLMs can be split into three groups:
• Poisson regression for count data with no over / under dispersion issues
• Quasi-poisson or Negative binomial models where the models are overdispersed
• Logistic regression models where the response data are binary (e.g. present or absent; male or female, or proportional (e.g. percentages))

In this exercise, we will focus on GLM that use Poisson regression. Please download dataset for this exercise here. The dataset is investigated the biographical determinants of at species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of ant species richness against latitude, elevation and habitat type on their paper.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

Exercise 1
load data and check the data structure using scatterplotMatrix function. Assess its covariation and data patterning

Exercise 2
run GLM model and run VIF analysis to check for inflation. Pay attention to the collinearity

Exercise 3
if there are any issues with the covariation try to center the predictor variables

Exercise 4
Re-run VIF with the new variables

Exercise 5
check for any influential data points outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1 then it is OK to go

Exercise 6
check for over dispersion. It needs to be around 1 to go to the next step.

Exercise 7
check the model summary and what can we infer?

Exercise 8
Since we have lots of variables, then we do model averaging. The first step to do is to set options in base R regarding missing values. Then try to asses which variables that have a significant influence on the response variable. Here we include latitude, elevation, and habitat variable to produce the best model.

Exercise 9
Check validation plots

Exercise 10
Produce base plot and the points of predicted values

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.