Basic Generalised Linear Modelling – Part 1: Exercises
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
GLMs can be split into three groups:
• Poisson regression for count data with no over / under dispersion issues
• Quasi-poisson or Negative binomial models where the models are overdispersed
• Logistic regression models where the response data are binary (e.g. present or absent; male or female, or proportional (e.g. percentages))
In this exercise, we will focus on GLM that use Poisson regression. Please download dataset for this exercise here. The dataset is investigated the biographical determinants of at species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of ant species richness against latitude, elevation and habitat type on their paper.
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
Exercise 1
load data and check the data structure using scatterplotMatrix
function. Assess its covariation and data patterning
Exercise 2
run GLM model and run VIF analysis to check for inflation. Pay attention to the collinearity
Exercise 3
if there are any issues with the covariation try to center the predictor variables
Exercise 4
Re-run VIF with the new variables
Exercise 5
check for any influential data points outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1 then it is OK to go
Exercise 6
check for over dispersion. It needs to be around 1 to go to the next step.
Exercise 7
check the model summary and what can we infer?
Exercise 8
Since we have lots of variables, then we do model averaging. The first step to do is to set options in base R regarding missing values. Then try to asses which variables that have a significant influence on the response variable. Here we include latitude, elevation, and habitat variable to produce the best model.
Exercise 9
Check validation plots
Exercise 10
Produce base plot and the points of predicted values
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.