Veterinary Epidemiologic Research: GLM – Logistic Regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We continue to explore the book Veterinary Epidemiologic Research and today we’ll have a look at generalized linear models (GLM), specifically the logistic regression (chapter 16). In veterinary epidemiology, often the outcome is dichotomous (yes/no), representing the presence or absence of disease or mortality. We code 1 for the presence of the outcome and 0 for its absence. This way, the mean of the outcome represents the proportion of subjects with the event, and its expectation represents the probability for the event to occur. By definition, the expectation must lie in the range 0 to 1. We can’t use a linear regression to analyse these data because:
- the error terms () from this model will not be normally distributed. They can only take two values,
- the probability of the outcome occurring, , follows a binomial distribution and depends on the value of the predictor variables, . Since the variance of a binomial distribution is a function of the probability, the error variance will also vary with the level of and as such the assumption of homoscedasticity will be violated,
- the mean responses responses are constrained to lie within 0 and 1. With a linear regression model, the predicted values might fall outside of this range.
One way of getting around the problems described above is to use a suitable transformation, the logistic or logit function of and model this as a linear function of a set of predictor variables. This leads to the model:
curve(log(x / (1 - x)), 0, 1)
The logit transform is . The logit of a probability is the log of the odds of the response taking the value 1. Then the above equation can be rewritten as
.
While the logit function can take any real value, the associated probability always lie within the bounds of 0 and 1. Within a logistic regression model, the parameter associated with the explanatory variable is such that is the odds that the response variable takes the value 1 when increases by 1, conditional on the other explanatory variables remaining constant.
As with the linear regression, the logistic regression model involves a linear combination of explanatory variables, even if the binary response is not modelled directly but through a logistic transformation. The expression of predictors on the response from a linear combination of predictors can be extended to models for other types of response than continous or binary and define the wider class of generalized linear models described by Nelder and Wedderburn. GLMs have 3 main features:
- an error distribution giving the distribution of the response around its mean. For ANOVA and linear regression it is the normal, for logistic regression it is the binomial. These distributons come from the same exponential family of probability distributions,
- a link function: how the linear function of the explanatory variables is related to the expected value of the response. For logistic regression it is the logit function (but can also be probit ( where is the inverse normal cumulative distribution function) or complementary log-log () ); for analysis of variance and regression it is the identity function,
- the variance function: how the variance of the response variable depends on the mean.
The first example is from the Nocardia dataset: a case-control study of Nocardia spp. mastitis, from 54 case herds and 54 control herds, with the predictors related to the management of the cows during the dry period.
temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/nocardia.rdata")) unlink(temp) library(Hmisc) nocardia <- upData(nocardia, labels = c(id = 'Herd identification number', casecont = 'Case/control status of herd', numcow = 'Number of cows milked', prod = 'Average milk production of the herd', bscc = 'Average bulk tank SCC over the first 6 months of 1988', dbarn = 'Type of barn dry cows kept in', dout = 'Type of outdoor area used for dry cows', dcprep = 'Method of teat end preparation prior to dry cow therapy administration', dcpct = 'Percent of dry cows treated with dry-cow therapy', dneo = 'Dry-cow product containing neomycin used on farm in last year', dclox = 'Dry cow product containing cloxacillin used on farm in last year', doth = 'Other dry cow products used on farm in last year'), levels = list(casecont = list('Control' = 0, 'Case' = 1), dbarn = list('freestall' = 1, 'tiestall' = 2, 'other' = 3), dout = list('pasture' = 1, 'yard/drylot' = 2, 'none' = 3, 'other' = 4), dcprep = list('no prep.' = 1, 'washed only' = 2, 'washed and disinfected' = 3, 'dry cow therapy not used' = 4), dneo = list('No' = 0, 'Yes' = 1), dclox = list('No' = 0, 'Yes' = 1)), units = c(prod = "kg/cow/day", bscc = "'000s of cells/ml", dcpct = "%"))
Example 16.1 is a logistic regression with 3 predictor variables. We can first have a look at a conditional density plot of the response variable given the proportion of dry cows treated with dry cow therapy.
cdplot(as.factor(casecont) ~ dcpct, data = nocardia)
The logistic regression model is:
mod1 <- glm(casecont ~ dcpct + dneo + dclox, family = binomial("logit"), data = nocardia) # "logit" can be omitted as it is the default (mod1.sum <- summary(mod1)) Call: glm(formula = casecont ~ dcpct + dneo + dclox, family = binomial("logit"), data = nocardia) Deviance Residuals: Min 1Q Median 3Q Max -1.8425 -0.8915 0.1610 0.6361 2.3745 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.984272 0.772196 -3.865 0.000111 *** dcpct 0.022668 0.007187 3.154 0.001610 ** dneoYes 2.212564 0.578012 3.828 0.000129 *** dcloxYes -1.412516 0.557197 -2.535 0.011243 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 149.72 on 107 degrees of freedom Residual deviance: 107.99 on 104 degrees of freedom AIC: 115.99 Number of Fisher Scoring iterations: 4 c(0.022668-1.96*0.007187, 0.022668+1.96*0.007187) [1] 0.00858148 0.03675452 ### or can also do: confint.default(mod1, parm = "dcpct") 2.5 % 97.5 % dcpct 0.00858227 0.03675422
All variables are significant at the 5% level. An increase of one unit in the proportion of cows receiving dry cow therapy increases the log-odds to be a case herd by 0.02. The confidence interval can be constructed using normal approximation for the parameter estimate: . The 95% confidence interval is (0.009, 0.037). But it is better to construct a profile likelihood-based confidence interval (from MASS package):
confint(mod1, parm = "dcpct") Waiting for profiling to be done... 2.5 % 97.5 % 0.009211164 0.037694948
We can get the odds by exponentiating the estimates:
cbind(exp(coef(mod1)), exp(confint(mod1))) 2.5 % 97.5 % (Intercept) 0.05057629 0.009587881 0.2007101 dcpct 1.02292712 1.009253717 1.0384144 dneoYes 9.13911806 3.139924972 31.3596110 dcloxYes 0.24352979 0.078283239 0.7094844
The statistic to determine the overall significance of a logistic model is the likelihood ratio test. It compares the likelihood of the full model (with all the predictors included) with the likelihood of the null model (which contains only the intercept). It is analogous to the overall F-test in linear regression. The likelihood ratio test statistic is: where is the likelihood of the full model and is the likelihood of the null model. The likelihood ratio test statistic ( = 41.73) can be compared to a distribution with 3 degrees of freedom. The predictor variables are significant predictors of case-control status.
mod1.null <- glm(casecont ~ 1, family = binomial, data = nocardia) lr.mod1 <- -(deviance(mod1) / 2) lr.mod1.null <- -(deviance(mod1.null) / 2) (lr <- 2 * (lr.mod1 - lr.mod1.null)) [1] 41.73248 1 - pchisq(lr, 2) [1] 8.667767e-10
We can also compare the likelihood of the model under investigation to the likelihood of a fully saturated model (one in whcih there would be one parameter fit for each data point):
pchisq(deviance(mod1), df.residual(mod1), lower = FALSE) [1] 0.3748198
The p-value is over 0.05 and we can conclude this model fits sufficiently well.
Finally we can evaluate a single coefficient with a Wald test. If we had barn type to the model below, we can test for an overall effect of barn type (b gives the coefficient, Sigma the variance covariance matrix of the error terms, Terms indicates which terms in the model to test (in the order they appear in the table of coefficients). The overall effect of barn is significant. We can also test for other barn against tie-stall barn.
mod1b <- glm(casecont ~ dcpct + dneo + dclox + as.factor(dbarn), family = binomial, data = nocardia) summary(mod1b) Call: glm(formula = casecont ~ dcpct + dneo + dclox + as.factor(dbarn), family = binomial, data = nocardia) Deviance Residuals: Min 1Q Median 3Q Max -1.6949 -0.7853 0.1021 0.7692 2.6801 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.445696 0.854328 -2.863 0.00420 ** dcpct 0.021604 0.007657 2.821 0.00478 ** dneoYes 2.685280 0.677273 3.965 7.34e-05 *** dcloxYes -1.235266 0.580976 -2.126 0.03349 * as.factor(dbarn)tiestall -1.333732 0.631925 -2.111 0.03481 * as.factor(dbarn)other -0.218350 1.154293 -0.189 0.84996 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 149.72 on 107 degrees of freedom Residual deviance: 102.32 on 102 degrees of freedom AIC: 114.32 Number of Fisher Scoring iterations: 5 library(aod) wald.test(b = coef(mod1b), Sigma = vcov(mod1b), Terms = 4:5) Wald test: ---------- Chi-squared test: X2 = 10.2, df = 2, P(> X2) = 0.0062 comp <- cbind(0, 0, 0, 0, 1, -1) wald.test(b = coef(mod1b), Sigma = vcov(mod1b), L = comp) Wald test: ---------- Chi-squared test: X2 = 1.1, df = 1, P(> X2) = 0.29
That’s it for now…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.