Dataset – The Giant Marmots of Moscow

Dr. Jack Davis

3 years ago

[This article was first published on Statistics et al., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stat 403/640/890 Analysis Assignment 3: Polluted Giant Marmots

Due Wednesday, April 3^rd

Drop off in the dropbox by the stats workshop, or hand in in class.

For this assignment, use the Marmots_Real.csv dataset.

Main goal: The giant marmots of Moscow have a pollution problem. Find a model to predict the pollutant concentration (mg per kg) in the local population without resorting to measuring it directly. (It turns out that measuring this pollutant requires some invasive measures like looking at bone marrow).

The dataset Marmots_real.csv has the data from 60 such marmots, including many variables that are easier measure:

Variable Name	Type	Description
Species	Categorical, Unordered	One of five species of giant marmot
Region	Categorical, Unordered	One of five regions around Moscow where the subject is captured
Age	Numerical, Continuous	Age in years
Pos_x	Numerical, Continuous	Longitude, recoded to (0,1000), of capture
Pos_y	Numerical, Continuous	Latitude, recoded to (0,1000), of capture
Long_cm	Numerical, Continuous	Length nose to tail in cm
Wide_cm	Numerical, Continuous	Width between front paws, outstretched
Sex	Binary	M or F
Lesions	Numerical, Count	Number of skin lesions (cuts, open sores) found upon capture
Injured	Binary	0 or 1, 1 if substantial injury was observed upon capture.
Teeth_Condition	Categorical, Ordered	Condition of teeth upon capture, listed as Very Bad, Bad, Average, or Good.
Weight	Numerical, Continuous	Mass of subject in 100g
Antibody	Numerical, Continuous	Count of CD4 antibody in blood per mL
Pollutant	Numerical, Continuous	mg/kg of selenium found in bone marrow

There are no sampling weights. There is no missing data. There should be little to no convergence or computational issues with this data.

Assignment parts:

1) Build at least three models for pollutant and compare them (e.g. r-squared, general parsimony). Be sure to try interactions and polynomial terms.

Select one to be your ‘model to beat’.

2) Check the diagnostics of your model to beat. Specifically, normality of residuals, influential outliers, and the Variance Inflation Factor. Comment.

3) Try a transformation of the response in your model to beat, and see if you can improve the r-squared.

4) Try a PCA-based model and see if it comes close to you model to beat.

5) Take your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and use that as a basis for model selection using stepwise and the AIC criterion. Is the stepwise-produced model better (r-squared, AIC) than your ‘model to beat’?

6) If you haven’t already, try a random effect of something appropriate, and see if it beats the AIC of the stepwise model. Use the AIC() function to see the AIC of most models.

Useful sample code:

######## Preamble / Setup

## Load the .csv file into R. Store it as ‘dat’

dat = read.csv(“marmots_real.csv”)

dat$region = as.factor(dat$region)

library(car) # for vif() and boxcox()

library(MASS) # for stepAIC()

library(ks) # for kde()

library(lme4) # lmer and glmer

##### Try some models, with interactions

### Saturated. Not enough DoF

mod = lm(antibody ~ species*region*age*weight + long_cm, data=dat)

summary(mod)

### Another possibility, enough DoF, but efficient?

mod = lm(antibody ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

vif(mod)

plot(mod)

AIC(mod)

### With polynomials

mod = lm(pollutant ~ age + long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)

summary(mod) ## High r-sq and little significance? How?

vif(mod) ## Oh, that’s how.

#### Model selection,

mod_full = lm(antibody ~ species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)

### Stepwise selection based on AIC

stepAIC(mod_full)

### What if we do a BIC penalty

## Try without trace=FALSE so we can see what’s going on.

stepAIC(mod_full, k=log(nrow(dat)))

######### Transformations

### Start with the classic tranforms

### Another possibility, enough DoF, but efficient?

mod = lm(sqrt(pollutant) ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

mod = lm(log(pollutant) ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

### Box-cox to find the range of best ones through

boxcox(pollutant ~ species + region + age*weight*long_cm, data = dat,

lambda = seq(-2, 3, length = 30))

boxcox(antibody ~ species + region + age*weight*long_cm, data = dat,

lambda = seq(-2, 3, length = 30))

### Anything above the 95% line is perfectly fine. Anything close is probably fine too.

### Reminder:

### Lambda = -1 is 1/x (inverse) transform

### lambda = 0 is log tranform

### Lambda = 1/2 is sqrt tranform

### Lambda = 1 is no tranform

### Lambda = 2 is square transform

#################

### MANOVA

### First, Are the two responses related?

cor(dat$antibody, dat$pollutant)

plot(dat$antibody, dat$pollutant)

### Start with the simple ANOVAs

mod_anti = lm(antibody ~ species + region + age*weight*lesions, data=dat)

mod_poll = lm(pollutant ~ species + region + age*weight*lesions, data=dat)

aov_anti = anova(mod_anti)

aov_anti

summary(aov_anti)

aov_poll = anova(mod_poll)

aov_poll

summary(aov_poll)

### Now try the multiple ANOVA

aov_mult = manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)

aov_mult

summary(aov_mult) ### Your job: Make a model that balances simplicity with fit.

## Residual standard errors: Lower = better fit

###################

# PCA

### convert the relevant categorical variables

dat$teeth_num = as.numeric(factor(x = dat$teeth_condition, levels=c(“Very Bad”,”Bad”,”Average”,”Good”)))

dat$sex_num = as.numeric(factor(x = dat$sex, levels=c(“F”,”M”)))

PCA_all = prcomp( ~ age + weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,

data = dat,

scale = TRUE)

summary(PCA_all)

plot(PCA_all, type=”lines”)

### Add the Principal components to the marmots dataset

dat = cbind(dat, PCA_all$x)

head(dat)

### Try a few models of the responses using the PCAs

mod_PCA1 = lm(antibody ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)

summary(mod_PCA1)

mod_PCA2 = lm(antibody ~ PC1 * PC2 * PC3 , data=dat)

summary(mod_PCA2)

mod_PCA3 = lm(antibody ~ PC1 + PC2 + PC3 , data=dat)

summary(mod_PCA3)

mod_PCA4 = lme(pollutant ~ PC1 + PC2 + PC3 + (1|region), data=dat)

summary(mod_PCA4) ### Why non-zero correlations? Adjustments for region were made

### Fixed vs Random

marmots$region = as.factor(marmots$region)

summary(lm(pollutant ~ region, data=dat))

summary(lmer(pollutant ~ (1|region), data=dat))

summary(lmer(pollutant ~ PC1 + PC2 + (age|region), data=dat))

summary(lmer(pollutant ~ age + (1|region), data=dat))$logLik ### Higher LogLik is better

summary(lmer(pollutant ~ age + (1|region), data=dat))$AICtab ### Lower REML (AIC calculated by REML) is better

### Compare the $AICtab value to the result from stepAIC

To leave a comment for the author, please follow the link and comment on their blog: Statistics et al..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.