Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exploring Gaussian Mixture Models
This week in the Empirical Research Methods course, we've been talking a lot about measurement error. The idea of having some latent variable of interest, coupled with 'flawed' measures reminded me of a section of Cosma's course I really enjoyed, but haven't gotten a change to go back to – mixture models.
The rough idea here is that we have some observed measure for a population, but suspect the population is composed of a number of distict types, which themselves have different distributions on our measure. Cosma has a nice example using precipitation data – here, we might suspect that snowy days result in a different amount of precipitation than rainy days. Total precipitation, then, is just the sum of a few distributions – one for each type.
Interested in trying to fit my own mixture models, I thought modeling weight by gender might be a reasonable domain for application.
Looking around a bit for open datasets, I found CrossFit data, which needed only a bit of cleaning before I had a data frame composed of weights and genders for about 14000 CrossFit athletes.
Before fitting my model, I wanted to make sure the data were fairly bimodal, so I decided to create a density plot of weight:
This turned out to be not nearly as bimodal as I'd hoped. Inspecting the gender distribution shows that men make up vast majority of CrossFit athletes:
## ## F M ## 0.27 0.73
Regardless of the comparatively smaller share of women, I was optimistic that a mixture model would still reasonably place gaussian distributions near the mean weights for each gender. Before I fit the model, I again plotted the weight densities, this time including gender information:
This result looked especially good – finally, before running the model, I used the plyr package to find the mean and standard deviation for each gender:
## gender mean.weight sd.weight ## 1 F 135.7 17.64 ## 2 M 184.4 24.26
Next, I used the mixtools package to fit two gaussians to the data. Note that application of the EM algorithm will not yield the same estimates each time – the results reported come from an interative fitting process.. Here are a few draws from the two-component model:
## number of iterations= 210 ## summary of normalmixEM object: ## comp 1 comp 2 ## lambda 0.983262 0.0167383 ## mu 170.985903 197.1966562 ## sigma 30.039475 68.0108365 ## loglik at estimate: -69972 ## number of iterations= 256 ## summary of normalmixEM object: ## comp 1 comp 2 ## lambda 0.0167324 0.983268 ## mu 197.1994160 170.986019 ## sigma 68.0174677 30.039619 ## loglik at estimate: -69972 ## number of iterations= 109 ## summary of normalmixEM object: ## comp 1 comp 2 ## lambda 0.913756 0.0862443 ## mu 170.277126 183.5844313 ## sigma 32.336239 8.7492152 ## loglik at estimate: -70004
Here, I was quite surprised. From the plot of the weight densities by gender, I was fairly sure the mixture model would work.. surprisingly, the first few times I ran the model, though, the EM algorithm resulted in a mixture of two gaussians, one around 170, and another around 195 (For a bit of background on this algorithm, and mixture models in general, I highly recommend Cosma's Notes).
When I instead moved to a model with three types, I started reliably picking up a gaussian that's fairly close to the mean female weight:
## number of iterations= 403 ## summary of normalmixEM object: ## comp 1 comp 2 comp 3 ## lambda 0.0669021 0.144703 0.788395 ## mu 191.9951136 128.521504 177.554028 ## sigma 50.1322568 10.480415 24.355337 ## loglik at estimate: -69672 ## number of iterations= 294 ## summary of normalmixEM object: ## comp 1 comp 2 comp 3 ## lambda 0.14469 0.788445 0.0668648 ## mu 128.52087 177.553775 191.9985466 ## sigma 10.48005 24.356546 50.1385850 ## loglik at estimate: -69672 ## number of iterations= 125 ## summary of normalmixEM object: ## comp 1 comp 2 comp 3 ## lambda 0.911272 0.0719816 0.0167468 ## mu 170.170622 181.9974482 194.2480171 ## sigma 32.328408 7.6615752 8.4118644 ## loglik at estimate: -70003
It's a bit difficult to tell what to make of all this.
The most obvious question in my mind is the failure of the model to pick up the types (i.e. gender) that I expected.
Taking a look at my assumptions, though, I've got to admit that I know essentially nothing about CrossFit. From a bit of browsing around, I found a few blog posts that suggested certain body types would be better for different exercises.
Returning to the model fits, it might make sense to think of different subpopulations based on weight – burly athletes who have a competitive advantage for strength-related competition, thinner athletes who have an advantage with speed-related events.
I should also mention that I originally chose two components based on my expectations about weight distribution following from gender. Cosma has a nice section in the notes on using cross-validation to choose the number of components, and that might be a reasonable way to extend the analysis.
As a final note, I've got this voice ringing in my head that dissuades the reification of the result – the 'types' our model uncovers are only useful insofar as they relate to other variables of interest.. it'd be ambitious to conclude here that there actually are these two subpopulations.. instead, we might do better to use this information to think heuristically about CrossFit athletes.
Here's the code I used for this post:
rm(list = ls()) library(mixtools) library(ggplot2) # Note: data source: # http://dl.dropbox.com/u/23802677/xfit2011.csv dat = read.csv("http://dl.dropbox.com/u/23802677/xfit2011.csv", stringsAsFactors = FALSE, header = FALSE) # Creating gender variable: dat$gender = substr(dat$V5,1,1) correct.gender.codes = (dat$gender %in% c("F","M")) dat$gender[which(correct.gender.codes == FALSE)] = NA #table(dat$gender) #Eyeballing the data, it looks like weights are reported in either kilograms or pounds. kilog.vec = grep(dat$V7, pattern = "Kilograms") Lb.vec = grep(dat$V7, pattern = "Pounds") # Defining function to excise number from string: excise.number = function(string){ # Note: assumes number starts the string as.numeric(unlist(strsplit(string,split = " "))[[1]][1]) } dat$weight = rep(NA,nrow(dat)) dat$weight[Lb.vec] = sapply(dat$V7[Lb.vec],excise.number) dat$weight[kilog.vec] = sapply(dat$V7[kilog.vec],excise.number) #note: weight in kilograms still needs to be converted: kilog.to.lb = function(x){round(x*2.20462,0)} dat$weight[kilog.vec] = sapply(dat$weight[kilog.vec],kilog.to.lb) # Final (Clean) data: dat = na.omit(dat[,c("weight","gender")]) # Weight density: p = ggplot(dat,aes(x = weight)) + geom_density() p # Gender Distribution: round(prop.table(table(dat$gender)),2) # Weight density by gender: p = ggplot(dat,aes(x = weight, color = factor(gender))) + geom_density() p # mean and sd for weight by gender: ddply(dat,.(gender),summarise, mean.weight = mean(weight), sd.weight = sd(weight)) # Mixture model with 2 components: weight.k2 = normalmixEM(dat$weight,k=2,maxit = 1000, epsilon = 0.001) summary(weight.k2) # Mixture model with 3 components: weight.k3 = normalmixEM(dat$weight,3=2,maxit = 1000, epsilon = 0.001) summary(weight.k2)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.