Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request
but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint. The biggest problem I have run into has been trying to decide what to do with two extremes. What does one do with the age ’65+’ category or the ‘over $100,000′ category. Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories. Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible. In this example I use voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation.
Group | Frequency |
18-30 | 25 |
30-45 | 43 |
45-60 | 42 |
60+ | 31 |
I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well. For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package. This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.
I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.
###Note that this function is also available through the VGAM package. ###However, it is displayed here as part of the example. library(graphics); laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if (is.na(prob) == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) log.sd.y <- log(sd(y)) start <- c(mean.y,log.sd.y) fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2])
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.