Visualizing categorical data in R

[This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I came across an interesting SAS macro that was used for visualizing log odds relationships of data.  This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable.  I don’t use SAS on a daily basis as I prefer to use R.  So I got to thinking that I could recreate this macro using only R.  I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.

The following picture is the result of the logodds function in R.  The chart is really close but not quite exact.  For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.


The following is the R script.

logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){

  # Assumptions: x & y are numeric vectors of the same
  # length, y is 0/1 varible.  This returns a vector
  # of breaks of the x variable where each bin has at
  # least min.countnumber of y’s
  bin.by.other.count <- function(x, other, min.cnt=1) {
    csum <- cumsum(tapply(other, x, sum))
    breaks <- numeric(0)
 
    i <- 1
    breaks[i] <- as.numeric(names(csum)[1])
    cursum <- csum[1]
 
    for ( a in names(csum) ) {
      if ( csum[a] – cursum >= min.cnt ) {
        i <- i + 1
        breaks[i] <- as.numeric(a)
        cursum <- csum[a]
      }
    }
 
    breaks
  }
 
  brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
 
  # Visualizing binary categorical data
  var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
  var_mean <- tapply(data_dep, var_cut, mean)
  var_median <- tapply(data_ind, var_cut, median)
 
  mydf <- data.frame(ind=data_ind, dep=data_dep)
  fit <- glm(dep ~ ind, data=mydf, family=binomial())
  pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
          type=”response”, se.fit=T)
 
  # Plot
  plot(x=var_median, y=var_mean, ylim=c(0,1.15),
       xlab=ind_varname, ylab=”Exp Prob”, pch=21, bg=”black”)
  stripchart(data_ind[data_dep==0], method=”stack”,
             at=0, add=T, col=”grey”)
  stripchart(data_ind[data_dep==1], method=”stack”,
             at=1, add=T, col=”grey”)
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit, col=”blue”, lwd=2)
  lines(lowess(x=var_median,
               y=var_mean, f=.30), col=”red”)
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit – 1.96*pred$se.fit, lty=2, col=”blue”)
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit + 1.96*pred$se.fit, lty=2, col=”blue”)
}




 


logoddsFnc(icu$age, icu$died, “age”, min.count=3)



The ICU data  for this example can be found in the R package “vcdExtra”.  Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function. 

The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.





To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)