Converting Individual Binary vectors to a Value based on Column Names

William Doane

3 years ago

[This article was first published on R – William Doane, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When processing data downloaded from popular survey engines, it’s not uncommon for multiple choice questions to be represented as one column per possible response coded as 0/1. So, a question with just two responses might be downloaded as part of a CSV with one column for q1_1 and another for q1_2. If the responses are mutually exclusive, then (q1_1 == 0 iff q1_2 == 1) and (q1_1 == 1 iff q1_2 == 0). If the responses are part of a “choose all that apply” question, then it’s possible to have multiple 1s.

How can these individual binary indicator variables be reassembled into a single response variable?

First, let’s simulate some response data for non-mutually exclusive questions—each row represents one respondent’s choices:

df <- data.frame(
  q1_1 = round(runif(5), 0),
  q1_2 = round(runif(5), 0),
  q1_3 = round(runif(5), 0),
  
  q2_1 = round(runif(5), 0),
  q2_2 = round(runif(5), 0),
  
  q3_1 = round(runif(5), 0),
  q3_2 = round(runif(5), 0),
  q3_3 = round(runif(5), 0),
  q3_4 = round(runif(5), 0)
)

df

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4
1    1    0    0    1    0    1    1    0    1
2    0    1    0    0    1    1    1    0    1
3    1    1    1    1    1    0    0    0    1
4    0    1    0    1    1    1    1    0    1
5    0    0    1    1    0    1    0    0    0

R’s dplyr package offers the coalesce function, which doesn’t suit my needs when the data contains 0s for non-selected response choices. Notice below in row 2, for example, that q1 and q2 select the first non-NA values, which is 0:

library(dplyr)
 
df %>%
  mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce(q2_1, q2_2)) %>%
  mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1    1    0    0    1    0    1    1    0    1  1  1  1
2    0    1    0    0    1    1    1    0    1  0  0  1
3    1    1    1    1    1    0    0    0    1  1  1  1
4    0    1    0    1    1    1    1    0    1  0  1  1
5    0    0    1    1    0    1    0    0    0  0  1  1

If you replace all 0s with NA, you can get closer to what you need:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce(q2_1, q2_2)) %>%
  mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1    1   NA   NA    1   NA    1    1   NA    1  1  1  1
2   NA    1   NA   NA    1    1    1   NA    1  1  1  1
3    1    1    1    1    1   NA   NA   NA    1  1  1  1
4   NA    1   NA    1    1    1    1   NA    1  1  1  1
5   NA   NA    1    1   NA    1   NA   NA   NA  1  1  1

Unfortunately, the q1 vector here only tells us that there was some response by each respondent, not which response they gave for q1.

It would be nice to have a version of coalesce that gathered not the first non-NA value, but the column name of the first non-NA value. Here, I’ll use the structure of dplyr’s coalesce as a model:

coalesce_colname <-
  function(...) {
    if (missing(..1)) {
      abort("At least one argument must be supplied")
    }
   
    colnames <- as.character(as.list(match.call()))[-1]
   
    values <- list(...)
   
    x <- values[[1]]
    x[!is.na(x)] <- colnames[1]
   
    values <- values[-1]
    colnames <- colnames[-1]
   
    for (i in seq_along(values)) {
      x <- ifelse(is.na(x) & !is.na(values[[i]]), colnames[i], x)
    }
   
    x
  }

With this, you have a drop-in replacement for coalesce that captures the column name:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
  mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4   q1   q2   q3
1    1   NA   NA    1   NA    1    1   NA    1 q1_1 q2_1 q3_1
2   NA    1   NA   NA    1    1    1   NA    1 q1_2 q2_2 q3_1
3    1    1    1    1    1   NA   NA   NA    1 q1_1 q2_1 q3_4
4   NA    1   NA    1    1    1    1   NA    1 q1_2 q2_1 q3_1
5   NA   NA    1    1   NA    1   NA   NA   NA q1_3 q2_1 q3_1

and with a little effort, you can wrangle the column name to extract the response value:

df %>%
  mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
  mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
  mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
  mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4)) %>%
  
  mutate_at(c("q1", "q2", "q3"), ~stringr::str_extract(., "\\d+$"))

  q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2   q3
1    1   NA   NA    1   NA    1    1   NA    1  1  1    1
2   NA    1   NA   NA    1    1    1   NA    1  2  2    1
3    1    1    1    1    1   NA   NA   NA    1  1  1    4
4   NA    1   NA    1    1    1    1   NA    1  2  1    1
5   NA   NA    1    1   NA    1   NA   NA   NA  3  1    1

To leave a comment for the author, please follow the link and comment on their blog: R – William Doane.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.