R: ifelse statements with multiple variables and NAs
[This article was first published on R in the Antipodes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
ifelse statements in R are the bread and butter of recoding variables. Normally these are pretty easy to do, particularly when we are recoding off one variable, and that variable contains no missing values. There are lots of examples on how to do this simple coding already available, so I will simply redirect you to the posts here and here, which are excellent summaries of how to do simple ifelse coding.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Things get more complicated when:
- the recoding is defined off more than one variable, and
- the variables contain missing values (NA in R speak)
I’ve been working on this type of coding for the past day, and I thought I would share my tips with you as it took me quite a while to figure out how to do this, with numerous internet searches. Hopefully I can save someone else the time!
My data frame is a survey of the general population, looking at consumption of a particular type of food (we’ll call it Widgets), because there was insufficient information about Widget-eaters to be able to cut a better sample frame. The end result is that I have a data frame where only around 10% of respondents say they have eaten Widgets. This means that I have lots of information about the relevant 10% of the population, and a much smaller amount of information about the other 90% (age, location, and sex only). Currently I am holding, and working on, all the data in one data frame without subsetting, although I may subset later. The upshot is that I have a data frame where 90% of the data is NA.
The current problem I am faced with is that there were a number of open-ended response options to questions because we had “Other, please specify” type options. These other responses were selected by only a subset of the Widget eaters (who are 10% of my respondents). These “Other” response options, which are answers to questions like “Where did you last buy Widgets”, “Where did you obtain information about the benefits of eating Widgets” were coded into updated categories by subject matter experts and provided to me as Excel files. My role here is to merge the updated coding into the existing response option coding for each question where there is an open-ended response.
Normally, that’s been pretty easy. Most of the questions have required a “pick one only” type response so it’s just been a matter of updating an existing variable, from a response of “Other” to a better coded response. That’s simply replacement coding, where an existing variable level is overwritten by a new one.
However, I have just been recoding response options to a “choose all that apply” question. And this is where my ifelse has smacked up against the NA problem. Here, I am recoding “Other” responses to the question “Where did you obtain information about the benefits of eating Widgets?”
There were two recodes I had to make:
- update an existing variable to reflect that the open-ended response should have been coded to an existing response option (e.g. because of post-hoc decision to expand a response option, e.g. so that the response option “TV” now also includes “radio”)
- to construct new response option categories because the existing ones are not sufficient to capture the “other” response and there are a sufficiently large number of similar response options to justify creating a new response option.
There are 5001 respondents in the data frame. Of the 460 Widget eaters, 73 provided “other” responses to the question. The data came to me with 21 response options precoded, and the 73 responses to the “other” option were split off into Excel for recoding by a subject matter expert. I received back the recodes in the Excel file. I saved a copy of the Excel file to csv format, read that into R, and created a new variable “Other.Source” to capture all of the numeric recodes. (I have created a data dictionary that shows what each numeric code stands for.) Inside the master R data frame (“foo”), the 20 non-“Other” responses are coded as separate variables (because this is an “all that apply” question), from C1a to C1t, with C1r as the “Other” category. The codes are:
- 1 if the Widget eater gave that response
- 0 if the Widget eater did not give that response
- NA if the respondent is not a Widget eater
The Other.Source value equals the response option number for question C1 in the survey. For example, where the response should have been coded to response option 1 (i.e. variable C1a in foo), the value in Other.Source has been set to 1.
I have 4 recodes to apply to existing response option categories, and 4 new response option categories that need to be constructed as new variables in the data frame (taking the range C1u to C1x to keep with the extant metadata framework). I have had to use a slightly different approach to recode, depending on whether I am updating an existing variable, or creating a new one. I show both approaches below.
Other.Source has been merged into foo, ahead of the work below. This means that all respondents who did not have an “other” response for the question, regardless of whether or not they are Widget eaters, are NA for Other.Source in foo.
First, updating an existing variable, based on values to two variables. This shows how I updated the 7th response option, which was coded to variable C1g in foo. I had been constructing a temporary variable to take the recodes (foo$temp) because I was having so many issues getting my code correct)
table(foo$C1g)
#code below is complicated because a missing value screws up the counts, so need to exclude missing values
#explicitly, hence all the !is.na() syntax = ensuring that the value is NOT missing
foo$temp <- ifelse(!is.na(foo$C1g) & (foo$C1g==1) | (!is.na(foo$Other.Source) & foo$Other.Source==7),1,
ifelse(foo$C1g==0 & !is.na(foo$Other.Source) & foo$Other.Source!=7,0,
ifelse(foo$C1g==0 & is.na(foo$Other.Source),0,NA)))
foo$C1g <- foo$temp
Walking through this code:
- I produce a table of the original responses coded to C1g so I can make sure that the update means that I have at least the same number of “1” responses after recoding. If I have less, something has gone wrong.
- Some comments to remind myself of why I have such complicated code: it’s all down to the NAs!
- The first ifelse is giving “1” codes to temp where C1g contains 1 or where Other.Source contains 7 (the value indicating the “other” response should have been coded to the 7th response option, i.e. C1g). Because both C1g and Other.Source contain NAs, !is.na() needs to be added as an extra condition for both variable checks, otherwise NA values interfere with getting the correct test result. Note that this contains an OR (|) condition: if either C1g OR Other.Source indicates that the 7th response option is valid for the respondent, set temp to “1”.
- The second ifelse puts temp to 0 if C1g is 0 (i.e. the respondent is a Widget eater and they were not coded to the 7th response option originally), and that they gave an “other” response that was not related to the 7th response option. Note how the !is.na() condition has been included for Other.Source, again it is included because only using foo$Other.Source!=7 without the !is.na would give incorrect results.
- The third ifelse is there because only a subset of Widget eaters gave “other” responses, so most Widget eaters will be NA for Other.Source, and we want to set them to “0” (did not give the 7th response option) and not NA (were not Widget eaters).
- The NA result will therefore only be assigned where the respondent is not a Widget eater.
- The last line replaces the original (and incorrect) C1g variable with the updated (and correct) results stored in the temp variable.
What about creating a new variable? Well, as I mentioned above I had to do this four times. The code is a little different because we don’t have an extant response option variable to match off. What I do have is a UseMentioned variable that indicates whether the respondent is a Widget eater (value=”Yes”) or not (value=”No”). So there are no NAs in the UseMentioned variable, which is part of foo. The code to do the new variable construction is below. We are constructing the 24th variable, which is named C1x*:
foo$temp <- ifelse(!is.na(foo$Other.Source) & foo$Other.Source==25,1,
ifelse((is.na(foo$Other.Source) | foo$Other.Source!=25) & foo$UseMentioned==”Yes”,0,NA))
foo$C1x <- Aussie$temp
Walking through this code:
- The first ifelse gives the value 1 to the new variable if the respondent picked this response option, with the !is.na() preventing problems arising from NA values in Other.Source.
- The second ifelse gives the value 0 to the new variable if the respondent is a Widget eater (UseMentioned==”Yes”) and the respondent did not provide this response option. Note the use of is.na() and not the !is.na()that was all that was used in the previous recoding example. While it would appear that all that is required is the UseMentioned==”Yes” criterion, the presence of NAs in the Other.Source variable causes issues so such a simple criterion cannot be used.
* The recode value is 25 and not 24, because of some complexity in the metadata where I have one variable that takes different values depending on which of two countries the respondent lives in, but the letter of the alphabet used for the variable has been made the same. So the variable letters are one out.
I hope you find this useful.
To leave a comment for the author, please follow the link and comment on their blog: R in the Antipodes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.