R pitfall #3: friggin’ factors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I received an email from one of my students expressing deep frustation with a seemingly simple problem. He had a factor containing names of potato lines and wanted to set some levels to NA
. Using simple letters as example names he was baffled by the result of the following code:
lines = factor(LETTERS) lines # [1] A B C D E F G H... # Levels: A B C D E F G H... linesNA = ifelse(lines %in% c('C', 'G', 'P'), NA, lines) linesNA # [1] 1 2 NA 4 5 6 NA 8...
The factor has been converted to numeric and there was no trace of the level names. Even forcing the conversion to be a factor loses the level names. Newbie frustation guaranteed!
linesNA = factor(ifelse(lines %in% c('C', 'G', 'P'), NA, lines)) linesNA # [1] 1 2 <NA> 4 5 6 <NA> 8... # Levels: 1 2 4 5 6 8...
Under the hood factors are numerical vectors (of class factor) that have associated character vectors to describe the levels (see Patrick Burns’s R Inferno PDF for details). We can deal directly with the levels using this:
linesNA = lines levels(linesNA)[levels(linesNA) %in% c('C', 'G', 'P')] = NA linesNA # [1] A B <NA> D E F <NA> H... #Levels: A B D E F H...
We could operate directly on lines (without creating linesNA), which is there to maintain consistency with the previous code. Another way of doing the same would be:
linesNA = factor(as.character(ifelse(lines %in% c('C', 'G', 'P'), NA, lines))) linesNA # [1] A B <NA> D E F <NA> H... #Levels: A B D E F H...
I can believe that there are good reasons for the default behavior of operations on factors, but the results can drive people crazy (at least rhetorically speaking).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.