R Tip: Use stringsAsFactors = FALSE
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R tip: use stringsAsFactors = FALSE.
R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.
Sigmund Freud, it is often claimed, said: “Sometimes a cigar is just a cigar.”
To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.
Example:
d <- data.frame(label = rep("tbd", 5))
d$label[[2]] <- "north"
#> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, :
#> invalid factor level, NA generated
print(d)
#> label
#> 1 tbd
#> 2 <NA>
#> 3 tbd
#> 4 tbd
#> 5 tbd
Notice our new value was not copied in!
The fix is easy: use stringsAsFactors = FALSE.
d <- data.frame(label = rep("tbd", 5),
stringsAsFactors = FALSE)
d$label[[2]] <- "north"
print(d)
#> label
#> 1 tbd
#> 2 north
#> 3 tbd
#> 4 tbd
#> 5 tbd
As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than some claim.
Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.