Site icon R-bloggers

Quickly Create Dummy Variables in a Data Frame

[This article was first published on randyzwitch.com » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On Quora, a question was asked about how to fix the error of the randomForest package in R not being able to handle more than 32 levels in a categorical variable. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own dummy variables instead of relying on Factors!

Code snippet

As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:

  1. The problem you are trying to solve
  2. How much RAM you have available

While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful.

Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing

Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:

@randyzwitch If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.

— John Myles White (@johnmyleswhite) January 2, 2014

Quickly Create Dummy Variables in a Data Frame is an article from randyzwitch.com, a blog dedicated to helping newcomers to Digital Analytics & Data Science If you liked this post, please visit randyzwitch.com to read more. Or better yet, tell a friend…the best compliment is to share with others!

Related posts:

  1. Google Analytics Custom Variables: A Page-Level Example
  2. Adobe Analytics Implementation Documentation in 60 seconds
  3. Tabular Data I/O in Julia

To leave a comment for the author, please follow the link and comment on their blog: randyzwitch.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.