The X-Factors: Where 0 means 1
[This article was first published on eKonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hadley Wickham in a recent blog post mentioned that “Factors have a bad rap in R because they often turn up when you don’t want them.” I believe Factors are an even bigger concern. They not only turn up where you don’t want them, but they also turn things around when you don’t want them to.
Consider the following example where I present a data set with two variables: x and y. I represent age in years as ‘y‘ and gender as a binary (0/1) variable as ‘x‘ where 1 represents males.
I compute the means for the two variables as follows:
Consider the following example where I present a data set with two variables: x and y. I represent age in years as ‘y‘ and gender as a binary (0/1) variable as ‘x‘ where 1 represents males.
I compute the means for the two variables as follows:
The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good.
Now let’s see what happens when I convert x into a factor variable using the following syntax:
The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.
The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.
I compute the average age for males and females as follows:
See what happens when I try to compute the mean for the variable ‘male‘.
Once you factor a variable, you can’t compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.
I recompute the means below.
Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let’s look at the data set below:
Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and y as the explanatory variable, R generates the following error:
To leave a comment for the author, please follow the link and comment on their blog: eKonometrics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.