Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In R, the model.matrix
function is used to create the design matrix for regression. In particular, it is used to expand factor variables into dummy variables (also known as “one-hot encoding“).
Let’s see this in action on the iris
dataset:
data(iris) str(iris) # 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... x <- model.matrix(Sepal.Length ~ Species, iris) head(x) # (Intercept) Speciesversicolor Speciesvirginica # 1 1 0 0 # 2 1 0 0 # 3 1 0 0 # 4 1 0 0 # 5 1 0 0 # 6 1 0 0
model.matrix
returns a column of ones labeled (Intercept)
by default. Also note that while the Species
factor has 3 levels (“setosa”, “versicolor” and “virginica”), the return value of model.matrix
only has dummy variables for the latter two levels. For a factor variable, model.matrix
treats the first level it encounters as the “baseline” level and will not produce a dummy variable for it. This is to avoid the problem of multi-collinearity.
However, there are situations where we might want dummy variables to be produced for all levels including the baseline level. (For example, when we do regularized regression, since multi-collinearity is no longer implies unidentifiability of the model.) We can induce this behavior by passing a specific value to the contrasts.arg
argument:
x <- model.matrix( Sepal.Length ~ Species, data = iris, contrasts.arg = list(Species = contrasts(iris$Species, contrasts = FALSE))) head(x) # (Intercept) Speciessetosa Speciesversicolor Speciesvirginica # 1 1 1 0 0 # 2 1 1 0 0 # 3 1 1 0 0 # 4 1 1 0 0 # 5 1 1 0 0 # 6 1 1 0 0
Let’s have a closer look at what we passed as the value of Species
in the list:
contrasts(iris$Species, contrasts = FALSE) # setosa versicolor virginica # setosa 1 0 0 # versicolor 0 1 0 # virginica 0 0 1
Notice that there are 3 columns, one for each level. If we didn’t pass this special value in, the default would have had just 2 columns, one for each of the levels we see in the output:
contrasts(iris$Species, contrasts = TRUE) # versicolor virginica # setosa 0 0 # versicolor 1 0 # virginica 0 1
It’s easy to modify the code above to include the baseline level for a different factor variable in another data frame. The code below is an example of how you can include the baseline level for all factor variables in the data frame.
df <- data.frame(x = factor(rep(c("a", "b", "c"), times = 3)), y = factor(rep(c("d", "e", "f"), times = 3)), z = 1:9) str(df) # 'data.frame': 9 obs. of 3 variables: # $ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 # $ y: Factor w/ 3 levels "d","e","f": 1 2 3 1 2 3 1 2 3 # $ z: int 1 2 3 4 5 6 7 8 9 # default: no dummy variable for baseline level x <- model.matrix(~ ., data = df) head(x) # (Intercept) xb xc ye yf z # 1 1 0 0 0 0 1 # 2 1 1 0 1 0 2 # 3 1 0 1 0 1 3 # 4 1 0 0 0 0 4 # 5 1 1 0 1 0 5 # 6 1 0 1 0 1 6 # dummy variables for baseline levels included x <- model.matrix( ~ ., data = df, contrasts.arg = lapply(df[, sapply(df, is.factor), drop = FALSE], contrasts, contrasts = FALSE)) head(x) # (Intercept) xa xb xc yd ye yf z # 1 1 1 0 0 1 0 0 1 # 2 1 0 1 0 0 1 0 2 # 3 1 0 0 1 0 0 1 3 # 4 1 1 0 0 1 0 0 4 # 5 1 0 1 0 0 1 0 5 # 6 1 0 0 1 0 0 1 6
References:
- StackOverflow. All Levels of a Factor in a Model Matrix in R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.