Random Forests in R

Posted on July 19, 2016 by R on The Stats Guy in R bloggers | 0 Comments

[This article was first published on R on The Stats Guy, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

As the name suggests, random forest models basically contain an ensemble of decision tree models, with each decision tree predicting the same response variable. The response may be categorical, in which case being a classification problem, or continuous / numerical, being a regression problem.

In this short tutorial, we will go through the use of tree-based methods (decision tree, bagging model, and random forest) for both classification and regression problems.

This tutorial is divided into two sections. We will first use tree-based methods for classification on the spam dataset from the kernlab package. Subsequently, we will apply these methods on a regression problem, with the imports85 dataset from the randomForest package.

Tree-based methods for classification

Preparation

Let’s start by loading the spam dataset and doing some preparations:

# packages that we will need:
#  @ kernlab:      for the spam dataset
#  @ tree:         for decision tree construction
#  @ randomForest: for bagging and RF
#  @ beepr:        for a little beep
#  @ pROC:         for plotting of ROC

# code snippet to install and load multiple packages at once
# pkgs <- c("kernlab","tree","randomForest","beepr","pROC")
# sapply(pkgs,FUN=function(p){
#        print(p)
#        if(!require(p)) install.packages(p)
#        require(p)
# })

# load required packages
suppressWarnings(library(kernlab))
suppressWarnings(library(tree))
suppressWarnings(library(randomForest))
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
suppressWarnings(library(beepr)) # try it! beep()
suppressWarnings(library(pROC))
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# load dataset
data(spam)

# take a look
str(spam)
## 'data.frame':    4601 obs. of  58 variables:
##  $ make             : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ address          : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ all              : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ num3d            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ our              : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ over             : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ remove           : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ internet         : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ order            : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ mail             : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ receive          : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ will             : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ people           : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ report           : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ addresses        : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ free             : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ business         : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ email            : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ you              : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ credit           : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ your             : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ font             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num000           : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ money            : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ hp               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hpl              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ george           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num650           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lab              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ labs             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ telnet           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num857           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data             : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ num415           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num85            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ technology       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ num1999          : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ parts            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pm               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ direct           : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ cs               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meeting          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ original         : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ project          : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ re               : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ edu              : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ table            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ conference       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charSemicolon    : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ charRoundbracket : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ charSquarebracket: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charExclamation  : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ charDollar       : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ charHash         : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ capitalAve       : num  3.76 5.11 9.82 3.54 3.54 ...
##  $ capitalLong      : num  61 101 485 40 40 15 4 11 445 43 ...
##  $ capitalTotal     : num  278 1028 2259 191 191 ...
##  $ type             : Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 2 2 ...

In this example, we will attempt to predict whether an email is spam or nonspam. To do so, we will construct models on one subset of the data (training data), and use the constructed model on another disparate subset of the data (the testing data). This is known as cross validation.

# preparation for cross validation:
# split the dataset into 2 halves,
# 2300 samples for training and 2301 for testing
num.samples <- nrow(spam) # 4,601
num.train   <- round(num.samples/2) # 2,300
num.test    <- num.samples - num.train # 2,301
num.var     <- ncol(spam) # 58

# set up the indices
set.seed(150715)
idx       <- sample(1:num.samples)
train.idx <- idx[seq(num.train)]
test.idx  <- setdiff(idx,train.idx)

# subset the data
spam.train <- spam[train.idx,]
spam.test  <- spam[test.idx,]

Taking a quick glance at the type variable:

table(spam.train$type)
## 
## nonspam    spam 
##    1397     903
table(spam.test$type)
## 
## nonspam    spam 
##    1391     910

Decision tree

Now that we are done with the preparation, let’s start by constructing a decision tree model, using the tree package:

tree.mod <- tree(type ~ ., data = spam.train)

Here’s how our model looks like:

plot(tree.mod)
title("Decision tree")
text(tree.mod, cex = 0.75)

The model may be overtly complicated. Typically, after constructing a decision tree model, we may want to prune the model, by collapsing certain edges, nodes and leaves together without much loss of performance. This is done by iteratively comparing the number of leaf nodes with the model’s performance (by k-fold cross validation within the training set).

cv.prune <- cv.tree(tree.mod, FUN = prune.misclass)
plot(cv.prune$size, cv.prune$dev, pch = 20, col = "red", type = "b",
     main = "Decision tree: Cross validation to find optimal size of tree",
     xlab = "Size of tree", ylab = "Misclassifications")

Having 9 leaf nodes may be good (maximising performance while minimising complexity).

best.tree.size <- 9

# pruning (cost-complexity pruning)
pruned.tree.mod <- prune.misclass(tree.mod, best = best.tree.size)

# here's the new tree model
plot(pruned.tree.mod)
title(paste("Pruned decision tree (", best.tree.size, " leaf nodes)",sep = ""))
text(pruned.tree.mod, cex = 0.75)

Now with our new model, let’s make some predictions on the testing data.

tree.pred <- predict(pruned.tree.mod,
                     subset(spam.test, select = -type), 
                     type = "class")

# confusion matrix
# rows are the predicted classes
# columns are the actual classes
print(tree.pred.results <- table(tree.pred, spam.test$type))
##          
## tree.pred nonspam spam
##   nonspam    1308  164
##   spam         83  746

# What is the accuracy of our tree model?
print(tree.accuracy <- (tree.pred.results[1,1] + tree.pred.results[2,2]) / sum(tree.pred.results))
## [1] 0.8926554

Our decision tree model is able to predict spam vs. nonspam emails with about 89.27% accuracy. We will make comparisons of accuracies with other models later.

Bagging

Next, we turn our attention to the bagging model. Recall that bagging, a.k.a. bootstrap aggregating, is the process of sampling (with replacement), samples from the training data. Each of these subsets are known as bags, and we construct individual decision tree models using each of these bags. Finally, to make a classification prediction, we use the majority vote from the ensemble of decision tree models.

bg.mod<-randomForest(type ~ ., data = spam.train,
                     mtry = num.var - 1, # try all variables at each split, except the response variable
                     ntree = 300,
                     proximity = TRUE,
                     importance = TRUE)

In the bagging, and also the random forest model, there are often only two hyperparameters that we are interested in: mtry, which is the number of variables to try from for each tree and at each split, and ntree, the number of trees in the ensemble. Tuning the number of trees is relatively easy by looking at the out-of-bag (OOB) error estimate of the ensemble at each step of the way. For more details, refer to the slides. We set proximity = TRUE and importance = TRUE, in order to get some form of visualization of the model, and the variable importances respectively.

plot(bg.mod$err.rate[,1], type = "l", lwd = 3, col = "blue",
     main = "Bagging: OOB estimate of error rate",
     xlab = "Number of Trees", ylab = "OOB error rate")

Here, 300 trees seems more than sufficient. One advantage of bagging and random forest models is that they provide a way of doing feature or variable selection, by considering the importance of each variable in the model. For exact details on how these importance measures are defined, refer to the slides.

varImpPlot(bg.mod,
           main = "Bagging: Variable importance")

In addition, we can visualize the classification done by the model using a multidimensional plot on the proximity matrix. The green samples in the figure represent nonspams, while the red samples are spams.

MDSplot(bg.mod,
        fac = spam.train$type,
        palette = c("green","red"),
        main = "Bagging: MDS")

Finally, let’s make some predictions on the testing data:

bg.pred <- predict(bg.mod,
                   subset(spam.test, select = -type), 
                   type = "class")

# confusion matrix
# rows are the predicted classes
# columns are the actual classes
print(bg.pred.results <- table(bg.pred, spam.test$type))
##          
## bg.pred   nonspam spam
##   nonspam    1336   87
##   spam         55  823

# what is the accuracy of our bagging model?
print(bg.accuracy <- sum(diag((bg.pred.results))) / sum(bg.pred.results))
## [1] 0.9382877

Our bagging model predicts whether an email is spam or not with about 93.83% accuracy.

Random Forest

The only difference between the bagging model and random forest model is that the latter uses chooses only from a subset of variables to split on at each node of each tree. In other words, only the mtry argument differs between bagging and random forest.

rf.mod <- randomForest(type ~ ., data = spam.train,
                       mtry = floor(sqrt(num.var - 1)), # 7; only difference from bagging is here
                       ntree = 300,
                       proximity = TRUE,
                       importance = TRUE)

# Out-of-bag (OOB) error rate as a function of num. of trees:
plot(rf.mod$err.rate[,1], type = "l", lwd = 3, col = "blue",
     main = "Random forest: OOB estimate of error rate",
     xlab = "Number of Trees", ylab = "OOB error rate")

Besides tuning the ntree hyperparameter, we might also be interested in tuning the mtry hyperparameter in random forest. The random forest model may be built using the mtry value that minimises the OOB error.

tuneRF(subset(spam.train, select = -type),
       spam.train$type,
       ntreeTry = 100)
## mtry = 7  OOB error = 5.52% 
## Searching left ...
## mtry = 4     OOB error = 6.26% 
## -0.1338583 0.05 
## Searching right ...
## mtry = 14    OOB error = 5.83% 
## -0.05511811 0.05
##        mtry   OOBError
## 4.OOB     4 0.06260870
## 7.OOB     7 0.05521739
## 14.OOB   14 0.05826087
title("Random forest: Tuning the mtry hyperparameter")

# variable importance
varImpPlot(rf.mod,
           main = "Random forest: Variable importance")

# multidimensional scaling plot
# green samples are non-spam,
# red samples are spam
MDSplot(rf.mod,
        fac = spam.train$type,
        palette = c("green","red"),
        main = "Random forest: MDS")

# now, let's make some predictions
rf.pred <- predict(rf.mod,
                   subset(spam.test,select = -type), 
                   type="class")

# confusion matrix
print(rf.pred.results <- table(rf.pred, spam.test$type))
##          
## rf.pred   nonspam spam
##   nonspam    1353   82
##   spam         38  828

# Accuracy of our RF model:
print(rf.accuracy <- sum(diag((rf.pred.results))) / sum(rf.pred.results))
## [1] 0.9478488

Our random forest model predicts whether an email is spam or not with about 94.78% accuracy.

Visualization of performances

Let’s go ahead and make some comparisons on the performances of our model. For comparison sake, let’s also construct a logistic regression model.

log.mod <- glm(type ~ . , data = spam.train,
             family = binomial(link = logit))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# predictions
log.pred.prob <- predict(log.mod,
                         subset(spam.test, select = -type), 
                         type = "response")
log.pred.class <- factor(sapply(log.pred.prob,
                                FUN = function(x){
                                        if(x >= 0.5) return("spam")
                                        else return("nonspam")
                                }))

# confusion matrix
log.pred.results <- table(log.pred.class, spam.test$type)

# Accuracy of logistic regression model:
print(log.accuracy <- sum(diag((log.pred.results))) / sum(log.pred.results))
## [1] 0.9135159

Now, let’s compare the performances, considering first the model accuracies.

barplot(c(tree.accuracy,
          bg.accuracy,
          rf.accuracy,
          log.accuracy),
        main="Accuracies of various models",
        names.arg=c("Tree","Bagging","RF", "Logistic"))

We can see here that the ensemble models (bagging and random forest) outperforms the single decision tree, and also the logistic regression model. It turns out here that the bagging and the random forest models have about the same classification performance. Understanding the rationale of random subspace sampling (refer to slides) should allow us to appreciate the potential improvement of random forest over the bagging model.

Finally, let’s plot the ROC curves of the various models. The ROC is only valid for models that give probabilistic output.

bg.pred.prob <- predict(bg.mod ,
                        subset(spam.test, select = -type),
                        type = "prob")

rf.pred.prob <- predict(rf.mod ,
                        subset(spam.test, select = -type),
                        type = "prob")

plot.roc(spam.test$type,
         bg.pred.prob[,1], col = "blue",
         lwd = 3, print.auc = TRUE, print.auc.y = 0.3,
         main = "ROC-AUC of various models")
## Setting levels: control = nonspam, case = spam
## Setting direction: controls > cases

plot.roc(spam.test$type,
         rf.pred.prob[,1], col = "green",
         lwd = 3, print.auc = TRUE, print.auc.y = 0.2,
         add = TRUE)
## Setting levels: control = nonspam, case = spam
## Setting direction: controls > cases

plot.roc(spam.test$type,
         log.pred.prob, col = "red",
         lwd = 3, print.auc = TRUE, print.auc.y = 0.1,
         add = TRUE)
## Setting levels: control = nonspam, case = spam
## Setting direction: controls < cases

legend(x = 0.6, y = 0.8, legend = c("Bagging",
                                    "Random forest",
                                    "Logistic regression"),
       col = c("blue", "green", "red"), lwd = 1)

Tree-based methods for regression

In the following section, we will consider the use of tree-based methods for regression. The materials that follows are analogous to that above, if not the similar.

Preparation

library(tree)
library(randomForest)

data(imports85)
imp <- imports85

# The following data preprocessing steps on
# the imports85 dataset are suggested by
# the authors of the randomForest package
# look at
# > ?imports85
imp <- imp[,-2]  # Too many NAs in normalizedLosses.
imp <- imp[complete.cases(imp), ]
# ## Drop empty levels for factors
imp[] <- lapply(imp, function(x) if (is.factor(x)) x[, drop=TRUE] else x)

# Also removing the numOfCylinders and fuelSystem
# variables due to sparsity of data
# to see this, run the following lines:
# > table(imp$numOfCylinders)
# > table(imp$fuelSystem)
# This additional step is only necessary because we will be
# making comparisons between the tree-based models
# and linear regression, and linear regression cannot
# handle sparse data well
imp <- subset(imp, select = -c(numOfCylinders,fuelSystem))

# also removing the make variable
imp <- subset(imp, select = -make)

# Preparation for cross validation:
# split the dataset into 2 halves,
# 96 samples for training and 97 for testing
num.samples <- nrow(imp) # 193
num.train   <- round(num.samples / 2) # 96
num.test    <- num.samples - num.train # 97
num.var     <- ncol(imp) # 25

# set up the indices
set.seed(150715)
idx       <- sample(1:num.samples)
train.idx <- idx[seq(num.train)]
test.idx  <- setdiff(idx,train.idx)

# subset the data
imp.train <- imp[train.idx,]
imp.test  <- imp[test.idx,]

str(imp.train)
## 'data.frame':    96 obs. of  22 variables:
##  $ symboling       : int  1 0 0 3 2 1 1 1 3 2 ...
##  $ fuelType        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 1 2 2 2 2 2 ...
##  $ aspiration      : Factor w/ 2 levels "std","turbo": 2 1 2 1 1 1 1 1 2 2 ...
##  $ numOfDoors      : Factor w/ 2 levels "four","two": 1 1 1 2 2 2 1 2 2 1 ...
##  $ bodyStyle       : Factor w/ 5 levels "convertible",..: 4 4 4 3 4 4 4 3 3 4 ...
##  $ driveWheels     : Factor w/ 3 levels "4wd","fwd","rwd": 2 2 3 3 2 3 2 2 2 2 ...
##  $ engineLocation  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheelBase       : num  96.3 97.2 108 102.9 97.3 ...
##  $ length          : num  172 173 187 184 172 ...
##  $ width           : num  65.4 65.2 68.3 67.7 65.5 64 63.8 66.5 65.4 66.5 ...
##  $ height          : num  51.6 54.7 56 52 55.7 52.6 54.5 53.7 49.4 56.1 ...
##  $ curbWeight      : int  2403 2302 3130 2976 2261 2265 1971 2385 2370 2847 ...
##  $ engineType      : Factor w/ 5 levels "dohc","l","ohc",..: 3 3 2 1 3 1 3 3 3 1 ...
##  $ engineSize      : int  110 120 134 171 97 98 97 122 110 121 ...
##  $ bore            : num  3.17 3.33 3.61 3.27 3.01 3.24 3.15 3.39 3.17 3.54 ...
##  $ stroke          : num  3.46 3.47 3.21 3.35 3.4 3.08 3.29 3.39 3.46 3.07 ...
##  $ compressionRatio: num  7.5 8.5 7 9.3 23 9.4 9.4 8.6 7.5 9 ...
##  $ horsepower      : int  116 97 142 161 52 112 69 84 116 160 ...
##  $ peakRpm         : int  5500 5200 5600 5200 4800 6600 5200 4800 5500 5500 ...
##  $ cityMpg         : int  23 27 18 20 37 26 31 26 23 19 ...
##  $ highwayMpg      : int  30 34 24 24 46 29 37 32 30 26 ...
##  $ price           : int  9279 9549 18150 16558 7775 9298 7499 10595 9959 18620 ...
str(imp.test)
## 'data.frame':    97 obs. of  22 variables:
##  $ symboling       : int  -1 1 -1 1 1 -2 0 0 2 2 ...
##  $ fuelType        : Factor w/ 2 levels "diesel","gas": 2 2 1 2 2 2 1 2 2 2 ...
##  $ aspiration      : Factor w/ 2 levels "std","turbo": 1 1 2 1 1 1 2 1 1 1 ...
##  $ numOfDoors      : Factor w/ 2 levels "four","two": 1 1 1 2 2 1 2 2 2 2 ...
##  $ bodyStyle       : Factor w/ 5 levels "convertible",..: 4 4 5 4 4 4 2 4 1 3 ...
##  $ driveWheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 3 2 3 3 3 3 2 ...
##  $ engineLocation  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheelBase       : num  115.6 103.5 110 94.5 94.5 ...
##  $ length          : num  203 189 191 169 165 ...
##  $ width           : num  71.7 66.9 70.3 64 63.8 67.2 70.3 70.6 65.6 64.4 ...
##  $ height          : num  56.5 55.7 58.7 52.6 54.5 56.2 54.9 47.8 53 50.8 ...
##  $ curbWeight      : int  3740 3055 3750 2169 1918 2935 3495 3950 2975 1944 ...
##  $ engineType      : Factor w/ 5 levels "dohc","l","ohc",..: 5 3 3 3 3 3 3 5 3 3 ...
##  $ engineSize      : int  234 164 183 98 97 141 183 326 146 92 ...
##  $ bore            : num  3.46 3.31 3.58 3.19 3.15 3.78 3.58 3.54 3.62 2.97 ...
##  $ stroke          : num  3.1 3.19 3.64 3.03 3.29 3.15 3.64 2.76 3.5 3.23 ...
##  $ compressionRatio: num  8.3 9 21.5 9 9.4 9.5 21.5 11.5 9.3 9.4 ...
##  $ horsepower      : int  155 121 123 70 69 114 123 262 116 68 ...
##  $ peakRpm         : int  4750 4250 4350 4800 5200 5400 4350 5000 4800 5500 ...
##  $ cityMpg         : int  16 20 22 29 31 24 22 13 24 31 ...
##  $ highwayMpg      : int  18 25 25 34 37 28 25 17 30 38 ...
##  $ price           : int  34184 24565 28248 8058 6649 15985 28176 36000 17669 6189 ...

# take a quick look
hist(imp.train$price)

hist(imp.test$price)

We will be predicting the price of imported automobiles in this example. While tree-based methods are scale-invariant with respect to predictor variables, this is not true for the response variable. Hence, let’s take a log-transformation on price here.

imp.train$price <- log(imp.train$price)
imp.test$price <- log(imp.test$price)

# take a look again
hist(imp.train$price)

hist(imp.test$price)

Decision tree

Done with the preparation, let’s begin with decision trees.

# Construct decision tree model
tree.mod <- tree(price ~ ., data = imp.train)

# here's how the model looks like
plot(tree.mod)
title("Decision tree")
text(tree.mod, cex = 0.75)

# let's see if our decision tree requires pruning
cv.prune <- cv.tree(tree.mod, FUN = prune.tree)
plot(cv.prune$size, cv.prune$dev, pch = 20, col = "red", type = "b",
     main = "Cross validation to find optimal size of tree",
     xlab = "Size of tree", ylab = "Mean squared error")

# looks fine

# now let's make some predictions
tree.pred <- predict(tree.mod,
                     subset(imp.test,select = -price), 
                     type = "vector")

# Comparing our predictions with the test data:
plot(tree.pred, imp.test$price, main = "Decision tree: Actual vs. predicted")
abline(a = 0, b = 1) # A prediction with zero error will lie on the y = x line

# What is the MSE of our model?
print(tree.mse <- mean((tree.pred - imp.test$price) ** 2))
## [1] 0.04716916

Our decision tree model predicts the price of imported automobiles with a mean squared error of 0.0472. As with the previous section, we will make comparsions on model performances later.

Bagging

Next, bagging.

bg.mod<-randomForest(price ~ ., data = imp.train,
                     mtry = num.var - 1, # try all variables at each split, except the response variable
                     ntree = 300,
                     importance = TRUE)

# Out-of-bag (OOB) error rate as a function of num. of trees
# here, the error is the mean squared error,
# not classification error
plot(bg.mod$mse, type = "l", lwd = 3, col = "blue",
     main = "Bagging: OOB estimate of error rate",
     xlab = "Number of Trees", ylab = "OOB error rate")

# variable importance
varImpPlot(bg.mod,
           main = "Bagging: Variable importance")

# let's make some predictions
bg.pred <- predict(bg.mod,
                   subset(imp.test,select = -price))

# Comparing our predictions with test data:
plot(bg.pred,imp.test$price, main = "Bagging: Actual vs. predicted")
abline(a = 0, b = 1)

# MSE of bagged model
print(bg.mse <- mean((bg.pred - imp.test$price) ** 2))
## [1] 0.03004431

Our bagging model predicts the price of imported automobiles with a mean squared error of 0.03.

Random forest

Finally, the random forest model.

rf.mod<-randomForest(price ~ ., data = imp.train,
                     mtry = floor((num.var - 1) / 3), # 7; only difference from bagging is here
                     ntree = 300,
                     importance = TRUE)

# Out-of-bag (OOB) error rate as a function of num. of trees
plot(rf.mod$mse, type = "l", lwd = 3, col = "blue",
     main = "Random forest: OOB estimate of error rate",
     xlab = "Number of Trees", ylab = "OOB error rate")

# tuning the mtry hyperparameter:
# model may be rebuilt if desired
tuneRF(subset(imp.train, select = -price),
       imp.train$price,
       ntreetry = 100)
## mtry = 7  OOB error = 0.02543962 
## Searching left ...
## mtry = 4     OOB error = 0.03064481 
## -0.2046095 0.05 
## Searching right ...
## mtry = 14    OOB error = 0.02643948 
## -0.03930348 0.05
##    mtry   OOBError
## 4     4 0.03064481
## 7     7 0.02543962
## 14   14 0.02643948
title("Random forest: Tuning the mtry hyperparameter")


# variable importance
varImpPlot(rf.mod,
           main = "Random forest: Variable importance")

# let's make some predictions
rf.pred <- predict(rf.mod,
                   subset(imp.test, select = -price))

# Comparing our predictions with test data:
plot(rf.pred, imp.test$price, main = "Random forest: Actual vs. predicted")
abline(a = 0, b = 1)

# MSE of RF model
print(rf.mse <- mean((rf.pred - imp.test$price) ** 2))
## [1] 0.03139744

Our random forest model incurs a mean squared error of 0.0314 for the prediction of imported automobile prices

Visualization of performances

For comparison purposes, let’s also construct a ordinary least squares (linear regression) model.

ols.mod <- lm(price ~ ., data = imp.train)

# predictions
ols.pred <- predict(ols.mod,
                   subset(imp.test, select = -price))

# comparisons with test data:
plot(ols.pred, imp.test$price, main = "OLS: Actual vs. predicted")
abline(a = 0, b = 1)

# MSE
print(ols.mse <- mean((ols.pred-imp.test$price) ** 2))
## [1] 0.03556617

Now, let’s compare their performances.

# Comparing MSEs of various models:
barplot(c(tree.mse,
          bg.mse,
          rf.mse,
          ols.mse),
        main = "Mean squared errors of various models",
        names.arg = c("Tree", "Bagging", "RF", "OLS"))

Our top performer here is the random forest model, followed by the bagging model.

To leave a comment for the author, please follow the link and comment on their blog: R on The Stats Guy.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Random Forests in R

Introduction

Tree-based methods for classification

Preparation

Decision tree

Bagging

Random Forest

Visualization of performances

Tree-based methods for regression

Preparation

Decision tree

Bagging

Random forest

Visualization of performances

Related

Introduction

Tree-based methods for classification

Preparation

Decision tree

Bagging

Random Forest

Visualization of performances

Tree-based methods for regression

Preparation

Decision tree

Bagging

Random forest

Visualization of performances

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)