Random Forests in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
As the name suggests, random forest models basically contain an ensemble of decision tree models, with each decision tree predicting the same response variable. The response may be categorical, in which case being a classification problem, or continuous / numerical, being a regression problem.
In this short tutorial, we will go through the use of tree-based methods (decision tree, bagging model, and random forest) for both classification and regression problems.
This tutorial is divided into two sections. We will first use tree-based methods for classification on the spam dataset from the kernlab package. Subsequently, we will apply these methods on a regression problem, with the imports85 dataset from the randomForest package.
Tree-based methods for classification
Preparation
Let’s start by loading the spam dataset and doing some preparations:
# packages that we will need: # @ kernlab: for the spam dataset # @ tree: for decision tree construction # @ randomForest: for bagging and RF # @ beepr: for a little beep # @ pROC: for plotting of ROC # code snippet to install and load multiple packages at once # pkgs <- c("kernlab","tree","randomForest","beepr","pROC") # sapply(pkgs,FUN=function(p){ # print(p) # if(!require(p)) install.packages(p) # require(p) # }) # load required packages suppressWarnings(library(kernlab)) suppressWarnings(library(tree)) suppressWarnings(library(randomForest)) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. suppressWarnings(library(beepr)) # try it! beep() suppressWarnings(library(pROC)) ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var # load dataset data(spam) # take a look str(spam) ## 'data.frame': 4601 obs. of 58 variables: ## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ... ## $ address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ... ## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ... ## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ... ## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ... ## $ over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ... ## $ remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ... ## $ internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ... ## $ order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ... ## $ mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ... ## $ receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ... ## $ will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ... ## $ people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ... ## $ report : num 0 0.21 0 0 0 0 0 0 0 0 ... ## $ addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ... ## $ free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ... ## $ business : num 0 0.07 0.06 0 0 0 0 0 0 0 ... ## $ email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ... ## $ you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ... ## $ credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ... ## $ your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ... ## $ font : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ... ## $ money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ... ## $ hp : num 0 0 0 0 0 0 0 0 0 0 ... ## $ hpl : num 0 0 0 0 0 0 0 0 0 0 ... ## $ george : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num650 : num 0 0 0 0 0 0 0 0 0 0 ... ## $ lab : num 0 0 0 0 0 0 0 0 0 0 ... ## $ labs : num 0 0 0 0 0 0 0 0 0 0 ... ## $ telnet : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num857 : num 0 0 0 0 0 0 0 0 0 0 ... ## $ data : num 0 0 0 0 0 0 0 0 0.15 0 ... ## $ num415 : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num85 : num 0 0 0 0 0 0 0 0 0 0 ... ## $ technology : num 0 0 0 0 0 0 0 0 0 0 ... ## $ num1999 : num 0 0.07 0 0 0 0 0 0 0 0 ... ## $ parts : num 0 0 0 0 0 0 0 0 0 0 ... ## $ pm : num 0 0 0 0 0 0 0 0 0 0 ... ## $ direct : num 0 0 0.06 0 0 0 0 0 0 0 ... ## $ cs : num 0 0 0 0 0 0 0 0 0 0 ... ## $ meeting : num 0 0 0 0 0 0 0 0 0 0 ... ## $ original : num 0 0 0.12 0 0 0 0 0 0.3 0 ... ## $ project : num 0 0 0 0 0 0 0 0 0 0.06 ... ## $ re : num 0 0 0.06 0 0 0 0 0 0 0 ... ## $ edu : num 0 0 0.06 0 0 0 0 0 0 0 ... ## $ table : num 0 0 0 0 0 0 0 0 0 0 ... ## $ conference : num 0 0 0 0 0 0 0 0 0 0 ... ## $ charSemicolon : num 0 0 0.01 0 0 0 0 0 0 0.04 ... ## $ charRoundbracket : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ... ## $ charSquarebracket: num 0 0 0 0 0 0 0 0 0 0 ... ## $ charExclamation : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ... ## $ charDollar : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ... ## $ charHash : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ... ## $ capitalAve : num 3.76 5.11 9.82 3.54 3.54 ... ## $ capitalLong : num 61 101 485 40 40 15 4 11 445 43 ... ## $ capitalTotal : num 278 1028 2259 191 191 ... ## $ type : Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 2 2 ...
In this example, we will attempt to predict whether an email is spam or nonspam. To do so, we will construct models on one subset of the data (training data), and use the constructed model on another disparate subset of the data (the testing data). This is known as cross validation.
# preparation for cross validation: # split the dataset into 2 halves, # 2300 samples for training and 2301 for testing num.samples <- nrow(spam) # 4,601 num.train <- round(num.samples/2) # 2,300 num.test <- num.samples - num.train # 2,301 num.var <- ncol(spam) # 58 # set up the indices set.seed(150715) idx <- sample(1:num.samples) train.idx <- idx[seq(num.train)] test.idx <- setdiff(idx,train.idx) # subset the data spam.train <- spam[train.idx,] spam.test <- spam[test.idx,]
Taking a quick glance at the type variable:
table(spam.train$type) ## ## nonspam spam ## 1397 903 table(spam.test$type) ## ## nonspam spam ## 1391 910
Decision tree
Now that we are done with the preparation, let’s start by constructing a decision tree model, using the tree package:
tree.mod <- tree(type ~ ., data = spam.train)
Here’s how our model looks like:
plot(tree.mod) title("Decision tree") text(tree.mod, cex = 0.75)
The model may be overtly complicated. Typically, after constructing a decision tree model, we may want to prune the model, by collapsing certain edges, nodes and leaves together without much loss of performance. This is done by iteratively comparing the number of leaf nodes with the model’s performance (by k-fold cross validation within the training set).
cv.prune <- cv.tree(tree.mod, FUN = prune.misclass) plot(cv.prune$size, cv.prune$dev, pch = 20, col = "red", type = "b", main = "Decision tree: Cross validation to find optimal size of tree", xlab = "Size of tree", ylab = "Misclassifications")
Having 9 leaf nodes may be good (maximising performance while minimising complexity).
best.tree.size <- 9 # pruning (cost-complexity pruning) pruned.tree.mod <- prune.misclass(tree.mod, best = best.tree.size) # here's the new tree model plot(pruned.tree.mod) title(paste("Pruned decision tree (", best.tree.size, " leaf nodes)",sep = "")) text(pruned.tree.mod, cex = 0.75)
Now with our new model, let’s make some predictions on the testing data.
tree.pred <- predict(pruned.tree.mod, subset(spam.test, select = -type), type = "class") # confusion matrix # rows are the predicted classes # columns are the actual classes print(tree.pred.results <- table(tree.pred, spam.test$type)) ## ## tree.pred nonspam spam ## nonspam 1308 164 ## spam 83 746 # What is the accuracy of our tree model? print(tree.accuracy <- (tree.pred.results[1,1] + tree.pred.results[2,2]) / sum(tree.pred.results)) ## [1] 0.8926554
Our decision tree model is able to predict spam vs. nonspam emails with about 89.27% accuracy. We will make comparisons of accuracies with other models later.
Bagging
Next, we turn our attention to the bagging model. Recall that bagging, a.k.a. bootstrap aggregating, is the process of sampling (with replacement), samples from the training data. Each of these subsets are known as bags, and we construct individual decision tree models using each of these bags. Finally, to make a classification prediction, we use the majority vote from the ensemble of decision tree models.
bg.mod<-randomForest(type ~ ., data = spam.train, mtry = num.var - 1, # try all variables at each split, except the response variable ntree = 300, proximity = TRUE, importance = TRUE)
In the bagging, and also the random forest model, there are often only two hyperparameters that we are interested in: mtry, which is the number of variables to try from for each tree and at each split, and ntree, the number of trees in the ensemble. Tuning the number of trees is relatively easy by looking at the out-of-bag (OOB) error estimate of the ensemble at each step of the way. For more details, refer to the slides. We set proximity = TRUE and importance = TRUE, in order to get some form of visualization of the model, and the variable importances respectively.
plot(bg.mod$err.rate[,1], type = "l", lwd = 3, col = "blue", main = "Bagging: OOB estimate of error rate", xlab = "Number of Trees", ylab = "OOB error rate")
Here, 300 trees seems more than sufficient. One advantage of bagging and random forest models is that they provide a way of doing feature or variable selection, by considering the importance of each variable in the model. For exact details on how these importance measures are defined, refer to the slides.
varImpPlot(bg.mod, main = "Bagging: Variable importance")
In addition, we can visualize the classification done by the model using a multidimensional plot on the proximity matrix. The green samples in the figure represent nonspams, while the red samples are spams.
MDSplot(bg.mod, fac = spam.train$type, palette = c("green","red"), main = "Bagging: MDS")
Finally, let’s make some predictions on the testing data:
bg.pred <- predict(bg.mod, subset(spam.test, select = -type), type = "class") # confusion matrix # rows are the predicted classes # columns are the actual classes print(bg.pred.results <- table(bg.pred, spam.test$type)) ## ## bg.pred nonspam spam ## nonspam 1336 87 ## spam 55 823 # what is the accuracy of our bagging model? print(bg.accuracy <- sum(diag((bg.pred.results))) / sum(bg.pred.results)) ## [1] 0.9382877
Our bagging model predicts whether an email is spam or not with about 93.83% accuracy.
Random Forest
The only difference between the bagging model and random forest model is that the latter uses chooses only from a subset of variables to split on at each node of each tree. In other words, only the mtry argument differs between bagging and random forest.
rf.mod <- randomForest(type ~ ., data = spam.train, mtry = floor(sqrt(num.var - 1)), # 7; only difference from bagging is here ntree = 300, proximity = TRUE, importance = TRUE) # Out-of-bag (OOB) error rate as a function of num. of trees: plot(rf.mod$err.rate[,1], type = "l", lwd = 3, col = "blue", main = "Random forest: OOB estimate of error rate", xlab = "Number of Trees", ylab = "OOB error rate")
Besides tuning the ntree hyperparameter, we might also be interested in tuning the mtry hyperparameter in random forest. The random forest model may be built using the mtry value that minimises the OOB error.
tuneRF(subset(spam.train, select = -type), spam.train$type, ntreeTry = 100) ## mtry = 7 OOB error = 5.52% ## Searching left ... ## mtry = 4 OOB error = 6.26% ## -0.1338583 0.05 ## Searching right ... ## mtry = 14 OOB error = 5.83% ## -0.05511811 0.05 ## mtry OOBError ## 4.OOB 4 0.06260870 ## 7.OOB 7 0.05521739 ## 14.OOB 14 0.05826087 title("Random forest: Tuning the mtry hyperparameter")
# variable importance varImpPlot(rf.mod, main = "Random forest: Variable importance")
# multidimensional scaling plot # green samples are non-spam, # red samples are spam MDSplot(rf.mod, fac = spam.train$type, palette = c("green","red"), main = "Random forest: MDS")
# now, let's make some predictions rf.pred <- predict(rf.mod, subset(spam.test,select = -type), type="class") # confusion matrix print(rf.pred.results <- table(rf.pred, spam.test$type)) ## ## rf.pred nonspam spam ## nonspam 1353 82 ## spam 38 828 # Accuracy of our RF model: print(rf.accuracy <- sum(diag((rf.pred.results))) / sum(rf.pred.results)) ## [1] 0.9478488
Our random forest model predicts whether an email is spam or not with about 94.78% accuracy.
Visualization of performances
Let’s go ahead and make some comparisons on the performances of our model. For comparison sake, let’s also construct a logistic regression model.
log.mod <- glm(type ~ . , data = spam.train, family = binomial(link = logit)) ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred # predictions log.pred.prob <- predict(log.mod, subset(spam.test, select = -type), type = "response") log.pred.class <- factor(sapply(log.pred.prob, FUN = function(x){ if(x >= 0.5) return("spam") else return("nonspam") })) # confusion matrix log.pred.results <- table(log.pred.class, spam.test$type) # Accuracy of logistic regression model: print(log.accuracy <- sum(diag((log.pred.results))) / sum(log.pred.results)) ## [1] 0.9135159
Now, let’s compare the performances, considering first the model accuracies.
barplot(c(tree.accuracy, bg.accuracy, rf.accuracy, log.accuracy), main="Accuracies of various models", names.arg=c("Tree","Bagging","RF", "Logistic"))
We can see here that the ensemble models (bagging and random forest) outperforms the single decision tree, and also the logistic regression model. It turns out here that the bagging and the random forest models have about the same classification performance. Understanding the rationale of random subspace sampling (refer to slides) should allow us to appreciate the potential improvement of random forest over the bagging model.
Finally, let’s plot the ROC curves of the various models. The ROC is only valid for models that give probabilistic output.
bg.pred.prob <- predict(bg.mod , subset(spam.test, select = -type), type = "prob") rf.pred.prob <- predict(rf.mod , subset(spam.test, select = -type), type = "prob") plot.roc(spam.test$type, bg.pred.prob[,1], col = "blue", lwd = 3, print.auc = TRUE, print.auc.y = 0.3, main = "ROC-AUC of various models") ## Setting levels: control = nonspam, case = spam ## Setting direction: controls > cases plot.roc(spam.test$type, rf.pred.prob[,1], col = "green", lwd = 3, print.auc = TRUE, print.auc.y = 0.2, add = TRUE) ## Setting levels: control = nonspam, case = spam ## Setting direction: controls > cases plot.roc(spam.test$type, log.pred.prob, col = "red", lwd = 3, print.auc = TRUE, print.auc.y = 0.1, add = TRUE) ## Setting levels: control = nonspam, case = spam ## Setting direction: controls < cases legend(x = 0.6, y = 0.8, legend = c("Bagging", "Random forest", "Logistic regression"), col = c("blue", "green", "red"), lwd = 1)
Tree-based methods for regression
In the following section, we will consider the use of tree-based methods for regression. The materials that follows are analogous to that above, if not the similar.
Preparation
library(tree) library(randomForest) data(imports85) imp <- imports85 # The following data preprocessing steps on # the imports85 dataset are suggested by # the authors of the randomForest package # look at # > ?imports85 imp <- imp[,-2] # Too many NAs in normalizedLosses. imp <- imp[complete.cases(imp), ] # ## Drop empty levels for factors imp[] <- lapply(imp, function(x) if (is.factor(x)) x[, drop=TRUE] else x) # Also removing the numOfCylinders and fuelSystem # variables due to sparsity of data # to see this, run the following lines: # > table(imp$numOfCylinders) # > table(imp$fuelSystem) # This additional step is only necessary because we will be # making comparisons between the tree-based models # and linear regression, and linear regression cannot # handle sparse data well imp <- subset(imp, select = -c(numOfCylinders,fuelSystem)) # also removing the make variable imp <- subset(imp, select = -make) # Preparation for cross validation: # split the dataset into 2 halves, # 96 samples for training and 97 for testing num.samples <- nrow(imp) # 193 num.train <- round(num.samples / 2) # 96 num.test <- num.samples - num.train # 97 num.var <- ncol(imp) # 25 # set up the indices set.seed(150715) idx <- sample(1:num.samples) train.idx <- idx[seq(num.train)] test.idx <- setdiff(idx,train.idx) # subset the data imp.train <- imp[train.idx,] imp.test <- imp[test.idx,] str(imp.train) ## 'data.frame': 96 obs. of 22 variables: ## $ symboling : int 1 0 0 3 2 1 1 1 3 2 ... ## $ fuelType : Factor w/ 2 levels "diesel","gas": 2 2 2 2 1 2 2 2 2 2 ... ## $ aspiration : Factor w/ 2 levels "std","turbo": 2 1 2 1 1 1 1 1 2 2 ... ## $ numOfDoors : Factor w/ 2 levels "four","two": 1 1 1 2 2 2 1 2 2 1 ... ## $ bodyStyle : Factor w/ 5 levels "convertible",..: 4 4 4 3 4 4 4 3 3 4 ... ## $ driveWheels : Factor w/ 3 levels "4wd","fwd","rwd": 2 2 3 3 2 3 2 2 2 2 ... ## $ engineLocation : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ... ## $ wheelBase : num 96.3 97.2 108 102.9 97.3 ... ## $ length : num 172 173 187 184 172 ... ## $ width : num 65.4 65.2 68.3 67.7 65.5 64 63.8 66.5 65.4 66.5 ... ## $ height : num 51.6 54.7 56 52 55.7 52.6 54.5 53.7 49.4 56.1 ... ## $ curbWeight : int 2403 2302 3130 2976 2261 2265 1971 2385 2370 2847 ... ## $ engineType : Factor w/ 5 levels "dohc","l","ohc",..: 3 3 2 1 3 1 3 3 3 1 ... ## $ engineSize : int 110 120 134 171 97 98 97 122 110 121 ... ## $ bore : num 3.17 3.33 3.61 3.27 3.01 3.24 3.15 3.39 3.17 3.54 ... ## $ stroke : num 3.46 3.47 3.21 3.35 3.4 3.08 3.29 3.39 3.46 3.07 ... ## $ compressionRatio: num 7.5 8.5 7 9.3 23 9.4 9.4 8.6 7.5 9 ... ## $ horsepower : int 116 97 142 161 52 112 69 84 116 160 ... ## $ peakRpm : int 5500 5200 5600 5200 4800 6600 5200 4800 5500 5500 ... ## $ cityMpg : int 23 27 18 20 37 26 31 26 23 19 ... ## $ highwayMpg : int 30 34 24 24 46 29 37 32 30 26 ... ## $ price : int 9279 9549 18150 16558 7775 9298 7499 10595 9959 18620 ... str(imp.test) ## 'data.frame': 97 obs. of 22 variables: ## $ symboling : int -1 1 -1 1 1 -2 0 0 2 2 ... ## $ fuelType : Factor w/ 2 levels "diesel","gas": 2 2 1 2 2 2 1 2 2 2 ... ## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 2 1 1 1 2 1 1 1 ... ## $ numOfDoors : Factor w/ 2 levels "four","two": 1 1 1 2 2 1 2 2 2 2 ... ## $ bodyStyle : Factor w/ 5 levels "convertible",..: 4 4 5 4 4 4 2 4 1 3 ... ## $ driveWheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 3 2 3 3 3 3 2 ... ## $ engineLocation : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ... ## $ wheelBase : num 115.6 103.5 110 94.5 94.5 ... ## $ length : num 203 189 191 169 165 ... ## $ width : num 71.7 66.9 70.3 64 63.8 67.2 70.3 70.6 65.6 64.4 ... ## $ height : num 56.5 55.7 58.7 52.6 54.5 56.2 54.9 47.8 53 50.8 ... ## $ curbWeight : int 3740 3055 3750 2169 1918 2935 3495 3950 2975 1944 ... ## $ engineType : Factor w/ 5 levels "dohc","l","ohc",..: 5 3 3 3 3 3 3 5 3 3 ... ## $ engineSize : int 234 164 183 98 97 141 183 326 146 92 ... ## $ bore : num 3.46 3.31 3.58 3.19 3.15 3.78 3.58 3.54 3.62 2.97 ... ## $ stroke : num 3.1 3.19 3.64 3.03 3.29 3.15 3.64 2.76 3.5 3.23 ... ## $ compressionRatio: num 8.3 9 21.5 9 9.4 9.5 21.5 11.5 9.3 9.4 ... ## $ horsepower : int 155 121 123 70 69 114 123 262 116 68 ... ## $ peakRpm : int 4750 4250 4350 4800 5200 5400 4350 5000 4800 5500 ... ## $ cityMpg : int 16 20 22 29 31 24 22 13 24 31 ... ## $ highwayMpg : int 18 25 25 34 37 28 25 17 30 38 ... ## $ price : int 34184 24565 28248 8058 6649 15985 28176 36000 17669 6189 ... # take a quick look hist(imp.train$price)
hist(imp.test$price)
We will be predicting the price of imported automobiles in this example. While tree-based methods are scale-invariant with respect to predictor variables, this is not true for the response variable. Hence, let’s take a log-transformation on price here.
imp.train$price <- log(imp.train$price) imp.test$price <- log(imp.test$price) # take a look again hist(imp.train$price)
hist(imp.test$price)
Decision tree
Done with the preparation, let’s begin with decision trees.
# Construct decision tree model tree.mod <- tree(price ~ ., data = imp.train) # here's how the model looks like plot(tree.mod) title("Decision tree") text(tree.mod, cex = 0.75)
# let's see if our decision tree requires pruning cv.prune <- cv.tree(tree.mod, FUN = prune.tree) plot(cv.prune$size, cv.prune$dev, pch = 20, col = "red", type = "b", main = "Cross validation to find optimal size of tree", xlab = "Size of tree", ylab = "Mean squared error")
# looks fine # now let's make some predictions tree.pred <- predict(tree.mod, subset(imp.test,select = -price), type = "vector") # Comparing our predictions with the test data: plot(tree.pred, imp.test$price, main = "Decision tree: Actual vs. predicted") abline(a = 0, b = 1) # A prediction with zero error will lie on the y = x line
# What is the MSE of our model? print(tree.mse <- mean((tree.pred - imp.test$price) ** 2)) ## [1] 0.04716916
Our decision tree model predicts the price of imported automobiles with a mean squared error of 0.0472. As with the previous section, we will make comparsions on model performances later.
Bagging
Next, bagging.
bg.mod<-randomForest(price ~ ., data = imp.train, mtry = num.var - 1, # try all variables at each split, except the response variable ntree = 300, importance = TRUE) # Out-of-bag (OOB) error rate as a function of num. of trees # here, the error is the mean squared error, # not classification error plot(bg.mod$mse, type = "l", lwd = 3, col = "blue", main = "Bagging: OOB estimate of error rate", xlab = "Number of Trees", ylab = "OOB error rate")
# variable importance varImpPlot(bg.mod, main = "Bagging: Variable importance")
# let's make some predictions bg.pred <- predict(bg.mod, subset(imp.test,select = -price)) # Comparing our predictions with test data: plot(bg.pred,imp.test$price, main = "Bagging: Actual vs. predicted") abline(a = 0, b = 1)
# MSE of bagged model print(bg.mse <- mean((bg.pred - imp.test$price) ** 2)) ## [1] 0.03004431
Our bagging model predicts the price of imported automobiles with a mean squared error of 0.03.
Random forest
Finally, the random forest model.
rf.mod<-randomForest(price ~ ., data = imp.train, mtry = floor((num.var - 1) / 3), # 7; only difference from bagging is here ntree = 300, importance = TRUE) # Out-of-bag (OOB) error rate as a function of num. of trees plot(rf.mod$mse, type = "l", lwd = 3, col = "blue", main = "Random forest: OOB estimate of error rate", xlab = "Number of Trees", ylab = "OOB error rate")
# tuning the mtry hyperparameter: # model may be rebuilt if desired tuneRF(subset(imp.train, select = -price), imp.train$price, ntreetry = 100) ## mtry = 7 OOB error = 0.02543962 ## Searching left ... ## mtry = 4 OOB error = 0.03064481 ## -0.2046095 0.05 ## Searching right ... ## mtry = 14 OOB error = 0.02643948 ## -0.03930348 0.05 ## mtry OOBError ## 4 4 0.03064481 ## 7 7 0.02543962 ## 14 14 0.02643948 title("Random forest: Tuning the mtry hyperparameter")
# variable importance varImpPlot(rf.mod, main = "Random forest: Variable importance")
# let's make some predictions rf.pred <- predict(rf.mod, subset(imp.test, select = -price)) # Comparing our predictions with test data: plot(rf.pred, imp.test$price, main = "Random forest: Actual vs. predicted") abline(a = 0, b = 1)
# MSE of RF model print(rf.mse <- mean((rf.pred - imp.test$price) ** 2)) ## [1] 0.03139744
Our random forest model incurs a mean squared error of 0.0314 for the prediction of imported automobile prices
Visualization of performances
For comparison purposes, let’s also construct a ordinary least squares (linear regression) model.
ols.mod <- lm(price ~ ., data = imp.train) # predictions ols.pred <- predict(ols.mod, subset(imp.test, select = -price)) # comparisons with test data: plot(ols.pred, imp.test$price, main = "OLS: Actual vs. predicted") abline(a = 0, b = 1)
# MSE print(ols.mse <- mean((ols.pred-imp.test$price) ** 2)) ## [1] 0.03556617
Now, let’s compare their performances.
# Comparing MSEs of various models: barplot(c(tree.mse, bg.mse, rf.mse, ols.mse), main = "Mean squared errors of various models", names.arg = c("Tree", "Bagging", "RF", "OLS"))
Our top performer here is the random forest model, followed by the bagging model.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.