Site icon R-bloggers

Predictive modelling fun with the caret package

[This article was first published on Data Until I Die!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m back!  6 months after my second child was born, I’ve finally made it back to my blog with something fun to write about.  I recently read through the excellent Machine Learning with R ebook and was impressed by the caret package and how easy it made it seem to do predictive modelling that was a little more than just the basics.

With that in mind, I went searching through the UCI machine learning repository and found a dataset about leaves that looked promising for a classification problem.  The dataset comprises of leaves from almost 40 different plant species, and has 14 numerical attributes describing each leaf.  It comes with a pdf file that shows pretty pictures of each leaf for the botanists out there, and some very mathematics heavy descriptions of each of the attributes which I couldn’t even hope to understand with my lack of education on the matter!

Seeing that it didn’t look overly complex to process, I decided to load it in and set up the overall training parameters:

library(caret)
leaf = read.csv("leaf.csv", colClasses = c(Class = "factor"))
ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction = "oneSE")
in_train = createDataPartition(leaf$Class, p=.75, list=FALSE)

First, I made sure that the Class variable remained a factor, even though it’s coded with integers in the incoming data.  This way once I split the data into a test set, I won’t get any complaints about missing outcome values if the sampling doesn’t pick up one of those values!

You’ll notice I’ve tried repeated cross validation here, with 5 repeats, and have used the ‘oneSE’ selection function.  This ensures that for whichever model I choose, the model gets tested on 10 different parts of my data, repeated 5 times over, and then I’ve chosen the ‘oneSE’ function to hopefully select a model that is not the most complex.  Finally, I use createDataPartition to create a a training sample of 75% of the data.

trf = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="rf", metric="Kappa",
            trControl=ctrl, subset = in_train)

tgbm = train(Class ~ Eccentricity + Aspect_Ratio + Elongation +
              Solidity + Stoch_Convexity + Isoperimetric + 
              Max_Ind_Depth + Lobedness + Avg_Intensity + 
              Avg_Contrast + Smoothness + Third_Moment + 
              Uniformity + Entropy, data=leaf, method="gbm", metric="Kappa",
            trControl=ctrl, subset = in_train, verbose=FALSE)

I’ve chosen to use a random forest and a generalized boosted model to try to model leaf class.  Notice how I’ve referred to the training parameters in the trControl argument, and have selected the training subset by referring to in_train.  Also, the ‘verbose=FALSE’ argument in the gbm model is important!!  Let’s look at results:

For the trf model:

Random Forest
340 samples
15 predictors
30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 228, 231, 233, 233, 232, 229, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.7341953 0.7230754 0.07930583 0.08252806
8 0.7513803 0.7409347 0.08873493 0.09237854
14 0.7481404 0.7375215 0.08438226 0.08786254

Kappa was used to select the optimal model using the one
SE rule.
The final value used for the model was mtry = 8.

So as you can see it’s selected a random forest model that tries 8 random predictors at each split, and it seems to be doing pretty well with a Kappa of .74. Now let’s move on to the next results:

For the tgbm model:

Stochastic Gradient Boosting 

340 samples
 15 predictors
 30 classes: '1', '10', '11', '12', '13', '14', '15', '2', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6', '7', '8', '9' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 

Summary of sample sizes: 226, 231, 229, 231, 228, 231, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.6550713  0.6406862  0.07735511   0.08017461
  1                  100      0.6779153  0.6646128  0.07461615   0.07739666
  1                  150      0.6799633  0.6667613  0.08291638   0.08592416
  2                   50      0.7000791  0.6876577  0.08467911   0.08771728
  2                  100      0.6984858  0.6860858  0.08711523   0.09041647
  2                  150      0.6886874  0.6759011  0.09157694   0.09494201
  3                   50      0.6838721  0.6708396  0.08850382   0.09166051
  3                  100      0.6992044  0.6868055  0.08423577   0.08714577
  3                  150      0.6976292  0.6851841  0.08414035   0.08693979

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Kappa was used to select the optimal model using  the one SE rule.
The final values used for the model were n.trees = 50, interaction.depth = 2 and shrinkage = 0.1.

Here we see that it has chosen a gbm model with an interaction depth of 2 and 50 trees. This has a kappa of .69, which appears somewhat worse than the random forest model. Let’s do a direct comparison:

resampls = resamples(list(RF = trf,
                          GBM = tgbm))

difValues = diff(resampls)
summary(difValues)

Call:
summary.diff.resamples(object = difValues)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

Accuracy 
    RF        GBM    
RF            0.05989
GBM 0.0003241        

Kappa 
    RF        GBM    
RF            0.06229
GBM 0.0003208  

Sure enough, the difference is statistically significant. The GBM value ends up being less accurate than the random forest model. Now let’s go to the testing stage! You’ll notice I’ve now stuck with the random forest model.

test = leaf[-in_train,]
test$pred.leaf.rf = predict(trf, test, "raw")
confusionMatrix(test$pred.leaf.rf, test$Class)

...
Overall Statistics
                                         
               Accuracy : 0.7381         
                 95% CI : (0.6307, 0.828)
    No Information Rate : 0.0833         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.7277         
 Mcnemar's Test P-Value : NA      
...

Please excuse the ellipses above as the confusionMatrix command generates voluminous output! Anyway, sure enough the Kappa statistic was not that far off in the test sample as it was from the training sample (recall it was .74). Also of interest to me (perhaps it’s boring to you!) is the No Information Rate. Allow me to explain: If I take all of the known classes in the testing sample, and just randomly guess which records to which they belong, I will probably get some right. And this is exactly what the No Information Rate is; the proportion of classes that you would guess right if you randomly allocated them. Obviously an accuracy of .74 and a Kappa of .73 are way higher than the No Information Rate, and so I’m happy that the model is doing more than just making lucky guesses!

Finally, caret has a function to calculate variable importance so that you can see which variables were the most informative in making distinctions between classes.  The results for the random forest model follow:

varImp(trf, scale=FALSE)
rf variable importance

                Overall
Solidity         31.818
Aspect_Ratio     26.497
Eccentricity     23.300
Elongation       23.231
Isoperimetric    20.001
Entropy          18.064
Lobedness        15.608
Max_Ind_Depth    14.828
Uniformity       14.092
Third_Moment     13.148
Stoch_Convexity  12.810
Avg_Intensity    12.438
Smoothness       10.576
Avg_Contrast      9.481

As I have very little clue what these variables mean from their descriptions, someone much wiser than me in all things botanical would have to chime in and educate me.

Well, that was good fun! If you have any ideas to keep the good times rolling and get even better results, please chime in by commenting 🙂


To leave a comment for the author, please follow the link and comment on their blog: Data Until I Die!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.