Model Segmentation with Cubist
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.
pkgs <- c('MASS', 'Cubist', 'caret') lapply(pkgs, require, character.only = T) data(Boston) X <- Boston[, 1:13] Y <- log(Boston[, 14]) ### TRAIN THE MODEL ### mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE, label = "log_medv", seed = 2015, rules = 5)) summary(mdl) # Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985] # # if # nox > 0.671 # then # log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm # - 0.0065 crim # # Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473] # # if # nox <= 0.671 # lstat > 19.01 # then # log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax # - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio # + 0.00015 black + 0.0003 zn # # Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825] # # if # rm <= 6.232 # dis > 1.8773 # then # log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black # - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad # - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn # # Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200] # # if # dis <= 1.8773 # lstat <= 19.01 # then # log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim # # Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716] # # if # rm > 6.232 # lstat <= 19.01 # then # log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad # - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio # - 0.0006 age - 0.12 nox + 0.0002 zn # # Evaluation on training data (506 cases): # # Average |error| 0.100444 # Relative |error| 0.33 # Correlation coefficient 0.94 # # Attribute usage: # Conds Model # # 71% 94% rm # 50% 100% lstat # 40% 100% dis # 23% 94% nox # 100% crim # 78% zn # 78% rad # 78% tax # 78% ptratio # 78% black # 71% age ### VARIABLE IMPORTANCE ### varImp(mdl) # Overall # rm 82.5 # lstat 75.0 # dis 70.0 # nox 58.5 # crim 50.0 # zn 39.0 # rad 39.0 # tax 39.0 # ptratio 39.0 # black 39.0 # age 35.5 # indus 0.0 # chas 0.0
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.