Site icon R-bloggers

Model Segmentation with Cubist

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.

pkgs <- c('MASS', 'Cubist', 'caret')
lapply(pkgs, require, character.only = T)

data(Boston)
X <- Boston[, 1:13]
Y <- log(Boston[, 14])

### TRAIN THE MODEL ###
mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE,  label = "log_medv", seed = 2015, rules = 5))
summary(mdl)
#  Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985]
#
#    if
#	nox > 0.671
#    then
#	log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm
#	           - 0.0065 crim
#
#  Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473]
#
#    if
#	nox <= 0.671
#	lstat > 19.01
#    then
#	log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax
#	           - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio
#	           + 0.00015 black + 0.0003 zn
#
#  Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825]
#
#    if
#	rm <= 6.232
#	dis > 1.8773
#    then
#	log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black
#	           - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad
#	           - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn
#
#  Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200]
#
#    if
#	dis <= 1.8773
#	lstat <= 19.01
#    then
#	log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim
#
#  Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716]
#
#    if
#	rm > 6.232
#	lstat <= 19.01
#    then
#	log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad
#	           - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio
#	           - 0.0006 age - 0.12 nox + 0.0002 zn
#
# Evaluation on training data (506 cases):
#
#    Average  |error|           0.100444
#    Relative |error|               0.33
#    Correlation coefficient        0.94
#
#	Attribute usage:
#	  Conds  Model
#
#	   71%    94%    rm
#	   50%   100%    lstat
#	   40%   100%    dis
#	   23%    94%    nox
#	         100%    crim
#	          78%    zn
#	          78%    rad
#	          78%    tax
#	          78%    ptratio
#	          78%    black
#	          71%    age

### VARIABLE IMPORTANCE ###
varImp(mdl)
#        Overall
# rm         82.5
# lstat      75.0
# dis        70.0
# nox        58.5
# crim       50.0
# zn         39.0
# rad        39.0
# tax        39.0
# ptratio    39.0
# black      39.0
# age        35.5
# indus       0.0
# chas        0.0

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.