Classification Trees using the rpart function

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post on classification trees we considered using the tree package to fit a classification tree to data divided into known classes. In this post we will look at the alternative function rpart that is available within the base R distribution.

Fast Tube
Fast Tube by Casper

A classification tree can be fitted using the rpart function using a similar syntax to the tree function. For the ecoli data set discussed in the previous post we would use:

> require(rpart)
> ecoli.df = read.csv("ecoli.txt")

followed by

> ecoli.rpart1 = rpart(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
  data = ecoli.df)

We would then consider whether the tree could be simplified by pruning and make use of the plotcp function:

> plotcp(ecoli.rpart1)

Once the amount of pruning has been determined from this graph or by looking at the output from the printcp function:

> printcp(ecoli.rpart1)
 
Classification tree:
rpart(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + 
    alm2, data = ecoli.df)
 
Variables actually used in tree construction:
[1] aac  alm1 gvh  mcv 
 
Root node error: 193/336 = 0.5744
 
n= 336 
 
        CP nsplit rel error  xerror     xstd
1 0.388601      0   1.00000 1.00000 0.046959
2 0.207254      1   0.61140 0.61658 0.045423
3 0.062176      2   0.40415 0.45596 0.041758
4 0.051813      3   0.34197 0.38342 0.039359
5 0.031088      4   0.29016 0.36269 0.038571
6 0.015544      5   0.25907 0.30570 0.036136
7 0.010000      6   0.24352 0.31088 0.036375

The prune function is used to simplify the tree based on a cp identified from the graph or printed output threshold.

> ecoli.rpart2 = prune(ecoli.rpart1, cp = 0.02)

The classification tree can be visualised with the plot function and then the text function adds labels to the graph:

> plot(ecoli.rpart2, uniform = TRUE)
> text(ecoli.rpart2, use.n = TRUE, cex = 0.75)

Other useful resources are provided on the Supplementary Material page.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)