[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I dabbled in building classification trees with the party and rpart packages. Now, I want to put together a series where I can apply those basic trees along with advanced techniques like bagging, boosting and random forest. Additionally, I’ve taken the leap and have started using the RStudio interface. I must say I’m quite pleased at the features and once you get comfortable it can be quite a time-saver. Have a look at http://www.rstudio.com/.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Having worked in oncology (sales rep and market research) in a previous lifetime I wanted to apply the techniques to the Wisconsin Breast Cancer data set. It is available at the UC-Irvine Machine Learning Repository. Here is the direct link… http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Prognostic%29 .
All I intend to do here in this entry, is get the data loaded, create test and training data sets, build a classification tree model using the “party” package and finally, write the code to be able to produce ROC and Lift charts. In further iterations, I will create additional tree models on this data set and examine which method is the best at predicting whether a tumor is benign or malignant.
This code drops the annoying ID variable, which is of no use and then eliminates the few missing observations, which are of no consequence.
> #dropping ID variable
> breastcancer$ID = NULL
> str(breastcancer)
‘data.frame’: 699 obs. of 10 variables:
$ thickness: int 8 6 1 1 1 5 3 3 3 8 …
$ Usize : int 4 6 1 1 1 1 1 1 1 8 …
$ Ushape : int 5 6 1 3 2 1 4 1 3 8 …
$ adhesion : int 1 9 1 1 1 1 1 1 1 1 …
$ SECS : int 2 6 1 2 3 2 2 2 2 2 …
$ nuclei : int NA NA NA NA NA NA NA NA NA NA …
$ chromatin: int 7 7 2 2 1 3 3 3 2 6 …
$ nucleoli : int 3 8 1 1 1 1 1 1 1 10 …
$ mitoses : int 1 1 1 1 1 1 1 1 1 1 …
$ class : Factor w/ 2 levels “benign”,”malignant”: 2 1 1 1 1 1 1 1 1 2 …
> bcancer = na.omit(breastcancer) # delete missing observations
> str(bcancer)
‘data.frame’: 683 obs. of 10 variables:
$ thickness: int 5 8 1 10 7 5 6 5 5 9 …
$ Usize : int 4 10 1 7 3 4 10 2 3 4 …
$ Ushape : int 4 10 1 7 2 4 10 3 4 5 …
$ adhesion : int 5 8 1 6 10 9 2 1 1 10 …
$ SECS : int 7 7 2 4 5 2 8 6 8 6 …
$ nuclei : int 10 10 10 10 10 10 10 10 10 10 …
$ chromatin: int 3 9 3 4 5 5 7 5 4 4 …
$ nucleoli : int 2 7 1 1 4 6 3 1 9 8 …
$ mitoses : int 1 1 1 2 4 1 3 1 1 1 …
$ class : Factor w/ 2 levels “benign”,”malignant”: 1 2 1 2 2 2 2 2 2 2 …
– attr(*, “na.action”)=Class ‘omit’ Named int [1:16] 1 2 3 4 5 6 7 8 9 10 …
.. ..- attr(*, “names”)= chr [1:16] “1” “2” “3” “4” …
Now we create the train and test sets…
> set.seed(1234)
> ind = sample(2,nrow(bcancer), replace=TRUE, prob=c(0.7, 0.3))
> train = bcancer[ind==1,]
> test = bcancer[ind==2,]
Then, we roll up our sleeves, pour a glass of Bourbon and get to work…
> #conditional inference tree with the “party” package
> train_tree = ctree(class ~ ., data=train)
> table(predict(train_tree), train$class)
benign malignant
benign 289 6
malignant 12 165
So, the model says that 177 patients had malignant tumors whereas in reality it was only 171.
Here is a plot of the tree:
> plot(train_tree, type=”simple”)
OK, now it is time to see how we did on the test data…
> testpred = predict(train_tree, newdata=test)
> table(testpred, test$class)
testpred benign malignant
benign 138 3
malignant 5 65
Quite consistent to what we saw with the train data.
Finally, I will start the code where I can compare multiple methods. We first have to gin-up the probabilities for the test set using our model with this annoying bit of code (this will be easier with other methods).
> test$probabilities = 1 – unlist(treeresponse(train_tree, newdata=test), use.names=F)[seq(1,nrow(test)*2,2)]
> library(ROCR) #loading the package to help us calculate and visualize the classifiers
> pred = prediction(test$probabilities, test$class) #create an object of the predicted probabilities
#create and plot the ROC chart with true positive rate and false positive rate
> perf = performance(pred, “tpr”, “fpr”)
> plot(perf, main=”ROC”, colorize=T)
#create and plot the lift chart with lift and rate of positive predictions
> perf_lift = performance(pred, “lift”, “rpp”)
> plot(perf_lift, mail=”lift chart”, colorize=T)
Oh, and for good measure I decided to include the area under the curve…
> auc.curve = performance(pred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”
Slot “y.name”:
[1] “Area under the ROC curve”
Slot “alpha.name”:
[1] “none”
Slot “x.values”:
list()
Slot “y.values”:
[[1]]
[1] 0.9854484
Next time I will introduce random forest and bagging and/or boosting, based on my further research into the subject.
Cheers,
Meister
To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.