R talks to Weka about Data Mining
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R provides us with excellent resources to mine data, and there are some good overviews out there:
- Yanchang’s website with Examples and a nice reference card
- The rattle-package that introduces a nice GUI for R, and Graham William’s compendium of tools
- The caret-package that offers a unified interface to running a multitude of model builders.
And there are other tools out there for data mining, like Weka.
Weka has a GUI and can be directed via the command line with Java as well, and Weka has a large variety of algorithms included. If, for whatever reason, you do not find the algorithm you need being implemented in R, Weka might be the place to go. And the RWeka-package marries R and Weka.
I am not an expert neither in R, nor in Weka, nor in data mining. But I happen to play around with them, and I’d like to share a starter on how to work with them. There is good documentation out there (e.g. Open-Source Machine Learning: R Meets Weka or RWeka Odds and Ends), but sometimes you want to document your own steps and ways of working, and this is what I do.
So, I want to build a classification model for the iris-dataset, based on a tree classifier. Joice is the C4.5 algorithm that I did not find implemented in any standard R package (anybody can help me out?).
We want to predict the class of a flower based on their attributes, namely sepal and petal width and length. The three species we have are “setosa”, “versicolor” and “virginica”. A short summary is given above.
Prediction with J48 (aka C4.5)
We next load the RWeka package.
summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 ## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 ## Median :5.80 Median :3.00 Median :4.35 Median :1.3 ## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 ## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 ## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## library(RWeka)
We now build the classifier, and this works with the J48(.)-function:
iris_j48 <- J48(Species ~ ., data = iris) iris_j48 ## J48 pruned tree ## ------------------ ## ## Petal.Width <= 0.6: setosa (50.0) ## Petal.Width > 0.6 ## | Petal.Width <= 1.7 ## | | Petal.Length <= 4.9: versicolor (48.0/1.0) ## | | Petal.Length > 4.9 ## | | | Petal.Width <= 1.5: virginica (3.0) ## | | | Petal.Width > 1.5: versicolor (3.0/1.0) ## | Petal.Width > 1.7: virginica (46.0/1.0) ## ## Number of Leaves : 5 ## ## Size of the tree : 9 summary(iris_j48) ## ## === Summary === ## ## Correctly Classified Instances 147 98 % ## Incorrectly Classified Instances 3 2 % ## Kappa statistic 0.97 ## Mean absolute error 0.0233 ## Root mean squared error 0.108 ## Relative absolute error 5.2482 % ## Root relative squared error 22.9089 % ## Coverage of cases (0.95 level) 98.6667 % ## Mean rel. region size (0.95 level) 34 % ## Total Number of Instances 150 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 50 0 0 | a = setosa ## 0 49 1 | b = versicolor ## 0 2 48 | c = virginica plot(iris_j48)
We can assign the model to an object, and printing the object gives us the tree in “Weka-Output”, summary(.) gives us the Summary of the classification on the training set (again, in Weka-style), and plot(.) allows us to nicely plot it.
Evaluation in Weka
Well, we used the whole dataset now for training, but we actually might want to perform cross-validation. This can be done like this:
eval_j48 <- evaluate_Weka_classifier(iris_j48, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE) eval_j48 ## === 10 Fold Cross Validation === ## ## === Summary === ## ## Correctly Classified Instances 144 96 % ## Incorrectly Classified Instances 6 4 % ## Kappa statistic 0.94 ## Mean absolute error 0.035 ## Root mean squared error 0.1586 ## Relative absolute error 7.8705 % ## Root relative squared error 33.6353 % ## Coverage of cases (0.95 level) 96.6667 % ## Mean rel. region size (0.95 level) 33.7778 % ## Total Number of Instances 150 ## ## === Detailed Accuracy By Class === ## ## TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class ## 0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 setosa ## 0.940 0.030 0.940 0.940 0.940 0.910 0.952 0.880 versicolor ## 0.960 0.030 0.941 0.960 0.950 0.925 0.961 0.905 virginica ## Weighted Avg. 0.960 0.020 0.960 0.960 0.960 0.940 0.968 0.924 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 49 1 0 | a = setosa ## 0 47 3 | b = versicolor ## 0 2 48 | c = virginica
We see slightly worse results now, as you would suspect.
Using Weka-controls
We used the standard options for th J48 classifier, but Weka allows more. You can acces these with the WOW-function:
WOW("J48") ## -U Use unpruned tree. ## -O Do not collapse tree. ## -C <pruning confidence> ## Set confidence threshold for pruning. (default 0.25) ## Number of arguments: 1. ## -M <minimum number of instances> ## Set minimum number of instances per leaf. (default 2) ## Number of arguments: 1. ## -R Use reduced error pruning. ## -N <number of folds> ## Set number of folds for reduced error pruning. One fold is ## used as pruning set. (default 3) ## Number of arguments: 1. ## -B Use binary splits only. ## -S Don't perform subtree raising. ## -L Do not clean up after the tree has been built. ## -A Laplace smoothing for predicted probabilities. ## -J Do not use MDL correction for info gain on numeric ## attributes. ## -Q <seed> ## Seed for random data shuffling (default 1). ## Number of arguments: 1.
If, for example, we want to use a tree with minimum 10 instances in each leaf, we change the command as follows:
j48_control <- J48(Species ~ ., data = iris, control = Weka_control(M = 10)) j48_control ## J48 pruned tree ## ------------------ ## ## Petal.Width <= 0.6: setosa (50.0) ## Petal.Width > 0.6 ## | Petal.Width <= 1.7: versicolor (54.0/5.0) ## | Petal.Width > 1.7: virginica (46.0/1.0) ## ## Number of Leaves : 3 ## ## Size of the tree : 5
And you see the tree is different (well, it just does not go as deep as the other one..).
Building cost-sensitive classifiers
You might want to include a cost matrix, i.e you want to penalize some wrong classifications, see here. If you think classifying for example a versicolor wrongly is very harmful, you want to penalize such a classification in our example, you can do that easily – you just have to choose a different classifier, namely the “Cost-sensitive classifier” in Weka:
csc <- CostSensitiveClassifier(Species ~ ., data = iris, control = Weka_control(`cost-matrix` = matrix(c(0, 10, 0, 0, 0, 0, 0, 10, 0), ncol = 3), W = "weka.classifiers.trees.J48", M = TRUE))
But you have to tell the “cost-sensitive-classifier” that you want to use J48 as algorithm, and you have to tell him the cost matrix you want to apply, name ly the matrix of the form
matrix(c(0, 1, 0, 0, 0, 0, 0, 1, 0), ncol = 3) ## [,1] [,2] [,3] ## [1,] 0 0 0 ## [2,] 1 0 1 ## [3,] 0 0 0
where you penalize “versicolor” being falsly classified as one of the others by factor 10.
And again we evaluate on 10-fold CV:
eval_csc <- evaluate_Weka_classifier(csc, numFolds = 10, complexity = FALSE, seed = 1, class = TRUE) eval_csc ## === 10 Fold Cross Validation === ## ## === Summary === ## ## Correctly Classified Instances 98 65.3333 % ## Incorrectly Classified Instances 52 34.6667 % ## Kappa statistic 0.48 ## Mean absolute error 0.2311 ## Root mean squared error 0.4807 ## Relative absolute error 52 % ## Root relative squared error 101.9804 % ## Coverage of cases (0.95 level) 65.3333 % ## Mean rel. region size (0.95 level) 33.3333 % ## Total Number of Instances 150 ## ## === Detailed Accuracy By Class === ## ## TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class ## 0.980 0.070 0.875 0.980 0.925 0.887 0.955 0.864 setosa ## 0.980 0.450 0.521 0.980 0.681 0.517 0.765 0.518 versicolor ## 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.333 virginica ## Weighted Avg. 0.653 0.173 0.465 0.653 0.535 0.468 0.740 0.572 ## ## === Confusion Matrix === ## ## a b c <-- classified as ## 49 1 0 | a = setosa ## 1 49 0 | b = versicolor ## 6 44 0 | c = virginica
and we see that the “versicolors” are now better predicted (only one wrong, compared to 3 in the normal J48 earlier). But this happened at the expense of more fals classification on “virginica”, where we have now 6 wrongly classified instead of 2.
Alright, this is just a short starter. I suggest you check out the very good introductions I referred to earlier to explore the full wealth of RWeka… Have fun!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.