Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However, this does tend to bite us periodically.
When I teach newcomers about R and predictive modeling, I have a slide that illustrates one of the weaknesses of this system: heterogeneous interfaces. If you are building a classification model and want to generate class probabilities for new samples, the syntax can be… diverse. Here is a sample of syntax for different models:
That’s a lot of minutia to remember. I did a quick and dirty census of all the classification models used by caret to quantify the variability in this particular syntax. The train
utilizes 64 different models that can produce class probabilities. Of these, many were from the same package. For example, both nnet
and multinom
are in the nnet package and probably should not count twice since the latter is a wrapper for the former. As another example, the RWeka packages has at least six functions that all use probability
as the value for type
.
For this reason, I cooked the numbers down to one value of type
per package (using majority vote if there was more than one). There were 40 different packages once these redundancies were eliminated. Here is a histogram of the type
values for calculating probabilities:
The most frequent situation is no type
value at all. For example, the lda
package automatically generated predicted classes and posterior probabilities without requiring the user to specify anything. There were a handful of cases where the class did not have a predict
method to generate class probabilities (e.g. party and pamr) and these also counted as “none”.
For those of us that use R to create predictive models on a day-to-day basis, this is a lot of detail to remember (especially if we want to try different models). This is one of the reasons I created caret; it has a unified interface to models that eliminates the need to remember the name of the function, the value of type
and any other arguments. In case you are wondering, I chose `type = “prob”‘.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.