Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lately I have been rather productive in my programming and frustrated at the same time. Trying to solve the problems of creating a demographics summary table proved to be a lesson in frustration with R. Since I love R, this was disheartening. I did eventually find the reporttools
package which does make a great latex table, but onlyin latex. Also the tables
package looks great, but also not entirely what I was looking for, so I do the first logical thing for an R User when faced with this sort of thing. I created a package to fill in the missing functionality.
The dostats
package/function
The new package is dostats
. There are two functions of the package.
- Create summaries of vectors through the
dostats
function. - Manipulate functions.
The package started out with the dostats
function for creating more informative summary tables. It works very similar with tabular
from tables
package, but it is designed to work with plyr
functions. The idea is to pass in a vector as the first argument and then the remaining arguments are functions that compute statistics on the vector. For example:
library(dostats) set.seed(20120220) dostats(rnorm(100), mean, sd, N = length) ## mean sd N ## 1 0.0775 0.8975 100
There is also the renaming construct built in to create the desired variables. This construct is nice because it facilitates easily passing as an argument into ldply
such as
library(plyr) ldply(mtcars, dostats, mean, sd, IQR) ## .id mean sd IQR ## 1 mpg 20.0906 6.0269 7.375 ## 2 cyl 6.1875 1.7859 4.000 ## 3 disp 230.7219 123.9387 205.175 ## 4 hp 146.6875 68.5629 83.500 ## 5 drat 3.5966 0.5347 0.840 ## 6 wt 3.2172 0.9785 1.029 ## 7 qsec 17.8487 1.7869 2.008 ## 8 vs 0.4375 0.5040 1.000 ## 9 am 0.4062 0.4990 1.000 ## 10 gear 3.6875 0.7378 1.000 ## 11 carb 2.8125 1.6152 2.000
This makes for a more logical summary data.frame
object that has usable columns, each with the same data type. Unfortunatly this does not always work for all data set. The above example only has numerical data. Any data frame with categorigal data would have that data treated as categorical. Another limitation is that the results of each function must be the same dimention for each variable. For this reason I introduced functions that filter by the variable class.
class.stats
creates a dostats function for a given class, tested byinherits
.integer.stats
predefined class stats for integer variables. This defined asclass.stats('integer')
numeric.stats
for numeric variables, which would also include integer variables.factor.stats
for factors.
When a class.stats
function is passed to ldply, variable not matching that class are silently removed.
ldply(iris, numeric.stats, mean, sd) ## .id mean sd ## 1 Sepal.Length 5.843 0.8281 ## 2 Sepal.Width 3.057 0.4359 ## 3 Petal.Length 3.758 1.7653 ## 4 Petal.Width 1.199 0.7622 ldply(iris, factor.stats, N = length) ## .id N ## 1 Species 150
You can also chain together arguments to compute on subsets using ddply
and ldply
.
ddply(iris, .(Species), ldply, numeric.stats, mean, median, sd) ## Species .id mean median sd ## 1 setosa Sepal.Length 5.006 5.00 0.3525 ## 2 setosa Sepal.Width 3.428 3.40 0.3791 ## 3 setosa Petal.Length 1.462 1.50 0.1737 ## 4 setosa Petal.Width 0.246 0.20 0.1054 ## 5 versicolor Sepal.Length 5.936 5.90 0.5162 ## 6 versicolor Sepal.Width 2.770 2.80 0.3138 ## 7 versicolor Petal.Length 4.260 4.35 0.4699 ## 8 versicolor Petal.Width 1.326 1.30 0.1978 ## 9 virginica Sepal.Length 6.588 6.50 0.6359 ## 10 virginica Sepal.Width 2.974 3.00 0.3225 ## 11 virginica Petal.Length 5.552 5.55 0.5519 ## 12 virginica Petal.Width 2.026 2.00 0.2747
Function manipulations
Passing all these functions around also requires some extra function manipulation functions. Now that is a mouthful, but something we do with R.
Composition
R lacks a function composition function. So I created one. function(x)any(is.na(x))
is just to long to type, and I find myself doing things like this far too often. The word “function” is just too long to type and takes up lots of space. It is much easier to do any%.%is.na
or compose(any, is.na)
either of which results in a function that creates a new function testing if there are any missing values. The two forms are
compose(...)
fun1%.%fun2
compose
takes any number of arguments and nests them with the right most being the inner most and the left being the outermost. The easy to remember is that they read the same as when they were input.
Argument Manipulations
Composition and dostats, only operate on the first argument which necessitates functions for manipulating arguments.
wargs
: creates a new function with changed defaults. An example would bewargs(mean, rm.na=T)
creates a new function that automatically removes missing values.onarg
: Specifies the first argument for the function. Such asonarg(rep,'times')
makes the number of times to repeate the first argument.
One example of this that is included in dostats
is the contains
and %contains%
which is the reverse order of %in%
.
Conclussion
There will likely be more functions as I come across the necessity. If you have an idea that should be included submit to the issues tracker.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.