Simple Tabulations Made Simple
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The {tabit} package
This is a blog post announcing the brand new micro package {tabit} that just made it to CRAN.
???? Thanks to all CRAN people ????
Motivation
{tabit} is a package that is about making simple tabulations simple. My motivation for writing this package was the realization that I was actually missing the way I could do tabulation in Stata: Easily getting an idea of the data very fast.
While R of cause has an tabulation function build in, I was always struggling with getting it to show what I wanted without specifying to many arguments.
The way I want it to work:
- I want frequencies.
- I want percentages.
- I do not want it to ignore NAs – never ever, no way.
- I want to see how thinks look like without NAs.
- I want it in a format that is easy to use later on or re-use for non interactive data glancing tasks.
- I want results to be sorted by decreasing frequency.
- I want it to be generic so I can make it work for vectors and lists and data.frames and all the things that might come up in the future.
- I want it to be configurable via parameters.
- I really do not want to have to touch those parameters ever.
- I want it to have zero dependencies.
- I want it all, I want it know
Over the last years I realized I was rewriting kind of the same function over and over again for projects and also for packages. After having gone through some iterations I am now quite happy with the outcome – though little code it might be.
Giving it a try
At the moment only one-dimensional tables are implemented multidimensional tabulations are planned but a well balanced design for a function still is under development.
Lets have a demo using the built in “New York Air Quality” data set. The data set consists of several variables most notably some containing missing values. The variable of interest Solar.R
measures solar radiation in Langleys.
To get a quick overview I round the radiation measures to the nearest hundreds and use ti_tab1()
to get a frequency table. The result of a call to ti_tab1()
is a data.frame with on line per variable value, sorted by decreasing frequencies and including frequencies for missing values as well per default. Since percentages differ depending on whether or not missing values (NA) are included or not there is one column
excluding NAs and one including them.
library(tabit) ti_tab1( x = round(airquality$Solar.R, -2) ) ## value count pct pct_all ## 3 200 50 34.25 32.68 ## 4 300 45 30.82 29.41 ## 2 100 34 23.29 22.22 ## 1 0 17 11.64 11.11 ## 5 <NA> 7 NA 4.58
If sorting by frequency is not what I want I can easily turn it off by setting the sort
parameter to FALSE
:
ti_tab1( x = round(airquality$Solar.R, -2), sort = FALSE ) ## value count pct pct_all ## 1 0 17 11.64 11.11 ## 2 100 34 23.29 22.22 ## 3 200 50 34.25 32.68 ## 4 300 45 30.82 29.41 ## 5 <NA> 7 NA 4.58
The same is true for the numbers of digits to show for the percentage columns:
ti_tab1( x = round(airquality$Solar.R, -2), digits = 0 ) ## value count pct pct_all ## 3 200 50 34 33 ## 4 300 45 31 29 ## 2 100 34 23 22 ## 1 0 17 12 11 ## 5 <NA> 7 NA 5 ti_tab1( x = round(airquality$Solar.R, -2), digits = 4 ) ## value count pct pct_all ## 3 200 50 34.2466 32.6797 ## 4 300 45 30.8219 29.4118 ## 2 100 34 23.2877 22.2222 ## 1 0 17 11.6438 11.1111 ## 5 <NA> 7 NA 4.5752
Since ti_tab1()
is implemented as generic it can handle multiple data types –
i.e. vectors, data.frames, and lists – and can be extended to cover other data types as well.
Again the ti_tab1()
returns a data.frame. This time a column named
name
has been added which captures the name of the column on which the
frequencies and percentages are based upon.
ti_tab1( x = lapply(airquality, round, -2) ) ## name value count pct pct_all ## 1 Ozone 0 82 70.69 53.59 ## 2 Ozone <NA> 37 NA 24.18 ## 3 Ozone 100 33 28.45 21.57 ## 4 Ozone 200 1 0.86 0.65 ## 5 Solar.R 200 50 34.25 32.68 ## 6 Solar.R 300 45 30.82 29.41 ## 7 Solar.R 100 34 23.29 22.22 ## 8 Solar.R 0 17 11.64 11.11 ## 9 Solar.R <NA> 7 NA 4.58 ## 10 Wind 0 153 100.00 100.00 ## 11 Wind <NA> 0 NA 0.00 ## 12 Temp 100 153 100.00 100.00 ## 13 Temp <NA> 0 NA 0.00 ## 14 Month 0 153 100.00 100.00 ## 15 Month <NA> 0 NA 0.00 ## 16 Day 0 153 100.00 100.00 ## 17 Day <NA> 0 NA 0.00
Last but not least the fact that ti_tab1()
returns simple data.frames means
that R provides a large array of things I can do with them – plotting, filtering, writing to file – and that every R user instantly knows how to handle them.
# get all counts ti_tab1(x = airquality$Wind)$count ## [1] 15 11 11 11 10 9 8 8 8 8 8 6 5 4 4 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 0 # get the highest percentage tab <- ti_tab1(x = round(airquality$Solar.R, -2)) tab$pct[1] ## [1] 34.25 # get percentage of NAs tab$pct_all[is.na(tab$value)] ## [1] 4.58
Things to come
As mentioned beforehand one of the things planed for this micro package is to add multidimensional tables. Another option is to extend the tabulation functions to allow for user defined aggregation functions producing other statistics than counts and percentages.
Other than that I think the package really has quite a narrow scope and we should keep it like that.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.