Cleaning up tables
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is re-published from my blog
Context
One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table.
One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things. But, of course, this being a R blog, I wanted to create a function that would fix this. I’ve created hack-y functions for this before, but a neat trick pointed out here gave me an idea for a more elegant solution. It also meant I had to use the tidyeval paradigm a bit, which I figured I should at least become familiar with.
Here’s what I want to do
Take a table like this:
Location | Gender | values |
---|---|---|
Rural | Male | 32.74 |
Rural | Female | 25.18 |
Urban | Male | 40.48 |
Urban | Female | 25.28 |
to something like this:
Location | Gender | values |
---|---|---|
Rural | Male | 32.74 |
Female | 25.18 | |
Urban | Male | 40.48 |
Female | 25.28 |
The point is that the first column has repeating values, and I just want the first row of the cluster of rows corresponding to Rural and Urban to have text, the rest being blank. I find this a cleaner look and more typical of tables I see in papers.
This is purely for presentation purposes.I would never do this for data frames I’ll still analyze, since the blank cells screw up things. Of course this could be fixed easily using last value carried forward imputation on the column.
A solution
I created this simple function to do this for a single column within a magrittr
pipeline:
clean_col = function(x, colvar){ require(dplyr) colv = enquo(colvar) x %>% group_by(!!colv) %>% mutate(rown = row_number()) %>% ungroup() %>% mutate_at(vars(!!colv), funs(ifelse(rown > 1, '', .))) %>% select (-rown) }
The first thing to note here is that I’m using quosures and quasiquotation to allow the pipeline to work with the function’s inputs, specifically the column name, which is provided as an unquoted name. Admittedly this was done without much understanding, following examples on Edwin Thoen’s excellent blog.
The second thing was the use of the dummy rown
column to identify the first row of each cluster of rows defined by the variable colvar
. This was inspired by this blog I read through R-Bloggers yesterday. This trick allowed me to easily “blank out” the appropriate cells in the colvar
column.
Desirable updates
There are two directions I want to take this, but I don’t understand tidyeval
or functions with variable numbers of arguments well enough yet to do it. The simpler extension is to do the same process using two or more columns, rather than one column. For example, taking
Location | Gender | AgeGrp | DeathRate |
---|---|---|---|
Rural | Male | 50-54 | 11.7 |
Rural | Male | 55-59 | 18.1 |
Rural | Male | 60-64 | 26.9 |
Rural | Male | 65-69 | 41.0 |
Rural | Male | 70-74 | 66.0 |
Rural | Female | 50-54 | 8.7 |
Rural | Female | 55-59 | 11.7 |
Rural | Female | 60-64 | 20.3 |
Rural | Female | 65-69 | 30.9 |
Rural | Female | 70-74 | 54.3 |
Urban | Male | 50-54 | 15.4 |
Urban | Male | 55-59 | 24.3 |
Urban | Male | 60-64 | 37.0 |
Urban | Male | 65-69 | 54.6 |
Urban | Male | 70-74 | 71.1 |
Urban | Female | 50-54 | 8.4 |
Urban | Female | 55-59 | 13.6 |
Urban | Female | 60-64 | 19.3 |
Urban | Female | 65-69 | 35.1 |
Urban | Female | 70-74 | 50.0 |
to
Location | Gender | AgeGrp | DeathRate |
---|---|---|---|
Rural | Male | 50-54 | 11.7 |
55-59 | 18.1 | ||
60-64 | 26.9 | ||
65-69 | 41.0 | ||
70-74 | 66.0 | ||
Rural | Female | 50-54 | 8.7 |
55-59 | 11.7 | ||
60-64 | 20.3 | ||
65-69 | 30.9 | ||
70-74 | 54.3 | ||
Urban | Male | 50-54 | 15.4 |
55-59 | 24.3 | ||
60-64 | 37.0 | ||
65-69 | 54.6 | ||
70-74 | 71.1 | ||
Urban | Female | 50-54 | 8.4 |
55-59 | 13.6 | ||
60-64 | 19.3 | ||
65-69 | 35.1 | ||
70-74 | 50.0 |
The second extension would be to create truly nested row labels, like this:
Location | Gender | AgeGrp | DeathRate |
---|---|---|---|
Rural | Male | 50-54 | 11.7 |
55-59 | 18.1 | ||
60-64 | 26.9 | ||
65-69 | 41.0 | ||
70-74 | 66.0 | ||
Female | 50-54 | 8.7 | |
55-59 | 11.7 | ||
60-64 | 20.3 | ||
65-69 | 30.9 | ||
70-74 | 54.3 | ||
Urban | Male | 50-54 | 15.4 |
55-59 | 24.3 | ||
60-64 | 37.0 | ||
65-69 | 54.6 | ||
70-74 | 71.1 | ||
Female | 50-54 | 8.4 | |
55-59 | 13.6 | ||
60-64 | 19.3 | ||
65-69 | 35.1 | ||
70-74 | 50.0 |
I can create these on a case-by-case basis, but I’m not sure how to do this in a function, yet. Looking forward to comments.
Update
I can do the first example with the following code (based on Edwin Thoen’s blog, again):
clean_col = function(x, ...){ require(dplyr) colvs = quos(...) x %>% group_by(!!!colvs) %>% mutate(rown = row_number()) %>% ungroup() %>% mutate_at(vars(!!!colvs), funs(ifelse(rown > 1, '', .))) %>% select (-rown) }
Update 2
The second example can be solved by extracting elements of quosures, which are essentially a list:
clean_cols = function(x, ...){ colvs = quos(...) for(i in 1:length(colvs)){ rowvar = rlang::sym(paste0('rown',i)) # Create dummy x = x %>% group_by(!!!colvs[1:i]) %>% mutate(!!rowvar := row_number()) %>% ungroup() } for(i in 1:length(colvs)){ rowvar = rlang::sym(paste0('rown',i)) x = x %>% mutate_at(vars(!!colvs[[i]]), funs(ifelse(!!rowvar > 1, '', .))) } x = x %>% select(-starts_with('rown')) # remove the dummies return(x) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.