Thoughts on nest()
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been experimenting with the Tidyverse’s nest
function,
because it may be useful when, for
example, using households
together with benefit units.
Below are some thoughts that
I first posted as a comment to Hadley Wickham’s blog entry
“tidyr 0.4.0”.
More on this in future posts.
First, this is likely to be very useful to me. I’m translating a microeconomic model into R. Its input is a set of British households, where each household record contains data on income and expenditure. The model uses these to predict how their incomes will change if you change tax (e.g. by increasing income tax) or benefits (e.g. by increasing pensions or child benefit).
Our data splits households into “benefit units”. A benefit unit ( http://www.poverty.org.uk/summary/households.shtml ) is defined as an adult plus their spouse if they have one, plus any dependent children they are living with. So for example, mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. These are treated more-or-less independently by the benefit system, e.g. if they receive money when out of work.
This matters because each dataset contains one table for household-wide data, and another for benefit-unit-wide data. I’ve been combining these with joins. But it might be cleaner to nest each household’s benefit units inside the household data frame. Not least, because sometimes we have to change data in a household, for example when simulating inflation, and then propagate the changes down to the benefit units.
Second, nest
and
unnest
could be useful elsewhere in our
data. Each household’s expenditure data is divided into
categories, for example food, rent, and alcohol. We may
want to group and ungroup these. For example, I make:
d <- tibble( ID=c( 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6 ), expensetype=c( 'food', 'alcohol', 'rent', 'food', 'rent', 'food', 'rent', 'food', 'cigarettes', 'rent', 'food', 'alcohol', 'rent', 'food', 'rent' ), expenditure = c( 100, 50, 400, 75, 300, 90, 400, 100, 30, 420, 75, 50, 550, 150, 600 ) )
Then
d %>% group_by(ID) %>% nest %>% arrange(ID)gives me a table
1 tibble [3 x 2] 2 tibble [2 x 2] 3 tibble [2 x 2] 4 tibble [3 x 2] 5 tibble [3 x 2] 6 tibble [2 x 2]where the first column is the ID and the second is a table such as
food 100 alcohol 50 rent 400So in effect, it makes my original table into a function from ID to ℘ expensetype × expenditure.
Whereas if I do
d %>% group_by(expensetype) %>% nest %>% arrange(expensetype)I get the table
alcohol tibble [2 x 2] cigarettes tibble [1 x 2] food tibble [6 x 2] rent tibble [6 x 2]where the first column is expenditure category and the second holds tables of ID versus expenditure. In effect, a function from expensetype to ℘ ID × expenditure. This sort of reformatting may be very useful.
Third. Continuing from the above, I wrote this function:
functionalise <- function( data, cols ) { data %>% group_by_( .dots=cols ) %>% nest %>% arrange_( .dots=cols ) }The idea is that
functionalise( data, cols )
reformats data into a data
frame that represents a function. The columns cols
represent the
function’s domain, and will never contain more than one
occurrence of the same tuple. The remaining column
represents the function’s codomain. Thus,
functionalise(d,"expensetype")
returns the data frame
shown in the last example.
Fourth, I note that I can write either
d %>% group_by( expensetype ) %>% nest %>% arrange( expensetype )or
nest( d, ID, expenditure )
In the first, I have to name those columns that I want to act as the function’s domain. In the second, I have to name the others. I find the first more natural.
Fifth,
nest
and
unnest
, as well as
spread
and
gather
make it very easy to generate alternative but logically
equivalent representations of a data frame. But every
time I change representation in this way, I have to
rewrite all the code that accesses the data. It
would be really nice if either a
representation-independent way of accessing it
could be found, or if nest
/unnest
and
spread
/gather
could be made to operate on the
code as well as the data. Paging Douglas
Ross and the Uniform Referent Principle…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.