Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been experimenting with by_row()
from the R purrrlyr package. It’s a function that maps over the rows of a data frame, which I thought might be handy for processing our economic model’s parameter files. I didn’t find the documentation for by_row()
told me everything I wanted to know, so I made up and ran some examples. As these might be useful to others, I’m blogging them here.
purrrlyr
was written mainly by Hadley Wickham. Regular readers of my posts will know that Hadley wrote the suite of programs known as the Tidyverse, which can be installed with the command install.packages("tidyverse")
. However, purrrlyr
is not quite part of this, and has to be installed separately, via install.packages("purrrlyr")
.
Why am I interested in by_rows()
? Our economic model reads what we call parameter files. These are spreadsheets describing the taxes and benefits it has to apply, plus various controls. Each spreadsheet contains several worksheets, most of which start with three columns making up a hierarchical parameter name, followed by one or more values. Here’s a fragment from one such sheet:
Name1 Name2 Name3 Allowance_Spouse PrivatePension BankAccount NationalSavings 1 IncomeLists IncomeSupport <NA> 1 1 0 0 2 IncomeLists PensionCredit Main 1 1 0 0 3 IncomeLists PensionCredit Savings 0 1 0 0 4 IncomeLists NonContributoryESA Income 1 1 0 0 5 IncomeLists NonContributoryESA Earnings 0 0 0 0 6 IncomeLists HBCTB <NA> 1 0 0 0And here’s a screen shot of it:
I need to process these sheets a row at a time, converting each row into an entry in a lookup table that maps parameter names to their values. Because by_row()
maps over rows, it looked worth learning about. So the code below shows the calls I tried. In it, t
is the tibble I passed to by_rows()
, and tbr
is the result of the calls. I’ve commented each call explaining what it tells me, and also listed the things I wanted to know. One was what the documentation meant by saying that the .collate
argument makes by_row()
collate by row or by column. “Collate” is a very general verb, and didn’t tell me exactly how by_row()
‘s results were arranged. I was also puzzled by the .labels
argument, in which if TRUE
“the returned data frame is prepended with the labels of the slices (the
columns in .d
used to define the slices)”.
# try_by_row.R # # Some experiments with # purrrlyr's by_row() function. # This may be useful when # converting segments of tibble # rows to lists that I can # store in a single cell. library( tidyverse ) library( purrrlyr ) t <- tribble( ~a, ~b, ~c , 1, 2, 3 , 4, 5, 6 , 7, 8, 9 , 10, 11, 12 ) # # I'll try various calls to by_row on this. # Let's try sum(). tbr <- by_row( t, sum ) # # tbr becomes t with a .out column added. # a b c .out # 1 1 2 3 < dbl [1]> # 2 4 5 6 < dbl [1]> # 3 7 8 9 < dbl [1]> # 4 10 11 12 < dbl [1]> # Each of the < dbl [1]>'s is a list whose # single element is the sum of a, b, and c. tbr <- by_row( t, sum, .collate="rows" ) tbr <- by_row( t, sum, .collate="cols" ) # # In both of these, tbr becomes a table where # each .out cell holds a sum rather than a # list containing the sum: # a b c .out # 1 1 2 3 6 # 2 4 5 6 15 # 3 7 8 9 24 # 4 10 11 12 33 # Let's try mean(). tbr <- by_row( t, mean ) # Warning messages: # 1: In mean.default(.d[[i]], ...) : # argument is not numeric or logical: returning NA # 2: In mean.default(.d[[i]], ...) : # argument is not numeric or logical: returning NA # 3: In mean.default(.d[[i]], ...) : # argument is not numeric or logical: returning NA # 4: In mean.default(.d[[i]], ...) : # argument is not numeric or logical: returning NA # Let's find out why sum() works but mean() doesn't. tbr <- by_row( t, show ) # # I did this to see what by_row()'s function # argument gets passed. The command above # displays four one-row tibbles, each a row of t. # So that's what the function gets passed. sum( t[1,] ) # # 6. So sum() can accept tibbles. mean( t[1,] ) # Warning message: # In mean.default(t[1, ]) : argument is not numeric # or logical: returning NA # But mean() can't. # The above three results show why by_row gave # an error with mean() but not sum(). It was # passing one-row tibbles to both. sum() accepts # these but mean() doesn't. # Let's see whether I can create a column # containing copies of the original rows. tbr <- by_row( t, function(row)row ) # # Shows that tbr$.out is a list where # each element is one of these one-row tibbles. # So this copies each original row into # the corresponding cell of .out . tbr <- by_row( t, as.list ) # # This does as above but converts each row to # a list. So here, # tbr$.out[[1]] # is # list(a = 1, b = 2, c = 3) . # and so on. This will be useful when I want to # convert rows to look-up tables implemented # as named lists. And I presume it's more # efficient than storing an entire tibble. as.list( t[1,] ) # # Just to show that conversion of one-row # tibble to named list is what as.list does, # and not some quirk of by_row(). You never # know with R. as.list( t ) # # I try this for interest. It converts to a # named list where each element is a # column vector. Thus, # as.list( t ) # is # list( a = c(1, 4, 7, 10) # , b = c(2, 5, 8, 11) # , c = c(3, 6, 9, 12) # ) # Let's try the .collate argument, because # I don't understand exactly what the # by_row() documentation means by collation. tbr <- by_row( t, function(row)c(10,20,30) ) # # This makes tbr$.out a list column each of # whose elements is c(10,20,30). So it just # puts the entire result of the function # into the .out cells. tbr <- by_row( t, function(row)c(10,20,30), .collate="rows" ) # # Makes tbr the table # a b c .row .out # 1 1 2 3 1 10 # 2 1 2 3 1 20 # 3 1 2 3 1 30 # 4 4 5 6 2 10 # 5 4 5 6 2 20 # 6 4 5 6 2 30 # 7 7 8 9 3 10 # etc. So appends one column. But copies # each original row as many times as there # are components in the function's result. # The i'th copy's .out cell holds the i'th # component. And we get a .row column holding # the number of the original row. tbr <- by_row( t, function(row)c(10,20,30), .collate="cols" ) # # Makes tbr the table # a b c .out1 .out2 .out3 # 1 1 2 3 10 20 30 # 2 4 5 6 10 20 30 # 3 7 8 9 10 20 30 # 4 10 11 12 10 20 30 # So appends three columns, where the i'th # column contains the i'th component of the # function's result. # Let's see whether .collate can produce # "Manhattan arrays", where the rows are # different lengths. tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="cols" ) # # This gave an error: # Error in by_row(t, function(row) 1:(sum(row)%%5), .collate = "cols") : # .f should return equal length vectors or data frames # for collating on `cols` # My function was intended to return rows of different # length. But, as the documentation says, they have to # be the same length. tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="rows" ) # # However, this one does run, producing differing # numbers of rows for each original row: # a b c .row .out # 1 1 2 3 1 1 # 2 4 5 6 2 1 # 3 4 5 6 2 0 # 4 7 8 9 3 1 # 5 7 8 9 3 2 # 6 7 8 9 3 3 # 7 7 8 9 3 4 # 8 10 11 12 4 1 # 9 10 11 12 4 2 # 10 10 11 12 4 3 # Let's see what happens with .collate when # the function returns a data frame, # because the documentation doesn't really # explain that. df <- tribble( ~X, ~Y, "$", "%", "^", "&" ) tbr <- by_row( t, function(row)df ) # # Each cell of tbr's .out column becomes a copy of # df. tbr <- by_row( t, function(row)df, .collate="rows" ) # # tbr becomes a table # a b c .row X Y # 1 1 2 3 1 $ % # 2 1 2 3 1 ^ & # 3 4 5 6 2 $ % # 4 4 5 6 2 ^ & # etc. So we get one copy of each original row # for each row in df. We get one column for # each column in df, plus again a .row column # holding the number of the original row. tbr <- by_row( t, function(row)df, .collate="cols" ) # # I wasn't expecting that. df's cells get # spread into a single line, thus, and its # column names suffixed accordingly: # a b c X1 X2 Y1 Y2 # 1 1 2 3 $ ^ % & # 2 4 5 6 $ ^ % & # 3 7 8 9 $ ^ % & # 4 10 11 12 $ ^ % & # Let's see what the .labels argument does. The # documentation says: if TRUE, the returned data # frame is prepended with the labels of the slices # (the columns in .d used to define the slices). # But I don't know what it means by slices. tbr <- by_row( t, sum, .labels=FALSE ) # # We just get # .out # 1 <dbl [1]> # 2 <dbl [1]> # 3 <dbl [1]> # 4 <dbl [1]> tbr <- by_row( t, function(row)df, .collate="rows", .labels=FALSE ) # # tbr becomes # .row X Y # 1 1 $ % # 2 1 ^ & # 3 2 $ % # 4 2 ^ & # etc. tbr <- by_row( t, function(row)df, .collate="cols", .labels=FALSE ) # # X1 X2 Y1 Y2 # 1 $ ^ % & # 2 $ ^ % & # 3 $ ^ % & # 4 $ ^ % & # So .labels=FALSE gives me what I think # of as a traditional mapping function, # where by_row() returns only the new # columns. Except that if .collate="rows", # it prepends a .row column holding the # original row number.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.