Computed columns for dataframes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.
Here’s an example, that will be familiar to anyone working in insurance:
myData = data.frame(State = c("NY","NY", "TX", "TX") , Premium = c(100,200,150,75) , Loss = c(80,175,80,80) , ALAE = c(10, 20, 15, 5))
What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.
The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.
I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here: http://www.favir.net/start.
Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.
ComputedColumn = function(ColumnName, ColumnFormula) { arguments <- as.list(match.call()) ColumnFormula = arguments$ColumnFormula x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula) x } new_Mesa = function(df, ComputedColumns) { columnVals = lapply(ComputedColumns, function(x) { y = eval(x$ColumnFormula, df)}) newMesa = do.call("cbind", columnVals) colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName") newMesa = cbind(df, newMesa) mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns) mesa }
When we run the following code, we'll get a table which has the first for columns of myData, augmented with three new ones.
col1 = ComputedColumn("LossRatio", Loss / Premium) col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE)) myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium)) myMesa = new_Mesa(myData, myCols) rstudio::viewData(myMesa@MesaData)
State | Premium | Loss | ALAE | LossRatio | ALAERatio | LossAndLAERatio |
NY | 100 | 80 | 10 | 80% | 11% | 90% |
NY | 200 | 175 | 20 | 88% | 10% | 98% |
TX | 150 | 80 | 15 | 53% | 16% | 63% |
TX | 75 | 80 | 5 | 107% | 6% | 113% |
Cool. But what if I'd like to know how New York and Texas are doing in total?
CollapseMesa = function(x, Along) { computedCols = x@MesaColumns df = x@MesaData cols = colnames(df) computedColnames = sapply(x@MesaColumns, "[[", "ColumnName") cols = cols[!cols %in% computedColnames] cols = cols[!cols %in% Along] dfSum = aggregate(df[,cols], by=list(df[,Along]), sum) colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum)) newMesa = new_Mesa(dfSum, x@MesaColumns) newMesa } stateMesa = CollapseMesa(myMesa, "State") rstudio::viewData(stateMesa@MesaData)
State | Premium | Loss | ALAE | LossRatio | ALAERatio | LossAndLAERatio |
NY | 300 | 255 | 30 | 85% | 11% | 95% |
TX | 225 | 160 | 20 | 71% | 11% | 80% |
At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.
Have a look here: https://github.com/PirateGrunt/Mesa
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.