Site icon R-bloggers

Computed columns for dataframes

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.

Here’s an example, that will be familiar to anyone working in insurance:

myData = data.frame(State = c("NY","NY", "TX", "TX")
                    , Premium = c(100,200,150,75)
                    , Loss = c(80,175,80,80)
                    , ALAE = c(10, 20, 15, 5))

What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.

The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.

I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here: http://www.favir.net/start.

Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.

ComputedColumn = function(ColumnName, ColumnFormula)
{
  arguments <- as.list(match.call())
  ColumnFormula = arguments$ColumnFormula
  x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula)
  x
}

new_Mesa = function(df, ComputedColumns)
{
  columnVals = lapply(ComputedColumns, function(x) {
    y = eval(x$ColumnFormula, df)})
  
  newMesa = do.call("cbind", columnVals)
  colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName")
  newMesa = cbind(df, newMesa)
  
  mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns)
  mesa
}

When we run the following code, we’ll get a table which has the first for columns of myData, augmented with three new ones.

col1 = ComputedColumn("LossRatio", Loss / Premium)
col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE))
myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium))

myMesa = new_Mesa(myData, myCols)
rstudio::viewData(myMesa@MesaData)
State Premium Loss ALAE LossRatio ALAERatio LossAndLAERatio
NY 100 80 10 80% 11% 90%
NY 200 175 20 88% 10% 98%
TX 150 80 15 53% 16% 63%
TX 75 80 5 107% 6% 113%

Cool. But what if I’d like to know how New York and Texas are doing in total?

CollapseMesa = function(x, Along)
{
  computedCols = x@MesaColumns
  df = x@MesaData

  cols = colnames(df)
  computedColnames = sapply(x@MesaColumns, "[[", "ColumnName")
  cols = cols[!cols %in% computedColnames]
  cols = cols[!cols %in% Along]
  
  dfSum = aggregate(df[,cols], by=list(df[,Along]), sum)
  colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum))
  newMesa = new_Mesa(dfSum, x@MesaColumns)
  newMesa
}
stateMesa = CollapseMesa(myMesa, "State")
rstudio::viewData(stateMesa@MesaData)
State Premium Loss ALAE LossRatio ALAERatio LossAndLAERatio
NY 300 255 30 85% 11% 95%
TX 225 160 20 71% 11% 80%

At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.

Have a look here: https://github.com/PirateGrunt/Mesa


To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.