R : a combined usage of split, lapply and do.call
[This article was first published on K & L Fintech Modeling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post explains a combined usage of split(), lapply() and do.call() R functions, so called Split-Apply-Combine approach. These are frequently used for group based calculations such as weighted average or aggregation. It will be useful when complicated user-defined function is applied differently to each group such as currency. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
lapply, split, and do.call
lapply R function takes a list as its input and apply built-in or use-defined function to its list members.
1 2 | lapply(list, function) | cs |
Raw data is typically the data.frame not the list. When we want to perform lapply() on data.frame, It is therefore needed to convert this data.frame to the corresponding list. For this purpose, we use split() R function, which take data.frame and a key column as input and return list object separated by key column.
1 2 | split(data.frame, key column of data.frame) | cs |
Although lapply() is very useful, it is somewhat annoying to deal with its returning list object. We convert this list object to the corresponding data.frame using do.call() R function in the following way.
1 2 | do.call(rbind, list) | cs |
We want to aggregate weighted sensitivities (ws) within its currency (currency) from input data.frame by using split() and lapply() R functions. Finally, output data.frame is constructed by using do.call() R function. This overall process is illustrated by the following figure.
Case 1) Single output
The following R code perform a summationof ws by currency group.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | #=========================================================================# # Financial Econometrics & Derivatives, ML/DL using R, Python, Tensorflow # by Sang-Heon Lee # # https://kiandlee.blogspot.com #————————————————————————-# # A versatile usage of lapply #=========================================================================# graphics.off() # clear all graphs rm(list = ls()) # remove all files from your workspace #————————————————————————– # Input data.frame #————————————————————————– str.data <– “currency maturity ws USD 3m 285000000 USD 3m 456000000 USD 1y 112000000 USD 2y 56000000 EUR 3m 1785000000 EUR 6m 200000000 EUR 1y 250000000 EUR 1y 1855000000 CNY 6m 84000000 CNY 6m 42000000 CNY 6m 144000000 AUD 6m 213000000 AUD 2y 106000000 AUD 2y 214000000″ df.data <– read.table(text=str.data, header = TRUE) print(df.data) #————————————————– # 1) Single output #————————————————– # split input data by key columns lt.data <– split(df.data, df.data$currency) # single output lt.out <– lapply( lt.data, function(x){ data.frame(sum_ws = sum(x$ws))}) # concatenate rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL; print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out) sum_ws 1 5.33e+08 2 2.70e+08 3 4.09e+09 4 9.09e+08 | cs |
Case 2) Multiple output
The following R code calculate a summation and maximum of ws by currency group with each corresponding currency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | #————————————————– # 2) Multiple output #————————————————– # split input data by key column and do lapply lt.out <– lapply( # split by currency for outer lapply split(df.data, df.data$currency), function(x){ data.frame(curr = max(x$currency), sum_ws = sum(x$ws), max_ws = max(x$ws)) }) # concatenate rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL; print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out) curr sum_ws max_ws 1 AUD 5.33e+08 214000000 2 CNY 2.70e+08 144000000 3 EUR 4.09e+09 1855000000 4 USD 9.09e+08 456000000 | cs |
Case 3) Multiple output while original row is preserved
The following R code calculates a summation of ws by currency group with each corresponding currency and also calculate weight within its currency. This case happens when we need to calculate some variables with both ungrouped and grouped (aggregated) variables such as weights within group. For this purpose, we need to preserve the row of original input data. This can be done by returning key or other column variables without group operation such as curr = x$currency, ws = x$ws except for grouped variables such as sum_ws = sum(x$ws).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | #————————————————– # 3) Multiple output while original row is preserved #————————————————– # split input data by key column and do lapply lt.out <– lapply( # split by currency for lapply split(df.data, df.data$currency), function(x){ data.frame(curr = x$currency, ws = x$ws, sum_ws = sum(x$ws)) }) # concatenate rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL; print(df.out) # add another group based calculation df.out$group_wgt <– df.out$ws/df.out$sum_ws print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out) curr ws sum_ws group_wgt 1 AUD 213000000 5.33e+08 0.39962477 2 AUD 106000000 5.33e+08 0.19887430 3 AUD 214000000 5.33e+08 0.40150094 4 CNY 84000000 2.70e+08 0.31111111 5 CNY 42000000 2.70e+08 0.15555556 6 CNY 144000000 2.70e+08 0.53333333 7 EUR 1785000000 4.09e+09 0.43643032 8 EUR 200000000 4.09e+09 0.04889976 9 EUR 250000000 4.09e+09 0.06112469 10 EUR 1855000000 4.09e+09 0.45354523 11 USD 285000000 9.09e+08 0.31353135 12 USD 456000000 9.09e+08 0.50165017 13 USD 112000000 9.09e+08 0.12321232 14 USD 56000000 9.09e+08 0.06160616 | cs |
Case 4) Multiple output with multiple key columns
The following R code calculates a summation of ws by currency and maturity group with each corresponding currency. In this case, nested lapply() is used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | #————————————————– # 4) Multiple output with multiple key columns #————————————————– # outer lapply lt.out <– lapply( # split by currency for outer lapply split(df.data, df.data$currency ), function(x){ # inner lapply y <– lapply( # split by maturity for inner lapply split(x, x$maturity), function(x) { data.frame(curr = max(x$currency), mat = max(x$maturity), sum_ws = sum(x$ws), max_ws = max(x$ws))}) # concatenate inner rows do.call(rbind,y) }) # concatenate outer rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL; print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out) curr mat sum_ws max_ws 1 AUD 2y 320000000 214000000 2 AUD 6m 213000000 213000000 3 CNY 6m 270000000 144000000 4 EUR 1y 2105000000 1855000000 5 EUR 3m 1785000000 1785000000 6 EUR 6m 200000000 200000000 7 USD 1y 112000000 112000000 8 USD 2y 56000000 56000000 9 USD 3m 741000000 456000000 | cs |
Case 5) Multiple output with multiple key columns while original row is preserved
The following R code is the same version of case 3) when original row is preserved.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | #————————————————– # 5) Multiple output with multiple key columns # while original row is preserved #————————————————– # outer lapply lt.out <– lapply( # split by currency for outer lapply split(df.data, df.data$currency ), function(x){ # inner lapply y <– lapply( # split by maturity for inner lapply split(x, x$maturity), function(x) { data.frame(curr = x$currency, mat = x$maturity, ws = x$ws, sum_ws = sum(x$ws))}) # concatenate inner rows do.call(rbind,y) }) # concatenate outer rows df.out <– do.call(rbind,lt.out) rownames(df.out) <– NULL # add another group based calculation df.out$group_wgt <– df.out$ws/df.out$sum_ws print(df.out) ––––––––––––––––––––––––––––––––––––––––––––––––––– > print(df.out) curr mat ws sum_ws group_wgt 1 AUD 2y 106000000 320000000 0.3312500 2 AUD 2y 214000000 320000000 0.6687500 3 AUD 6m 213000000 213000000 1.0000000 4 CNY 6m 84000000 270000000 0.3111111 5 CNY 6m 42000000 270000000 0.1555556 6 CNY 6m 144000000 270000000 0.5333333 7 EUR 1y 250000000 2105000000 0.1187648 8 EUR 1y 1855000000 2105000000 0.8812352 9 EUR 3m 1785000000 1785000000 1.0000000 10 EUR 6m 200000000 200000000 1.0000000 11 USD 1y 112000000 112000000 1.0000000 12 USD 2y 56000000 56000000 1.0000000 13 USD 3m 285000000 741000000 0.3846154 14 USD 3m 456000000 741000000 0.6153846 | cs |
From this post, we can find that combination of split, lapply and do.call R runctions deliver output which we want to generate for the group operations. \(\blacksquare\)
To leave a comment for the author, please follow the link and comment on their blog: K & L Fintech Modeling.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.