Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.
From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}
” structure.
When porting code from one language to another you hope the expressive power and style of the languages are similar.
- If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
- If the source language is too strong you will have operators that don’t have direct analogues in the target language.
SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}
“. From SAS documentation:
The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.
That is a really wonderful operator!
R has some available related operators: base::ifelse()
, dplyr::if_else()
, and dplyr::mutate_if()
. However, none of these has the full expressive power of the SAS operator, which can per data row:
- Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
- Conditionally specify blocks of assignments that happen together.
- Be efficiently nested and chained with other IF statements.
To help achieve such expressive power in R Win-Vector is introducing seplyr::if_else_device()
. When combined with seplyr::partition_mutate_se()
you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.
For more information please reach out to us here at Win-Vector or try help(if_else_device)
.
Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.