Working with big SAS datasets using R and sparklyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In general, R loads all data into memory while SAS allocates memory dynamically to keep data on disk. This makes SAS a better solution for handling very large datasets.
I often need to work with large SAS data files that are prepared in the information system of my department. However, I always try to fit everything to my R workflow. This is because I like to manipulate data with dplyr and perform statistical analysis with all the available packages in R.
To this purpose I found the perfect solution with sparklyr.
First of all we need to install and load the packages.
library(sparklyr) library(spark.sas7bdat) library(dplyr) spark_install(version = "2.0.1", hadoop_version = "2.7")
Then I connect to a local instance of the installed Spark
sc <- spark_connect(master = "local")
Finally it is possible to read the SAS files, manipulate them via dplyr and store in the R memory via collect command.
df %
select()
df_manipulated_r <- collect(df_manipulated)
The command spark_read_sas return an object of class tbl_spark, which is a reference to a Spark data frame based on which dplyr functions can be executed.
The collect function returns a local data frame from the remote source of manipulated spark nibbles allowing for storage in the local memory.
This should be the file on which perform the data analysis and visualization steps.
Here some resources:
Importing 30GB of data into R with sparklyr
github.com/bnosac/spark.sas7bdat
sparklyr: R interface for Apache Spark
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.