Memory Management in R, and SOAR
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer. Keep in mind I’ve got a quad core with 8 gigs of RAM. With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.
The best/easiest solution I’ve found so far is in a package called SOAR. To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM. I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.
An example might suffice:
> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r
a b
1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”
So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM. Then, I initialized the SOAR library, and set my local cache to “testsession”. The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.
Sure enough, you see my workspace before and after I store the r object. Now you see the object, now you don’t! But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns). However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!
So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement. Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM. It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace. With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.
Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down. So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.