Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Can’t be bothered reading, tell me now
A simple one line tweak can significantly speed up package installation and updates.
The wonder of CRAN
One of the best features of R is CRAN. When a package is submitted to CRAN, not only is it checked under three versions of R
- R-past, R-release and R-devel
but also three different operating systems
- Windows, Linux and Mac (with multiple flavours of each)
CRAN also checks that the updated package doesn’t break existing packages. This last part is particularly tricky when you consider all the dependencies a package like ggplot2 or Rcpp have. Furthermore, it performs all these checks within 24 hours, ready for the next set packages.
What many people don’t realise is that for CRAN to perform this miracle of package checking, it builds and checks these packages in parallel; so rather than installing a single package at a time, it checks multiple packages at once. Obviously some care has to be taken when checking/installing packages due to the connectivity between packages, but R takes care of these details.
Parallel package installation: Ncpus
If you examine the help package of ?install.packages
, there’s a sneaky argument called Ncpus
. From the help page:
Ncpus
: The number of parallel processes to use for a parallel install of more than one source package.
The default value of this argument is
Ncpus = getOption(‘Ncpus’, 1L)
The getOption()
part determines if the value has been set in options()
. If no value is found, the default number of processes to use is 1
. If you haven’t heard of Ncpus
, it’s almost certainly 1, but you can check using
getOption("Ncpus", 1L) ## [1] 6
Does it work?
To test if changing the value of Ncpus
makes a difference, we’ll install the tidyverse package with all it’s associated dependencies. On my machine, all packages live in a directory called /rpackages/
, for each test below I deleted /rpackages/
so all tidyverse dependencies were reinstalled.
My machine has eight cores
parallel::detectCores() # [1] 8
So it doesn’t make sense to set Ncpus
above 8. Another point is that although R reports that I have 8 cores, I only have 4 physical cores; the other 4 are due to hyper-threading. In practice, this means that I’m only likely to get at most a 6 fold speed-up.
For this experiment, I used the RStudio CRAN repository, set via
options(repos = c("CRAN" = "https://cran.rstudio.com/"))
To time the installation procedure, I just use the standard system.time()
function.
After removing the /rpackages/
directory, I set Ncpus
equal to 1
and installed the tidyverse package with dependencies
options(Ncpus = 1) system.time(install.packages("tidyverse")) ## Time in seconds # user system elapsed #372.252 15.468 409.364
So a standard installation takes almost 7 minutes (409/60)!
Before we go on, it’s worth noting a couple of caveats:
- This timing also includes the download time of the packages; however for simplicity I’m ignoring this. Downloading the packages takes around 20 seconds
- The above number uses a sample size of 1 to estimate the time; repeating the above experiment, resulted in a remarkably consistent installation time of 410 seconds.
Repeating this experiment with different values of Ncpus
gives the table below:
Ncpus | Elapsed (Secs) | Ratio |
---|---|---|
1 | 409 | 2.26 |
2 | 224 | 1.24 |
4 | 196 | 1.08 |
6 | 181 | 1.00 |
So setting Ncpus
to 2 allows us to half the installation time from 409 seconds to around 224 (seconds). Increasing Ncpus
to 4 gives a further speed boost. Due to the dependencies between packages, we’ll never achieve a perfect speed-up, e.g. if package X depends on Y, then we have to install X first. However, for a simple change we get an easy speed boost.
Furthermore, setting Ncpus
gives a speed boost when updating packages via update.packages()
.
A permanent change: .Rprofile
Clearly writing options(Ncpus = 6)
before you install a package is a pain. However, you can just add it to your .Rprofile
file. In a future blog post, we cover the .Rprofile
in more detail, but for the purposes of this post, your .Rprofile
is a file that contains R code that runs whenever R starts. You can test whether you have an .Rprofile
file using the command
file.exists("~/.Rprofile")
If you don’t have an .Rprofile
file, create one in your home area
Sys.getenv("HOME")
Then simply add options(Ncpus = XX)
to your file.
The one remaining question is what value should you set XX
. I typically set it to six since I have eight cores. This allows packages to be installed in parallel, while giving me a little bit of wiggle room to check email and listen to music.
References
If you are interested in how CRAN handles the phenomenal number of package submissions, check out this recent talk:
- Twenty years of CRAN by Uwe Ligges at UseR2017! in Brussels.
- Image credit: Simson Petrol
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.