R Documentation and R Manuals: Github and Bioconductor technicalities | RDocumentation
[This article was first published on DataCamp Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In our last blog post we announced the addition of GitHub and Bioconductor R documentation and R manuals to Rdocumentation. For the more technical amongst you, I’ll give a short, high-level description of what’s under the hood of Rdocumentation. Along with that, I’ll zoom in on some of the challenges that I encountered while adding GitHub and Bioconductor repositories.
Rdocumentation for R Documentation and R Manuals in a (technical) nutshell
In a nutshell, the Rdocumentation web server communicates with an R server that’s running in the background. Using a cron job, this R server executes the following steps on a daily basis:- Check for all available packages and their version numbers using
available.packages()
. - Compare these with the ones on Rdocumentation.
- Install/update the ones that are out of sync.
- Generate the documentation for the newly installed/updated packages and store it in a zip file.
Adding GitHub and Bioconductor repositories
The first version of Rdocumentation only included the packages available on CRAN. Our latest update expanded the package portfolio with the available R Documentation and R Manuals on Bioconductor and GitHub.Implementing Bioconductor packages was very similar to implementing CRAN packages, but with a few caveats. The biggest one to overcome was that Bioconductor packages sometimes download massive datasets (> 1GB) upon installation, which makes installing and updating a very time consuming and storage space consuming task. To overcome this, we used the `parallel` package to run package installations in threads that were killed (with a SIGKILL signal to the process) if they didn’t terminate after some time. This way we avoided cluttering our machine, and the few packages we loose with this technique is worth the performance gain.Adding GitHub support was very different. Credits go to Hadley Wickham’s r-on-github script. His script uses the GitHub api to search for all R repositories and their details (owner, stars, latest update, etc.). We only made some minor changes to his script to filter repositories on the amount of stars that they have, this to cut out the many test repositories. The following graph plots the amount of R repositories based on the amount of stars that they have.We decided that 3 or more stars was an acceptable metric to decide that a repository is “popular enough” for Rdocumentation. An arbitrary measure, but given the amounts shown in the graph above it seems that even taking 1 or more stars already discards the big majority of repositories. Once the repository information is collected,install_github()
from devtools
is used to install all of the packages on the server. After an initial install of all packages, only packages that have been updated/created within the last week on GitHub are considered for obvious performance reasons.Any questions/remarks? Drop me a line at [email protected]
To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.