Tracking progress in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Admit it or not, we human beings become anxious and impatient when it comes to waiting. Especially when we are blindfolded — that is, we are unaware of how long we have to suffer the endless wait. As pointed out by Brad Allan Myers, arguably the designer of the progress bar in the 1980s, being able to track the progress during the waiting can significantly improve the user experience (Myers, 1985).
As an R programmer in bioinformatic research, often my codes are not designed for the general public, but it is important to make sure that my users, namely my fellow colleagues and researchers, are as happy as possible. However, tracking process in R can be tricky. In this article, I am going to present you some approaches and my solution – pbmcapply.
The easiest way to handle progress tracking in R is to periodically print the percentage of completion to the output, which is the screen by default, or write it to your log file located somewhere on the disk. Needless to say, this is probably the least elegant way to solve the problem, but many people are still following this path nowadays.
Pbapply
A better (and still easy) solution is to adopt a package named pbapply. According to its dev page, the package has been very popular — 90k downloads. The package is easy to use. Whenever you are about to call the apply function, use the pbapply version of it. For example:
# Some numbers we are going to work with nums <- 1:10 # Let's call the lapply to get the sqare root of these numbers sqrt <- sapply(nums, sqrt) # Now let's track the process using pbapply package sqrt <- pbsapply(nums, sqrt)
While the numbers are processed, a progress bar will be printed to the output and refreshed repeatedly.
Although pbapply is a great tool and I use it frequently, it failed to track the progress of the paralleled version of apply — mcapply — until recently. In September, the author of pbapply updated his package with support to snow-type clusters and multicore-type forking. However, his approach relies on splitting the elements into fractions and applies mcapply to them sequentially. One caveat of this approach is that if the number of elements is significantly higher than the number of cores, a lot of mcapply calls will be executed. Mcapply calls, which is built upon the fork() function in Unix/Linux, is very expensive: forking into lots of child processes is time consuming and creates memory overhead.
Pbmcapply
Pbmcapply is my own solution to address this problem. Available as a CRAN package, it can be easily incorporated into your code:
# Install pbmcapply install.packages("pbmcapply")
As you might have realized by its name, I was inspired by the pbapply package. Unlike pbapply, my solution does not rely on executing multiple mcapply calls. Instead, pbmcapply takes advantages of a package named future.
In Computer Science, future refers to an object that will hold values later. It allows the program to execute some code as a future and, without waiting for the return, proceed to the next step. In pbmcapply, mcapply will be wrapped into a future. The future will update the main program with its progress periodically, and the main program will maintain a progress bar to display the updates.
Because the overhead was minimal and non-linear in pbmcapply, a dramatic increase of performance is seen when the number of elements to iterate over is significantly bigger than the number of CPU cores. Single-thread and multi-threaded apply functions from the R base are used as reference. It is obvious that even with pbmcapply, the performance is affected due to time required to set up the monitor process.
Everything comes at a price. When enjoying the convenience of interactive progress tracking, please keep in mind that it slightly slows down the program.
Conclusion
Like always, one shoe doesn’t fit all. If performance is your top priority (e.g. when running a program on a cluster), a better way to track progress might be print. On the other hand, if letting the program run for extra second sounds reasonable, you are more than welcome to check either my solution (pbmcapply) or pbapply to take a more user-friendly approach.
Reference
- Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. In ACM SIGCHI Bulletin (Vol. 16, №4, pp. 11–17). ACM.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.