Tired of Waiting for your R Scripts to Finish? Let AWS do the Work, Get Notified by E-Mail
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Setting: Avoiding 4 Weeks of Runtime
Recently, I was faced with a problem: I had written a rather complex simulation of a discrete time queueing network, and I needed to let this simulation run
- with some repetitions of the entire simulation,
- for some varying different parameter values,
- with many observations (i.e. ~ 2.000.000 observation).
The goal was to verify that a new estimating procedure for such queueing networks provides sensible results. For more details on the matter, I refer interested readers to my previous articles on this topic, available here and here, or under the DOIs 10.1080/15326349.2015.1060862 and 10.1016/j.spa.2014.09.003 for all you fans of SciHub1 out there .
Anyways, the situation described above wouldn’t be problematic as such, but the runtime for a single one of these simulations increased exponentially: for 1500 observations it ran ~ 0.3 secs, for 15000 ~ 9 secs, and for 30000 ~ 38 secs. My goal was to reach 1.500.000 observations for at least 3 times, so that I was facing (by very rough calculations, of course), 2430000 seconds, or a 4 week marathon, in which my laptop would be doing nothing else but crunching numbers.
Naturally, I wanted to avoid that and considered using Amazon AWS service for this task. I have described this in a recent blog post (and to make it clear: I don’t get any money from AWS, I just like their usability). However, my requirements were a bit stricter than last time, I needed a workflow that
- Starts up a decent sized AWS instance for my calculation,
- Puts my entire simulation code on this instance,
- Starts the simulation with a freely chosen parameter setting,
- Notifies me by mail as soon as the simulation was done.
The last point was important since I was planning on using larger instance types, and AWS bills by the hour. Obviously, I didn’t want to let precious money go to waste, so I needed a trigger to terminate the instance as soon as the simulation was over.
The Step-by-Step Guide:
Setting up the AWS instance
I have previouly posted a detailed description for how to set up an AWS instance, and that walkthrough guide is still viable. In a nutshell, the core ingredients are
- an AWS account,
- the RStudio AMI of Louis Aslett,
- a security group that allows ssh and html access,
- the correct specification of the instance.
For the latter, correct specification is obviously a non-trivial requirement. In my benchmark test for AWS instances I was aiming at a RAM intense task and ended up with a recommendation for the “r4” instance class. In my use case, however, the CPU performance was decisive. For this reason, I opted for a c5.large instance, but your experiences may vary.
Setting up E-Mail transmission from the instance
Once the AWS instance is up an running, I needed to enable it to send mails, for this I followed this post here. One first needs to install the necessary software by connecting to the instance via ssh and installing
sudo apt-get install ssmptp mpack
ssmtp allows for the configuration of the SMTP sending, mpack makes for an easy CLI for sending e-mails with attachments. Now, we need to tell the instance which E-Mail client to use for outgoing mail. In my case, I used my Google mail address for this purpose. Unfortunately, this path necessitates that the password to the gmail account needs to be saved unencrypted in the file /etc/ssmtp/ssmtp.conf, like so:
# # Config file for sSMTP sendmail # # The person who gets all mail for userids < 1000 # Make this empty to disable rewriting. [email protected] # The place where the mail goes. The actual machine name is required no # MX records are consulted. Commonly mailhosts are named mail.domain.com mailhub=smtp.gmail.com:587 # Where will the mail seem to come from? #rewriteDomain= # The full hostname hostname=***** # Are users allowed to set their own From: address? # YES - Allow the user to specify their own From: address # NO - Use the system generated From: address FromLineOverride=YES [email protected] AuthPass=************** UseSTARTTLS=YES UseTLS=YES
In order for gmail to accept this unencrypted password, you need to checkmark the option for allowing 'less secure apps' to access your mail, in Google settings under 'Sign in and security':
Getting R Code on the Instance
This step is ostensibly simple, since you can just copy and paste the code to the RStudio window in your browser. But I did it differently: I wrote my simulation as an R package (a very nice and straightforward how-to can be found here). Writing a package comes with many advantages by forcing you to be clean on
- documentation
- testing
- putting functionality in functions
Furthermore, since used Git with a remote repository on GitLab for version control, I am able to call my entire code with a simple
install.packages("devtools") devtools::install_git(url = "https://gitlab.com/sastibe/spa_queueingnetwork", branch = "dev") library(queueingnetworkR)
Start the Simulation and the E-Mail notification
After all these preliminary steps this is the simplest of them all. I log on the RStudio instance by navigating to the public DNS, using the credentials "rstudio" and the (AWS) instance_id as password. Then, I simply choose an appropriate set of parameter values steering the simulation and call just one tailor-made function. Admittedly, this function was written with exactly this use case in mind, so your script lengths might vary. In my case, this was the entire run script I pasted into the RStudio on the AWS instance:
install.packages("devtools") devtools::install_git(url = "https://gitlab.com/sastibe/spa_queueingnetwork", branch = "master") library(queueingnetworkR) p_12 = 0.5 p_21 = 0.2 n_obs = 1000 burn_in = 5000 lambda_1 = 0.8 lambda_2 = 0.2 G_1 = 0.6 G_2 = 0.5 firstrun <- present_estimates(n_reps = 1, max_lag = 10, p_12 = p_12, p_21 = p_21, n_obs = n_obs, burn_in = burn_in, lambda_1 = lambda_1, lambda_2 = lambda_2, G_1 = G_1, G_2 = G_2, progress = TRUE) save(firstrun, file = "resultate.zip") ggsave("result_G2_plot.png", plot = firstrun$plot_result_G2, device = "png") system(paste0("mpack -s 'Skript durchgelaufen mit ", n_obs, " Beobachtungen: Plot G2' result_G2_plot.png [email protected]")) system(paste0("mpack -s 'Skript durchgelaufen mit ", n_obs, " Beobachtungen: Daten' resultate.zip [email protected]"))
The last two lines are a simple system call of the function mpack, which uses the credentials provided in Step X to send a mail from my GMail account to "[email protected]", and attaching the plotted file G2_result_plot.png as an attachment.
Receive the Glorious Results
I started the script above on a Friday afternoon, and received the following e-mail on Monday:
Is it Worth the Effort?
In short: yes. Longer version: The answer to this question obviously depends on the metric. In terms of time elapsed, the calculation is simple: 4 weeks on my personal laptop vs. 3 days on AWS is a very one-sided competition. But let's try to look at the monetary aspect:
Calculation on... | Duration | Cost / hour | Type of Cost | Cost total |
---|---|---|---|---|
Laptop | 4 weeks | 0,87 Cents2 | Electricity | 5,84 € |
AWS | 3 days | 9,7 US Cents | AWS fee | 5,82 €3 |
Well, that's ... surprising? To be quite honest, I started calculating this comparison under the strong preconception that AWS must surely be more expensive. Yet the accumulation of runtime over 4 weeks is enough to give the Amazon server farm the edge over my old-school laptop solution, thus cloud computing takes the cake once more. Hooray for technological progress!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.