Site icon R-bloggers

Running an R Script on a Schedule: Gitlab

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- tags at least beginner, tutorial, and all packages used. --> < !-- categories: R and blog. Blog is general, R means rweekly and r-bloggers --> < !-- share img is either a complete url or build on top of the base url (https://blog.rmhogervorst.nl) so do not use the same relative image link. But make it more complete post/slug/image.png --> < !-- useful settings for rmarkdown--> < !-- content -->

In this tutorial I have an R script that creates a plot and tweets it, it runs every day on gitlab runners.

The use case is this: You have a script and it needs to run on a schedule (for instance every day).

Other ways to schedule a script

I will create a new post for many of the other ways on which you can run an R script on schedule. But in this case I will run the script on gitlab. Find all posts about scheduling an R script here

Gitlab details

Gitlab is a complete version control system. I’m using the free version on gitlab.com but you can self-host gitlab too. And many companies do. That way it remains entirely under your control. For our purposes though, gitlab is exactly like github but with more private repos.

For gitlab you also have to specify configuration in a yaml file. The syntax is slightly different from github and you put it into a file called .gitlab-ci.yml. I found this slightly easier to setup, and easier to debug because you specify which docker container the runner should use.

My version can be found on github here and gitlab here The two repos (on github and gitlab) are identical because I have one repo on my computer that is connected to both of them.

On a high level this is what is going to happen:

On a high level this is what is going to happen:

(We want the code to run on computer in the cloud)
You save your script locally in a git repository
You push everything to gitlab
# installation
the gitlab runner
- uses a docker container which has R installed
- installs the system dependencies
- and installs the correct packages
# running something
gitlab runner runs the script
we can schedule this action

I first explain what you need, what my rscript does, and how to deal with credentials. If you are not interested go immediately to steps.

What you need:

Example of a script

I have an R script that:

Of course you could create something that is actually useful, like downloading data, cleaning it and pushing it into a database. But this example is relatively small and you can actually see the results online.

Small diversion: credentials/ secrets

For many applications you need credentials and you don’t want to put the credentials in the script, if you share the script with someone, they also have the credentials. If you put it on an open gitlab repo, the world has your secrets.

So how can you do it? R can read environmental variables and in github you can input the environmental variables that will be passed to the runner when it runs (there are better, more professional tools to do the same thing but this is good enough for me). So you create an environmental variable called apikey with a value like aVerY5eCretKEy. In your script you use Sys.getenv("apikey") and the script will retrieve the apikey: aVerY5eCretKEy and use that.

How do you add them to your local environment?

How do you add them to gitlab?

You don’t need to do anything else, if you name the vars just as you did in your .Renviron file it just works.

Steps

So what do you need to make this work?

Steps in order

Check if your script runs on your computer
Set up renv and snapshot
(optional) try a cache of your renv libraries for faster
install the correct packages on the runner
execute the script
set up a schedule

Steps with explanation

The entire script contains 4 parts

The cache is optional and I don’t think it works as intented yet. Variables are used further in the process and the before_script runs before the script action in run. Wait that doesnt’ make it very clear…

The process starts with reading in the variables. It then starts the docker container rocker/r-ver:4.0.2 and copies the files from your repo to the container. The next step is executing the before_script which installs some systems libraries and sets some options. It then installs renv and it also creates a directory that renv expects. Finally it ‘restores’ the library based on the renv.lock file (So it installs all the packages you need to run a script!).

And then it executes the script part (which is the script I wanted to run in the first part).

Some details about the process:

I’m using

run:
tags:
- docker
image: rocker/r-ver:4.0.2

So I’m telling gitlab it should look into the docker hub containers (-docker), and tell it to use the r-ver container from the rocker organization. You could use :latest, and I would recommend that for building packages, because than it would take the latest version of the rocker r-ver container. But I want this to run every time and so I fix it with a version number 4.0.2 (which is at the moment of writing identical to latest).

The step apt-get install -y --no-install-recommends ${APT_PKGS} makes use of the variable at the top of the script. It installs all systems libraries you define there.

And finally it executes the script (making use of the variables I defined in settings, and this exact same script works on my local computer too).

Scheduling

you can schedule a gitlab runner very easily by going to ‘CI/CD’/schedules:

You could even make it depend on your timezone!

Conclusion

So to run this script on gitlab we have to give instructions to the infrastructure, we tell it what docker container to use, what things to install and what commands to run, until, finally, we can run our script.

And now it runs every day.

The building of the container takes long here, just as on github actions ( so any speedup tips you have, I would really appreciate! ). To debug you can run the docker container locally but you have to execute the before_script steps manually.

References

Reproducibility

< details> < summary> At the moment of creation (when I knitted this document ) this was the state of my machine: **click here to expand**
sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.6
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2020-09-24
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.1)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.1)
knitr 1.29 2020-06-23 [1] CRAN (R 4.0.1)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.1)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.1)
stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.2)
xfun 0.15 2020-06-21 [1] CRAN (R 4.0.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.