Site icon R-bloggers

On target

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here are some notes on getting started with {targets}.

The project I am working on involves several different reports, each at least 30 pages, and each with about 20 plots and 20 tables per document.

As well as a myriad of functions, I had 7 very large R scripts doing the data munging and processing.

I thought they were well ordered, but I had to burn everything down a couple of times and it was quite nerve wracking building it back up. The thought of adding additional phases of the project to this code base made me uncomfortable. I decided I needed to learn{targets} to ensure this project can be reproducible a few years down the line.

The package comes with extensive documentation, but here are some edited highlights and explainers

If you don’t know what targets does – it keeps track of the objects you create, and the relationships between those objects. So if you have a file that feeds into a function, and the file updates, then the function needs to be run again. You don’t need to keep track of that in your head, {targets} does the work for you and produces a wonderful network plot showing the current status.

For example – here is a very zoomed out view of all my targets. It’s hard to tell, but quite a lot are now out of date – as seen by the blue colour

targets network plot

Here I’ve zoomed in, with particular focus on the localities target, which acts as an input to many other downstream targets.

The next time tar_make is run, the code that updates these functions will run, and everything else will be skipped. There is no way, having broken everything down into small functions, that I could track all this manually.

Note – I’m using dataframe here as a generic term for data.frame, tibble, data.table, or whatever else you might be using.

For example, here I’m tracking a spreadsheet which has a list of desired indicators. If the file changes in any way, then anything that depends on this will become outdated, and {targets} will know to update those parts of the pipeline

 tar_target(profile2_adult_indicators,# target name
            "./01-inputs/profile2_adultindicators.xlsx", # command
            format = "file") # target file format
  • If you have a large script that generates several objects, you’re going to need to break that down into functions so that one target is returned per function. It seems a lot of work, but its worth it.

  • In general, you will use tar_manifest, tar_vis_network, and tar_make the most.

    tar_manifest creates a table of the targets and their inputs and contains lots of info that will help you check that everything is working. If you run tar_make, and your pipeline doesn’t work as expected, you need to run tar_manifest and examine the output in detail (you’ll probably want to pipe the output straight into dplyr::View(.)

    You might also want to use tar_invalidate with a specific target to ensure any changes you make, e.g., as a result of a function change, are picked up prior to running tar_make.

    You can also use tar_destroy and set the option to “all” to completely burn everything down and rebuild it. Probably not something to use on a Friday afternoon, unless you’re very confident in your pipeline, or you simply live for the danger.

    tar_target(
      fig_three_bar_comparison, #target name
    plot_three_bar_comparison(df = combined_populations, # command
                              council = localities$area,
                              areaval = localities$areaname),
      pattern = map(localities), # localities = a 2 column df with area & areaname
      iteration = "list",
     format = "file" # format
    )
    

    This code maps over the area and areanamecolumns in my localities dataframe, and creates a plot for each combination, using the plot_three_bar_comparison function, with existing target combined_populations as an input. The target name is fig_three_bar_comparison

    Here are the results of this bit of code:

    This is using dynamic branching . Static branching is also available, and I should probably have used it, as I know what I want my file names to be. I’m using static branching to generate each Word document with tar_render. This involves creating a tibble with the column values to map over, and an output vector. That may be the topic of a future post.

    I have many of these functions, and as a result I already have over 500 individual targets, from original source spreadsheets and CSVs to plots and documents.

    I am much happier now about the foundations of the project. I had a draft phase 2 document up and running in a couple of days – and this is a much larger document with even more tables and plots. Combined with {renv} and git, we are in a good place for our first Reproducible Analytic Pipeline.

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.