ETC5513: Collaborative and Reproducible Practices

Workshop 10

Author

Michael Lydeamore

Published

20 May 2025

Part 1: Setup

Step 1: Create a new RStudio project

In RStudio, choose File > New Project > New Directory > New Project.
Name your project (e.g. penguin-pipeline).
This creates a local directory.

Step 2: Initialise git

In the R console:
```
usethis::use_git()
```
Create a repository on GitHub (you can leave it empty).
Copy the repository URL (e.g. git@github.com/yourusername/penguin-pipeline.git).

In the terminal:

git remote add origin git@github.com/yourusername/penguin-pipeline.git
git branch -M main
git push -u origin main

Step 1: Load & Clean Data

targets::tar_script()

This creates _targets.R, which we will now edit.

We’ll use the palmerpenguins dataset.

Edit your _targets.R file like this:

library(targets)
library(tarchetypes)
library(dplyr)
library(palmerpenguins)

tar_option_set(packages = c("dplyr", "broom", "palmerpenguins"))

list(
  tar_target(raw_data, penguins),
  
  tar_target(clean_data, raw_data |> 
               filter(!is.na(body_mass_g)) |> 
               mutate(species = as.factor(species)))
)

Now run the pipeline from the R console:

targets::tar_make()

You should be able to see your data with

tar_read(clean_data)

Step 2: Fit Models by Group

Let’s manually define separate targets to fit a linear model (body_mass_g ~ bill_length_mm) for each species. This approach gets repetitive — and we’ll fix that later using dynamic branching.

Add the following targets to your _targets.R file:

  tar_target(adelie_data, filter(clean_data, species == "Adelie")),
  tar_target(chinstrap_data, filter(clean_data, species == "Chinstrap")),
  tar_target(gentoo_data, filter(clean_data, species == "Gentoo")),

  tar_target(adelie_model, lm(body_mass_g ~ bill_length_mm, data = adelie_data)),
  tar_target(chinstrap_model, lm(body_mass_g ~ bill_length_mm, data = chinstrap_data)),
  tar_target(gentoo_model, lm(body_mass_g ~ bill_length_mm, data = gentoo_data))

Then, run:

tar_make()

You now have three separate models, one per species.

Tip

Try and plot these models from your console!

Step 3: Use Dynamic Branching

Now let’s replace our manual lapply step with dynamic branching from tarchetypes.

Replace the last two targets with:

  tar_group_by(
   grouped_data, 
   clean_data, species),

  tar_target(
   models,
   list(lm(body_mass_g ~ bill_length_mm, data = grouped_data)), 
   pattern = map(grouped_data)
  )

Tip

Note the list in the second tar_target. This is because tarchetypes wants to row bind items together, which you can’t do with raw lm objects.

Step 4: Check your network of functions

Run

tar_visnetwork()

to see what the combinations of functions looks like.

Step 4: Report with Quarto

Create a new report.qmd file with this content:

---
title: "Penguins Analysis"
format: html
execute:
  echo: true
---

```{r}
library(targets)
library(gtsummary)
tar_read(clean_data)
```

```{r}
#| results: asis
tar_load(models)

lapply(models, function(input) {
   tbl_regression(input)
})
```

You can render this qmd in the standard way

Challenge: Can you work out which species is for which model? Try and add it to the output chunk above.

Hint: tar_read(models)$tar_group()

Step 5: Pull the reporting out to `targets`

Try and extract the tbl_regression function out into the _targets.R file.

Here’s some starter code. Try to fill in everything in the angled brackets.

tar_target(
   reporting_tables,
   < >,
   pattern = < >,
   iteration = < >
)

Step 6: Version control

Add `.gitignore`

Create a .gitignore file with this content:

_targets
.Rhistory
.RData
.Rproj.user

Save, add and commit this file.

Tip

Remember to add the _targets folder to your .gitignore file!

Step 6: Invalidation

Change the model formula in _targets.R

lm(body_mass_g ~ bill_length_mm + flipper_length_mm, 
   data = grouped_data)

Run

tar_visnetwork()

What do you notice?

Rerun the pipeline with

targets::tar_make()

Notice that only the affected targets have re-run!

Your final `_targets.R` file

By the end of this workshop, you should’ve ended up with this:

# Load packages required to define the pipeline:
library(targets)
library(tarchetypes) # Load other packages as needed.

# Set target options:
tar_option_set(
  packages = c("tibble", "tidyverse", "palmerpenguins") # Packages that your targets need for their tasks.
  
)

# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# tar_source("other_functions.R") # Source other scripts as needed.

# Replace the target list below with your own:
list(
  tar_target(raw_data, penguins),
  
  tar_target(clean_data, raw_data |> 
               filter(!is.na(body_mass_g)) |> 
               mutate(species = as.factor(species))),
  
  tar_target(
    penguins_plot,
    ggplot(clean_data, aes(x=bill_length_mm, y = bill_depth_mm, colour = species)) +
      geom_point()
  ),
  
  tar_group_by(
    grouped_data, 
    clean_data, species),
  
  tar_target(
    models,
    lm(body_mass_g ~ bill_length_mm, data = grouped_data), 
    pattern = map(grouped_data),
    iteration = "list"
  ),
  
  tar_quarto(
    report,
    "report.qmd",
    quiet = FALSE
  )
)

Part 1: Setup

Step 1: Create a new RStudio project

Step 2: Initialise git

Step 1: Load & Clean Data

Step 2: Fit Models by Group

Step 3: Use Dynamic Branching

Step 4: Check your network of functions

Step 4: Report with Quarto

Step 5: Pull the reporting out to targets

Step 6: Version control

Add .gitignore

Step 6: Invalidation

Your final _targets.R file

Step 5: Pull the reporting out to `targets`

Add `.gitignore`

Your final `_targets.R` file