ETC5513: Collaborative and Reproducible Practices
Workshop 10
Part 1: Setup
Step 1: Create a new RStudio project
- In RStudio, choose File > New Project > New Directory > New Project.
- Name your project (e.g.
penguin-pipeline
). - This creates a local directory.
Step 2: Initialise git
In the R console:
::use_git() usethis
Create a repository on GitHub (you can leave it empty).
Copy the repository URL (e.g.
git@github.com/yourusername/penguin-pipeline.git
).In the terminal:
git remote add origin git@github.com/yourusername/penguin-pipeline.git git branch -M main git push -u origin main
Step 1: Load & Clean Data
::tar_script() targets
This creates _targets.R
, which we will now edit.
We’ll use the palmerpenguins
dataset.
Edit your _targets.R
file like this:
library(targets)
library(tarchetypes)
library(dplyr)
library(palmerpenguins)
tar_option_set(packages = c("dplyr", "broom", "palmerpenguins"))
list(
tar_target(raw_data, penguins),
tar_target(clean_data, raw_data |>
filter(!is.na(body_mass_g)) |>
mutate(species = as.factor(species)))
)
Now run the pipeline from the R console:
::tar_make() targets
You should be able to see your data with
tar_read(clean_data)
Step 2: Fit Models by Group
Let’s manually define separate targets to fit a linear model (body_mass_g ~ bill_length_mm
) for each species. This approach gets repetitive — and we’ll fix that later using dynamic branching.
Add the following targets to your _targets.R
file:
tar_target(adelie_data, filter(clean_data, species == "Adelie")),
tar_target(chinstrap_data, filter(clean_data, species == "Chinstrap")),
tar_target(gentoo_data, filter(clean_data, species == "Gentoo")),
tar_target(adelie_model, lm(body_mass_g ~ bill_length_mm, data = adelie_data)),
tar_target(chinstrap_model, lm(body_mass_g ~ bill_length_mm, data = chinstrap_data)),
tar_target(gentoo_model, lm(body_mass_g ~ bill_length_mm, data = gentoo_data))
Then, run:
tar_make()
You now have three separate models, one per species.
Try and plot these models from your console!
Step 3: Use Dynamic Branching
Now let’s replace our manual lapply step with dynamic branching from tarchetypes
.
Replace the last two targets with:
tar_group_by(
grouped_data,
clean_data, species),
tar_target(
models,list(lm(body_mass_g ~ bill_length_mm, data = grouped_data)),
pattern = map(grouped_data)
)
Note the list
in the second tar_target
. This is because tarchetypes
wants to row bind items together, which you can’t do with raw lm
objects.
Step 4: Check your network of functions
Run
tar_visnetwork()
to see what the combinations of functions looks like.
Step 4: Report with Quarto
Create a new report.qmd
file with this content:
---
title: "Penguins Analysis"
format: html
execute:
echo: true
---
```{r}
library(targets)
library(gtsummary)
tar_read(clean_data)
```
```{r}
#| results: asis
tar_load(models)
lapply(models, function(input) {
tbl_regression(input)
})```
You can render this qmd
in the standard way
Challenge: Can you work out which species is for which model? Try and add it to the output chunk above.
Hint: tar_read(models)$tar_group()
Step 5: Pull the reporting out to targets
Try and extract the tbl_regression
function out into the _targets.R
file.
Here’s some starter code. Try to fill in everything in the angled brackets.
tar_target(
reporting_tables,< >,
pattern = < >,
iteration = < >
)
Step 6: Version control
Add .gitignore
Create a .gitignore
file with this content:
_targets
.Rhistory
.RData
.Rproj.user
Save, add and commit this file.
Remember to add the _targets
folder to your .gitignore
file!
Step 6: Invalidation
Change the model formula in _targets.R
lm(body_mass_g ~ bill_length_mm + flipper_length_mm,
data = grouped_data)
Run
tar_visnetwork()
What do you notice?
Rerun the pipeline with
::tar_make() targets
Notice that only the affected targets have re-run!
Your final _targets.R
file
By the end of this workshop, you should’ve ended up with this:
# Load packages required to define the pipeline:
library(targets)
library(tarchetypes) # Load other packages as needed.
# Set target options:
tar_option_set(
packages = c("tibble", "tidyverse", "palmerpenguins") # Packages that your targets need for their tasks.
)
# Run the R scripts in the R/ folder with your custom functions:
tar_source()
# tar_source("other_functions.R") # Source other scripts as needed.
# Replace the target list below with your own:
list(
tar_target(raw_data, penguins),
tar_target(clean_data, raw_data |>
filter(!is.na(body_mass_g)) |>
mutate(species = as.factor(species))),
tar_target(
penguins_plot,ggplot(clean_data, aes(x=bill_length_mm, y = bill_depth_mm, colour = species)) +
geom_point()
),
tar_group_by(
grouped_data,
clean_data, species),
tar_target(
models,lm(body_mass_g ~ bill_length_mm, data = grouped_data),
pattern = map(grouped_data),
iteration = "list"
),
tar_quarto(
report,"report.qmd",
quiet = FALSE
) )