ETC5513: Reproducible and Collaborative Practices

Reproducible analysis pipelines with targets

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics



Open Frame

Recap

  • Repositories need structure as projects grow
  • Quarto makes reports reproducible
  • git records the history of the project
  • renv can record package versions for a project

Today we add the missing piece: a way to run the analysis itself in the right order.

Today’s plan

Aim

  • Understand why analysis pipelines are useful
  • Learn the core ideas of the targets package
  • Write a simple _targets.R file
  • Break a long script into functions and targets
  • Render Quarto reports with tar_quarto()
  • See where renv fits in

Why Pipelines?

The familiar pattern

# 01_download.R
# 02_clean.R
# 03_model.R
# 04_plots.R
# 05_report.R

This can work well at first.

But the project gets harder to trust when the order, dependencies, and outputs live mostly in our heads.

The real problem

After a few weeks, you may not know:

  • Which scripts need to run after a small data change
  • Whether the report uses the newest model
  • Which objects were created manually in the global environment
  • Whether a collaborator ran the same steps in the same order
  • How much expensive work can be skipped

A reproducible project should be able to explain itself.

What a pipeline gives us

A pipeline is a set of named steps with explicit dependencies.

Before:

source("01_download.R")
source("02_clean.R")
source("03_model.R")
source("04_plots.R")
quarto::quarto_render("report.qmd")

After:

targets::tar_make()

The pipeline works out:

  • What is out of date
  • What order to run in
  • What can be skipped

Pipelines are graphs

raw_file
   |
raw_data
   |
clean_data
   |
   +------> model ------> model_table
   |
   +------> plot  ------> report

In targets, this graph is built from your R code.

A target is a promise

A target says:

  • Here is the name of an object or file I want
  • Here is the code to make it
  • Here are the upstream things it depends on
  • Store the result so it can be reused later
tar_target(
  clean_data,
  clean_sales(raw_data)
)

clean_data depends on raw_data because raw_data appears in the command.

Why targets?

The targets package is a Make-like pipeline toolkit for R analysis projects.

It helps you:

  • Skip work that is already up to date
  • Rerun downstream results when inputs change
  • Store intermediate outputs in _targets/
  • Inspect the pipeline graph
  • Run analyses from a clean R process
  • Scale to larger projects later

Source: targets overview

The key habit change

Do not ask:

Warning

Which scripts do I need to remember to run?

Ask:

Tip

What are the important outputs of this analysis, and what does each one depend on?

Anatomy Of A Targets Project

Project layout

sales-analysis/
├── _targets.R
├── R/
│   ├── data.R
│   ├── model.R
│   └── plots.R
├── data/
│   └── raw/
│       └── sales.csv
├── report.qmd
├── renv.lock
└── README.md

_targets.R defines the pipeline. The R/ folder contains your functions.

Create the starter files

From the R console:

install.packages(c("targets", "tarchetypes"))
targets::use_targets()

This creates a starter _targets.R file.

You can also create the file by hand. It is just an R script with a special job.

The three parts of _targets.R

library(targets)
library(tarchetypes)

tar_source()

tar_option_set(
  packages = c("readr", "dplyr", "ggplot2", "broom")
)

list(
  tar_target(raw_file, "data/raw/sales.csv", format = "file"),
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(model, fit_sales_model(clean_data)),
  tar_target(model_table, summarise_model(model)),
  tar_target(sales_plot, plot_sales(clean_data))
)

Part 1: Load pipeline tools

library(targets)
library(tarchetypes)

Use library() here for packages that define the pipeline.

targets provides tar_target(), tar_make(), tar_read(), and friends.

tarchetypes provides extra target factories, including tar_quarto().

Part 2: Load your functions

tar_source()

By default, tar_source() reads all .R files in the R/ folder.

This makes functions like clean_sales() and fit_sales_model() available to the pipeline.

The function bodies are tracked, so changing a function can invalidate the targets that use it.

Part 3: Declare packages for targets

tar_option_set(
  packages = c("readr", "dplyr", "ggplot2", "broom")
)

These packages are loaded when targets run.

This is different from library(dplyr) in your interactive session. tar_make() should not rely on whatever you happened to load by hand.

Part 4: Return a list of targets

list(
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(model, fit_sales_model(clean_data))
)

Each tar_target() has:

  • A name
  • A command
  • Optional settings such as format = "file"

Target names

Good target names are nouns:

  • raw_data
  • clean_data
  • model
  • model_table
  • report

Less helpful names:

  • step1
  • stuff
  • temp
  • final_final

The name should tell you what the target stores, not just where it sits in a script.

Target commands

tar_target(
  clean_data,
  clean_sales(raw_data)
)

The command should usually be a function call.

This keeps _targets.R readable and pushes the analysis details into testable functions.

File targets

Regular targets store R objects in _targets/objects/.

File targets track files on disk.

tar_target(
  raw_file,
  "data/raw/sales.csv",
  format = "file"
)

If sales.csv changes, targets knows downstream targets are out of date.

Output file targets

If a target creates a file, return the path and set format = "file".

save_sales_plot <- function(plot) {
  path <- "figures/sales-over-time.png"
  ggplot2::ggsave(path, plot, width = 8, height = 5)
  path
}
tar_target(
  plot_file,
  save_sales_plot(sales_plot),
  format = "file"
)

The return value is the path that targets should watch.

Breaking Up Code

Start with the long script

sales <- readr::read_csv("data/raw/sales.csv")

clean_sales <- sales |>
  filter(!is.na(revenue)) |>
  mutate(month = lubridate::floor_date(date, "month"))

model <- lm(revenue ~ ad_spend + region, data = clean_sales)

plot <- ggplot(clean_sales, aes(month, revenue)) +
  geom_line() +
  facet_wrap(~region)

This script mixes inputs, cleaning, modelling, and plotting.

Move verbs into functions

R/data.R

read_sales <- function(path) {
  readr::read_csv(path, show_col_types = FALSE)
}

clean_sales <- function(data) {
  data |>
    filter(!is.na(revenue)) |>
    mutate(month = lubridate::floor_date(date, "month"))
}

Functions should take inputs as arguments and return outputs explicitly.

More functions

R/model.R

fit_sales_model <- function(data) {
  lm(revenue ~ ad_spend + region, data = data)
}

summarise_model <- function(model) {
  broom::tidy(model)
}

R/plots.R

plot_sales <- function(data) {
  ggplot(data, aes(month, revenue)) +
    geom_line() +
    facet_wrap(~region)
}

Then connect them with targets

list(
  tar_target(raw_file, "data/raw/sales.csv", format = "file"),
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(model, fit_sales_model(clean_data)),
  tar_target(model_table, summarise_model(model)),
  tar_target(sales_plot, plot_sales(clean_data))
)

The target names are the nouns. The functions are the verbs.

What should become a target?

Good target candidates:

  • Important intermediate datasets
  • Expensive computations
  • Outputs used by multiple later steps
  • Published tables, figures, and reports
  • A clear boundary between stages of analysis

Not every line needs to be a target.

What should stay inside a function?

Keep these inside functions:

  • Small transformations that always happen together
  • Temporary variables
  • Repeated calculations with one clear purpose
  • Implementation details that do not need separate inspection

Tip

Targets make the workflow visible. Functions keep the code readable.

Avoid hidden state

This is fragile:

clean_sales <- function() {
  sales |>
    dplyr::filter(!is.na(revenue))
}

This is better:

clean_sales <- function(data) {
  data |>
    dplyr::filter(!is.na(revenue))
}

The second version says exactly what it needs.

Avoid doing work in _targets.R

Prefer this:

tar_target(clean_data, clean_sales(raw_data))

Instead of this:

tar_target(
  clean_data,
  raw_data |>
    dplyr::filter(!is.na(revenue)) |>
    dplyr::mutate(month = lubridate::floor_date(date, "month"))
)

The second version is allowed, but it becomes hard to read as the project grows.

Running The Pipeline

Inspect before running

targets::tar_manifest(fields = "command")

Shows the commands for each target.

targets::tar_visnetwork()

Draws the dependency graph.

targets::tar_validate()

Checks that the pipeline is valid.

Run the pipeline

targets::tar_make()

tar_make():

  • Reads _targets.R
  • Runs targets in dependency order
  • Stores return values in _targets/
  • Skips targets already up to date

Source: tar_make() documentation

What you might see

+ raw_file dispatched
✔ raw_file completed
+ raw_data dispatched
✔ raw_data completed
+ clean_data dispatched
✔ clean_data completed
+ model dispatched
✔ model completed
+ model_table dispatched
✔ model_table completed
✔ ended pipeline

On the next run, up-to-date targets are skipped.

Read outputs

targets::tar_read(model_table)

Returns the saved value of one target.

targets::tar_load(clean_data)

Loads one or more targets into your R session.

Use these to inspect results. Do not rerun expensive code manually just to look at an output.

Check what is out of date

targets::tar_outdated()

Common reasons a target becomes outdated:

  • Its command changed
  • An upstream target changed
  • A function it uses changed
  • A tracked file changed
  • The target was manually invalidated

What reruns?

Change Likely effect
data/raw/sales.csv changes raw_file and downstream targets rerun
clean_sales() changes clean_data and downstream targets rerun
fit_sales_model() changes model and downstream targets rerun
report.qmd changes the report target reruns
No relevant changes everything is skipped

Force a target to rerun

targets::tar_invalidate(model)
targets::tar_make()

This deletes the metadata for model, so targets treats it as out of date.

Warning

Use this when you need it, but first ask why the target was not already invalidated.

Clean old targets

If you remove targets from _targets.R, old stored results can remain in _targets/.

targets::tar_prune()

For a full reset:

targets::tar_destroy()

Warning

tar_destroy() deletes the whole target store. It is useful, but it is not gentle.

What goes in git?

Commit:

  • _targets.R
  • R/ scripts
  • Quarto source files
  • Small raw data files, if appropriate
  • renv.lock

Usually do not commit:

  • _targets/
  • Large generated outputs
  • Private credentials
  • The project-local package library

The pipeline should be rebuildable from the committed source files.

Dynamic Branching

Repeating the same step

Sometimes we need the same operation many times:

  • Fit one model per group
  • Render one report per scenario
  • Simulate many parameter combinations
  • Process many raw files

This is where dynamic branching becomes useful.

Branch over values

list(
  tar_target(raw_file, "data/raw/sales.csv", format = "file"),
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(regions, sort(unique(clean_data$region))),
  tar_target(
    region_data,
    dplyr::filter(clean_data, region == regions),
    pattern = map(regions)
  ),
  tar_target(
    region_model,
    fit_sales_model(region_data),
    pattern = map(region_data)
  )
)

Each branch is tracked separately.

Why branches matter

If one group changes, targets can avoid rerunning unrelated groups.

Without branches:

all_data -> all_models

One small change can make all models rerun.

With branches:

region A -> model A
region B -> model B
region C -> model C

Only affected branches need work.

Keep branching for later

Dynamic branching is powerful, but the first goal is simpler:

Tip

Build a clear non-branching pipeline first. Add branches once the repeated structure is obvious.

Quarto Reports In The Pipeline

The report should be an output

The common mistake:

# Run the analysis...
targets::tar_make()

# Then manually render the report...
quarto::quarto_render("report.qmd")

If the report is part of the project, make it part of the pipeline.

tar_quarto()

tar_quarto() comes from tarchetypes.

library(targets)
library(tarchetypes)

tar_source()
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "broom"))

list(
  tar_target(raw_file, "data/raw/sales.csv", format = "file"),
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(model, fit_sales_model(clean_data)),
  tar_target(model_table, summarise_model(model)),
  tar_target(sales_plot, plot_sales(clean_data)),
  tar_quarto(report, path = "report.qmd")
)

How the report sees targets

Inside report.qmd, read target outputs directly.

```{r}
model_table <- targets::tar_read(model_table)
model_table
```
```{r}
targets::tar_read(sales_plot)
```

The report consumes outputs. It should not repeat the whole analysis.

What tar_quarto() tracks

tar_quarto() creates a file target for the rendered document.

It also:

  • Detects tar_read() and tar_load() calls in active R chunks
  • Adds those upstream dependencies to the graph
  • Watches the Quarto source and output files
  • Returns relative file paths for portability

Source: tar_quarto() documentation

Quarto report pattern

Keep the report focused on communication:

  • Load finished targets
  • Make final display tables if needed
  • Arrange text, figures, and interpretation
  • Avoid long-running cleaning or modelling chunks

Tip

If a chunk takes a long time or produces an important object, consider making it a target.

Quarto gotchas

tar_quarto() can only detect dependencies it can see.

Prefer:

```{r}
targets::tar_read(model_table)
```

Avoid hiding the target read inside another function:

```{r}
make_my_report_table()
```

The hidden version may render, but the dependency graph can become incomplete.

Parameterised Quarto

You can pass target values into Quarto parameters.

tar_quarto(
  report,
  path = "report.qmd",
  execute_params = list(model_summary = model_table)
)

For most projects, directly using targets::tar_read() in the report is simpler.

Debugging And Practice

A debugging routine

When the pipeline breaks:

  1. Read the error message from tar_make()
  2. Inspect the target command in _targets.R
  3. Load upstream targets with tar_load()
  4. Run the function call interactively
  5. Fix the function, then rerun tar_make()

The bug is usually in your R function, not in targets.

Example

targets::tar_load(raw_data)

clean_sales(raw_data)

If this fails interactively, fix clean_sales().

Then rerun:

targets::tar_make()

Useful commands

Task Command
Inspect target commands tar_manifest(fields = "command")
Draw the graph tar_visnetwork()
Run the pipeline tar_make()
List outdated targets tar_outdated()
Read one target tar_read(target_name)
Load one target tar_load(target_name)
Remove unused stored targets tar_prune()

Demo checklist

  1. Create a small project
  2. Put helper functions in R/
  3. Write _targets.R
  4. Run tar_validate()
  5. Draw the graph with tar_visnetwork()
  6. Run tar_make()
  7. Change one function and see what reruns
  8. Add tar_quarto()

Where renv Fits

Reproducibility has layers

Tool What it records
git Project history
Quarto How the report is generated
targets What runs, in what order, and what is up to date
renv R package versions used by the project
Docker Operating system and system dependencies

Today, targets handles the workflow. renv handles the package library.

What renv does

renv helps make R projects:

  • Isolated: project-specific package libraries
  • Portable: another machine can install the needed packages
  • Reproducible: package versions are recorded in renv.lock

Source: renv documentation

Basic renv workflow

renv::init()

Install or update packages as needed.

install.packages(c("targets", "tarchetypes", "dplyr"))

Record the current project library:

renv::snapshot()

Restore it later:

renv::restore()

targets plus renv

There is one extra helper:

targets::tar_renv()

This writes _targets_packages.R, a generated file that helps renv detect packages declared in:

  • tar_option_set(packages = ...)
  • tar_target(packages = ...)
  • storage formats that need extra packages

Source: tar_renv() documentation

A combined workflow

renv::init()

install.packages(c("targets", "tarchetypes", "readr", "dplyr"))

targets::tar_renv()

renv::snapshot()

Then commit:

  • _targets.R
  • _targets_packages.R
  • R/
  • report.qmd
  • renv.lock

On another computer: renv::restore(), then targets::tar_make().

What renv does not do

renv does not decide:

  • Which analysis steps need to rerun
  • Whether your report is up to date
  • Whether a data file changed
  • How objects depend on each other

That is the job of targets.

What targets does not do

targets does not guarantee:

  • Your collaborator has the same package versions
  • External system libraries are installed
  • Raw data can be downloaded again
  • Secrets and credentials are available

That is why reproducibility needs layers.

Pulling It Together

A complete minimal _targets.R

library(targets)
library(tarchetypes)

tar_source()

tar_option_set(
  packages = c("readr", "dplyr", "lubridate", "ggplot2", "broom")
)

list(
  tar_target(raw_file, "data/raw/sales.csv", format = "file"),
  tar_target(raw_data, read_sales(raw_file)),
  tar_target(clean_data, clean_sales(raw_data)),
  tar_target(model, fit_sales_model(clean_data)),
  tar_target(model_table, summarise_model(model)),
  tar_target(sales_plot, plot_sales(clean_data)),
  tar_quarto(report, path = "report.qmd")
)

What success looks like

A good targets project should let you:

  • Start a fresh R session
  • Run renv::restore() if needed
  • Run targets::tar_make()
  • Rebuild the whole analysis without manual steps
  • Explain the workflow with tar_visnetwork()

That is a much stronger claim than “it worked on my computer yesterday”.

Summary

  • targets turns an analysis into a dependency graph
  • Each target is a named output made by a command
  • Keep analysis details in functions under R/
  • Use file targets for inputs and generated files
  • Use tar_quarto() to make reports part of the pipeline
  • Use renv to record package versions

Further reading