Lecturer: Michael Lydeamore
Department of Econometrics and Business Statistics
git records the history of the projectrenv can record package versions for a projectToday we add the missing piece: a way to run the analysis itself in the right order.
Aim
targets package_targets.R filetar_quarto()renv fits inThis can work well at first.
But the project gets harder to trust when the order, dependencies, and outputs live mostly in our heads.
After a few weeks, you may not know:
A reproducible project should be able to explain itself.
A pipeline is a set of named steps with explicit dependencies.
In targets, this graph is built from your R code.
A target says:
clean_data depends on raw_data because raw_data appears in the command.
targets?The targets package is a Make-like pipeline toolkit for R analysis projects.
It helps you:
_targets/Source: targets overview
Do not ask:
Warning
Which scripts do I need to remember to run?
Ask:
Tip
What are the important outputs of this analysis, and what does each one depend on?
_targets.R defines the pipeline. The R/ folder contains your functions.
From the R console:
This creates a starter _targets.R file.
You can also create the file by hand. It is just an R script with a special job.
_targets.Rlibrary(targets)
library(tarchetypes)
tar_source()
tar_option_set(
packages = c("readr", "dplyr", "ggplot2", "broom")
)
list(
tar_target(raw_file, "data/raw/sales.csv", format = "file"),
tar_target(raw_data, read_sales(raw_file)),
tar_target(clean_data, clean_sales(raw_data)),
tar_target(model, fit_sales_model(clean_data)),
tar_target(model_table, summarise_model(model)),
tar_target(sales_plot, plot_sales(clean_data))
)Use library() here for packages that define the pipeline.
targets provides tar_target(), tar_make(), tar_read(), and friends.
tarchetypes provides extra target factories, including tar_quarto().
By default, tar_source() reads all .R files in the R/ folder.
This makes functions like clean_sales() and fit_sales_model() available to the pipeline.
The function bodies are tracked, so changing a function can invalidate the targets that use it.
These packages are loaded when targets run.
This is different from library(dplyr) in your interactive session. tar_make() should not rely on whatever you happened to load by hand.
Each tar_target() has:
format = "file"Good target names are nouns:
raw_dataclean_datamodelmodel_tablereportLess helpful names:
step1stufftempfinal_finalThe name should tell you what the target stores, not just where it sits in a script.
The command should usually be a function call.
This keeps _targets.R readable and pushes the analysis details into testable functions.
Regular targets store R objects in _targets/objects/.
File targets track files on disk.
If sales.csv changes, targets knows downstream targets are out of date.
If a target creates a file, return the path and set format = "file".
The return value is the path that targets should watch.
This script mixes inputs, cleaning, modelling, and plotting.
R/data.R
Functions should take inputs as arguments and return outputs explicitly.
R/model.R
R/plots.R
list(
tar_target(raw_file, "data/raw/sales.csv", format = "file"),
tar_target(raw_data, read_sales(raw_file)),
tar_target(clean_data, clean_sales(raw_data)),
tar_target(model, fit_sales_model(clean_data)),
tar_target(model_table, summarise_model(model)),
tar_target(sales_plot, plot_sales(clean_data))
)The target names are the nouns. The functions are the verbs.
Good target candidates:
Not every line needs to be a target.
Keep these inside functions:
Tip
Targets make the workflow visible. Functions keep the code readable.
This is fragile:
This is better:
The second version says exactly what it needs.
_targets.RPrefer this:
Instead of this:
The second version is allowed, but it becomes hard to read as the project grows.
Shows the commands for each target.
Draws the dependency graph.
Checks that the pipeline is valid.
tar_make():
_targets.R_targets/Source: tar_make() documentation
On the next run, up-to-date targets are skipped.
Returns the saved value of one target.
Loads one or more targets into your R session.
Use these to inspect results. Do not rerun expensive code manually just to look at an output.
Common reasons a target becomes outdated:
| Change | Likely effect |
|---|---|
data/raw/sales.csv changes |
raw_file and downstream targets rerun |
clean_sales() changes |
clean_data and downstream targets rerun |
fit_sales_model() changes |
model and downstream targets rerun |
report.qmd changes |
the report target reruns |
| No relevant changes | everything is skipped |
This deletes the metadata for model, so targets treats it as out of date.
Warning
Use this when you need it, but first ask why the target was not already invalidated.
If you remove targets from _targets.R, old stored results can remain in _targets/.
For a full reset:
Warning
tar_destroy() deletes the whole target store. It is useful, but it is not gentle.
Commit:
_targets.RR/ scriptsrenv.lockUsually do not commit:
_targets/The pipeline should be rebuildable from the committed source files.
Sometimes we need the same operation many times:
This is where dynamic branching becomes useful.
list(
tar_target(raw_file, "data/raw/sales.csv", format = "file"),
tar_target(raw_data, read_sales(raw_file)),
tar_target(clean_data, clean_sales(raw_data)),
tar_target(regions, sort(unique(clean_data$region))),
tar_target(
region_data,
dplyr::filter(clean_data, region == regions),
pattern = map(regions)
),
tar_target(
region_model,
fit_sales_model(region_data),
pattern = map(region_data)
)
)Each branch is tracked separately.
If one group changes, targets can avoid rerunning unrelated groups.
Dynamic branching is powerful, but the first goal is simpler:
Tip
Build a clear non-branching pipeline first. Add branches once the repeated structure is obvious.
The common mistake:
If the report is part of the project, make it part of the pipeline.
tar_quarto()tar_quarto() comes from tarchetypes.
library(targets)
library(tarchetypes)
tar_source()
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "broom"))
list(
tar_target(raw_file, "data/raw/sales.csv", format = "file"),
tar_target(raw_data, read_sales(raw_file)),
tar_target(clean_data, clean_sales(raw_data)),
tar_target(model, fit_sales_model(clean_data)),
tar_target(model_table, summarise_model(model)),
tar_target(sales_plot, plot_sales(clean_data)),
tar_quarto(report, path = "report.qmd")
)Inside report.qmd, read target outputs directly.
The report consumes outputs. It should not repeat the whole analysis.
tar_quarto() trackstar_quarto() creates a file target for the rendered document.
It also:
tar_read() and tar_load() calls in active R chunksSource: tar_quarto() documentation
Keep the report focused on communication:
Tip
If a chunk takes a long time or produces an important object, consider making it a target.
tar_quarto() can only detect dependencies it can see.
Prefer:
Avoid hiding the target read inside another function:
The hidden version may render, but the dependency graph can become incomplete.
You can pass target values into Quarto parameters.
For most projects, directly using targets::tar_read() in the report is simpler.
When the pipeline breaks:
tar_make()_targets.Rtar_load()tar_make()The bug is usually in your R function, not in targets.
If this fails interactively, fix clean_sales().
| Task | Command |
|---|---|
| Inspect target commands | tar_manifest(fields = "command") |
| Draw the graph | tar_visnetwork() |
| Run the pipeline | tar_make() |
| List outdated targets | tar_outdated() |
| Read one target | tar_read(target_name) |
| Load one target | tar_load(target_name) |
| Remove unused stored targets | tar_prune() |
R/_targets.Rtar_validate()tar_visnetwork()tar_make()tar_quarto()| Tool | What it records |
|---|---|
git |
Project history |
| Quarto | How the report is generated |
targets |
What runs, in what order, and what is up to date |
renv |
R package versions used by the project |
| Docker | Operating system and system dependencies |
Today, targets handles the workflow. renv handles the package library.
renv doesrenv helps make R projects:
renv.lockSource: renv documentation
renv workflowInstall or update packages as needed.
Record the current project library:
Restore it later:
targets plus renvThere is one extra helper:
This writes _targets_packages.R, a generated file that helps renv detect packages declared in:
tar_option_set(packages = ...)tar_target(packages = ...)Source: tar_renv() documentation
Then commit:
_targets.R_targets_packages.RR/report.qmdrenv.lockOn another computer: renv::restore(), then targets::tar_make().
renv does not dorenv does not decide:
That is the job of targets.
targets does not dotargets does not guarantee:
That is why reproducibility needs layers.
_targets.Rlibrary(targets)
library(tarchetypes)
tar_source()
tar_option_set(
packages = c("readr", "dplyr", "lubridate", "ggplot2", "broom")
)
list(
tar_target(raw_file, "data/raw/sales.csv", format = "file"),
tar_target(raw_data, read_sales(raw_file)),
tar_target(clean_data, clean_sales(raw_data)),
tar_target(model, fit_sales_model(clean_data)),
tar_target(model_table, summarise_model(model)),
tar_target(sales_plot, plot_sales(clean_data)),
tar_quarto(report, path = "report.qmd")
)A good targets project should let you:
renv::restore() if neededtargets::tar_make()tar_visnetwork()That is a much stronger claim than “it worked on my computer yesterday”.
targets turns an analysis into a dependency graphR/tar_quarto() to make reports part of the pipelinerenv to record package versions
ETC5513 Week 10