ETC5513: Reproducible and Collaborative Practices

Reproducibility in the wild: a capstone case study

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics

michael.lydeamore@monash.edu
Week 12
rcp.numbat.space

Open Frame

Recap

Quarto puts prose, code, and output in one document
git records project history
GitHub helps coordinate collaboration
renv records package versions
targets records the analysis workflow
Docker can record the computer environment

Today is about judgement: using the right layer at the right time.

Today’s plan

Aim

Inherit a broken analysis project
Audit its reproducibility claims
Repair the project structure
Turn a script into functions
Write a targets pipeline from scratch
Finish with a reproducible handoff

The Case

The handoff

You have inherited a project from a previous analyst.

The team needs a final report.

The README says:

quarto render report.qmd

That is a claim. Today we test it.

Files for today

week12/case-study/
|-- broken/
|   |-- README.md
|   |-- analysis.R
|   |-- data/raw/cafe_sales.csv
|   `-- report.qmd
`-- solution/
    |-- _targets.R
    |-- R/
    |-- data/raw/cafe_sales.csv
    `-- report.qmd

Instructor notes: week12/guided-notes.qmd

First run

From the project root:

cd week12/case-study/broken
quarto render report.qmd

If it fails, that is useful evidence.

Then inspect the fallback

The README also says:

source("analysis.R")

Before editing, read the script and ask: what assumptions does it make?

What we are looking for

Missing files
Wrong paths
Output names that do not match the report
Columns that do not exist
Manual state hidden in the analyst’s computer
Packages that are used but not recorded
Steps that must happen in a particular order

The deeper problem

The project does not have one clear answer to:

Warning

How do I rebuild the final report from the raw data?

That is the reproducibility question underneath all the symptoms.

Audit

Do not start by fixing code

Start by finding the shape of the project.

ls
find . -maxdepth 3 -type f

Then read:

cat README.md

The README is part of the software.

Audit checklist

Question	Evidence
Can the report render?	`quarto render report.qmd`
Can the script run?	`source("analysis.R")`
Are paths project-relative?	Look for absolute paths and mismatched folders
Are raw and derived files separate?	Inspect `data/`, `output/`, `outputs/`
Are packages recorded?	Look for `renv.lock`, `DESCRIPTION`, README
Is the workflow explicit?	Look for `_targets.R` or one rebuild command

A useful repair order

Make the current failure visible
Fix the smallest path and naming errors
Identify the real analysis steps
Move reusable work into functions
Write the pipeline
Make the report depend on generated outputs
Update the README

Small wins first; structure second.

What not to do yet

Do not rewrite the statistical analysis
Do not add Docker before the local workflow works
Do not hide manual steps in the README
Do not treat stale output files as evidence

First make the project honest.

Repair

What the script is doing

The inherited script mixes several jobs:

Read data
Clean data
Summarise by week
Summarise by campus
Make a plot
Write report inputs

Those are natural pipeline targets.

From script to functions

Move the work into small functions:

R/
|-- data.R
|-- summaries.R
|-- plots.R
`-- report.R

Functions are easier to test, reuse, and connect.

A function boundary

Before:

sales <- read_csv("data/orders.csv")
sales$date <- as.Date(sales$date, format = "%d/%m/%Y")
clean_sales <- filter(sales, cancelled == "No")

After:

read_sales <- function(path) {
  readr::read_csv(path, show_col_types = FALSE)
}

clean_sales <- function(sales) {
  sales |>
    dplyr::mutate(date = as.Date(date)) |>
    dplyr::filter(cancelled == "No")
}

Name the outputs

The report needs:

outputs/campus_summary.csv
outputs/weekly_summary.csv
outputs/revenue_by_week.png

If the report depends on these files, the pipeline should create these files.

Now write `_targets.R`

Start with an empty file.

library(targets)

tar_source()

tar_option_set(
  packages = c("dplyr", "readr", "ggplot2", "scales", "knitr")
)

No half-built pipeline. We write the contract ourselves.

Add the raw file

list(
  tar_target(raw_sales_file, "data/raw/cafe_sales.csv", format = "file")
)

File targets tell targets to watch the file itself.

Add data objects

list(
  tar_target(raw_sales_file, "data/raw/cafe_sales.csv", format = "file"),
  tar_target(raw_sales, read_sales(raw_sales_file)),
  tar_target(clean_sales_data, clean_sales(raw_sales))
)

Each target is a named promise: this object can be rebuilt.

Add summaries

tar_target(campus_summary, summarise_by_campus(clean_sales_data)),
tar_target(weekly_summary, summarise_by_week(clean_sales_data))

The order comes from dependencies, not from memory.

Add output files

tar_target(
  campus_summary_file,
  write_output_csv(campus_summary, "outputs/campus_summary.csv"),
  format = "file"
)

Generated files can also be targets.

Add the report

tar_target(
  report,
  render_report(
    input = "report.qmd",
    dependencies = c(campus_summary_file, weekly_summary_file, weekly_revenue_plot)
  ),
  format = "file"
)

The report should not render until its inputs exist.

Handoff

The final test

From the repaired project folder:

targets::tar_make()

Then run it again:

targets::tar_make()

The second run should skip work that is already up to date.

Inspect the workflow

targets::tar_manifest()
targets::tar_visnetwork()

A good project can explain how it rebuilds itself.

Update the README

The README should say:

install.packages(c("targets", "dplyr", "readr", "ggplot2", "scales", "knitr"))
targets::tar_make()

And it should explain:

Where raw data lives
Which outputs are generated
What the one rebuild command is

What each layer did

Layer	In this case
Quarto	Final report
Git	History of the repair
GitHub Issues	Work that could be assigned
`renv`	Package versions, if this were being handed to a new machine
`targets`	Rebuild order and stale output detection
Docker	Only needed if R/system environment becomes the problem

What can we now claim?

Before:

The report worked on the previous analyst’s laptop.

After:

From the project folder, one command rebuilds the derived files and final report from the raw data.

That is a stronger and more testable claim.

Debrief questions

Which failure was easiest to diagnose?
Which failure was most dangerous?
What should have been in the README?
Where would renv add value?
When would Docker be justified?

Summary

Reproducibility is a claim you test
Broken projects are usually broken at the joins
Reports should not depend on mystery files
Scripts become safer when turned into functions
Pipelines make the rebuild contract explicit
A good handoff tells the next person exactly what to run

Files

Broken project: week12/case-study/broken
Working version: week12/case-study/solution
Guided notes: guided-notes.html

ETC5513: Reproducible and Collaborative Practices

Reproducibility in the wild: a capstone case study

Open Frame

Recap

Today’s plan

The Case

The handoff

Files for today

First run

Then inspect the fallback

What we are looking for

The deeper problem

Audit

Do not start by fixing code

Audit checklist

A useful repair order

What not to do yet

Repair

What the script is doing

From script to functions

A function boundary

Name the outputs

Now write _targets.R

Add the raw file

Add data objects

Add summaries

Add output files

Add the report

Handoff

The final test

Inspect the workflow

Update the README

What each layer did

What can we now claim?

Debrief questions

Summary

Files

Now write `_targets.R`