ETC5513: Reproducible and Collaborative Practices

Keeping environments separate and reproducible

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics



Open Frame

Environment reproduction

Why Do We Need This?

  • Projects become harder to manage as they grow
  • We want to ensure reproducibility:
    • Use the same packages and versions
    • Avoid redundant computation
  • renv handles package environments
  • targets handles workflow pipelines

📦 What is renv?

  • renv stands for “Reproducible ENVironments”
  • Captures your package dependencies
  • Makes your project portable and self-contained

Libraries and repositories

We first have to understand what is a library and what is a repository.

A library is a directory which contains all of your install packages. For us, so far, we have one library shared across all of our projects.

You can see your current libraries by running .libPaths()

Libraries and repositories

A repository is a source of packages, such as CRAN.

Other repositories include: Bioconductor, Posit Public Package Manager, and R Universe.

You can see your available repositories with getOption('repos')

Setting up renv

install.packages("renv")
renv::init()

This will set up a project-specific library, which isolates the packages for each project.

How renv Works

  • renv creates a lockfile (renv.lock)
  • Tracks your installed packages and versions
  • You can restore your environment later:
renv::restore()
  • Keeps packages project-specific (not global)

Key Files from renv

  • renv.lock — lists exact package versions
  • renv/ folder — stores the project library
  • .Rprofile — ensures renv is activated when you open the project

Make sure you check these files in to version control!

How to use renv

  • Install packages as normal
  • After each package install, run
renv::snapshot()
  • Push the changed renv.lock file to your repository

The lockfile

Here is an example lockfile:

{
  "R": {
    "Version": "4.4.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.12.1",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteHost": "api.github.com",
      "RemoteUsername": "yihui",
      "RemoteRepo": "mime",
      "RemoteRef": "main",
      "RemoteSha": "1763e0dcb72fb58d97bab97bb834fc71f1e012bc",
      "Requirements": [
        "tools"
      ],
      "Hash": "c2772b6269924dad6784aaa1d99dbb86"
    }
  }
}

There are two main things in here: R and Packages.

It is Packages that has everything needed to reinstall an exact version of a package.

Reusing packages across projects

Often we use the same packages across most of our projects. Conveniently, renv reuses packages across our projects by maintaining a cache.

You’ll see sometimes a message that says:

Linked from Cache

when installing. This is the package being re-used!

Risks of renv

renv doesn’t solve everything for you:

  • Doesn’t manage R versions
  • When packages become “old” they have to be compiled from source
    • Which is slow
    • And error prone
  • Doesn’t track pandoc
  • Can’t help with operating system, system dependencies

Sharing Projects

When someone else clones your project:

  1. Open the RStudio project
  2. Run:
renv::restore()

Using renv with GitHub Actions (Advanced)

  • GitHub Actions can install your project dependencies automatically
  • renv::restore() is run as part of the CI workflow
  • Good for checking that your project runs cleanly on fresh machines

Example GitHub Actions Setup

Add this to .github/workflows/ci.yaml:

name: R-CMD-check

on: [push, pull_request]

jobs:
  R-CMD-check:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3
      
      - name: Set up R
        uses: r-lib/actions/setup-r@v2

      - name: Install system dependencies
        run: sudo apt-get update && sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev

      - name: Restore packages with renv
        run: Rscript -e 'install.packages("renv"); renv::restore()'

      - name: Run your script or tests
        run: Rscript your_script.R

Your GitHub Action will now use the same package versions as your local setup.

When Should You Use This?

  • To check your pipeline still runs after changes
  • When collaborating, to ensure a shared reproducible environment
  • For automated reports, like rendering Quarto documents

Pipeline management

The typical pipeline

Often, we end up with something like this:

01-data.R
02-model.R
03-plots.R

And then we source these in order each time.

This works OK for small projects, but scales very poorly.

What is targets?

  • Think Makefile, but for R
  • Automates steps in your workflow
  • Re-runs only the steps that need updating
install.packages("targets")

Example Pipeline

library(targets)

list(
  tar_target(data, read_csv("data.csv")),
  tar_target(model, lm(y ~ x, data))
)
  • Each tar_target() defines a step
  • targets watches for changes

How It Works

Pipeline:

read_csv() → data ─┐
                   └──> lm() → model

If the file changes, targets knows to rerun read_csv() and everything downstream.

Running a Pipeline

tar_make()
  • Builds your pipeline
  • Skips steps if inputs haven’t changed

Project Structure Example

your-project/
├── _targets.R     # pipeline definition
├── data/
├── R/             # helper functions
├── renv.lock
├── renv/
└── analysis.qmd

How does it work?

When running a function, the package hashes the function.

If the function doesn’t change, the hash will stay the same

If the function has changed, then so too will the hash.

How does it work

Results are stored on-disk in a compressed format.

Targets can be loaded using tar_load or tar_read (learn the keybinds!)

Using targets with plots

_targets.R

list(
  tar_target(
    penguins_plot, 
    ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point())
)

In your console:

tar_make()
tar_read(penguins_plot)

tar_read prints the object inside it tar_load loads the object into your workspace (like using <-)

tarchetypes

tarchetypes contains useful functions for working with targets

Useful for:

  • Dynamic branching
  • Rendering Quarto documents
  • Utility functions

Dynamic branching

Often we have a dataset where we want to run a pipeline on the groups. In standard R, we do this as follows:

penguins |>
  group_by(species) |>
  summarise(mean_flipper = mean(flipper_length_mm, na.rm=TRUE))
# A tibble: 3 × 2
  species   mean_flipper
  <fct>            <dbl>
1 Adelie            190.
2 Chinstrap         196.
3 Gentoo            217.

Dynamic branching

In targets-land, we do it like this instead:

list(
  tar_group_by(grouped_penguins, penguins, species), 
  tar_target(
    mean_flipper, 
    summarise(
      grouped_penguins, 
      mean_flipper = mean(flipper_length_mm, na.rm=TRUE),
      species = unique(species)), 
      pattern = map(grouped_penguins))
)

Dynamic branching

This results in

> tar_read(mean_flipper)
# A tibble: 3 × 2
  mean_flipper species  
         <dbl> <fct>    
1         190. Adelie   
2         196. Chinstrap
3         217. Gentoo   

Note: tarchetypes will try and row bind these outputs. If your output is not a vector, then you will need iteration = list as an argument to the target. We will see this in the workshop! :::

Rendering Quarto documents

You can also use targets in your Quarto documents with

tar_quarto()

This will scan your qmd file for tar_target commands, and load them in the way that you would expect.

Checking if targets are up to date

tar_visnetwork()

will show you outdated, up-to-date, and not yet run targets.

Remember, targets only runs what it thinks it needs to!

How to use this with version control

You should check in your _targets.R file, but you typically ignore the cache.

To do that, add

_targets

to your .gitignore file.

Advanced targetting

We’ve scratched the surface of this package. You can also:

  • Export your results to an AWS bucket
  • Push results to Posit pins for re-use elsewhere
  • Branch statically (with sensible names) or dynamically
  • Recombine results from branches

It also forces you to write your code in a “functional” way, which leads to easier code maintenance and readability down the track.

Recap

renv:

  • Tracks your installed packages and versions
  • Keeps packages project-specific (not global)

targets:

  • Automates steps in your workflow
  • Re-runs only the steps that need updating

Next week:

Docker!