Lecturer: Michael Lydeamore
Department of Econometrics and Business Statistics
renv records the R package environmenttargets records the workflow and reruns only what is neededToday we move one layer lower: the computer environment itself.
Aim
Dockerfile for an R projectTip
Docker is included to show you the current gold standard for portable computational environments.
You should understand the ideas and recognise the workflow.
You are not expected to memorise every Docker command for assessment.
But on someone else’s machine:
Reproducibility is not only about code.
| Layer | Tool we have used | What it records |
|---|---|---|
| Code history | git |
What changed, when, and why |
| Documents | Quarto | Text, code, and outputs together |
| R packages | renv |
Package versions for the project |
| Workflow | targets |
What depends on what |
| Computer environment | Docker | Operating system, system libraries, and commands |
Docker does not replace the earlier tools. It wraps them in a portable environment.
Docker lets us describe and run a small computer environment.
That environment can include:
The description is plain text, so it can be committed to git.
Docker is not:
Tip
Use Docker when the environment itself is part of the reproducibility problem.
An image is like a saved template. A container is a running instance of that template.
| Term | Meaning |
|---|---|
Dockerfile |
A text file with instructions to build an image |
| Image | A reusable template for a container |
| Container | A running environment created from an image |
| Registry | A place to publish and download images |
| Mount | A way to connect host files to a container |
| Volume | Storage managed by Docker |
Containers are like virtual machines, but lighter.
| Virtual machine | Container |
|---|---|
| Simulates a whole computer | Shares the host system kernel |
| Usually larger | Usually smaller |
| Slower to start | Faster to start |
| Stronger isolation | Enough isolation for many data workflows |
For reproducible R projects, containers are usually the more practical option.
The container is isolated, but it can still connect to:
Docker Desktop gives us:
Download: https://www.docker.com/products/docker-desktop
After installation, open Docker Desktop before running Docker commands.
In the terminal:
If Docker is working, hello-world downloads a tiny image and runs it.
This is the smallest possible “does my Docker setup work?” check.
The Rocker project provides Docker images for R.
Common examples include:
r-base: a minimal R imagerocker/r-ver: versioned R imagesrocker/rstudio: RStudio Server in a browserrocker/verse: RStudio plus many common data science toolsRocker saves us from writing a full Linux and R installation from scratch.
Docker Desktop can search for images and download them.
For teaching, the terminal is still clearer because the command documents exactly what happened.
This starts R inside a container.
The prompt looks familiar, but R is running inside the container, not directly on your computer.
run creates and starts a container-t gives the container a terminal-i keeps input open--rm removes the container after it stopsr-base is the image nameUse q() to exit the R session.
You can also check in the terminal:
For stopped containers:
Start a container:
Install a package by hand inside it.
Exit the container.
Start the container again.
The package is gone. That is a feature, not a bug.
Disposable containers make it harder to accidentally depend on:
Tip
If you need something every time, put it in the image or mount it from the host.
A Dockerfile describes how to build an image.
This says: start from the r-base image.
Every image builds on another image, unless it starts from scratch.
RUN executes a command while the image is being built.
After the image is built, these packages are part of the image.
CMD says what should happen when a container starts.
Here the default behaviour is to open R.
In the folder containing the Dockerfile:
Then run it:
The . at the end of docker build means “use this folder as the build context”.
Tags are labels for images.
Avoid relying on latest when you need long-term reproducibility.
Less reproducible:
More reproducible:
The second version says exactly which R version family the image starts from.
For an R project using renv:
FROM rocker/r-ver:4.4.3
WORKDIR /project
ENV RENV_PATHS_LIBRARY=/opt/renv/library
ENV RENV_PATHS_CACHE=/opt/renv/cache
ARG QUARTO_VERSION=1.8.27
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl ca-certificates \
&& arch="$(dpkg --print-architecture)" \
&& curl -L -o quarto.deb \
"https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-${arch}.deb" \
&& apt-get install -y ./quarto.deb \
&& quarto --version \
&& rm quarto.deb \
&& rm -rf /var/lib/apt/lists/*
RUN Rscript -e "install.packages('renv')"
COPY .Rprofile .Rprofile
COPY renv.lock renv.lock
COPY renv/activate.R renv/activate.R
COPY renv/settings.json renv/settings.json
RUN Rscript -e "renv::restore(prompt = FALSE)"
COPY . .
CMD ["Rscript", "-e", "renv::load('/project'); targets::tar_make()"]renv::restore() installs packages into /opt/renv/library. renv::load() makes sure that library is used when the container starts.
Putting the library outside /project matters because a bind mount can replace /project at runtime.
The Quarto CLI is installed separately because the R package quarto does not provide the quarto command.
renv.lock first?Docker builds in layers.
If the package lockfile has not changed, Docker can reuse the package installation layer.
This makes rebuilds much faster when only your analysis code changes.
.dockerignore fileThe build context should not include everything.
This keeps images smaller and avoids copying local machine state into the container.
A containerised analysis image might include:
Usually avoid copying:
Tip
Commit the recipe, not the leftovers.
In git, keep:
Dockerfile.dockerignorerenv.lockDo not commit the built image itself.
| Pattern | What happens | Good for |
|---|---|---|
| Container filesystem | Files live inside one container | Short experiments |
| Bind mount | Host folder appears inside container | Active development |
| Named volume | Docker manages persistent storage | Databases and service data |
For this unit, bind mounts are the most useful pattern.
This means:
/project/projectWithout a bind mount:
With a bind mount:
This is how we use Docker while still editing files normally.
RStudio Server runs in the container and opens in your browser.
Then open:
-d runs the container in the background-p 8787:8787 maps a browser port to the container-e PASSWORD=changeme sets an environment variable--rm removes the container when it stopsUse a stronger password for anything that is not just local teaching work.
List running containers:
Stop one:
You only need enough of the container ID to make it unique.
Create a Docker-managed volume:
Use it:
Volumes are useful when the data belongs to the service, not directly to your project folder.
| Feature | Bind mount | Named volume |
|---|---|---|
| Managed by | You | Docker |
| Easy to inspect in Finder or Explorer | Yes | Less directly |
| Portable across computers | Depends on path | Managed by Docker |
| Best for | Code and project files | Databases and service state |
Bind mounts can expose permission differences between:
Symptoms include:
This is normal Docker friction. It is not a sign that the project is broken.
This is useful, but hard to remember:
Docker Compose lets us write this configuration once.
compose.yaml
Run:
Older examples may use docker-compose. The newer command is docker compose.
Some projects need more than R.
A data science project might use:
Compose describes the small system, not just one container.
Environment variables configure software without hard-coding values.
Common uses:
Warning
Do not commit real passwords, API keys, or tokens to a repository.
For real projects, use:
.env files that are not committedThe strongest projects use these tools together, each doing a different job.
FROM rocker/r-ver:4.4.3
WORKDIR /project
ENV RENV_PATHS_LIBRARY=/opt/renv/library
ENV RENV_PATHS_CACHE=/opt/renv/cache
ARG QUARTO_VERSION=1.8.27
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl ca-certificates \
&& arch="$(dpkg --print-architecture)" \
&& curl -L -o quarto.deb \
"https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-${arch}.deb" \
&& apt-get install -y ./quarto.deb \
&& quarto --version \
&& rm quarto.deb \
&& rm -rf /var/lib/apt/lists/*
RUN Rscript -e "install.packages('renv')"
COPY .Rprofile .Rprofile
COPY renv.lock renv.lock
COPY renv/activate.R renv/activate.R
COPY renv/settings.json renv/settings.json
RUN Rscript -e "renv::restore(prompt = FALSE)"
COPY . .
CMD ["Rscript", "-e", "renv::load('/project'); targets::tar_make()"]Then:
Instead of saying:
You can say:
That is a much smaller surface area for mistakes.
A Docker image can still fail if:
Docker reduces environment drift. It does not remove the need for good project design.
Use Docker when:
For a small class project, you may be fine with:
renv.locktargets pipeline for larger analysesTip
Choose the simplest tool that makes the project trustworthy.
A registry stores Docker images.
Common options:
This is like GitHub for built environments, but the object being shared is an image.
Login:
Build with a registry-ready name:
Push:
Someone else can run:
For long-term projects, also keep the Dockerfile in the repository so the image can be rebuilt.
Images include operating system files, libraries, R, packages, and maybe project files.
Ways to keep them manageable:
.dockerignoreDocker can accumulate:
Inspect first:
Clean cautiously:
Warning
docker system prune deletes stopped containers and unused build objects. Read the prompt before confirming.
Imagine this project:
The goal:
should rebuild the analysis.
_targets.Rlibrary(targets)
library(tarchetypes)
tar_source()
tar_option_set(
packages = c("palmerpenguins", "dplyr", "ggplot2")
)
list(
tar_target(raw_data, palmerpenguins::penguins),
tar_target(clean_data, clean_penguins(raw_data)),
tar_target(species_plot, plot_species(clean_data)),
tar_quarto(report, "report.qmd")
)DockerfileFROM rocker/r-ver:4.4.3
WORKDIR /project
ENV RENV_PATHS_LIBRARY=/opt/renv/library
ENV RENV_PATHS_CACHE=/opt/renv/cache
ARG QUARTO_VERSION=1.8.27
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl ca-certificates \
&& arch="$(dpkg --print-architecture)" \
&& curl -L -o quarto.deb \
"https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-${arch}.deb" \
&& apt-get install -y ./quarto.deb \
&& quarto --version \
&& rm quarto.deb \
&& rm -rf /var/lib/apt/lists/*
RUN Rscript -e "install.packages('renv')"
COPY .Rprofile .Rprofile
COPY renv.lock renv.lock
COPY renv/activate.R renv/activate.R
COPY renv/settings.json renv/settings.json
RUN Rscript -e "renv::restore(prompt = FALSE)"
COPY . .
CMD ["Rscript", "-e", "renv::load('/project'); targets::tar_make()"]If the report is written inside the container, remember:
Without a mount, the output stays inside the container and disappears when the container is removed.
Now rendered outputs are written to your project folder.
This is usually what you want while developing.
The package library still comes from the image because it is stored in /opt/renv/library, not under the mounted /project folder.
_targets/ into the imagerenv/library/ into the imagerenv but not activating the project library at runtime/project/renv/library and then hiding them with a bind mountquarto but forgetting the Quarto CLIlatest tags for long-term workrenv.lockcompose.yamlMake the default path easy for a new person to follow.
Before sharing a Dockerised project, check:
Dockerfile builds from a clean clone.dockerignore excludes local staterenv.lock is currenttargets::tar_make() runs inside the containerOpen a shell inside the image:
Then inspect:
Debug the environment first, then the analysis.
Docker costs:
Docker buys:
Dockerfile records how an image is builtgit, Quarto, renv, and targetsTip
Reproducibility is a ladder. Docker is one of the higher rungs, not the first one.
Start with clear code, relative paths, version control, package management, and a runnable workflow.
Then use Docker when the environment needs to be part of the project.

ETC5513 Week 11