ETC5513: Reproducible and Collaborative Practices

Reproducible reports using Quarto

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics



Open Frame

Recap

  1. Set the basis for the unit
  2. Unit structure
  3. Assessment
  4. Introduction to reproducibility
  5. Looked at R, RStudio and git

In the tutorial, you got to know more about R, and some of the available R and RStudio resources to help you through the semester.

You were also introduced to ChatGPT that you can use to assist in your learning. We will be using ChatGPT ethically as per the University guidelines.

Today’s plan

Aim

  • Quarto documents
  • R Code Chunk Options
  • Including images and figures
  • Computer file architecture
  • RStudio Projects
  • Good coding practices

Second hour: hands on practice

Scaffolding of reproducible research & reporting

Think of reproducible reporting as a project

The project needs to contain all the resources needed to produce a reproducible output.

Definition: Computational Reproducibility

Obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.

Elements of a reproducible project

We need to have a plan to organise, store and make all the project files available

  • All the elements of the project should be files
  • All files should be stored within the project location (typically a folder)
  • All your files should be explicitly tied together

Project organisation example

Workflow for reproducible research

Stages for reproducible data analysis and reporting

  • Clear research questions to be investigated
  • Clear objectives: what is the goal of this report?
  • Data gathering
  • Exploratory data analysis
  • Data analysis
  • Results presentation

All of the above needs to be documented and tied together

In this unit

We will create documents that are reproducible

  • Incorporate analyses that are reproducible
  • Include report text
  • All combined together

Our reproducible documents will be created using the scripting language R combined with quarto.

What is Markdown?

Markdown is a lightweight markup language that you can use to add formatting elements to plain text documents.

It was created by John Gruber in 2004. Read more here

  • Markdown is a “text” formatting syntax
  • Can be rendered to more complex formats such as pdf, html, pdf, doc, …

Good news!

We can use markdown inside a type of documents called Quarto files

Today, we’ll learn how to do it.

Main tools for combining R code and text

Our main tool is going to be R and it’s packages. We will be using R via RStudio.

  • R is structured into libraries
  • For reproducibility, we have two libraries that are crucial
  • Does anyone know which ones are those?

Quarto and knitr

  • Quarto is a new piece of software (with corresponding R package) that allows us to create documents using Markdown
  • knitr is an R package that allows us to render quarto code into html, pdf, doc etc

Quarto documents

  • Quarto could be considered a file format, for making dynamic documents with R.
  • Quarto documents have extension qmd

Quarto

Quarto

  • Provides an environment where you can write your complete analysis, and combines your text and code together into a rich document
  • You write your code as code chunks, put your text around that, and then you get a fully reproducible document

Elements in a Quarto Document

There are three parts to a Quarto document

  1. Metadata (YAML)
  2. Text (formatted with Markdown)
  3. Code (code formatting)

Before we dive into the Quarto file structure, let’s talk about Markdown.

Dynamic documents

Quarto + knitr = Dynamic document

  • Quarto allows us to not only use Markdown to write the text in the report, it also allows us to include R code.
  • knitr combines with pandoc to render documents that contain a mixture of these components
  • pandoc is used by the quarto package to render a qmd file into a new format.

Possible outputs

Quarto file structure

Three main components: YAML, text and R code chunks.

Component breakdown: YAML

Metadata is written at the top of the file, between --- in YAML.

---
title: "ETC5513"
author: "Michael Lydeamore"
format: html
---

Component breakdown: Text

Text is written in Markdown

# This is a section header

This is a section header

## This is a subsection header

This is a subsection header

In this section, something is **important**

In this section, something is important

Font types

We can write things in italic or bold:

Code:

__bold__, **bold**,

_italic_, *italic*

Result:

bold, bold,

italic, italic

Markdown example

Code:

# Header 1
## Header 2

* Unordered list 1

_This is italic_

*So is this*

**This is bold**

1. Ordered list 1

Result:

Header 1

Header 2

  • Unordered list 1

This is italic

So is this

This is bold

  1. Ordered list 1

Markdown component: code

R Code is included in chunks:

Code:

```{r}
#| echo: false

library(ggplot2)
ggplot(cars, 
       aes(x = speed, 
           y = dist)
       ) +
  geom_point()
``` 

Result:

R code

Code:

```{r}
#| echo: false

library(ggplot2)

data = data(InsectSprays)

head(InsectSprays)
```
...
```{r}
#| echo: false
ggplot(data = InsectSprays,
       aes(x = spray,
           y = count,
           fill = spray)
       ) +
  geom_boxplot(alpha = 0.6) +
  ggtitle("Insect sprays boxplots")
```

Result:

  count spray
1    10     A
2     7     A
3    20     A
4    14     A
5    14     A
6    12     A

R Code Chunks

You can quickly insert an R code chunk into your file with:

  • Keyboard shortcut Ctrl + Alt + I (Mac: Cmd + Option + I)
  • The Add Chunk command in the editor toolbar or
  • Typing the chunk delimeters (```)

Chunk output can be customised with Chunk execution options, which are at the top of a chunk, starting with #|.

  • include: false prevents code and results appearing in the finished file. The code is still run and results can be used in other chunks.
  • echo: false prevents code but not results appearing in the finished file. This is a useful way to embed figures.

More chunk options

  • eval: false does not evaluate (or run) this code chunk when knitting
  • message: false prevents messages that are generated by code appearing in the finished file
  • warning: false prevents warnings that are generated appearing in the finished file
  • fig.cap = "Text" adds a caption to a figure
  • fig-align = "center" sets the position it will appear

There are loads more of these - see the Quarto documentation for a complete list.

Global options

To set global options that paply to every chunk in your file, call knitr::opts_chunk$set() in a code chunk.

These will be treated as a global default that can be overwritten by individual chunk headers.

Example:

knitr::opts_chunk$set(echo = FALSE)

Caching

Long documents can take a long time to run. Quarto has a caching system that can help manage this long execution time.

You can set cache as either a chunk option (using #|) or globally in YAML:

execute:
  cache: true

More info on caching is in the Quarto manual

Use these with care: It is easy to accidentally not refresh an updated chunk!

Example

Rendering Quarto to HTML

Use the “Render” button at the top

Rendering Quarto to PDF

We can also knit to a PDF:

knitr and Pandoc

Pandoc: The document converter

https://pandoc.org/index.html

  • knitr executes the code and converts the .qmd to a .md
  • Pandoc renders the .md to the output format you want

Let’s learn about YAML

title: "R Notebook"
author: "Michael Lydeamore"
format: 
  html:
    toc: true
    theme: solar
  pdf:
    toc: true
  docx:
    toc: true

toc: Table of contents. You can read more abotu that here

This is the resulting HTML

Tables and Captions

Code:

```{r}
library(dslabs)
data(murders)
table_data <- head(murders, 5)

knitr::kable(table_data, 
             caption = "Gun murder data
             from FBI reports by state",
             digits = 2)
```

Result:

Gun murder data from FBI reports by state
state abb region population total
Alabama AL South 4779736 135
Alaska AK West 710231 19
Arizona AZ West 6392017 232
Arkansas AR South 2915918 93
California CA West 37253956 1257

Tables and Captions

Code:

```{r}
library(dslabs)
data(murders)
table_data <- head(murders, 5)

knitr::kable(table_data, 
             caption = "Gun murder data
             from FBI reports by state",
             digits = 2)
```

For more information, type knitr::kable() into your R console.

Figures and captions

Figures from R are created inside code chunks.

Typically, we will generate figures using ggplot2

Inside the code chunk, we use the fig-cap chunk option to generate a caption.

You will also want to include fig-label so it gets a number.

Figures and captions

```{r}
#| fig-label: cars-plot
#| fig-cap: "Distance taken for a car to stop, against it's speed during the test."

library(ggplot2)
ggplot(cars, 
       aes(x = speed, 
           y = dist)
       ) +
  geom_point()
```

Distance taken for a car to stop, against it’s speed during the test.

Inserting external images/photos/figures

There are two different ways to include external pictures.

```{r}
#| out-width: "80%"
knitr::include_graphics("images/R.png")
```

or

![](images/R.png){width="80%"}

I recommend the latter unless for some reason you need some specific R processing.

Note these don’t have to be local links. URLs work just fine!

![](https://media.giphy.com/media/JIX9t2j0ZTN9S/giphy.gif)

Now we know how to create a qmd file

But there is more to a project than that

A project might have:

  • Data,
  • Other R or Quarto scripts
  • Figures etc

All the documents related to a project should be in one folder, often under an RStudio Project.

Let’s talk about computer paths

And then RStudio Projects

Computer paths

Where are files and folders stored on our computer?

Computer paths

Definition: Path

A path is the complete location or name of where a computer file, directory, device, or web page is located

Some examples:

  • Windows: C:\Documents\ETC5513
  • Mac/Linux: /Users/Documents/ETC5513
  • Internet: http://rcp.numbat.space/

Absolute and Relative Paths

Definition: Absolute Path

An absolute or full path begins from the lowest level, typically a drive letter or root (/)

Definition: Relative Path

A relative path refers to a location that is relative to the current directory. They typically start with a . (although this may be hidden from the user)

Examples:

  • Absolute path: C:\Documents\ETC5513-Assignment-Solutions
  • Relative path: ./assignment-solutions

Absolute and Relative paths

Absolute paths are generally to be avoided - it is extremely unlikely another person will have the same absolute path as you.

Relative paths can work on different systems.

It is essential you understand where directories and files are within your computer

Having clarity about that and the projects file architecture gives you total control about their organisation.

Order versus mess

Work projects

  • Give each project a unique working directory/folder
  • Clean file system: all files related to a single project should be in the same folder
    • data (typically a folder)
    • figures (typically a folder)
    • code
    • notes
  • All paths should be relative to the project folder. Why?
  • Remember, absolute paths are not reproducible

RStudio Project Example

  • Data folder: Contains all the data for the project
  • Images/Figures folder: Contains all pictures not produced by your code in the qmd file
  • .Rproj file: This gets added when we create an RStudio project
  • qmd file
  • Other R scripts etc…

RStudio projects

RStudio projects automatically handle relative paths and working directories

You can create an RStudio project

  • In a brand new directory
  • In an existing directory where you already have R Code and data
  • From a version control repository

Read more on Rstudio projects here

Creating a new project

File > New project > Fill out the Options

RStudio Project Advantages

When you make a new RStudio Project, it:

  • Creates a project file (with the .Rproj extension) within the project directory
    • This file can be used a as a shortcut to open the project directly
  • Creates a hidden directory (.Rproj.user) where project-specific temporary fiels are stored
  • Loads the project into RStudio and displays it’s name in the Projects toolbar

Good Coding Style

Good Coding Style

Coding style is an opinion-based phenomenon

There are different styles and it is important to be careful about how you write your code.

Bad example:

```{r}
library(ggplot2)
data = data(InsectSprays)
ggplot(data=InsectSprays, aes(spray, count, fill=spray))+geom_boxplot(alpha=0.6)+ggtitle("Insect sprays boxplots")
```

Long lines, no spaces, no structure: makes it very hard to read and debug

Good example

```{r}
library(ggplot2)

ggplot(data = InsectSprays,
       aes(x = spray, 
           y = count,
           fill = spray)
       ) +
  geom_boxplot(alpha = 0.6) +
  ggtitle("Insect sprays boxplots")
```

We will (mostly) follow the Tidyverse style guide

Good coding principles

  • Source code should be readable by humans and self-explanatory
  • Long sentences are not good (maximum 80-100 characters)
  • Inside R code chunks, the tidyverse style guide is a good guide:
    • Use spaces around <-, +, =, -, after , and before {
    • For comments inside yoru code, use #

Important

The more organised you are writing your code, the eaiser it will be to read it and debug it

Practices for reproducible research

  • Have a plan to organise, store, and make your files available
  • Set up an RStudio Project for each of your projects
  • Make sure all the steps in your analysis are documented
  • All files should be human readable
  • All files related to a project should be explicitly tied together

Reproducible workflow

Week 2 Lesson

Summary

  • Quarto documents
  • R Code Chunk Options
  • Including figures, tables, captions
  • RStudio projects
  • Good Coding Practices