ETC5513: Collaborative and Reproducible Practices

Tutorial 2

Author

Michael Lydeamore

Published

20 May 2025

🎯 Objectives

  • Working on a reproducible RStudio Project
  • Working on a HTML report and example different YAML themes
  • Practice Markdown syntax
  • Practice R
  • Practice R chunk options
  • Gain experience on data wrangling using the tidyverse suite of packages
  • Producing exploratory data analysis figures using the package ggplot2
  • Learn how to add figure captions
  • Create HTML tables and learn how to add captions

Exercise 1: Hands on practice with COVID-19 Data

  1. The data for the tutorial is inside a folder called data, which is bundled with the RStudio Project you made in this week’s workshop. Find that file in the lower right pane where all your files are listed.
  2. Create a new section heading in your qmd document to read the data with the title “Reading Avian Influenza Data”
    Hint: Use #
  3. Inside this new section, create an R Code Chunk with options echo: true, warning: false, message: false called “Reading data” and insert the following code:
dat <- read_csv("data/avian_influenza_numbers.csv")
  1. Insert a new R Chunk and find out what information you can get from the command head(dat)
  2. Modify the head command to display 10 rows.
  3. Create another two R chunks and use in each of them the R functions glimpse() and str(). What information can you get from those commands? Hint: For more information on R functions, type in the R console ?glimpse().
  4. Using an R inline command, write the dimension of the dataset in a sentence.
    Hint: Have a look at ncol and nrow.
  5. Add a new subsection heading (###) with “Why is it important to know the dimension of your dataset?” and write a brief sentence with the explanation
  6. Add a new subsection heading (###) with “What are the variable names in the dataset?” and display the names of the dataset variables using R.
    Hint: ?names() in the R consolee
  7. Select two variables and use a markdown list to briefly explain what each of the variables are measuring.
  1. The R chunk should look like this:
# Reading data

```{r loading_data}
#| echo: true
#| warning: false
#| message: false

dat <- read_csv("data/avian_influenza_numbers.csv")
```
  1. head(dat) will print the top five rows of the dataset.
  2. head(dat, n=10)
  3. glimpse prints the columns as rows, and the data across the screen. It shows the first view values and the type (class) of each column. str is similar but shows you detailed information about the dataframe object (as opposd to just the data).
Rows: 370
Columns: 26
$ Range            <chr> "1997-1999", "1997-1999", "1997-1999", "1997-1999", "…
$ Month            <chr> "1/1/1997", "2/1/1997", "3/1/1997", "4/1/1997", "5/1/…
$ Vietnam          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Turkey           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Thailand         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Iraq             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Indonesia        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Egypt            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Djibouti         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ China            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 13, 0, 0, 0, 0, 0, 0…
$ Cambodia         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Nigeria          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Azerbaijan       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Pakistan         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Myanmar          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Laos             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Bangladesh       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Canada           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ India            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Nepal            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `United Kingdom` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Spain            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `United States`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Ecuador          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Chile            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Australia        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
spc_tbl_ [370 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Range         : chr [1:370] "1997-1999" "1997-1999" "1997-1999" "1997-1999" ...
 $ Month         : chr [1:370] "1/1/1997" "2/1/1997" "3/1/1997" "4/1/1997" ...
 $ Vietnam       : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Turkey        : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Thailand      : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Iraq          : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Indonesia     : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Egypt         : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Djibouti      : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ China         : num [1:370] 0 0 0 0 1 0 0 0 0 0 ...
 $ Cambodia      : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Nigeria       : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Azerbaijan    : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Pakistan      : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Myanmar       : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Laos          : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Bangladesh    : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Canada        : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ India         : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Nepal         : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ United Kingdom: num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Spain         : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ United States : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Ecuador       : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Chile         : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 $ Australia     : num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   Range = col_character(),
  ..   Month = col_character(),
  ..   Vietnam = col_double(),
  ..   Turkey = col_double(),
  ..   Thailand = col_double(),
  ..   Iraq = col_double(),
  ..   Indonesia = col_double(),
  ..   Egypt = col_double(),
  ..   Djibouti = col_double(),
  ..   China = col_double(),
  ..   Cambodia = col_double(),
  ..   Nigeria = col_double(),
  ..   Azerbaijan = col_double(),
  ..   Pakistan = col_double(),
  ..   Myanmar = col_double(),
  ..   Laos = col_double(),
  ..   Bangladesh = col_double(),
  ..   Canada = col_double(),
  ..   India = col_double(),
  ..   Nepal = col_double(),
  ..   `United Kingdom` = col_double(),
  ..   Spain = col_double(),
  ..   `United States` = col_double(),
  ..   Ecuador = col_double(),
  ..   Chile = col_double(),
  ..   Australia = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

and for good measure

# A tibble: 10 × 26
   Range     Month  Vietnam Turkey Thailand  Iraq Indonesia Egypt Djibouti China
   <chr>     <chr>    <dbl>  <dbl>    <dbl> <dbl>     <dbl> <dbl>    <dbl> <dbl>
 1 1997-1999 1/1/1…       0      0        0     0         0     0        0     0
 2 1997-1999 2/1/1…       0      0        0     0         0     0        0     0
 3 1997-1999 3/1/1…       0      0        0     0         0     0        0     0
 4 1997-1999 4/1/1…       0      0        0     0         0     0        0     0
 5 1997-1999 5/1/1…       0      0        0     0         0     0        0     1
 6 1997-1999 6/1/1…       0      0        0     0         0     0        0     0
 7 1997-1999 7/1/1…       0      0        0     0         0     0        0     0
 8 1997-1999 8/1/1…       0      0        0     0         0     0        0     0
 9 1997-1999 9/1/1…       0      0        0     0         0     0        0     0
10 1997-1999 10/1/…       0      0        0     0         0     0        0     0
# ℹ 16 more variables: Cambodia <dbl>, Nigeria <dbl>, Azerbaijan <dbl>,
#   Pakistan <dbl>, Myanmar <dbl>, Laos <dbl>, Bangladesh <dbl>, Canada <dbl>,
#   India <dbl>, Nepal <dbl>, `United Kingdom` <dbl>, Spain <dbl>,
#   `United States` <dbl>, Ecuador <dbl>, Chile <dbl>, Australia <dbl>
  1. Code example:
The dataset has `r nrow(dat)` rows and `r ncol(dat)` variables.

Output: The dataset has 370 rows and 26 variables.

  1. Example markdown:
### Why is it important to know the dimension of your dataset?
It is important because it will help you to better understand the 
structure of your data set. It will provide a clear information
about how many variables and how many individual cases  are 
in your data.

### What are the variable names in the dataset?
```{r}
names(dat)
```

Output:

  • Range: Contains a range of years. Class character.
  • Month: Month of the data, format is m/d/y. Class character.
  • Remaining columns: Country of case. Class double.

Why is it important to know the dimension of your dataset?

It is important because it will help you to better understand the structure of your data set. It will provide a clear information about how many variables and how many individual cases are in your data.

What are the variable names in the dataset?

 [1] "Range"          "Month"          "Vietnam"        "Turkey"        
 [5] "Thailand"       "Iraq"           "Indonesia"      "Egypt"         
 [9] "Djibouti"       "China"          "Cambodia"       "Nigeria"       
[13] "Azerbaijan"     "Pakistan"       "Myanmar"        "Laos"          
[17] "Bangladesh"     "Canada"         "India"          "Nepal"         
[21] "United Kingdom" "Spain"          "United States"  "Ecuador"       
[25] "Chile"          "Australia"     

Exercise 4: Data Wrangling

  1. Using the R package dplyr (which is loaded with tidyverse), and using the pipe (|>), create a new dataset called data_cleaned that only contains the following variables:
    • Month
    • Australia
    • Egypt
    • United States
data_cleaned <- dat |>
    select(Month, Australia, Egypt, `United States`)
  1. Inspect data_cleaned and describe using a markdown list the type of variables in this new dataset. Write the names of the variables in bold. Do you think the variable attributes are correct?
* `Month` is a character (`<chr>`)
* `Australia`, `Egypt` and `United States` are character (`<chr>`)

We would expect Month to be a date

  1. Convert the variable date into a date vector using lubridate::mdy. What do you notice?

Let’s put it in a new object

data_monthly <- data_cleaned |>
    mutate(monthdate = lubridate::mdy(Month))

There are some missing values. If we filter by these,

data_monthly |> filter(is.na(monthdate))
# A tibble: 32 × 5
   Month Australia Egypt `United States` monthdate
   <chr>     <dbl> <dbl>           <dbl> <date>   
 1 <NA>         NA    NA              NA NA       
 2 <NA>         NA    NA              NA NA       
 3 <NA>         NA    NA              NA NA       
 4 1997          0     0               0 NA       
 5 1998          0     0               0 NA       
 6 1999          0     0               0 NA       
 7 2000          0     0               0 NA       
 8 2001          0     0               0 NA       
 9 2002          0     0               0 NA       
10 2003          0     0               0 NA       
# ℹ 22 more rows

We see it seems to be aggregated data that is NA, so we can safely ignore them.

data_monthly <- data_monthly |> filter(!is.na(monthdate))
  1. Remove cases for which the data is aggregated or doesn’t have a valid month.
data_monthly <- data_monthly |> filter(!is.na(monthdate))
  1. What is the dimension of this new data set? Compare it with the dimension of cleaned_data. How many cases have we lost?
dim(data_monthly)
[1] 338   5
dim(data_cleaned)
[1] 370   4

We have lost 32 cases for which we did not have information about age or gender.

  1. Provide a table summary of the three countries using the kable() function from the kableExtra package. Give it the caption “Summary of number of cases of Avian Influenza”.
library(knitr)
data_monthly |>
    select(Egypt, Australia, `United States`) |>
    summary() |>
    kable(caption = "COVID-19 Age Summary")
COVID-19 Age Summary
Egypt Australia United States
Min. : 0.000 Min. :0.000000 Min. : 0.0000
1st Qu.: 0.000 1st Qu.:0.000000 1st Qu.: 0.0000
Median : 0.000 Median :0.000000 Median : 0.0000
Mean : 1.062 Mean :0.002959 Mean : 0.2101
3rd Qu.: 0.000 3rd Qu.:0.000000 3rd Qu.: 0.0000
Max. :50.000 Max. :1.000000 Max. :30.0000
  1. Visualize the case counts using a histogram and give an explanation about the information that a histogram convey. In addition change the x label in the plot to Age and remove the y axis label.
    Hint: As a first step, do this for just one country. To do multiple countries at once, you will need to pivot_longer your dataset.
library(ggplot2)
data_monthly |>
    ggplot(aes(x=Egypt)) +
    geom_histogram(binwidth = 5) +
    labs(x="Case counts in Egypt", y="")

library(ggplot2)
library(tidyr)
data_monthly |>
    # Drop the old month column
    select(!Month) |>
    # Pivot everything except monthdate
    pivot_longer(!monthdate) |>
    ggplot(aes(x=value, fill = name)) +
    geom_histogram(binwidth = 5, position = "dodge") +
    labs(x="Case counts of Avian Influenza", y="")

  1. Extension: Change this plot to a time series plot, with one bar per month. As an extra challenge, split this out into three separate plots - one per country.
data_monthly |>
    # Drop the old month column
    select(!Month) |>
    # Pivot everything except monthdate
    pivot_longer(!monthdate) |>
    ggplot(aes(x=monthdate, y = value, fill = name)) +
    geom_col() +
    facet_wrap(~name, scales="free") +
    labs(x="Case counts of Avian Influenza", y="")

Important

At the end of this tutorial, you should have a full QMD file that renders, including your code and the outputs from it. This means you can read it from top to bottom and remember what you did.