ETC5513: Collaborative and Reproducible Practices

Tutorial 2

Author

Michael Lydeamore

Published

21 May 2024

🎯 Objectives

  • Working on a reproducible RStudio Project
  • Working on a HTML report and example different YAML themes
  • Practice Markdown syntax
  • Practice R
  • Practice R chunk options
  • Gain experience on data wrangling using the tidyverse suite of packages
  • Producing exploratory data analysis figures using the package ggplot2
  • Learn how to add figure captions
  • Create HTML tables and learn how to add captions
  1. To complete this tutorial, you’ll need the pre-built RStudio project. Go to Moodle, under Week 2, download the Tutorial 2 RStudio Project zip.
    Save this file on your computer and unzip it. You can open the .Rproj file and that should launch RStudio.
  2. To render into a PDF, you will need to install the tinytex package. There are two methods to do this:
    Either install.packages("tinytex") or at your command line type quarto install tinytex. Your tutors can help with this if you get stuck.

Exercise 1: RStudio Projects

  1. Render the Tutorial2.qmd file into a HTML and a PDF.
  2. Write your name as an author in the YAML.
  3. Change the HTML theme to cerulean.

Exercise 2: YAML and R Chunk Options

Carefully inspect the YAML and the first R code chunk in your Quarto file.

  1. What is the first R chunk of code doing?
    Hint: Remember all the libraries used in an analysis should be listed together at the top of the file.
  2. Change the R chunk option from message: false to message: true and add the option warning: true. What happens when you knit the file?
  3. Create a new section called Introduction and type using markdown the following:
    “In this tutorial we are looking at the Coronavirus cases detected within the Hubei area as reported in the Lancet Journal website as of March 12, 2019.”
    Hint: Think about the # character.
  4. Remove all the R chunk messages from the Chunk called Chunk 1 and write the following under that section using markdown:
    “In this section we are loading all the required libraries for the tutorial”
  5. For the same R Chunk, add the chunk option echo: false. What does this do?
  6. Using markdown, link the word “Lancet” to the website https://www.thelancet.com/coronavirus

Exercise 3: Hands on practice with COVID-19 Data

  1. The data for the tutorial is inside a folder called Data, which is bundled with the RStudio Project. Find that file in the lower right pane where all your files are listed.
  2. Create a new section heading in your qmd document to read the data with the title “Reading Coronavirus Data”
    Hint: Use #
  3. Inside this new section, create an R Code Chunk with options echo: true, warning: false, message: false called “Reading data” and insert the following code:
dat <- read_csv("Data/COVID19_March12_Hubei.csv")
  1. Insert a new R Chunk and find out what information you can get from the command head(dat)
  2. Modify the head command to display 10 rows.
  3. Create another two R chunks and use in each of them the R functions glimpse() and str(). What information can you get from those commands?
    Hint: For more information on R functions, type in the R conolse ?glimpse().
  4. Using an R inline command, write the dimension of the dataset in a sentence.
    Hint: Have a look at ncol and nrow.
  5. Add a new subsection heading (###) with “Why is it important to know the dimension of your dataset?” and write a brief sentence with the explanation
  6. Add a new subsection heading (###) with “What are the variable names in the dataset?” and display the names of the dataset variables using R.
    Hint: ?names() in the R consolee
  7. Select two variables and use a markdown list to briefly explain what each of the variables are measuring.

Exercise 4: COVID19 Data Wrangling

  1. Using the R package dplyr (which is loaded with tidyverse), and using the pipe (|>), create a new dataset called data_cleaned that only contains the following variables:
    • country
    • age
    • sex
    • city
    • province
    • latitude
    • longitude
  1. Inspect data_cleaned and describe using a markdown list the type of variables in this new dataset. Write the names of the variables in bold. Do you think the variable attributes are correct?
  1. Convert the variable age into a numeric vector
  1. Inspect the first 20 values of age. What do you observe? What is the proportion of missing values in the variable age? Make sure you round the results to two decimal numbers.
  1. Remove cases for which we don’t have information on the person’s age and keep cases for which the gender of the patient is known. Give this new data set the name data_filtered.
  1. What is the dimension of this new data set? Compare it with the dimension of cleaned_data. How many cases have we lost?
  1. Examine the variable age using the function summary(). Do you see any problems in the data?
  1. Remove patient entries with age below 1. You can save this back into data_filtered.
  1. Provide a table summary of the variable age using the kable() function from the kableExtra package. Give it the caption “COVID-19 Age Summary”.
  1. Visualize the age distribution using a histogram and give an explanation about the information that a histogram convey. In addition change the x label in the plot to Age and remove the y axis label.
    Hint: You’ll need to use ggplot2 for this.
  1. Change the color of the histogram using geom_histogram(color = "blue", fill = "white")
  1. Visualize the age distribution for females and males and add the following caption to the figure “Age frequencies of COVID19 patients in China per gender”
    (Hint: facet_wrap())
  1. Count the number of cases per province and display a table of the top 10 provinces. Store results into an object called cases_by_province.
    Hint: Replace XXX by the adequate variable names in the code below.
```{r}
cases_by_province <- data_filtered |>
  select(XXX) |>
  filter(!is.na(XXX)) |>
  group_by(province) |>
  summarise(cases = n())|>
  arrange(-XXX) 
```
  1. Recreate this table using the below code:
```{r}
cases_by_province_alternative <- data_filtered |>
    filter(!is.na(province)) |>
    count(province, name="cases", sort=TRUE)
```