ETC5513: Collaborative and Reproducible Practices
Tutorial 2
🎯 Objectives
- Working on a reproducible RStudio Project
- Working on a HTML report and example different YAML themes
- Practice Markdown syntax
- Practice R
- Practice R chunk options
- Gain experience on data wrangling using the
tidyverse
suite of packages - Producing exploratory data analysis figures using the package
ggplot2
- Learn how to add figure captions
- Create HTML tables and learn how to add captions
Exercise 1: Hands on practice with COVID-19 Data
- The data for the tutorial is inside a folder called
data
, which is bundled with the RStudio Project you made in this week’s workshop. Find that file in the lower right pane where all your files are listed. - Create a new section heading in your
qmd
document to read the data with the title “Reading Avian Influenza Data”
Hint: Use#
- Inside this new section, create an R Code Chunk with options
echo: true
,warning: false
,message: false
called “Reading data” and insert the following code:
<- read_csv("data/avian_influenza_numbers.csv") dat
- Insert a new R Chunk and find out what information you can get from the command
head(dat)
- Modify the
head
command to display 10 rows. - Create another two R chunks and use in each of them the R functions
glimpse()
andstr()
. What information can you get from those commands? Hint: For more information on R functions, type in the R console?glimpse()
. - Using an R inline command, write the dimension of the dataset in a sentence.
Hint: Have a look atncol
andnrow
. - Add a new subsection heading (
###
) with “Why is it important to know the dimension of your dataset?” and write a brief sentence with the explanation - Add a new subsection heading (
###
) with “What are the variable names in the dataset?” and display the names of the dataset variables using R.
Hint:?names()
in the R consolee - Select two variables and use a markdown list to briefly explain what each of the variables are measuring.
Exercise 4: Data Wrangling
- Using the R package
dplyr
(which is loaded withtidyverse
), and using the pipe (|>
), create a new dataset calleddata_cleaned
that only contains the following variables:Month
Australia
Egypt
United States
- Inspect
data_cleaned
and describe using a markdown list the type of variables in this new dataset. Write the names of the variables in bold. Do you think the variable attributes are correct?
- Convert the variable
date
into a date vector usinglubridate::mdy
. What do you notice?
- Remove cases for which the data is aggregated or doesn’t have a valid month.
- What is the dimension of this new data set? Compare it with the dimension of
cleaned_data
. How many cases have we lost?
- Provide a table summary of the three countries using the
kable()
function from thekableExtra
package. Give it the caption “Summary of number of cases of Avian Influenza”.
- Visualize the case counts using a histogram and give an explanation about the information that a histogram convey. In addition change the x label in the plot to Age and remove the y axis label.
Hint: As a first step, do this for just one country. To do multiple countries at once, you will need topivot_longer
your dataset.
- Extension: Change this plot to a time series plot, with one bar per month. As an extra challenge, split this out into three separate plots - one per country.
Important
At the end of this tutorial, you should have a full QMD file that renders, including your code and the outputs from it. This means you can read it from top to bottom and remember what you did.