ETC5513: Collaborative and Reproducible Practices
Tutorial 2
🎯 Objectives
- Working on a reproducible RStudio Project
- Working on a HTML report and example different YAML themes
- Practice Markdown syntax
- Practice R
- Practice R chunk options
- Gain experience on data wrangling using the
tidyversesuite of packages - Producing exploratory data analysis figures using the package
ggplot2 - Learn how to add figure captions
- Create HTML tables and learn how to add captions
Exercise 1: Hands on practice with COVID-19 Data
- The data for the tutorial is inside a folder called
data, which is bundled with the RStudio Project you made in this week’s workshop. Find that file in the lower right pane where all your files are listed. - Create a new section heading in your
qmddocument to read the data with the title “Reading Avian Influenza Data”
Hint: Use# - Inside this new section, create an R Code Chunk with options
echo: true,warning: false,message: falsecalled “Reading data” and insert the following code:
dat <- read_csv("data/avian_influenza_numbers.csv")- Insert a new R Chunk and find out what information you can get from the command
head(dat) - Modify the
headcommand to display 10 rows. - Create another two R chunks and use in each of them the R functions
glimpse()andstr(). What information can you get from those commands? Hint: For more information on R functions, type in the R console?glimpse(). - Using an R inline command, write the dimension of the dataset in a sentence.
Hint: Have a look atncolandnrow. - Add a new subsection heading (
###) with “Why is it important to know the dimension of your dataset?” and write a brief sentence with the explanation - Add a new subsection heading (
###) with “What are the variable names in the dataset?” and display the names of the dataset variables using R.
Hint:?names()in the R consolee - Select two variables and use a markdown list to briefly explain what each of the variables are measuring.
Exercise 4: Data Wrangling
- Using the R package
dplyr(which is loaded withtidyverse), and using the pipe (|>), create a new dataset calleddata_cleanedthat only contains the following variables:MonthAustraliaEgyptUnited States
- Inspect
data_cleanedand describe using a markdown list the type of variables in this new dataset. Write the names of the variables in bold. Do you think the variable attributes are correct?
- Convert the variable
dateinto a date vector usinglubridate::mdy. What do you notice?
- Remove cases for which the data is aggregated or doesn’t have a valid month.
- What is the dimension of this new data set? Compare it with the dimension of
cleaned_data. How many cases have we lost?
- Provide a table summary of the three countries using the
kable()function from thekableExtrapackage. Give it the caption “Summary of number of cases of Avian Influenza”.
- Visualize the case counts using a histogram and give an explanation about the information that a histogram convey. In addition change the x label in the plot to Age and remove the y axis label.
Hint: As a first step, do this for just one country. To do multiple countries at once, you will need topivot_longeryour dataset.
- Extension: Change this plot to a time series plot, with one bar per month. As an extra challenge, split this out into three separate plots - one per country.
Important
At the end of this tutorial, you should have a full QMD file that renders, including your code and the outputs from it. This means you can read it from top to bottom and remember what you did.