ETC5513: Reproducible and Collaborative Practices

Introduction to collaborative and reproducible practices

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics



Aim

  • Understand the aims and structure of the unit
  • Explain the need for reproducible and collaborative practices
  • Introduce the main tools that we will be using during lectures and tutorials
  • Get to know your classmates

👩🏻‍🏫 ETC5513 Teaching Team

Dr. Michael Lydeamore

Lecturer & Chief Examiner

Naveen Kaushik

Tutor

David Wu

Tutor

Contacting the teaching team

  • For private matters, contact michael.lydeamore@monash.edu using your Monash student email and citing the unit name.
  • For non-private matters, you should post this in the Moodle discussion board.

👩🏻‍🏫 ETC5513 Teaching Team

Most material in this course was developed by

Dr. Patricia Menendez


Patricia is a strong believer and trailblazer in reproducible research.

🎯 ETC5513 Learning Objectives

Learning objectives

  1. Develop skills to create reproducible data analyses, reports and presentations.
  2. Understand the operation of version control systems.
  3. Advance use of Git and GitHub.
  4. Utilize version control to integrate data analysis efforts of team members.
  5. Effectively work with a group to construct collaborative data science projects.

All that combined with the learning of statistical concepts!

Tip

Please participate during the lectures and tutorials. The success of the unit depends not only on the teaching team but also you as part of this unit’s team.

ETC5513 Program

  1. Course introduction to collaborative and reproducible practices
  2. Reproducible reports using R markdown
  3. Introduction to version control systems: Git and GitHub
  4. Reproducible reporting using R markdown, Git and GitHub
  5. Deeper git knowledge, stashing and tools
  6. Reproducible reporting and version control systems
  7. Workflows for reproducible data analysis
  8. Reproducible reporting for specialized and broad audiences
  9. Advanced collaborative practices
  10. Reproducible workflows in consultancy
  11. Summary and Recap

🏛️ ETC5513 unit structure

Start with individual projects

Will continue with a class group project

Finally, you will work on yuour own projects

Unit structure and resources

  • 2 hour lectures are interactive sessions:
    • during the lecture we demonstrate, discuss and complete tasks in small groups
  • 1.5 hour tutorial → only go to the one you are assigned to!

The lectures will be a combination of presentations with interactive exercises.

Unit structure and resources

Each lecture will commence with a open frame (5 minutes), where you can talk about your learning, share comments, issues and resources with the rest of the class.

That time can also be used for questions (as can any other time in the lecture).

The tutorials will be entirely based on computer practicals and you will be working individually as well as in groups.

Lecture structure

  • Open Frame
  • Recap from previous lecture
  • Summary of today’s lecture content
  • Lecture delivery

Lecture tips

  • Come prepared to be an active learner
  • Engage yourself in the lecture
  • Share responsibility for learning
  • Bring your computer

Tutorials

Go over the material before the tutorial

Goal is to practice the ideas covered in lectures by working through activies and exercises individually and in groups.

Tip

  • You will get instructions with the tasks that need to be completed during the tutorial
  • Your tutors will be there to guide and help you through the activities
  • Tutorials also rpovide a great opportunity for you to discuss and work with your peers

🪵 Materials

Unit website

  • Lecture slides and tutorial materials are available on the unit website
  • Lecture videos and assessments will be available on Moodle

Note

Materials are designed to develop your hard and soft skills.

✋ Consultation hours

  • Michael: Thursdays 3.00-4.00pm In Person (Building 6 Room 354) and on Zoom
  • Naveen Kaushik: Mondays 5.30-7.00pm on Zoom
  • David Wu: Tuesdays 4.00-5.00pm In Person (Menzies W9.20)

Please see Moodle for Zoom details

💯 Course assessments

  • 3 Assignments:
    • A1: Released week 3, due week 5: 30%
    • A2: Released week 7, due week 9: 30%
    • A3: Released week 10, due week 12: 30%
  • Oral interview: Based on A1 and W1-W12 content: Week 12: 10%

ETC5513 Code of Conduct

  • Please feel free to ask questions and share ideas with the class.
  • All questions, suggestions or comments are welcomed and must be respected by the group.
  • Remember, while working in teams, clarity, organisation and communication are extremely important
  • Please let me know about suggestions, problems and/or complaints at any time.

Interactions with the teaching team

✅ Consultation hours: We are here to help you!

✅ Moodle discussion forum

Get used to using the forum - helping your peers is a fantastic way to learn.

Questions?

The classic analysis pipeline

  1. You carry out your analysis in R, Python or MATLAB (with some code), or perhaps you use Excel
  2. You paste you results into your Word document or Google Doc.

Question

What is the problem with this approach?

Critical Issue

How about…?

If one parameter or one number changes in your data?

GAME OVER

We start all over again 😭😭😭

Maybe we copy and paste into a new script

After a week, a month, a year… it gets very hard to remember all the steps!

Reproducible research and replicability

Definitions by the USA National Academies of Science, Engineering and Medicine:

  • Reproducibility (“computational reproducibility”) means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.

  • Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data

Reference here: Report on reproducibility and replicability

Combining text and data analysis in the same document

Literate programming

Literate programming is an approach to writing reports using software that weaves together the source code and text at the time of creation.

Donald Knuth coined the term literate programming in the 1970s to refer to a source file that could be both run by a computer and “woven” with a formatted presentation document

Reproducibility

Reproducibility is a way of thinking and approaching projects

  • Requires planning
  • Needs extra upfront effort
  • Demands us to be organised
  • Challenges us to think more broadly

Reproducible research

  • Working to make your research reproducible does require extra upfront effort.
  • Making a project reproducible from the start encourages you to use better work habits.
  • It should push you to bring your data and source code up to a higher level of quality than you might if you “thought ‘no one was looking’” [Donoho, 2010, 386].
  • Reproducible research needs to be stored so that other researchers can actually access the data and source code.
  • Changes are easier to implement especially when using dynamic reproducible documents.
  • Reproducible research has higher impact.

Reproducibility complexity

Complexity varies

Some projects require a single tool (be that R, Python, MATLAB or many others) and may only involve one person.

Others might involve different teams and require many different tools

Project example

Complex workflow example

Complex projects need more than literate programming

Reproducibility: How?

Using tools for reproducible research and reporting

Dynamic documents

Definition: Dynamic Documents

A dynamic document includes code used for data analysis and report text

These two things produce your report/paper/presentation

All in a sequential and dynamic way!

Let’s start from the beginning

Code?

R & RStudio?

They are related but they are not the same. Why?

Tools for reproducible research

R Programming Language

  • R enables researchers to read data, create data visualizations and run statistical analyses.
  • R has thousands of libraries
  • R has a very active development community that is constantly expanding.

R Libraries & Packages facilitate reproducibility

  • knitr and quarto allow us to connected R-based analyses to a presentation, papers, and report documents created with markup languages such as LaTeX and Markdown.

R by itself has the capabilities to gather and analyse data, and with a little help from knitr and quarto, with some markup languages, present results in a way that is highly reproducible.

RStudio

Is an integrated developer environment (IDE)

We don’t need RStudio, but it lets us do things more easily.

  • A happy medium between R’s text-based interface and a pure GUI
  • It is closely integrated with git (version control)

It has a cloud counterpart called RStudio Cloud

Important distinction

R is the programming language

Important distinction

RStudio is the integrated development environment

RStudio Cloud

It’s RStudio, in the cloud.

Why?

RStudio Cloud

  • Allows the users to run reproducible reports without the need of installing any additional software or configurations in their own computer. It basically looks exactly like Rstudio but it runs in the cloud and can be loaded using any browser.
  • Rstudio Cloud allows us to work in the same environment regardless the computer operating system that each of you have in your computers.
  • Rstudio will provide the means for us to first focus on learning R and Rstudio without having to worry about installing them locally in each computer (we’ll do that later once you are more familiar with the language and the Rstudio environment).

Version Control

Definition: Version Control

A system that records changes to a file or a set of files over time, so that you can recall specific versions later.

git

Definition: git

Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows

GitHub, BitBucket, and others

  • Both are cloud-based hosting services to manage git repositories
  • Are code hosting platforms for version control and collaboration

It lets you and others work together on projects from anywhere

git and GitHub

Reproducibility setup depends on the project

We will learn general practical tips for reproducible workflows

There is no one-size-fits-all approach!

Recomendations summary

  1. Plan in advance
  2. Consider adequate file systems for the project
  3. Create accessible, connected workflows
  4. Document, document, document
  5. Consider using a code environment container
  6. Add a license for sharing your work

ETC5513 Ingredients

Main tools

  • R
  • RStudio
  • Command Line Interface
  • git
  • GitHub
  • VSCode

During this semester these tools will be essential for us to build reproducible and collaborative research practices.

Tutorial

This week the tutorial will focus on providing an introduction to different resources.

  • These slides are on Moodle and the course website
  • You will also find the tutorial for this week
  • Familiarise yourself with all the resources in the tutorial and get to know your colleagues (this is quite important!)
  • Overview of RStudio and an introduction to R

Week 1 Lesson

Summary

  • What are reproducible practices?
  • What tools are available to us for reproducibility?
  • When should we consider reproducible practices?

Resources