ETC5513: Reproducible and Collaborative Practices

Keeping environments separate and reproducible

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics



Open Frame

Recap

  • Learn how to add references and bibliography
  • Dealing with large files
  • Tags
  • GitHub issues

Today’s plan

Aim

  • Create a git project from an existing local folder
  • Licensing a repository
  • Lightweight dependency management
  • Advanced collaborative practices
  • Templates for slides

Adding version control onto an existing project

Suppose you have a folder on your computer which is not version controlled, and you decide that you would like to start tracking it.

You go to GitHub and create a repo over there. You now have two options:

Approach 1:

  • Create a GitHub repo
  • Clone the repo locally
  • Move all the files and folders from your existing project into this folder
  • Stage, commit, push

Approach 2:

  • Open a terminal in the directory for your folder
  • git init to create a git repostiroy
  • Stage, commit
  • git remote add origin git@github.com:...
  • git push -u origin main

Adding version control onto an existing project

Option 2 is preferred because it reduces duplication.

GitHub even gives you instructions:

The -u flag says to link remote origin to branch main. it is a one time operation.

Remember you can verify your remotes using git remote -v

Demo

Repository Visibility

More info here

Licensing

Public repos in GitHub make your work publicly available and therefore it is important to establish how your work should be acknowledged if someone else wants to use it.

Public repositories on GitHub are often used to share open source software. For your repository to be truly open source, you’ll need to license it sot that others are free to use, change and distribute the software.

More info here

Available licenses in GitHub

More info here

Choosing a license

Source here

License examples

No license

Source here

Creative Commons License

Creative Commons License

More info here

Location of your license

You can add a license from:

  • GitHub when you first create a repo
  • Later by placing your license text in a file named LICENSE.md

Licenses go in the root of the directory. Some information about licenses is sometimes included in README.md as well, but this is not required.

More info here

Demo

Making sure your code is transportable

It used to work I’m sure!

Think about the following issue

  • My code ran 6 months ago (I’m sure it did) and now it is not working.
  • My figures in ggplot look funny now

Generally, this is because a version of a package has updated.

Package environments

Lightweight dependency management

The primary solution for dependency management (which is far from perfect) is renv

The idea is to create a project-specific library to ensure a project has a record of which version of R packages are used.

How does renv work?

  • Gives each R project it’s own local library
  • Provides an easy way to get R sessions to use a local library
  • Provides tools for managing the R packages installed in these local libraries

So what is a library path anyway?

When we load a library, what exactly is going on?

R searches in a collection of places for the dplyr package, and loads it.

These places are called library paths.

You can change the library path, and you can view them using .libPaths()

Library paths example

Your default library path is probably something like:

/Library/Frameworks/R.framework/Versions/4.0/Resources/library

If you want to find a package, you can use find.package("dplyr")

renv sets up a project-specific library path to keep your packages from interacting with eaach other.

renv workflow

Via the R console:

  • renv::init() to initialise the project
  • Work as usual, installing packages, writing code
  • renv::snapshot() to save the state of the local library to a lockfile
  • renv::restore() to reverty our packages to the state encoded in the lockfile

The lockfile contains all the information about which package version is being used, and makes your environment reproducible.

But why?

In collaborative projects, you may want to ensure everyone is working with the same environment

It helps protect against changes between different versions of packages.

By sharing the lockfile, your collaborators will be using the same version of packages that you are using, without breaking their own installs.

Demo

Observe the changes in libPaths, the new files, and the lockfile.

Managing the lockfile

“If you’re using a version control system with your project, then as you call renv::snapshot() and later commit new lockfiles to your repository, you may find it necessary later to recover older versions of your lockfiles. renv provides the functions renv::history() to list previous revisions of your lockfile, and renv::revert() to recover these older lockfiles.”

Currently, only Git repositories are supported by renv::history() and renv::revert().

Source here

Managing the lockfile

Each project contains only a single renv.lock

There are some helper functions:

  • renv::history(): find prior commtis in which the lockfile has changed
  • renv::revert(commit = SHA1): Revert the lockfile to a state at a previous commit

Important

Make sure to commit the lockfile often, and call renv::snapshot() when you’re updating packages! This is the only way the changes can be recorded and shared.

Workflow for collaboration

  • Select a version control system (i.e. git and GitHub)
  • Initialise the project: renv::init()
  • Periodically, and after adding new packages, use renv::snapshot()
  • Stage, commit and share renv.lock with others via version control.

Workflow for collaboration

Once collaborators clone the repository, they also run renv::init().

This will automatically install the packages declared in the lockfile.

By doing this, they can work in your project using exactly the same R packages that you used when the lockfile was generated.

Demo

Reproducible Presentations Demo

Collaborative project tips

  • Listen and learn.
  • Care about the statistical/data science methods. Know what you are doing so that you can explain it!
  • Establish protocols for your work to be reproducible and acknowledged.
  • Remember the data has an owner and it needs to be acknowledge and referenced.
  • Be careful with changes in data structure! Request that any updates to the data preserves the data structure.
  • Don’t feel pressure to do an analysis that you feel is not right.
  • Learn to disagree if you consider that the data treatment/statistical or computational methods are not adequate.
  • Learn what the methods are - not just how to use software functions. That will take you far!!

Interesting paper here

Week 10 Lesson

Important

  • Learn how to add references and bibliography
  • Dealing with large files
  • Tags
  • GitHub issues