Lecturer: Michael Lydeamore
Department of Econometrics and Business Statistics
Aim
git
project from an existing local folderThis is just for your information and it is not part of the material that is going to be examined.
Docker is a program that allows to manipulate (launch and stop) multiple operating systems (called containers) on your machine (your machine will be called the host).
Source here.
Docker is designed to enclose environments inside an image / a container
Definitions by the USA National Academies of Science, Engineering and Medicine:
Tip
Reproducibility (“computational reproducibility”) means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.
Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data
RStudio allows you to do all of these things more easily. - It also is a happy medium between R’s text-based interface and a pure GUI. - It is not the only IDE! - It is closely integrated with the version control programs Git and SVN.
Tip
Version Control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
“Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows”
https://en.wikipedia.org/wiki/Git
Reproducible Research with R and Rstudio by Christopher Gandrud
Tip
Markdown is a lightweight markup language that you can use to add formatting elements to plain text text documents.
It was created by John Gruber in 2004. Read more here
Three main components:
You can quickly insert R chunks into your file with
Chunk output can be customized with options, marked by the “hashpipe”: #|
include: false
prevents code and results from appearing in the finished file. Quarto still runs the code in the chunk, and the results can be used by other chunks.echo: false
prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.eval: false
prevents evaluating the code and include its resultsmessage: false
prevents messages that are generated by code from appearing in the finished file.warning: false
prevents warnings that are generated by code from appearing in the finished.fig-cap: ...
adds a caption to graphical results.To set global options that apply to every chunk in your file, call
knitr::opts_chunk$set()
in an R code chunk.
Knitr will treat each option that you pass to knitr::opts_chunk$set()
as a global default that can be overwritten in individual R code chunk headers.
cache: true
is an option within the chunks or you can set it as a global option.
Caution
Caching might prevent you from updating some results. Because of that, it is essential that you use it only when you are sure your R code chunks are working fine. Setting cache = TRUE
as a global option might be dangerous so be very careful.
knitr
executes the code and coverts the .qmd to .md
pandoc
renders the .md file to the output format you want
When we have figures or plots in our reports it is a great idea to set up some global options at the beginning of our document:
---
title: "My Report"
author: "Michael Lydeamore"
format:
html:
keep_md: true
---
Using markdown syntax:
Using Knitr syntax
Options inside your R code chunks
fig-align
: alignment of the figures in the report with options default, center, left, or rightfig-cap
: captionsfig-height
& fig-width
: size of the figure in inchesout-height
& out-width
: size of your plot in the final file. Useful to resize your figures by say 50%An absolute or full path begins with a drive letter followed by a colon, such as D:
or /users
.
C:\documents\charlie
/Users/documents/courses/ETC5513
A relative path refers to a location that is relative to a current directory:
ETC5513/exercise.Rmd
(no matter where the folder sits things can actually run)
It is essential to understand where your directories and files are within your computer. Having clarity about that and the projects file architecture gives you total control about its organization
Each project has a unique working directory
Clean file system: all files related to a single project should be in the same folder
File path discipline: all paths should be relative to the project’s folder
Refer to the computer location where files and folders are.
Remember, absolute paths are not reproducible
RStudio Projects are associated with R working directories.
You can create an RStudio project:
Read more on Rstudio projects here
File > New project > (Few options)
When a new project is created RStudio:
.Rproj
extension) within the project directory..Rproj.user
) where project-specific temporary files (e.g. auto-saved source documents, window-state, etc.) are stored.Allison Horst (@allison_horst)
We use a distributed version control called Git
Let’s think of the connections between the different versions of an R project as a tree (Git tree).
main
(default branch)Illustration source: Begining Git and GitHub
Also known as the Shell, command line interface (cli) or terminal is an interface for typing commands to interact directly with a computer’s operating system.
Learn how to use the shell/command line interface!
Why??
Git has three main states that your files can reside in: modified, staged, and committed:
Modified: you have changed the file but have not committed it to your database yet.
Staged: you have marked a modified file in its current version to go into your next commit snapshot.
Committed: the data is safely stored in your local database.
This leads us to the three main sections of a Git project: the working tree, the staging area, and the Git directory.
To interact between our projects and Git, we are going to use the shell/command line interface
git clone
: is a Git command line utility which is used to target an existing repository and create a clone, or copy of the target repository.git add
: command adds a change in the working directory to the staging area.git commit -m
: The git commit command captures a snapshot of the project’s currently staged changes. (m = message for commit. The git commit is used to create a snapshot of the staged changes along a timeline of a Git projects history.)git push origin main
: The git push command is used to upload local repository content to a remote repository, in this case to the main branch.To interact between our projects and Git, we are going to use the shell/command line interface
In a git repository tracked files are those which are part of the git repository
However, we can also have untracked files for which their history is not tracked
Tracked files are files that were in the last snapshot; they can be unmodified, modified, or staged. In short, tracked files are files that Git knows about.
Untracked files are everything else — any files in your working directory that were not in your last snapshot and are not in your staging area.
Once you have cloned the repo, each time you work on the project (via the terminal/command line):
git pull
: used to fetch and download content from a remote repository and immediately update the local repository to match that content.git status
: displays the state of the working directory and the staging areagit add file_name
: adds changes in the working directory to the staging area)git commit -m "Message"
: used to create a snapshot of the staged changes along a timeline of a Git project historygit push origin branch name
: used to upload the local repository content to a remote repository in GitHubExcellent summary about the commands that will be using can be found here
The status/staging panel in Rstudio:
RStudio keeps git constantly scanning the project directory to find any files that have changed or which are new.
By clicking a file’s little “check-box” you can stage it.
Understanding the symbols in the Rstudio Git pane:
Using git branch
and git checkout
:
git branch
show us the branches we have in our repo and marked our current branch with *
git branch newbranch_name
creates a new branch but does not move the HEAD
of the repo there.git checkout newbranch_name
moves the HEAD
to newbranch_name
Using the checkout
command:
git checkout -b newbranch_namne
: creates a new branch and moves the repo HEAD
to this branchgit branch
to see in which branch you are currently inSuppose we have two branches: main
and new_development
main
branch git checkout main
git merge new_development -m "Merging new_development into main"
Remember that if you have VSCode installed and you do git merge new_development
then the VSCode editor will open so that you can type your message.
If those steps are successful your new_development
branch will be fully integrated within the main branch.
Before you push the new branch to the remote repo:
git branch -m original_name new_name
If you want to renamed a branch that has already been pushed to the remote repo:
git branch -m old_name_branch new_name
git push origin -u newname
git push origin --delete old_name_branch
The git stash
command takes your uncommitted changes and saves them in the git repo away for later use.
Bringing stash into the repo
git stash
git stash apply
git stash pop
git stash list
to see the list of the stashesgit stash apply
will take the changes saved in your stash and apply them into the working directory of your current branch. In addition, the changes are kept in the stash.
This might be useful when you want to apply the same changes into different branches.
git stash pop
will do the same as apply but will delete the stash after applying the changes. git stash pop
will apply the changes into your working copy.
Tip
Stash is not a substitute for committing changes
.gitignore
that is checked in at the root of your repository..gitignore
file which must be edited and committed by hand when you have new files that you wish to ignore..gitignore
files contain patterns that are matched against file names in your repository to determine whether or not they shouldbe ignored.Assume the following history exists and the current branch is “Feature”
A---B---C Feature
/
D---E---F---G main
From this point, the result of either of the following commands:
git checkout Feature
git rebase main
git rebase main Feature
A'--B'--C' Feature
/
D---E---F---G main
Merging is a non-destructive operation. The existing branches are not changed in any way. This avoids all of the potential problems of rebasing.
Rebasing moves the entire Feature branch to begin on the tip of the main branch, incorporating all of the new commits into main.
Rebasing re-writes the project history by creating brand new commits for each commit in the original branch. Produces cleaner project history.
However, it creates problems with safety and traceability
Golden rule for rebase: Never use it on public branches (main in collaborative projects).
A fork is a copy of a repository
Forking a repository allows you to freely experiment with changes without affecting the original project.
Most commonly, forks are used to either propose changes to someone else project or to use someone else project as a starting point for your own project.
Search/navigate repo from within our Github account.
A fork is a copy of someone elses GitHub repository saved to your own GitHub account. It allows you to experiment with changes without affecting the original project.
A fork acts as a between the original repository and your personal one.
It will also allow you to interact between your forked copy and the original repo
When you clone a GitHub repository, you are creating a local copy of that repo on your computer
That allows you to work on that repo locally and sync between both your local repo and your remote repo
We use GitHub to share our code and projects with others.
There are situations when another person make changes into your code and wants you to consider those changes.
Examples: Fixing a problem/bug or add new functionality into the repo.
We achieve this by sending a request to the repo’s owner to pull/merge these changes into the owner’s original GitHub repo
That request is called a pull request
There are different ways:
preamble.tex
:git log
: allow us go back into our project history to see who contributed what, find out past issues or problems and revert problematic changes.commit 8cfaee1e447d8e83d745b51ffcd310465afb76b1
Author: Patricia Menendez <patricia.menendez@monash.edu>
Date: Sat Apr 4 15:49:54 2020 +1000
Uploading Week4 slides
git log --oneline
: condenses each commit to a single line
3a5bc86 W3 cli updates
4d1b022 W3 shell update
We can also use git log --pretty=oneline
Detached HEAD state gives you the power to check out any commit and explore the older state of a repository without having to create a local branch.
Any commits made in a detached HEAD state will be lost when you check out any branch.
Solution: Create a branch to keep commits
git checkout 8cfaee1e447d8e8
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
git checkout -b new_branch_name
git reset filename
git reset
If you have not pushed the commit: - git commit --amend
: will open your VS code editor so you can amend the commit - git push origin main
git reset --soft HEAD~1
git reset --hard HEAD~1
git reset --mixed HEAD~1
HEAD~1
: you want to reset the HEAD (the last commit) to one commit before in the log history.
This can be extended to any commit and you can use the notation HEAD~1 or the commit SHA identifier
git reset --soft
Imagine that you have added two files in your latest commit and you want to make a modification in one of the files.
git reset --soft HEAD~1
to undo our last commit and include additional modifications into the file.git reset --hard
git reset
command with the –-hard
option.HEAD~1
).Caution
When we use git reset --hard
all the changes will be removed from the working directory and from the index (staging area).
git reset --mixed
git reset --mixed
option.git reset --mixed HEAD~1
git reset --mixed
option is combination of soft and hard reset.The git revert command can be considered an ‘undo’ type command, however, it is not a traditional undo operation. Instead of removing the commit from the project history, it figures out how to invert the changes introduced by the commit and appends a new commit with the resulting inverse content. This prevents Git from losing history, which is important for the integrity of your revision history and for reliable collaboration.
You can think of it as a “rollback”: it points your local environment back to a previous commit. Your “local environment,” refers to your local repository, staging area, and working directory
git rm file.txt
git rm -r Data
git commit -m "Delete file.txt"
git status
One line commit (we need to main that a little bit more!)
We can add more text into any commit and many times we should be doing that
We can do that using VSCode
Commit structure:
First line Blank Line Rest of the text
Git Large File Storage lets you store them on a remote server such as GitHub.
Tip
Git Large File Storage (LFS) replaces large files such as audio samples, videos, data sets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
git lfs install
: You only need to run this once per repo.git lfs track "*.csv"
: In each Git repository where you want to use Git LFS, select the file types you would like Git LFS to managegit add .gitattributes
: make sure “.gitattributes” is trackedThen, continue as usual:
git add file.csv
git commit -m "Add data file"
git push origin main
If for some reason you have staged/committed a large file before you run the work flow above you can use:
git reset --soft HEAD~1
Tags are references that point to specific points in Git history
Example: Specific report release, package release
A tag is like a branch that doesn’t change.
Unlike branches, tags (after being created) have no further history of commits
Great tutorial on tags here
git pull
is a combination of both commands git fetch
and git merge
If you are working on your own, git pull
would be ok in most cases.
However, if you are collaborating with other people who might be simultaneously working in the repo, using git pull
might not be a good idea!
In that case, it is much better to use git fetch
first to see what is happening in the remote repository and to synchronize your repo by merging the changes.
git fetch
downloaded the new B commit however our local working directory is not updated and the head of our main branch is still pointing to commit A!
We need to combined main branch with the remote tracking origin/main branch. How?
First we need to move into the main branch and then merge origin/main.
git checkout main
git merge origin/main
git remote
: lets you create, view, and delete connections to remote repositories)git branch -vv
allows you to check the status of your local and remote branches in relation to each other.git fetch origin
fetch the changes from remote origingit branch -a
: all the branches available in the local repository + all the branches fetched from the remote.The branches fetched from the remote origin would be preceded by remotes/origin/
Public repos in GitHub make your work publicly available and therefore it is important to establish how your work should be acknowledged if someone else wants to use it.
“Public repositories on GitHub are often used to share open source software. For your repository to truly be open source, you’ll need to license it so that others are free to use, change, and distribute the software.””
The idea is to create project-local library to ensure that projects gets its own unique library of R packages!
In the R console:
renv::init()
to initialize a project with a project-local libraryrenv::snapshot()
to save the project-local library’s saterenv::restore()
to restore the project-local library’s stateYou have learned version control: This works for any programming language!
It starts with you.
ETC5513 Week 10