Version Control and Repositories (video time: 52 minutes)
Before you get started with data-oriented work, it’s useful to have an organized record keeping system. Version control, also known as revision control was developed for exactly this purpose. The first version control system, named the Source Code Control System, was developed at Bell Labs in the 1970s. Today, there are many version control systems available. Some notable ones include:
We’re going to use GitLab, partly because NPS has its own GitLab instance that’s ready for us to use, and primarily because Git is an excellent version-control system.
Motivation
Imagine you’ve worked with data in the past and you started with a file called my_data.xlsx
Have you ever had the experience of working on the file over a period of time and renaming it? Have you ever gone from my_data.xlsx
to my_data-v1.xlsx
to my_data-v2.xlsx
to my_data-v55.xlsx
? Or perhaps you’ve worked on a file with colleagues, you’ve emailed it back and forth, and renamed it from project_data.xlsx
to project_data-my_name.xlsx
to project_data-her_name-v27_10-18-2020.xlsx
?
If you haven’t had that experience, lucky you. For those of us who have, we probably had good intentions, but came up with a strategy that isn’t ideal.
There are a few upsides to renaming files, most notably:
- You can track the latest version by the filename.
- You maintain earlier versions, in case you need to refer back to them.
However, there are also unintended downsides, especially as the number of versions increases:
- It’s hard to recall what’s changed from one version to another.
- You end up with lots of files that you never actually use.
- When you do refer back to old versions, you waste too much time looking for whatever you’re trying to find.
- You mess up filenames and version numbers, and start working on the wrong file.
- You accidentally overwrite one of your saved files.
- Team members aren’t all on the same page, so they lose track of versions.
We’re going to use a version control system, which provides all the upsides without any of the downsides (if used correctly).
GUI Desktop Clients
Git commands can be run on your computer through a terminal, but graphical user interface (GUI) clients provide a more user-friendly approach, especially for beginners who aren’t used to working with a command-line. Some notable clients include:
We’re going to use GitHub Desktop, because its interface is one of the simplest, but others may offer broader functionality.
MN3441 Technology for Managerial Data Analysis