Forges: Git and GitHub

“It is difficult to overstate the importance of version control. I believe that it is as important as the invention of the chalkboard and of the book for multiplying the power of people to create together.”

Foundries and Foundations

We spent the last couple of lessons defining abstract spaces, their benefits, and their limitations.

In the first lesson we saw a first space: our local machine. We generally have complete control over it: meaning that we can install software, create files and folders, and generally have unlimited freedom. Unlimited freedom may come at a cost. Thou are free to choose whether you back up your files—or not. Thou are free to run untrusted code, lose your files to ransomware, and hope that the scammers are true to their word after you liquidate your life savings into cryptocurrency.

In the second lesson we defined a second space: a remote server. These are shared resources: which means that we have to give up some of our absolute freedom in order to work together with others. We must agree on: what programs are installed there, how the directories get structured, how to share finite processing power, who has the authority to make changes, and how to audit and resolve conflicts.

What choice can we make between absolute freedom and shared governance? Infinite points exist between the two extremes. We can parallel these to sociological ideas about human societies and communities. The first place is the home (where each person’s home is their castle)¹ and the second place is the workplace. Third places are everything else: book stores, libraries, coffee shops, public parks, national forests, or fisheries.

If you’re building something for other people or with other people, that third space is called a forge or software forge.

Commons, Clubs, Forges, and Knowledge Work

A forge is a repository where people collaborate to create digital goods. In the early Internet: forges were individual websites or email lists. Since bandwidth—the means of transmitting bits over a wire—was limited, forges had to be augmented with existing physical and human infrastructure: such as a postal service that could physically deliver copies of software on discs from one part of the world to another.

But the story didn’t end there. Storage costs dropped, bandwidth got cheaper and faster, processor speeds compounded exponentially. You may have even read the phrase “postal service” and cringed: why would you send a file on a physical compact disc (CD) when you can download it over the network? The world went through a phase transition and the forges themselves became digital.

As of 2024, the term software forge is used synonymously with the term GitHub. This synonym is a lie:² but it’s a lie predicated on many of the same factors that led to the forges getting digitized.

No single source of truth: git, version control, and skateboards

Remember how we spent the last couple of lessons talking about Linux? Linux isn’t the only thing that Linus Torvalds invented. He also invented a little program called git.

git, or “the stupid content tracker”, was a response to challenges faced in the 1990s and early 2000s when a team of engineers distributed across planet Earth collaborated to build Linux. Software is malleable—hence the soft in software—any person who has a copy of the software’s source code can change it for better or worse: either they know what they’re doing and they make it better, or they don’t know what they are doing and they break the code. But here inlies a question: if everyone has their own copy of the software, and everyone can make changes, which version is correct?

There did (and still does, c. 2024) exist a correct version of Linux—it’s the version that Linus Torvalds says is correct, and it’s the version that he points kernel.org at. But this leads us into more questions. Are all the past versions of Linux incorrect in some way? How do I know that I have the most up-to-date version? What if I discover a problem and fix it in my copy, how do I tell Linus about my fix? If you’re asking these questions, you’ve discovered the idea behind source control or version control.

Side note: Which skateboard is correct? 🛹

“Skateboarding” is a recent enough invention that we can draw a strong analogy between software and skateboarding if all this source control talk feels too abstract.

Skateboarding is an activity. Every skateboarder is a person who owns a skateboard, but if you go to a skate park you are not going to see every person skating the same or even using the same tools. There is a plethora of skateboard designs, equipment, and tweaks.

The “casual skater” may be satisfied with buying a board and using it however the manufacturer intended, but the “expert hobbyist skater” might not be. Becoming an expert in a craft often coincides with a desire to experiment: wanting to change what is in pursuit of what might someday be. What if I sand off the edges? What if I swap the wheels? How much grease is too much grease?

This evolutionary design among experts and hobbyists produces the skateboard—hobbyists learn from and copy one another, forming feedback loops that cause manufacturers to produce new editions based on what people want.

So which skateboard is correct? — Whichever is correct for you.

In plain speak, what are git and GitHub?

The first thing to learn is that git and GitHub are not the same things:

Git is an open source version control system—another program on our list of Linux commands. It is primarily used to track changes in source code, make backups of the code, and allow multiple programmers to work on code simultaneously.
GitHub is a website where people manage and collaborate on remote git repositories.

Indiana University has an internal GitHub called IU GitHub at https://github.iu.edu/ that is free to use for students. You can log in with your IU credentials.

Follow Along with the Instructor

Today: we’re doing all the practice steps together, so follow along with the video to practice with the instructor. Our goal is to get started with Git and GitHub—both of which will be required to do homework and projects.

Create a scratchspace repository on IU GitHub

The best way to learn is by doing.

Open https://github.iu.edu
Choose “New” (looks like a plus icon ➕)
Create a new repository called “scratchspace”
- Owner: (your username)
- Repository name: scratchspace
- Description: “Practicing with git and GitHub”
- ✅ Public
- ✅ Add a README file

Think about git as a series of snapshots of your files: at any point in time, what did your files look like?

The initial state, or initial commit of the repository might contain a single file called README.md. So the initial commit is a directory with a single file inside:

gitGraph
    commit id: "🎉 Initial commit"

But if you make changes to that README.md and make a commit, then we’ve created a new snapshot of the code:

gitGraph
    commit id: "🎉 Initial commit"
    commit id: "✨ Add a more descriptive title"

Every time we repeat this edit + commit step, we create a new node in a graph: a timeline or git history progressing from the left to the right:

gitGraph
    commit id: "🎉 Initial commit"
    commit id: "✨ Add a more descriptive title"
    commit id: "✏️ Fix typo in README"

This graph of commits—with older commits on the left and newer commits on the right—shows the entire history of a project. Every commit records what the code looked like at a point in time.

Back at the command line: First-time Git setup

We need to adjust some settings before using git.³

Replace yourUsername with your username, and run these in your shell:

git config --global user.name "yourUsername"
git config --global user.email "yourUsername@iu.edu"

Set the default branch name:

git config --global init.defaultBranch main

Set nano as the default editor when writing commit messages:

git config --global core.editor "nano"

Set a default strategy to follow when pulling changes from a remote repository:

git config --global pull.rebase false

Clone a git repository

Version control generalizes the directories and files we talked about previously. Instead of our files and folders being static: version control is a means of keeping track of their state over time.

Let’s clone a copy of our repository from IU GitHub:

git clone https://github.iu.edu/USERNAME/scratchspace.git

When we change into the directory, we should see it contains the same files from GitHub:

$ cd scratchspace/
$ tree .
.
└── README.md

The scratchspace we cloned is a special kind of directory called a git directory: meaning we can run git commands inside of it. How do we know it’s a git directory?

$ ls -a
.git  README.md

We know this is a git directory because there’s a special .git directory inside of it. For the purposes of this book: one should be aware of two things:

the .git directory exists
it represents the base of a repository: everything in the same folder as a .git directory is also part of a git repository

Exactly how this folder works is beyond the scope of what we plan to cover. But there are two implications:

every subdirectory in a repository is also part of that repository, which one might visualize by walking toward the root until one finds a .git directory (alternatively: finding the root—meaning that one is not in a git repository)
weird things happen if one puts a git repository inside a git repository:

The takeaway is to be mindful of where one clones or otherwise creates repositories.⁴ A practice that we’ll follow is to have a common directory (e.g. i211/), and put git repositories (starter/, lecture/, project/) inside of it:

i211
├── starter
│   └── .git
├── lecture
│   └── .git
└── project
    └── .git

Plumbing and Porcelain 🚽

Plumbing and porcelain are two metaphors for thinking about abstractions.⁵ In an abstraction: the plumbing describes how something works on a technical level. The porcelain, by contrast, describes how something works to an end user.

Plumbing vs. porcelain is not necessarily the same as a distinction of complexity or difficulty. The ability to drive a car and the ability to repair a car are related skills, but proficiency in one does not guarantee the other. Driving is a porcelain skill: requiring one to learn how to steer, operate pedals, and actuate signals. But its porcelain nature does not trivialize the skill: driving a car in the United States requires extensive training before receiving a license.

All this to say: we focus on porcelain git. The mechanics describing exactly what the .git directory is, how git keeps track of changes, or how git communicates with remote repository are details for another course.

Repo status and remotes

The first one we should know about is status, or the git status subcommand, which we can use to check the state of our local repository:

$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

Feedback from the status command leads us into some new vocabulary:

branch
origin/main
working tree

A branch is a line of development, and each commit occurs on the main branch by default. The working tree is git’s terminology for the file tree being tracked. In total: the phrase “nothing to commit, working tree clean” means that none of the files have been changed. Clean versus dirty are common metaphors when keeping track of changes: where a clean file is unchanged and a dirty file has been changed—and therefore needs to be inspected.

The origin/main is related to a concept called remotes or remote repositories. For this repository, running git remote -v shows:

$ git remote -v
origin  https://github.iu.edu/USERNAME/scratchspace.git (fetch)
origin  https://github.iu.edu/USERNAME/scratchspace.git (push)

Showing that the source of the information in this local git repository—its origin—is a remote repository on IU GitHub.

The simplest git workflow: add, commit, push

Here’s our first goal: edit code on our local machine, and sync our code to GitHub. This requires three commands:

git add .
git commit -m "Message"
git push

These three commands are so common that people frequently report seeing them written out on sticky notes or taped to the side of developers’ monitors.⁶

What happens if we make a new file?

touch file1.txt

Whereas we previously saw “working tree clean”, we now see:

$ git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        file1.txt

nothing added to commit but untracked files present (use "git add"
to track)

Since file1.txt is new: it is untracked by default. Here inlies a key difference between version control systems like git and cloud backup systems like Dropbox, Google Drive, or Apple iCloud—just because a file is currently inside a git repository does not mean we want to track it. With git, you must opt in to files getting tracked.

The feedback does suggest we can use the git add subcommand to begin tracking this file. If we run git add .:

git add .

… then git status informs us that we’re ready to commit, and the file1.txt changes from red to green:

$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        file1.txt

Continuing to respond to feedback: git suggests we’re ready to commit. A commit (sometimes called a snapshot) represents the state of all our files and folders at some point in history. Every commit must have a commit message describing what the change accomplishes. Here our change is pretty simple, so we might say:

git commit -m "Add empty file1.txt"

… which gives us some immediate feedback:

[main a108582] Add empty file1.txt
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file1.txt

… but something is different in the status (orange emphasis is ours):

$ git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean

We said earlier that origin/main represents the version of our code on GitHub. In the same manner where we saw we had to opt-in to adding files to be tracked: this again shows us that we have to opt-in to synchronizing our code with the remote repository on GitHub. This continues the trend where git does not perform any actions until it is commanded to.

$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 222 bytes | 74.00 KiB/s, done.
Total 2 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.iu.edu/hayesall/scratchspace.git
   66e5a29..ab2ef11  main -> main

Finally, we’ve made it full circle and the status of our repository is back to a “working tree clean” state:

$ git status
On branch main
nothing to commit, working tree clean

Staging: analyzing the core git loop

The first three subcommands: add, commit, and push are verbs. They are actions performed on files. If we instead take a file-centric view, we could give names for where our changes go each time we run a command. We hinted at the existence of three places: the working tree, the staging area, and the local git database. Each git subcommand relates a file to one of these locations:

graph LR
    A[Working Tree] -->|git add| B[Staging Area];
    B -->|git commit| C[Local Git Database];
    C -->|modify files| A;

Let’s add three more files and reason through how git commands affect the files and where they are in the loop. Assume that we start from a clean working tree similar to where we ended in the previous section:

$ touch file{2,3,4}.txt
$ ls
file1.txt  file2.txt  file3.txt  file4.txt  README.md

Test Yourself: What does status show? Hint: three files, red or green?

$ git status
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        file2.txt
        file3.txt
        file4.txt

nothing added to commit but untracked files present (use "git add"
to track)

Let’s add file2.txt to the staging area:

$ git add file2.txt

Test Yourself: What does status show? Hint: where does red become green?

$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   file2.txt

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        file3.txt
        file4.txt

These highlight how the staging area is a place to assemble software before one permanently commits the changes to the project history. If one were building a rocket: it’s better to test and assemble pieces individually before moving the final product outside to the launch pad.

Git operates on the same principle: not everything that goes into building software needs to be permanently tracked. Software development frequently requires nonlinear turns to get correct, and could even pollute the working directory with irrelevant files which only exist to test out a specific idea. Therefore the slow, methodical approach should give one time to consider what changes are relevant and what changes are not.

If we also add file3.txt: two files will be in our staging area, with one being outside the staging area.

$ git add file3.txt
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   file2.txt
        new file:   file3.txt

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        file4.txt

In git terminology, an upload is a push, a download is a pull, and all changes are local until they are synchronized with a remote. This completes our state machine for the concepts so far:

graph LR
    A[Working Tree] -->|git add| B[Staging Area];
    B -->|git commit| C[Local Git Database];
    C -->|modify files| A;
    C -->|git push| D[Remote Repository];
    D -->|git pull| C;

Versioning file content

Continuing the previous example, we have a version-controlled directory, and its state as of the most-recent commit is a directory with four files:

$ lt some-file-tree
some-file-tree
├── file2.txt
├── file3.txt
├── file4.txt
└── README.md

Let’s add some text to README.md. If the following listing looks too mysterious, you can achieve the same result using nano.⁷

echo '# Hello World!' > README.md

Previously: every file was blank. Now: we have a README.md with something new. We can query git to find out what the differences are with git diff, which shows us:

$ git diff
diff --git a/README.md b/README.md
index e69de29..cc0be1e 100644
--- a/README.md
+++ b/README.md
@@ -0,0 +1 @@
+# Hello World!

The same plus symbol + from earlier returns here: but what was previously a conventional way to represent changes is now something we can see and interact with.

We haven’t staged or committed our changes yet, so let’s compare what happens when we make even more changes:

echo -e '\nPractice makes perfect.' >> README.md

… then git will tell us we’ve made three additions:

$ git diff
diff --git a/README.md b/README.md
index e69de29..d9b5e51 100644
--- a/README.md
+++ b/README.md
+# Hello World!
+
+Practice makes perfect.

This feels like a good time to make a commit and bring ourselves back to a clean working tree state.

$ git add README.md
$ git commit -m "📝 Add starter notes"

What makes a good commit? 🤔

There isn’t a universal rule satisfying this question. But the inverse is easy: what makes a commit message bad? Imagine you read someone’s commits to find:

commit

commit

it works

done

… did you infer what these did?

Good commits are discrete units of work. Good commits tend to be verbs, and good commits tend to be atomic—small, but also difficult to divide.

Compare previous four commits with something like these:

Add add_user function

Set an index on user email addresses

Test for valid vs. invalid usernames

Document limitations of email validation

Which set do you prefer?

Descriptive commits that make small, incremental changes tend to be easier to understand than poorly-worded commits that make huge, sweeping changes. But perhaps you’re working alone and no other human being will ever see your commits: you will eventually have to face your past self. Was your past self helpful? Did your past self write helpful commit messages, or did they leave you a trail of hieroglyphics to decipher?

Releasing Software

Version control so far has been a behind-the-scenes tool. Why should any end user (someone who interacts with an app we build) care about whether we’re following a version control approach at all? Releases (or release versions) are something we will accomplish with tags, following a semantic versioning approach.

You may already have seen semantic versioning without realizing it or without knowing what it meant. Semantic versions (as defined in the Semantic Versioning 2.0.0 specification, is a simple X.Y.Z numbering approach to communicate version parity with users.

Project milestones might correspond to things that end users want: usually features and bug fixes. Each of these can be a commit

Every release is described by three numbers: MAJOR.MINOR.PATCH.

MAJOR represents major changes, often those that are backwards-incompatible with whatever versions preceded it
MINOR represents features being added, but have been done in a way that is backwards compatible
PATCH represents bug fixes

Now that you know what these three numbers mean: test your knowledge on the following:

A piece of software requires a minimum version of v1.1.0. You have v1.2.0 installed. Is your installed version compatible with the requirements?
You have v1.5.1 installed. Should it generally be safe to upgrade to v1.5.8?
Alexander has Python v3.11.2 installed. Your friend has Python v3.9.0 installed. Would you expect a Python program that works for your friend to work for Alexander? Why or why not?

With git, a release is created by tagging a commit with git tag. As an example, creating release v1.0.0

git tag -a v1.0.0 -m "Version 1.0.0 Release"
git push -u origin v1.0.0

So great: we have a version-controlled directory on our local machine, and it contains every noteworthy change that we’ve ever made. But remember how we started off this whole discussion with lofty ideas about sharing ideas, using a forge, and communicating with others toward the betterment of the commons? This database exists on our local machine, but we haven’t explored a means of uploading or downloading these versions.

The more-complete way is to think of these as a finite state machine. Using git subcommands will move a file between the locations:

graph LR
    A[Working Tree] -->|add| B[Staging Area];
    B -->|commit| C[Local Git Database];
    C -->|push| D[Remote Repository];
    D -->|pull| C;
    D -->|clone| A;

TL;DR git terminology

Version control with git is a deep topic. Since we assume you’re getting started with git, we want you to be comfortable with a core set of terms and operations. The following are a sufficient set of terms and commands to get you started in a single-developer git workflow using tagged releases. When you work on a team: you’ll want to be comfortable with the git branching model, and merge versus rebase strategies. If you go deeper into operations (GitOps or DevOps), you’ll want a working knowledge of the plumbing-porcelain dichotomy. Right now: practice your fundamentals, and layer in more complexity when you are ready.


staging area
commit
remote repository
`.git` directory
`.gitignore`
`git status`
`git diff`
`git add [file]`
`git commit -m [message]`
`git remote -v`
`git pull`
`git push`
`git clone [url]`
`git tag -a [version] -m [message]`

TL;DR What is our workflow?

Before we even begin, we must ask: where do we want to work today?

$ cd to/an/i211/repository

Make sure that our repository is up-to-date.

$ git pull
Already up to date.
This our local and remote repositories are in sync with each other, and we’re ready to start working.

Open in Visual Studio Code

$ code .
Earlier: we said that the dot represents the current folder. So code . opens the current folder in Visual Studio Code.

Edit, stage, and make commits as you accomplish tasks

git add [file-name]       # stage
git commit -m "[message]" # commit
git push
Should we git push every time? Maybe! Pushing effectively “backs up” your code to a remote location, so committing and pushing frequently means we’re less likely to suffer a data loss.

Once we’re confident in our code and we’ve reached a major milestone, we’ll tag the commit with a version number. For example: create v1.0.0 and push the release to GitHub:

git tag -a v1.0.0 -m "Version 1.0.0 Release"
git push -u origin v1.0.0

Conclusion: a distributed, asynchronous, multi-user model of collaboration

Version control in general, and git in particular, are tools that help to solve the questions that we set out with at the beginning of this lesson.

Are all the past versions of the software incorrect in some way? — yes, but we have them in the history if we need to refer back to them.

How do we know we have the most up-to-date copy of something? — we pull.

I fixed a bug in my copy, how do I tell everyone about it? — we commit and push. Technically: we push our copy into a public branch and open a pull request (or merge request) linking back to a maintainer—but that’s a detail we’ll have to explore at some other time. The key idea is that a version control system (VCS) defines a set of primitive operations. In concert, these primitive operations may be combined into a protocol that groups of people use to communicate with one another. The “single developer tagged release workflow” that we described here is one workflow out of many that you may see out there in “the real world”.

flowchart LR

  subgraph "Person 2"
    direction LR
    G[Local Git Database];
  end

  subgraph "Person 1"
    direction LR
    C[Local Git Database];
  end

  C -->|push| D[Remote Repository];
  D -->|pull| C;
  G -->|push| D;
  D -->|pull| G;

We’ve established the three spaces: our home, our workplace, and the commons. In the following lessons, we’ll orchestrate the three into a common workflow: where we write code (Python) on our local machine, push it into a remote version control system (GitHub), and deploy those changes onto a public server that anyone in the world may interact with (Linux).

How much git should I learn?

Git has over a hundred subcommands to handle almost any asynchronous collaboration workflow. This large surface area makes git one of the most-complex programming tools that isn’t itself a programming language.

If we had more time in this class: we would spend time on branching and merging in a git feature branching workflow. But i211 does not spend time in group projects, so features used for multi-developer workflows would have little utility here. However, one of the key reasons to use git and GitHub are for collaboration: so one should pursue collaboration features once they are comfortable with single-user flows.

As a data-driven approach toward which commands or which parts of git to explore: here are Alexander’s Top-30 git subcommands based on frequency of use:

git add
git status
git commit
git push
git switch
git clone
git checkout
git merge
git branch
git log
git remote
git rm
git diff
git mv
git restore
git pull
git reset
git cat-file
git show
git rev-list
git stash
git grep
git submodule
git revert
git cherry-pick
git fetch
git rebase
git blame
git tag

Footnotes

Semayne’s Case (1604-01-01) 5 Coke Rep. 91. See also: Steve Sheppard (editor), “The Selected Writings of Sir Edward Coke” (2005), Liberty Fund, Inc. Carmel, IN 46032-4564, USA.

Every course is limited in what it can cover and what it cannot cover. GitHub is a private company that produces a closed-source forge on an internal version of its own proprietary forge—but despite this sounding like the setup of a logical paradox: GitHub is the largest software forge, and from my (Alexander’s) experience it’s the one that people have heard of even when they know nothing about software. Some other forges in no particular order: GitLab, Bitbucket, SourceForge, Gitea, soft-serve, Kernel.org, Savannah.

Covering the choice of settings we show here is a bit more technical than I (Alexander) want to get into by default. Some are mundane: git config user.email is a matter of bookkeeping, and configures git with the email address that it should write to its internal database. Others, like git config pull.rebase, exist for historical reasons and because different developers use git in different ways—but since its initial release on 2005-04-07 the tool overall has stayed aggressively backwards-compatible to keep the entire ecosystem from fracturing. One is practical: vim is the default text editor in many Linux environments, but nano is beginner-friendly and less prone to causing panic if someone forgets to type a message. Still others, like git config init.defaultBranch main, are social: after George Floyd’s murder many developers re-evaluated earlier naming choices. Master is a word with historical use in trade skills: the master copy, a masterpiece, a master of science degree. But equally present is the word master in a master-slave exploitationship. Devoid of context: which one do you read?

⁴

There do exist situations where developers put repositories-inside-repositories: submodules. The git submodule command is outside what we cover, but it provides one way to represent one repository depending on another. Pro Git Chapter 7 covers aspects of this problem.

⁵

Scott Chacon and Ben Straub, (2014) “Pro Git: Second Edition”. Chapter 10.1 Git Internals - Plumbing and Porcelain. Online: https://git-scm.com/book/fa/v2/Git-Internals-Plumbing-and-Porcelain

⁶

Rachel M. Carmena (2018), “How to teach Git”

⁷

The command: echo '# Hello World!' > README.md does several things. echo behaves like a print() statement in other programming languages: it repeats whatever is sent into it. The greater-than sign > is a standard output (STDOUT) redirect, which sends the output of one command somewhere else. In this case, the combination of these can be thought of as sending data into a file.

An Introduction to Information Infrastructure II