Deep Git: What’s all this then about switch, submodules and git lfs object not found

Well, it’s been a long time, but git has got to be amongst the most mysterious of all software tools. I’ve spent the last five years going through tutorials and most of them are obsolete. The git command is undergoing really significant improvement and lots of the confusing things are because the base tutorials get you completely on the wrong path. Kind of like unlearning the “snowplow or pizza” when skiing. It is good, but then you spend the rest of the time unlearning the trick.

Tl;dr

When you learn git, the typical sequence looks like in most tutorials

git clone git@github.com:richtong/src
cd src
git checkout -b rich-dev
git pull
# make some changes (the first time you will need a -u
git checkout main
# Get the changes that may have happen remotely from other devs
git pull
git checkout rich-dev
git merge

Well it turns out that most of these commands are actually conglomerations of others and can get you in real trouble from rebase errors to conflicts, so here is a better sequence that has emerged in the last two years as the great folks at the git project have made things clearer:

# Deals with potential Windows users
git config --global core.autocrlf input
git clone --recursive --remote-submodules git@github.com:richtong/src
cd src
git lfs install
git fetch -p --all
git switch --c rich-dev
# make your changes
git fetch -p --all
git checkout main
git pull --rebase
git push
git checkout rich-dev
git rebase main
git rebase main -i
# Edit the file squashing as necessary with 's'

So the next sections explains why you need this, but the short answer is that when you clone a new repo, it is safer and handles the optional features Git Large File Storage and Submodules and cross-platform developers

Windows and MacOS/Linux file differences

Ok, you pretty soon will end up having to work with folks who are on Windows machines (or maybe you are one yourself). The biggest problem you will end up with is what is the end of a line of text. For reasons in the distant past, Windows/DOS end each line with two characters, a carriage return and then a linefeed that is \r\n. This is called CR/LF. But Linux/Unix/MacOS files end file with just a line feed or \n

As with all things Git, there are handles in a bunch of different ways.

What is all this then about — in git commands?

This one really confused me for a long time, for instance, you see undefined the use of two dashes in git quite a bit:

git restore -- hello.c

Well it turns out this is convention in complex Unix commands which means stop looking for flags (aka as options) and the rest are parameters (aka files). So most of the time it doesn’t matter, but suppose that you for some reason create a file named --newfile.c well the problem is that you cannot distinguish this from flag to git so you need to use the double dash so git knows you are

git restore -- --newfile.c

There’s a similar problem with strangely name thing so with git 2.3 (Q1 2021) there are even more ways to delimit parameters, so there are actually three set of things you can feed git, options, parameters and then revisions

# what if the path looks like an option "--local-env" for instance
git rev-parse "$rev" -- "$path"
# so this is safe
git rev-parse --end-of-options "$rev" -- "$path"

Why can’t I just git pull and git push? It’s all about conflicts baby

Well, this is a case where those demonstrations are just too simple for the real world. Here are two examples of what can happen:

If you do a git pull and you have changes and the other branch has changes made by others, you will create a merge commit. That means you have a commit that is “noise”. It just says we merged two commit points together.
For most projects, what you really want to do is a rebase and then a squash. This sounds very strange, but the key idea is that everytime you commit to git, you create a unique record of *every* file in the repository. This is called a commit point and has a unique number called the SHA (this is actually a hash that is generated and is globally unique gigantic string of numbers and letters).
The big concept is that is the idea of branches, these are strings of commit points going from the oldest to the newest changes.
There is a default branch which is the released version, the thing that, hopefully works. In the old days this was called the master branch, but now is called main by conventional.
The second big concept is that you have the files on your local machine and then there are files that are remote, that is on the server. The most typically server is https://github.com but it could be gitlab.com or even an private server in your company. By convention, the name of that remote place where you put thing is origin but you can actually have multiple remotes. A common one is upstream but that’s covered later when we talk about forking.
So when you make changes, you checkout starting from main and make your changes. Every so often to save your work, you will create a new commit with git commit and give it a label. By convention, the branches are typically, your name and then something that explains what you are working on, so for instance rich-install-fix might mean that the user rich is working on fixing installation.

The net of all this is that you have to typically worry that you have in sync four different branches:

Local main. This is what is running on your local machine and was copied from a remote sometime ago. You have to manually sync this and that is what all this git push and pull nonsense is about.
Remote main, typically origin. This is the “truth”. That is for the project with many developers, this is what the latest version is.
Local development branch. This is the thing that you are working on.
Remote development branch. There is a copy of this on the server too.

The main point is that you can run for a long time locally and never push anything up to the remote location. You have a local copy of everything and this allows really great async development.

But what happens when you want to save your changes up top. Well, the tutorials tell you just git pull to copy the remote stuff down to your local and then git push to copy your local stuff to the remote site.

That all sounds great, but in reality there are three problems:

Since literally hundreds of developers can be working, there could be a zillion commits that the remote is ahead of your main.
Also since you’ve been working like crazy, these zillion commits by others can conflict with the changes you have made.

The typical git pull and git push deals with this in a simple way. If for instance, you do a git checkout main && git pull if the main branch at the top is

Now the second thing is git pull this really short for two things, “git fetch” which means download all the new commits from the remote branch and then do a “git merge”. This last thing is to be avoided at all costs. It just creates a new commit point that is the merge. Instead, you need to do a git rebase what this does is a little subtle. But it takes all the remote changes first and puts all your local changes “on top of if” so in essence the new base of your changes is all the changes in remote.

The nice thing about this is that you will never get a merge. The bad thing is that you can rebase conflicts. That is because if you change says a line in README.md, this might have been changed by all the new changes in the other branch. So you go through a cycle where git will say both files have changed and entered a mode where you get to pick which to change. That is the current HEAD, that is the new top or the change you are putting in.

This is also why people say “rebase early and often”, you do not want to be doing this for 100 commits. If you are 100 commits different from the new base, then you have a potential rebase conflict for every one of them.

Now the most subtle thing is that you don’t just want to do, “git fetch && git rebase” because git pull --rebase is actually more sophisticated than in that it tries to figure out which commits are from early fetch vs really.

But why do I do a git switch and not a git checkout

Git switch was just recently invented because it takes apart git checkout and solves a common problem. As an example, you are on the main branch and you start typing away merrily.

Oops now you realize that those changes should happen on your rich-dev branch as an example. Oh now, now when you do a git checkout, it will say, you will lose all the changes you just made. That is because those changes will get overwritten when you do the git checkout.

So, in 2019, there was a change made, the git checkout was separated into two commands, git switch which changes the branch, but does *not* do a git restore so you won’t have to be worried about the wrong commit. The only small complexity is that creating a new branch with git checkout -b rich-new is now git switch -c rich-new for create. And that itself is really short for git branch rich-new && git switch rich-new. The other use of git checkout is to do a git checkout hello.c when you mess up hello.c and want to recover it and now the easy way to do this is with git restore hello.c.

Most of the time you will use git switch and then git restore to get files. The only use left for git checkout is then to checkout specific commit points like git checkout 237fb to get the commit point that starts there.

What’s all this then about Submodules

I finally think I understand submodules decently. The main issue is that except for toy repos, nearly every project I’ve used eventually has to keep different bodies of code in sync. A simple example is some shared libraries or shell scripts. It sure would be nice to share them across different repositories. The old way has been to compile them or to use a package manager or more often, just copy them, but then updates get out of sync. In fact, for my personal work, I have bin, lib, and docker images that I use as submodules across different organizations.

Net, net, Submodules are a way to keep multiple repositories in sync. There are others, but since git has native support for it, we mainly use this to manage things. Here are some things that are puzzling and confusing. But the normal tutorial sequence is:

# create a submodule in the current working directory
git submodule add git@github.com/richtong/bin
# list all the submodules
git submodule

By default this puts the new repo into the current working directory. Many repos have submodules scattered all over the place so it is a mess and hard to figure out what is where
The native submodule has no idea about branches. So when you do a git add submodule you are getting an addition to the latest commit point. This makes sense when you are working with “foreign” submodules where you don’t do any development on them and just want the latest commit, but doesn’t work as well if you are developing these submodules.
The result is that after a git submodule add you are in a strange state called “detached HEAD” mode. This basically means that while you can make a change, you can’t actually push any changes or make modifications to them.
The other thing that is confusing is that when you git clone a repo with submodules, you do *not* by default get any of the contents of the submodules. They are naked. Again, this makes sense, since if you just want to look at a repo, there might be gigabytes of other things pulled in, so it just gets you that repo. You have to do something special to get all the data.
The second thing is that you will just the single remote which is the usual “origin” and so if the submodule is actually forked then you can’t pull from the upstream branch either.

So what if you are developing in those with submodules

The net is that you have to be more disciplined with all this, so here are some recommendations, but you will have much less trouble if you:


# set the default branch
git submodule add -b git@github.com:richtong/bin
# for example the clone of a mono repo called richtong/src
git clone --recursive --remote-submodules git@github.com:richtong/src
cd src
# when you are ready to check in all the fixes make sure the branches are right and everything is checked in
git checkout main
git submodule update --remote --merge
git push

Be disciplined about where you put submodules. Normally what we do is to have a single “mono-repo” and then put all the sub repos into a single ‘./extern’ directory so you know where they all live.
Make sure you install git lfs because you don’t know which modules require it, with the Mac and home-brew this is pretty simple, ‘brew install git-lfs; git lfs init’ cranks it all up. Then any repos that needed it will have it work automatically.
When you create a submodule you use the git submodule -b which add a branch= to the .gitmodules. You can add it after the fact by adding something to .gitmodules in the branch.
When you clone a new repository, instead of the normal “git clone”, you should do “git clone –recursive –remote-submodules” what this says is to recursively find all the submodules and then pass the “–remote” flag which checks out every submodule against their default branch, typically main. Then when you are updating them, you don’t have to pull or know the default branch.
Now the second thing is that when you are working with a bunch of submodules when you finally want to check them into “main” of the mono repo, they will be all over the places, so there is a way to force each to their default branch. We are right now in a transition where the default branch used to be master, but for new repos, it is now main. But you can check this, so run git submodule update --remote --merge will ensure that everything is checked out against the submodule main.

But what if you need to set a submodule default branch for existing repos

Well, that is a little tricker, turns out this is pretty easy the main tricky part is that setting the default branch has to be done in the parent of the submodule, so what you do is to use git submodule foreach and then have to cd up and out to set it. The second tricky part is that you need to parse the default branch which can be either main or master and then feed that, so here is one hellacious one-line.

The other tricky part is that you need to quote a single quote and turns out to be hard to do this. You have to use an ANSI string function so it is done by adding a $ to it and then you can backquote a single quote. The other complexity is that you have to use various special variables that the submodule command support namely the $toplevel is the root of the parent repo and $sm_path is the submodule path from there.

git submodule foreach \
  $'default = $(git remote set-head origin -a
      | awk \'{print $NF}\' &&
      cd $toplevel && 
      git submodule set-branch -b $default -- $sm_path'

First, you need to go to each submodule and find the default branch, this is normally master or lately main, so this is pretty simple git symbolic-ref --short HEAD but what you need to do is to find it. In the old days, this was pretty easy because --branch . meant to use the same branch as the repo which was typically master. But with the current mix of master and main, you need an easy way to find the default branch. But, if you do a git remote set-head origin -a it actually downloads from the origin the current head as this can be different than what is on the local branch.
Then for each submodule, you need to add this as the default branch with and the complexity is that you need this rev-parse magic to figure out if it main or master
Also, I don’t quite have to quotes quite right. Still messing with that, but you get a general idea. The main issue is that you don’t want $name to be evaluated by bash, it is actually an argument to the submodule function and replacement happens there.

Now that this is set, all you need to do is to update it with the new --remote option. Now, this does not appear to actually switch you to the branch, but at least you are up to the latest on the default branch. If you are interested in internals, that complicated command just adds a branch entry to .gitmodules.

git submodule update --init --recursive --remote

Oops how do I remove a submodule

This is another example of where it used to be super hard. you would have to do a git submodule deinit and then all kinds of fixups, but it looks like a simple git rm in the latest git takes care of this problem. I need to experiment to see if this is true.

Ugh what is all this rebase -i stuff and what is a rebase conflict.

OK the hardest part of about dealing with Git is making your commits look neat and nice. This what git rebase -i

What’s all this than about Git LFS

The second thing that any decent sized project is going to use is Git Large File Storage. The reason for this is that if you have JPGs or large blobs, Git really doesn’t know how to deal with them. As an example if you check in EXE for instance, even if you git rm it then it never leaves, git keeps a copy of *every* file you have ever checked in in the history that you download. That is why my original personal repo ended up being 5GB because I didn’t understand this.

You still want to have version control though, so the solution is Git Large File Storage. What this does is that it replaces those big files with 128-byte pointer files and then your client does the magic of looking for these binary files in a separate storage mechanism, so they are not copied all the time.

If these submodules use git lfs, if you don’t have that git lfs installed, you are not going to get read files but lots of 128 byte pointers instead which is pretty useless.

Getting it working is pretty easy:

# assuming you are on a Mac
brew install git-lfs
cd _your_git_repo_
git lfs install
# now name all your binary objects 
# Note that we need quotes
git track "*.psd" "*.jpg" "*.bmp" "*.h5"
# this creates a .gitattributes file
git add .gitattributes
# commit and push it
git commit
git push

Git LFS Object not found

Well, this is a weird bug, a repo I haven’t used for a while on a pull report these.

There is a reported fix for this that you can push the object up by knowing it’s LFS id, but this didn’t work for me with a git push --object-id _whatever_that_is or you can push all the local lfs objects up with git lfs push --all. However when I tried that, the local git reported all kinds of objects missing. Argh.

The other fix is just to start over with another clone. So that’s where I’m heading 🙂

Rich's Tongfamily (Richtong.net add-on)