The Internal Craziness that is Git Submodules and LFS a reflection
Well, these are really powerful, but incredibly hard to understand. It’s taken me five years to get a sense of how to really use these is to first understand at an architectural level what is really happening inside Git, then it makes sense how this whole thing works. It’s taken me five years to really understand how Git “thinks” and hopefully this will help you too:
- Everything in Git is a Blob. And every blob has a unique SHA1 commit point. This is a huge number that uniquely identifies everything. So there is no file system, everything is a unique object. Every file, every commit point, everything.
- These blobs are stitched together into orgs and repos and branches. That is when you in say organization which is my personal one
richtongand you are in my
srcrepo on the
- And when you do a
git fetchyou are actually pulling *every* blob in the repo. That means ever file changed, every JPG, even the ones you’ve deleted are carried in the repo forever. This is great because it means it is unlikely that you will lose something, but terrible in that an accidental check-in can literally live forever. There are tools of course which can rewrite history so that you can get rid of things, but this is pretty scary!
- Once you get this then you can see how Git LFS works. Instead of the unique commit pointing to a real file, it has a pointer into Git LFS storage, so you never put
- So the most confusing this is that when you add this concept of submodules then there is another layer of complexity, what is confusing here is that when you add another submodule repo, there is caching and work that happens in the parent in the magic files labeled
.moduleswhich I really don’t quite understand.
The net result is that most of the time you think you are working on a long line of edits, that’s just not what is happening, Git flattens everything into objects and then every time an object changes, a new commit is created. It’s actually an incredible universe if you realize that Git has every change you’ve ever made all floating in free space.
At least that helps me understand what is really going on. Submodules are another complexity here, but the basic idea is that with a submodule, you get yet another free floating universe. The parent repo has a cache of everything, but fundamentally, the only relationship between the parent and the sub-repos, is what commit point is the parent working on.
The next level is branches, these are ways to stitch together commit points. So you can be in detached state which means you are not on a branch. If you are on a branch with
git switch then when you do a git push, it adds to that branch. So the “history” is really arbitrary, it is just a set of links between this sea of commits.
And it also explains what is happening with Git Large File Storage. Normally every object is in the git repo. As you can see from this explanation when you for instance clone a 20 year old repo, you are in fact getting *every file change ever made in the history of the repo* and that is the power and glory of it.
But if that object is say a 10MB Jpeg it is pretty painful to keep that around for the rest of time or a 200MB PDF. So the solution is that instead of doing that, you just stick a pointer in and that points to a big file store that doesn’t have to keep all that around. The net is that when you clone, you don’t have to get everything.
Of course that is where lots of bugs come from, if the git lfs isn’t tracking, you can get all kinds of “object not found” errors. Also if you clone a repo with git lfs and you don’t have it, suddenly all those JPEGs are going to be unusable, they will just be little pointer files.