Splitting a big repo into two with git-filter-repo and git submodule


OK, so I made a mistake a while back, I was taking a bunch of ML classes and ended up creating a subdirectory course in my personal repo rich, but that means I'm carrying around 1GB of data that I basically never use.

In the old days, this was really hard to fix, but as always, things get better and better and with the new git commands, this is actually super easy to do. The process is:

  1. Git clone the repo twice. One will go into the original rich repo and the other will be the new course repo with something like git clone git@github.com:richtong/rich rich and then git clone git@github.com:richtong/rich course
  2. Now you need to make sure to get all the Git LFS objects downloaded. When you just do a clone, you don’t get them all, so `cd course && git lfs fetch —all` will do the trick otherwise, you will get missing object messages
  3. So now let's go to the course repo and run this magic command, brew install git-filter-repo and then you can do a git filter-repo --subdirectory-filter course what this does is that it prunes all the non-course content and then promotes all the files in course to the root which is exactly what you want.
  4. If you have two directories like says `course` and `tutorials` then you can actually combine all the subdirectories with `git filter-repo —path course/ —path tutorial/ —rename course/: —rename tutorial/: which work assuming there are no name collisions like the README files, otherwise you can leave the rename off.
  5. Now that you have this, from the command line, you can do a brew install gh which are the github.com tools and run gh repo create and this will give you a new repo and then it's a simple git push --upstream origin main
  6. And, then do the same for all the branches that you want to retain like `git switch dev-branch && git push -u origin dev-branch` so do this for all the branches that you want to keep in the new repo.

So that takes care of the new separate repo, now you have to clean out all references to it from the main repo and then add the new repo as a sub module:

  1. So now you go to the main repo with for example, `cd ../rich` and then you a trick, you specific the thing you are getting rid of and then choose invert so that you keep everything that is *not* the directory being split, git filter-repo —path course/ —invert-paths
  2. Now that this is done, it’s pretty easy from here, create a new repo with `gh repo create git@github.com/<your org>/new-repo` as an example and it will automatically create a new origin for this repo which is nice
  3. So it’s now a simple matter to add the fat repo as a submodule. Just run `git submodule add git@github.com:<yourorg>/new-repo original-repo` and then a `git commit -a && git push` and you are done!

Related Posts