Managing Python installations for reproducibility

Managing Python installations for reproducibility

Wow this is a complicated topic, but learning from Sam, one of the unsung cores to good software development is reproducibility. On any machine, you should be able to run and not have dependencies in software get you.

But, there is a cost to full reproducibility. At the limit having to set up an entire new virtual machine for every little project doesn't make sense, so it is a good idea to think about it in terms of a tradeoff. For quick things, do the least amount possible.

So here is a good example, how to handle dependencies when you are working with Python. I have been using language a lot for machine learning and here are the some the tradeoff to make from easy to hard:

Fastest Way: Use Requirements.txt

Ok, the easiest thing on a Mac is just to do a brew install python and then pip install package1 package2 ...The advantage of this approach is that it is quick, you get the latest version of each package and it is easy to code up. Upgrades to the latest is easy with brew upgrade and pip upgrade.

The disadvantage is that this is user-wide, so anything you do has to work together. If you are doing tutorials from different sources, they may not work. And as upgrades happen, things can break. For instance, when I took the courses, they wre using Tensorflow 1.2 and some things just broke with Tensorflow 1.16.

The simple way out is that after some code is working, go to the parent and list all the packages and their versions with package freeze > requirements.txt which puts all the versions into a text file. Check that into github and when you run it sometime later, just do a pip install -f requirements.txt and you will get back to where things were working.

When things break: Add venv

Now things will break and the most likely thing are the PIP packages. So, if you do a pip install and things fail, you will need isolate things. I had this problem where a package music21 refused to work. So the solution is to create an environment with the new Python 3 venv module:
cd your_project_name

Call the venv module and put all python packages into a directory called venv

python -m venv venv

do not check in venv, it will have GBs of python so add to gitignore

if venv subdirectory doesn't already exist

grep -v venv .gitignore || cat >> .gitignore <<<"venv/"
source ./venv/bin/activate

This puts you into a fresh python environment with zero packages

pip install -f requirements.txt

do your work and then when done

Then the next time You work on it, it is as easy as

source ./venv/bin/activate
# work on the project

When things go really topsy turvy: docker and conda

Now this will solve many problems, but there are times when you are dependent on a particular version of Python or on system libraries or even operating system. While you can use some things like conda to set different python libraries, if things get this bad, then you will have to manage conda and pip and getting to reproducible is harder because now you also need to script conda. I find that you might as well take the next step and use a docker container which will take care of many dependencies at the system level.

The solution here is to use a Docker image. I normally clone the Dockerfile that makes the image as well and make that an submodule of the project, so you can reproduce it. You don't just want to use some random base image, so get enough of the Dockerfile that you are confident you are down to say the Ubuntu core image.

Finally if you are doing this and it is all in that sense scripted, it makes sense to convert to Anaconda Python (since you won't be using homebrew) and then the conda package installer. This installer is better than PIP because it does  a better job of looking at all the dependencies, so you want to put all the packages on a single conda line and it will look for things like circular dependencies.

Conda only has about 1,500 packages even when you include some things like Conda-forge. So if there are packages that are there, then after RUN conda install _packages_ you can have another line RUN pip install _more_packages_ to get the rest.

%d bloggers like this: