Hi all! We have a version controlled codebase (even the executable actually) and all its textual and binary inputs. The workflow is that you change your inputs and run the executable to produce, say, half a gigabyte when compressed (this gets done maybe 10 times per day). There are two issues when it comes to reproducibility: 1) We don't version control each state of our inputs. One can tinker with a parameter in a text file and keep rerunning the model. 2) The outputs are too big to be conveniently version controlled with our codebase and inputs.
So when we stumble upon some outputs from some time ago, we are not sure which model version created them, because it may as well have been between commits. I had two different ideas of making this thing (more) reproducible
1. Before each run (it's a bash script, so just add a line), we create a patch of the current state of the model and store this patch together with our outputs. So that in combination with our VCS, we can always get to the state the model was in when we got these outputs. These two things (patch, outputs) then get archived some place.
2. Before each run, we commit our current state to a separate branch, get its hash/revision number and rename our outputs to this, so that our outputs named 36add13.tar.gz can be readily reproduced by just checking that revision out. This would be a separate (and perhaps non-linear) branch, these commits would only be merged to the master once in a while (see point one in the first paragraph).
Notice that neither of these workflows requires the user to manually keep track of things, it all gets version controlled in the batch script. I would like to keep this as low tech as possible (no Docker, no orchestration, no cloud services ideally).
Thanks for any comments on these suggestions and/or your own experience.