Couple times in the past I wanted to port open source ML models from CUDA/Python to a better technology stack. I have ported Whisper https://github.com/Const-me/Whisper/ and Mistral https://github.com/Const-me/Cgml/ to D3D11. I don’t remember how much time I spent, but given both were unpaid part-time hobby projects, probably under 160 hours / each.
These software projects were great to validate the technology choices, but note I only did bare minimum to implement specific ML models. Implementing a complete PyTorch backend gonna involve dramatically more work. I can’t even estimate how much more because I’m not an expert in Python or these Python-based ML libraries.
To go on a tangent, I note your custom 'BCML1' 5bit per weight compression codec and your optimised hand-coded AVX2 to encode it... was that really needed? Are the weights encoded on every startup? Why not do it once and save to disk?
Not really, that code only runs while importing the PyTorch format. See readme for the frontend app: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... When loading the model from *.cgml, the model file already contains compressed tensors. That’s how that file is only 4.55 GB, versus 13.4 GB in the original model.
> was that really needed?
For desktops with many CPU cores, a simpler scalar version would probably work equally well. Still, low-end computers don’t always have many cores to use by these background encoding tasks. Also, CPU usage on laptops translates to battery drain.
These software projects were great to validate the technology choices, but note I only did bare minimum to implement specific ML models. Implementing a complete PyTorch backend gonna involve dramatically more work. I can’t even estimate how much more because I’m not an expert in Python or these Python-based ML libraries.