Couple times in the past I wanted to port open source ML models from CUDA/Python...

versteegen · on Dec 17, 2023

Wow, a very nice reimplementation.

To go on a tangent, I note your custom 'BCML1' 5bit per weight compression codec and your optimised hand-coded AVX2 to encode it... was that really needed? Are the weights encoded on every startup? Why not do it once and save to disk?

Const-me · on Dec 17, 2023

> Are the weights encoded on every startup?

Not really, that code only runs while importing the PyTorch format. See readme for the frontend app: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... When loading the model from *.cgml, the model file already contains compressed tensors. That’s how that file is only 4.55 GB, versus 13.4 GB in the original model.

> was that really needed?

For desktops with many CPU cores, a simpler scalar version would probably work equally well. Still, low-end computers don’t always have many cores to use by these background encoding tasks. Also, CPU usage on laptops translates to battery drain.

spookie · on Dec 18, 2023

This is great!