Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.

No, Mixture-of-Experts is not stacking finetunes of the same base model.



Do you have any more information on the topic? I remember reading that about significant memory savings achieved by reusing most layers.

Made sense to mee on first sight to me, because you don't need to train stuff like syntax and grammar 8 times in 8 different ways.

Also would explain why interference of two 7B models has the cost of running a 12B model.


The original paper by Shazeer suffices. What you are saying is in theory possible to do and may have been done in practice here, but in the general case MoE is trained from scratch and specializations of layers which develop are not products of some design choice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: