Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it really 37B different parameters for each token? Even with the "multi-token prediction system" that the article mentions?


I don't think anyone uses MTP for inference right now. Even if you use MTP for drafting, you need to batching in the next round to "verify" it is the right token, if that happens you need to activate more experts.

DELETED: If you don't use MTP for drafting, and use MTP to skip generations, sure. But you also need to evaluate your use case to make sure you don't get penalized for doing that. Their evaluation in the paper don't use MTP for generation.

EDIT: Actually, you cannot use MTP other than drafting because you need to fill in these KV caches. So, during generation, you cannot save your compute with MTP (you save memory bandwidth, but this is more complicated for MoE model due to more activated experts).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: