Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Inference should be closer to llama 13b, since it runs 2/8 experts for each token.


Does it have to run them sequentially? I guess the cost will be 12/13bn level but latency may be faster?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: