Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You only need 40GB of RAM for the largest model and inference latency mostly depends on single core performance and memory bus speed because it has to crunch the whole 40GB for every token it produces.

If its slower than you want, figure out which one is your bottleneck. Because even 64GB of faster cheap RAM could be a 50% speedup if your CPU isn't the problem.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: