Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure it's easy -- you can use benchmarks like HumanEval, which Stability did. They just didn't compare to Codex or GPT-4. Of course such benchmarks don't capture all aspects of an LLM's capabilities, but they're a lot better than nothing!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: