Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not enough samples to overcome variance. Only 714 hands played for Meta LLAMA 4. Noise in a dashboard.


(author of PokerBattle here)

That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.

A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped

The current setup is mainly useful for observing common reasoning failure modes and how often they occur.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: