This is surprisingly competent. A couple of months ago I evaluated some leading models on a bunch of text adventures[1]. Typical regression coefficients would be +0.02 for top level models like Sonnet and Gemini 2.5 Pro, but notably also Gemini 2.5 Flash. (The baseline is GPT 5 Chat, i.e. the one where OpenAI routes to a thinking model only when they determine it's needed.)
When I include an attempt from Haiku 4.5 in the mix, most coefficients stay similar, but Haiku itself gets a +0.05. This must be a statistical fluke, because that would be insanely impressive – in particular for a cheaper model. I guess I'm adding samples to some of these after all...
When I include an attempt from Haiku 4.5 in the mix, most coefficients stay similar, but Haiku itself gets a +0.05. This must be a statistical fluke, because that would be insanely impressive – in particular for a cheaper model. I guess I'm adding samples to some of these after all...
[1]: https://entropicthoughts.com/evaluating-llms-playing-text-ad...
Edit: It was a fluke. Back to +0.01 after one more go at all games.