Those numbers don’t mean anything without average token usage stats.

distalx · 2025-10-16T03:38:43 1760585923

Exactly, token per dollar rates are useful, but without knowing the typical input output token distribution for each model on this specific task, the numbers alone don’t give a full picture of cost.

deadbabe · 2025-10-16T04:50:23 1760590223

That’s how they lie to us. Companies can advertise cheap prices to lure you in but they know very well how many tokens you’re going to use on average so they will still make more profit than ever, especially if you’re using any kind of reasoning model which is just like a blank check for them to print money.

solumunus · 2025-10-16T05:52:13 1760593933

I don’t think any of them are profitable are they? We’re in the losing money to gain market share phase of this industry.

Topfi · 2025-10-16T08:16:28 1760602588

Fair point of course and it is still far to early to make a definitive statement, but in my still limited experience throughout the night, I have seen Haiku 4.5 be far better in using what I'd consider a justifiable amount of input tokens over e.g. GPT-5 models. Sonnets recent versions also had been better on this front over OpenAIs current best, but I try (not always succeed) to take prior experience and expectation out of the equation when evaluating models.

Additionally, the AA cost to run benchmark suite numbers are very encouraging [0] and Haiku 4.5 without reasoning is always an option too. Tested that even less, but there is some indication that reasoning may not be necessary for reasonable output performance [1][2][3].

In retrospect, I perhaps would have been served better starting with "reasoning" disabled, will have to do some self-blinded comparisons between model outputs over the coming weeks to rectify that. Am trying my best not to make a judgement yet, but compared to other recent releases, Haiku 4.5 has a very interesting, even distribution.

GPT-5 models were and continue to be encouraging for price/performance with a reliable 400k window and good adherence to prompts with multi minute (beyond 10) adherence, but from the start weren't the fastest and ingests every token there is in a code base with reckless abandon.

No Grok model ever performed for me like they seem to during the initial hype

GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Recent Anthropic releases were good at code output quality, but not as reliable beyond 200k vs GPT-5, not exactly fast either when looking at token/sec, though task completion generally takes less time due to more efficient ingestion vs GPT-5 and of course rather expensive.

Haiku 4.5, if they can continue to offer it at such speeds with such low latency and at this price, cupeled with encouraging initial output quality and efficient ingestion of repos seems to be designed in a far more balanced manner, which I welcome. Course with 200k being a hard limit, that is a clear downside compared to GPT-5 (and Gemini 2.5 Pro though that has its own reliability issues in tool calling) and I have yet to test whether it can go beyond 8 min on chains of tool calls with intermittent code changes without suffering similar degradation to other recent Anthropic models, but I am seeing the potential for solid value here.

[0] https://artificialanalysis.ai/?models=gpt-5-codex%2Cgpt-5-mi...

[1] Claude 4.5 Haiku 198.72 tok/sec 2382 tokens Time-to-First: 1.0 sec https://t3.chat/share/35iusmgsw9

[2] Claude 4.5 Haiku 197.51 tok/sec 3128 tokens Time-to-First: 0.91 sec https://t3.chat/share/17mxerzlj1

[3] Claude 4.5 Haiku 154.75 tok/sec 2341 tokens Time-to-First: 0.50 sec https://t3.chat/share/96wfkxzsdk

camel_Snake · 2025-10-16T18:52:12 1760640732

> GLM-4.6 is great value but still not solid enough for tool calls, not that fast, etc. so if you can afford something more reliable I'd go for that, but encouraging.

Funny you should say that, because while it is a large model the GLM 4.5 is at the top of Berkley's Function Calling Leaderboard [0] and has one of the lowest costs. Can't comment on speed compared to those smaller models, but the Air version of 4.5 is similarly highly-ranked.

[0]https://gorilla.cs.berkeley.edu/leaderboard.html

Topfi · 2025-10-17T10:43:04 1760697784

Gorilla is a great resource and it isn't unreasonable to suspect Z.AI has it in their data sets. I'd suspect most other frontier labs as well (pure speculation, but why not use it as a resource).

Problem is, while Gorilla was an amazing resource back in 2023 and continues to be a great dataset to lean on, but most ways we use LLMs in multi step tasks have since evolved greatly, not just with structured JSON (which GorillaOpenFunctionsV2, v4 eval does multi too), but more with the scaffolding around models (Claude Code vs Codex vs OpenCode, etc.). Likely why good performance with Gorilla doesn't necessarily map onto multiple step workloads with day-to-day tooling, which I tend to go for and reason why, despite there being FOSS options already, most labs either built their own coding assistant tooling (and most open source that too) or feel the need to fork others (Qwen with Geminis repo).

Purely speculative, but GLM-4.6 I evaluated using the same tasks as other models via Claude Code with their endpoint as that is what they advertise as the official way to use the model, same reason I use e.g. Codex for GPT-5. More focused on results in the best case, over e.g. using opencode for all models to give a more level playing field.