Exactly! In https://matrix.dev/blog-2026-04-04-2.html#questions-this-rai..., we raised exactly the same concerns. In particular, we actually saw that a hot swap caused a 100% cache miss. If it's a session filled with 800k tokens, rebuilding the cache is very expensive.
Also looking back at their claim: "Token counts may include tokens added automatically by Anthropic for system optimizations. You are not billed for system-added tokens. Billing reflects only your content."
A/B testing sounds a bit different. Do they really count it as "system-added tokens" and not charge for this extra cost? If you consider the model you're requesting as the baseline, then yes. But technically it's an A/B test of a different model, so they might secretly charge 130% as "we didn't add any system prompt, we just routed you to a better model."
Great example as to why people are yearning for CSS in TypeScript. Something as simple as if() only works in Chrome and there's not a good shim story for CSS versus a more complete language, so you end up with this:
> The problem: CSS can compute a number – 0 for visible and 1 for hidden – but you can’t directly use that number to set visibility. There is a new feature coming to CSS that solves this: if(), but right now it only just shipped in Chrome.
> So I used a trick called type grinding. You create a paused animation that toggles visibility between visible and hidden. Then you set the animation-delay based on the computed value to determine which keyframe is used:
> A negative animation delay on a paused animation jumps to that point in the timeline. So a delay of 0s lands in the visible range, and -0.5s lands in the hidden range. It’s a hack, but a functional one. When CSS if() gets wider support, we can replace this with a clean conditional.
```
I think Anthropic just highly RL’s their model to work best with it’s Claude Code’s particular ways of going about things.
All the background capability Claude code now has makes things way more complex and I saw a meaningful improvement with 4.6 versus 4.5, so imagine other harnesses will take time to catch up.
reply