That's what I thought when starting and it functions so poorly that I think they should remove it from their docs. You can enforce a schema by creating a tool definition with json in the exact shape you want the output, then set "tool_choice" to "any". They have a picture that helps.
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
LLM 'neurons' are not single input/single output functions. Most 'neurons' are Mat-Vec computations that combine the products of dozens or hundreds of prior weights.
In our lane the only important question to ask is, "Of what value are the tokens these models output?" not "How closely can we emulate an organic bran?"
Regarding the article, I disagree with the thesis that AGI research is a waste. AGI is the moonshot goal. It's what motivated the fairly expensive experiment that produced the GPT models, and we can look at all sorts of other hairbrained goals that ended up making revolutionary changes.
You can address the issue by putting the report and the code base in a sandbox with an agent that tries to reproduce it. If it can't reproduce it then that should be a strike against the reporter. OSS projects should absolutely ban accounts that repetitively create reports that are of such low quality that it can't be recreated. IMO the Hacker One reputation mechanism is a good idea because it incentives users who operate in good faith and can serially produce findings.
Sandbox a third AI that just bets on AI stocks and crypto. Add a fourth AI to check the third AI's bets, and a fifth one to go on forums and pump the relevant equities. A sixth AI can short sell when the fourth AI gets overheated.
(1) JSON requires lots of escape characters that mangle the strings + hex escapes and (2) it's much easier for model attention to track when a semantic block begins and ends when it's wrapped by the name of that section
<instructions>
...
...
</instructions>
can be much easier than
{
"instructions": "..\n...\n"
}
especially when there are newlines, quotes and unicode
Thanks for the reply, that part about the models attention is pretty interesting!
I would suspect that a single attention layer won't be able to figure out to which token a token for an opening bracket should attend the most to. Think of
{"x": {y: 1}} so with only one layer of attention, can the token for the first opening bracket successfully attend to exactly the matching closing bracket?
I wonder if RNNs work better with JSON or XML. Or maybe they are just fine with both of them because a RNN can have some stack-like internal state that can match brackets?
Probably, it would be a really cool research direction to measure how well Transformer-Mamba hybrid models like Jamba perform on structured input/output formats like JSON and XML and compare them. For the LLM era, I could only find papers that do this evaluation with transformer-based LLMs. Damn, I'd love to work at a place that does this kind of research, but guess I'm stuck with my current boring job now :D Born to do cutting-edge research, forced to write CRUD apps with some "AI sprinkled in". Anyone hiring here?
90% as good as Sonnet 4 or 4.5?
Openrouter just started reporting, and it's saying Haiku is 2x as fast (60tps vs 125tps) and 2-3x less latent (2-3s vs 1s)
Sonnet 4.5 is an excellent model for my startup's use case. Chatting to Haiku it looks promising too, and it may be great drop in replacement for some of inference tasks that have a lot of input tokens but don't require 4.5-level intelligence.
I think a lot of people judge these models purely off of what they want to personally use for coding and forget about enterprise use. For white-label chatbots that use completely custom harnesses + tools, Sonnet 4.5 is much easier to work with than GPT-5. And like you, I was really pleased to see this release today. For our usage speed/cost are more important than pure model IQ above some certain threshold. We'll likely switch over to Haiku 4.5 after some testing to confirm it is what it says on the tin.
There is a deep literature on this in the High Performance Computing (HPC) field, where researchers traditionally needed to design simulations to run on hundreds to thousands of nodes with up to hundreds of CPU threads each. Computation can be defined as dependency graphs at the function or even variable level (depending on how granular you can make your threads). Languages built on top of LLVM or interpreters that expose AST can get you a long way there.
I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.
The article says exactly this in bold at the bottom:
> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.
https://docs.claude.com/en/docs/agents-and-tools/tool-use/im...
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
reply