- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.
- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!
- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.
- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.
Because it doesn't work like how you think at all. You're still thinking it works like Chain of Thought. It doesn't. And the difference is key!
It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).
It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".
The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.
That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).
And also why it is so much harder to determine what it's "thinking".
The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.
Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.
If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.
Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!
Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!
Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!
It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.
Just paying the fine without arguing on time gets you a 50% discount in state voting. And it's a token fine - $20 or so from memory for federal ballots. Besides you don't have to vote. The requirement is you turn up, or give them a piece of paper if you post it. This is deemed so important when we designed our own voting machines (which were never deployed), they had an explicit "I decline to vote" option.
The paper can be blank, but people are often more imaginative. I can't find the reference to it now, but one paper had penises of different lengths drawn beside each option. The Australian Electoral Commission is required by law to "save" votes, which means that even if it wasn't marked strictly according to the rules if a reasonable person could infer the intention, it counted. This particular vote worked its way through the courts, where it was eventually struck down. Reason: it was impossible to know if a longer penis meant it was more or less favourable to the candidate.
About 8% of ballots can't be saved. Of those around about 2% are deliberately spoilt - the rest are mistakes. If Vanessa Teague's voting machines (with the decline button) had been deployed, the remaining 6% would have gone away.
I get the impression the hive mind hasn't come to terms with the point that a model is optimised for certain tasks. It's like having someone ask you "is that a good hammer?". Good for what? There are claw hammers, sledgehammers, ball-peen hammers, club hammers, mallets, .... Yes, in a pinch, they can all bang in nails, but you wouldn't choose a dead blow hammer for that if you had a choice.
The Gemini Flash is very good at searches. Just about any low end model can toss out a poem. All the higher end models (open source and otherwise) seem to be able to churn out code that passes tests. The smaller, "less capable" ones are much faster at it, which means in the hands of a skilled practitioner are the best choice for that task. But they rapidly fall apart where there isn't a hard source of truth (like a good test suite) to grind against. Because of that you have to use a bigger model for bug finding. In that task the open source models tend to fail on larger code bases, where something like Opus still shines. I gather Mythos is an absolute monster, and unparalleled, and unavailable. I'm sure one of the reasons for that is it's so expensive to run.
Or to put it another way - you don't use a 100 tonne crane to pick up the shopping. And ... the smaller models will happily run on in-house hardware. You may not do it today because of the current DRAM price and integrated NPUs have just started shipping, but in 5 years time models will be running on your phone.
Yes exactly, we will have specialized models soon. These will be trained with plugin architecture with a core reasoning model asking plugin models to do stuff on its behalf. I don't need chinese or russian knowledge in my workflow.
assert!(a.len() >= 32);
for i in 0..32 {
a[i] = 0;
}
Or:
for i in 0..std::cmp::min(a.len(), 32) {
a[i] = 0;
}
I confess I hadn't thought about the implications of any of this before reading the article. If you need to squeeze the last 10% of performance out of your code, I'd consider it required reading.
As for the speed comparisons with C++, the OP says at the end you tell the C++ compiler to be as strict as Rust using "-D_FORTIFY_SOURCE=3 -fsanitize=bounds,object-size" & hardened STL, then it slows to below Rust speeds for the same safety unless you use the same techniques.
It's a shame the other optimisation techniques you need to bring Rust in line with C++ aren't as easy to apply.
If a.len() == 16, the indexed loop writes a[0]..a[15] and then panics at a[16]. By contrast, both assert!(a.len() >= 32); and a[0..32].iter_mut().for_each(|el| *el = 1) fail before any writes occur. The former at the explicit assertion, the latter while creating the a[0..32] subslice. That difference is observable if the panic is caught, and the panic location/message may also differ. This is why these are valid manual rewrites only when the intended precondition is "the slice has length at least 32," not generally valid compiler rewrites of the original loop.
The GitHub issue discussion is directly about these concerns and discuss whether bounds checks may fail early, whether intermediate writes are observable after catch_unwind and whether panic behavior must be preserved.
> The GitHub issue discussion is directly about these concerns and discuss whether bounds checks may fail early, whether intermediate writes are observable after catch_unwind and whether panic behavior must be preserved.
No argument about the point of the issue. But this is a discussion about the relative efficiency of C, C++ and Rust. My point is there is a way in Rust to say "I don't care about observable writes, hoist the bounds check out of the loop", so that the efficiency is the same.
Admittedly, it's not part of the language definition. You're relying on intimate knowledge of how the optimiser works. In fact, you are probably pasting the code into godbolt, and looking at the assembler produced. But if you care about cycles that much, that's true for all three languages.
That's relevant if we're talking about the compiler automatically rewriting the code, but the chances are if you're writing this code yourself that the array will always have >= 32 elements.
> It is not worth switching to Pi except as a hobbyist.
Permit me to paraphrase slightly. "It is not worth switching to Linux except as a hobbyist. Something that is overlooked: the mainstream OSs have a huge advantage ....".
You are in good company. In 1999, Bill Gates confidently dismissed Linux as a threat, arguing it lacked the central control, features, and graphical interface needed to compete in the commercial market.
Back to the article, quoting:
> Pi might be built with Pi, but we’re quite far off today from where Bun and OpenClaw already are: fully detached, automated software engineering.
Please don't call it software engineering. I've been programming for 40 years, and most of that time had to put up with the derision from the other engineering disciplines: "If civil engineering built things like software engineers, the first woodpecker that came along would destroy civilisation". It hurt because it was true. It's still often true for things like web pages, but for the things I use like Linux and vim, it hasn't been true for a long, long while. We have finally mastered how to repeatedly build solid, reliable software.
Which is why I'm an Anthropic refugee. Opus is definitely the best for coding, but claude-cli + bun is the most unreliable piece of crap I've had the misfortune to come across in a while. Sadly I can't afford their API pricing, so either my principles or Opus had to give. I went to pi and an open-source model. The difference between the top open-source models and Opus are noticeable, but not drastic, unlike the difference between pi and claude-cli.
pi has proved to be solid, fast, have a transparent design, and be customisable in the old Linux way ("do one thing, and do it well"). I pray that will never change.
> But this argument that Rust's memory management isn't more cognitively demanding than Go's memory management --- that isn't true.
It's not far from true. The fights you get into with the borrow checker can be legendary, but lifetimes serve more as gentle reminders. If you get stuck, you can always just use Rc, which is pretty close to opt-in GC. But it's rare to have to resort to Rc, because ownership is just not that much of a problem. In fact, I very rarely use Box either. All heap memory allocation is done by containers, not manually by me. I guess the main friction point for lifetimes is Rust's closures and async, but if you avoid them life is pretty simple.
In return for wearing this almost not a problem, you almost don't have to think about releasing a whole pile of other things - like closing files, sockets, and locks. They are guaranteed to be released by the same mechanism.
On balance, I would not be surprised if the cognitive balance tips Rust's way once you allow for the fact that Rust's memory management also gives you robust resource management for free.
> But US unions seem to exist nearly exclusively to protect people who don’t want to work.
It's a similar dynamic here in Australia, but it seems not in Germany and Japan.
I think the difference is that in the USA and Australia the unions are organised on a craft or occupational basis. You see this in the sorts of laws they want passed - they are invariably after laws that only union members can do a certain job. It isn't always so direct. Doctors for example insist on certain qualifications, which seems fine and necessary at first glance, but then the doctors' associations somehow manage to gate the institutions that can issue the qualifications. It's interesting how doctors have no trouble with that, but computer engineers can't bring themselves to do it.
Anyway, the outcome of "guild based unions" ends up being what others here noted. Instead of everyone in the same firm cooperating to get work done, they are all fighting to preserve their patch. In Germany and Japan, unions look to be organised around large companies. If the company goes broke the union disappears with them, so the company's incentives and the union's are more aligned. So much so the union reps are given seats on the board, and are expected to make a contribution. The unions are still focused on ensuring the distribution of profits goes more towards the employees than the shareholders of course, but nonetheless the outcome seems to be far less dysfunctional than the USA and Australian systems.
He's got control of the executive branch. So he's safe, for now.
His plans for when someone else inevitably has control of the executive branch boggles the mind. Does he have no foresight whatsoever?
I've come across others who live only in the present. I recall an inmate breaking out of jail a week before his release because his sister was sick and needed help. They are a rare breed. Life can't be easy for them. I would have thought it was impossible for someone like that to ascend to the level of President.
Yet even now when a disaster is looming in the mid terms, he manages to keep the Republican party cowed. It is something to behold. I still have no idea how he does it. Do these people have no sense of self-preservation?
- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.
- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!
- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.
- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.
How can this possibly go wrong?
reply