So if I claim I am a communist who doesn't want to ever get rich and then someone dangles a billion shiny dollars in front of me to just simply grab and own, you think I'd still be a communist then?
If you go around saying “I’m a communist, I believe in communism, I think it’s very important that we establish communism”? Sure, absolutely. Engels was pretty rich.
Replace the cash with Apple or some other trillion dollar corporation and you're given the CEO's seat and voting control on the BoD. Can I be Tim Cook and preach communism and expect anyone to believe it?
The other "cheating" examples are even worse. It's wild to me that people keep designing benchmarks where the answer is lying around on disk or in the git history. "Hardening" the benchmark with strongly worded prompt instructions is bizarre. There are so many agent sandbox solutions. Why not use one and give it only access to the code it should see?
And I'm not sure how they can rule out other solutions also benefiting from being in the training data, just not reproduced exactly. Seems like it should focus on only CVEs from the last 30 days or something.
To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)
Obviously they could just delete .git for their test if they wanted to. But consider telling the LLM not to use git commands the same as if you have keys in a .env file, and you tell the LLM not to read it, you might be concerned.
The user asks for details of the last transaction, the user gets back the amount, the source, and the description in a safely quoted format with the LLM never reading it.
You can't inject the LLM if it doesn't see the data.
An architecture like this won't work in many situations, but it can work for a lot of simple questions.
And if you want the LLM to summarize things, you run an isolated instance that makes a summary and you never show that summary to the LLM that's following the user's instructions.
You can do this, it is useful, but it's just not the same as where the goalposts are now which is: the AI is a person in a box and can do everything a person can.
If we actually limit them to "only accepts tiny ultra well defined problems and ultra well defined outputs" then theycease being a $10T/year idea and become a merely $10B/year idea.
> The user asks for details of the last transaction, the user gets back the amount, the source, and the description in a safely quoted format
What's "safely quoted format" when prompt injection is already safe in the description?
> You can't inject the LLM if it doesn't see the data.
How doesn't it see the data when you literally say "The user asks for details of the last transaction, the user gets back the amount, the source, and the description"?
> And if you want the LLM to summarize things, you run an isolated instance that makes a summary
> How doesn't it see the data when you literally say "The user asks for details of the last transaction, the user gets back the amount, the source, and the description"?
The above post said how. The LLM writes code to do it. The code has a function to send text to the user. The LLM is not allowed to see the text, only the user is.
> And it will make a summary exactly how?
The second summarizing-only LLM is fed the raw data and allowed to output summary text. This is then sent directly to the user and put in a box with some hazard lines on it. The main LLM is not allowed to see the summary, only the user is.
This is very reminiscent of the "everyone's a Russian bot" era of social media, where everyone would just lob that accusation at people without any real proof.
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.
"Lying" is not supported by the evidence. In the context of bot traffic on the web, looking at only GETs for HTML is a reasonable approach. If you're counting all requests for all assets then a single page view of nytimes.com would count 100x as much as one for HN.
I would assume a lot of people running websites tend to think in pageviews, especially when dealing with bots because images and CSS files tend to be "cheap" static content but HTML requests are often dynamically generated.
It's also a single tweet that links to the data used to "disprove" it. Would be a weird way to lie.
reply