Right? All high quality coffee makers use a proper method so there is absolutely zero downside in decaf. Just make sure to check which method they use (all big ones state it on their website or else)
I’m building something similar with https://github.com/LabLeaks/special (apologies for the desultory slop-laden README, need to give that a lot more human attention) but I’ve gone in a slightly different direction: a “spec” is a product contract claim supported by attached tests that verify it. It’s a little Cucumber-y, if anyone remembers that, but a lot more flexible — you just write stuff like
@spec LINT_COMMAND.ORPHAN_VERIFIES
linter reports blocks that do not attach to a supported owned item.
Then
#[test]
// @verifies SPECIAL.LINT_COMMAND.ORPHAN_VERIFIES
fn rejects_orphan_verifies_blocks() {
let block = block_with_path("src/example.rs", &["@verifies EXPORT.ORPHAN"]);
let parsed = parse_current(&block);
assert!(parsed.verifies.is_empty());
assert_eq!(parsed.diagnostics.len(), 1);
assert!(
parsed.diagnostics[0]
.message
.contains("@verifies must attach to the next supported item")
);
}
And then the CLI command “special specs” pulls your specs and all attached verification + test code so you (or your LLM) to analyze whether the (hopefully passing!) test actually supports the product claim.
There’s also a bunch of other code quality commands and source annotations in there for architectural design & analysis, fuzzy-checking for DRY opportunities, and general codebase health. But on the overall principle, this article is dead-on: when developing with LLMs, your source of truth should be in your code, or at least co-located with it.
There is no evidence of this. Evals are quite different from "self-evals". The only robust way of determining if LLM instructions are "good" is to run them through the intended model lots of times and see if you consistently get the result you want. Asking the model if the instructions are good shows a very deep misunderstanding of how LLMs work.
When you give prompt P to model M, when your goal is for the model to actually execute those instructions, the model will be in state S.
When you give the same prompt to the same model, when your goal is for the model to introspect on those instructions, the model is still in state S. It's the exact same input, and therefore the exact same model state as the starting point.
Introspection-mode state only diverges from execution-mode state at the point at which you subsequently give it an introspection command.
At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input, and there is overwhelming evidence that frontier models do this very well, and have for some time.
Asking the model, while it's in state S, to introspect and surface any points of confusion or ambiguities it's experiencing about what it's being asked to do, is an extremely valuable part of the prompt engineering toolkit.
I didn't, and don't, assert that "asking the model if the instructions are good" is a replacement for evals – that's a strawman argument you seem to be constructing on your own and misattributing to me.
At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input
This point is load-bearing for your position, and it is completely wrong.
Prompt P at state S leads to a new state SP'. The "common jumping off point" you describe is effectively useless, because we instantly diverge from it by using different prompts.
And even if it weren't useless for that reason, LLMs don't "query" their "state" in the way that humans reflect on their state of mind.
The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis
Nicely put. I haven't seen anyone say that the introspection abilities of LLMs are up to much, but claiming that it's completely impossible to get a glimpse behind the curtain is untrue.
Is that based on your "deep understanding" of how LLMs work or have you actually tried it? If you watch the execution trace of a Skill in action, you can see that it's doing exactly this inspection when the skill runs - how could it possibly work any other way?
Skills are just textual instructions, LLMs are perfectly capable of spotting inconsistencies, gaps and contradictions in them. Is that sufficient to create a good skill? No, of course not, you need to actually test them. To use an analogy, asking a LLM to critique a skill is like running lint on C code first to pick up egregious problems, running testcases is vital.
You could introduce teleportation boots to humanity and within a few weeks we'd be complaining that sometimes we still have to walk the last 20 meters.
Contrary to popular opinion, the "verbosity void" of the article is not a hopeless expanse without any useful information at all; it just has a much lower density than average.
nyuk nyuk
Anyhow it barely touched on dark matter... Like, are the voids themselves where the dark matter is, or is it spread out, um, orthogonally?
Along these lines, I’m working on a tool called Spotless[0] that takes a more HTTP proxy-based approach to make statefulness something the agent doesn’t have to worry about. It directly reads & writes to the messages array going to and from Anthropic, so you don’t have to rely on the agent calling an MCP or using a skill. Still buggy and early, but it’s definitely a very interesting way of working with agents.
That and Brooks’ underrated “The Design of Design” are notable for having an almost impossible density of quotable aphorisms on every page. They’re all so relevant today that it’s hard to believe that he’s talking about problems he faced half a century ago.
Never heard of "The Design of Design" but I bought it off this comment chain.
I think our industry would do a lot to take a moment and breath to understand what we have collectively done since inception. Wonder often if we will look at the highly corporatized influence our industry has had during our time as the dark ages 1000s of years into the future. The idea that private enterprise should shape the direction of our industry is deeply problematic, there needs to be public option and I doubt many devs would disagree.