> My job went from connecting these two things being the hard and reward part, to just mopping up how poorly they’ve been connected.
That’s only half of the transition.
The other half - and when you know you’ve made it through the “AI sux” phase - is when you learn to automate the mopping up. Give the agent the info it needs to know if it did good work - and if it didn’t do good work, give it information so it knows what to fix. Trust that it wants to fix those things. Automate how that info is provided (using code!) and suddenly you are out of the loop. The amount of code needed is surprisingly small and your agent can write it! Hook a few hundred lines of script up to your harness at key moments, and you will never see dumb AI mistakes again (because it fixed them before presenting the work to you, because your script told it about the mistakes while you were off doing something else)
Think of it like linting but far more advanced - your script can walk the code AST and assess anything, or use regex - your agent will make that call when you ask for the script. If the script has an exit code of 2, stderr is shown to the agent! So you (via your script) can print to stderr what the agent did wrong - what line, what file, wha mistake.
Fascinating how many people are stuck at “I test manually because I know of no way to express an executable, repeated, and detailed procedure such as testing” ;p
I think MDI style interfaces went out of fashion because there were a lot of programs that implemented an entire window manager and then only really needed one window. This added a ton of complexity that wasn't needed most of the time, because window management is really a job for a different program.
It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.
TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.
It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.
As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.
Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)
While you are right that you need to be encoding the right details, I disagree on the tests enforcing a design point.
As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.
Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.
Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).
Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.
Fine, but don't check in the tests that prove implementation since they will be deleted soon anyway. The only tests to check in are ones that - by failing - informed you that you broke something. We don't know which those tests are and because most tests run fast we tend to check in lots of tests that will never fail in a useful way.
Taken to its logical conclusion, what you are saying is do not write (or commit? but in practice, why write them if not to run in CI) any tests except for end-to-end tests covering actual use cases. In theory, even make them generic enough so they are not affected by the implementation. Perhaps even employ LLMs there ("check that a customer can provide their address for their order by using a headless browser").
It is a strong disagree from me: end-to-end tests have always been fragile and slow, and feedback loop time is the boundary at which any coder (agentic or human) needs to operate on. If your agents need to wait 2h to see if their every change is valid, you'll be beat by humans doing properly structured "just enough" testing.
That isn't the logical conclusion though. I specificly said find places where change would be too complex to attempt anyway which breaks your conclusion. This lets you find plenty of places to jump in and write a test. (You will still be wrong but less often and generally you know it will be a hard change before you start making it)
though I find in practice end to end tests are not that fragile. It took us a decade of effort to find and mitigate all the little issues that make them fragile though so perhaps you don't want to go that far. I wish I could make the end to end tests faster though, but fragile they are not.
I am a bit perplexed at your claim that end-to-end tests are not fragile in general (my claim) with a counter how you spent a decade on not making them fragile in one particular case?
I am not disagreeing most projects evolve test suites which have duplicated, useless tests as a majority. But it can be done better.
I feel like the difference is minimal, if not entirely dismissable. Code in this sense is just a representation of the same information as someone would write in an .md file. The resolution changes, and that's where both detail and context are lost.
I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.
Tests (and type-checkers, linters, formal specs, etc.) ground the model in reality: they show it that it got something wrong (without needing a human in the loop). It's empiricism, "nullius in verba"; the scientific approach, which lead to remarkable advances in a few hundred years; that over a thousand years of ungrounded philosophy couldn't achieve.
The scientific approach is not only or primarily empiricism. We didn't test our way to understanding. The scientific approach starts with a theory that does it's best to explain some phenomenon. Then the theory is criticized by experts. Finally, if it seems to be a promising theory tests are constructed. The tests can help verify the theory but it is the theory that provides the explanation which is the important part. Once we have explanation then we have understanding which allows us to play around with the model to come up with new things, diagnose problems etc.
The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.
> The scientific approach starts with a theory that does it's best to explain some phenomenon
At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.
I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.
Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).
Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)
The ai is just pattern matching. Vibing is not understanding, whether done by humans or machines. Vibe programmers (of which there are many) make a mess of the codebase piling on patch after patch. But they get the tests to pass!
Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.
No, the theory comes from the authors knowledge, culture and inclinations, not from the fact.
Obviously the author has to do much work in selecting the correct bits from this baggage to get a structure that makes useful predictions, that is to say predictions that reproduces observable facts. But ultimately the theory comes from the author, not from the facts, it would be hard to imagine how one can come up with a theory that doesn't fit all the facts known to an author if the theory truly "emanated" from the facts in any sense strict enough to matter.
It most certainly is not. All your tests are doing is seeding the context with tokens that increase the probability of tokens related to solving the problem being selected next. One small problem: if the dataset doesn't have sufficiently well-represented answers to the specific problem, no amount of finessing the probability of token selection is going to lead to LLMs solving the problem. The scientific method is grounded in the ability to reason, not probabilistically retrieve random words that are statistically highly correlated with appearing near other words.
This only holds if you understand what's in the tests, and the tests are realistic. The moment you let the LLM write the tests without understanding them, you may as well just let it write the code directly.
> The moment you let the LLM write the tests without understanding them, you may as well just let it write the code directly.
I disagree. Having tests (even if the LLM wrote them itself!) gives the model some grounding, and exposes some of its inconsistencies. LLMs are not logically-omniscient; they can "change their minds" (next-token probabilities) when confronted with evidence (e.g. test failure messages). Chain-of-thought allows more computation to happen; but it doesn't give the model any extra evidence (i.e. Shannon information; outcomes that are surprising, given its prior probabilities).
I disagree to some degree. Tests have value even beyond whether they test the right thing. At the very least they show something worked and now doesnt work or vice versa. That has value in itself.
Say you describe your kitchen as “I want a kitchen” - where are the knives? Where’s the stove? Answer: you abdicated control over those details, so it’s wherever the stochastic parrot decided to put them, which may or may not be where they ended up last time you pulled your LLM generate-me-a-kitchen lever. And it may not be where you want.
Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)
The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.
A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.
Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.
If it is fast enough, and cheap enough, people would very happily reroll specific subsets of decisions until happy, and then lock that down. And specify in more details the corner cases that it doesn't get just how you want it.
People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.
A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.
Agreed; often you don’t know quite what you want until you’ve seen it.
Spec is an overloaded term in software :) because there are design specs (the plan, alternatives considered etc) and engineering style specs (imagine creating a document with enough detail that someone overseas could write your documentation from it while you’re building it)
Those need distinct names or we are all at risk of talking past each other :)
I've seen this sentiment and am a big fan of it, but I was confused by the blog post, and based on your comment you might be able to help: how does Lean help me? FWIW, context is: code Dart/Flutter day to day.
I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"
yesterday I had to tell a frontier model to translate my code to tla+ to find a tricky cache invalidation bug which nothing could find - gpt 5.4, gemini 3.1, opus 4.6 all failed. translation took maybe 5 mins, the bug was found in seconds, total time to fix from idea to commit - about 15 mins.
if you can get a model to quickly translate a relevant subset of your code to lean to find tricky bugs and map lean fixes back to your codebase space, you've got yourself a huge unlock. (spoiler alert: you basically can, today)
Thanks for following up on this: I was really surprised by how much air this paeon to, idk, TDD, took out of the comments by getting off-topic.
Before you commented, I started poking at what you described for 15 minutes, then forget about it and fell asleep. Now I remembered, and I know it's viable and IIUC it's almost certainly going to make a big difference in my work practice moving forward. Cheers.
I don't think he's referring to Lean specifically, but any sort of executable testing methodology. It removes the human in the loop in the confidence assurance story, or at least greatly reduces their labor. You cannot ever get such assurance just by saying, "Well this model seems really smart to me!" At best, you would wind up with AI-Jim.
(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)
What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).
But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.
Well, if you do not need to care about performance everything can be extremely simple indeed. Let me show you some data structure in coq/rocq while switching off notations and diplaying low level content.
You know you could just define the verified specs in lean and if performance is a problem, use the lean spec to extract an interface and tests for a more performant language like rust. You could at least in theory use Lean as an orchestrator of verified interfaces.
In Lean, strings are packed arrays of bytes, encoded as UTF-8. Lean is very careful about performance; after all, a self-hosted system that can't generate fast code would not scale.
I don't think so? Lean is formal methods, so it makes sense to discuss the boons of formal and semiformal methods more generally.
I used to think that the only way we would be able to trust AI output would be by leaning heavily into proof-carrying code, but I've come to appreciate the other approaches as well.
But that's exactly my point. "It's natural to discuss the broader category" is doing a lot of heavy lifting here. The blog post is making a very specific claim: that formal proof, checked by Lean's kernel, is qualitatively different from testing, it lets you skip the human review loop entirely. cadamsdotcom's comment rounds that down to "executable specs good, markdown specs bad," which... sure, but that's been the TDD elevator pitch for 20 years.
If someone posted a breakthrough in cryptographic verification and the top comment was "yeah, unit tests are great," we'd all recognize that as missing the point. I don't think it's unrelated, I think it's almost related, which is worse, because it pattern-matches onto agreement while losing the actual insight.
Not just TDD. Amazon, for instance, is heading towards something between TDD and lightweight formal methods.
They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev
Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).
Setting aside that model means something different now … MDD never really worked because the tooling never really dealt with intent. You would get so far with your specifications (models) but the semantic rigidity of the tooling mean that at some point your solution would have to part way. LLM is the missing piece that finally makes this approach viable where the intent can be inferred dynamically and this guides the
implementation specifics. Arguably the purpose of TDD/BDD was to shore up the gaps in communicating intent, and people came to understand that was its purpose, whereas the key intent in the original XP setting was to capture and preserve “known good” operation and guard against regression (in XP mindset, perhaps fatefully clear intent was assumed)
That matches what I’ve seen as well — generation is the easy part, validation is the bottleneck.
I’ve been experimenting with a small sparse-regression system that infers governing equations from raw data, and it can produce a lot of plausible candidates quickly. The hard part is filtering out the ones that look right but violate underlying constraints.
For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.
Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.
I've wrestled with this idea. Do you think the general population will all be vibe coding finance apps? I have to think that most will still just pay the big players.
(I say this as someone who vibe coded a finance app, and it works!)...now I'm not sure what to do with it, it works for me - do I open it to the world or just keep making it great for me.
What if all the good stories have been told? I conjecture Hollywood is out of ideas but there’s plenty new if you look elsewhere.
Recently watched a fantastic Chinese movie: Upstream (2024) - a dramatized view of a culture driven by algorithms where everyone is plugged in but opportunity does exist if you work hard. Optimistic and pessimistic, with an underdog you want to see win, and a bunch of beautiful human and touching moments. Highly recommended.
Hollywood will keep going maybe in a smaller form. It’s ok for industries to change or run out of steam, and it’s ok for new to replace told. It’s ok for a place to run out of stories to tell, because new stories will get told by others in other places.
I feel something like this. All the popular movies are very similar, like all cars look the same. They have optimised it for efficiency to the lowest common buying denominator.
I have started watch movies and series in foreign languages, particularly Korean and some Thai that have very novel story lines.
It’s working! ~200k LOC python/typescript codebase built from scratch as I’ve grown out the framework. I probably wrote 500-1000 lines of that, so ~99.5% written by Claude Code. I commit 10k-30k loc per week, code-reviewed and industrial strength quality (mainly thanks to rigid TDD)
I review every line of code but the TDD enforcement and self-reflection have now put both the process and continual improvement to said process more or less on autopilot.
It’s a software factory - I don’t build software any more, I walk around the machine with a clipboard optimizing and fixing constraints. My job is to input the specs and prompts and give the factory its best chance of producing a high quality result, then QA that for release.
I keep my operational burden minimal by using managed platforms - more info in the framework.
One caveat; I am a solo dev; my cofounder isn’t writing code. So I can’t speak to how it is to be in a team of engineers with this stuff.
How many pages of architecture / constraints did you write? I guess I’m curious what type of text input renders 200K lines of code output. It must be a similar level of tokens in just docs / prompting. Have you verified all of that? Was that AI generated?
Would be very interested to see whether it’s not just… regular LLM snowballing a paragraph into 12 pages of “technical design documents” and 10K lines of code. Not sure what kind of niche you’re in or what the business logic is, but it sounds to me like you’ve built a machine that… generates code you don’t need to look at??
There was a 200 word architectur doc that lasted about 3 weeks before it drifted so it got deleted. I no longer keep architecture docs - tests and code are enough for the agent to answer questions when we have them.
Probably wrote 2000+ words of prompts per day to the agent, Monday to Friday, for like 9 months. Dozens to hundreds of prompts a day back and forth with anywhere from 1-7 concurrent agents at a time.
This is not something anyone would ever one-shot. There are thousands of commits. My commit log looks like a normal squash-merge-to-main-and-deploy workflow.
Thanks for sharing that insight! I hope you learned something in the process as well! I hope to find time soon to incorporate all the ideas found in the comments.
reply