Extensible coding agent written in typescript. It’s exactly what you (I’m projecting) want out of Claude Code if you’re okay investing time into building your harness or prompting an agent to build it.
Literally everyone is desperately trying to figure out why it's so bad and how to make it work consistently using harness etc. But in spite of this massive effort things always go awry after a while. Maybe in a year or two someone figures it out.
Frankly, I created dozen of such projects in the last weeks. Recently I just deleted them all. I feel like there's no point. I cancelled my Claude subscription, too.
I got back learning from books and use LLMs for "review my code in depth and show me its weak points" occasionally.
I can't comment about the quality of the code you delivered for your client so I checked your side project. Unfortunately it looks like there is only a landing page (very nice!) but the way from a vibe-coded project to production is usually quite long.
Not wrong at all, that’s why I’m building my own platform for this. That’s also why I haven’t publicly done much on First Cut yet. I’m using my platform to actually build the product, so the intent is that I use my expertise and oversight to ensure it’s not just slop code. So most of the effort has gone into building that platform, which has made building First Cut itself slower. But I’ve actually got my platform running well-enough that now my team is able to get involved, and I can start to work on First Cut again, which means that I should be able to answer your “concern” definitively. I share it.
> A smarter model would be great but there are bigger productivity gains to be had with a good set up, a faster model, and abstracting away the need to think about agents or context usage. I’m still figuring out a good set up. Something with the speed of Haiku with the reasoning of Opus without the overhead of having to think about the management of agents or context would be sweet.
I was thinking about this recently. This kind of setup is a Holy Grail everyone is searching for. Make the damn tool produce the right output more of the time. And yet, despite testing the methods provided by the people who claim they get excellent results, I still come to the point where the it gets off rails. Nevertheless, since practically everybody works on resolving this particular issue, and huge amounts of money have been poured into getting it right, I hope in the next year or so we will finally have something we can reliably use.
> If you aren't doing this level of work by now, you will be automated soon.
It's harder and harder to detect sarcasm these days but in case you're being serious, I've tested a similar setup and I noticed Claude produces perfectly plausible code that has very subtle bugs that get harder and harder to notice. In the end, the initial speedup was gone and I decided to rewrite everything by hand. I'm working on a product where we need to understand the code base very well.
When you write the code yourself you are slowly building up a mental model of how said thing should work. If you end up introducing a subtle bug during that process, at least you already have a good understanding of the code, so it shouldn't be much of an issue to work backwards to find out what assumptions turned out to be incorrect.
But now with Claude, the mental model of how your code works is not in your head, but resides behind a chain of reasoning from Claude Code that you are not privy too. When something breaks, you either have to spend much longer trying to piece together what your agent has made, or to continue throwing Claude at and hope it doesn't spiral into more subtle bugs.
Everybody produces bugs, but Claude is good a producing code that looks like it solves the problem but doesn't. Developers worth working with, grow out of this in a new project. Claude doesn't.
An example I have of this is when I asked Claude to copy a some functionality from a front-end application to a back-end application. It got all of the function signatures right but then hallucinated the contents of the functions. Part of this functionality included a look up map for some values. The new version had entirely hallucinated keys and values, but the values sounded correct if you didn't compare with the original. A human would have literally copied the original lookup map.
I asked claude to help me figure out some statistical calculation in Apple Numbers. It helpfully provided the results of the calculation. I ignored it and implemented it in the spreadsheet and got completely different (correct) results. Claude did help me figure out how to do it correctly though!
> Developers worth working with, grow out of this in a new project. Claude doesn't.
There is no way this is true. People make fewer bugs with time and guidance, but no human makes zero bugs. Also, bugs are not planned; it's always easy to in hindsight say "A human would have literally copied the original lookup map," but every bug has some sort of mistake that is made that is off the status quo. That's why it's a bug.
No, it's broadly true. Also, that's why we have code review and tests, so that it has to pass a couple of filters.
LLMs don't make mistakes like humans make mistakes.
If you're a SWE at my company, I can assume you have a baseline of skill and you tested the code yourself, so I'm trying to look for any edge cases or gaps or whatever that you might have missed. Do you have good enough tests to make both of us feel confident the code does what it appears to do?
With LLMs, I have to treat its code like it's a hostile adversary trying to sneak in subtle backdoors. I can't trust anything to be done honestly.
Sorry, perhaps I should have been clearer. They don't grow completely out of making bugs (although they do tend to make fewer over time), they grow out of making solutions that look right but don't actually solve the problem. This is because they understand the problem space better over time.
Claude will happily generate tons of useless code and you will be charged appropriately. the output of LLMs has nothing to do with payment rates, otherwise you end up with absurdities like valuating useless CCC that was very expensive to build using LOCs as a metrics whereas in reality is a toy product nobody in their right mind would ever use.
My metrics are really simple - I don’t do staff augmentation. I get a contract (SOW) with a known set of requirements and acceptance criteria.
The only metrics that matter is it done on time, on budget and meets requirements.
But if Claude Code is generating “useless code” for you, you’re doing it wrong
And I assure you that my implementations from six years of working with consulting departments/companies (including almost four as blue badge, RSU earning consultant at AWS ProServe) have never gone unused.
Using a mix of models - GLM5, MinMax 2.5 and Claude Sonnet/Opus - they find different issues
Spending fair bit of time in spec'ing things out and running all three models over it to suggest improvements / flaws & iterating till all three are happy. Same at end - look at code & suggest stability improvements. The actual writing code is GLM5 - once properly spec'd out it can generally just hammer away at it till its done
And doing a lot of microservice style architecture. Think chains of containers talking to each other over APIs
Sorry, what is pi and how are you using it with ChatGPT for agentic coding?
reply