Maybe some day, but as a claude code user it makes enough pretty serious screw u...

jaggederest · 2026-02-06T04:11:43 1770351103

I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.

Every mistake is a chance to fix the system so that mistake is less likely or impossible.

I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.

This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.

Terretta · 2026-02-06T14:50:35 1770389435

> You're no longer developing software, you're doing therapy for robots.

Or, really, hacking in "learning", building your knowhow-base.

> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.

Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:

https://github.com/anthropics/knowledge-work-plugins

Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.

For those following along at home:

This is the return of the "expert system", now running on a generalized "expert system machine".

rapind · 2026-02-06T06:34:11 1770359651

I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.

jaggederest · 2026-02-06T08:25:43 1770366343

It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment

vidarh · 2026-02-06T16:42:35 1770396155

Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.

rapind · 2026-02-06T21:18:35 1770412715

True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.

Some of the things it occasionally does:

- Ignores conventions (even when emphasized in the CLAUDE.md)

- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)

- Writes badly performing code (N+1)

- Does more than you asked (in a bad way, changing UIs or adding cruft)

- Makes generally bad assumptions

I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.

vidarh · 2026-02-07T11:46:46 1770464806

Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.

I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.

_zoltan_ · 2026-02-07T10:10:13 1770459013

then you're using it wrong, to be frank with you.

you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.

I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.

(I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)

dostick · 2026-02-07T13:28:52 1770470932

That’s what oh-my-open-code does.