More

jumploops · 2026-03-19T08:45:07 1773909907

Funnily enough, with the most recent models (having reduced sycophancy), putting in the wrong assumptions often still leads to the right output.

jumploops · 2026-03-19T08:08:25 1773907705

In my experience with “agentic engineering” the spec docs are often longer than the code itself.

Natural language is imperfect, code is exact.

The goal of specs is largely to maintain desired functionality over many iterations, something that pure code handles poorly.

I’ve tried inline comments, tests, etc. but what works best is waterfall-style design docs that act as a second source of truth to the running code.

Using this approach, I’ve been able to seamlessly iterate on “fully vibecoded” projects, refactor existing codebases, transform repositories from one language to another, etc.

Obviously ymmv, but it feels like we’re back in the 70s-80s in terms of dev flow.

stanac · 2026-03-19T08:34:23 1773909263

> The goal of specs is largely to maintain desired functionality over many iterations, something that pure code handles poorly.

IMHO this could be achieved with large set of tests, but the problem is if you prompt an agent to fix tests, you can't be sure it won't "fix the test". Or implement something just to make the test pass without looking at a larger picture.

abustamam · 2026-03-19T17:01:39 1773939699

I find myself babysitting agent-derived tests unless I specifically say what the in variants and edge cases are. Sometimes I'll ask it if I missed anything and it'll be helpful. But I have to be proactive.

yes_man · 2026-03-19T08:19:35 1773908375

> In my experience with “agentic engineering” the spec docs should be longer than the code itself. Natural language is imperfect, code is exact.

The latter notion probably is true, but the prior isn’t necessarily true because you can map natural language to strict schemas. ”Implement an interface for TCP in <language>” is probably shorter than the actual implementation in code.

And I understand my example is pedantic, but it extends to any unambiguous definitions. Of course one can argue that TCP spec is not determimistic by nature because natural language isn’t. But that is not very practical. We have to agree to trust some axioms for compilers to work in the first place.

jumploops · 2026-03-19T08:43:22 1773909802

Thanks, I updated my comment to say “are often longer” because that’s what I see in practice.

To your point, there are some cases where a short description is sufficient and may have equal or less lines than code (frequently with helper functions utilizing well known packages).

In either case, we’re entering a new era of “compilers” (transpilers?), where they aren’t always correct/performant yet, but the change in tides is clear.

jumploops · 2026-03-16T08:00:14 1773648014

So much of practical CS is abiding by standards created by solo programmers in the past.

My university frowned on any industry-related classes (i.e. teaching software engineering tools vs. theoretical CS), but I was fortunate enough to know a passionate grad student who created a 1-credit seminar course on this exact topic.

This course covered CLIs/git/Unix/shell/IDEs/vim/emacs/regex/etc. and, although I had experience with Linux/git already, was invaluable to my early education (and adoption of Vim!).

It makes sense that this isn't a core topic, as a CS education should be as pure as possible, but when you're learning/building, you're forced to live within an operating system and architecture that are built on decades of trade-offs and technical debt.

kgwxd · 2026-03-16T10:09:36 1773655776

If you've made it that far in life without learning how to use a screwdriver, engineering would be a bad choice of major. And paying insane amounts of money for someone to explain how to use one would be an even poorer choice.

randusername · 2026-03-16T13:49:37 1773668977

It was amazing to me how many people I met in college that pursued majors they didn't even like. It was even more sad when it was clear they or their parents had fallen in love with the idea of a career path and not the realities of it.

Lots of my engineering cohort landed in sales because they didn't like building or fixing things. I guess that's a win for them, but I always felt nauseated that practical kids might be cut from the program instead of the book-smart but uninterested ones.

lokar · 2026-03-16T14:37:15 1773671835

I asked my CS peers why they were doing that major since they clearly disliked programming and theory.

They all said they would either be a consultant or a manager.

aleph_minus_one · 2026-03-17T00:21:19 1773706879

> They all said they would either be a consultant or a manager.

To be a good consultant, one must be exceptional in the area in which one consults.

Similarly: if they actually want to become a manager, why don't they study business administration instead. And because lots of people want to become managers: why don't they hang all day and night about textbooks and texts about economic topics and analyze reportings of companies or business case studies?

ferguess_k · 2026-03-16T13:05:54 1773666354

Same reason I always wonder whether I should go for an electrician/mechanic/avion mechanic education if I'm laid off (and cannot find a job).

I'm really not a handyman -- quite the opposite -- it took me and my father 30 minutes to change the car battery last time -- and most of the time was spent on pushing a component dropped to the bottom out of the car. I used to think that more practices bring some sort of linear growth of the skill in the beginning, but now I tend to believe that for certain people (who are not suitable for the trade), the beginning is totally random -- I could practice 100 times and fail 100 tiles randomly, without really learning anything -- because there are an unlimited number of ways to do one thing, theoretically.

Software suits me way more. Soldering is also OK albeit more confusing. Unfortunately there is no trade that primarily deals with microcontrollers, except in military/defense.

Pay08 · 2026-03-16T15:06:27 1773673587

That's a terrible comparison. You could've at least compared Git to a lathe. Imagine if educators had this attitude, no one would want to learn anything.

kace91 · 2026-03-16T10:43:52 1773657832

Universities consider themselves pure and isolated from lowly industry.

Industry demands specifically university degrees to gatekeep positions.

And then we leave teenagers to figure out the puzzle by themselves. I think it's a disservice to the youth.

someprick · 2026-03-16T12:13:03 1773663183

Universities produce research, and students; Students produce industry, and the body politic; Industry and polity produce university funding.

A cycle I like to call, the "ring-bugger."

I'm not saying it's right, or acceptable, or particularly moral… But I agree that by obscuring the facts, we only serve to confound the decent and good-willed of our students.

Edit: derp.

pjmlp · 2026-03-17T10:21:27 1773742887

In Portugal it depends on which university degree you go to, there are for all levels.

If you want a higher education degree focus on what the industry is using today, you go to a politécnico, or técnico superior school.

If you want more focus to learning to learn, with more broader horizons, then you go into plain old university.

If you want to broaden the horizons, but still have some contact with what is industry is using today, you go into an applied engineering degree.

Additionally, similar to Germany with their Azubis program, you can just go to a technical school, with focus on being industry prepared, learning on the job during summer internships, and still leave the door open for going into the university later on, e.g. técnico professional.

The problem are the places that only have one way to approach higher education.

Pay08 · 2026-03-16T14:59:17 1773673157

While I have my issues with the system, many Soviet-controlled countries implemented a two-tier higher education system that solved this by having one tier be focused on practical subjects and the other on theoretical ones.

kjellsbells · 2026-03-16T18:53:53 1773687233

Britain used to have this too. Sadly it was strangled to death by the UK class system, but the replacement didnt help.

Once upon a time the white collar track was to go to University. One of the old ones if your class situation was pushing you towards executive roles in the Civil Service or banking or some big corporation. One of the newer, redbrick ones if your horizon was more like running a textile mill in the North. You were trained to think and had a fairly Great Books style of curriculum.

For the people who needed advanced education to keep the electric grokulator working, there were polytechnics. People came out of here with practical skills. In some areas, like mathematics, there would have been overlap between University and Polytechnic courses.

Then there were technical colleges where working class people could get skills to help them in their jobs, like rebuilding engines or CNC machining.

Then, people got antsy that university was so elite and only 5% of highschoolers were going. why not let polys be universities? After all, we need to keep up in a global economy. And so there was a massive gold rush and places that had no business or capability became A University overnight.

But...Brits being how they are, they still stratified themselves into class layers. You're far more likely to find a Russell Group university graduate in a fancy job than someone from a former poly in the North. The class system persisted despite everything, and attempts to broaden educational access ultimately did not simultaneously keep the quality uniformly high.

aleph_minus_one · 2026-03-17T00:16:18 1773706578

> While I have my issues with the system, many Soviet-controlled countries implemented a two-tier higher education system that solved this by having one tier be focused on practical subjects and the other on theoretical ones.

In Germany, there exist even more tiers for tertiary education:

- vocational training

- universities (academic training)

- Fachhochschule (instutions of tertiary education that offer study programs that is more focused on skills that are needed by industry)

- in some parts of Germany: Berufsakademie: even more applied than Fachhochschule; you absolve half of your tertiary education at a company

Pay08 · 2026-03-17T06:43:28 1773729808

Those exist elsewhere too, but at least in Hungary, they aren't separate institutions with different legal statuses (except for vocational schools), unlike the system I was talking about.

coldpie · 2026-03-16T13:47:05 1773668825

Yeah, I got duped by this. Did a CS degree because that's what you're "supposed" to do to get a programming job, and it was almost all theoretical junk I had no interest in. I hated it. I think I learned useful things in like, two of my classes. I knew more about programming than all but one of my instructors. It was awful and going through that degree program is one of the biggest regrets in my life. But hey, I get to stick "CS Degree from University" as the very last line on my resumes, I guess. Woo.

slfnflctd · 2026-03-16T14:11:04 1773670264

I was directly told by senior staff at a large org I worked for that I'd be eligible for a managerial position-- the only thing I was missing was a degree. Unfortunately, getting a degree while working full time for the income I needed was impossible for me at the time.

My entire career would've been different if I had that "very last line on my resumes" and I'd be better off financially. I just couldn't pull it off. I hope yours pays you back eventually, it seems like you worked hard to get it.

coldpie · 2026-03-16T14:32:29 1773671549

That sucks and is super unfair.

For my career path specifically I don't think it has made a difference. I've only had two software jobs in my 17 year career, the first definitely didn't need a degree and I think my current one would've let me in without a degree as I was referred by an employee. I doubt my next job will still be in software, so I'll probably have gotten largely nothing out of the time & money I blew on getting that useless degree.

zabzonk · 2026-03-16T14:34:17 1773671657

Where exactly did this "supposed to" come from? I've never met anyone who expected (or needed) a CS degree to teach them programming.

coldpie · 2026-03-16T14:41:52 1773672112

From the post I was replying to:

> Industry demands specifically university degrees to gatekeep positions.

At the time (mid-2000s), people who wanted to get programming positions got CS degrees, so that's what I did. I didn't expect it to teach me anything, it was just the path I was told I was expected to take. In retrospect I should have done literally anything else, but like that same post said:

> And then we leave teenagers to figure out the puzzle by themselves. I think it's a disservice to the youth.

I was a teenager. I made a bad call and wasted 4 years on a degree program I hated because everyone said a degree is required to get a good job, and the degree that programmers get is CS. Sucks.

raw_anon_1111 · 2026-03-16T15:44:39 1773675879

So do you think most people get into tens of thousands of debt to be “a better citizen of the world” or to learn what they need to know for some company to allow them to exchange labor for money to support their addictions to food and shelter?

zabzonk · 2026-03-17T01:09:39 1773709779

What has that got to do with learning programming? Or not learning programming?

raw_anon_1111 · 2026-03-17T01:33:50 1773711230

Really? If you don’t know how to program, why would a company hire you to program?

loughnane · 2026-03-16T13:04:22 1773666262

Sounds like MIT's missing semester https://news.ycombinator.com/item?id=46273762

ErroneousBosh · 2026-03-16T15:12:35 1773673955

> So much of practical CS is abiding by standards created by solo programmers in the past.

I wonder if this shows up in other disciplines? Do surgeons do this? I'm thinking in particular of the bit in Richard Heller's book M * A * S * H (you're probably more familiar with the TV series) where one of the old hands is reviewing the Young And Enthusiastic Newbie's work, and says something like "Your work is absolutely perfect and it's the neatest job I've ever seen, but you're going to kill a patient doing that because it took you two hours and some of these kids don't have two hours".

godelski · 2026-03-17T02:59:11 1773716351

  > This course covered CLIs/git/Unix/shell/IDEs/vim/emacs/regex/etc.

Fwiw I just graduated grad school and our lower division courses taught most of this stuff, though not as the main subject. Most upper division classes required you to submit your git repo. Most of this was fairly rudimentary but it existed. Though we didn't cover vim/emacs and I'd argue shell and bash were very lacking.

That said, several of us grad students and a few professors lived in the terminal. The students that wanted to learn this stuff frequented our offices, even outside office hours. I can certainly say every single one of those students was consistently at the top of the class (but not the only ones). The students who lived in the terminal but didn't frequent office hours tended to do well in class but honestly I think several were bored do didn't get straight A's but I could generally tell they knew more than most. Though I'm biased. I think more people should live in the shell

eru · 2026-03-16T09:53:24 1773654804

> It makes sense that this isn't a core topic, as a CS education should be as pure as possible, [...]

I don't think that's a good goal. Otherwise, why let you near a computer at all, and not restrict you to chalk and blackboards?

ratorx · 2026-03-16T14:32:48 1773671568

Pure CS is not necessarily equivalent to pure maths. For the “science” bit of CS, you do need to do the equivalent of experiments (for more applied topics).

For example, a physics degree is expected to have experiments. You are not required, expected (and possibly do not want) to know the tools required to professionally build a bridge because you did courses on mechanics. But you might do an experiment on the structural integrity and properties of small structures.

Whether this is a good split is an entirely different question.

jumploops · 2026-03-16T07:16:40 1773645400

After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

jumploops · 2026-03-16T05:59:51 1773640791

This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:

The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.

This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.

A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).

I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).

One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).

lelele · 2026-03-16T06:23:20 1773642200

> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step. > > This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.

aix1 · 2026-03-16T07:08:50 1773644930

Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.

Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.

My workflow for adding a feature goes something like this:

1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.

2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.

3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.

3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.

4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.

4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.

5. Claude implements the feature.

5a. (Optionally) another instance reviews the implementation.

For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.

From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).

Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.

I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.

stavros · 2026-03-16T11:24:16 1773660256

I don't know if I explained this clearly enough in the article, but I have the LLM write the plan to a file as well. The architect's end result is a plan file in the repo, and the developer reads that.

You can see one here: https://github.com/skorokithakis/sleight-of-hand/blob/master...

Havoc · 2026-03-16T08:40:56 1773650456

Yeah same. The markdown thing also helps with the multi model thing. Can wipe context and have another model look at the code and markdown plan with fresh eyes easily

jumploops · 2026-03-16T04:37:41 1773635861

A lot of these resonate with me, particularly the mental fatigue. It feels like normal coding forced me to slow my brain down, whereas now my mind is the limit.

For context, I started an experiment to rebuild a previous project entirely with LLMs back in June '25 ("fully vibecoded" - not even reading the source).

After iterating and finally settling on a design/plan/debug loop that works relatively well, I'm now experiencing an old problem like new: doing too much!

As a junior engineer, it's common to underestimate the scope of some task, and to pile on extra features/edge cases/etc. until you miss your deadline. A valuable lesson any new programmer/software engineer necessarily goes though.

With "agentic engineering," it's like I'm right back at square one. Code is so cheap/fast to write, I find myself doing it the "right way" from the get go, adding more features even though I know I shouldn't, and ballooning projects until they reach a state of never launching.

I feel like a kid again (:

hackit2 · 2026-03-16T04:49:10 1773636550

I spent more time correcting LLMs or agentics systems than just learning the domain and doing the coding myself. I mainly leave LLM to the boring work of doing tedious repetitive code.

If I give it anything resembling anything that I'm not an expert on, it will make a mess of things.

jumploops · 2026-03-16T07:40:10 1773646810

Yeah the old adage "what you put in is what you get out" is highly relevant here.

Admittedly I'm knowledgable in most of the domains I use LLMs for, but even so, my prompts are much longer now than they used to be.

LLMs are token happy, especially Claude, so if you give it a short 1-2 sentence prompt, your results will be wildly variable.

I now spend a lot of mental energy on my prompting, and resist the urge to use less-than-professional language.

Instead of "build me an app to track fitness" it's more like:

> "We're building a companion app for novice barbell users, roughly inspired by the book 'Starting Strength.' The app should be entirely local, with no back-end. We're focusing on iOS, and want to use SwiftUI. Users should [..] Given this high-level description, let's draft a high-level design doc, including implementation decisions, open questions, etc. Before writing any code, we'll review and iterate on this spec."

I've found success in this method for building apps/tools in languages I'm not proficient in (Rust, Swift, etc.).

hackit2 · 2026-03-17T02:29:13 1773714553

Yup, it makes me think that the whole bubble/marketing about how AI is going to revolutionize business and managers can just fire or make redundant 80% of their developers because they can replace them with a single Claude subscription to be hyperbolic, and very short sighted. Even from a Business standpoint - most of the cost that is associated with running a business isn't in the people but in the marketing and material cost - developers are probably the least costly part of a business. This is just due to the fact that developer have such a high ROI that it is silly to make the case that your developers are such a significant cost factor for running your business.

That being said, I may get to that stage. How-ever there is still a lot more growing pains to be had with LLM/AI before it reaches that point - if it ever does.

apsurd · 2026-03-16T05:57:57 1773640677

What do you mean doing it the "right way" from the get-go, as then paired with more features, ballooning projects, and never launching?

Is that why it's in quotes because it's the opposite of the right way?

If there's one thing I learned in a decade+ of professional programming, it's that we can't predict the future. That's it, that simple. YANGNI. (also: model the data, but I'm trying to make a point here)

We got into coding because we like to code; we invent reasons and justifications to code more, ship more, all the world's problems can be solved if only developers shipped more code.

Nirvana is reached when they that love and care about the shipping of the code know also that it's not the shipping of the code that matters.

jumploops · 2026-03-16T06:06:31 1773641191

Yeah exactly, "right way" is in quotes because there is no right way.

The most important thing is shipping/getting feedback, everything else is theatre at best, or a project-killing distraction at worst.

As a concrete example, I wanted to update my personal website to show some of these fully-vibecoded projects off. That seemed too simple, so instead I created a Rotten Tomatoes-inspired web app where I could list the projects. Cool, should be an afternoon or two.

A few yak shaves later, and I'm adding automatic repo import[0] from Github...

Totally unnecessary, because I don't actually expect anyone to use the site other than me!

[0]https://github.com/jumploops/slop.haus/pull/9

apsurd · 2026-03-16T06:15:30 1773641730

lol. for whatever reason what came to mind is "it's like alcoholics anonymous". it's so liberating to be self-aware that we have a problem.

I JUST WANT TO CODE!

It gets us all. And it makes us better I think, to care about the craft. LLM people seem split on that. But it's both to me: gotta care about the craft, also as a professional, it's not the code, it's business outcomes. All good. hold two truths.

jumploops · 2026-03-02T08:03:27 1772438607

I do something similar, but across three doc types: design, plan, and debug

Design works similar to your project.md file, but on a per feature request. I also explicitly ask it to outline open questions/unknowns.

Once the design doc (i.e. design/[feature].md) has been sufficiently iterated on, we move to the plan doc(s).

The plan docs are structured like `plan/[feature]/phase-N-[description].md`

From here, the agent iterates until the plan is "done" only stopping if it encounters some build/install/run limitation.

At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.

We review these hypotheses, sometimes iterate, and then tackle them one by one.

An important note for debug flows, similar to manual debugging, it's often better to have the agent instrument logging/traces/etc. to confirm a hypothesis, before moving directly to a fix.

Using this method has led to a 100% vibe-coded success rate both on greenfield and legacy projects.

Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.

miki123211 · 2026-03-02T09:58:35 1772445515

My "heavy" workflow for large changes is basically as follows:

0. create a .gitignored directory where agents can keep docs. Every project deserves one of these, not just for LLMs, but also for logs, random JSON responses you captured to a file etc.

1. Ask the agent to create a file for the change, rephrase the prompt in its own words. My prompts are super sloppy, full of typos, with 0 emphasis put on good grammar, so it's a good first step to make sure the agent understands what I want it to do. It also helps preserve the prompt across sessions.

2. Ask the agent to do research on the relevant subsystems and dump it to the change doc. This is to confirm that the agent correctly understands what the code is doing and isn't missing any assumptions. If something goes wrong here, it's a good opportunity to refactor or add comments to make future mistakes less likely.

3. Spec out behavior (UI, CLI etc). The agent is allowed to ask for decisions here.

4. Given the functional spec, figure out the technical architecture, same workflow as above.

5. High-level plan.

6. Detailed plan for the first incomplete high-level step.

7. Implement, manually review code until satisfied.

8. Go to 6.

jedberg · 2026-03-02T08:28:22 1772440102

> At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.

I'm biased because my company makes a durable execution library, but I'm super excited about the debug workflow we recently enabled when we launched both a skill and MCP server.

You can use the skill to tell your agent to build with durable execution (and it does a pretty great job the first time in most cases) and then you can use the MCP server to say things like "look at the failed workflows and find the bug". And since it has actual checkpoints from production runs, it can zero in on the bug a lot quicker.

We just dropped a blog post about it: https://www.dbos.dev/blog/mcp-agent-for-durable-workflows

zknill · 2026-03-02T09:20:34 1772443234

Why an MCP? dbos already ships a cli that appears to have the same features. Why an MCP over a skill that gives context on using the cli?

https://docs.dbos.dev/python/reference/cli

jumploops · 2026-03-02T09:51:34 1772445094

> we launched both a skill and MCP server.

My guess is that the MCP was easy enough to add, and some tools only support MCP.

Personal opinion: MCP is just codified context pollution.

jumploops · 2026-03-02T08:41:29 1772440889

This is great, giving agents access to logs (dev or prod) tightens the debug flow substantially.

With that said, I often find myself leaning on the debug flow for non-errors e.g. UI/UX regressions that the models are still bad at visualizing.

As an example, I added a "SlopGoo" component to a side project, which uses an animated SVG to produce a "goo" like effect. Ended up going through 8 debug docs[0] until I was satisified.

[0]https://github.com/jumploops/slop.haus/tree/main/debug

nubinetwork · 2026-03-02T10:43:34 1772448214

> giving agents access to logs (dev or prod) tightens the debug flow substantially.

Unless the agent doesn't know what it's doing... I've caught Gemini stuck in an edit-debug loop making the same 3-4 mistakes over and over again for like an hour, only to take the code over to Claude and get the correct result in 2-3 cycles (like 5-10 minutes)... I can't really blame Gemini for that too much though, what I have it working on isn't documented very well, which is why I wanted the help in the first place...

danenania · 2026-03-02T16:16:17 1772468177

I have a similar process and have thought about committing all the planning files, but I've found that they tend to end up in an outdated state by the time the implementation is done.

Better imo is to produce a README or dev-facing doc at the end that distills all the planning and implementation into a final authoritative overview. This is easier for both humans and agents to digest than bunch of meandering planning files.

frumiousirc · 2026-03-02T11:51:44 1772452304

> Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.

FWIW, what you describe maps well to Beads. Your directory structure becomes dependencies between issues, and/or parent/children issue relationship and/or labels ("epic", "feature", "bug", etc). Your markdown moves from files to issue entries hidden away in a JSONL file with local DB as cache.

Your current file-system "UI" vs Beads command line UI is obviously a big difference.

Beads provides a kind of conceptual bottleneck which I think helps when using with LLMs. Beads more self-documenting while a file-system can be "anything".

wek · 2026-03-02T14:21:52 1772461312

Similar, but we have the agent write the test cases after writing the plan and then iterate until it passes the test cases.

jumploops · 2026-03-02T06:43:38 1772433818

I've been experimenting with a few ways to keep the "historical context" of the codebase relevant to future agent sessions.

First, I tried using simple inline comments, but the agents happily (and silently) removed them, even when prompted not to.

The next attempt was to have a parallel markdown file for every code file. This worked OK, but suffered from a few issues:

1. Understanding context beyond the current session

2. Tracking related files/invocations

3. Cold start problem on an existing codebases

To solve 1 and 3, I built a simple "doc agent" that does a poor man's tree traversal of the codebase, noting any unknowns/TODOs, and running until "done."

To solve 2, I explored using the AST directly, but this made the human aspect of the codebase even less pronounced (not to mention a variety of complex edge-cases), and I found the "doc agent" approach good enough for outlining related files/uses.

To improve the "doc agent" cold start flow, I also added a folder level spec/markdown file, which in retrospect seems obvious.

The main benefit of this system, is that when the agent is working, it not only has to change the source code, but it has to reckon with the explanation/rationale behind said source code. I haven't done any rigorous testing, but in my anecdotal experience, the models make fewer mistakes and cause less regressions overall.

I'm currently toying around with a more formal way to mark something as a human decision vs. an agent decision (i.e. this is very important vs. this was just the path of least resistance), however the current approach seems to work well enough.

If anyone is curious what this looks like, I ran the cold start on OpenAI's Codex repo[0].

[0]https://github.com/jumploops/codex/blob/file-specs/codex-rs/...

jumploops · 2026-02-24T09:35:03 1771925703

In the context of traditional SaaS, using dynamic secrets loaded at runtime (KMS+Dynamo, etc.).

For agentic tools and pure agents, a proxy is the safest approach. The agent can even think it has a real API key, but said key is worthless outside of the proxy setting.

berkes · 2026-02-24T13:29:38 1771939778

It suprises me how often I see some Dockerfile, Helm, Kubernetes, Ansible etc write .env files to disk in some production-alike environment.

The OS, especially linux - most common for hosting production software - is perfectly capable of setting and providing ENV vars. Almost all common devops and older sysadmin tooling can set ENV vars. Really no need to ever write these to disk.

I think this comes from unaware developers that think a .env file, and runtime logic that reads this file (dotenv libs) in the app are required for this to work. I certainly see this misconception a lot with (junior) developers working on windows.

- you don't need dotenv libraries searching files, parsing them, etc in your apps runtime. Please just leave it to the OS to provide the ENV vars and read those, in your app.

- Yes, also on your development machine. Plenty of tools from direnv to the bazillion "dotenv" runners will do this for you. But even those aren't required, you could just set env vars in .bashrc, /etc/environment (Don't put them there, though) etc.

- Yes, even for windows, plenty of options, even when developers refuse to or cannot use wsl. Various tools, but in the end, just `set foo=bar`.

Arrowmaster · 2026-02-25T01:07:15 1771981635

The problem isn't the .env file itself but using environment variables at all to pass secrets is insecure.

berkes · 2026-02-25T12:36:02 1772022962

I strongly disagree.

Environment variables are -by far- the securest AND most practical way to provide configuration and secrets to apps.

Any other way is less secure: files on disk, (cli)arguments, a database, etc. Or about as secure but far more complex and convoluted. I've seen enterprise hosting with a (virtual) mount (nfs, etc) that provides config files - read only - tight permissions, served from a secure vault. A lot of indirection for getting secrets into an app that will still just read them plain text. More secure than env vars? how?

Or some encrypted database/vault that the app can read from using - a shared secret provided as env var or on-disk config file.

Arrowmaster · 2026-02-25T20:59:50 1772053190

Disagree, the best way to pass secrets is by using mount namespaces (systemd and docker do this under /run/secrets/) so that the can program can access the secrets as needed but they don't exist in the environment. The process is not complicated, many system already implement it. By keeping them out of ENV variables you no longer have to worry about the entire ENV getting written out during a crash or debugging and exposing the secrets.

berkes · 2026-02-26T08:08:15 1772093295

How does a mounted secret (vault) protect against dumping secrets on crash or debugging?

The app still has it. It can dump it. It will dump it. Django for example (not a security best practice in itself, btw) will indeed dump ENV vars but will also dump its settings.

The solution to this problem lies not in how you get the secrets into the app, but in prohibiting them getting out of it. E.g. builds removing/stubbing tracing, dumping entirely. Or with proper logging and tracing layers that filter stuff.

There really is no difference, security wise, between logger.debug(system.env) and logger.debug(app.conf)

wswin · 2026-02-24T11:39:16 1771933156

These are from AWS right, what about simple, no cloud setups with just docker compose or even bare proccesses on a VPS?

jumploops · 2026-01-27T22:43:55 1769553835

I’ve found that LLMs seem to work better on LLM-generated codebases.

Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

We call this tech debt.

Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

YMMV here, as it’s hard to do large refactors without tests for correctness.

(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

mh2266 · 2026-01-28T03:10:47 1769569847

are LLM codebases not messy?

Claude by default, unless I tell it not to, will write stuff like:

    // we need something to be true
    somethingPasses = something()
    if (!somethingPasses) {
        return false
    }

    // we need somethingElse to be true
    somethingElsePasses = somethingElse()
    if (!somethingElsePasses) {
        return false
    }

    return true

instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.

generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.

to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.

jumploops · 2026-01-28T03:54:54 1769572494

Yeah to be clear it will have the same issues as a flyby contributor if prompted to.

Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.

I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.

dexdal · 2026-01-28T04:08:47 1769573327

The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.

jumploops · 2026-01-28T05:25:47 1769577947

Yes, exactly.

The LLM is onboarding to your codebase with each context window, all it knows is what it’s seen already.

dexdal · 2026-01-28T23:42:35 1769643755

Right. Each context window is a partial view, so it cannot “know the codebase” unless you supply stable artefacts. Treat project state as inputs: invariants, interfaces, constraints, and a small set of must-keep facts. Then force changes through a plan and a diff, and gate with tests and checks. That turns context limits into a controlled boundary instead of a surprise.

olig15 · 2026-01-27T23:49:06 1769557746

Surely because LLM generated code is part of the training data for the model, so code/patterns it can work with is closer to its training data.