that is because they are all using non deterministic approaches, aka expecting that a single detailed prompt with 10000 words is going to generate a stable application. Because prompts dont have replay value, you have to split it into one microtask per agent and validate the output with deterministic fallback as and when required.