More

zambelli · 2026-05-24T00:59:16 1779584356

Thanks for the thoughtful comment! Let me try to unpack some of what's there and what's missing.

Forge is at its core a mechanical reliability layer, whereas a lot of memory/skill management would be more of an orchestration component/element that that consumer would own.

That split that has forge stopping at the mechanical layer was an intentional design decision, but there's no reason it couldn't grow into more. I think a lot of what you're thinking about is a big model/small model split similar to how CC does it - but that's an orchestrator.

Now, where Forge can help with what you're suggesting - I think most of it is there, but needs some wiring from the consumer/orchestrator: - Forge surfaces information about which guardrails fired: InferenceResult.new_messages carries typed MessageMeta.type — RETRY_NUDGE, STEP_NUDGE, PREREQUISITE_NUDGE, CONTEXT_WARNING, SUMMARY. So every nudge that fired during a run is observable per-step. A consumer could capture that and compare to workflow steps to reconstruct what success looked like. - Combined with Guardrails.check() > CheckResult, you would have a lot of the journey the model took to get to the answer. - Forge lets you (actually, requires) you to define the system prompt, any workflow restrictions, and the tools. So if you know something about how your task will behave with a small model, you can include that in system prompt, or a tool that's a required step, etc.

For integrations into MCPs/etc that house memories and skills, those can be surfaced to the model with Forge in place. Prompt the model to search for tools in the MCP/surface an MCP tool, etc. I've built a consumer that follows this pattern: main agent gets task > main agent eyeballs whether it can be solved on its own > if not, sends to a subagent specialized on that topic (that has access to more tools related to that) - which allows me to keep context lean for each agent.

You could do something similar where the model is prompted to use its toolset, but if its unsure or needs a tool it doesn't have, to call the get_mcp() tool or something to look for better options.

Big model v small model now - a couple of ways I think about it. - You could use big models to go through your workflow a few times, see common patterns, and then use those to define prerequisite and required steps in Forge guardrails when using small models. - You could use small models the same way there's the ANTHROPIC_SMALL_FAST_MODEL env var in claude code (this is what Explore subagent uses I think). Big model is effectively an orchestrator, and when it recognizes a task is easy, it dispatches a small model to do it, where Forge might make it viable.

Hoepfully that helps! Forge could certainly elevate some of this to be more native - and I might do that - like a mode that packages up results for you so you don't need to reconstruct the nudge events from hooks firing. But everything should be there to integrate with a memory system with the information required, or with an API/MCP that has more tools or skills for the agent to read.

Would love to see the integration if you do it! You'd just need a consumer that captures the events forge returns and packages them up into whatever your memory system is looking for!

If you're looking for other ways of ingesting those memories/skills that isn't system prompt, message, or tool result, then that's something I can look into.

zambelli · 2026-05-21T04:13:58 1779336838

Merged! Thanks for that catch. I'll try to sequence the in-flight work ASAP to get the vllm branch merged in as a whole.

zambelli · 2026-05-20T20:16:05 1779308165

Yeah I got it working as a quick test run to confirm a model issue vs backend issue on a consumer app. It worked on my dual-5070 Ti rig, but I didn't have time to formalize all the way and merge it in. Thanks for linking it!

somethingsome · 2026-05-20T21:33:10 1779312790

Thanks, I just tried, for me it worked on 2x L40S with vLLM. I had some issues due to the model name, forge was forwarding 'default' instead of the real model name 'Qwen2.5-Coder-14B-Instruct'.

If someone else struggle on this step, I added in vLLM args: --served-model-name "Qwen2.5-Coder-14B-Instruct" --served-model-name "default"

So default becomes an alias.

I didn't yet test Forge, I was just happy that it worked at the moment ;)

zambelli · 2026-05-20T22:05:37 1779314737

Oh that's a good find, I'll book ark this for a GitHub issue.

Glad to hear it's working!

zambelli · 2026-05-20T19:04:04 1779303844

Nice symmetry with tool call failures being sent to LLM that made the call without bugging the user. The artifact-generating entity gets the error back, effectively.

100% correct, and stackable. Could have topic refusal in LLM training itself, forge in tool call alter, and sdlc gates at the workflow level.

mrothroc · 2026-05-21T13:45:49 1779371149

Definitely stacks. The thing that made it clear for me was being explicit about the stages, and where/what you can verify with a guardrail, or gate. I wrote up the framework I use here: https://michael.roth.rocks/research/trust-topology/

Being explicit about the space between the stages is critical, because that's your enforcement point.

zambelli · 2026-05-21T18:49:30 1779389370

This is a really neat writeup, and the empirical data for coding agents is super useful. Will take a closer read and see if there's anything I easily lift into my harness!

mrothroc · 2026-05-22T16:09:10 1779466150

Thanks, glad you find it useful! Feel free to ping me if you have any questions.

zambelli · 2026-05-20T16:55:00 1779296100

Interesting, catching the problem upstream, effectively. How did you enforce the grammar?

peer0 · 2026-05-20T17:43:04 1779298984

https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

llama.cpp supports grammar limiting using either GBNF or json schema (It just translate it to GBNF behind the scenes I think). So I have my harness generate a tool schema on the fly (based on what tools are possible for the current task) and pass it in at request time.

zambelli · 2026-05-20T17:59:29 1779299969

Oh, interesting - thanks for the link. I really haven't explored this but it should slot in fairly easily I think? Gotta dig into it more.

tmzt · 2026-05-22T19:35:43 1779478543

It's basically restricting what logits are allowed when sampling the model to conform with the JSON (or whatever) shape. It can also cause the model to get "confused" though and doesn't always result in the output you want.

zambelli · 2026-05-20T15:35:01 1779291301

Oh, awesome! I'll take a look.

zambelli · 2026-05-20T14:50:18 1779288618

Thank you! I've been trying to catch those replies and redirect people, but hopefully your comment be upvoted for others. Very embarrassing to put up the post with the wrong link lol.

zambelli · 2026-05-20T13:26:48 1779283608

This is not an agentic coding harness. It's a generic tool-calling guardrail stack. I have built a coding harness built on Forge since, but that's not what this is.

zambelli · 2026-05-20T13:24:23 1779283463

I know :( - I posted the wrong link and now it's there forever.

Dashboard is in here: https://github.com/antoinezambelli/forge/tree/main/docs/resu...

zambelli · 2026-05-20T13:15:46 1779282946

Yeah I would think so!

A lot of current tooling is layered mostly at the workflow level. Auth for the agent, or memory management for the agent (like some smart skills stuff), but Forge sits below that.

In most cases I've looked at, it could be slotted in with other work without much disruption. Forge just increases mechanical reliability of tool-calling, it shouldn't disrupt your workflow-level layers much.