This is interesting. I’ve had this same issue trying to build an agentic system ...

This is interesting. I’ve had this same issue trying to build an agentic system with the smaller ChatGPT models, almost no matter the prompt (“think aloud” are magic words that help a lot, but it’s still flaky). Most of the time it would either perform the tool call before explaining it (the default) or explain it but then not actually make the call.

I’ve been wondering how Cursor et al solved this problem (having the LLM explain what it will do before doing it is vitally important IMO), but maybe it’s just not a problem with the big models.

Your experience seems to support that smaller models are just generally worse about tool calling (were you using Gemini Flash?) when asked to reason first.