I think the researchers agree with your premise. The “evidence” is not that chicks have more language understanding than previously understood, but rather that the source of the universality of bouba/kiki is due to something more primitive than built in human language hardware.
Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.
IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable.
depends ln your goals of course. but worth mentioning there are plenty of narrowish tasks (think text-to-sql, and other less general language tasks) where llama8b or phi-4 (14b) or even up to 30b with quantization can be trained on 8xa100 with great results. plus these smaller models benefit from being able to be served on a single a100 or even L4 with post training quantization, with wicked fast generation thanks to the lighter model.
on a related note, at what point are people going to get tired of waiting 20s for an llm to answer their questions? i wish it were more common for smaller models to be used when sufficient.
Subtext is that the solution is always the best possible move sequence. OP’s comment is clarifying that sometimes after executing the best move sequence, the puzzle ends with a capture, and sometimes ends with a checkmate (“winning”).