> the behavior of LLMs is not thanks to a learned world model, but to brute forc...

> the behavior of LLMs is not thanks to a learned world model, but to brute force memorization of incomprehensibly abstract rules governing the behavior of symbols, i.e. a model of syntax.

I think reinforcing this distinction between syntax and semantics is wrong. I think what LLMs have shown is that semantics reduces to the network of associations between syntax (symbols). So LLMs do learn a world model if trained long enough, past the "grokking" threshold, but they are learning a model of the world that we've described linguistically, and natural language is ambiguous, imprecise and not always consistent. It's a relatively anemic world model in other words.

Different modalities help because they provide more complete pictures of concepts we've named in language, which fleshes out relationships that may not have been fully described in natural language. But to do this well, the semantic network built by an LLM must be able to map different modalities to the same "concept". I think Meta's Large Concept Models is promising for this reason:

https://arxiv.org/abs/2412.08821

This is on the path to what the article describes that humans do, ie. that many of our skills are developed due to "overlapping cognitive structures", but this is still a sort of "multimodal LLM", and so I'm not persuaded by this article's argument of needing embodiment and such.

> but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation.

That's not clear at all, and the link provided in that sentence doesn't suggest any such thing.