You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens. I find it easier to conceptualize this in three dimensions. 3blue1brown has a good video series which covers the overall concept of LLM vectors in machine learning: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_...
To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.
I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.
> You're right about "reasoning". It's just trying to steer the conversation in a more relevant direction in vector space, hopefully to generate more relevant output tokens.
I like to frame it as a theater-script cycling through the LLM. The "reasoning" difference is just changing the style so that each character has film noir monologues. The underlying process hasn't really changes, and the monologues text isn't fundamentally different from dialogue or stage-direction... but more data still means more guidance for each improv-cycle.
> say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer?
I'd like to point out that this scheme can result in things that look better to humans in the end... even when the "clarifying" choice is entirely arbitrary and irrational.
In other words, we should be alert to the difference between "explaining what you were thinking" versus "picking a firm direction so future improv makes nicer rationalizations."
It makes sense if you think of the LLM as building a data-aware model that compresses the noisy data by parsimony (the principle that the simplest explanation that fits is best). Typical text compression algorithms are not data-aware and not robust to noise.
In lossy compression the compression itself is the goal. In prediction, compression is the road that leads to parsimonious models.
To give a concrete example, say we're generating the next token from the word "queen". Is this the monarch, the bee, the playing card, the drag entertainer? By adding more relevant tokens (honey, worker, hive, beeswax) we steer the token generation to the place in the "word cloud" where our next token is more likely to exist.
I don't see LLMs as "lossy compression" of text. To me that implies retrieval, and Transformers are a prediction device, not a retrieval device. If one needs retrieval then use a database.