Well said!
That is my impression of his Twitter feed from what I remember.
They've long since lost that advantage.
then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task
you then take the codebook entries generated and run it through the codec’s decoder and yield audio
it works surprisingly well
speech text models (tts model with an llm as backbone) is the current meta