Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).
Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.
Grant reviews are blind reviews - so you don’t know.
Also - and even worse - there is no rebuttal process. It gets rejected without you having a chance to clarify / convince reviewers.
Instead you’d need to resubmit and start the entire process from scratch. What a waste of resources …
It’s the final nail what made me quit pursuing a scientific career path despite having good pubs & PhD /w honours.
That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.
While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.
I agree - and I think having interdisciplinary approach here is going to increase the odds here. There is a ton of useful knowledge in related disciplines - often just named differently - but turns out investigating the same problem from a different angle.
I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?
Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.
(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)
Pretending channels can be effectively merged into a single percel vector, that would open up interesting channels beyond human perception even, e.g. lidar. Or it would be interesting to train a model that feels at home in 4D space.
Deep things often, not always, take more attention to appreciate than the superficial. It's a precious resource people are seldom disposed to allocate a lot of when headline-surfing HN.
No, latent space doesn't have to be made of percels, just like not every 2D array of 3-element vectors is an image made of pixels. Percels are tied to your sensors, components of what you perceive, in totality.
Of course there is an interesting paradox - each layer of the NN doesn't know whether it's connected to the sensors directly, or what kind of abstractions it works with in the latent space. So the boundary between the mind and the sensor is blurred and to some extent a subjective choice.
I'm not an ML expert or practitioner, so someone might need to correct me.
However, I believe the parcel's components together as a whole would capture the state of the audio+visual+time. However, I don't think the state of one particular mode (e.g. audio or visual or time) is encoded with a specific subset of the percel's components. Rather, each component of the percel itself would represent a mixture (or a portion of a mixture) of the audio+video+time. So, you couldn't isolate out just the audio or visual or time state specifically by looking at some specific subset of the percel's components, because each component is itself a mixture of the audio+visual+time state.
I think the classic analogy is that if river 1 and river 2 combine to form river 3, you cannot take a cup of water from river 3 and separate out the portions from river 1 and river 2; they're irreversibly mixed.
Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.