There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D?
I’m not an expert in the field but intuitively from my own experience I’d say what’s missing is a world model. By trying to be more conscious about my own vision I’ve started to notice how common it is that I fail to recognize a shape and then use additional knowledge, context and extrapolations to deduce what it can be.
A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.
People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.
> If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members.
Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?
The problem is the data. LLM data is self supervised. Vision data is very sparsly annotated in the real world. Going a step further robotics data is is much sparser. So getting these models to improve on this long tail distribution will take time.
How far can you go by improving the curriculum?
Start simple. Find a shorter and shorter sequence of examples that gives you thd best result. What is the shortest sequence to get to some perplexity? Why?
Yes, cDNA is a DNA fragment that does not interact with the patient's DNA in this treatment.
Wikipedia:
"Once amplified, the sequence can be cut at each end with nucleases and inserted into one of many small circular DNA sequences known as expression vectors. Such vectors allow for self-replication, inside the cells, and potentially integration in the host DNA. They typically also contain a strong promoter to drive transcription of the target cDNA into mRNA, which is then translated into protein."