Hacker Newsnew | past | comments | ask | show | jobs | submit | raindear's commentslogin

There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D?


I’m not an expert in the field but intuitively from my own experience I’d say what’s missing is a world model. By trying to be more conscious about my own vision I’ve started to notice how common it is that I fail to recognize a shape and then use additional knowledge, context and extrapolations to deduce what it can be.

A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.

People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.


> If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members.

Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?


> LLMs understand text on a human level (even if they have other limitations).

Limitations like understanding...


The problem is the data. LLM data is self supervised. Vision data is very sparsly annotated in the real world. Going a step further robotics data is is much sparser. So getting these models to improve on this long tail distribution will take time.


Are dead ReLUs still a pronlem today? Why not?


There are alternative activation functions, which are also widely used.


How far can you go by improving the curriculum? Start simple. Find a shorter and shorter sequence of examples that gives you thd best result. What is the shortest sequence to get to some perplexity? Why?


Progress is what happens thanks to AI skeptics busy defining model limitations. The limitations set attractive bars to pass.


I read that deepseek was trained on western llm output. So it is expected to have the same biases.


Did the creators actually say so? I'd rather expect them to train on pirated books just like OpenAI and Meta.



Molecular Biology research is much more funded than computer science. At least in Academia.


Yes, cDNA is a DNA fragment that does not interact with the patient's DNA in this treatment. Wikipedia: "Once amplified, the sequence can be cut at each end with nucleases and inserted into one of many small circular DNA sequences known as expression vectors. Such vectors allow for self-replication, inside the cells, and potentially integration in the host DNA. They typically also contain a strong promoter to drive transcription of the target cDNA into mRNA, which is then translated into protein."


But why do transformers perform better than older language models including other neural language models.


There are around 40k Hamas gunmen. So the reported numbers can get that high.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: