VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).
There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.
Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"
I didn't know counting objects was a problem. That's pretty ironic because the very first implementation of a neural net (AFAIK) is the numa-rete artificial retina developed at the Biological Computer Lab [0] circa 1960. It was a parallel analog computer composed of "nuerons" each with a photocell that could be arranged in a grid and count "the number of objects independent of their size, location and form, and independent of strength of illumination" [1] - this paper may be of interest to those in the field, "Perception of Form in Biological and Man Made Systems" Heinz Von Foerster 1962
It really shouldn't be surprising that these models fail to do anything that _they weren't trained to do_. It's trivially easy to train a model to count stuff. The wild thing about transformer based models is that their capabilities are _way_ beyond what you'd expect from token prediction. Figuring out what their limitations actually are is interesting because nobody fully knows what their limitations are.
I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.
I agree this research is really interesting, but I didn't have an a priori expectation of what token prediction could accomplish, so my reaction to a lot of the claims and counterclaims of this new tech is that it's good at fooling people and giving plausible but baseless results. It makes for good research but dangerous in the hands of a market attempting to exploit it.
> I agree that these open ended transformers are way more interesting and impressive than a purpose built count-the-polygons model, but if the model doesn't generalize well enough to figure out how to count the polygons, I can't be convinced that they'll perform usefully on a more sophisticated task.
I think people get really wrapped into the idea that a single model needs to be able to do all the things, and LLMs can do a _lot_, but there doesn't actually need to be a _one model to rule them all_. If VLMs are kind of okay at image intepretation but not great at details, we can supplement them with something that _can_ handle the details.
Vision models use CLiP or something similar, which has no conception of anything specific in the image. It sees embeddings which correlate similarly to text embeddings. Take an image then describe it 'there are birds sitting on a power line in front of a blue sky with some clouds', get the embeddings from that and the embeddings from that picture and line them up. If you ask if there are birds in it, it would know, but not how many, unless it was common to describe the number of birds sitting on things and it happened often enough that the number counted was the number in the image descriptions it trained on. If you want to count objects you want something like YOLO.
VLMs like PaliGemma and Florence-2 support object detection and segmentation, so it's becoming more common to have YOLO like capabilities built into VLMs.
Another benefit of VLMs which support object detection is that they are open vocabulary, meaning you don't have to define the classes ahead of time. Additionally fine tuning tends to keep the previous detection capabilities instead of erasing all previous classes like fine tuning YOLO.
There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.
Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"
0: https://github.com/microsoft/SoM