Idk if this is a daft question, but in the Visual-Semantic Alignment section, ar...

tjr · on Nov 19, 2014

From the paper: Our core insight is that we can leverage these large image-sentence datasets by treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image. Our approach is to infer these alignments and use them to learn a generative model of descriptions.

karpathy · on Nov 19, 2014

This gets a little more detailed into the work, but compared to other papers that have sprung up in this area recently, our paper slightly frowns on the idea of distilling a complex image into a single short sentence description. In that sense we are a little more ambitious and we're trying to produce snippets of text that cover the full image with descriptions on level of image regions. I would call our results encouraging, but there is certainly more work to be done here. And I think one of the limitations right now to do a good job is the amount of training data available to us.

notastartup · on Nov 19, 2014

so basically you have a database of pre existing sentences to match whats roughly likely to be on the page, not actually seeing individual objects and generating a grammatically correct and accurate description based on the individual objects?