Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Idk if this is a daft question, but in the Visual-Semantic Alignment section, are those objects in the colored boxes actually being directly recognized by the software? Or are they inputted in some other way?


From the paper: Our core insight is that we can leverage these large image-sentence datasets by treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image. Our approach is to infer these alignments and use them to learn a generative model of descriptions.


This gets a little more detailed into the work, but compared to other papers that have sprung up in this area recently, our paper slightly frowns on the idea of distilling a complex image into a single short sentence description. In that sense we are a little more ambitious and we're trying to produce snippets of text that cover the full image with descriptions on level of image regions. I would call our results encouraging, but there is certainly more work to be done here. And I think one of the limitations right now to do a good job is the amount of training data available to us.


so basically you have a database of pre existing sentences to match whats roughly likely to be on the page, not actually seeing individual objects and generating a grammatically correct and accurate description based on the individual objects?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: