I'm curious to hear your thoughts about learning object saliency from these datasets. Most human generated images have built-in biases toward framing things humans care about, and all of the captions will reflect the relative importance (to humans) of pictured objects.
Captioning images, for humans, is a subset of a much more general skill set. Humans can scan a broad visual scene for salient components, focus on those while ignoring non-salient objects, and then organize their thoughts about what has been seen in such a way as to produce an extremely low dimensional description of the scene (a descriptive sentence.)
Human's also have the advantage of immediate feedback to their generated descriptions from peers or parents.
I haven't seen much work that has attempted to tackle datasets that aren't pre-framed by humans, or ones that try to scale reinforcement learning. I'd love to hear your thoughts or get suggested reading if any pops to mind.