Apart from the captioning work already mentioned, there's also visual analogy making work by Reed et al. http://www-personal.umich.edu/~reedscot/nips2015.pdf which is pretty exciting (sort of a converse to what you're proposing).
...for some definitions of "trivial" and "perfect". At this level, I suspect even a small advantage would result in winning the contest, which is the point here.
I don't think OpenCV really solved computer vision to be fair. There's definitely no model out there that can do image-based question & answering as well as a human can, or interpret the contents of an image (parse it, if you will) in an accurate way, with the exception of very few special cases.
I'm curious to hear your thoughts about learning object saliency from these datasets. Most human generated images have built-in biases toward framing things humans care about, and all of the captions will reflect the relative importance (to humans) of pictured objects.
Captioning images, for humans, is a subset of a much more general skill set. Humans can scan a broad visual scene for salient components, focus on those while ignoring non-salient objects, and then organize their thoughts about what has been seen in such a way as to produce an extremely low dimensional description of the scene (a descriptive sentence.)
Human's also have the advantage of immediate feedback to their generated descriptions from peers or parents.
I haven't seen much work that has attempted to tackle datasets that aren't pre-framed by humans, or ones that try to scale reinforcement learning. I'd love to hear your thoughts or get suggested reading if any pops to mind.
Just FYI, the only additional data used by the GoogLeNet entry was from the classification challenge (aka provided by the organizers), hardly something that would make you lose sleep at night.
I was not suggesting Google was pulling data from other sources in some sort of conspiratorial way, but rather pointing out for interest that its algorithmic superiority was weighted toward large data sets. Given the volume of data they see and store in their existing operations, I saw that as a potentially interesting correlation.
The Los Angeles office is right by Venice Beach, in a beautiful setting. It's a mid-sized office (~500 employees according to http://venice.patch.com/groups/business-news/p/silicon-beach...) in the same time-zone as Mountain View and a less than 1h flight to the mothership should you need to go there.
Unlike the main office, most people don't have to choose between commuting from SF or living in Mountain View, because Venice/Santa Monica is actually a nice area to live in :). Naturally, the breadth of projects is not as big as in Mountain View, but there's a number of exciting things happening here (computer vision, quantum AI etc).
It's actually already possible to train convolutional network-like models to distinguish between a variety of dogs, cats etc with precision that is pretty much super human. The real problem is getting high-quality training data without involving tons of domain experts that would tells us with high degree of confidence whether a given image is of a specific breed of dog (getting millions of images of dogs is easy, so is building a classifier).
It's not immediately obvious to me how useful such an app would be btw. Unless I of course misunderstood what a "real life pokedex app" is :).
Yes, though I think on public benchmarks this is still not the case. There's a dog-breed classification problem in this year's Fine-Grained challenge (https://sites.google.com/site/fgcomp2013/) so we'll see in December!
Yes, it'd be surprised if the straightforward implementation from https://code.google.com/p/cuda-convnet/, run on a GPU with lots of transformations, wasn't the winning entry.
It's possible that the underlying model is just not particularly good at learning from data. 11B parameters is a lot of free parameters to learn -- for instance, the main competitor to that paradigm is the work by Krizhevsky et al., which are convolutional networks with lots of parameter sharing, and I think they get better performance (on a comparable task) with ~60M free parameters.