I often wonder how much the multi-sensory aspect plays into it. When I see an image, I don't really process it as an image, I map that visual cue into the full gamut of sensory memories (?) of that object. I could write a page worth of these descriptive meatspace 'vectors' that are invoked when i see a banana and color my interpretation of its context.
If my understanding is remotely correct, an RNN's view of a banana is basically like the face in Aphex Twin's Equation -https://youtu.be/M9xMuPWAZW8?t=5m30s (headphone users beware). No qualitative or quantitative information about the object, just a certain tone of integer triplets in a cacophony of noise.
It seems like a many-dimensional view of the world around us is going to be necessary for systems to more effectively intuit about interacting with it. It could be something we synthetically inject or we may need to give our models new senses they can use to extract their own meaning.
Well that's why I call it 4D. It's a multidimensional understanding of "banana" that crosses multiple sensory barriers.
As you more or less correctly point out, the way a DNN understands a 2D image of a banana is by basically compressing (convolving and pooling) an image into a mathematical "fingerprint" for which we provide a label. If the labeling process is homogenized then we can relatively rapidly generate inferences when testing the fingerprints on new images at a high probability.
That is to say the complexity of the "fingerprint" of a banana is several orders of magnitude greater in humans than it is for even our most advanced object detectors - if for no other reason than the mapped data is multi-sensory.
If my understanding is remotely correct, an RNN's view of a banana is basically like the face in Aphex Twin's Equation -https://youtu.be/M9xMuPWAZW8?t=5m30s (headphone users beware). No qualitative or quantitative information about the object, just a certain tone of integer triplets in a cacophony of noise.
It seems like a many-dimensional view of the world around us is going to be necessary for systems to more effectively intuit about interacting with it. It could be something we synthetically inject or we may need to give our models new senses they can use to extract their own meaning.