Well that's why I call it 4D. It's a multidimensional understanding of "banana" ...

Well that's why I call it 4D. It's a multidimensional understanding of "banana" that crosses multiple sensory barriers.

As you more or less correctly point out, the way a DNN understands a 2D image of a banana is by basically compressing (convolving and pooling) an image into a mathematical "fingerprint" for which we provide a label. If the labeling process is homogenized then we can relatively rapidly generate inferences when testing the fingerprints on new images at a high probability.

That is to say the complexity of the "fingerprint" of a banana is several orders of magnitude greater in humans than it is for even our most advanced object detectors - if for no other reason than the mapped data is multi-sensory.