Yes you have a concept of an object based on all your memories across all senses...

TFortunato · on March 10, 2018

This is a very interesting point that I think has more going for it than the poster you are replying to may realize. Related, and apologies I don't have a link, but there have been a few stories circulated of adults who were blind at birth / a young age who had their sight restored at adulthood - They often had great difficulty, or even found it impossible to correlate what they saw with their eyes with the concept of objects that they knew by touch and their other senses.

danbruc · on March 10, 2018

I have no link at hand either, but I also read about blind people being unable to transfer their concepts of objects build from touching them to seeing them when their sight was restored. They know what a sharp edge or a pointy thing feels like but they don't know what it looks like.

But I think that actually strengthens my point, blind people are capable of learning how different object feel and they recognize them this way but they so just fine without vision and if they suddenly gain access to vision it does not help them.

In general having more senses will help to more reliably identify objects because you have access to more features to differentiate them, for example distinguishing materials from imitations just by vision can be hard or maybe even impossible. But when you can also touch them you gain a lot of new information about surface structure, hardness, thermal conductivity and so on and you can easily distinguish between, for example, real stone and plastic or wood with a stone print on it.

But just because it is advantageous in the general case to have access to more than one sense that does not imply that it is necessary for a specific task.

TFortunato · on March 10, 2018

Apologies ahead of time. I'm typing this on a small screen phone, and trying to not ramble :-)

We might be more on the same page than I realized On the one hand, I agree that it doesn't imply that it is necessary to have access to more than one sense.

On the other hand I do think that having access to more senses may make the problem a LOT easier, especially during the training process for teaching systems to do things like try to distinguish objects by sight.

Don't get me wrong, the progress made so far has been incredible, but it isn't surprising that we are eventually running into limits given the limited amount of data these systems have to train on.

Sort of riffing off of this - In particular, my sense is that when you are training on 2-D pixel data only, it is no wonder that no matter how clever your network is to extract high-level features from the data, you are going to run into these kind of issues. We are asking the system to perform a task (describing, classifying, etc.) about a fundamentally 3-d world, using only a 2-d image, when it has never had a "concept" of the world in 3 dimensions. I think that we take for granted that when we are learning, not only do we get information by seeing, but also touching and manipulating objects and existing in the world around us. We can see things in different lighting conditions, and we can also move our heads and our bodies and around the world and see things from new angles and manipulate them to learn the rich set of correlations between what we see, and what we experience...basically the rich structure of the world around us. A system which is trained on images alone doesn't have the integrated knowledge, which is what I was hoping to get at with the anecdote of adults being cured of blindness.

When you look at some of the work we do to augment visual data sets, by scaling images, cropping them, rotating them, skewing them, etc. its basically a poor mimcry of something that humans and animals get just by being in the world...this idea of learning that things remain the same and have a certain structure, regardless of the viewpoint we see them from. (From what I understand, this idea is also part of what inspires capsule networks).

Anyways, to come to my main point. Yes, you can absolutely identify objects, even a completely novel object after only seeing it once or twice. But I would argue that this is only because you already have a rich framework of the world as 3-dimensional spatially, along with all of the other priperties that object have that you have learned. So when you see a 2-dimensional picture, your brain can form a 3-d image of the scene, identify materials of objects, etc., which you can leverage. Things that our current AI systems have no idea of... so perhaps its a miracle they made it this far, and no wonder that we find they can get fooled by flipping pixels!

(I'll end my rant here with the caveat that I am a roboticist, so perhaps I have very skewed views of intelligence and feelings about embodied intelligence, but I'm willing to learn and be disabused of my notions!)

danbruc · on March 10, 2018

I think we pretty much agree, identification can be done just from images, learning only from them asks for quite a lot. The issue is probably that annotating thousands of images is a comparatively easy task when compared to doing this on years of video from a camera moving through the world. And who has the time to watch a neural network ingesting videos for several years? But something like that may be necessary if you want to achieve human-level performance by learning from scratch, watching billions of video frames. I guess you can take a shortcut, sort of, and get away with a lot less training examples but only if you engineer a lot of the fundamentals about the world into the system and avoid having to learn them, too, and all in one process.

ianai · on March 10, 2018

What I’m trying to say is AFAIK, AI is generally solving a problem through training something akin to a black box. It’s data -> function -> expected output. But I think we’re able to get past silly things like a pixel off or other simply “odd” images because our learning is much more involved. I’d liken it to be along the lines of data (with many data associated to our own, known input ie pushing against a car, petting a cat, looking in a mirror and moving our arms and realizing we “did that”, etc) -> contextualizing based on memories -> identifying relevant concepts (chair, car, color/blue, etc) -> thinking/doing something -> other steps -> repeat

This is also where I think robots have the best chance at really becoming self aware (far down the road). They’re not going to be passive observers.

uryga · on March 10, 2018

One example can be found in Oliver Sacks' "An Anthropologist on Mars", which has a chapter about a blind man named Virgil who regained his sight in adulthood and it took him a long time to make sense of visual stimuli.

danbruc · on March 10, 2018

This is true insofar that the huge amount of background knowledge I have and use to build a model of what I see in an image was almost inevitably build over my lifetime with all my senses, but this does not change the fact that I performed the task at hand only visually, i.e. without fusing different senses. The AI is lacking my background knowledge, not senses other than vision. One might argue that building human-level background knowledge requires more than just vision but it seems at least not totally obvious that this is indeed the case.