*Humans can build associations with very few samples* This to me is an example o...

YeGoblynQueenne · on March 10, 2018

The problem is that when you show a single example of "a toy" (let's say- a fire engine) to a child, they don't just learn to recognise the unique object you showed them as a "fire engine"; they learn the concept of "fire engine" that they can subsequently correctly recognise in different objects, with very different characteristics. Having learned what a fire engine is, they can then recognise fire engines of all shapes an sizes as belonging to the same category of "fire engine" as the original; a blue fire engine as a special case with a surprising colour; or a real fire engine as a different class of fire engines that is not a toy; and so on.

Machine vision classifiers can do nothing of the sort, no matter how many examples you give them and for how long they learn to look at them. If you label a fire engine toy as a "fire engine" then either the classifier will only be able to recognise toy fire engines, or it will have to mislabel real fire engines as "toy fire engine".

I agree that the difference between the sampling rate of humans and machine vision classifiers is not well defined, but it is obvious (and far as I can tell there's a strong consensus on this) that machine vision algorithms are many orders of magnitude less sample efficient than humans.

AndrewKemendo · on March 10, 2018

when you show a single example of "a toy" (let's say- a fire engine) to a child, they don't just learn to recognise the unique object you showed them as a "fire engine"; they learn the concept of "fire engine" that they can subsequently correctly recognise in different objects

I don't have that same experience at all. In fact if anything it's the opposite. My kids called ambulances "fire trucks" until I - the supervised labeler - corrected them.

that machine vision algorithms are many orders of magnitude less sample efficient than humans.

I don't think anyone disputes that - but they are at least in the same ballpark in terms of structure, especially if you look at the way RL works.

resource0x · on March 10, 2018

As someone who raised children and grandchildren, I can't find any explanation to how fast they learn the language, based on very few samples (where your 4D argument doesn't apply). Sure, children learn from conversations with adults, but those are mostly trivial, and involve trivial concepts. And children seem to be able to learn not only from the very limited number of samples, but also (in a sense) - learn more than these samples contain. BTW, did anyone try to analyze how many words/phrases the child heard, say, by the age of 7, when they develop perfect understanding of the language and ability to speak like adults? And after that age, one can spend 50 years learning foreign language and still not get it.

javierluraschi · on March 10, 2018

Intially, kids don't learn language that fast, they spend a whole year getting samples from parents where we try-and-try-and-try over and over to get them to say something, so there is a high sample rate going on for sure. However, it is also true that humans learn faster at some point by using self other tools, like consciousness. Not sure how exactly that works on a toddlers brain, but on mine, if you ask me to remember a phone number, I will repeat it in my head several times and try to make associations, those higher-level learning processes seem to be the algorithms that we are missing to discover and implement successfully.

lukeschlather · on March 10, 2018

> (where your 4D argument doesn't apply)

The 4D argument is even more applicable to human language, IMO. Object recognition pretty exclusively involves sight and touch. Human language involves all the senses, frequently at once.

My Spanish is not great, but usually I can communicate pretty well despite that, partially because there are a lot of other contextual cues (body language, nonverbal vocalizations, known objects) I can use to figure things out.

It's amazing how frequently the words don't matter at all, and the meaning is almost entirely contained in tone and pacing of speech.

AndrewKemendo · on March 10, 2018

based on very few samples (where your 4D argument doesn't apply)

Again, define "few." Language development starts in-utero [1] and basically is a constant stream thereafter.

Children who have more consistent exposure to directed language and singing from their parents learn language faster, so there is absolutely correlation between exposure rate (sample rate) and acquisition time.

Additionally the idea that language isn't 4D is just completely missing the concept. There is no linguistic association with a "ball" if there is no physical (visual/tactile) representation of said ball. Assuming a child doesn't have a disability there are no single sense concepts that I can think of.

[1]https://www.washington.edu/news/2013/01/02/while-in-womb-bab...

jcims · on March 10, 2018

I often wonder how much the multi-sensory aspect plays into it. When I see an image, I don't really process it as an image, I map that visual cue into the full gamut of sensory memories (?) of that object. I could write a page worth of these descriptive meatspace 'vectors' that are invoked when i see a banana and color my interpretation of its context.

If my understanding is remotely correct, an RNN's view of a banana is basically like the face in Aphex Twin's Equation -https://youtu.be/M9xMuPWAZW8?t=5m30s (headphone users beware). No qualitative or quantitative information about the object, just a certain tone of integer triplets in a cacophony of noise.

It seems like a many-dimensional view of the world around us is going to be necessary for systems to more effectively intuit about interacting with it. It could be something we synthetically inject or we may need to give our models new senses they can use to extract their own meaning.

AndrewKemendo · on March 10, 2018

Well that's why I call it 4D. It's a multidimensional understanding of "banana" that crosses multiple sensory barriers.

As you more or less correctly point out, the way a DNN understands a 2D image of a banana is by basically compressing (convolving and pooling) an image into a mathematical "fingerprint" for which we provide a label. If the labeling process is homogenized then we can relatively rapidly generate inferences when testing the fingerprints on new images at a high probability.

That is to say the complexity of the "fingerprint" of a banana is several orders of magnitude greater in humans than it is for even our most advanced object detectors - if for no other reason than the mapped data is multi-sensory.

tlarkworthy · on March 10, 2018

Also the pre-training takes 2 years

maxerickson · on March 10, 2018

We are also rather prone to seeing patterns in noise.