*How do you measure how much a person knows, and how do you objectively measure ...

inference-lord · on April 16, 2024

I think you misinterpreted what I mean.

I'm not saying that they can't be more capable, I'm saying the guy can get a little overly excited about things which are hard to measure or quantify.

We're observing these systems and making up our own interpretations about how good they are at certain tasks, but it's not really easy to measure how much better or worse these things can be overall.

Your example about language translation is a good example of where these things aren't really "better", just different. I speak multiple languages and while these systems are fantastic, they can fail in ways a professional translator wouldn't and it doesn't seem to automatically know it failed and should fix itself.

The car example is also great because it again proves my point. We can easily measure a car and a person and workout a car is faster, but we can also see that a car can't walk. So it's faster but it;s also entirely different.

williamcotton · on April 16, 2024

I'm saying the guy can get a little overly excited about things which are hard to measure or quantify.

Let's back this up a little bit. We've got Marvin Minksy who comes along and destroys the perceptron. Then we have decades of knowledge systems that go nowhere. All the while Geoff Hinton is tirelessly working on neural networks. Finally after decades of hard work the fruits of his labor are recognized with ImageNet.

And then a bunch of people in a comment section criticize the guy for getting "a little overly excited" about the stunning range of neural networks that validates his life's work.

Great job, all around!

YeGoblynQueenne · on April 16, 2024

>> Here’s a very basic example of where an LLM is clearly more capable than a human: language translation. I would bet $10k at 10:1 that there are no humans who can reliably translate to and from as many languages as an LLM can.

See, translation is exactly the kind of domain where there are no good measures of performance and where performance is open to subjective interpretation, and a lot of it. That's because we don't know what is a "good translation" and, crucially, machine translation systems and language models have not helped us find out.

The way machine translation systems are evaluated is generally by a metric based on the similarity to an arbitrarily chosen "gold standard" translation. What that means in practice is that we have some corpus of parallel texts, we train a machine translation system on a part of the corpus and then test it on the held-out test set. The way we test is that we take each e.g. sentence in a text translated by the system and we compare it, as a bag-of-words or a set of n-grams, to the text in the original translation. If there is a high amount of overlap, the system scores highly. That's the way BLEU scores work and similar metrics like ROUGE.

It is important to note how arbitrary is this metric: out of all possible translations we choose one to be the "reference" translation and compare machine translations to it. The only accepted alternative is eyballing, where we give the machine translation to a bunch of humans and ask them how they feel about it.

My point is that we don't know how to measure knowledge, and language models are trained to maximise similarity, not knowledge. So there's no way to go from observations of their behaviour to a measure of their knowledge. All you can say about a language model is that it is good, or bad, at generating text that's similar to its training corpus. Everything else is an assumption.

williamcotton · on April 16, 2024

Good god, people, we measure knowledge all the time with testing. We have a difficult time measuring intelligence but we have no problem measuring someone’s knowledge about the major events that led up to the Battle of Waterloo.

Just give the participates the final from my French 3 exam but also in 100 different language combinations. I bet you do worse than ChatGPT.

YeGoblynQueenne · on April 16, 2024

>> Good god, people, we measure knowledge all the time with testing.

In humans. Not in machines.

You're proposing to use a test of human knowledge as a test of computer knowledge, when the question in the first place is whether a computer can have knowledge at all. It's like giving an IQ test to a frog and concluding that the frog has no IQ because it can't answer the questions, only reversed: the machine answers the questions, therefore it has knowledge. Who cares about mechanisms, who cares how the answers are generated, if I see answers, that's knowledge.

Well that is a pre-scientific way to look at the world. I observe the sun, it looks like it's moving around the Earth, therefore the sun turns around the Earth. No room left for critical inquiry or understanding of the cause of phenomena. We have a test? Bash it against anything and we'll get some answers, and then we'll claim that they're the right answers because that's the right test, since it gave us the right answers. And all that, not for some mysterious physical phenomenon that we're not responsible for, but for a machine, created and programmed by humans, and we know exactly how.

No no. That's not good engineering, and it's not good science: it doesn't explain the how, and it doesn't explain the why.