> That's exactly how every human-vs-AI evaluation is measured.
No, it's not. For instance, there are comparison on games with fixed definite rules.
And even if it was, the fact that a methodological weakness is universal in a field doesn't make it not a weakness, it just makes the field problematic as a whole.
> For example, if you have an AI detect whether an image contains a cat, your first step is to have a person manually check every image.
Sure, but the thing is if you are comparing humans to AI at finding cars in pictures, you are usually testing against lay humans with no special expertise, so the assumption on which usefulness of the experiment rests is “an academic expert evaluating pictures for the presence of cats approximates perfect accuracy near enough that it is unlikely that the score of either lay humans or AI against that expert significantly misstates their accuracy.”
With the legal case, the assumption is something like: “an academic expert evaluating the legal effect that a court would find in a set of contracts approximates perfect accuracy near enough that it is unlikely that the score of either experienced lawyers practicing in the the field or AI against that panel significantly misstates their accuracy.”
The former is, of course, not certain, but I think most people would accept it to be more likely than not to be true.
The latter is less believable (unless, for example, the expert used as the oracle is actually the Supreme Court of the jurisdiction whose law is to be applied in evaluating the contracts.)
No, it's not. For instance, there are comparison on games with fixed definite rules.
And even if it was, the fact that a methodological weakness is universal in a field doesn't make it not a weakness, it just makes the field problematic as a whole.
> For example, if you have an AI detect whether an image contains a cat, your first step is to have a person manually check every image.
Sure, but the thing is if you are comparing humans to AI at finding cars in pictures, you are usually testing against lay humans with no special expertise, so the assumption on which usefulness of the experiment rests is “an academic expert evaluating pictures for the presence of cats approximates perfect accuracy near enough that it is unlikely that the score of either lay humans or AI against that expert significantly misstates their accuracy.”
With the legal case, the assumption is something like: “an academic expert evaluating the legal effect that a court would find in a set of contracts approximates perfect accuracy near enough that it is unlikely that the score of either experienced lawyers practicing in the the field or AI against that panel significantly misstates their accuracy.”
The former is, of course, not certain, but I think most people would accept it to be more likely than not to be true.
The latter is less believable (unless, for example, the expert used as the oracle is actually the Supreme Court of the jurisdiction whose law is to be applied in evaluating the contracts.)