The AI was clearly faster, which is no surprise. But since the "correct" answers for the test were assigned by one group of humans, and the correct answers in real practice are also determined by (a different group of, compared to either those running the test or those taking the test) humans, I'm not sure that result that the AI was more correct can be in any sense meaningful.
That's exactly how every human-vs-AI evaluation is measured. For example, if you have an AI detect whether an image contains a cat, your first step is to have a person manually check every image. Of course it is possible for the test/evaluation labels to be incorrect. Nothing about this case is special, at least nothing you have pointed out.
> That's exactly how every human-vs-AI evaluation is measured.
No, it's not. For instance, there are comparison on games with fixed definite rules.
And even if it was, the fact that a methodological weakness is universal in a field doesn't make it not a weakness, it just makes the field problematic as a whole.
> For example, if you have an AI detect whether an image contains a cat, your first step is to have a person manually check every image.
Sure, but the thing is if you are comparing humans to AI at finding cars in pictures, you are usually testing against lay humans with no special expertise, so the assumption on which usefulness of the experiment rests is “an academic expert evaluating pictures for the presence of cats approximates perfect accuracy near enough that it is unlikely that the score of either lay humans or AI against that expert significantly misstates their accuracy.”
With the legal case, the assumption is something like: “an academic expert evaluating the legal effect that a court would find in a set of contracts approximates perfect accuracy near enough that it is unlikely that the score of either experienced lawyers practicing in the the field or AI against that panel significantly misstates their accuracy.”
The former is, of course, not certain, but I think most people would accept it to be more likely than not to be true.
The latter is less believable (unless, for example, the expert used as the oracle is actually the Supreme Court of the jurisdiction whose law is to be applied in evaluating the contracts.)
You would assume that the test/answer creation involved a lot more eyes and time to double-check everything. Hell, they might even have used an AI to make sure that they didn't miss anything.