One of the authors of above paper here. There is a whole line of work that suppo...

skinner_ · on Sept 20, 2019

I'd like to signal boost this, it's a very important line of research. Congrats on a groundbreaking paper! The first time I have seen it, I was completely shocked by that perfect linear fit on "Do ImageNet Classifiers Generalize to ImageNet?" Figure 1.

To elaborate for the other commenters: vaishaalshankar's team has created a new ImageNet evaluation dataset from scratch, and they observed that the leaderboard positions of popular image recognition models didn't change much when switching to the new evaluation. The actual performance of the models decreased significantly, but without affecting the ranking.

The OP starts from a not very controversial claim: there's a good chance that the winner of a Kaggle competition is not actually better than any of the other top k contestants, for quite large values of k. But then he completely overplays his hand, and by the time he gets to talk about ImageNet, he makes claims that were actually falsified by vaishaalshankar's paper.

lukeor · on Sept 20, 2019

I don't actually think imagenet is anywhere near as susceptible to crowd based overfitting as most kaggle competitions, but I don't actually think that paper falsifies the claim that it is.

That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data. Medical AI shows us over and over again that truly out of distribution unseen data (external validation) is a completely different challenge to simply drawing multiple test sets from your home clinic.

Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).

skinner_ · on Sept 20, 2019

> That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data.

Not the same distribution, it's new data collected and processed according to the same recipe. A quite different distribution, demonstrated by the fact that the accuracy numbers drop sharply. That's why it's so surprising that the rankings do not change that much. (Okay, in principle, a possible explanation is that it is the exact same distribution, with a fixed percentage of mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the paper deals with this possibility.)

In any case, I fully agree that this kind of generalization is still much easier than generalizing to real world data.

> Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).

CIFAR-10 is basically the opposite of your list of requirements. Train set small, test set small, test set public, small number of labels, grid searched to death. And yet, look at the CIFAR-10 graph from that paper. The exact same pattern as ImageNet.