Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The headline, apart from being clickbaity, is accepting the null hypothesis. Which is a STAT101 no no.

The article makes several good points. But just because the testing isn’t sufficient to prove that the winner didn’t just get lucky, it doesn’t prove that the winner did just get lucky.



There's nothing problematic about accepting the null hypothesis, it's just that instead of controlling for Type I error, you need to control for Type II error, i.e. ensure sufficient power.


Yeah, we can't disprove anything, yadda yadda.

If almost every competition on Kaggle has a winner that is not significantly better than the bulk of the field, then that is proof. Chance correlations leading to you not rejecting the null can only take you so far.


I think the point is we are left with uncertainty. Your prior should be that we don't know which competitor is best, and after the competition we are still unsure.


Isn't that the same thing as "not producing useful models"? Like, sure, some of the models may work, but unless you know which ones you can't make use of them.


Yes, very true, but if we're still unsure it may be worth testing them more, while if we've proven they don't work we can abandon them.


How do you test them more, with which nonbiased dataset that does not exist?

What you could do is actually describe the kind of errors the network makes. In the example of CT, false positive, false negative, wrong diagnosis. We can try to analyze what the network is detecting, rather than accept a result on some test set as real.

The millions of trials is an overstatement, but indeed few hundred thousands are needed to actually discern a winner, presuming the network did not cheat by focusing on, say, population statistics - say, certain cranium sizes being more likely to present with problems. Relying on population statistics derived from a small sample (even if representative, which it's not) is very risky...


It's also possible that if you have a lot of models that all score very close to each other that they just ALL work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: