Hacker Newsnew | past | comments | ask | show | jobs | submit | lukeor's commentslogin

Hi, author here. I'll try to answer here with two points.

1) race is not provided explicitly or intentionally during training, but medical practice is biased so it is reasonable to assume there is a signal in the data.

2) we know there must be a signal, since AI models learn it. The optimisation process should only discover features that correlate with the labels, but we see the ability to predict race in models trained to look for diseases like pneumonia (which do not appear different for people from different racial groups).


Thanks a lot for taking the time to reply. I have to admit that I have found your post unexpectedly super interesting. It's quite subtle, and while I can see all the facts, there are a few logical steps which don't quite make sense to me.

So you're saying that 1. Unlike what the blog post says, there is practice bias even in radiology. This would indeed explain why the model can learn 'racial bias'.

2. This is less clear to me. Just like humans can't see race on radio images, humans might be unable to see differences for diseases like pneumonia, but the nn could see them, no? In other words, how do you know that the differences have to come from hidden racial bias, and not from hidden pathological differences (that you don't know about, just like you've just discovered hidden racial bias) ?


(replying to myself about 2) You say the race can't be seen in the image by humans because it's not here, and therefore the race information must come from bias introduced by the radiologist.

I guess it depends on whether all differences are visible to the human eye (and there's nothing pathological hidden that the nn 'sees') , and whether you can prove (a very hard thing in stats) that there's no way you can possibly extract race in a different manner.

Now, assuming the data is racially biased because of medical practice biased, forgive me for being naive, but, why is this surprising / major?

Isn't this just another instance of models being trained on bad data, and there are already plenty of examples in ia ?

Edit:.. Unless the actual finding becomes 'radiology is (unexpectedly) racially biased, so much so that an ia can learn it'?


I never said radiology didn't produce biased results, just that we don't know the race of our patients most of the time.

There are lots of ways bias can still occur, like in who gets referred for scans, when they get referred, what the referrer writes on the request form, how the technologist takes the images (I could tell you some horribly racist stories about a few ultrasonographers I've worked with), and so on.

And all of this is based on previous work that AI produces bias (when trained on these datasets). If it was useful differences that drove AI learning about race, the models would not produce disparities. We went looking for how it is interacting with race because we already knew it was producing unacceptable outcomes. The big news here is that it is so easy to learn race that this effect is almost certainly not isolated to the systems tested so far.


Thanks!


I don't actually think imagenet is anywhere near as susceptible to crowd based overfitting as most kaggle competitions, but I don't actually think that paper falsifies the claim that it is.

That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data. Medical AI shows us over and over again that truly out of distribution unseen data (external validation) is a completely different challenge to simply drawing multiple test sets from your home clinic.

Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).


> That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data.

Not the same distribution, it's new data collected and processed according to the same recipe. A quite different distribution, demonstrated by the fact that the accuracy numbers drop sharply. That's why it's so surprising that the rankings do not change that much. (Okay, in principle, a possible explanation is that it is the exact same distribution, with a fixed percentage of mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the paper deals with this possibility.)

In any case, I fully agree that this kind of generalization is still much easier than generalizing to real world data.

> Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).

CIFAR-10 is basically the opposite of your list of requirements. Train set small, test set small, test set public, small number of labels, grid searched to death. And yet, look at the CIFAR-10 graph from that paper. The exact same pattern as ImageNet.


Hi, author here. There are a range of ways the estimates can be improved, although many require data that isn't available. The main point is that having a ballpark idea of how reliable your results are is good, and you can achieve that with this sort of simple napkin maths.

No statistician would do what I did for a formal publication, but I think what I did gets the point across.


Yes, it's a very thought-provoking article. I'm sure there are many competitions on Kaggle that were won due to the testing/training splits or other incidental choices rather than better machine learning.

But I don't think it's fair to complain about $30,000 prizes being awarded to first rather than second place in a specific competition without doing at least a little checking of whether that was actually the case. And the article kind of reads like cynicism that all machine learning is a waste, and all the algorithms are just producing random numbers that randomly happen to be right some of the time and win the competition by random chance.


All I can really say is that my usual readers understand that I am pro-ML, in fact I'm probably more hung go about the potential of deep learning than many of my compatriots.

I've fallen victim of getting a Twitter bump, and assuming that people know I'm not anti-ML.

The blog post is meant to be educational, not argumentative. Since it has got wider exposure I'll do a follow up to clarify my position on imagenet.


It's a great post; I love ML, I've spent many years trying to get value out of it, and sometimes succeeding. But folks are applying without any of the checks and balances that are needed to produce real value in a sustained way.

Two reasons : 1 - it's harder to do this vs. optimise the behooozas out of a dataset and throw the best one over the wall (and this is often done in good heart complete with a whole gamut of "standard practice" which are in-fact information leak from test to train like checking what features are informative on the test set before doing training) 2... folks don't know better, and best practice is sparsely documented or taught. This is because there are almost no practitioners turned teachers in comp sci. I'm not running down the great people who do great work pushing the field, they are my betters, but the next generation are being mislead into thinking that the skills they are picking up in their ML classes are going to keep them gainfully employed in the long term.


This is the problem with the internet and links. You come in with inappropriate context and make judgements based on single pages of text.


Hi, author here. I didn't actually mean to suggest that the last 5 years of performance improvement could be spurious. That clearly isn't true. I use resnets/densenets etc in my day to day work!

What the picture was trying to say is that, within a given year, the "winner" becomes less likely to be truly better than the second place team. Alexnet was clearly better than the alternative, even with Bonferroni adjusted significance thresholds. Less so by 2016/17.

I'm writing a follow up on imagenet in particular to address some of the nuance. It is very clearly not a representative example of ML competitions, but the same effects still apply to some extent (imo).


Hey, thanks for liking the article. I was just looking at my blog stats and they had a bit of spike when you wrote this.

I'm just getting down to writing some posts about the big question the New Yorker article introduces, but doesn't really make any real progress with: will machines actually replace doctors.

Hope you enjoy them :)


Hi, I'm the author of the blogpost (didn't expect to see it here!) and I largely agree with you.

I think I would say that the use of private datasets isn't actually a problem though, the problem is that the methods and analysis in many of the papers is horribly flawed. I could reasonably trust a paper with good methodology - all of medical research is on private data essentially. But the majority is, as you say, pure hype.

I actually wrote another blog post a bit back that was popular on some simple rules to separate hype from reality in medical AI work - https://lukeoakdenrayner.wordpress.com/2016/11/27/do-compute...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: