There has been no peer-reviewed paper calling in question the gaydar paper. There has been a master student who tried to replicate the study with his own crawled dataset, and got better than human guessing, but slightly below the paper accuracy. News outlets ran with that to say that the study was flawed. Another was by a Googler who claimed that the neural net solely looked at eye shadow or glasses, but he also got better than random and human guessing on his own sanitized dataset, and, one could argue that eye shadow and glasses are fair game when classifying from a face picture, as they are included in the picture, and these pictures were also shown to the human evaluators (even ground).
The next web article is by a journalist with a history degree, not an ML scientist. But based solely on the merit of his arguments, he also agrees with the results of the paper:
> there’s nothing wrong with the paper and all the science (that can actually be reviewed) obviously checks out.
and seems to take more issue with the ethical considerations, binary sexuality, and builds his point around: humans have no functioning gaydar at all, so it is insignificant that a neural net could beat a coin flip. His point is weak, as he gives no evidence for humans lacking a gaydar, and the paper (which was not wrong as claimed) includes human assessments which are higher than random guessing.
I think my contrarian view is true from mere pragmatism: Israel has the best airport security in the world, and uses these Suspect Detection Systems extensively, seemingly constantly improving and making enough profit for new players to enter the market. AKA the people that actually do this for a living keep innovating on it, and I find that rather unlikely if all of this is tea leaf reading.
I think, in general, that the HN crowd overreacts when it comes to controversial tech, and that a simplistic "this does not work, and is a sham, and fraud to take research money" is an uninformed weak claim. It takes a lot of chutzpah to denounce the many months work of legit scientists as obviously flawed from behind your keyboard when one probably has not even read the full paper. The authors, by picking such a controversial topic, are partly to blame for this pushback and popular media reporting, but that does not make it right.
I will not defend the use of plethysmograph and eye tracking studies to measure a sexual response. Just claim that it is better than random guessing, it allows for better treatment when measurements are out of line with self-reports, and that it is still in use and very similar to the Fruit Machine. The Fruit Machine is already back.
> My dowsing rod is better than my crystal ball at finding water,
This I do not get what you refer too (I know you as a ML knowledgable person from your other comments, so I am afraid to assume things, but if your crystal ball is random, and your dowsing rod is better than random, you are succesfully doing predictive modeling, no, not a sham? [1]). These systems do not need extremely high accuracy, if they do not auto-deny a person, and it is changing the goal posts a bit to demand accuracy when better than random guessing has been demonstrated (which is questioned by the majority of the commenters here).
> or they are irrelevant like the sources about the training of border agents
User kindly requested sources for all of my claims. I claimed this and sourced it. My point was that we already have human Suspect Detection Systems in place, so either those must go (you have a fundamental problem with SDS's) or they can't be automated (because you don't trust AI research or believe these systems need common sense problem solved first). I could then offer counter-arguments to both.
For the question about the eye direction, look at the sourcing for telltale signs of lies I posted in reply to another commenter. It depends on if you are left- or right handed.
[1] > A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. In this paper, it is shown that these two notions of learnability are equivalent. - The Strength of Weak Learnability
Regarding the gaydar paper, yes, I have read the full paper (if memory serves,
I read two versions, a pre-print and the published paper). At the time, I
wanted to publish a rebuttal, perhaps a letter in a journal or something, but
in the end I didn't think I'd be adding much to the debate and the paper had
been widely discredited already anyway.
My objection with the methodology in the paper was that the authors had
assembled a dataset where the distribution of gay men and women was 50% of the
population, i.e. there were as many gay women as straight and as many gay men
as straight in the data. This was for one of their datasets, the one were
everyone had a picture. There were two more where the distribution was less
even but still nothing like what it's usually estimated to be. This despite
the fact that the paper itself cited a result that gay men and women are
around 7% of the population.
The reason for this discrepancy was clearly to improve the results by reducing
the number of false negatives which are expected when there are many more
negative than positive examples in binary classification.
This from the point of view of machine learning. There were other flaws that
others pointed out, e.g. the choice of metric (I don't remember what it was
now, I can look it up if you like), the premising of the paper on prenatal
hormone theory that is another piece of bunkum without any evidence to back it
etc.
And of course there were the ethical considerations.
Sorry but I don't have the courage to reply to the rest of your comment. You
write way too much.
Rebalancing an imbalanced dataset is common in industry and academicia. You use that when you focus on accuracy, to make claims like: We were 54% accurate on classifying sexuality of females easily interpretable, without needing a distribution-balanced benchmark (you simply know it is a coin flip).
If there is signal in the rebalanced dataset, there should be signal in the imbalanced dataset. If they'd switched to logloss or AUC and an imbalanced dataset, do you think now their results would be as good as random? Because that is what you are implying and you are basically implying the research is fraudulent. This is a very strong claim to make, in the absence of legit discrediting studies that failed to replicate any predictability, and requires more than guessing the authors rebalancing act was "clearly" to improve the accuracy (with 7% negative class, you could get 93% accuracy by always predicting positive class, so if they wanted to inflate the accuracy, they shouldn't have rebalanced).
The ethical considerations are moot/personal opinion, as they passed the ethics board of Stanford. Those are people who evaluate ethics of academic research for a living, or are you saying they were also shoddy and wrong to give this a pass?
Magical thinking is not wanting something to be true, because it would be an uncomfortable truth, and so deeming that something which is objectively true, must be false, so you can continue to think happy thoughts in line with your world view.
You keep talking about the paper being widely discredited, but can't provide a single academic source for this. Instead, you question my sources (business insider?) while posting articles from The Next Web written by a History degree journalist who does not want the concept of binary sexuality to be true, or even allow it in constructing a dataset of gay and straight people by self-classification.
It takes more energy and letters to attack a point than to make a point. You made quite a lot of weak points.
>> Rebalancing an imbalanced dataset is common in industry and academicia. You use that when you focus on accuracy, to make claims like: We were 54% accurate on classifying sexuality of females easily interpretable, without needing a distribution-balanced benchmark (you simply know it is a coin flip).
You quoted The Strength of Weak Learnability and I figured you must have at least a passing acquaintance with computatinal learning theory. In computational learning theory (such as it is) it's a foundational assumption that the distribution from which training examples are drawn is the same as the true distribution of the data, otherwise there cannot be any guarantees that a learned approximation is a good approximation of the true distribution.
The following is a good article on machine learning with unbalanced classes:
>> This is a very strong claim to make, in the absence of legit discrediting studies that failed to replicate any predictability, and requires more than guessing the authors rebalancing act was "clearly" to improve the accuracy (with 7% negative class, you could get 93% accuracy by always predicting positive class, so if they wanted to inflate the accuracy, they shouldn't have rebalanced).
The gay class was the positive class and the straight class negative, in this case. If you did what you say and identified everyone as straight, you'd get a very high number of false negatives: you'd identify every gay man and woman as being straight. You'd get very high recall but abysmall precision. The authors validated their models using an AUC curve plotting precision against recall and such a plot would immediately show the weakness of an always-say-straight classifier.
>> You keep talking about the paper being widely discredited, but can't provide a single academic source for this.
An "academic source", like a publication in a peer-reviewed journal is not always necessary. For example, you won't find any peer-reviewed work debunking Yuri Geller. In this case my instinct is that no reputable scientist would want to get anywhere near that controversy (and that was one reason I also stayed away).
Some of the criticisms are technical, some are from the point of view of ethics. It would be a grave mistake to discount the ethical concerns, but if you prefer technical explanations there is quite a bit of meat there.
Thanks! That article has a lot of critique and I also like that the author collected the responses from one of the authors.
But, to me, most of the critiques seem uninformed (not made by ML practicioners) and focus on the ethics (where I agree with the authors: we need solid research into weaponized algorithms and show what is currently possible by ML practicioners, who may use such technology adversarialy, and can look at reclassifying profile pictures to the same degree as we do information about sexuality, religion, or political preference). By my estimation, most of the critiques are by people who find this research to be threatening to them, their friends, and their sexual identity. That may very well be the case, but it also leads people to conclude the scientific study was flawed and that an automated gaydar can't possibly work. Two replications by scientists who took issue with the paper, and lack incentive to fudge the data or metric to dress up their paper, also demonstrated a better than random automated gaydar. These systems work! (And that poses a problem we can now tackle, where before we did not even know this was possible, and the majority in this thread still thinks it is all bunkum).
Many statistical assumptions are regurarly broken, for pragmatic reasons (it just works better), or because the world is not static (and so the IID assumption is broken). There is an entire subfield of learning on imbalanced datasets, which includes resampling, subsampling, oversampling, and algorithms like SMOTE. It is common to use these techniques to get a better performance, including on unseen out-of-distribution data. Fraud - and CTR - and medical diagnosis models are regurarly rebalanced for other purposes than trying to break assumptions or cheat oneself into a seemingly higher accuracy. Plus, the signal does not dissapear when training only on originally balanced data. These systems do not work by the grace of a rebalancing trick alone, but they may work better (as usually the case with neural nets, which do not even give convergence guarantees: something only a statistician would worry about).
You can switch negative with positive class and my point remains: if the authors wanted the fraudulenty hack the accuracy score, this is way easier with imbalanced data. AUC metric robust to class imbalance anyway: ranking won't change for unseen data out of distribution, you can just adjust the threshold to match it.
I'd say an academic source is necessary in this case, because you implicitly accuse these scientists of doing shoddy hyped up work, with fudging tricks to appear more accurate. I need more than popular media sources or previous HN discussions to admit this paper was "widely discredited".
Yes, of course many theoretical assumptions are broken- but that is because people who break them either ignore them completely, or deliberately voilate them in order to produce better-looking results. That is more common in industry where it's easier to pull the wool over the eys of senior colleagues, but it's not unheard of in academia, quite the contrary. Anyway, just because people do shoddy work and then report impressive results doesn't mean that we should accept poor methodology as if it was good.
In particular about the gaydar paper, the authors cook up their data to get good results and then use those results to claim that they have found evidence for an actual natural phenomenon (hormones influencing haircuts etc). That's just ...pseudoscience.
You seem to be under the assumption that rebalancing is always bad or ignorant. That techniques, such as SMOTE, are only used to produce better-looking results and pull the wool over someones eyes. This is simply not true. Rebalancing is not shoddy, but accepted practice. It is certainly fair to question it, but not to draw the conclusion of fraud or shoddy science (without making you look pretty silly).
Again, I do not think rebalancing data justifies the conclusion that the authors were cooking up their data to report better results. Take a step back and assume good faith: could there be any other reasons to resample data, other than wanting to commit fraud?
The Google scholar links includes 10+ cited and peer-reviewed papers on the Yuri Geller drama.
I don't know enough about hormone theory to say anything against or for their conclusion, just focusing on showing that working automated gaydars that perform better than average/random guessing exist and have been scientifically demonstrated. I can agree with you on that the connection is spurious, without dropping my point that this controversial technology actually works (rebalanced or no).
The next web article is by a journalist with a history degree, not an ML scientist. But based solely on the merit of his arguments, he also agrees with the results of the paper:
> there’s nothing wrong with the paper and all the science (that can actually be reviewed) obviously checks out.
and seems to take more issue with the ethical considerations, binary sexuality, and builds his point around: humans have no functioning gaydar at all, so it is insignificant that a neural net could beat a coin flip. His point is weak, as he gives no evidence for humans lacking a gaydar, and the paper (which was not wrong as claimed) includes human assessments which are higher than random guessing.
I think my contrarian view is true from mere pragmatism: Israel has the best airport security in the world, and uses these Suspect Detection Systems extensively, seemingly constantly improving and making enough profit for new players to enter the market. AKA the people that actually do this for a living keep innovating on it, and I find that rather unlikely if all of this is tea leaf reading.
I think, in general, that the HN crowd overreacts when it comes to controversial tech, and that a simplistic "this does not work, and is a sham, and fraud to take research money" is an uninformed weak claim. It takes a lot of chutzpah to denounce the many months work of legit scientists as obviously flawed from behind your keyboard when one probably has not even read the full paper. The authors, by picking such a controversial topic, are partly to blame for this pushback and popular media reporting, but that does not make it right.
I will not defend the use of plethysmograph and eye tracking studies to measure a sexual response. Just claim that it is better than random guessing, it allows for better treatment when measurements are out of line with self-reports, and that it is still in use and very similar to the Fruit Machine. The Fruit Machine is already back.
> My dowsing rod is better than my crystal ball at finding water,
This I do not get what you refer too (I know you as a ML knowledgable person from your other comments, so I am afraid to assume things, but if your crystal ball is random, and your dowsing rod is better than random, you are succesfully doing predictive modeling, no, not a sham? [1]). These systems do not need extremely high accuracy, if they do not auto-deny a person, and it is changing the goal posts a bit to demand accuracy when better than random guessing has been demonstrated (which is questioned by the majority of the commenters here).
> or they are irrelevant like the sources about the training of border agents
User kindly requested sources for all of my claims. I claimed this and sourced it. My point was that we already have human Suspect Detection Systems in place, so either those must go (you have a fundamental problem with SDS's) or they can't be automated (because you don't trust AI research or believe these systems need common sense problem solved first). I could then offer counter-arguments to both.
For the question about the eye direction, look at the sourcing for telltale signs of lies I posted in reply to another commenter. It depends on if you are left- or right handed.
[1] > A concept class is learnable (or strongly learnable) if, given access to a source of examples of the unknown concept, the learner with high probability is able to output an hypothesis that is correct on all but an arbitrarily small fraction of the instances. The concept class is weakly learnable if the learner can produce an hypothesis that performs only slightly better than random guessing. In this paper, it is shown that these two notions of learnability are equivalent. - The Strength of Weak Learnability