Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why is it reasonable to expect unbiased results from an LLM trained on internet content which almost everyone agrees is biased? Isn't this the expected outcome?

If anything, I'm surprised the results are as close as they are. For example, it rates criticism of trans and disabled people as only slightly worse than criticism of cisgender and non-disabled. If this discrepancy were (as some in this thread seem to be suggesting) the result of some liberal OpenAI employees intervening to favor their own side, I'd expect those bars to be much farther apart.



Expecting unbiased results out of ChatGPT would be indeed unreasonable, it is pitched as a "research preview" of a language model. I would completely expect ChatGPT to have all kinds of weird biases. But the article isn't really concerned with GPT outputs, it's concerned with examples where ChatGPT will refuse to answer. Specifically examples where it will refuse to answer because the prompt is scored as "hate" by the OpenAI moderation endpoint (only one of many possible reasons for ChatGPT to refuse answering).

That endpoint is pitched as "The moderation endpoint is a tool you can use to check whether content complies with OpenAI's content policy. Developers can thus identify content that our content policy prohibits and take action, for instance by filtering it.". No mention of this being an LLM (it might well not be), a preview or being inaccurate or biased (though in fairness they mention that they are working to improve it). I think it's completely fair to hold it to the expectation of being as unbiased as is reasonably possible. And the article is really talking about low-hanging fruits in terms of bias metrics.

[1] https://platform.openai.com/docs/guides/moderation/overview


The problem isn't fine-tuning the model, the problem is that there isn't an objective definition of bias. Is there an a priori reason to believe that "I hate disabled people" and "I hate non-disabled people" are equally hateful, and should receive equal hate scores from an unbiased algorithm? Is hating disabled people better or worse than hating Jews? What about "Jews control Hollywood" vs "Disabled people control Hollywood"?

I don't think we as a society have an answer to that, so it's hardly fair to expect ChatGPT to provide one. What it currently does is produce similar-but-not-equal scores to sentences like those - maybe "I hate men" is 0.52 and "I hate women" is 0.73 - and if you filter out anything higher than 0.4 then they both get flagged, which seems about as unbiased as we're going to get.


> about as unbiased as we're going to get.

You can easily force the model to be more unbiased. Just add a filter that flips the gender of words, evaluates the hate score for both the original and flipped version, and averages the results.

Guaranteed to give the same score regardless of the gender mentioned.


Clever idea, but I don't think this would work very well on real posts. Consider a model that rates "typical woman driver" as hateful, because that phrase appears in a lot of argument threads with lots of downvotes. Your approach would average its score with that of "typical man driver", which will presumably be very low, not because it's less hateful but because it just rarely shows up in the training corpus.


If you're worried about the average score being too low, you could just take the maximum of the two scores instead?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: