>*If OpenAI thinks that "Women are vague" is 30% likely to be hateful but "men a...

>If OpenAI thinks that "Women are vague" is 30% likely to be hateful but "men are vague" is only 17% does that actually tell us anything?

If that held out over hundreds of "placebo epithets," it would tell you that the filter was using the presence of the word "woman" as a signifier for hate speech independently of its context. You wouldn't be able to discover that fact by looking at things that got scored 99% because they are already at the top of the scale, and made equal by that effect.