I don't disagree that detecting hate isn't a hard problem. But I also feel it isn't as ambiguous as you make it out to be.
"X loves Y" - not hate speech.
"X should die" - hate speech.
Now, context to these could change it. But the question is, can two random person proficient in the same language tell apart similarly the two kinds of speech?
If a human can perform the task with pretty good accuracy, then it's doable.
Now if we're saying that even people can't tell apart hate and violence, we're in more trouble. But as of now, I think it's a pretty strong claim to make that people can't actually decipher hateful tones in text.
I think the enforcement is way more imbalanced than you realize.
Here's an example, how about instead of "X should die", it's "lol, can all X just die?". They give it a little jocular spin with the "Lol", but it's effectively the same comment.
Here's the deal: if X in that comment is "black people", you get banned on Twitter. If it's "white people" you don't get banned.
Now personally, I don't mind the comment in either form. I don't believe it's an actual threat to anyone. It's kind of edgelordy, but whichever. And beyond that, I'd be very unhappy if someone got banned for something like that, because it just stomps over whatever other, more legitimate commentary they might've had.
I'm interested in having conversations, even imperfect ones. All conversations are imperfect.
But OK, let's move past "X should die".
But it's not just that. Any discussion where one side can be spun as a vulnerable minority (merit be damned), that minority status can be used to silence the other side.
So for example, consider transwomen in women's sports. There's a not-insignificant push among trans activists to claim that transwomen are "biologically female". I think the logic with that is: 1. sex is fuzzier than most people realize (debatable) and 2. hormone treatments are sufficient to tip someone over the line into the other biological sex (also debatable).
Referring to a trans woman as male for the sake of argument runs you a serious risk of getting permabanned from Twitter if the person you're arguing with is particularly ornery.
Hum, you make claims of this will get you banned and this won't. What are those assumptions built on?
I'm not saying it's not the case, I don't use Twitter and don't know their processes.
That said, in my work, we deal with similar scenarios. Normally, you look for trends and you have different levels of enforcement for different risks.
So say you have automated systems doing monitoring which flags and alerts of possible bad behavior. Normally these are scored and categorized. For some score and category, you might take an automatic actions. Say, block the comment from posting at all and tell the user to reword in a non hateful manner their idea. Or you delete the comment. Now these infractions go towards the account. When there's many if them, indicating a trend, it promotes the account for manual review.
At that point, a person does a holistic overview of the account and its infractions. If they can reverse some if them, feeding back into the automated monitoring and alerting to improve its false positive rates, and they can confirm them, to add weight to those labels. Finally, they can choose to contact the account holder to justify themselves, give them a warning, take partial enforcement like deleting certain posts, or outright ban them.
Even the ban has degrees. You could have the account ban, but be allowed to create a new account. Or you can be banned with cross account detection, so your IP, email, address, credit cards etc. are all banned to make it hard for you to even sign up again.
And the processes in place and their rules are constantly adjusted and reevaluated. And there's even backfill mechanisms, so if rules are relaxed in the future, prior enforcement can be reversed if they no longer hold against the new rules for example.
Would you find that if say Facebook or Twitter were to operate in a somewhat similar fashion that it would seem reasonable?
"X loves Y" - not hate speech.
"X should die" - hate speech.
Now, context to these could change it. But the question is, can two random person proficient in the same language tell apart similarly the two kinds of speech?
If a human can perform the task with pretty good accuracy, then it's doable.
Now if we're saying that even people can't tell apart hate and violence, we're in more trouble. But as of now, I think it's a pretty strong claim to make that people can't actually decipher hateful tones in text.