It’s not as though the designers of the system set out to train it on a biased dataset; we can assume they were trying to be balanced from the start.
And that’s the deeper problem here: “it’s just a biased dataset” is a misdiagnoses. It’s a whole system of biases that leads to people thinking they are training with balanced data when they manifestly are not.
You’re never really going to achieve this mythical “balanced training data” until you untangle all of the other implicit personal and organizational biases. There are a whole host of ethical discussions that need to happen to even begin to flesh out what “balanced” might even mean for, say, facial recognition software intended for use in law-enforcement, but the same biases that lead people to skip right past those discussions and begin training are often the very ones that result in the biased data to begin with.
And that’s the deeper problem here: “it’s just a biased dataset” is a misdiagnoses. It’s a whole system of biases that leads to people thinking they are training with balanced data when they manifestly are not.
You’re never really going to achieve this mythical “balanced training data” until you untangle all of the other implicit personal and organizational biases. There are a whole host of ethical discussions that need to happen to even begin to flesh out what “balanced” might even mean for, say, facial recognition software intended for use in law-enforcement, but the same biases that lead people to skip right past those discussions and begin training are often the very ones that result in the biased data to begin with.