Weapons of Math Destruction is good book on the topic. Using ML to do tasks like evaluate employee performance (and trigger firings), issue loans, insurance etc. is affecting peoples lives in the real world today.
Insurance has always been a statistically driven process and ML is just an advanced form of stats, so it makes perfect sense that the industry should use this technology.
I know of models that use things like typing speed and birth month to make loan decisions. Feed the AI inputs like race and it will make racist decisions.
I am not aware of models that use birth month to make loan decisions, but if that's a useful predictor, why not include it?
I'm not really sure exactly what you mean by "feed AI inputs like race and it will make racist decisions", as that's an absurdly general claim that doesn't even really say anything useful. If you mean the model reflects the underlying stats that correlate with race, class, gender, then sure, yes, I agree the predictor will do that. To say that it's racist is an entirely subjective claim.
THe point of insurance isn't to avoid social ills, but to statistically minimize the impact of risk on individuals (while making a tidy profit for the insurer).
Because birth month distribution is highly correlated with climate zone; see the heat-map in [1]. Climate zone, in turn, is highly correlated with race (and probably also immigration status in this case).
Even if we set ethics aside, this is still a terrible idea because credit risk analysis is highly regulated in most cases. There is enormous regulatory risk to using birth month in a credit risk model. Particularly because hen a jury asks "what's an alternative causal model that could explain the bank's incorporate of this data point into their models?" their answer is going to be "absolutely none".
> I am not aware of models that use birth month to make loan decisions, but if that's a useful predictor, why not include it?
Because it's fucking unfair, and humans hate feeling like they've been wronged for unfair reasons they cannot control. How would you feel if your loan got denied, you asked why, and you found it it would have been approved if you had been born in April instead of June? Don't pretend you'd love it and smile. You'd rant about it and tell all your friends how bad that lender is.
If the lender rejects you because your debts are too high or your income too low or something, or the zip code of the home you're mortgaging, that makes more sense and feels more relevant and in your control, at least.
Yeah, it's nice from the point of view of the lender, but the cost is too high to society.
Society feels so strongly about this that (in the USA) we've even passed all kinds of laws about what pieces of information are absolutely banned from being used to determine who gets a job or who gets a home. If you are a landlord or hiring manager you have to be aware of them so you can at least pretend to comply.
Sometimes I wonder if AI is already here secretly. Half of my replies on hacker news seem to basically boil down to teaching what humans are actually like.
> How would you feel if your loan got denied, you asked why, and you found it it would have been approved if you had been born in April instead of June?
I mean, I pay more for car insurance because I am immutably male. I don't feel great about this, but assuming that males my age are statistically more likely to be in crashes this makes total sense.
Likewise, I don't feel great that even if I were really good at basketball I'd be way less likely to make the short list of an NBA recruiter, because statistically I'm not going to be as good as someone two feet taller. It may not feel great, but it still absolutely feels fair too.
If people born in April instead of June were actually statistically more likely to be in crashes, I don't see how this is any different or unfair.
If I understand correctly, most modern insurance models contain both a general risk ("males are more dangerous so their premiums should be higher") and personal risk ("this male has driven without an accident for 20 years, which is better than the average male, so we categorise them as less risky").
This is a naive answer and neglects (1) models for housing loans, insurance, etc. are trained on historical data. (2) Policy (in the USA) at the time of data collection, and therefore currently was racist (this is not subjective).
Historically, at least in the United States, loans were unavailable to PoC, especially black Americans. Districts in many American cities were "redlined"[1], that is to say certain districts of cities were deemed "unprofitable" for banks. Redlining was policy and was targeted towards discriminating against black Americans, who were often victims to predatory loans with unserviceable interest rates, which commonly resulted in defaults. The defaults caused worse scores for people in the neighborhood and created a positive feedback loop. Historical housing loan data (and insurance data) includes this data. People today are affected by this historical data and this absolutely must be taken into account by the developers of these models.
Because of the entangled nature of real data, dropping "race" as a feature for training wouldn't solve the problem. Factors like zip codes (think of the redlined districts) would also influence the outcome[2].
Creating models, whose output can impact people so profoundly, e.g. can Jane get health insurance, calls for more reflection than just "numbers correlate!".
There isn't an "easy" solution, but step 0 is recognizing that there is an historic problem that is being dragged with us into the future because of the way our current systems work.
Political solutions are necessary, maybe something like subsidized loans for people from formerly redlined communities to purchase and restore homes, or start businesses. Urban planning projects, like increasing mixed-zoning, pedestrian traffic, and good public transportation would help keep money in the neighborhood. Then there is the question is how to deal with gentrification and increase quality of living in a community without displacing people from that community. It takes a team of experts from various fields, the community itself, and goal-oriented cooperation.
I am not aware of models that use birth month to make loan decisions, but if that's a useful predictor, why not include it?
Because it’s radically unfair to do that! You’re seriously suggesting that some people should be penalized for the day they were born?
Taking the point more seriously, it’s hard to believe that the risks for whatever you’re ensuring have populations of people each month that are so different that a random person from each has a risk associated with them that birth month, as a signal, isn’t likely to discriminate unfairly almost as much as it helps. Naively, it’s likely to unfairly penalize almost half the population. Maybe you’re prepared to accept that because, hey, it’s right over half the time, but tell that to the vast population you’re penalizing.
And some things just shouldn’t be included. Health insurance, for example: isn’t it unfair to include risk of cancer caused by genetic factors in premium calculations? I guess you could say that’s different because parents could control birth month, but, man, that just seems so uncaring to me.
In the race example, let’s say there is a discrepancy in the underlying stats now, but it’s caused by structural problems in society caused by racism in the past. By using that as a signal to discriminate now it makes it more unlikely that the problems will ever be resolved (example: Black people cant get loans as easily, so can’t start businesses as easily, so the wealth gap is reinforced). If you think that’s ethical I would suggest that you are considering statistics too much and people too little.
If there was a reason for people to be higher statistical risk based on birth month (say, "babies born in november had 0.5% higher risk of long-term respiratory disease") then, yes, that should be incorporated into a model.
Genetic risk of cancer cannot be included in insurance decisions (excluding voluntary life insurance) because a law was passed (GINA) preventing it.
It's not the role of the loan industry to make it easier for black people who appear to be a higher statistical risk to get loans. That's the purpose of government- to make it illegal to include genetic history in life insurance models- provided that society believes in that principle.
It's not the role of the loan industry to make it easier for black people who appear to be a higher statistical risk to get loans. That's the purpose of government- to make it illegal to include genetic history in life insurance models- provided that society believes in that principle.
Another way to say that - one that appears less uncaring to readers - would be “yes, including birth month or race in insurance risk calculations would be unfair+, so it should not be done. If necessary, the government should step in to prevent this”.
+ you may squabble with the use of ‘unfair’ here - obviously it is not ‘unfair’ in the stastical sense, but that’s not what the word generally means in spoken English. ‘Inequitable’ would be better, but for some reason ‘equity’ seems to be evolving to mean ‘at all costs, produces equality of outcome’ and so is a victim to the same problem in the opposite direction. I chose the more common word.
[edit: I used insurance here, sorry - the same would apply to loans I think]
I’ve been thinking about this more, and it’s occurred to me that there is an underlying assumption in It’s not he purpose of the loan industry… that corporations are required to make decisions amorally - which is don’t think is or should be the case. That may be an underlying schism in our argument.
You are correct. I expect that corporations understand they exist in a competitive environment and incorporating morality represents a real risk that your competitor will replace you, because morality is not profitable.
I do agree that if you expect corporations to act morally, then it would be sensible for them to be more careful in understanding and correcting long-term social ills.
Is it the role of the loan industry to make it more difficult for black people to get loans, regardless of whether the individual is a higher statistical risk? (Apparently so, since that's the way it's always worked, right?)
It is trivial to bake an irrelevant distinction into a naive ML driven process.
An insurance model that uses data on financial stability in a society where a specific race of people have been, through bigotry alone, forced into a less stable position by default, will be bound to perpetuate the original bigoted patterns of behavior. It's not inaccurate to describe the model as therefore racist because it's modeling itself off a risk assessment where certain groups have historically been forced into the fringe and are therefore inherently risky.
I'm sure if black wall street wasn't razed to the ground by a racist white mob (among several other attempts to gain stability, wealth, etc. by black people being destroyed by racist white anger) then maybe the model wouldn't need to "reflect the underlying stats that correlate with race...". But those underlying stats, and those correlations, didn't just happen in a vacuum, hewn out of the aether like a magical and consequentless thing.
The points you are making are entirely political and apply generally, not just to insurance. Insurance is a business and it's not the job of the insurer to correct long-standing social ills.
Whether or not my points apply generally does not mean it does not apply to insurance. In fact, it very likely means insurance is a subset of generally, and therefore insurance also has to deal with racism. You cannot extricate models of society from their ills by pretending the ills aren't relevant.
Generally when a loan company is training a model, they don't approve every application for a month and see if they're repaid to get unbiased data.
Instead, they look at who was approved, and maybe whether they repaid their loan.
So if a human making loan decisions isn't keen on giving out loans in View Park-Windsor Hills, a model trained to produce human-level performance won't be keen on doing it either.
This place was using any and all data that improved the fit. What browser you used, typing speeds, and birth month all improved fit in training and worked in the test set and got thrown in. That was the only hurdle, they didn’t care as to why, they just added the data.
I thought it was one of the most deceptive books I ever read! It contradicts itself enough that it is often obviously wrong even without conceptual knowledge of the field.