Most of our ML stack has been developed internally given the unique constraints we have for Radar. Among other things, we need to be able to
- compute a huge number of features, many of which are quite complex (involving data collected from throughout the payment process), in real-time: e.g. how many distinct IP addresses have we seen this card from over its entire history on Stripe, how many distinct cards have we seen from the IP address over its history, and do payments from this card usually come from this IP address?
- train custom models for all Stripe users who have enough data to make this feasible, necessitating the ability to train large numbers of models in parallel,
- provide human-readable explanations as to why we think a payment has the score that it does (which involves building simpler “explanation models”—which are themselves machine learning models—on top of the core fraud models),
- surface model performance and history in the Radar dashboard,
- allow users to customize the risk score thresholds at which we action payments in Radar for Fraud teams,
- and so forth.
We found that getting everything exactly right on the data-ML-product interactions necessitated our building most of the stack ourselves.
That said, we do use a number of open source tools—we use TensorFlow and pytorch for our deep learning work, xgboost for training boosted trees, and Scalding and Hadoop for our core data processing, among others.
Broadly speaking, what approach do you use to "build simpler 'explanation models'" from the more complicated "core fraud models"? Do you learn the models separately over the training data, or does the more complicated model somehow influence the training of the simpler model?
Why you so stubborn on IP address? Its not a holy grail! I use proxy for some years now and many times I want to buy something on the frontstore “powered by Stripe” and my card is declined due to “unknow error”. Moment I turn off my vpn, transaction goes thru. I can exect this to be a huge problem for Stripe or anyone deciding on fraud attempt greatly basing it on IP. These days if i find a cool product and see “powered by Stripe” I simply end up on Amazon purchasing same product for similar price. Worst part — your clients don’t even know!
I’m sorry that you had this experience. We vehemently agree that any one signal (such as IP address or use of a proxy) is a pretty poor predictor of fraud in isolation. We are trying to move the industry towards holistic evaluation rather than inflexible blacklists; not everyone behind a TOR exit node is a fraudster, for example.
While we can’t fix the previous experience you had, we’ve rebuilt almost every component of our fraud detection stack over the past year. We’ve added hundreds of new signals to improve accuracy, each payment is now scored using thousands of signals, and we retrain models every day.
We hope these improvements will help. We want our customers to be able to provide you services; that’s what keeps the lights on here. We’d be happy to look into what happened if you have specific websites in mind—feel free to shoot me a note at mlm@stripe.com.
The rough idea is that you look at all the decisions made by the fraud model (sample 1 is fraud, sample 2 is not fraud) and the world of possible "predicates" ("feature 1 > x1", "feature 1 > x2", ..., "feature 10000 > z1," etc.) and try to find a collection of explanations (which are conjunctions of these predicates) that have high precision and recall over the fraud model's predictions. For example, if "feature X > X0 and feature Y < Y0" is true for 20% of all payments the fraud model thinks are fraudulent, and 95% of all payments matching those conditions are predicted by the fraud model to be fraud, that's a good "explanation" in terms of its recall and precision.
It's a little tough to talk about this in an HN comment but please feel free to shoot me an e-mail (mlm@stripe.com) if you'd like to talk more.
- compute a huge number of features, many of which are quite complex (involving data collected from throughout the payment process), in real-time: e.g. how many distinct IP addresses have we seen this card from over its entire history on Stripe, how many distinct cards have we seen from the IP address over its history, and do payments from this card usually come from this IP address?
- train custom models for all Stripe users who have enough data to make this feasible, necessitating the ability to train large numbers of models in parallel,
- provide human-readable explanations as to why we think a payment has the score that it does (which involves building simpler “explanation models”—which are themselves machine learning models—on top of the core fraud models),
- surface model performance and history in the Radar dashboard,
- allow users to customize the risk score thresholds at which we action payments in Radar for Fraud teams,
- and so forth.
We found that getting everything exactly right on the data-ML-product interactions necessitated our building most of the stack ourselves.
That said, we do use a number of open source tools—we use TensorFlow and pytorch for our deep learning work, xgboost for training boosted trees, and Scalding and Hadoop for our core data processing, among others.