SlideShare a Scribd company logo
1 of 29
Download to read offline
Counterfactual evaluation
of machine learning
models
Michael Manapat
@mlmanapat
Stripe
Stripe
• APIs and tools for commerce and payments
import stripe
stripe.api_key = "sk_test_BQokiaJOvBdiI2HdlWgHa4ds429x"
stripe.Charge.create(
amount=40000,
currency="usd",
source={
"number": '4242424242424242',
"exp_month": 12,
"exp_year": 2016,
"cvc": "123"
},
description="PyData registration for mlm@stripe.com"
)
Machine Learning at Stripe
• Ten engineers/data scientists as of July 2015 (no
distinction between roles)
• Focus has been on fraud detection
• Continuous scoring of businesses for fraud
risk
• Synchronous classification of charges as
fraud or not
Making a Charge
Customer's
browser
CC: 4242...
Exp: 7/15
Stripe API
Your
server
(1) Raw credit
card information
(2) Token
(3) Order information
(4) Charge
token
tokenization payment
Charge Outcomes
• Nothing (best case)
• Refunded
• Charged back ("disputed")
• Fraudulent disputes result
from "card testing" and "card
cashing"
• Can take > 60 days to get
raised
5
Model Building
• December 31st, 2013
• Train a binary classifier for
disputes on data from Jan
1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Based on validation data, pick
a policy for actioning scores:

block if score > 50
Questions
• Validation data is > 2 months old. How is the
model doing?
• What are the production precision and recall?
• Business complains about high false positive
rate: what would happen if we changed the
policy to "block if score > 70"?
Next Iteration
• December 31st, 2014. We
repeat the exercise from a
year earlier
• Train a model on data from
Jan 1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Validation results look ~ok
(but not great)
Next Iteration
• We put the model into production, and the
results are terrible
• From spot-checking and complaints from
customers, the performance is worse than
even the validation data suggested
• What happened?
Next Iteration
• We already had a model running that was blocking a
lot of fraud
• Training and validating only on data for which we had
labels: charges that the existing model didn't catch
• Far fewer examples of fraud in number. Essentially
building a model to catch the "hardest" fraud
• Possible solution: we could run both models in
parallel...
Evaluation and Retraining
Require the Same Data
• For evaluation, policy changes, and
retraining, we want the same thing:
• An approximation of the distribution of charges
and outcomes that would exist in the absence
of our intervention (blocking)
First attempt
• Let through some fraction of charges that we
would ordinarily block
• Straightforward to compute precision
if score > 50:
if random.random() < 0.05:
allow()
else:
block()
Recall
• Total "caught" fraud = (4,000 * 1/0.05)
• Total fraud = (4,000 * 1/0.05) + 10,000
• Recall = 80,000 / 90,000 = 89%
1,000,000 charges Score < 50 Score > 50
Total 900,000 100,000
Not Fraud 890,000 1,000
Fraud 10,000 4,000
Unknown 0 95,000
Training
• Train only on charges that were not blocked
• Include weights of 1/0.05 = 20 for charges that
would have been blocked if not for the random
reversal
from sklearn.ensemble import 
RandomForestRegressor
...
r = RandomForestRegressor(n_estimators=10)
r.fit(X, Y, sample_weight=weights)
Training
• Use weights in validation (on hold-out set) as
well
from sklearn import cross_validation
X_train, X_test, y_train, y_test = 
cross_validation.train_test_split(
data, target, test_size=0.2)
r = RandomForestRegressor(...)
...
r.score(
X_test, y_test, sample_weight=weights)
Better Approach
• We're letting through 5% of all charges we think
are fraudulent. Policy:
Very likely
to be
fraud
Could go
either way
Better Approach
• Propensity function: maps
classifier scores to P(Allow)
• The higher the score, the
lower probability we let the
charge through
• Get information on the area we
want to improve on
• Letting through less "obvious"
fraud ("budget" for evaluation)
Better Approach
def propensity(score):
# Piecewise linear/sigmoidal
...
ps = propensity(score)
original_block = score > 50
selected_block = random.random() < ps
if selected_block:
block()
else:
allow()
log_record(
id, score, ps, original_block,
selected_block)
ID Score p(Allow)
Original
Action
Selected
Action
Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK
Analysis
• In any analysis, we only consider samples that
were allowed (since we don't have labels
otherwise)
• We weight each sample by 1 / P(Allow)
• cf. weighting by 1/0.05 = 20 in the uniform
probability case
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56
Recall = 5 / 6 = 0.83
Evaluating the "block if score > 40" policy
Precision = 6 / 10 = 0.60
Recall = 6 / 6 = 1.00
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 62" policy
Precision = 5 / 5 = 1.00
Recall = 5 / 6 = 0.83
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Analysis
• The propensity function controls the
"exploration/exploitation" payoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the
more we allow through
• Bootstrap to get error bars (pick rows from the
table uniformly at random with replacement)
New models
• Train on weighted data
• Weights => approximate distribution without any
blocking
• Can test arbitrarily many models and policies
offline
• Li, Chen, Kleban, Gupta: "Counterfactual
Estimation and Optimization of Click Metrics for
Search Engines"
Technicalities
• Independence and random seeds
• Application-specific issues
Conclusion
• If some policy is actioning model scores, you
can inject randomness in production to
understand the counterfactual
• Instead of a "champion/challenger" A/B test, you
can evaluate arbitrarily many models and
policies in this framework
Thanks!
• Work by Ryan Wang (@ryw90), Roban Kramer
(@robanhk), and Alyssa Frazee (@acfrazee)
• We're hiring engineers/data scientists!
• mlm@stripe.com (@mlmanapat)

More Related Content

What's hot

Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkDatabricks
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Databricks
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regressionAkhilesh Joshi
 
hands on machine learning Chapter 6&7 decision tree, ensemble and random forest
hands on machine learning Chapter 6&7 decision tree, ensemble and random foresthands on machine learning Chapter 6&7 decision tree, ensemble and random forest
hands on machine learning Chapter 6&7 decision tree, ensemble and random forestJaey Jeong
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift ModellingPierre Gutierrez
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking Systemivaderivader
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive DataSumit Rangwala
 

What's hot (20)

What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering Framework
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Predictive modelling
Predictive modellingPredictive modelling
Predictive modelling
 
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
hands on machine learning Chapter 6&7 decision tree, ensemble and random forest
hands on machine learning Chapter 6&7 decision tree, ensemble and random foresthands on machine learning Chapter 6&7 decision tree, ensemble and random forest
hands on machine learning Chapter 6&7 decision tree, ensemble and random forest
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift Modelling
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking System
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 

Similar to Counterfactual evaluation of machine learning models for fraud detection

Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningHJ van Veen
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques finalshraavank
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Projectraj
 
Fraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesFraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesAlejandro Correa Bahnsen, PhD
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxPriyadharshiniG41
 
Dynamic Testing
Dynamic TestingDynamic Testing
Dynamic TestingJimi Patel
 
Computer Intensive Statistical Methods
Computer Intensive Statistical MethodsComputer Intensive Statistical Methods
Computer Intensive Statistical MethodsDaniele Baker
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesLuca Zavarella
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statisticsSpotle.ai
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMikhail Klassen
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedinAsoka Korale
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 

Similar to Counterfactual evaluation of machine learning models for fraud detection (20)

Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques final
 
Predicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining ProjectPredicting Cab Booking Cancellations- Data Mining Project
Predicting Cab Booking Cancellations- Data Mining Project
 
Fraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision TreesFraud Detection by Stacking Cost-Sensitive Decision Trees
Fraud Detection by Stacking Cost-Sensitive Decision Trees
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
Dynamic Testing
Dynamic TestingDynamic Testing
Dynamic Testing
 
Computer Intensive Statistical Methods
Computer Intensive Statistical MethodsComputer Intensive Statistical Methods
Computer Intensive Statistical Methods
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Market Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning ServicesMarket Basket Analysis in SQL Server Machine Learning Services
Market Basket Analysis in SQL Server Machine Learning Services
 
Randomness and fraud
Randomness and fraudRandomness and fraud
Randomness and fraud
 
An introduction to machine learning and statistics
An introduction to machine learning and statisticsAn introduction to machine learning and statistics
An introduction to machine learning and statistics
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Machine Learning for Aerospace Training
Machine Learning for Aerospace TrainingMachine Learning for Aerospace Training
Machine Learning for Aerospace Training
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Counterfactual evaluation of machine learning models for fraud detection

  • 1. Counterfactual evaluation of machine learning models Michael Manapat @mlmanapat Stripe
  • 2. Stripe • APIs and tools for commerce and payments import stripe stripe.api_key = "sk_test_BQokiaJOvBdiI2HdlWgHa4ds429x" stripe.Charge.create( amount=40000, currency="usd", source={ "number": '4242424242424242', "exp_month": 12, "exp_year": 2016, "cvc": "123" }, description="PyData registration for mlm@stripe.com" )
  • 3. Machine Learning at Stripe • Ten engineers/data scientists as of July 2015 (no distinction between roles) • Focus has been on fraud detection • Continuous scoring of businesses for fraud risk • Synchronous classification of charges as fraud or not
  • 4. Making a Charge Customer's browser CC: 4242... Exp: 7/15 Stripe API Your server (1) Raw credit card information (2) Token (3) Order information (4) Charge token tokenization payment
  • 5. Charge Outcomes • Nothing (best case) • Refunded • Charged back ("disputed") • Fraudulent disputes result from "card testing" and "card cashing" • Can take > 60 days to get raised 5
  • 6. Model Building • December 31st, 2013 • Train a binary classifier for disputes on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Based on validation data, pick a policy for actioning scores:
 block if score > 50
  • 7. Questions • Validation data is > 2 months old. How is the model doing? • What are the production precision and recall? • Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?
  • 8. Next Iteration • December 31st, 2014. We repeat the exercise from a year earlier • Train a model on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Validation results look ~ok (but not great)
  • 9. Next Iteration • We put the model into production, and the results are terrible • From spot-checking and complaints from customers, the performance is worse than even the validation data suggested • What happened?
  • 10. Next Iteration • We already had a model running that was blocking a lot of fraud • Training and validating only on data for which we had labels: charges that the existing model didn't catch • Far fewer examples of fraud in number. Essentially building a model to catch the "hardest" fraud • Possible solution: we could run both models in parallel...
  • 11. Evaluation and Retraining Require the Same Data • For evaluation, policy changes, and retraining, we want the same thing: • An approximation of the distribution of charges and outcomes that would exist in the absence of our intervention (blocking)
  • 12. First attempt • Let through some fraction of charges that we would ordinarily block • Straightforward to compute precision if score > 50: if random.random() < 0.05: allow() else: block()
  • 13. Recall • Total "caught" fraud = (4,000 * 1/0.05) • Total fraud = (4,000 * 1/0.05) + 10,000 • Recall = 80,000 / 90,000 = 89% 1,000,000 charges Score < 50 Score > 50 Total 900,000 100,000 Not Fraud 890,000 1,000 Fraud 10,000 4,000 Unknown 0 95,000
  • 14. Training • Train only on charges that were not blocked • Include weights of 1/0.05 = 20 for charges that would have been blocked if not for the random reversal from sklearn.ensemble import RandomForestRegressor ... r = RandomForestRegressor(n_estimators=10) r.fit(X, Y, sample_weight=weights)
  • 15. Training • Use weights in validation (on hold-out set) as well from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split( data, target, test_size=0.2) r = RandomForestRegressor(...) ... r.score( X_test, y_test, sample_weight=weights)
  • 16. Better Approach • We're letting through 5% of all charges we think are fraudulent. Policy: Very likely to be fraud Could go either way
  • 17. Better Approach • Propensity function: maps classifier scores to P(Allow) • The higher the score, the lower probability we let the charge through • Get information on the area we want to improve on • Letting through less "obvious" fraud ("budget" for evaluation)
  • 18. Better Approach def propensity(score): # Piecewise linear/sigmoidal ... ps = propensity(score) original_block = score > 50 selected_block = random.random() < ps if selected_block: block() else: allow() log_record( id, score, ps, original_block, selected_block)
  • 19. ID Score p(Allow) Original Action Selected Action Outcome 1 10 1.0 Allow Allow OK 2 45 1.0 Allow Allow Fraud 3 55 0.30 Block Block - 4 65 0.20 Block Allow Fraud 5 100 0.0005 Block Block - 6 60 0.25 Block Allow OK
  • 20. Analysis • In any analysis, we only consider samples that were allowed (since we don't have labels otherwise) • We weight each sample by 1 / P(Allow) • cf. weighting by 1/0.05 = 20 in the uniform probability case
  • 21. ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK Evaluating the "block if score > 50" policy Precision = 5 / 9 = 0.56 Recall = 5 / 6 = 0.83
  • 22. Evaluating the "block if score > 40" policy Precision = 6 / 10 = 0.60 Recall = 6 / 6 = 1.00 ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
  • 23. Evaluating the "block if score > 62" policy Precision = 5 / 5 = 1.00 Recall = 5 / 6 = 0.83 ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK
  • 24. Analysis • The propensity function controls the "exploration/exploitation" payoff • Precision, recall, etc. are estimators • Variance of the estimators decreases the more we allow through • Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)
  • 25. New models • Train on weighted data • Weights => approximate distribution without any blocking • Can test arbitrarily many models and policies offline • Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"
  • 26. Technicalities • Independence and random seeds • Application-specific issues
  • 27.
  • 28. Conclusion • If some policy is actioning model scores, you can inject randomness in production to understand the counterfactual • Instead of a "champion/challenger" A/B test, you can evaluate arbitrarily many models and policies in this framework
  • 29. Thanks! • Work by Ryan Wang (@ryw90), Roban Kramer (@robanhk), and Alyssa Frazee (@acfrazee) • We're hiring engineers/data scientists! • mlm@stripe.com (@mlmanapat)