2. Data Science for Financial Fraud Detection
Denisa BANULESCU-RADU
University of Orléans, LEO
WiMLDS 13th of April 2021
Banulescu-Radu (LEO) WiMLDS 13/04/2021 2 / 39
3. Background
• Since 2015: Associate Professor – University of Orléans, LEO
• 2016: Young Researcher Award in Economics – Autorité des Marchés
Financiers
• 2015: Thesis Prize – Fondation Banque de France
• 2014-2015: Max Weber Postdoctoral Fellow – European University Institute
• 2011-2014: PhD in Economics – Maastricht University and University of
Orléans
Title dissertation: "Four essays in financial econometrics"
Banulescu-Radu (LEO) WiMLDS 13/04/2021 3 / 39
5. Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 5 / 39
6. Econometrics vs Machine Learning
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 6 / 39
7. Econometrics vs Machine Learning
Econometrics vs Machine Learning
Banulescu-Radu (LEO) WiMLDS 13/04/2021 7 / 39
8. Econometrics vs Machine Learning
Econometrics vs Machine Learning
Banulescu-Radu (LEO) WiMLDS 13/04/2021 8 / 39
9. Econometrics vs Machine Learning
“there are a number of areas where there would be opportunities
for fruitful collaboration between econometrics and machine
learning ”
Hal Varian (2014) - Professor of Economics (University of Michigan) & Chief Economist
(Google)
Banulescu-Radu (LEO) WiMLDS 13/04/2021 9 / 39
10. General aspects of fraud
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 10 / 39
11. General aspects of fraud
Fraud detection - Why is it important?
Banulescu-Radu (LEO) WiMLDS 13/04/2021 11 / 39
12. General aspects of fraud
Definition of fraud
Definition
• Baesens et al. (2015)
Fraud is an uncommon, well-considered, imperceptibly
concealed, time-evolving, and often carefully organized crime
which appears in many types of forms.
Banulescu-Radu (LEO) WiMLDS 13/04/2021 12 / 39
13. General aspects of fraud
Typologies of fraud
Banulescu-Radu (LEO) WiMLDS 13/04/2021 13 / 39
14. Main challenges and solutions
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 14 / 39
15. Main challenges and solutions
Main CHALLENGES and solutions
Banulescu-Radu (LEO) WiMLDS 13/04/2021 15 / 39
16. Main challenges and solutions
Main CHALLENGES and solutions
Banulescu-Radu (LEO) WiMLDS 13/04/2021 16 / 39
17. Main challenges and solutions
Main CHALLENGES and solutions
Banulescu-Radu (LEO) WiMLDS 13/04/2021 17 / 39
18. Main challenges and solutions
Main CHALLENGES and solutions
Banulescu-Radu (LEO) WiMLDS 13/04/2021 18 / 39
19. Main challenges and solutions
Main challenges and SOLUTIONS
1. Main tools used to fight fraud
Banulescu-Radu (LEO) WiMLDS 13/04/2021 19 / 39
20. Main challenges and solutions
Main challenges and SOLUTIONS
2. Deal with imbalanced datasets
Banulescu-Radu (LEO) WiMLDS 13/04/2021 20 / 39
21. Main challenges and solutions
Main challenges and SOLUTIONS
2. Deal with imbalanced datasets
Banulescu-Radu (LEO) WiMLDS 13/04/2021 21 / 39
22. Main challenges and solutions
Main challenges and SOLUTIONS
Banulescu-Radu (LEO) WiMLDS 13/04/2021 22 / 39
23. Main challenges and solutions
Main challenges and SOLUTIONS
3. Evaluation of fraud detection models
Banulescu-Radu (LEO) WiMLDS 13/04/2021 23 / 39
24. Main challenges and solutions
Main challenges and SOLUTIONS
4. Improving the interpretability of fraud detection models
“if the users do not trust a model or a prediction, they will not use it”
(Ribeiro et al., 2016)
• LIME method
Ribeiro et al. (2016)
• SHAP (SHapley Additive exPlanations) value
Lundberg and Lee, (2017)
BUT ... to what extent do we need fraud detection models to be interpretable?
Banulescu-Radu (LEO) WiMLDS 13/04/2021 24 / 39
25. Case studies
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 25 / 39
26. Case studies Case 1: Insurance fraud detection
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 26 / 39
27. Case studies Case 1: Insurance fraud detection
General framework
• Fraud claims represented 10% of the total number of claims in 2019 (Insurance
Europe)
• Negative record for France: e2.5 Billion in 2014. Only e219 million recovered.
(ALFA)
Banulescu-Radu (LEO) WiMLDS 13/04/2021 27 / 39
28. Case studies Case 1: Insurance fraud detection
Methodology
DATA
• 45 954 house claims for the period 2013 to 2017
• French insurance company
• 0.76% of claims are fraudulent
Technical tools
• Logistic LASSO (Cox, 1958; Tibshirani, 1996)
• Random forest (Breiman, 2001)
• Extreme Gradient Boosting or Xgboost (Chen and Guestrin, 2016)
Resampling techniques to deal with imbalanced data
• Random Oversampling
• Synthetic Minority Oversampling TEchnique or SMOTE (Chawla et al., 2002)
• ADAptive SYNthetic sampling or ADASYN (He et al., 2008)
Performance metrics
• AUC-ROC, AUC-PR, Brier score, Log-Loss, F-measure
Banulescu-Radu (LEO) WiMLDS 13/04/2021 28 / 39
29. Case studies Case 1: Insurance fraud detection
Methodology
Banulescu-Radu (LEO) WiMLDS 13/04/2021 29 / 39
30. Case studies Case 1: Insurance fraud detection
• Interpretation of results: SHAP value method (global/individual level)
Figure 1: Fraudulent case
Figure 2: Non Fraudulent case
Banulescu-Radu (LEO) WiMLDS 13/04/2021 30 / 39
31. Case studies Case 2: Social fraud detection
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 31 / 39
32. Case studies Case 2: Social fraud detection
General framework
• Controlling the risks of social and fiscal fraud and combating illegal work are
also important problems for social justice and economic efficiency
• French mutual organization
• collects data systematically from their beneficiaries
• organizes regular controls on a subsample of their taxpayers
• manages a fraud detection system to identify those who do not pay
their contributions
Banulescu-Radu (LEO) WiMLDS 13/04/2021 32 / 39
33. Case studies Case 2: Social fraud detection
General framework
Objective: Estimate the tax shortfall.
Definition
The tax shortfall is defined as the potential sum of the tax adjustments
that could have been imposed on companies having defrauded or made er-
roneous social declarations, if they had been effectively audited, whereas
they were not in reality.
Banulescu-Radu (LEO) WiMLDS 13/04/2021 33 / 39
34. Case studies Case 2: Social fraud detection
Remarks
• the two decisions are neither sequential nor conditional
• the decisions are linked
Banulescu-Radu (LEO) WiMLDS 13/04/2021 34 / 39
35. Case studies Case 2: Social fraud detection
Banulescu-Radu (LEO) WiMLDS 13/04/2021 35 / 39
36. Case studies Case 2: Social fraud detection
Methodology: Estimation by Maximum Likelihood
Control decision
Ci =
(
1
0
if C∗
i = Xc,i βc + εc,i > 0
otherwise
∀i = 1, . . . , n (1)
Fraud decision
e
Di =
1
0
if D∗
i = Xd,i βd + εd,i 0
otherwise
∀i = 1, . . . , n (2)
Potential tax shortfall
M∗
i =
(
Xm,i βm + εm,i
0
if e
Di = 1
otherwise
∀i = 1, ..n (3)
εc,i
εd,i
εm,i
∼ N
0,
X
with
X
= DRD (4)
D =
σc 0 0
0 σd 0
0 0 σm
R =
1 ρcd ρcm
ρcd 1 ρdm
ρcm ρdm 1
(5)
Banulescu-Radu (LEO) WiMLDS 13/04/2021 36 / 39
37. Conclusion
Outline
1 Econometrics vs Machine Learning
2 General aspects of fraud
3 Main challenges and solutions
4 Case studies
4.1 Case 1: Insurance fraud detection
4.2 Case 2: Social fraud detection
5 Conclusion
Banulescu-Radu (LEO) WiMLDS 13/04/2021 37 / 39