SlideShare a Scribd company logo
1 of 18
Securing Financial Transactions:
Credit Card Fraud Detection
Advancing Financial Security and Prevention
Through Machine Learning Innovations
Kamakshi Sharma
Data enthusiast and lifelong learner ✨
Did you know credit card fraud affects millions globally
each year?
This widespread criminal activity leads to financial losses and identity theft
for consumers, while businesses face chargebacks and reputational
damage. Secure financial transactions are the bedrock of trust in today's
digital economy.
This project tackles the critical challenge of credit card fraud detection and
prevention.
Our goal is to develop effective methods using machine learning, anomaly
detection, and deep learning to identify fraudulent activities.
Objective : Enhancing financial transaction security and minimizing
fraudulent losses.
DATASET DESCRIPTION
This project leverages a simulated credit card transaction dataset encompassing the
period from January 1st, 2019, to December 31st, 2020. The data provides valuable
insights into both legitimate and fraudulent transactions, enabling us to develop
robust fraud detection methods.
Key dataset specifications:1296675 rows & 23 columns
The dataset includes these attributes:
Column Names Description
Transaction
Details
trans_date_trans_time, trans_num,
unix_time
Transaction date, time,
number, and Unix timestamp
Card Information cc_num Credit card number
Merchant Details
merchant, category, amt, merch_lat,
merch_long
Merchant's information and
transaction details
Customer Details
first, last, gender, street, city, state, zip,
lat, long, city_pop, job, dob
Customer's information and
transaction details
Fraud Indicator is_fraud
Indicates whether the
transaction is fraudulent (1
for fraud, 0 for legitimate)
OVERVIEW
In this project, I aimed enhance financial transaction security and minimize fraudulent
losses using machine learning techniques, anomaly detection technique, and deep learning technique.
Where, I performed extensive data analysis, including exploratory data analysis (EDA) to
understand the characteristics of the dataset and to do data cleaning, and then proceeded
with data preprocessing, model building & evaluation and improving the best chosen
model.
Here, built 4 models using Machine Learning (Logistic Regression & Random Forest),
Anomaly Detection (Isolation Forest) & Deep Learning (Neural Network (MLP –Multi layer
Perceptron)), and evaluated their performance using different Evaluation Matrices
(Classification Report , ROC - AUC score & curve and Precision - Recall Curve)
After comparison, Random Forest emerged as the optimal choice according to the
problem statement as we can choose a model prioritizing high fraud detection while
tolerating some false positives.
To further enhance results, an ensemble model combining Random Forest with Isolation
Forest was implemented, Leveraging the strengths of both models, Random Forest
maintains good performance across classes, while Isolation Forest excels at identifying
outliers (potentially fraudulent transactions)..
Overall, this project showcases the effectiveness of various techniques in combating credit
EDA (EXPLORATORY DATA ANALYSIS)
Data
Cleaning Removed the
columns that are
not required for
model building
No nulls were
there & Rectified
inappropriate
datatype
Feature
Engineering
Created Some new
features as
required
•For e.g., is_fraud_cat
for categorical analysis,
•for numerical analysis
age' , 'trans_month',
'trans_year',
'month_name’,etc.
Categorical
Variable
Analysis
Visualized -
•Transaction categories
and gender
distribution, both for
the entire dataset and
specifically for
fraudulent transactions.
•Top 10 fraudulent
transactions by job,
city, and state
Numerical
Variable
Analysis
Visualized Overall
Skewness
Class balance –
•Not Fraud (99.4%)
•Fraud (0.6%)
Bivariate Analysis -
Vizualisation with
'is_fraud'
•age groups ,
•latitudinal &
longitudinal distance
and
•month & year.
• There are no missing values (nulls) in dataset,
• but some data types need correction.
Data
Quality:
•Shopping_net and grocery_pos categories have the highest number of fraudulent
transactions, despite gas_transport having the most overall transactions.
•Gender distribution is nearly balanced for both overall and fraudulent transactions.
•Top fraudulent transaction jobs include materials engineer, trading standards
officer, and naval architect. Cities with the most fraud are Houston, Warren, and
Huntsville. States with the most fraud are NY, TX, and PA.
Categorical
Variables:
•The dataset is imbalanced, with a very small percentage of fraudulent transactions
compared to non-fraudulent ones.
•Age group 20-40 seems to be more targeted by fraudsters. There's a potential
location component to the fraud, with more cases closer to the equator and eastern
hemisphere.
•Most frauds occur in March, May, and February. 2019 has significantly more fraud
cases compared to 2020.
Numerical
Variables:
KEY FINDINGS OF EDA :
DATA PREPROCESSING
converted categorical
into numerical variables-
•Binary Encoding : Gender
•One Hot Encoding :
Transaction Category
Encoding
Performed standard
scaling to normalize
numerical features.
Ensures all variables are
on a similar scale,
preventing features with
larger magnitudes from
dominating the model.
Standard Scaling:
To handle imbalance of
the dataset.
Adding more copies of
the minority class to
balance the dataset.
SMOTE (Synthetic
Minority Over-sampling
Technique) -
•a smarter way to oversample,
it creates synthetic samples
that are similar to the existing
minority class samples.
Oversampling
ALGORITHM USED FOR MODEL BUILDING
Machine Learning Technique
• Logistic Regression:
• Interpretability: Provides straightforward interpretations of coefficients for
understanding feature impact on fraud likelihood.
• Simplicity: Easy implementation and understanding facilitate communication with
stakeholders.
• Random Forest:
• Complex Relationship Capture: Excels at capturing complex data relationships to
detect subtle fraud patterns.
• Minimal Feature Engineering: Requires minimal feature manipulation, suitable for
challenging feature selection scenarios.
Anomaly Detection Technique
• Isolation Forest:
• Efficient Anomaly Detection: Efficiently isolates anomalies (fraudulent transactions) in
high-dimensional data.
• Distribution Agnostic: Robust against various fraud patterns without assuming
specific data distributions.
Deep Learning Technique
• Neural Network (MLP Classifier):
• Nonlinear Pattern Detection: Captures nonlinear data relationships for sophisticated
fraud detection.
• Scalability: Handles large data volumes and adapts to real-time fraud detection needs.
EVALUATION MATRIX USED
Classification
Report
•Precision: The
proportion of correctly
predicted instances of a
class out of all instances
predicted as that class
•Recall : The proportion
of correctly predicted
instances of a class out
of all instances that truly
belong to that class.
•F1- score : It is a
combination of
precision and recall into
a single value. It gives
you a balanced measure
of how well model is
performing.
•Accuracy : the
proportion of correctly
classified instances out
of the total instances.
ROC-AUC
Score:
• Receiver
Operating
Characteristic
(ROC) Area
Under Curve
(AUC): A
measure of
the classifier's
ability to
distinguish
between
classes. A
higher AUC
indicates
better
classifier
performance. ROC-AUC
Curve:
• Graphical
representatio
n of the true
positive rate
(recall)
against the
false positive
rate at
various
threshold
settings. It
illustrates the
trade-off
between true
positive rate
and false
positive rate.
Precision-Recall
Curve
(PR
Curve):
• Graphical
representati
on of the
trade-off
between
precision
and recall for
different
threshold
settings. It
helps
evaluate
classifier
performance
when classes
LOGISTIC REGRESSION EVALUATION AND
INFERENCES
Inferences :
• This model achieves an accuracy of 89%, with high precision (1.00) for non-fraudulent
transactions but low precision (0.04) for fraudulent ones.
• It exhibits high recall (0.76) for fraud, but lower recall (0.89) for non-fraud cases, indicating
some missed normal transactions.
• The F1-scores are 0.94 for non-fraud and 0.07 for fraud, suggesting a significant imbalance
between precision and recall for fraudulent transactions.
• The ROC-AUC score is 0.9088, indicating good discriminative ability between fraudulent and
normal transactions.
• ROC-AUC curve displays good separation between TPR and FPR.
• The PR curve shows prioritization of capturing fraud (high recall) at the expense of
misclassifying normal transactions (low precision).
Overall, the model performs well in identifying fraud but misclassify normal transactions.
What does Logistic regression do ?
It creates a linear decision boundary by fitting a logistic function to the input features,
separating the data into two classes. It calculates the probability of a data point belonging to a
certain class based on its features.
Evaluation :
RANDOM FOREST EVALUATION AND INFERENCES
Inferences :
• Achieves a perfect accuracy (1.00), indicating it classified all transactions correctly (might be
due to overfitting on the training data).
• Both precision and recall are high for both fraudulent and non-fraudulent transactions.
• F1-scores are also high for both classes.
• ROC-AUC score (0.9930) suggests excellent discriminative ability between classes.
• ROC Curve: Close to top-left corner, indicating good TPR-FPR trade-off.
• Precision-Recall Curve: Fairly close to top-left corner, indicating good precision-recall
balance.
However, the perfect accuracy on the test data raises concerns about potential overfitting and
the model's ability to generalize to unseen data.
What does Random Forest do ?
It constructs multiple decision trees using bootstrapped samples of the dataset and randomly selected
subsets of features. Each tree "votes" on the class of an input, and the final prediction is determined by the
most common class among all trees. This ensemble approach helps capture complex relationships in the
data.
Evaluation :
ISOLATION FOREST EVALUATION AND INFERENCES
Inferences :
• Achieves high accuracy (0.97) but with a significant imbalance in precision and recall.
• Very high precision (0.99) for non-fraudulent transactions but extremely low
precision (0.01) for fraudulent ones.
• Recall is also high for non-fraud (0.97) but very low for fraud (0.03).
• F1-score reflects the imbalance (0.98 for non-fraud, 0.01 for fraud).
• Doesn't have probability prediction capability, so ROC curve cannot be plotted.
• Precision-Recall Curve: PR curve not close to top-left corner, indicating poor
performance.
While it identifies most normal transactions correctly, it struggles to detect fraudulent
What does Isolation Forest do ?
It isolates anomalies by recursively partitioning the data into subsets. It randomly selects a feature and a
split value, aiming to isolate outliers quickly. Anomalies are identified as instances that require fewer
partitions to isolate, as they are different from the majority of the data.
Evaluation :
NEURAL NETWORK EVALUATION AND INFERENCES
Inferences :
• Achieves high accuracy (0.98) similar to Logistic Regression.
• High precision (1.00) for non-fraudulent transactions but lower than Logistic Regression for
fraud (0.20).
• Recall is high for fraud (0.89) but lower than Random Forest.
• F1-score highlights the class imbalance (0.99 for non-fraud, 0.32 for fraud).
• ROC-AUC score (0.9919) indicates good discriminative ability.
• ROC Curve: Close to top-left corner, confirming good performance.
• Precision-Recall Curve: Reasonably close to top-left corner, suggesting good precision-
recall trade-off.
What does Neural Network (MLP Classifier) do ?
It consist of layers of interconnected neurons that process input data. In the case of MLP Classifier, multiple
layers of neurons process the input through nonlinear activation functions. These layers learn to represent
the data in a hierarchical manner, capturing intricate patterns and relationships. The network adjusts its
weights through backpropagation, minimizing prediction errors during training.
Evaluation :
MODELS COMPARISON
Selecting Best Model
Considering the importance of maximizing
fraud detection while tolerating some false
positives, Random Forest emerges as a
promising choice.
Overall Conclusion
• All models achieved high overall
accuracy, but Random Forest and MLP
might be overfitting on the training
data.
• Logistic Regression and MLP struggle
with precision for fraudulent
transactions, while Random Forest
offers a more balanced approach.
• Isolation Forest excels at identifying
normal transactions but fails to capture
most fraudulent ones.
Hence, Best Model out of these 4:
Random Forest
ENSEMBLE METHOD - RANDOM FOREST & ISOLATION FOREST
Considering that there might be overfitting in Random Forest,
Combining Random Forest and Isolation Forest –
• Random Forest maintains good performance in fraud detection and normal transaction
classification.
• Isolation Forest excels at identifying outliers, potentially fraudulent transactions, that
Random Forest might miss.
By combining them, a wider range of fraudulent activities can be captured.
Evaluation:
Final Classification Report (Random Forest + Isolation Forest):
• Achieves an accuracy of 0.97, indicating less overfitting
compared to Random Forest alone.
• Lower precision (0.15) for fraudulent transactions but higher
recall (0.80) compared to Random Forest. This means it might
miss some fraudulent transactions but captures more overall.
Inferences:
• The ensemble method shows promising results, achieving high
accuracy and improved recall for fraudulent transactions.
• By leveraging the strengths of both Random Forest and Isolation
Forest, a more comprehensive fraud detection system is
established.
CONCLUSION
While Random Forest performs well on its own, the Ensemble Method (Random Forest
+ Isolation Forest) seems to be a better choice for credit card fraud detection in this case
as it offers:
• Reduced Overfitting Risk
• Improved Fraud Detection
This analysis explored various machine learning models for credit card fraud detection.
The ensemble method combining Random Forest and Isolation Forest emerged as the
most promising choice due to its balanced performance, reduced overfitting risk,
and improved fraud detection capabilities.
GitHub Link:
For further details and access to the project code, visit my GitHub
repository:
Project_Fraud_Detection.ipynb
REAL-TIME IMPLEMENTATION CHALLENGES
.
Model Interpretability:
•Explanation of model
decisions is crucial for
compliance.
•Complex models may lack
interpretability.
Computational
Efficiency:
•Real-time systems require
fast inference.
•Complex models may cause
latency issues.
Handling Concept
Drift:
•Fraud patterns change over
time, leading to concept drift.
•Models must adapt to
maintain effectiveness.
Challenge
s Model Explainability:
•Use interpretable models
alongside complex ones.
•Implement techniques like
SHAP values.
Computational
Optimization:
•Optimize model architecture
and feature engineering.
•Use model compression
techniques.
Consideratio
ns
Real-time
implementation of
fraud detection models
poses challenges
related to
interpretability,
computational
efficiency, and concept
drift. By addressing
these challenges and
considering the
aforementioned
considerations,
organizations can
deploy effective fraud
detection systems in
real-time payment
processing
environments
Conclusion
Detecting Credit Card Fraud: An AI-driven Approach

More Related Content

Similar to Detecting Credit Card Fraud: An AI-driven Approach

MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
Rahul Bhatia
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
Ashish Gupta
 
MSI Value Proposition v2.2 (4-2-15)
MSI Value Proposition v2.2 (4-2-15)MSI Value Proposition v2.2 (4-2-15)
MSI Value Proposition v2.2 (4-2-15)
Joe Passafiume
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
Ashish Gupta
 

Similar to Detecting Credit Card Fraud: An AI-driven Approach (20)

A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.
 
Machine Learning-Based Approaches for Fraud Detection in Credit Card Transact...
Machine Learning-Based Approaches for Fraud Detection in Credit Card Transact...Machine Learning-Based Approaches for Fraud Detection in Credit Card Transact...
Machine Learning-Based Approaches for Fraud Detection in Credit Card Transact...
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
 
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNINGCREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
 
Serano
SeranoSerano
Serano
 
MSI Value Proposition v2.2 (4-2-15)
MSI Value Proposition v2.2 (4-2-15)MSI Value Proposition v2.2 (4-2-15)
MSI Value Proposition v2.2 (4-2-15)
 
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdfTanvi_Sharma_Shruti_Garg_pre.pdf.pdf
Tanvi_Sharma_Shruti_Garg_pre.pdf.pdf
 
IRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit CardIRJET- Fraud Detection Algorithms for a Credit Card
IRJET- Fraud Detection Algorithms for a Credit Card
 
MACHINE LEARNING ALGORITHMS FOR CREDIT CARD FRAUD DETECTION
MACHINE LEARNING ALGORITHMS FOR CREDIT CARD FRAUD DETECTIONMACHINE LEARNING ALGORITHMS FOR CREDIT CARD FRAUD DETECTION
MACHINE LEARNING ALGORITHMS FOR CREDIT CARD FRAUD DETECTION
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...network layer service models forwarding versus routing how a router works rou...
network layer service models forwarding versus routing how a router works rou...
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
CREDIT_CARD.ppt
CREDIT_CARD.pptCREDIT_CARD.ppt
CREDIT_CARD.ppt
 
Credit card fraud dection
Credit card fraud dectionCredit card fraud dection
Credit card fraud dection
 
ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 

More from Boston Institute of Analytics

More from Boston Institute of Analytics (20)

Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive FutureFuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
 
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC ShootingsUnveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
 
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile Prices
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Analyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning projectAnalyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning project
 

Recently uploaded

一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
yhavx
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 

Recently uploaded (20)

How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 

Detecting Credit Card Fraud: An AI-driven Approach

  • 1. Securing Financial Transactions: Credit Card Fraud Detection Advancing Financial Security and Prevention Through Machine Learning Innovations Kamakshi Sharma Data enthusiast and lifelong learner ✨
  • 2. Did you know credit card fraud affects millions globally each year? This widespread criminal activity leads to financial losses and identity theft for consumers, while businesses face chargebacks and reputational damage. Secure financial transactions are the bedrock of trust in today's digital economy. This project tackles the critical challenge of credit card fraud detection and prevention. Our goal is to develop effective methods using machine learning, anomaly detection, and deep learning to identify fraudulent activities. Objective : Enhancing financial transaction security and minimizing fraudulent losses.
  • 3. DATASET DESCRIPTION This project leverages a simulated credit card transaction dataset encompassing the period from January 1st, 2019, to December 31st, 2020. The data provides valuable insights into both legitimate and fraudulent transactions, enabling us to develop robust fraud detection methods. Key dataset specifications:1296675 rows & 23 columns The dataset includes these attributes: Column Names Description Transaction Details trans_date_trans_time, trans_num, unix_time Transaction date, time, number, and Unix timestamp Card Information cc_num Credit card number Merchant Details merchant, category, amt, merch_lat, merch_long Merchant's information and transaction details Customer Details first, last, gender, street, city, state, zip, lat, long, city_pop, job, dob Customer's information and transaction details Fraud Indicator is_fraud Indicates whether the transaction is fraudulent (1 for fraud, 0 for legitimate)
  • 4. OVERVIEW In this project, I aimed enhance financial transaction security and minimize fraudulent losses using machine learning techniques, anomaly detection technique, and deep learning technique. Where, I performed extensive data analysis, including exploratory data analysis (EDA) to understand the characteristics of the dataset and to do data cleaning, and then proceeded with data preprocessing, model building & evaluation and improving the best chosen model. Here, built 4 models using Machine Learning (Logistic Regression & Random Forest), Anomaly Detection (Isolation Forest) & Deep Learning (Neural Network (MLP –Multi layer Perceptron)), and evaluated their performance using different Evaluation Matrices (Classification Report , ROC - AUC score & curve and Precision - Recall Curve) After comparison, Random Forest emerged as the optimal choice according to the problem statement as we can choose a model prioritizing high fraud detection while tolerating some false positives. To further enhance results, an ensemble model combining Random Forest with Isolation Forest was implemented, Leveraging the strengths of both models, Random Forest maintains good performance across classes, while Isolation Forest excels at identifying outliers (potentially fraudulent transactions).. Overall, this project showcases the effectiveness of various techniques in combating credit
  • 5. EDA (EXPLORATORY DATA ANALYSIS) Data Cleaning Removed the columns that are not required for model building No nulls were there & Rectified inappropriate datatype Feature Engineering Created Some new features as required •For e.g., is_fraud_cat for categorical analysis, •for numerical analysis age' , 'trans_month', 'trans_year', 'month_name’,etc. Categorical Variable Analysis Visualized - •Transaction categories and gender distribution, both for the entire dataset and specifically for fraudulent transactions. •Top 10 fraudulent transactions by job, city, and state Numerical Variable Analysis Visualized Overall Skewness Class balance – •Not Fraud (99.4%) •Fraud (0.6%) Bivariate Analysis - Vizualisation with 'is_fraud' •age groups , •latitudinal & longitudinal distance and •month & year.
  • 6. • There are no missing values (nulls) in dataset, • but some data types need correction. Data Quality: •Shopping_net and grocery_pos categories have the highest number of fraudulent transactions, despite gas_transport having the most overall transactions. •Gender distribution is nearly balanced for both overall and fraudulent transactions. •Top fraudulent transaction jobs include materials engineer, trading standards officer, and naval architect. Cities with the most fraud are Houston, Warren, and Huntsville. States with the most fraud are NY, TX, and PA. Categorical Variables: •The dataset is imbalanced, with a very small percentage of fraudulent transactions compared to non-fraudulent ones. •Age group 20-40 seems to be more targeted by fraudsters. There's a potential location component to the fraud, with more cases closer to the equator and eastern hemisphere. •Most frauds occur in March, May, and February. 2019 has significantly more fraud cases compared to 2020. Numerical Variables: KEY FINDINGS OF EDA :
  • 7. DATA PREPROCESSING converted categorical into numerical variables- •Binary Encoding : Gender •One Hot Encoding : Transaction Category Encoding Performed standard scaling to normalize numerical features. Ensures all variables are on a similar scale, preventing features with larger magnitudes from dominating the model. Standard Scaling: To handle imbalance of the dataset. Adding more copies of the minority class to balance the dataset. SMOTE (Synthetic Minority Over-sampling Technique) - •a smarter way to oversample, it creates synthetic samples that are similar to the existing minority class samples. Oversampling
  • 8. ALGORITHM USED FOR MODEL BUILDING Machine Learning Technique • Logistic Regression: • Interpretability: Provides straightforward interpretations of coefficients for understanding feature impact on fraud likelihood. • Simplicity: Easy implementation and understanding facilitate communication with stakeholders. • Random Forest: • Complex Relationship Capture: Excels at capturing complex data relationships to detect subtle fraud patterns. • Minimal Feature Engineering: Requires minimal feature manipulation, suitable for challenging feature selection scenarios. Anomaly Detection Technique • Isolation Forest: • Efficient Anomaly Detection: Efficiently isolates anomalies (fraudulent transactions) in high-dimensional data. • Distribution Agnostic: Robust against various fraud patterns without assuming specific data distributions. Deep Learning Technique • Neural Network (MLP Classifier): • Nonlinear Pattern Detection: Captures nonlinear data relationships for sophisticated fraud detection. • Scalability: Handles large data volumes and adapts to real-time fraud detection needs.
  • 9. EVALUATION MATRIX USED Classification Report •Precision: The proportion of correctly predicted instances of a class out of all instances predicted as that class •Recall : The proportion of correctly predicted instances of a class out of all instances that truly belong to that class. •F1- score : It is a combination of precision and recall into a single value. It gives you a balanced measure of how well model is performing. •Accuracy : the proportion of correctly classified instances out of the total instances. ROC-AUC Score: • Receiver Operating Characteristic (ROC) Area Under Curve (AUC): A measure of the classifier's ability to distinguish between classes. A higher AUC indicates better classifier performance. ROC-AUC Curve: • Graphical representatio n of the true positive rate (recall) against the false positive rate at various threshold settings. It illustrates the trade-off between true positive rate and false positive rate. Precision-Recall Curve (PR Curve): • Graphical representati on of the trade-off between precision and recall for different threshold settings. It helps evaluate classifier performance when classes
  • 10. LOGISTIC REGRESSION EVALUATION AND INFERENCES Inferences : • This model achieves an accuracy of 89%, with high precision (1.00) for non-fraudulent transactions but low precision (0.04) for fraudulent ones. • It exhibits high recall (0.76) for fraud, but lower recall (0.89) for non-fraud cases, indicating some missed normal transactions. • The F1-scores are 0.94 for non-fraud and 0.07 for fraud, suggesting a significant imbalance between precision and recall for fraudulent transactions. • The ROC-AUC score is 0.9088, indicating good discriminative ability between fraudulent and normal transactions. • ROC-AUC curve displays good separation between TPR and FPR. • The PR curve shows prioritization of capturing fraud (high recall) at the expense of misclassifying normal transactions (low precision). Overall, the model performs well in identifying fraud but misclassify normal transactions. What does Logistic regression do ? It creates a linear decision boundary by fitting a logistic function to the input features, separating the data into two classes. It calculates the probability of a data point belonging to a certain class based on its features. Evaluation :
  • 11. RANDOM FOREST EVALUATION AND INFERENCES Inferences : • Achieves a perfect accuracy (1.00), indicating it classified all transactions correctly (might be due to overfitting on the training data). • Both precision and recall are high for both fraudulent and non-fraudulent transactions. • F1-scores are also high for both classes. • ROC-AUC score (0.9930) suggests excellent discriminative ability between classes. • ROC Curve: Close to top-left corner, indicating good TPR-FPR trade-off. • Precision-Recall Curve: Fairly close to top-left corner, indicating good precision-recall balance. However, the perfect accuracy on the test data raises concerns about potential overfitting and the model's ability to generalize to unseen data. What does Random Forest do ? It constructs multiple decision trees using bootstrapped samples of the dataset and randomly selected subsets of features. Each tree "votes" on the class of an input, and the final prediction is determined by the most common class among all trees. This ensemble approach helps capture complex relationships in the data. Evaluation :
  • 12. ISOLATION FOREST EVALUATION AND INFERENCES Inferences : • Achieves high accuracy (0.97) but with a significant imbalance in precision and recall. • Very high precision (0.99) for non-fraudulent transactions but extremely low precision (0.01) for fraudulent ones. • Recall is also high for non-fraud (0.97) but very low for fraud (0.03). • F1-score reflects the imbalance (0.98 for non-fraud, 0.01 for fraud). • Doesn't have probability prediction capability, so ROC curve cannot be plotted. • Precision-Recall Curve: PR curve not close to top-left corner, indicating poor performance. While it identifies most normal transactions correctly, it struggles to detect fraudulent What does Isolation Forest do ? It isolates anomalies by recursively partitioning the data into subsets. It randomly selects a feature and a split value, aiming to isolate outliers quickly. Anomalies are identified as instances that require fewer partitions to isolate, as they are different from the majority of the data. Evaluation :
  • 13. NEURAL NETWORK EVALUATION AND INFERENCES Inferences : • Achieves high accuracy (0.98) similar to Logistic Regression. • High precision (1.00) for non-fraudulent transactions but lower than Logistic Regression for fraud (0.20). • Recall is high for fraud (0.89) but lower than Random Forest. • F1-score highlights the class imbalance (0.99 for non-fraud, 0.32 for fraud). • ROC-AUC score (0.9919) indicates good discriminative ability. • ROC Curve: Close to top-left corner, confirming good performance. • Precision-Recall Curve: Reasonably close to top-left corner, suggesting good precision- recall trade-off. What does Neural Network (MLP Classifier) do ? It consist of layers of interconnected neurons that process input data. In the case of MLP Classifier, multiple layers of neurons process the input through nonlinear activation functions. These layers learn to represent the data in a hierarchical manner, capturing intricate patterns and relationships. The network adjusts its weights through backpropagation, minimizing prediction errors during training. Evaluation :
  • 14. MODELS COMPARISON Selecting Best Model Considering the importance of maximizing fraud detection while tolerating some false positives, Random Forest emerges as a promising choice. Overall Conclusion • All models achieved high overall accuracy, but Random Forest and MLP might be overfitting on the training data. • Logistic Regression and MLP struggle with precision for fraudulent transactions, while Random Forest offers a more balanced approach. • Isolation Forest excels at identifying normal transactions but fails to capture most fraudulent ones. Hence, Best Model out of these 4: Random Forest
  • 15. ENSEMBLE METHOD - RANDOM FOREST & ISOLATION FOREST Considering that there might be overfitting in Random Forest, Combining Random Forest and Isolation Forest – • Random Forest maintains good performance in fraud detection and normal transaction classification. • Isolation Forest excels at identifying outliers, potentially fraudulent transactions, that Random Forest might miss. By combining them, a wider range of fraudulent activities can be captured. Evaluation: Final Classification Report (Random Forest + Isolation Forest): • Achieves an accuracy of 0.97, indicating less overfitting compared to Random Forest alone. • Lower precision (0.15) for fraudulent transactions but higher recall (0.80) compared to Random Forest. This means it might miss some fraudulent transactions but captures more overall. Inferences: • The ensemble method shows promising results, achieving high accuracy and improved recall for fraudulent transactions. • By leveraging the strengths of both Random Forest and Isolation Forest, a more comprehensive fraud detection system is established.
  • 16. CONCLUSION While Random Forest performs well on its own, the Ensemble Method (Random Forest + Isolation Forest) seems to be a better choice for credit card fraud detection in this case as it offers: • Reduced Overfitting Risk • Improved Fraud Detection This analysis explored various machine learning models for credit card fraud detection. The ensemble method combining Random Forest and Isolation Forest emerged as the most promising choice due to its balanced performance, reduced overfitting risk, and improved fraud detection capabilities. GitHub Link: For further details and access to the project code, visit my GitHub repository: Project_Fraud_Detection.ipynb
  • 17. REAL-TIME IMPLEMENTATION CHALLENGES . Model Interpretability: •Explanation of model decisions is crucial for compliance. •Complex models may lack interpretability. Computational Efficiency: •Real-time systems require fast inference. •Complex models may cause latency issues. Handling Concept Drift: •Fraud patterns change over time, leading to concept drift. •Models must adapt to maintain effectiveness. Challenge s Model Explainability: •Use interpretable models alongside complex ones. •Implement techniques like SHAP values. Computational Optimization: •Optimize model architecture and feature engineering. •Use model compression techniques. Consideratio ns Real-time implementation of fraud detection models poses challenges related to interpretability, computational efficiency, and concept drift. By addressing these challenges and considering the aforementioned considerations, organizations can deploy effective fraud detection systems in real-time payment processing environments Conclusion