SlideShare a Scribd company logo
1 of 39
CREDIT RISK PREDICTION
MODEL
Submitted by:
Rituparna Sarkar
Outline
1. Project Objective
2. Process Approach
3. Data Source and Variables
4. Data Analysis
5. Data Pre-processing
6. Exploratory Analysis
7. Model development
i. Training the model
ii. Validation
8. Conclusion & Limitations
Project Objective
To develop a prediction model to assess credit risk to
borrowers
• Do all borrowers have an equal probability to default?
• Is there a way to determine risk of defaulting before
processing a credit request?
• Can we classify customers into two groups, i.e.. Risky and
Non-Risky based on the nature of their financial data?
• Which are the key factors to be considered to assess risk
of lending to an individual based on historic data?
Process Approach
1. Develop a
predictive model
to assess the
credit risk to
Borrowers
2. Develop
business
understanding of
data, relationship
between variables
and data sources
to be used
1. Get data from
relevant data sources
2. Explore data for
missing values,
outliers, invalid data
through descriptive
statistics and
visualization
techniques
3. Understand the
business relevance of
outliers, missing
values and invalid data
and formulate the
approach to treat them
accordingly
1. Data splitting for
training and test
2. Data clean up for
missing values,
outliers, invalid data
3. Data binning and
imputation for
outlier treatment
4. Binning
independent
variables as per
business needs
5. Data exploration
for patterns and
collinearity test
1. Develop logistic
regression model
to classify
customers into two
groups based on
credit risk
probability
2. Train the model
using 80% of
training data
1. Validate the
trained model
using rest 20% of
training data
2. If satisfied with
accuracy
percentages
proceed to testing
using test dataset,
else go to
previous step
(modeling) and
train the model
again
When satisfied
with the test
results, deploy
the model to
aid business
take decisions
based on
predictions
given by the
model
Business
Understanding
Data
Understanding
Data
Preparation
Modeling DeploymentEvaluation
* Software Used – Excel & SPSS
Data Source and Variables
• Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset
was split into two parts – 1,50,000 cases for Training and validation and rest
1,00,000 cases for testing the model.
• Data Dictionary for variables in dataset:
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines
Total balance on credit cards and personal lines of credit except real estate and no installment debt
like car loans divided by the sum of credit limits
percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
DATA ANALYSIS
Descriptive Statistics
• There are 1,50,000 cases in
training dataset;
• Out of 11 variables available,
SeriousDlqIn2yrs is the binary
dependent variable for
which model has to be
developed
• MonthlyIncome has large
number of missing values.
NumberOfDependents too
have some missing values
• There are high numbers of
extreme values(outliers) for
RevolvingUtilizationOfUnsecur
edLines, DebtRatio and
MonthlyIncome as indicated
by high Standard Deviation.
Missing Value Analysis
NumberOfDependents missing values are
about 2.6% (less than 5%) hence these
cases could be removed
MonthlyIncome has around 20%
value missing, which is quite high
and needs to be imputed
DATA PRE-PROCESSING
Data Cleaning Steps
Invalid Data identified below to be removed in the Excel sheet
• Age Variable - One case showing 0
• Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate
and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96
and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in
number and common for all three variables.
Data Formatting in Excel
Variables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from
General to Number format
Imputation in SPSS:
• Imputation for missing values in MonthlyIncome
• 5 imputations done using all independent variables and 5th imputation results
taken for training
Descriptive Statistics After Data Cleaning
• After data cleaning
total number of cases
down to 145837
• Outliers in variables
DebtRatio,
MonthlyIncome and
RevolvingUtilizationOf
UnsecuredLines to be
removed through
binning
Variable Binning
Binning done for following variables:
• Age: Age Binning containing bins for age group
• DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables
DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with
following cut off values :
• MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins
Age Group Bin
21-30 1
31-40 2
41-50 3
51-60 4
>60 5
Group Bin Remark
<=0.25 1 Good
0.25 - 0.50 2 Low Risk
> 0.50 3 High Risk
EXPLORATORY ANALYSIS
Exploratory Analysis (Using SPSS)
Delinquencyoverdifferentcategories
0 1 0 1
21 - 30 7374 940 8314 5.42% 9.68% 5.70%
31 - 40 20562 2285 22847 15.11% 23.53% 15.67%
41 - 50 31130 2828 33958 22.87% 29.12% 23.28%
51 - 60 32334 2213 34547 23.75% 22.79% 23.69%
60 + 44725 1446 46171 32.86% 14.89% 31.66%
136125 9712 145837 100.00% 100.00% 100.00%
Age_Binni
ng
Total
SeriousDlqin2yrs %
Total
Age_Binning * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Disproportionate percentage of samples
for dependent variable. Sampling of
training dataset required to remove bias
in model development
• Maximum customers
from age group 60+
• Delinquency risk is
highest for Age Group of
41-50 and Lowest in 21-
30 age group
a) Age
Exploratory Analysis (Contd.)
Around 60% data have number of dependents as 0; Delinquency count and percentage
also highest for this group
Total percentage share of data with dependents greater than 3 is only around 2%
0 1 0 1
0 81722 4992 86714 60.03% 51.40% 59.46%
1 24372 1921 26293 17.90% 19.78% 18.03%
2 17930 1571 19501 13.17% 16.18% 13.37%
3 8646 833 9479 6.35% 8.58% 6.50%
4 2564 296 2860 1.88% 3.05% 1.96%
5 677 68 745 0.50% 0.70% 0.51%
6 134 24 158 0.10% 0.25% 0.11%
7 46 5 51 0.03% 0.05% 0.03%
8 22 2 24 0.02% 0.02% 0.02%
9 5 0 5 0.00% 0.00% 0.00%
10 5 0 5 0.00% 0.00% 0.00%
13 1 0 1 0.00% 0.00% 0.00%
20 1 0 1 0.00% 0.00% 0.00%
136125 9712 145837 100.00% 100.00% 100.00%
Num berOf
Dependen
ts
Total
Serious Dlqin2yrs %
Total
NumberOfDependents * SeriousDlqin2yrs
Crosstabulation
Count
Serious Dlqin2yrs
Total
b) Number of Dependents
Exploratory Analysis (Contd.)
0 1 0 1
<= 0.25 24825 1472 26297 36.47% 30.31% 36.06%
0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03%
0.51+ 24057 2128 26185 35.35% 43.82% 35.91%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
DebtRatio
(Binned)
Total
DebtRatio (Binned) * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
0 1 0 1
<= 0.25 41954 912 42866 61.64% 18.78% 58.79%
0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06%
0.51+ 16429 3371 19800 24.14% 69.42% 27.15%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
RevolvingUtilizationOfUnsecuredLines (Binned) *
SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Revolving
Utilization
OfUnsecur
edLines
(Binned)
Total
Around 44% of Delinquency from group with Debt Ratio > 0.5
Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5
d) RevolvingUtilizationOfCreditLines
c) Debt Ratio
Exploratory Analysis (Contd.)
0 1 0 1
<=
3100.00
26699 2494 29193 19.61% 25.68% 20.02%
3100.01 -
5000.00
29083 2518 31601 21.36% 25.93% 21.67%
5000.01 -
7083.00
25214 1766 26980 18.52% 18.18% 18.50%
7083.01 -
10823.00
27435 1461 28896 20.15% 15.04% 19.81%
10823.01+ 27694 1473 29167 20.34% 15.17% 20.00%
136125 9712 145837 100.00% 100.00% 100.00%
SeriousDlqin2yrs
Total
MonthlyInc
ome
(Binned)
Total
SeriousDlqin2yrs %
Total
MonthlyIncome (Binned) * SeriousDlqin2yrs
Crosstabulation
Count
• More than 50% of defaulters are accounted by lower 40% of the income range
• Other 3 groups have more or less same percentage of defaulters
e) Monthly Income
Exploratory Analysis
MonthlyIncomevs. OtherFinancialVariables
Exploratory Analysis (Contd.)
All parameters below have similar pattern - low
income range attributing to high values of debt
indicators
i) RevolvingUtilizationOfUnecuredLines,
ii) DebtRatio,
iii) NumberOfTime30-59DaysPastDueNotWorse,
iv) NumberOfTimes90DaysLate
v) NumberOfTime60-089DaysPastDueNotWorse,
vi) NumberOfOpenCreditLinesAndLoans
vii) NumberOfRealEstateLoansOrLines
Collinearity Diagnostics
Sample Collinearity Diagnostic results for Age
vs. Other 9 independent variable shown here
Performed similar diagnostics for each of the
10 variable against other variables
Condition Index was always less than 15
indicating no collinearity is existing between
independent variables
MODEL DEVELOPMENT
Logistic Regression Model
 The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0
• 1 indicates risk of defaulting
• 0 indicates no risk
 As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is
followed to come up with the model
 Pre-processed training dataset is used to draw samples for training and validation of the model
 80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1
and used for developing and training the model
 20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and
used to for validation
 Final model tested using test data set given
 Logistic regression models were developed and compared with two different approaches:
• With binned variables (Model 1)
• Binned model as Model 1, but missing data binned into another category instead of clean up/imputation,
wherever applicable(Model 2)
• A model without binning using variables directly (Model 3)
MODEL 1 – WITH BINNING
Model 1 – With binning
• The model has been developed considering business needs and therefore the bins have been
created considering business cut offs.
• In the current model, missing values for NoOfDependents, NumberOfTime30-
99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-
089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and
missing values in MonthlyIncome have been imputed.
• Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have
been created. Bins created for Age variable as well.
• Dummy variables were created for the categories in the binned variables clubbing insignificant
bins together to have better control of the model.
• Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and
4500 SeriousDlquin2Yrs = 0).
• The model comprises of 10 variables including 4 dummy variables.
Model 1 – Output
• The logit function equation for the model is :
-(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)*
NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)*
NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans
– (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)*
DebtRatio_GT0.25_0.5(1)
• A cut off value of 0.5 gave optimal results
Model 1 - Variables Used
 Variables used
• Age
• NumberOfTime3059DaysPastDueNotWorse
• NumberOfTime6089DaysPastDueNotWorse
• NumberOfTimes90DaysLate
• NumberOfOpenCreditLinesAndLoans
• NumberRealEstateLoansOrLines
• DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5
• RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0), RU_0_.25(
where RU>0 but <0.25) and RU_GT_.5( where RU >=5).
 Observations
• MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from
the model.
• MonthlyIncome and DebtRatio were affecting each other
• RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated.
• Though bins were created for Age variable but all the bins were contributing equally to the model
therefore used the Age variable as such.
• NoOfDependents was initially thought as significant variable but turned out to be insignificant.
Created bins for NoOfDependents variable but the bins too were insignificant.
Model 1 - Validation
• Validated the developed model on a non- stratified random sample of 40% of the data (which
comprised of 29168 records).
• Overall accuracy : 78.62% and Misclassification rate : 21.38%
• Prediction accuracy for Risky (= 1) is 75.9%
Model 1 – Pros and Cons
 17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.
 The model has been developed taking into consideration widely used business cut offs and significant
parameters.
 Since the model has been built on data where missing values were treated, the accuracy of the model may drop
on data where missing values are present.
 Analyzing Top 10% ( Customers who are prone to default)
• 67.4% defaulters are in the age group : 30-50
• 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and
90 days respectively.
• 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did
not have any dependent.
 Analyzing Bottom 10% ( Customers who are safe)
• 80 % of non- defaulters are more than 40 years of age.
• 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days
and 90 days respectively.
• 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non-
defaulters did not have any dependent.
MODEL 2 – CONSIDERING
MISSING VALUES
Model 2 – Considering Missing Values
• Missing values have not been imputed here, rather an extra category has been added in
the binned variables to consider missing value as another category. (Example :
NoOfDependents_Binned shown below)
• Selection of variables have been based on B, Exp(B), Sig values
• Optimal Binning has been used based on SeriousDlquin2yrs variable.
Model 2 – Output
• Final Model
(1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445
*NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(-
3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442.
• Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another
4000 with SeriousDlquin2Yrs = 0
• A cut off value of 0.4 gave optimal results
Model 2 - Variables Used
 Variables used
• Age_OptimalBin
• NumberOfTime3059DaysPastDueNotWorse_OptimalBin
• NumberOfTime6089DaysPastDueNotWorse_OptimalBin
• NumberOfTimes90DaysLate_OptimalBin
 Possible reasons why few other variables are not significant
• Age has a non-linear relationship with MonthlyIncome
• Other 3 variables in the equation are the indicators of number of defaults committed by
the customer which has a relation with NumberOfOpenLinesOfCredit and
RevolvingUtilizationsOfUnsecuredLines
• MonthlyIncome will effect the DebtRatio
Model 2 - Validation
• Multiple test run has been performed on different sample sizes
• The below given validation table was for a random sample of 90000.
• Overall Accuracy 72.62% and Misclassification 27.37%
• Risky ( = 1) prediction accuracy of 75.1%
Model 2 – Pros and Cons
 Capable of handling missing values (including 98,96)
 Intermediate processing required is minimal (only binning required)
 The model uses only 4 variables
 Optimal binning used and not the industry standard binning
 Other insights
• Analyzing top 10% (most risky customer segment)
84% of the customer are below 56 years of age
72% have 1 or more past 30 days default
• Analyzing bottom 10% (safest customer segment)
All of them are of 64 years or above in age
Almost all of them have 0 defaults under any case.
MODEL 3 – USING
VARIABLES DIRECTLY
Model 3 – Using Variables Directly
• Final model has following equation:
0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOf
Time6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate)
• This model is simplest but business considerations were not accounted for, hence cannot
assure robustness on deployment
• It cannot handle missing values
CONCLUSION & LIMITATIONS
Conclusion & Limitations
• Model 1 and Model 2 give similar accuracy levels. Model 3 is not
recommended. Choice of final model is left to business based on the
pros and cons mentioned
• These models to be further validated for scalability and robustness
• The test dataset given did not have delinquency values; hence after
validation with 20% random samples from training data set further
validation could not be performed using test dataset for accuracy
check on a totally new set of data.
• Assumptions taken on binning financial variables could change the
significance of different variables in final model. This aspect to be
validated with business
THANK YOU

More Related Content

What's hot

What's hot (20)

Delopment and testing of a credit scoring model
Delopment and testing of a credit scoring modelDelopment and testing of a credit scoring model
Delopment and testing of a credit scoring model
 
Data Science Use cases in Banking
Data Science Use cases in BankingData Science Use cases in Banking
Data Science Use cases in Banking
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Estimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishEstimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit Rish
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detection
 
Nexx Consultants: Credit risk/IRB Approach models validation
Nexx Consultants: Credit risk/IRB Approach models validationNexx Consultants: Credit risk/IRB Approach models validation
Nexx Consultants: Credit risk/IRB Approach models validation
 
Credit Risk
Credit RiskCredit Risk
Credit Risk
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1Predictive Model for Loan Approval Process using SAS 9.3_M1
Predictive Model for Loan Approval Process using SAS 9.3_M1
 
Presentation on credit management
Presentation on credit managementPresentation on credit management
Presentation on credit management
 
18.2 internal ratings based approach
18.2   internal ratings based approach18.2   internal ratings based approach
18.2 internal ratings based approach
 
Credit scoring
Credit scoringCredit scoring
Credit scoring
 
Analytics For Retail Banking - Marketelligent
Analytics For Retail Banking - MarketelligentAnalytics For Retail Banking - Marketelligent
Analytics For Retail Banking - Marketelligent
 
Credit Card Issuers
Credit Card IssuersCredit Card Issuers
Credit Card Issuers
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card Prediction
 
Jntu credit risk-management
Jntu credit risk-managementJntu credit risk-management
Jntu credit risk-management
 
An Introduction to Digital Credit: Resources to Plan a Deployment
An Introduction to Digital Credit: Resources to Plan a DeploymentAn Introduction to Digital Credit: Resources to Plan a Deployment
An Introduction to Digital Credit: Resources to Plan a Deployment
 
Default payment prediction system
Default payment prediction systemDefault payment prediction system
Default payment prediction system
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine Learning
 
9_Advanced Credit Risk Management Methods
9_Advanced Credit Risk Management Methods9_Advanced Credit Risk Management Methods
9_Advanced Credit Risk Management Methods
 

Similar to Credit risk scoring model final

Similar to Credit risk scoring model final (20)

Business and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April MeetupBusiness and Data Analytics Collaborative April Meetup
Business and Data Analytics Collaborative April Meetup
 
Maintaining Credit Quality in Banks and Credit Unions
Maintaining Credit Quality in Banks and Credit UnionsMaintaining Credit Quality in Banks and Credit Unions
Maintaining Credit Quality in Banks and Credit Unions
 
Forward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative AdjustmentsForward-Looking ALLL: Computing Qualitative Adjustments
Forward-Looking ALLL: Computing Qualitative Adjustments
 
Choosing The Right Credit Decisioning Model
Choosing The Right Credit Decisioning ModelChoosing The Right Credit Decisioning Model
Choosing The Right Credit Decisioning Model
 
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
Moody's ---How Social Performance Impacts Financial Resilience and Default Pr...
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
Evaluation of transport safety policies in commercial motorcycle operation in...
Evaluation of transport safety policies in commercial motorcycle operation in...Evaluation of transport safety policies in commercial motorcycle operation in...
Evaluation of transport safety policies in commercial motorcycle operation in...
 
Personal Loan Risk Assessment
Personal Loan Risk Assessment Personal Loan Risk Assessment
Personal Loan Risk Assessment
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
Barclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National FinalistBarclays - Case Study Competition | ISB | National Finalist
Barclays - Case Study Competition | ISB | National Finalist
 
Credit Risk and Monetary Pass-through. Evidence from Chile
Credit Risk and Monetary Pass-through. Evidence from ChileCredit Risk and Monetary Pass-through. Evidence from Chile
Credit Risk and Monetary Pass-through. Evidence from Chile
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Loan Risk Assessment & Scoring Model
Loan Risk Assessment & Scoring ModelLoan Risk Assessment & Scoring Model
Loan Risk Assessment & Scoring Model
 
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
Asa wisconsin chapter april 2015 meeting presentation: residual values for ma...
 
RMCPWSM_GCM_2015
RMCPWSM_GCM_2015RMCPWSM_GCM_2015
RMCPWSM_GCM_2015
 
Cas rpm 2015 claim liability estimation
Cas rpm 2015   claim liability estimationCas rpm 2015   claim liability estimation
Cas rpm 2015 claim liability estimation
 
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
[DSC Adria 23] Mirjana Pejic Bach Data mining approach to internal fraud in a...
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
Ac Sjzh92177
Ac Sjzh92177Ac Sjzh92177
Ac Sjzh92177
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 

More from Ritu Sarkar (9)

Google analytics
Google analyticsGoogle analytics
Google analytics
 
Candy score score
Candy score scoreCandy score score
Candy score score
 
Simulation model sortation system
Simulation model sortation systemSimulation model sortation system
Simulation model sortation system
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
 
Driver profile caused accident
Driver profile caused accidentDriver profile caused accident
Driver profile caused accident
 
Kaggel cab serivce
Kaggel cab serivceKaggel cab serivce
Kaggel cab serivce
 
Big Data solution for multi-national Bank
Big Data solution for multi-national BankBig Data solution for multi-national Bank
Big Data solution for multi-national Bank
 
Data mining to improve e-mail marketing
Data mining to improve e-mail marketing Data mining to improve e-mail marketing
Data mining to improve e-mail marketing
 
Best analytics tool
 Best analytics tool Best analytics tool
Best analytics tool
 

Recently uploaded

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Recently uploaded (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Credit risk scoring model final

  • 2. Outline 1. Project Objective 2. Process Approach 3. Data Source and Variables 4. Data Analysis 5. Data Pre-processing 6. Exploratory Analysis 7. Model development i. Training the model ii. Validation 8. Conclusion & Limitations
  • 3. Project Objective To develop a prediction model to assess credit risk to borrowers • Do all borrowers have an equal probability to default? • Is there a way to determine risk of defaulting before processing a credit request? • Can we classify customers into two groups, i.e.. Risky and Non-Risky based on the nature of their financial data? • Which are the key factors to be considered to assess risk of lending to an individual based on historic data?
  • 4. Process Approach 1. Develop a predictive model to assess the credit risk to Borrowers 2. Develop business understanding of data, relationship between variables and data sources to be used 1. Get data from relevant data sources 2. Explore data for missing values, outliers, invalid data through descriptive statistics and visualization techniques 3. Understand the business relevance of outliers, missing values and invalid data and formulate the approach to treat them accordingly 1. Data splitting for training and test 2. Data clean up for missing values, outliers, invalid data 3. Data binning and imputation for outlier treatment 4. Binning independent variables as per business needs 5. Data exploration for patterns and collinearity test 1. Develop logistic regression model to classify customers into two groups based on credit risk probability 2. Train the model using 80% of training data 1. Validate the trained model using rest 20% of training data 2. If satisfied with accuracy percentages proceed to testing using test dataset, else go to previous step (modeling) and train the model again When satisfied with the test results, deploy the model to aid business take decisions based on predictions given by the model Business Understanding Data Understanding Data Preparation Modeling DeploymentEvaluation * Software Used – Excel & SPSS
  • 5. Data Source and Variables • Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset was split into two parts – 1,50,000 cases for Training and validation and rest 1,00,000 cases for testing the model. • Data Dictionary for variables in dataset: Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age Age of borrower in years integer NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
  • 7. Descriptive Statistics • There are 1,50,000 cases in training dataset; • Out of 11 variables available, SeriousDlqIn2yrs is the binary dependent variable for which model has to be developed • MonthlyIncome has large number of missing values. NumberOfDependents too have some missing values • There are high numbers of extreme values(outliers) for RevolvingUtilizationOfUnsecur edLines, DebtRatio and MonthlyIncome as indicated by high Standard Deviation.
  • 8. Missing Value Analysis NumberOfDependents missing values are about 2.6% (less than 5%) hence these cases could be removed MonthlyIncome has around 20% value missing, which is quite high and needs to be imputed
  • 10. Data Cleaning Steps Invalid Data identified below to be removed in the Excel sheet • Age Variable - One case showing 0 • Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96 and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in number and common for all three variables. Data Formatting in Excel Variables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from General to Number format Imputation in SPSS: • Imputation for missing values in MonthlyIncome • 5 imputations done using all independent variables and 5th imputation results taken for training
  • 11. Descriptive Statistics After Data Cleaning • After data cleaning total number of cases down to 145837 • Outliers in variables DebtRatio, MonthlyIncome and RevolvingUtilizationOf UnsecuredLines to be removed through binning
  • 12. Variable Binning Binning done for following variables: • Age: Age Binning containing bins for age group • DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with following cut off values : • MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins Age Group Bin 21-30 1 31-40 2 41-50 3 51-60 4 >60 5 Group Bin Remark <=0.25 1 Good 0.25 - 0.50 2 Low Risk > 0.50 3 High Risk
  • 14. Exploratory Analysis (Using SPSS) Delinquencyoverdifferentcategories 0 1 0 1 21 - 30 7374 940 8314 5.42% 9.68% 5.70% 31 - 40 20562 2285 22847 15.11% 23.53% 15.67% 41 - 50 31130 2828 33958 22.87% 29.12% 23.28% 51 - 60 32334 2213 34547 23.75% 22.79% 23.69% 60 + 44725 1446 46171 32.86% 14.89% 31.66% 136125 9712 145837 100.00% 100.00% 100.00% Age_Binni ng Total SeriousDlqin2yrs % Total Age_Binning * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total Disproportionate percentage of samples for dependent variable. Sampling of training dataset required to remove bias in model development • Maximum customers from age group 60+ • Delinquency risk is highest for Age Group of 41-50 and Lowest in 21- 30 age group a) Age
  • 15. Exploratory Analysis (Contd.) Around 60% data have number of dependents as 0; Delinquency count and percentage also highest for this group Total percentage share of data with dependents greater than 3 is only around 2% 0 1 0 1 0 81722 4992 86714 60.03% 51.40% 59.46% 1 24372 1921 26293 17.90% 19.78% 18.03% 2 17930 1571 19501 13.17% 16.18% 13.37% 3 8646 833 9479 6.35% 8.58% 6.50% 4 2564 296 2860 1.88% 3.05% 1.96% 5 677 68 745 0.50% 0.70% 0.51% 6 134 24 158 0.10% 0.25% 0.11% 7 46 5 51 0.03% 0.05% 0.03% 8 22 2 24 0.02% 0.02% 0.02% 9 5 0 5 0.00% 0.00% 0.00% 10 5 0 5 0.00% 0.00% 0.00% 13 1 0 1 0.00% 0.00% 0.00% 20 1 0 1 0.00% 0.00% 0.00% 136125 9712 145837 100.00% 100.00% 100.00% Num berOf Dependen ts Total Serious Dlqin2yrs % Total NumberOfDependents * SeriousDlqin2yrs Crosstabulation Count Serious Dlqin2yrs Total b) Number of Dependents
  • 16. Exploratory Analysis (Contd.) 0 1 0 1 <= 0.25 24825 1472 26297 36.47% 30.31% 36.06% 0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03% 0.51+ 24057 2128 26185 35.35% 43.82% 35.91% 68063 4856 72919 100.00% 100.00% 100.00% SeriousDlqin2yrs % Total DebtRatio (Binned) Total DebtRatio (Binned) * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total 0 1 0 1 <= 0.25 41954 912 42866 61.64% 18.78% 58.79% 0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06% 0.51+ 16429 3371 19800 24.14% 69.42% 27.15% 68063 4856 72919 100.00% 100.00% 100.00% SeriousDlqin2yrs % Total RevolvingUtilizationOfUnsecuredLines (Binned) * SeriousDlqin2yrs Crosstabulation Count SeriousDlqin2yrs Total Revolving Utilization OfUnsecur edLines (Binned) Total Around 44% of Delinquency from group with Debt Ratio > 0.5 Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5 d) RevolvingUtilizationOfCreditLines c) Debt Ratio
  • 17. Exploratory Analysis (Contd.) 0 1 0 1 <= 3100.00 26699 2494 29193 19.61% 25.68% 20.02% 3100.01 - 5000.00 29083 2518 31601 21.36% 25.93% 21.67% 5000.01 - 7083.00 25214 1766 26980 18.52% 18.18% 18.50% 7083.01 - 10823.00 27435 1461 28896 20.15% 15.04% 19.81% 10823.01+ 27694 1473 29167 20.34% 15.17% 20.00% 136125 9712 145837 100.00% 100.00% 100.00% SeriousDlqin2yrs Total MonthlyInc ome (Binned) Total SeriousDlqin2yrs % Total MonthlyIncome (Binned) * SeriousDlqin2yrs Crosstabulation Count • More than 50% of defaulters are accounted by lower 40% of the income range • Other 3 groups have more or less same percentage of defaulters e) Monthly Income
  • 19. Exploratory Analysis (Contd.) All parameters below have similar pattern - low income range attributing to high values of debt indicators i) RevolvingUtilizationOfUnecuredLines, ii) DebtRatio, iii) NumberOfTime30-59DaysPastDueNotWorse, iv) NumberOfTimes90DaysLate v) NumberOfTime60-089DaysPastDueNotWorse, vi) NumberOfOpenCreditLinesAndLoans vii) NumberOfRealEstateLoansOrLines
  • 20. Collinearity Diagnostics Sample Collinearity Diagnostic results for Age vs. Other 9 independent variable shown here Performed similar diagnostics for each of the 10 variable against other variables Condition Index was always less than 15 indicating no collinearity is existing between independent variables
  • 22. Logistic Regression Model  The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0 • 1 indicates risk of defaulting • 0 indicates no risk  As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is followed to come up with the model  Pre-processed training dataset is used to draw samples for training and validation of the model  80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used for developing and training the model  20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used to for validation  Final model tested using test data set given  Logistic regression models were developed and compared with two different approaches: • With binned variables (Model 1) • Binned model as Model 1, but missing data binned into another category instead of clean up/imputation, wherever applicable(Model 2) • A model without binning using variables directly (Model 3)
  • 23. MODEL 1 – WITH BINNING
  • 24. Model 1 – With binning • The model has been developed considering business needs and therefore the bins have been created considering business cut offs. • In the current model, missing values for NoOfDependents, NumberOfTime30- 99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60- 089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and missing values in MonthlyIncome have been imputed. • Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have been created. Bins created for Age variable as well. • Dummy variables were created for the categories in the binned variables clubbing insignificant bins together to have better control of the model. • Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and 4500 SeriousDlquin2Yrs = 0). • The model comprises of 10 variables including 4 dummy variables.
  • 25. Model 1 – Output • The logit function equation for the model is : -(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)* NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)* NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans – (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)* DebtRatio_GT0.25_0.5(1) • A cut off value of 0.5 gave optimal results
  • 26. Model 1 - Variables Used  Variables used • Age • NumberOfTime3059DaysPastDueNotWorse • NumberOfTime6089DaysPastDueNotWorse • NumberOfTimes90DaysLate • NumberOfOpenCreditLinesAndLoans • NumberRealEstateLoansOrLines • DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5 • RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0), RU_0_.25( where RU>0 but <0.25) and RU_GT_.5( where RU >=5).  Observations • MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from the model. • MonthlyIncome and DebtRatio were affecting each other • RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated. • Though bins were created for Age variable but all the bins were contributing equally to the model therefore used the Age variable as such. • NoOfDependents was initially thought as significant variable but turned out to be insignificant. Created bins for NoOfDependents variable but the bins too were insignificant.
  • 27. Model 1 - Validation • Validated the developed model on a non- stratified random sample of 40% of the data (which comprised of 29168 records). • Overall accuracy : 78.62% and Misclassification rate : 21.38% • Prediction accuracy for Risky (= 1) is 75.9%
  • 28. Model 1 – Pros and Cons  17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.  The model has been developed taking into consideration widely used business cut offs and significant parameters.  Since the model has been built on data where missing values were treated, the accuracy of the model may drop on data where missing values are present.  Analyzing Top 10% ( Customers who are prone to default) • 67.4% defaulters are in the age group : 30-50 • 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5 • 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and 90 days respectively. • 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did not have any dependent.  Analyzing Bottom 10% ( Customers who are safe) • 80 % of non- defaulters are more than 40 years of age. • 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5 • 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days and 90 days respectively. • 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non- defaulters did not have any dependent.
  • 29. MODEL 2 – CONSIDERING MISSING VALUES
  • 30. Model 2 – Considering Missing Values • Missing values have not been imputed here, rather an extra category has been added in the binned variables to consider missing value as another category. (Example : NoOfDependents_Binned shown below) • Selection of variables have been based on B, Exp(B), Sig values • Optimal Binning has been used based on SeriousDlquin2yrs variable.
  • 31. Model 2 – Output • Final Model (1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445 *NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(- 3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442. • Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another 4000 with SeriousDlquin2Yrs = 0 • A cut off value of 0.4 gave optimal results
  • 32. Model 2 - Variables Used  Variables used • Age_OptimalBin • NumberOfTime3059DaysPastDueNotWorse_OptimalBin • NumberOfTime6089DaysPastDueNotWorse_OptimalBin • NumberOfTimes90DaysLate_OptimalBin  Possible reasons why few other variables are not significant • Age has a non-linear relationship with MonthlyIncome • Other 3 variables in the equation are the indicators of number of defaults committed by the customer which has a relation with NumberOfOpenLinesOfCredit and RevolvingUtilizationsOfUnsecuredLines • MonthlyIncome will effect the DebtRatio
  • 33. Model 2 - Validation • Multiple test run has been performed on different sample sizes • The below given validation table was for a random sample of 90000. • Overall Accuracy 72.62% and Misclassification 27.37% • Risky ( = 1) prediction accuracy of 75.1%
  • 34. Model 2 – Pros and Cons  Capable of handling missing values (including 98,96)  Intermediate processing required is minimal (only binning required)  The model uses only 4 variables  Optimal binning used and not the industry standard binning  Other insights • Analyzing top 10% (most risky customer segment) 84% of the customer are below 56 years of age 72% have 1 or more past 30 days default • Analyzing bottom 10% (safest customer segment) All of them are of 64 years or above in age Almost all of them have 0 defaults under any case.
  • 35. MODEL 3 – USING VARIABLES DIRECTLY
  • 36. Model 3 – Using Variables Directly • Final model has following equation: 0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOf Time6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate) • This model is simplest but business considerations were not accounted for, hence cannot assure robustness on deployment • It cannot handle missing values
  • 38. Conclusion & Limitations • Model 1 and Model 2 give similar accuracy levels. Model 3 is not recommended. Choice of final model is left to business based on the pros and cons mentioned • These models to be further validated for scalability and robustness • The test dataset given did not have delinquency values; hence after validation with 20% random samples from training data set further validation could not be performed using test dataset for accuracy check on a totally new set of data. • Assumptions taken on binning financial variables could change the significance of different variables in final model. This aspect to be validated with business