2. Outline
1. Project Objective
2. Process Approach
3. Data Source and Variables
4. Data Analysis
5. Data Pre-processing
6. Exploratory Analysis
7. Model development
i. Training the model
ii. Validation
8. Conclusion & Limitations
3. Project Objective
To develop a prediction model to assess credit risk to
borrowers
• Do all borrowers have an equal probability to default?
• Is there a way to determine risk of defaulting before
processing a credit request?
• Can we classify customers into two groups, i.e.. Risky and
Non-Risky based on the nature of their financial data?
• Which are the key factors to be considered to assess risk
of lending to an individual based on historic data?
4. Process Approach
1. Develop a
predictive model
to assess the
credit risk to
Borrowers
2. Develop
business
understanding of
data, relationship
between variables
and data sources
to be used
1. Get data from
relevant data sources
2. Explore data for
missing values,
outliers, invalid data
through descriptive
statistics and
visualization
techniques
3. Understand the
business relevance of
outliers, missing
values and invalid data
and formulate the
approach to treat them
accordingly
1. Data splitting for
training and test
2. Data clean up for
missing values,
outliers, invalid data
3. Data binning and
imputation for
outlier treatment
4. Binning
independent
variables as per
business needs
5. Data exploration
for patterns and
collinearity test
1. Develop logistic
regression model
to classify
customers into two
groups based on
credit risk
probability
2. Train the model
using 80% of
training data
1. Validate the
trained model
using rest 20% of
training data
2. If satisfied with
accuracy
percentages
proceed to testing
using test dataset,
else go to
previous step
(modeling) and
train the model
again
When satisfied
with the test
results, deploy
the model to
aid business
take decisions
based on
predictions
given by the
model
Business
Understanding
Data
Understanding
Data
Preparation
Modeling DeploymentEvaluation
* Software Used – Excel & SPSS
5. Data Source and Variables
• Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset
was split into two parts – 1,50,000 cases for Training and validation and rest
1,00,000 cases for testing the model.
• Data Dictionary for variables in dataset:
Variable Name Description Type
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines
Total balance on credit cards and personal lines of credit except real estate and no installment debt
like car loans divided by the sum of credit limits
percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
MonthlyIncome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
7. Descriptive Statistics
• There are 1,50,000 cases in
training dataset;
• Out of 11 variables available,
SeriousDlqIn2yrs is the binary
dependent variable for
which model has to be
developed
• MonthlyIncome has large
number of missing values.
NumberOfDependents too
have some missing values
• There are high numbers of
extreme values(outliers) for
RevolvingUtilizationOfUnsecur
edLines, DebtRatio and
MonthlyIncome as indicated
by high Standard Deviation.
8. Missing Value Analysis
NumberOfDependents missing values are
about 2.6% (less than 5%) hence these
cases could be removed
MonthlyIncome has around 20%
value missing, which is quite high
and needs to be imputed
10. Data Cleaning Steps
Invalid Data identified below to be removed in the Excel sheet
• Age Variable - One case showing 0
• Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate
and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96
and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in
number and common for all three variables.
Data Formatting in Excel
Variables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from
General to Number format
Imputation in SPSS:
• Imputation for missing values in MonthlyIncome
• 5 imputations done using all independent variables and 5th imputation results
taken for training
11. Descriptive Statistics After Data Cleaning
• After data cleaning
total number of cases
down to 145837
• Outliers in variables
DebtRatio,
MonthlyIncome and
RevolvingUtilizationOf
UnsecuredLines to be
removed through
binning
12. Variable Binning
Binning done for following variables:
• Age: Age Binning containing bins for age group
• DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables
DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with
following cut off values :
• MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins
Age Group Bin
21-30 1
31-40 2
41-50 3
51-60 4
>60 5
Group Bin Remark
<=0.25 1 Good
0.25 - 0.50 2 Low Risk
> 0.50 3 High Risk
14. Exploratory Analysis (Using SPSS)
Delinquencyoverdifferentcategories
0 1 0 1
21 - 30 7374 940 8314 5.42% 9.68% 5.70%
31 - 40 20562 2285 22847 15.11% 23.53% 15.67%
41 - 50 31130 2828 33958 22.87% 29.12% 23.28%
51 - 60 32334 2213 34547 23.75% 22.79% 23.69%
60 + 44725 1446 46171 32.86% 14.89% 31.66%
136125 9712 145837 100.00% 100.00% 100.00%
Age_Binni
ng
Total
SeriousDlqin2yrs %
Total
Age_Binning * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Disproportionate percentage of samples
for dependent variable. Sampling of
training dataset required to remove bias
in model development
• Maximum customers
from age group 60+
• Delinquency risk is
highest for Age Group of
41-50 and Lowest in 21-
30 age group
a) Age
15. Exploratory Analysis (Contd.)
Around 60% data have number of dependents as 0; Delinquency count and percentage
also highest for this group
Total percentage share of data with dependents greater than 3 is only around 2%
0 1 0 1
0 81722 4992 86714 60.03% 51.40% 59.46%
1 24372 1921 26293 17.90% 19.78% 18.03%
2 17930 1571 19501 13.17% 16.18% 13.37%
3 8646 833 9479 6.35% 8.58% 6.50%
4 2564 296 2860 1.88% 3.05% 1.96%
5 677 68 745 0.50% 0.70% 0.51%
6 134 24 158 0.10% 0.25% 0.11%
7 46 5 51 0.03% 0.05% 0.03%
8 22 2 24 0.02% 0.02% 0.02%
9 5 0 5 0.00% 0.00% 0.00%
10 5 0 5 0.00% 0.00% 0.00%
13 1 0 1 0.00% 0.00% 0.00%
20 1 0 1 0.00% 0.00% 0.00%
136125 9712 145837 100.00% 100.00% 100.00%
Num berOf
Dependen
ts
Total
Serious Dlqin2yrs %
Total
NumberOfDependents * SeriousDlqin2yrs
Crosstabulation
Count
Serious Dlqin2yrs
Total
b) Number of Dependents
16. Exploratory Analysis (Contd.)
0 1 0 1
<= 0.25 24825 1472 26297 36.47% 30.31% 36.06%
0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03%
0.51+ 24057 2128 26185 35.35% 43.82% 35.91%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
DebtRatio
(Binned)
Total
DebtRatio (Binned) * SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
0 1 0 1
<= 0.25 41954 912 42866 61.64% 18.78% 58.79%
0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06%
0.51+ 16429 3371 19800 24.14% 69.42% 27.15%
68063 4856 72919 100.00% 100.00% 100.00%
SeriousDlqin2yrs %
Total
RevolvingUtilizationOfUnsecuredLines (Binned) *
SeriousDlqin2yrs Crosstabulation
Count
SeriousDlqin2yrs
Total
Revolving
Utilization
OfUnsecur
edLines
(Binned)
Total
Around 44% of Delinquency from group with Debt Ratio > 0.5
Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5
d) RevolvingUtilizationOfCreditLines
c) Debt Ratio
17. Exploratory Analysis (Contd.)
0 1 0 1
<=
3100.00
26699 2494 29193 19.61% 25.68% 20.02%
3100.01 -
5000.00
29083 2518 31601 21.36% 25.93% 21.67%
5000.01 -
7083.00
25214 1766 26980 18.52% 18.18% 18.50%
7083.01 -
10823.00
27435 1461 28896 20.15% 15.04% 19.81%
10823.01+ 27694 1473 29167 20.34% 15.17% 20.00%
136125 9712 145837 100.00% 100.00% 100.00%
SeriousDlqin2yrs
Total
MonthlyInc
ome
(Binned)
Total
SeriousDlqin2yrs %
Total
MonthlyIncome (Binned) * SeriousDlqin2yrs
Crosstabulation
Count
• More than 50% of defaulters are accounted by lower 40% of the income range
• Other 3 groups have more or less same percentage of defaulters
e) Monthly Income
19. Exploratory Analysis (Contd.)
All parameters below have similar pattern - low
income range attributing to high values of debt
indicators
i) RevolvingUtilizationOfUnecuredLines,
ii) DebtRatio,
iii) NumberOfTime30-59DaysPastDueNotWorse,
iv) NumberOfTimes90DaysLate
v) NumberOfTime60-089DaysPastDueNotWorse,
vi) NumberOfOpenCreditLinesAndLoans
vii) NumberOfRealEstateLoansOrLines
20. Collinearity Diagnostics
Sample Collinearity Diagnostic results for Age
vs. Other 9 independent variable shown here
Performed similar diagnostics for each of the
10 variable against other variables
Condition Index was always less than 15
indicating no collinearity is existing between
independent variables
22. Logistic Regression Model
The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0
• 1 indicates risk of defaulting
• 0 indicates no risk
As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is
followed to come up with the model
Pre-processed training dataset is used to draw samples for training and validation of the model
80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1
and used for developing and training the model
20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and
used to for validation
Final model tested using test data set given
Logistic regression models were developed and compared with two different approaches:
• With binned variables (Model 1)
• Binned model as Model 1, but missing data binned into another category instead of clean up/imputation,
wherever applicable(Model 2)
• A model without binning using variables directly (Model 3)
24. Model 1 – With binning
• The model has been developed considering business needs and therefore the bins have been
created considering business cut offs.
• In the current model, missing values for NoOfDependents, NumberOfTime30-
99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-
089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and
missing values in MonthlyIncome have been imputed.
• Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have
been created. Bins created for Age variable as well.
• Dummy variables were created for the categories in the binned variables clubbing insignificant
bins together to have better control of the model.
• Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and
4500 SeriousDlquin2Yrs = 0).
• The model comprises of 10 variables including 4 dummy variables.
25. Model 1 – Output
• The logit function equation for the model is :
-(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)*
NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)*
NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans
– (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)*
DebtRatio_GT0.25_0.5(1)
• A cut off value of 0.5 gave optimal results
26. Model 1 - Variables Used
Variables used
• Age
• NumberOfTime3059DaysPastDueNotWorse
• NumberOfTime6089DaysPastDueNotWorse
• NumberOfTimes90DaysLate
• NumberOfOpenCreditLinesAndLoans
• NumberRealEstateLoansOrLines
• DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5
• RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0), RU_0_.25(
where RU>0 but <0.25) and RU_GT_.5( where RU >=5).
Observations
• MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from
the model.
• MonthlyIncome and DebtRatio were affecting each other
• RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated.
• Though bins were created for Age variable but all the bins were contributing equally to the model
therefore used the Age variable as such.
• NoOfDependents was initially thought as significant variable but turned out to be insignificant.
Created bins for NoOfDependents variable but the bins too were insignificant.
27. Model 1 - Validation
• Validated the developed model on a non- stratified random sample of 40% of the data (which
comprised of 29168 records).
• Overall accuracy : 78.62% and Misclassification rate : 21.38%
• Prediction accuracy for Risky (= 1) is 75.9%
28. Model 1 – Pros and Cons
17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.
The model has been developed taking into consideration widely used business cut offs and significant
parameters.
Since the model has been built on data where missing values were treated, the accuracy of the model may drop
on data where missing values are present.
Analyzing Top 10% ( Customers who are prone to default)
• 67.4% defaulters are in the age group : 30-50
• 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and
90 days respectively.
• 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did
not have any dependent.
Analyzing Bottom 10% ( Customers who are safe)
• 80 % of non- defaulters are more than 40 years of age.
• 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5
• 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days
and 90 days respectively.
• 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non-
defaulters did not have any dependent.
30. Model 2 – Considering Missing Values
• Missing values have not been imputed here, rather an extra category has been added in
the binned variables to consider missing value as another category. (Example :
NoOfDependents_Binned shown below)
• Selection of variables have been based on B, Exp(B), Sig values
• Optimal Binning has been used based on SeriousDlquin2yrs variable.
31. Model 2 – Output
• Final Model
(1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445
*NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(-
3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442.
• Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another
4000 with SeriousDlquin2Yrs = 0
• A cut off value of 0.4 gave optimal results
32. Model 2 - Variables Used
Variables used
• Age_OptimalBin
• NumberOfTime3059DaysPastDueNotWorse_OptimalBin
• NumberOfTime6089DaysPastDueNotWorse_OptimalBin
• NumberOfTimes90DaysLate_OptimalBin
Possible reasons why few other variables are not significant
• Age has a non-linear relationship with MonthlyIncome
• Other 3 variables in the equation are the indicators of number of defaults committed by
the customer which has a relation with NumberOfOpenLinesOfCredit and
RevolvingUtilizationsOfUnsecuredLines
• MonthlyIncome will effect the DebtRatio
33. Model 2 - Validation
• Multiple test run has been performed on different sample sizes
• The below given validation table was for a random sample of 90000.
• Overall Accuracy 72.62% and Misclassification 27.37%
• Risky ( = 1) prediction accuracy of 75.1%
34. Model 2 – Pros and Cons
Capable of handling missing values (including 98,96)
Intermediate processing required is minimal (only binning required)
The model uses only 4 variables
Optimal binning used and not the industry standard binning
Other insights
• Analyzing top 10% (most risky customer segment)
84% of the customer are below 56 years of age
72% have 1 or more past 30 days default
• Analyzing bottom 10% (safest customer segment)
All of them are of 64 years or above in age
Almost all of them have 0 defaults under any case.
36. Model 3 – Using Variables Directly
• Final model has following equation:
0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOf
Time6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate)
• This model is simplest but business considerations were not accounted for, hence cannot
assure robustness on deployment
• It cannot handle missing values
38. Conclusion & Limitations
• Model 1 and Model 2 give similar accuracy levels. Model 3 is not
recommended. Choice of final model is left to business based on the
pros and cons mentioned
• These models to be further validated for scalability and robustness
• The test dataset given did not have delinquency values; hence after
validation with 20% random samples from training data set further
validation could not be performed using test dataset for accuracy
check on a totally new set of data.
• Assumptions taken on binning financial variables could change the
significance of different variables in final model. This aspect to be
validated with business