Detailed insight into Analytical Steps required for generating reliable insights from analysis - Univariate, Bivariate, Multivariate, OLS & Logistic Models, etc
Was put together to train friends and mentees. Based on personal learnings/research and no proprietary info, etc. and no claims on 100% accuracy. Also every institution/organization/team uses it own steps/methodologies, so please use the one relevant for you and this only for training purposes.
3. Intended for Knowledge Sharing
only
3Intended for Knowledge Sharing only 3
ANALYTICAL PROCESS OVERVIEW
Business Problem
Characterization
Data Consolidation
Data Treatment
Modeling &
Analysis*
Recommendations &
Implementation
Strategy
● TRANSFORMATION : Conversion of field formats from other types to numeric
● MISSING VALUE TREATMENTS : Imputation of missing values based on Mean, etc.
● CAPPING TREAMENTS: Capping of extreme and nonsensical values
● NORMALIZATION of all the variables to remove the effect of the distribution of variables
on subsequent analytical steps
●The relationship between the variable of interest and the drivers has to be
established with significant confidence and stability through mathematical modeling
techniques like Regression/ Decision Trees, etc.
●Based on the understanding of the relationships between the events of interest and its
drivers, suitable business strategies can be developed to address the business problem.
● TRANSLATION OF BUSINESS PROBLEM INTO A STATISTICAL FRAMEWORK: Decision on
the analytical technique, data processing and final outcomes
● HYPOTHESIZE the predictor variables’ relationships with the dependent variable * *
Note:
* Modeling & Analysis is generally preceded by Clustering phase where all observations are grouped into homogenous
clusters, similar in characteristics within, and dissimilar from the other clusters, to remove exogenous errors in findings
* * DEPENDENT VARIABLE has to defined keeping in mind the business objective, data availability and forecast period.
● RECONCILIATION OF DATA FROM VARIOUS SOURCES into an Analysis Master Dataset
4. Intended for Knowledge Sharing
only
4Intended for Knowledge Sharing only 4
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
5. Intended for Knowledge Sharing
only
5Intended for Knowledge Sharing only 5
Data Specification Document
•Hypothesized predictor variables necessary
for solving the business problem
•Availability of Data in the various data
sources
•Form of the data
Data Integration Plan
•Reconciliation of the data from various
sources into one single analysis master
dataset
•Data Integration(DI) Report talks about
presence of data across merged tables
Data Gap Analysis
•Information that was critical as per the
hypothesis and not available in the data
sources are listed down here so that this info
can be captured in future
A thorough understanding of the data sources is essential to plan the extraction in the fastest and most efficient
way….
….DI Report assumes significance in the later stages of Data preparation where the missing information
because of unavailability of data has a particular meaning and so should not be imputed as others.
DATA COLLECTION
Master Dataset
NBP
CLV Acxiom
Warehouse
Data
Analytics
Tables
Bureau
Data
• Customer
• Payments
• Click-Stream
• Cards
Subsequent Data Treatments
and Analytical Steps
6. Intended for Knowledge Sharing
only
6Intended for Knowledge Sharing only 6
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
7. Intended for Knowledge Sharing
only
7Intended for Knowledge Sharing only 7
….Variable reduction being very critical step to achieve best predictors for the subsequent analytical steps
DATA PREPARATION
Univariate Analysis Bivariate Analysis Variable Reduction
Certain Thumb Rules,
%Missing<= 5: Single Value
Imputation
5<%Missing<=20: Bivariate based
Value Imputation
20<%Missing<=40: Imputation
based on Modeling with other
independent variables
%Missing<=40 : Drop the variables
•Removal of extreme and non-sensical
values to achieve better distribution in
the variables
•Variable transformation-log, exp etc
forms depending on their degrees of
relationship observed in bivariate plots.
• Dummy/binning variable creation
depending on nature of relationship
•Selection/Dropping of variables based
on the strength of relationship by trend
and/or significance of chi-square test.
•Redundancy checks and removal by
using indicators like VIF,CI and Factor
Loading.
•Also business sense would be used in
selection of variables for modeling
Data preparation begins with Data Distribution studies needed for missing and capping treatments ; followed
by Data Sanity checks on groups of variables….
Missing Treatment
Capping Treatment
Variable
Transformations
Selection/Dropping
Multi-collinearity
Checks
PCA/FA/Varclus
•Selection/Dropping of based on the
Factor Loadings of the variables on the
significant PCs/Factors
?
?
?
?
?
?
8. Intended for Knowledge Sharing
only
8Intended for Knowledge Sharing only 8
Capping treatment is another critical treatment, where the non-sense and extreme observations are removed to
achieve stability in parameter estimates….
….It should always precede Missing treatment, so that the imputed values for missing observations follow better
distributions
DATA PREPARATION
Capping Treatment has to be consider,
i. Distribution - if it’s a categorical variable then it should not be capped, etc.
ii. Niche characteristics – If this outlier values explain a certain niche group of customers who have outliers in
other variables also, then they should not be capped
iii. Business Information - Certain non-sense values signify something like Missing, etc., they should be capped to
another value nearest to the most sensible end values, but kept outside so that the actual information is not lost.
CAPPING TREATMENT
Back to Dataprep
Capping Treatment is necessary to remove the following two types of incidents,
i. Outliers- Extreme observations in Dependent variable leading to high residuals in predictions
ii. Influential Observations– Outliers in the independent variable side leading to unstable/wrong parameter
estimates
AmountTransacted$
Count of Transactions
Outliers
Influential Observations
9. Intended for Knowledge Sharing
only
9Intended for Knowledge Sharing only 9
Missing treatment is inevitable since the entire record is deleted if a certain variable has missing information….
….It’s also the most complex treatment, as each variable has to be treated differently based on its meaning,
missing content and data integrity issues
DATA PREPARATION
No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3
1 0 2324.9 19900 360 1 21 282 1 282 0 0
2 0 3796.5 22100 240 0 6 6911 1 33978 1 1
3 1 12523.2 42000 360 1 1 36350 . 36732 1 1
4 0 5190.9 21760 349 1 42 885 1 911 0 0
5 1 53.6 18000 360 1 5 8851 1 9506 0 0
6 0 1256.9 15500 360 . 13 409 1 760 0 0
7 0 4403.3 25150 900 1 3 21417 5 23579 3 1
8 0 3137.2 17800 240 1 4 4528 2 5967 1 0
9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1
10 0 6442.4 31200 360 1 34 33177 1 0 2 0
Missing observations
Unrealistic values
Missing Value Imputation has to be done based on,
i. Meaning of the variable- for e.g., if flag, it can take either 1 or 0 depending on the coding;
ii. Distribution - if it’s a continuous variable like Amount, etc. with lesser missing content then mean, etc.
iii. If missing due to merging issue – then it depends on whether it was available in a particular table or not, for
e.g., if its not present in Restrictions table then missing “freq_restrictions” can take a value ‘0’ or if it was actually
present in the Restrictions table but still has missing then it should take median value.
iv. Correlation with Other Predictors - Also the missing value in a variable can depend on other variables in a
dataset, for e.g., if “amount_received” is missing, then the missing amount depends on the size of the merchant,
the Average amount received in the prior months, type of products sold, industry avg, etc.
MISSING VALUE TREATMENT
Back to Dataprep
10. 0
5
10
15
20
25
30
40 60 80 100
MeanTxnAmt$
Mean MOB
Intended for Knowledge Sharing
only
10Intended for Knowledge Sharing only 10
Bivariate analysis explores the nature and degree of relationship between the independent and dependent
variables ….
….and is necessary to achieve stable and accurate predictions apart from arriving at the correct recommendations
DATA PREPARATION
BIVARIATE ANALYSIS
Back to Dataprep
Dep Var = f(Indep Var, Log(Indep Var), Sin(Indep Var),….)
Significant estimate with large
magnitude
Insignificant estimate
Transformations required
Bivariate Chart Analysis- Mean
dep var value vs. Class
Dummy Creation
for certain classes
Variable dropping
if no trend or
relationship
0
10
20
30
40
50
0 1 2 3 4
MeanTxnAmt$
Mean Count of Restrictions
Dummy = (count_rest<=2)
No relationship
11. Intended for Knowledge Sharing
only
11Intended for Knowledge Sharing only 11
Multivariate analysis helps remove interrelationships between the predictors to achieve stable and correct
estimates at individual variable level which is necessary for correct strategy creation….
….the variance/correlation based reductions are not mutually exclusive and might be applied judgmentally in
different sequences to achieve the best set of predictors
DATA PREPARATION
MULTIVARIATE ANALYSIS
Back to Dataprep
Inter-correlations amongst Predictors
Linear relations Common Variances
Collinearity removal based on
VIF and CI values
Total Variances
Factor Analysis Principal Component Analysis
Significant Predictors with Eigen Values
>1 or which capture 70% variance
Variables should be grouped as per the information that they capture and
reductions are performed at the group level
Factor loadings are used to decompose the significant
Factors/PCs to variable level
12. Intended for Knowledge Sharing
only
12Intended for Knowledge Sharing only 12
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
13. Intended for Knowledge Sharing
only
13Intended for Knowledge Sharing only 13
Business need defines the nature of the dependent variable and the analysis time windows in which the
predictors are observed and where the performance is observed.….
Note:
*Population sizes and business dynamics have to be taken into account while deciding the Analysis Time windows and
the form of dependent variable
Analysis windows*
Observation window Out-of-time Validation
window
Performance Window
Dependent Variable captures the behavior of interest and it can be
oContinuous or categorical
oRaw or transformed(log, growth) , etc.
and the statistical technique used for analysis depends on the type and form of this variable
Observation Window stands for the window where the various predictors are observed
Performance Window stands for the time window where the dependent variable is defined
Out-of-Time Validation Window stands for the time window where the model performance and
stability is checked
Definition of Dependent Variable
MODELING & ANALYSIS
DEFINITION OF DEPENDENT VARIABLE
14. Intended for Knowledge Sharing
only
14Intended for Knowledge Sharing only 14
Every findings from analysis has to be validated for reliability and accuracy across samples of data.….
MODELING & ANALYSIS
A NOTE ON SAMPLING
Define the Population
Determine the Sampling Frame
Select Sampling Technique(s)
Determine the Sample Size
Execute the Sampling Process
SAMPLING TECHNIQUES
SIMPLE RANDOM SAMPLING
STRATIFIED SAMPLING
All records are randomly assigned a selection
probability between 0 and 1.
STRENGTHS
Easily understood and implemented
WEAKNESSES
Lower precision and no assurance of
representativeness
All records are assigned to a particular sub-
population, the proportion of which is to be
maintained in the final samples. SRS is used to
select records from the sub-populations
STRENGTHS
Increases representativeness
WEAKNESSES
Not effective for large/small Stratas
….Nature of the business problem and population decides the sampling technique and sizes
15. Intended for Knowledge Sharing
only
15Intended for Knowledge Sharing only 15
Segmentation of customers into homogenous groups, identical within the clusters and different from those in
other clusters, based on a set of behavioral characteristics..….
MODELING & ANALYSIS
SOME TIDBITS ABOUT CLUSTERING
….Identifies the structural breaks in the data, on either side of which the characteristics are fundamentally
different, and hence is necessary to arrive at the real relationship of predictors with dependent variable
Most used methods of clustering:
Hierarchical Clustering- Assigns observations to a cluster progressively one at a time, based a distance measure.
Advantages: Good in case of small datasets as the algorithm finds the number of clusters.
Limitations: It fails with large datasets as a result of memory issues.
K-means Clustering- A random number of cluster origins are selected ;then all the remaining records are assigned to
one of them based on a distance measure.
Advantages: Simplicity and speed
Limitations: It does not yield the same result with each run, since the resulting clusters depend on the initial
random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global
minimum of variance.
16. Intended for Knowledge Sharing
only
16Intended for Knowledge Sharing only 16
Modeling is the establishment of a relationship between the variable of interest and its various predictors and
hence the technique depends on the distribution of the dependent variable, the business problem and data
quality and quantity available for modeling..….
MODELING & ANALYSIS
MODELING TECHNIQUES
….Findings of a stable and accurate model elucidates the degree and nature of the drivers of the dependent
variable and thus defines the strategy to be taken for solving the business problem.
Final
Analysis
Dataset
Non-Parametric
Parametric
Does not depend on
distribution of
dependent variable
Depends on the
distribution of
dependent variable
Sl.No. Target Variable Distribution Modeling Approach Model Output
1 Continuous OLS Regression
A Typical Model :
Y = f(X)= f(X1, X2,..,Xn)
2 Nominal Logistic Regression
3 Categorical positive values Poisson/Gamma
4 Unidentified Decision Trees Segments with increasing proportion of
dependent variable.
17. Intended for Knowledge Sharing
only
17Intended for Knowledge Sharing only 17
Form of fitting function(how are they mathematically related?):
y =α + β1X1 + β2X2
Predicted = Mean + relationship with Predictor 1*predictor 1+ relationship with
Predictor 2*predictor 2
Assumption for the modeling: Residuals are independent, are normally distributed with
‘0’ mean and have uniform variance throughout
What is OLS? Ordinary Least Squares(Explained variance, R2 is being maximized)
Type of Predicted (dependent) Variable: Continuous Variable (-∞ to + ∞)
Business Question: What loan amount take off can we expect from a customer?
SAS procedure: Proc Reg
Performance Diagnostics (indicators of a good model):
•R-square(-1 to +1): How good the model is explaining variance in predicted variable?
•MSE(Mean Square Error): Size of average difference between predicted and actual?
MSE = sqrt of summation of (actual value – predicted value)/(count of obs)
•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001
•Sign of parameter estimates: Should be intuitive or repeated in validation sample
•Model validation: Model should be stable on both in time/out of time validation samples
•Rank Ordering: Predict high value when actual is high and vice versa
•AIC/SIC: Parsimony(or Efficiency)- min predictors, max predictions; compare across
models
GENERALIZED LINEAR MODELS
OLS REGRESSION (LINEAR)
18. Intended for Knowledge Sharing
only
18Intended for Knowledge Sharing only 18
GENERALIZED LINEAR MODELS
OLS REGRESSION (LINEAR)- SAMPLE MODEL OUTPUT
The REG Procedure
Model: MODEL1
Dependent Variable: censor_po
Number of Observations Read 40162
Number of Observations Used 40162
Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 12 610.91533 50.90961 219.02<.0001
Error 40149 9332.36401 0.23244
Corrected Total 40161 9943.27934
Root MSE 0.48212 R-Square 0.0614
Dependent Mean 0.5492 Adj R-Sq 0.0612
Coeff Var 87.78642
Parameter Estimates
Variable DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept 1 1.24953 0.20693 6.04 <.0001 0
APPLICATION_PRIM_CB_SCR_NBR 1 -0.000216 0.00028377 -0.76 0.4465 1.0205
log_APPL_ADV_RATIO 1 -0.1166 0.0117 -9.96 <.0001 1.09417
log_APPL_PYMT_TO_INCOME_RATIO 1 -0.01966 0.00517 -3.8 0.0001 1.17587
Collinearity Diagnostics
Number Eigenvalue Condition Proportion of Variation
Index Intercept APPLICATION_P
RIM_CB_SCR_NB
R
log_APPL_ADV_
RATIO
log_APPL_PYMT
_TO_INCOME_RA
TIO
1 8.3631 1 0.00000188 0.00000202 0.00002708 0.00057815
2 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-06
3 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.0000019
8 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.00396
9 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.01911
10 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.26143
11 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.68574
12 0.00137 78.10783 0.01375 0.01859 0.96941 0.02085
13 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173
19. Intended for Knowledge Sharing
only
19Intended for Knowledge Sharing only 19
What is Logistic? Predicts log odds(event/non-event)
Log (odds) = α + β1X1 + β2X2
Predicted probability of event = e^(α + β1X1 + β2X2)/(1+e^(α + β1X1 + β2X2))
Predicted probability of non-event = 1/(1+e^(α + β1X1 + β2X2))
->Therefore, total probability (event + non-event) at an obs level is 1
Type of Predicted (dependent) Variable: Binary (1/0)- one is event, other is ‘reference’
Business Question: What is the probability of a customer defaulting?
SAS procedure: Proc Logistic (with various link functions)
Performance Diagnostics (indicators of a good model):
•Concordance/Discordance: If all observations were paired randomly, in how many
instances(%) is actual event observation given higher probability
•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001
•Sign of parameter estimates: Should be intuitive or repeated in validation sample
•Model validation: Model should be stable on both in time/out of time validation samples
•Rank Ordering: Predict high value when actual is high and vice versa
•Gains Chart(K-Statistic): Highest probabilities should be assigned to actual events
•AIC: Parsimony(or Efficiency):min predictors, max predictions; compare across models
Note:
*Hosmer-Lemeshow good but fails when model sample size is large
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION
20. Intended for Knowledge Sharing
only
20Intended for Knowledge Sharing only 20
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT
The LOGISTIC Procedure
Model Information
Data Set MODOUT.TU60_VAL_FICO
_690_719_EXP
Response Variable outcome
Number of Response Levels 3
Model generalized logit
Optimization Technique Fisher's scoring
Number of Observations Read 607592
Number of Observations Used 607592
Response Profile
Ordered outcome Total
Value Frequency
1 0 597504
2 1 9432
Logits modeled use outcome=0 as the reference
category.
Model Fit Statistics
Criterion Intercept only Intercept &
Covariates
AIC 107549.99 106661.99
SC 107572.63 106956.24
-2 Log L 107545.99 106609.99
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 935.9990 24 <.0001
Score 902.4392 24 <.0001
Wald 892.8763 24 <.0001
21. Intended for Knowledge Sharing
only
21Intended for Knowledge Sharing only 21
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT contd…
Type 3 Analysis of Effects
Effect DF Wald Pr > ChiSq
Chi-Square
APPLICATION_PRIM_CB_ 2 14.5230 0.0007
log_APPL_ADV_RATIO 2 126.6605 <.0001
log_APPL_PYMT_TO_INC 2 83.5880 <.0001
Analysis of Maximum Likelihood Estimates
Parameter DF Development
Model Estimate
Validation
Model Estimate
Standard Wald Pr > Chi
SqError Chi-Square
Intercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466
APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715
log_APPL_ADV_RATIO 1 -0.3934 -0.2839 0.0485 34.2834 <.0001
log_APPL_PYMT_TO_INC 1 -0.1206 -0.0900 0.0221 16.5920 <.0001
Odds Ratio Estimates
Effect outcome Point Estimate 95% Wald Confidence Limits
APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000
log_APPL_ADV_RATIO 1 0.753 0.685 0.828
log_APPL_PYMT_TO_INC 1 0.914 0.875 0.954
Percent Concordant 65.9 Somers' D 0.338
Percent Discordant 32.1 Gamma 0.345
Percent Tied 2.0 Tau-a 0.074
Pairs 1806529536 c 0.669
Higher the percent
concordant, better
the model
23. Intended for Knowledge Sharing
only
23Intended for Knowledge Sharing only 23
0
20
40
60
80
100
120
0 20 40 60 80 100
CAPTURING OF THE MODEL
The column “cumpct” in the rank-ordering output indicates the no. of responders captured up
to the given decile.
The model captures about 22.5% responders in the first decile and about 71.07% of the
responders in the top 5 deciles.
Model
capturing
Random
capturing
Population (%)
Responders
captured
Higher the capturing
in the initial deciles,
better the model
performance
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION – GAINS CHART contd…
24. Intended for Knowledge Sharing
only
24Intended for Knowledge Sharing only 24
CRITERIA FOR FINE-TUNING
CRITERION FOR FINE TUNING
The fine tuning is based on applying model for both development and validation samples. Following
criterion are consider for fine tuning the model.
Fine Tuning
Rank Ordering
Coefficient Stability
Concordance
Highest KS
Goodness-of-fit
Validation
Capturing
25. Intended for Knowledge Sharing
only
25Intended for Knowledge Sharing only 25
RECAP
Phase III
Decide on the number of models and identify the dependent variables for each model
Identify the statistical method suitable for each predictive model: OLS Regression, Logistic Regression etc.
Hypothesize Predictor variables
TRANSLATE THE BUSINESS PROBLEM INTO A STATISTICAL
PROBLEM BASED ON IBCVM FRAMEWORK
UNDERSTAND THE
BUSINESS PROBLEM
PREPARE DATA
SPECIFICATIONS
& GET DATA
MODEL IMPLEMENTATION
Prepare Scoring Code
Track model performance after regular
intervals
Redevelop/ Rebuild models on a need
basis
UNIVARIATE ANALYSIS
- Treatment of Outliers
BIVARIATE ANALYSIS
-Treatment of Missing Value
- Variable Transformations
DEVELOPMENT SAMPLE
(Sub sample of raw data)
MODEL DEVELOPMENT
-OLS / Logistic Regression
-Fine Tuning
VALIDATION SAMPLE
(Sub sample of raw data)
MULTIVARIATE ANALYSIS
- Removal of Multicollinearity
- Removal of Insignificant variables
RAW DATA
Model validation
Refinement
based on
Client Feedback
VALIDATION SAMPLE
(out of time)
Phase II
Phase I
26. Intended for Knowledge Sharing
only
26Intended for Knowledge Sharing only 26
REMAINING SLIDES
PENDING SLIDES:
OTHER TESTS(t tests, ANOVA, CHI-SQUARE, etc.)
PITFALLS IN STATISTICS
SPURIOUS CORRELATION
ENDOGENOUS & EXOGENOUS ERRORS
ACCURACY vs. RANKING
CAUSAL VS. CORRELATION
POPULATION STABILITY INDEX
OTHER THINGS TO BE ADDED:
BEST PRACTICES DOCUMENT
SAS & EXCEL MACROS
REFERENCES
SAMPLE DATA, CODE, OUTPUT
CHEAT SHEET