SlideShare a Scribd company logo
1 of 26
SIMPLIFYD ANALYTICS
Predictive Analytics: Process details
Sep 27, 2011
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only
2Intended for Knowledge Sharing only 2
CONTENTS
Intended for Knowledge Sharing
only
3Intended for Knowledge Sharing only 3
ANALYTICAL PROCESS OVERVIEW
Business Problem
Characterization
Data Consolidation
Data Treatment
Modeling &
Analysis*
Recommendations &
Implementation
Strategy
● TRANSFORMATION : Conversion of field formats from other types to numeric
● MISSING VALUE TREATMENTS : Imputation of missing values based on Mean, etc.
● CAPPING TREAMENTS: Capping of extreme and nonsensical values
● NORMALIZATION of all the variables to remove the effect of the distribution of variables
on subsequent analytical steps
●The relationship between the variable of interest and the drivers has to be
established with significant confidence and stability through mathematical modeling
techniques like Regression/ Decision Trees, etc.
●Based on the understanding of the relationships between the events of interest and its
drivers, suitable business strategies can be developed to address the business problem.
● TRANSLATION OF BUSINESS PROBLEM INTO A STATISTICAL FRAMEWORK: Decision on
the analytical technique, data processing and final outcomes
● HYPOTHESIZE the predictor variables’ relationships with the dependent variable * *
Note:
* Modeling & Analysis is generally preceded by Clustering phase where all observations are grouped into homogenous
clusters, similar in characteristics within, and dissimilar from the other clusters, to remove exogenous errors in findings
* * DEPENDENT VARIABLE has to defined keeping in mind the business objective, data availability and forecast period.
● RECONCILIATION OF DATA FROM VARIOUS SOURCES into an Analysis Master Dataset
Intended for Knowledge Sharing
only
4Intended for Knowledge Sharing only 4
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only
5Intended for Knowledge Sharing only 5
Data Specification Document
•Hypothesized predictor variables necessary
for solving the business problem
•Availability of Data in the various data
sources
•Form of the data
Data Integration Plan
•Reconciliation of the data from various
sources into one single analysis master
dataset
•Data Integration(DI) Report talks about
presence of data across merged tables
Data Gap Analysis
•Information that was critical as per the
hypothesis and not available in the data
sources are listed down here so that this info
can be captured in future
A thorough understanding of the data sources is essential to plan the extraction in the fastest and most efficient
way….
….DI Report assumes significance in the later stages of Data preparation where the missing information
because of unavailability of data has a particular meaning and so should not be imputed as others.
DATA COLLECTION
Master Dataset
NBP
CLV Acxiom
Warehouse
Data
Analytics
Tables
Bureau
Data
• Customer
• Payments
• Click-Stream
• Cards
Subsequent Data Treatments
and Analytical Steps
Intended for Knowledge Sharing
only
6Intended for Knowledge Sharing only 6
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only
7Intended for Knowledge Sharing only 7
….Variable reduction being very critical step to achieve best predictors for the subsequent analytical steps
DATA PREPARATION
Univariate Analysis Bivariate Analysis Variable Reduction
Certain Thumb Rules,
%Missing<= 5: Single Value
Imputation
5<%Missing<=20: Bivariate based
Value Imputation
20<%Missing<=40: Imputation
based on Modeling with other
independent variables
%Missing<=40 : Drop the variables
•Removal of extreme and non-sensical
values to achieve better distribution in
the variables
•Variable transformation-log, exp etc
forms depending on their degrees of
relationship observed in bivariate plots.
• Dummy/binning variable creation
depending on nature of relationship
•Selection/Dropping of variables based
on the strength of relationship by trend
and/or significance of chi-square test.
•Redundancy checks and removal by
using indicators like VIF,CI and Factor
Loading.
•Also business sense would be used in
selection of variables for modeling
Data preparation begins with Data Distribution studies needed for missing and capping treatments ; followed
by Data Sanity checks on groups of variables….
Missing Treatment
Capping Treatment
Variable
Transformations
Selection/Dropping
Multi-collinearity
Checks
PCA/FA/Varclus
•Selection/Dropping of based on the
Factor Loadings of the variables on the
significant PCs/Factors
?
?
?
?
?
?
Intended for Knowledge Sharing
only
8Intended for Knowledge Sharing only 8
Capping treatment is another critical treatment, where the non-sense and extreme observations are removed to
achieve stability in parameter estimates….
….It should always precede Missing treatment, so that the imputed values for missing observations follow better
distributions
DATA PREPARATION
Capping Treatment has to be consider,
i. Distribution - if it’s a categorical variable then it should not be capped, etc.
ii. Niche characteristics – If this outlier values explain a certain niche group of customers who have outliers in
other variables also, then they should not be capped
iii. Business Information - Certain non-sense values signify something like Missing, etc., they should be capped to
another value nearest to the most sensible end values, but kept outside so that the actual information is not lost.
CAPPING TREATMENT
Back to Dataprep
Capping Treatment is necessary to remove the following two types of incidents,
i. Outliers- Extreme observations in Dependent variable leading to high residuals in predictions
ii. Influential Observations– Outliers in the independent variable side leading to unstable/wrong parameter
estimates
AmountTransacted$
Count of Transactions
Outliers
Influential Observations
Intended for Knowledge Sharing
only
9Intended for Knowledge Sharing only 9
Missing treatment is inevitable since the entire record is deleted if a certain variable has missing information….
….It’s also the most complex treatment, as each variable has to be treated differently based on its meaning,
missing content and data integrity issues
DATA PREPARATION
No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3
1 0 2324.9 19900 360 1 21 282 1 282 0 0
2 0 3796.5 22100 240 0 6 6911 1 33978 1 1
3 1 12523.2 42000 360 1 1 36350 . 36732 1 1
4 0 5190.9 21760 349 1 42 885 1 911 0 0
5 1 53.6 18000 360 1 5 8851 1 9506 0 0
6 0 1256.9 15500 360 . 13 409 1 760 0 0
7 0 4403.3 25150 900 1 3 21417 5 23579 3 1
8 0 3137.2 17800 240 1 4 4528 2 5967 1 0
9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1
10 0 6442.4 31200 360 1 34 33177 1 0 2 0
Missing observations
Unrealistic values
Missing Value Imputation has to be done based on,
i. Meaning of the variable- for e.g., if flag, it can take either 1 or 0 depending on the coding;
ii. Distribution - if it’s a continuous variable like Amount, etc. with lesser missing content then mean, etc.
iii. If missing due to merging issue – then it depends on whether it was available in a particular table or not, for
e.g., if its not present in Restrictions table then missing “freq_restrictions” can take a value ‘0’ or if it was actually
present in the Restrictions table but still has missing then it should take median value.
iv. Correlation with Other Predictors - Also the missing value in a variable can depend on other variables in a
dataset, for e.g., if “amount_received” is missing, then the missing amount depends on the size of the merchant,
the Average amount received in the prior months, type of products sold, industry avg, etc.
MISSING VALUE TREATMENT
Back to Dataprep
0
5
10
15
20
25
30
40 60 80 100
MeanTxnAmt$
Mean MOB
Intended for Knowledge Sharing
only
10Intended for Knowledge Sharing only 10
Bivariate analysis explores the nature and degree of relationship between the independent and dependent
variables ….
….and is necessary to achieve stable and accurate predictions apart from arriving at the correct recommendations
DATA PREPARATION
BIVARIATE ANALYSIS
Back to Dataprep
Dep Var = f(Indep Var, Log(Indep Var), Sin(Indep Var),….)
Significant estimate with large
magnitude
Insignificant estimate
Transformations required
Bivariate Chart Analysis- Mean
dep var value vs. Class
Dummy Creation
for certain classes
Variable dropping
if no trend or
relationship
0
10
20
30
40
50
0 1 2 3 4
MeanTxnAmt$
Mean Count of Restrictions
Dummy = (count_rest<=2)
No relationship
Intended for Knowledge Sharing
only
11Intended for Knowledge Sharing only 11
Multivariate analysis helps remove interrelationships between the predictors to achieve stable and correct
estimates at individual variable level which is necessary for correct strategy creation….
….the variance/correlation based reductions are not mutually exclusive and might be applied judgmentally in
different sequences to achieve the best set of predictors
DATA PREPARATION
MULTIVARIATE ANALYSIS
Back to Dataprep
Inter-correlations amongst Predictors
Linear relations Common Variances
Collinearity removal based on
VIF and CI values
Total Variances
Factor Analysis Principal Component Analysis
Significant Predictors with Eigen Values
>1 or which capture 70% variance
Variables should be grouped as per the information that they capture and
reductions are performed at the group level
Factor loadings are used to decompose the significant
Factors/PCs to variable level
Intended for Knowledge Sharing
only
12Intended for Knowledge Sharing only 12
CONTENTS
METHODOLOGY
OVERVIEW
DATA COLLECTION
DATA PREPARATION
MODELING & ANALYSIS
PERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only
13Intended for Knowledge Sharing only 13
Business need defines the nature of the dependent variable and the analysis time windows in which the
predictors are observed and where the performance is observed.….
Note:
*Population sizes and business dynamics have to be taken into account while deciding the Analysis Time windows and
the form of dependent variable
Analysis windows*
Observation window Out-of-time Validation
window
Performance Window
Dependent Variable captures the behavior of interest and it can be
oContinuous or categorical
oRaw or transformed(log, growth) , etc.
and the statistical technique used for analysis depends on the type and form of this variable
Observation Window stands for the window where the various predictors are observed
Performance Window stands for the time window where the dependent variable is defined
Out-of-Time Validation Window stands for the time window where the model performance and
stability is checked
Definition of Dependent Variable
MODELING & ANALYSIS
DEFINITION OF DEPENDENT VARIABLE
Intended for Knowledge Sharing
only
14Intended for Knowledge Sharing only 14
Every findings from analysis has to be validated for reliability and accuracy across samples of data.….
MODELING & ANALYSIS
A NOTE ON SAMPLING
Define the Population
Determine the Sampling Frame
Select Sampling Technique(s)
Determine the Sample Size
Execute the Sampling Process
SAMPLING TECHNIQUES
SIMPLE RANDOM SAMPLING
STRATIFIED SAMPLING
All records are randomly assigned a selection
probability between 0 and 1.
STRENGTHS
Easily understood and implemented
WEAKNESSES
Lower precision and no assurance of
representativeness
All records are assigned to a particular sub-
population, the proportion of which is to be
maintained in the final samples. SRS is used to
select records from the sub-populations
STRENGTHS
Increases representativeness
WEAKNESSES
Not effective for large/small Stratas
….Nature of the business problem and population decides the sampling technique and sizes
Intended for Knowledge Sharing
only
15Intended for Knowledge Sharing only 15
Segmentation of customers into homogenous groups, identical within the clusters and different from those in
other clusters, based on a set of behavioral characteristics..….
MODELING & ANALYSIS
SOME TIDBITS ABOUT CLUSTERING
….Identifies the structural breaks in the data, on either side of which the characteristics are fundamentally
different, and hence is necessary to arrive at the real relationship of predictors with dependent variable
 Most used methods of clustering:
 Hierarchical Clustering- Assigns observations to a cluster progressively one at a time, based a distance measure.
 Advantages: Good in case of small datasets as the algorithm finds the number of clusters.
 Limitations: It fails with large datasets as a result of memory issues.
 K-means Clustering- A random number of cluster origins are selected ;then all the remaining records are assigned to
one of them based on a distance measure.
 Advantages: Simplicity and speed
 Limitations: It does not yield the same result with each run, since the resulting clusters depend on the initial
random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global
minimum of variance.
Intended for Knowledge Sharing
only
16Intended for Knowledge Sharing only 16
Modeling is the establishment of a relationship between the variable of interest and its various predictors and
hence the technique depends on the distribution of the dependent variable, the business problem and data
quality and quantity available for modeling..….
MODELING & ANALYSIS
MODELING TECHNIQUES
….Findings of a stable and accurate model elucidates the degree and nature of the drivers of the dependent
variable and thus defines the strategy to be taken for solving the business problem.
Final
Analysis
Dataset
Non-Parametric
Parametric
Does not depend on
distribution of
dependent variable
Depends on the
distribution of
dependent variable
Sl.No. Target Variable Distribution Modeling Approach Model Output
1 Continuous OLS Regression
A Typical Model :
Y = f(X)= f(X1, X2,..,Xn)
2 Nominal Logistic Regression
3 Categorical positive values Poisson/Gamma
4 Unidentified Decision Trees Segments with increasing proportion of
dependent variable.
Intended for Knowledge Sharing
only
17Intended for Knowledge Sharing only 17
Form of fitting function(how are they mathematically related?):
y =α + β1X1 + β2X2
Predicted = Mean + relationship with Predictor 1*predictor 1+ relationship with
Predictor 2*predictor 2
Assumption for the modeling: Residuals are independent, are normally distributed with
‘0’ mean and have uniform variance throughout
What is OLS? Ordinary Least Squares(Explained variance, R2 is being maximized)
Type of Predicted (dependent) Variable: Continuous Variable (-∞ to + ∞)
Business Question: What loan amount take off can we expect from a customer?
SAS procedure: Proc Reg
Performance Diagnostics (indicators of a good model):
•R-square(-1 to +1): How good the model is explaining variance in predicted variable?
•MSE(Mean Square Error): Size of average difference between predicted and actual?
MSE = sqrt of summation of (actual value – predicted value)/(count of obs)
•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001
•Sign of parameter estimates: Should be intuitive or repeated in validation sample
•Model validation: Model should be stable on both in time/out of time validation samples
•Rank Ordering: Predict high value when actual is high and vice versa
•AIC/SIC: Parsimony(or Efficiency)- min predictors, max predictions; compare across
models
GENERALIZED LINEAR MODELS
OLS REGRESSION (LINEAR)
Intended for Knowledge Sharing
only
18Intended for Knowledge Sharing only 18
GENERALIZED LINEAR MODELS
OLS REGRESSION (LINEAR)- SAMPLE MODEL OUTPUT
The REG Procedure
Model: MODEL1
Dependent Variable: censor_po
Number of Observations Read 40162
Number of Observations Used 40162
Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 12 610.91533 50.90961 219.02<.0001
Error 40149 9332.36401 0.23244
Corrected Total 40161 9943.27934
Root MSE 0.48212 R-Square 0.0614
Dependent Mean 0.5492 Adj R-Sq 0.0612
Coeff Var 87.78642
Parameter Estimates
Variable DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept 1 1.24953 0.20693 6.04 <.0001 0
APPLICATION_PRIM_CB_SCR_NBR 1 -0.000216 0.00028377 -0.76 0.4465 1.0205
log_APPL_ADV_RATIO 1 -0.1166 0.0117 -9.96 <.0001 1.09417
log_APPL_PYMT_TO_INCOME_RATIO 1 -0.01966 0.00517 -3.8 0.0001 1.17587
Collinearity Diagnostics
Number Eigenvalue Condition Proportion of Variation
Index Intercept APPLICATION_P
RIM_CB_SCR_NB
R
log_APPL_ADV_
RATIO
log_APPL_PYMT
_TO_INCOME_RA
TIO
1 8.3631 1 0.00000188 0.00000202 0.00002708 0.00057815
2 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-06
3 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.0000019
8 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.00396
9 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.01911
10 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.26143
11 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.68574
12 0.00137 78.10783 0.01375 0.01859 0.96941 0.02085
13 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173
Intended for Knowledge Sharing
only
19Intended for Knowledge Sharing only 19
What is Logistic? Predicts log odds(event/non-event)
Log (odds) = α + β1X1 + β2X2
Predicted probability of event = e^(α + β1X1 + β2X2)/(1+e^(α + β1X1 + β2X2))
Predicted probability of non-event = 1/(1+e^(α + β1X1 + β2X2))
->Therefore, total probability (event + non-event) at an obs level is 1
Type of Predicted (dependent) Variable: Binary (1/0)- one is event, other is ‘reference’
Business Question: What is the probability of a customer defaulting?
SAS procedure: Proc Logistic (with various link functions)
Performance Diagnostics (indicators of a good model):
•Concordance/Discordance: If all observations were paired randomly, in how many
instances(%) is actual event observation given higher probability
•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001
•Sign of parameter estimates: Should be intuitive or repeated in validation sample
•Model validation: Model should be stable on both in time/out of time validation samples
•Rank Ordering: Predict high value when actual is high and vice versa
•Gains Chart(K-Statistic): Highest probabilities should be assigned to actual events
•AIC: Parsimony(or Efficiency):min predictors, max predictions; compare across models
Note:
*Hosmer-Lemeshow good but fails when model sample size is large
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION
Intended for Knowledge Sharing
only
20Intended for Knowledge Sharing only 20
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT
The LOGISTIC Procedure
Model Information
Data Set MODOUT.TU60_VAL_FICO
_690_719_EXP
Response Variable outcome
Number of Response Levels 3
Model generalized logit
Optimization Technique Fisher's scoring
Number of Observations Read 607592
Number of Observations Used 607592
Response Profile
Ordered outcome Total
Value Frequency
1 0 597504
2 1 9432
Logits modeled use outcome=0 as the reference
category.
Model Fit Statistics
Criterion Intercept only Intercept &
Covariates
AIC 107549.99 106661.99
SC 107572.63 106956.24
-2 Log L 107545.99 106609.99
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 935.9990 24 <.0001
Score 902.4392 24 <.0001
Wald 892.8763 24 <.0001
Intended for Knowledge Sharing
only
21Intended for Knowledge Sharing only 21
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT contd…
Type 3 Analysis of Effects
Effect DF Wald Pr > ChiSq
Chi-Square
APPLICATION_PRIM_CB_ 2 14.5230 0.0007
log_APPL_ADV_RATIO 2 126.6605 <.0001
log_APPL_PYMT_TO_INC 2 83.5880 <.0001
Analysis of Maximum Likelihood Estimates
Parameter DF Development
Model Estimate
Validation
Model Estimate
Standard Wald Pr > Chi
SqError Chi-Square
Intercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466
APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715
log_APPL_ADV_RATIO 1 -0.3934 -0.2839 0.0485 34.2834 <.0001
log_APPL_PYMT_TO_INC 1 -0.1206 -0.0900 0.0221 16.5920 <.0001
Odds Ratio Estimates
Effect outcome Point Estimate 95% Wald Confidence Limits
APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000
log_APPL_ADV_RATIO 1 0.753 0.685 0.828
log_APPL_PYMT_TO_INC 1 0.914 0.875 0.954
Percent Concordant 65.9 Somers' D 0.338
Percent Discordant 32.1 Gamma 0.345
Percent Tied 2.0 Tau-a 0.074
Pairs 1806529536 c 0.669
Higher the percent
concordant, better
the model
Intended for Knowledge Sharing
only
22Intended for Knowledge Sharing only 22
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION – RANK ORDERING OUTPUT contd…
predgr
oup obs minpred maxpred avgpred totact avgact cumact
predran
k cumpct actrank KS
1 12551 0.2069 1 0.275689 3632 0.289379 3632 1 22.51984 1 14.5573
2 13077 0.172384 0.206895 0.190708 2565 0.196146 6197 2 38.42386 2 21.07661
3 12932 0.163982 0.172383 0.165289 2179 0.168497 8376 3 51.93452 3 24.98741
4 12696 0.118382 0.163978 0.142257 1727 0.136027 10103 4 62.64261 4 25.9028
5 12814 0.096125 0.118381 0.105572 1360 0.106134 11463 5 71.07515 5 24.10965
6 12814 0.086392 0.096124 0.091463 1151 0.089824 12614 6 78.21181 6 20.83402
7 12814 0.077738 0.086391 0.081861 1061 0.0828 13675 7 84.79043 7 16.92002
8 11344 0.07317 0.077737 0.075261 811 0.071492 14486 8 89.81895 8 12.54508
9 14284 0.069614 0.073168 0.072034 894 0.062588 15380 9 95.3621 9 6.134163
10 12814 0.03382 0.069613 0.060393 748 0.058374 16128 10 100 10 0
Intended for Knowledge Sharing
only
23Intended for Knowledge Sharing only 23
0
20
40
60
80
100
120
0 20 40 60 80 100
CAPTURING OF THE MODEL
The column “cumpct” in the rank-ordering output indicates the no. of responders captured up
to the given decile.
The model captures about 22.5% responders in the first decile and about 71.07% of the
responders in the top 5 deciles.
Model
capturing
Random
capturing
Population (%)
Responders
captured
Higher the capturing
in the initial deciles,
better the model
performance
GENERALIZED LINEAR MODELS
LOGISTIC REGRESSION – GAINS CHART contd…
Intended for Knowledge Sharing
only
24Intended for Knowledge Sharing only 24
CRITERIA FOR FINE-TUNING
CRITERION FOR FINE TUNING
The fine tuning is based on applying model for both development and validation samples. Following
criterion are consider for fine tuning the model.
Fine Tuning
Rank Ordering
Coefficient Stability
Concordance
Highest KS
Goodness-of-fit
Validation
Capturing
Intended for Knowledge Sharing
only
25Intended for Knowledge Sharing only 25
RECAP
Phase III
Decide on the number of models and identify the dependent variables for each model
Identify the statistical method suitable for each predictive model: OLS Regression, Logistic Regression etc.
Hypothesize Predictor variables
TRANSLATE THE BUSINESS PROBLEM INTO A STATISTICAL
PROBLEM BASED ON IBCVM FRAMEWORK
UNDERSTAND THE
BUSINESS PROBLEM
PREPARE DATA
SPECIFICATIONS
& GET DATA
MODEL IMPLEMENTATION
Prepare Scoring Code
Track model performance after regular
intervals
Redevelop/ Rebuild models on a need
basis
UNIVARIATE ANALYSIS
- Treatment of Outliers
BIVARIATE ANALYSIS
-Treatment of Missing Value
- Variable Transformations
DEVELOPMENT SAMPLE
(Sub sample of raw data)
MODEL DEVELOPMENT
-OLS / Logistic Regression
-Fine Tuning
VALIDATION SAMPLE
(Sub sample of raw data)
MULTIVARIATE ANALYSIS
- Removal of Multicollinearity
- Removal of Insignificant variables
RAW DATA
Model validation
Refinement
based on
Client Feedback
VALIDATION SAMPLE
(out of time)
Phase II
Phase I
Intended for Knowledge Sharing
only
26Intended for Knowledge Sharing only 26
REMAINING SLIDES
PENDING SLIDES:
OTHER TESTS(t tests, ANOVA, CHI-SQUARE, etc.)
PITFALLS IN STATISTICS
SPURIOUS CORRELATION
ENDOGENOUS & EXOGENOUS ERRORS
ACCURACY vs. RANKING
CAUSAL VS. CORRELATION
POPULATION STABILITY INDEX
OTHER THINGS TO BE ADDED:
BEST PRACTICES DOCUMENT
SAS & EXCEL MACROS
REFERENCES
SAMPLE DATA, CODE, OUTPUT
CHEAT SHEET

More Related Content

What's hot

What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...Smarten Augmented Analytics
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...Smarten Augmented Analytics
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Smarten Augmented Analytics
 
Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1Michael Jacobs, Jr.
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...Smarten Augmented Analytics
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyaltyTheDataNation
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsMayank Johri
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
What Is a Model, Anyhow?
What Is a Model, Anyhow?What Is a Model, Anyhow?
What Is a Model, Anyhow?Bill Cassill
 
Presentation Title
Presentation TitlePresentation Title
Presentation Titlebutest
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market ResearchTed Clark
 

What's hot (18)

What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...What is Binary Logistic Regression Classification and How is it Used in Analy...
What is Binary Logistic Regression Classification and How is it Used in Analy...
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1Jacobs Kiefer Bayes Guide 3 10 V1
Jacobs Kiefer Bayes Guide 3 10 V1
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyalty
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule Thresholds
 
Doc 20190909-wa0025
Doc 20190909-wa0025Doc 20190909-wa0025
Doc 20190909-wa0025
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
What Is a Model, Anyhow?
What Is a Model, Anyhow?What Is a Model, Anyhow?
What Is a Model, Anyhow?
 
Presentation Title
Presentation TitlePresentation Title
Presentation Title
 
Classes of Model
Classes of ModelClasses of Model
Classes of Model
 
Lobsters, Wine and Market Research
Lobsters, Wine and Market ResearchLobsters, Wine and Market Research
Lobsters, Wine and Market Research
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 

Viewers also liked

Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data ScienceJohn B. Rollins, Ph.D.
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologySergey Shelpuk
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
Delopment and testing of a credit scoring model
Delopment and testing of a credit scoring modelDelopment and testing of a credit scoring model
Delopment and testing of a credit scoring modelMattia Ciprian
 

Viewers also liked (7)

Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Delopment and testing of a credit scoring model
Delopment and testing of a credit scoring modelDelopment and testing of a credit scoring model
Delopment and testing of a credit scoring model
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 

Similar to Step by Step guide to executing an analytics project

Data science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughData science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughTristan Wiggill
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive StatisticsCIToolkit
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss sessionM Baddar
 
Explicato bi saa_s_detailed_deck_20150616
Explicato bi saa_s_detailed_deck_20150616Explicato bi saa_s_detailed_deck_20150616
Explicato bi saa_s_detailed_deck_20150616George Yankov
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulationNguyen Ngoc Binh Phuong
 
Data Analysis Methods 101 - Turning Raw Data Into Actionable Insights
Data Analysis Methods 101 - Turning Raw Data Into Actionable InsightsData Analysis Methods 101 - Turning Raw Data Into Actionable Insights
Data Analysis Methods 101 - Turning Raw Data Into Actionable InsightsDataSpace Academy
 
Simplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebSimplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebNovoniel Deb
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersSatyam Jaiswal
 
Partial Least Square model.pdf
Partial Least Square model.pdfPartial Least Square model.pdf
Partial Least Square model.pdfbhaskarpathak15
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxaudeleypearl
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docxroushhsiu
 
data science pptx
data science pptxdata science pptx
data science pptxHome
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxAbdullahEmam4
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxJANNU VINAY
 

Similar to Step by Step guide to executing an analytics project (20)

Data science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enoughData science in demand planning - when the machine is not enough
Data science in demand planning - when the machine is not enough
 
Analytics
AnalyticsAnalytics
Analytics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
data analysis-mining
data analysis-miningdata analysis-mining
data analysis-mining
 
Explicato bi saa_s_detailed_deck_20150616
Explicato bi saa_s_detailed_deck_20150616Explicato bi saa_s_detailed_deck_20150616
Explicato bi saa_s_detailed_deck_20150616
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation[MPKD1] Introduction to business analytics and simulation
[MPKD1] Introduction to business analytics and simulation
 
Data Analysis Methods 101 - Turning Raw Data Into Actionable Insights
Data Analysis Methods 101 - Turning Raw Data Into Actionable InsightsData Analysis Methods 101 - Turning Raw Data Into Actionable Insights
Data Analysis Methods 101 - Turning Raw Data Into Actionable Insights
 
Simplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel DebSimplifying Analytics - by Novoniel Deb
Simplifying Analytics - by Novoniel Deb
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Partial Least Square model.pdf
Partial Least Square model.pdfPartial Least Square model.pdf
Partial Least Square model.pdf
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
Module Overview Careers in Analytics In this module, we .docx
Module Overview  Careers in Analytics In this module, we .docxModule Overview  Careers in Analytics In this module, we .docx
Module Overview Careers in Analytics In this module, we .docx
 
data science pptx
data science pptxdata science pptx
data science pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
 
Business inteligence
Business inteligenceBusiness inteligence
Business inteligence
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptx
 

More from Ramkumar Ravichandran

Risk Product Management - Creating Safe Digital Experiences, Product School 2019
Risk Product Management - Creating Safe Digital Experiences, Product School 2019Risk Product Management - Creating Safe Digital Experiences, Product School 2019
Risk Product Management - Creating Safe Digital Experiences, Product School 2019Ramkumar Ravichandran
 
Improving AI products with Analytics
Improving AI products with AnalyticsImproving AI products with Analytics
Improving AI products with AnalyticsRamkumar Ravichandran
 
Advancing the analytics maturity curve at your organization
Advancing the analytics maturity curve at your organizationAdvancing the analytics maturity curve at your organization
Advancing the analytics maturity curve at your organizationRamkumar Ravichandran
 
Advancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationAdvancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationRamkumar Ravichandran
 
Augment the actionability of Analytics with the “Voice of Customer”
Augment the actionability of Analytics with the “Voice of Customer”Augment the actionability of Analytics with the “Voice of Customer”
Augment the actionability of Analytics with the “Voice of Customer”Ramkumar Ravichandran
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionRamkumar Ravichandran
 
Building & nurturing an Analytics Team
Building & nurturing an Analytics TeamBuilding & nurturing an Analytics Team
Building & nurturing an Analytics TeamRamkumar Ravichandran
 
Analytics as an enabler of Company Culture
Analytics as an enabler of Company CultureAnalytics as an enabler of Company Culture
Analytics as an enabler of Company CultureRamkumar Ravichandran
 
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insights
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insightsDigital summit Dallas 2015 - Research brings back the 'human' aspect to insights
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insightsRamkumar Ravichandran
 
Social media analytics - a delicious treat, but only when handled like a mast...
Social media analytics - a delicious treat, but only when handled like a mast...Social media analytics - a delicious treat, but only when handled like a mast...
Social media analytics - a delicious treat, but only when handled like a mast...Ramkumar Ravichandran
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkRamkumar Ravichandran
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsRamkumar Ravichandran
 
Transform your Analytics Practice into Insights Practice
Transform your Analytics Practice into Insights PracticeTransform your Analytics Practice into Insights Practice
Transform your Analytics Practice into Insights PracticeRamkumar Ravichandran
 

More from Ramkumar Ravichandran (20)

Risk Product Management - Creating Safe Digital Experiences, Product School 2019
Risk Product Management - Creating Safe Digital Experiences, Product School 2019Risk Product Management - Creating Safe Digital Experiences, Product School 2019
Risk Product Management - Creating Safe Digital Experiences, Product School 2019
 
Improving AI products with Analytics
Improving AI products with AnalyticsImproving AI products with Analytics
Improving AI products with Analytics
 
Advancing the analytics maturity curve at your organization
Advancing the analytics maturity curve at your organizationAdvancing the analytics maturity curve at your organization
Advancing the analytics maturity curve at your organization
 
Advancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationAdvancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organization
 
Leadership, analytics & you
Leadership, analytics & youLeadership, analytics & you
Leadership, analytics & you
 
Augment the actionability of Analytics with the “Voice of Customer”
Augment the actionability of Analytics with the “Voice of Customer”Augment the actionability of Analytics with the “Voice of Customer”
Augment the actionability of Analytics with the “Voice of Customer”
 
Predictive Analytics as a Product
Predictive Analytics as a Product Predictive Analytics as a Product
Predictive Analytics as a Product
 
Prepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolutionPrepping the Analytics organization for Artificial Intelligence evolution
Prepping the Analytics organization for Artificial Intelligence evolution
 
Power of Small Data
Power of Small DataPower of Small Data
Power of Small Data
 
Optimizing Marketing Decisions
Optimizing Marketing DecisionsOptimizing Marketing Decisions
Optimizing Marketing Decisions
 
Building & nurturing an Analytics Team
Building & nurturing an Analytics TeamBuilding & nurturing an Analytics Team
Building & nurturing an Analytics Team
 
Analytics as an enabler of Company Culture
Analytics as an enabler of Company CultureAnalytics as an enabler of Company Culture
Analytics as an enabler of Company Culture
 
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insights
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insightsDigital summit Dallas 2015 - Research brings back the 'human' aspect to insights
Digital summit Dallas 2015 - Research brings back the 'human' aspect to insights
 
Social media analytics - a delicious treat, but only when handled like a mast...
Social media analytics - a delicious treat, but only when handled like a mast...Social media analytics - a delicious treat, but only when handled like a mast...
Social media analytics - a delicious treat, but only when handled like a mast...
 
Optimizing product decisions
Optimizing product decisionsOptimizing product decisions
Optimizing product decisions
 
Moving beyond numbers
Moving beyond numbersMoving beyond numbers
Moving beyond numbers
 
Taming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model FrameworkTaming the Data Lake with Scalable Metrics Model Framework
Taming the Data Lake with Scalable Metrics Model Framework
 
Actionability of insights
Actionability of insights Actionability of insights
Actionability of insights
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'ts
 
Transform your Analytics Practice into Insights Practice
Transform your Analytics Practice into Insights PracticeTransform your Analytics Practice into Insights Practice
Transform your Analytics Practice into Insights Practice
 

Recently uploaded

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

Step by Step guide to executing an analytics project

  • 1. SIMPLIFYD ANALYTICS Predictive Analytics: Process details Sep 27, 2011
  • 2. METHODOLOGY OVERVIEW DATA COLLECTION DATA PREPARATION MODELING & ANALYSIS PERFORMANCE DIAGNOSTICS BEST PRACTICES REFERENCES Intended for Knowledge Sharing only 2Intended for Knowledge Sharing only 2 CONTENTS
  • 3. Intended for Knowledge Sharing only 3Intended for Knowledge Sharing only 3 ANALYTICAL PROCESS OVERVIEW Business Problem Characterization Data Consolidation Data Treatment Modeling & Analysis* Recommendations & Implementation Strategy ● TRANSFORMATION : Conversion of field formats from other types to numeric ● MISSING VALUE TREATMENTS : Imputation of missing values based on Mean, etc. ● CAPPING TREAMENTS: Capping of extreme and nonsensical values ● NORMALIZATION of all the variables to remove the effect of the distribution of variables on subsequent analytical steps ●The relationship between the variable of interest and the drivers has to be established with significant confidence and stability through mathematical modeling techniques like Regression/ Decision Trees, etc. ●Based on the understanding of the relationships between the events of interest and its drivers, suitable business strategies can be developed to address the business problem. ● TRANSLATION OF BUSINESS PROBLEM INTO A STATISTICAL FRAMEWORK: Decision on the analytical technique, data processing and final outcomes ● HYPOTHESIZE the predictor variables’ relationships with the dependent variable * * Note: * Modeling & Analysis is generally preceded by Clustering phase where all observations are grouped into homogenous clusters, similar in characteristics within, and dissimilar from the other clusters, to remove exogenous errors in findings * * DEPENDENT VARIABLE has to defined keeping in mind the business objective, data availability and forecast period. ● RECONCILIATION OF DATA FROM VARIOUS SOURCES into an Analysis Master Dataset
  • 4. Intended for Knowledge Sharing only 4Intended for Knowledge Sharing only 4 CONTENTS METHODOLOGY OVERVIEW DATA COLLECTION DATA PREPARATION MODELING & ANALYSIS PERFORMANCE DIAGNOSTICS BEST PRACTICES REFERENCES
  • 5. Intended for Knowledge Sharing only 5Intended for Knowledge Sharing only 5 Data Specification Document •Hypothesized predictor variables necessary for solving the business problem •Availability of Data in the various data sources •Form of the data Data Integration Plan •Reconciliation of the data from various sources into one single analysis master dataset •Data Integration(DI) Report talks about presence of data across merged tables Data Gap Analysis •Information that was critical as per the hypothesis and not available in the data sources are listed down here so that this info can be captured in future A thorough understanding of the data sources is essential to plan the extraction in the fastest and most efficient way…. ….DI Report assumes significance in the later stages of Data preparation where the missing information because of unavailability of data has a particular meaning and so should not be imputed as others. DATA COLLECTION Master Dataset NBP CLV Acxiom Warehouse Data Analytics Tables Bureau Data • Customer • Payments • Click-Stream • Cards Subsequent Data Treatments and Analytical Steps
  • 6. Intended for Knowledge Sharing only 6Intended for Knowledge Sharing only 6 CONTENTS METHODOLOGY OVERVIEW DATA COLLECTION DATA PREPARATION MODELING & ANALYSIS PERFORMANCE DIAGNOSTICS BEST PRACTICES REFERENCES
  • 7. Intended for Knowledge Sharing only 7Intended for Knowledge Sharing only 7 ….Variable reduction being very critical step to achieve best predictors for the subsequent analytical steps DATA PREPARATION Univariate Analysis Bivariate Analysis Variable Reduction Certain Thumb Rules, %Missing<= 5: Single Value Imputation 5<%Missing<=20: Bivariate based Value Imputation 20<%Missing<=40: Imputation based on Modeling with other independent variables %Missing<=40 : Drop the variables •Removal of extreme and non-sensical values to achieve better distribution in the variables •Variable transformation-log, exp etc forms depending on their degrees of relationship observed in bivariate plots. • Dummy/binning variable creation depending on nature of relationship •Selection/Dropping of variables based on the strength of relationship by trend and/or significance of chi-square test. •Redundancy checks and removal by using indicators like VIF,CI and Factor Loading. •Also business sense would be used in selection of variables for modeling Data preparation begins with Data Distribution studies needed for missing and capping treatments ; followed by Data Sanity checks on groups of variables…. Missing Treatment Capping Treatment Variable Transformations Selection/Dropping Multi-collinearity Checks PCA/FA/Varclus •Selection/Dropping of based on the Factor Loadings of the variables on the significant PCs/Factors ? ? ? ? ? ?
  • 8. Intended for Knowledge Sharing only 8Intended for Knowledge Sharing only 8 Capping treatment is another critical treatment, where the non-sense and extreme observations are removed to achieve stability in parameter estimates…. ….It should always precede Missing treatment, so that the imputed values for missing observations follow better distributions DATA PREPARATION Capping Treatment has to be consider, i. Distribution - if it’s a categorical variable then it should not be capped, etc. ii. Niche characteristics – If this outlier values explain a certain niche group of customers who have outliers in other variables also, then they should not be capped iii. Business Information - Certain non-sense values signify something like Missing, etc., they should be capped to another value nearest to the most sensible end values, but kept outside so that the actual information is not lost. CAPPING TREATMENT Back to Dataprep Capping Treatment is necessary to remove the following two types of incidents, i. Outliers- Extreme observations in Dependent variable leading to high residuals in predictions ii. Influential Observations– Outliers in the independent variable side leading to unstable/wrong parameter estimates AmountTransacted$ Count of Transactions Outliers Influential Observations
  • 9. Intended for Knowledge Sharing only 9Intended for Knowledge Sharing only 9 Missing treatment is inevitable since the entire record is deleted if a certain variable has missing information…. ….It’s also the most complex treatment, as each variable has to be treated differently based on its meaning, missing content and data integrity issues DATA PREPARATION No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3 1 0 2324.9 19900 360 1 21 282 1 282 0 0 2 0 3796.5 22100 240 0 6 6911 1 33978 1 1 3 1 12523.2 42000 360 1 1 36350 . 36732 1 1 4 0 5190.9 21760 349 1 42 885 1 911 0 0 5 1 53.6 18000 360 1 5 8851 1 9506 0 0 6 0 1256.9 15500 360 . 13 409 1 760 0 0 7 0 4403.3 25150 900 1 3 21417 5 23579 3 1 8 0 3137.2 17800 240 1 4 4528 2 5967 1 0 9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1 10 0 6442.4 31200 360 1 34 33177 1 0 2 0 Missing observations Unrealistic values Missing Value Imputation has to be done based on, i. Meaning of the variable- for e.g., if flag, it can take either 1 or 0 depending on the coding; ii. Distribution - if it’s a continuous variable like Amount, etc. with lesser missing content then mean, etc. iii. If missing due to merging issue – then it depends on whether it was available in a particular table or not, for e.g., if its not present in Restrictions table then missing “freq_restrictions” can take a value ‘0’ or if it was actually present in the Restrictions table but still has missing then it should take median value. iv. Correlation with Other Predictors - Also the missing value in a variable can depend on other variables in a dataset, for e.g., if “amount_received” is missing, then the missing amount depends on the size of the merchant, the Average amount received in the prior months, type of products sold, industry avg, etc. MISSING VALUE TREATMENT Back to Dataprep
  • 10. 0 5 10 15 20 25 30 40 60 80 100 MeanTxnAmt$ Mean MOB Intended for Knowledge Sharing only 10Intended for Knowledge Sharing only 10 Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables …. ….and is necessary to achieve stable and accurate predictions apart from arriving at the correct recommendations DATA PREPARATION BIVARIATE ANALYSIS Back to Dataprep Dep Var = f(Indep Var, Log(Indep Var), Sin(Indep Var),….) Significant estimate with large magnitude Insignificant estimate Transformations required Bivariate Chart Analysis- Mean dep var value vs. Class Dummy Creation for certain classes Variable dropping if no trend or relationship 0 10 20 30 40 50 0 1 2 3 4 MeanTxnAmt$ Mean Count of Restrictions Dummy = (count_rest<=2) No relationship
  • 11. Intended for Knowledge Sharing only 11Intended for Knowledge Sharing only 11 Multivariate analysis helps remove interrelationships between the predictors to achieve stable and correct estimates at individual variable level which is necessary for correct strategy creation…. ….the variance/correlation based reductions are not mutually exclusive and might be applied judgmentally in different sequences to achieve the best set of predictors DATA PREPARATION MULTIVARIATE ANALYSIS Back to Dataprep Inter-correlations amongst Predictors Linear relations Common Variances Collinearity removal based on VIF and CI values Total Variances Factor Analysis Principal Component Analysis Significant Predictors with Eigen Values >1 or which capture 70% variance Variables should be grouped as per the information that they capture and reductions are performed at the group level Factor loadings are used to decompose the significant Factors/PCs to variable level
  • 12. Intended for Knowledge Sharing only 12Intended for Knowledge Sharing only 12 CONTENTS METHODOLOGY OVERVIEW DATA COLLECTION DATA PREPARATION MODELING & ANALYSIS PERFORMANCE DIAGNOSTICS BEST PRACTICES REFERENCES
  • 13. Intended for Knowledge Sharing only 13Intended for Knowledge Sharing only 13 Business need defines the nature of the dependent variable and the analysis time windows in which the predictors are observed and where the performance is observed.…. Note: *Population sizes and business dynamics have to be taken into account while deciding the Analysis Time windows and the form of dependent variable Analysis windows* Observation window Out-of-time Validation window Performance Window Dependent Variable captures the behavior of interest and it can be oContinuous or categorical oRaw or transformed(log, growth) , etc. and the statistical technique used for analysis depends on the type and form of this variable Observation Window stands for the window where the various predictors are observed Performance Window stands for the time window where the dependent variable is defined Out-of-Time Validation Window stands for the time window where the model performance and stability is checked Definition of Dependent Variable MODELING & ANALYSIS DEFINITION OF DEPENDENT VARIABLE
  • 14. Intended for Knowledge Sharing only 14Intended for Knowledge Sharing only 14 Every findings from analysis has to be validated for reliability and accuracy across samples of data.…. MODELING & ANALYSIS A NOTE ON SAMPLING Define the Population Determine the Sampling Frame Select Sampling Technique(s) Determine the Sample Size Execute the Sampling Process SAMPLING TECHNIQUES SIMPLE RANDOM SAMPLING STRATIFIED SAMPLING All records are randomly assigned a selection probability between 0 and 1. STRENGTHS Easily understood and implemented WEAKNESSES Lower precision and no assurance of representativeness All records are assigned to a particular sub- population, the proportion of which is to be maintained in the final samples. SRS is used to select records from the sub-populations STRENGTHS Increases representativeness WEAKNESSES Not effective for large/small Stratas ….Nature of the business problem and population decides the sampling technique and sizes
  • 15. Intended for Knowledge Sharing only 15Intended for Knowledge Sharing only 15 Segmentation of customers into homogenous groups, identical within the clusters and different from those in other clusters, based on a set of behavioral characteristics..…. MODELING & ANALYSIS SOME TIDBITS ABOUT CLUSTERING ….Identifies the structural breaks in the data, on either side of which the characteristics are fundamentally different, and hence is necessary to arrive at the real relationship of predictors with dependent variable  Most used methods of clustering:  Hierarchical Clustering- Assigns observations to a cluster progressively one at a time, based a distance measure.  Advantages: Good in case of small datasets as the algorithm finds the number of clusters.  Limitations: It fails with large datasets as a result of memory issues.  K-means Clustering- A random number of cluster origins are selected ;then all the remaining records are assigned to one of them based on a distance measure.  Advantages: Simplicity and speed  Limitations: It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
  • 16. Intended for Knowledge Sharing only 16Intended for Knowledge Sharing only 16 Modeling is the establishment of a relationship between the variable of interest and its various predictors and hence the technique depends on the distribution of the dependent variable, the business problem and data quality and quantity available for modeling..…. MODELING & ANALYSIS MODELING TECHNIQUES ….Findings of a stable and accurate model elucidates the degree and nature of the drivers of the dependent variable and thus defines the strategy to be taken for solving the business problem. Final Analysis Dataset Non-Parametric Parametric Does not depend on distribution of dependent variable Depends on the distribution of dependent variable Sl.No. Target Variable Distribution Modeling Approach Model Output 1 Continuous OLS Regression A Typical Model : Y = f(X)= f(X1, X2,..,Xn) 2 Nominal Logistic Regression 3 Categorical positive values Poisson/Gamma 4 Unidentified Decision Trees Segments with increasing proportion of dependent variable.
  • 17. Intended for Knowledge Sharing only 17Intended for Knowledge Sharing only 17 Form of fitting function(how are they mathematically related?): y =α + β1X1 + β2X2 Predicted = Mean + relationship with Predictor 1*predictor 1+ relationship with Predictor 2*predictor 2 Assumption for the modeling: Residuals are independent, are normally distributed with ‘0’ mean and have uniform variance throughout What is OLS? Ordinary Least Squares(Explained variance, R2 is being maximized) Type of Predicted (dependent) Variable: Continuous Variable (-∞ to + ∞) Business Question: What loan amount take off can we expect from a customer? SAS procedure: Proc Reg Performance Diagnostics (indicators of a good model): •R-square(-1 to +1): How good the model is explaining variance in predicted variable? •MSE(Mean Square Error): Size of average difference between predicted and actual? MSE = sqrt of summation of (actual value – predicted value)/(count of obs) •Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001 •Sign of parameter estimates: Should be intuitive or repeated in validation sample •Model validation: Model should be stable on both in time/out of time validation samples •Rank Ordering: Predict high value when actual is high and vice versa •AIC/SIC: Parsimony(or Efficiency)- min predictors, max predictions; compare across models GENERALIZED LINEAR MODELS OLS REGRESSION (LINEAR)
  • 18. Intended for Knowledge Sharing only 18Intended for Knowledge Sharing only 18 GENERALIZED LINEAR MODELS OLS REGRESSION (LINEAR)- SAMPLE MODEL OUTPUT The REG Procedure Model: MODEL1 Dependent Variable: censor_po Number of Observations Read 40162 Number of Observations Used 40162 Analysis of Variance Source DF Sum of Mean F Value Pr > F Squares Square Model 12 610.91533 50.90961 219.02<.0001 Error 40149 9332.36401 0.23244 Corrected Total 40161 9943.27934 Root MSE 0.48212 R-Square 0.0614 Dependent Mean 0.5492 Adj R-Sq 0.0612 Coeff Var 87.78642 Parameter Estimates Variable DF Parameter Standard t Value Pr > |t| Variance Estimate Error Inflation Intercept 1 1.24953 0.20693 6.04 <.0001 0 APPLICATION_PRIM_CB_SCR_NBR 1 -0.000216 0.00028377 -0.76 0.4465 1.0205 log_APPL_ADV_RATIO 1 -0.1166 0.0117 -9.96 <.0001 1.09417 log_APPL_PYMT_TO_INCOME_RATIO 1 -0.01966 0.00517 -3.8 0.0001 1.17587 Collinearity Diagnostics Number Eigenvalue Condition Proportion of Variation Index Intercept APPLICATION_P RIM_CB_SCR_NB R log_APPL_ADV_ RATIO log_APPL_PYMT _TO_INCOME_RA TIO 1 8.3631 1 0.00000188 0.00000202 0.00002708 0.00057815 2 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-06 3 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.0000019 8 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.00396 9 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.01911 10 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.26143 11 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.68574 12 0.00137 78.10783 0.01375 0.01859 0.96941 0.02085 13 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173
  • 19. Intended for Knowledge Sharing only 19Intended for Knowledge Sharing only 19 What is Logistic? Predicts log odds(event/non-event) Log (odds) = α + β1X1 + β2X2 Predicted probability of event = e^(α + β1X1 + β2X2)/(1+e^(α + β1X1 + β2X2)) Predicted probability of non-event = 1/(1+e^(α + β1X1 + β2X2)) ->Therefore, total probability (event + non-event) at an obs level is 1 Type of Predicted (dependent) Variable: Binary (1/0)- one is event, other is ‘reference’ Business Question: What is the probability of a customer defaulting? SAS procedure: Proc Logistic (with various link functions) Performance Diagnostics (indicators of a good model): •Concordance/Discordance: If all observations were paired randomly, in how many instances(%) is actual event observation given higher probability •Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001 •Sign of parameter estimates: Should be intuitive or repeated in validation sample •Model validation: Model should be stable on both in time/out of time validation samples •Rank Ordering: Predict high value when actual is high and vice versa •Gains Chart(K-Statistic): Highest probabilities should be assigned to actual events •AIC: Parsimony(or Efficiency):min predictors, max predictions; compare across models Note: *Hosmer-Lemeshow good but fails when model sample size is large GENERALIZED LINEAR MODELS LOGISTIC REGRESSION
  • 20. Intended for Knowledge Sharing only 20Intended for Knowledge Sharing only 20 GENERALIZED LINEAR MODELS LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT The LOGISTIC Procedure Model Information Data Set MODOUT.TU60_VAL_FICO _690_719_EXP Response Variable outcome Number of Response Levels 3 Model generalized logit Optimization Technique Fisher's scoring Number of Observations Read 607592 Number of Observations Used 607592 Response Profile Ordered outcome Total Value Frequency 1 0 597504 2 1 9432 Logits modeled use outcome=0 as the reference category. Model Fit Statistics Criterion Intercept only Intercept & Covariates AIC 107549.99 106661.99 SC 107572.63 106956.24 -2 Log L 107545.99 106609.99 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 935.9990 24 <.0001 Score 902.4392 24 <.0001 Wald 892.8763 24 <.0001
  • 21. Intended for Knowledge Sharing only 21Intended for Knowledge Sharing only 21 GENERALIZED LINEAR MODELS LOGISTIC REGRESSION - SAMPLE MODEL OUTPUT contd… Type 3 Analysis of Effects Effect DF Wald Pr > ChiSq Chi-Square APPLICATION_PRIM_CB_ 2 14.5230 0.0007 log_APPL_ADV_RATIO 2 126.6605 <.0001 log_APPL_PYMT_TO_INC 2 83.5880 <.0001 Analysis of Maximum Likelihood Estimates Parameter DF Development Model Estimate Validation Model Estimate Standard Wald Pr > Chi SqError Chi-Square Intercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466 APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715 log_APPL_ADV_RATIO 1 -0.3934 -0.2839 0.0485 34.2834 <.0001 log_APPL_PYMT_TO_INC 1 -0.1206 -0.0900 0.0221 16.5920 <.0001 Odds Ratio Estimates Effect outcome Point Estimate 95% Wald Confidence Limits APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000 log_APPL_ADV_RATIO 1 0.753 0.685 0.828 log_APPL_PYMT_TO_INC 1 0.914 0.875 0.954 Percent Concordant 65.9 Somers' D 0.338 Percent Discordant 32.1 Gamma 0.345 Percent Tied 2.0 Tau-a 0.074 Pairs 1806529536 c 0.669 Higher the percent concordant, better the model
  • 22. Intended for Knowledge Sharing only 22Intended for Knowledge Sharing only 22 GENERALIZED LINEAR MODELS LOGISTIC REGRESSION – RANK ORDERING OUTPUT contd… predgr oup obs minpred maxpred avgpred totact avgact cumact predran k cumpct actrank KS 1 12551 0.2069 1 0.275689 3632 0.289379 3632 1 22.51984 1 14.5573 2 13077 0.172384 0.206895 0.190708 2565 0.196146 6197 2 38.42386 2 21.07661 3 12932 0.163982 0.172383 0.165289 2179 0.168497 8376 3 51.93452 3 24.98741 4 12696 0.118382 0.163978 0.142257 1727 0.136027 10103 4 62.64261 4 25.9028 5 12814 0.096125 0.118381 0.105572 1360 0.106134 11463 5 71.07515 5 24.10965 6 12814 0.086392 0.096124 0.091463 1151 0.089824 12614 6 78.21181 6 20.83402 7 12814 0.077738 0.086391 0.081861 1061 0.0828 13675 7 84.79043 7 16.92002 8 11344 0.07317 0.077737 0.075261 811 0.071492 14486 8 89.81895 8 12.54508 9 14284 0.069614 0.073168 0.072034 894 0.062588 15380 9 95.3621 9 6.134163 10 12814 0.03382 0.069613 0.060393 748 0.058374 16128 10 100 10 0
  • 23. Intended for Knowledge Sharing only 23Intended for Knowledge Sharing only 23 0 20 40 60 80 100 120 0 20 40 60 80 100 CAPTURING OF THE MODEL The column “cumpct” in the rank-ordering output indicates the no. of responders captured up to the given decile. The model captures about 22.5% responders in the first decile and about 71.07% of the responders in the top 5 deciles. Model capturing Random capturing Population (%) Responders captured Higher the capturing in the initial deciles, better the model performance GENERALIZED LINEAR MODELS LOGISTIC REGRESSION – GAINS CHART contd…
  • 24. Intended for Knowledge Sharing only 24Intended for Knowledge Sharing only 24 CRITERIA FOR FINE-TUNING CRITERION FOR FINE TUNING The fine tuning is based on applying model for both development and validation samples. Following criterion are consider for fine tuning the model. Fine Tuning Rank Ordering Coefficient Stability Concordance Highest KS Goodness-of-fit Validation Capturing
  • 25. Intended for Knowledge Sharing only 25Intended for Knowledge Sharing only 25 RECAP Phase III Decide on the number of models and identify the dependent variables for each model Identify the statistical method suitable for each predictive model: OLS Regression, Logistic Regression etc. Hypothesize Predictor variables TRANSLATE THE BUSINESS PROBLEM INTO A STATISTICAL PROBLEM BASED ON IBCVM FRAMEWORK UNDERSTAND THE BUSINESS PROBLEM PREPARE DATA SPECIFICATIONS & GET DATA MODEL IMPLEMENTATION Prepare Scoring Code Track model performance after regular intervals Redevelop/ Rebuild models on a need basis UNIVARIATE ANALYSIS - Treatment of Outliers BIVARIATE ANALYSIS -Treatment of Missing Value - Variable Transformations DEVELOPMENT SAMPLE (Sub sample of raw data) MODEL DEVELOPMENT -OLS / Logistic Regression -Fine Tuning VALIDATION SAMPLE (Sub sample of raw data) MULTIVARIATE ANALYSIS - Removal of Multicollinearity - Removal of Insignificant variables RAW DATA Model validation Refinement based on Client Feedback VALIDATION SAMPLE (out of time) Phase II Phase I
  • 26. Intended for Knowledge Sharing only 26Intended for Knowledge Sharing only 26 REMAINING SLIDES PENDING SLIDES: OTHER TESTS(t tests, ANOVA, CHI-SQUARE, etc.) PITFALLS IN STATISTICS SPURIOUS CORRELATION ENDOGENOUS & EXOGENOUS ERRORS ACCURACY vs. RANKING CAUSAL VS. CORRELATION POPULATION STABILITY INDEX OTHER THINGS TO BE ADDED: BEST PRACTICES DOCUMENT SAS & EXCEL MACROS REFERENCES SAMPLE DATA, CODE, OUTPUT CHEAT SHEET