DIY Driver Analysis Webinar slides

TIM BOCK PRESENTS
Examples in Q5 (beta)
If you have any questions, enter them into the Questions field.
Questions will be answered at the end. If we do not have time to get to
your question, we will email you.
We will email you a link to the slides and data.
Get a free one-month trial of Q from www.q-researchsoftware.com.
DIY Driver Analysis

Overview
• Objectives of (key) driver analysis
• Overview of techniques
• 13 assumptions that need to be checked when doing QA for driver analysis
(there are workarounds for most of these)
1: There are 15 or fewer predictors (if using Shapley)
2: The outcome variable is monotonically increasing
3: The outcome variable is numeric (if using Shapley)
4: The predictor variables are numeric or binary
5: People do not differ in their needs/wants (segmentation)
6: The causal model is plausible
7: There is no multicollinearity/correlations between predictors (if using GLMs)
8: There are no unexpected correlations between the predictors and the outcome variable
9: The signs of the importance scores are correct
10: The predictor variables have no missing values
11: There are no outliers/influential data points
12: There is no serial correlation (aka autocorrelation)
13: The residuals have constant variance (i.e., no heteroscedasticity in a model with a linear outcome variable) 2

The basic objective of (key) driver analysis
The basic objective: work out the relative importance of a series of predictor
variables in predicting an outcome variable. For example:
• NPS: comfort vs customer service vs price.
• Customer satisfaction: wait time vs staff friendliness vs comfort.
• Brand preference: modernity vs friendliness vs youthfulness.
What driver analysis is not: predictive analysis (e.g., predicting sales, customer
churn). Although, you can use driver analysis to make strategic predictions (e.g., if I
improve, say, fun, then preference will increase.)
3

Likelihood to
recommend
This brand is
fun
This brand is
exciting
This brand is
youthful
6 1 1 1
9 0 1 0
7 0 0 0
6 1 1 1
9 0 1 0
7 0 0 1
7 0 0 0
What the data looks like
This data shows 7
observations
1 outcome
variable
4
Predictor variables
(Typically there will be more than 3.)

Case study 1: Technology
Outcome variable(s) Predictor variable(s)
Likelihood to recommend:
• Apple
• Microsoft
• IBM
• Google
• Intel
• Hewlett-Packard
• Sony
• Dell
• Yahoo
• Nokia
• Samsung
• LG
• Panasonic
Brand associations:
• Fun
• Worth what you pay for
• Innovative
• Good customer service
• Stylish
• Easy-to-use
• High quality
• High performance
• Low prices
5

ID
Apple
Microsoft
IBM
Apple
Microsoft
IBM
Apple
Microsoft
IBM
1 6 9 7 1 0 0 1 1 0
2 8 7 7 1 0 0 1 0 0
3 0 9 8 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0
This brand is
fun
This brand is
exciting
Likelihood to
recommend
ID Brand
Likelihood to
recommend
This brand is
fun
This brand is
exciting
1 Apple 6 1 1
1 Microsoft 9 0 1
1 IBM 7 0 0
2 Apple 6 1 1
2 Microsoft 9 0 1
2 IBM 7 0 0
3 Apple 6 1 1
3 Microsoft 9 0 1
3 IBM 7 0 0
4 Apple 6 1 1
4 Microsoft 9 0 1
4 IBM 7 0 0
The data (stacked)
From: one row per respondent
To: one row per brand per respondent
6

Tips for stacking
• Get an SPSS .SAV data file. If you do not have an SPSS file:
• Import your data the usual way
• Tools > Save Data as SPSS/CSV and Save as type: SPSS
• Re-import
• Tools > Stack SPSS .sav Data File
• Set the labels for the stacking variable (in Q: observation) in Value Attributes
• Delete any None of these data (e.g., brand associations where respondents were
able to select None of these
7

Case study 2: Cola brand attitude
Outcome
variable(s)
34 Predictor
variable(s)
If the brand was a person, what would
its personality be?
Hate/Dislike/Neither/
Like/Love/Don’t know:
• Coke Zero
• Coke
• Diet Coke
• Diet Pepsi
• Pepsi Max
• Pepsi
Brand associations:
• Beautiful
• Carefree
• Charming
• Confident
• Down-to-earth
• Feminine
• Fun
• Health-conscious
• Hip
• Honest
• Humorous
• Imaginative
• Individualistic
• Innocent
• Intelligent
• Masculine
• Older
• Open to new
experiences
• Outdoorsy
• Rebellious
• Reckless
• Reliable
• Sexy
• Sleepy
• Tough
• Traditional
• Trying to be cool
• Unconventional
• Up-to-date
• Upper-class
• Urban
• Weight-conscious
• Wholesome
• Youthful
8

9
Example output:
Importance scores
Key
drivers of
cola
preference

10
Example output:
Performance-
Importance Chart
(aka Quad Chart)

Example output: Correspondence Analysis with Importance
11

Overview
13: The residuals have constant variance (i.e., no heteroscedasticity in a model with a linear outcome variable) 12

Standard “best practice”
recommendation for
driver analysis:
LMG
Lindeman, Merenda, Gold (1980)
=
Kruskal
Kruskal (1987)
=
Dominance Analysis
Budescu (1993)
=
Shapley / Shapley Value
Lipovetsky and Conklin(2001)
The average
improvement in R² that a
predictor makes across
all possible models (aka
“Shapley”)

14
Best practice:
Bespoke models
(e.g., Bayesian
multilevel model)
Bivariate metrics
E.g., Correlations,
Jaccard
Coefficients
Shapley,
Relative
Importance
Analysis
Much too hard Too hard Too Soft Just Right
GLMs
(e.g., linear
regression)

What makes bespoke models and GLMs too hard?
To estimate an OK bespoke model,
you need to have a few week, and
know lots of things, including:
• Joint interpretation of parameter
estimates, the predictor covariance
matrix, and the parameter
covariance matrix
• Conditional effects
• Multicollinearity
• Confounding (e.g., suppressor
effects)
• Estimation (ML, Bayesian)
• Specification of informative priors
• Specification of random effects
15
To understand importance in a GLM (e.g., linear
regression), you need to know quite a lot about:
• Joint interpretation of parameter estimates, the
predictor covariance matrix, and the parameter
covariance matrix
• Conditional effects
• Multicollinearity
• Confounding (e.g., suppressor effects)
Shapley and similar methods allow
us to be less careful when
interpreting results

16
Bespoke models
& GLMs
Proportional
Marginal Variance
Decomposition
Shapley
With coefficient adjustment
Lipovetsky and Conklin(2001)
Random Forest
(for importance analysis)
Kruskal’s Squared
partial correlation
Called Kruskal in Q
Relative Importance
Analysis
AKA Relative Weight: Johnson (2000)
Shapley

Shapley case study
• Open Initial.Q. This already contains the cola data.
• File > Data Sets > Add to Project > From File > Stacked Technology
• Create > Regression > Driver (Importance) Analysis > Shapley
• Dependent variable: Q3. Likelihood to recommend [Stacked Technology]
• Dependent variable: Q4 variables from Stacked Technology
• No when asked about confidence intervals (clicking Yes is OK as well)
• Note that High Quality is the most important, with a score of 18.2
• Right-click: Reference name: shapley
Everything I demonstrate in this webinar is described on a slide like this. The rest of them
are hidden in this deck, but you can get them if you download the slides. So, there is no
need to take detailed notes.
17
Instructions for the case studies

Overview
13: The residuals have constant variance (i.e., no heteroscedasticity in a model with a linear outcome variable)18

19
Options (ranked from best to worst) Comments
Relative Importance Analysis
Tends to give almost identical results
to Shapley regression, and much,
much, faster to compute.
Shapley using only a subset of models (e.g.,
all predictors, leave 1 out predictor, 2
predictors, 1 predictor)
There is no evidence that this
method produces accurate estimates
of Shapley importance.
This is likely to be less similar to a
proper Shapley than using Relative
Importance Analysis.
Dimension reduction (e.g., PCA) Hard to interpret.
Issue
Shapley Regression cannot be
computed with more 15
predictors in Q (it reverts to
Relative Importance Analysis)
If you run Shapley Regression in
the R package relimpo with
this many variables you get a
crash.

• With the cola study, we have 34 variables, and that will take an infinite amount of time to
compute, so using Shapley is not an option and we have to use Relative Importance Analysis.
• We can use the technology data set, which only has 9 predictors, to explore how similar the
techniques are.
• Create > Regression > Linear Regression
• Reference name: relative.importance
• Select variables
• Output: Relative importance analysis
• Check Automatic Note that High Quality is again most important
• Right-click: Add R Output:
comparison = cbind(shapley = shapley[-10],
"Relative Importance" =
relative.importance$relative.importance$importance)
• Calculate
• Change shapley to shapley[-10]
• Calculate
• Right-click: Add R Output: correlation = cor(comparison)
• Increase number of decimal places. Note the correlation is 0.999
• Rename output: Correlation
• Insert > Charts > Visualization > Labeled Scatterplot,
• Table: comparison
• Automatic
20

21
Options (not mutually exclusive) Comments
Set Don’t Knows to missing
Merge categories
• Do this when there are categories that
have ambiguous orderings (e.g., OK
and Good).
• The more categories you merge, the
less significant the results will be.
Recode the data in some meaningful way
(e.g., reverse the scale, Likelihood to
recommend, recoded as NPS)
The specific values tend to make little
difference, so using a recoding that is
easy to explain to stakeholders, such as
NPS, is often desirable.
Issue
All the standard driver analysis
algorithms assume that the
outcome variable contains
categories ordered from lowest
to highest, and which are
believed to be associated with
greater levels of preference.
Test
This is usually best checked by
creating a summary table.

• Right-click: Add Table
• Blue drop-down: Likelihood to recommend [Stacked Technology.sav]
Other than the NET, which is ignored by analyses, the order is correct.
• Click on table of outcome variables in Technology. Note that there is no problem.
• Click on Brand preference Note that here we have a Don’t Know.
• Right-click on Don’t Know and press Remove. This sets it as missing values.
• Click on two grey outputs (just setting up a later analysis)
22

23
Use limited dependent variable versions of
Relative Importance Analysis (e.g., Ordered
Logit)
• The less numeric the variable, the
better this option is.
• This approach is also preferable
because it can take non-linear
relationships into account
automatically.
Ignore the problem and use Shapley.
Where the variable is close to being
numeric, there is probably little lost
by this approach.
Issue
Shapley assumes that the
outcome variable is numeric
(theoretically, it can deal with
non-numeric outcome variables,
but for more than about 10 or so
variables, it is impractical).

3: Numeric outcome
• Select relative.importance
• Type: Ordered logit
• Select chart We can see that the correlation between the models is smaller than
before. This is because correctly modeling the data type has a bigger impact on
our results than the difference between Shapley and Relative importance analysis.
• Note that the changes are pretty trivial.
24

25
Set Don’t Knows to missing
This can be problematic as the variables as the
missing values may not be missing at random.
This is discussed later.
Merge categories
• Do this when there are categories that have
ambiguous orderings (e.g., OK and Good).
• The more categories you merge, the less
significant the results will be.
Recode the data in some meaningful way
(midpoint recoding)
Use a bespoke or Generalized Linear Model
(GLM), with dummy variables and/or splines,
computing importance as the difference
between the lowest and largest effect sizes for
each variable.
In theory this is the best approach to dealing
with non-numeric data, but it requires quite a
lot to get right and, when interpreting the data,
the sampling error of the categorical and spline
effects will make them hard to compare.
Issue
Both Shapley and
Relative Importance
Analysis assume that
the predictor
variables are
numeric or binary.

26
Estimate an appropriate bespoke model
(e.g., latent class analysis) and then
estimate the driver analysis models
within each segment
In Q: In a non-stacked data file, set up the
data as an Experiment, and use Create >
Segment > Latent Class Analysis
Form segments by judgment, and
estimate separate relative importance
analyses for each segment.
Ignore the problem, interpreting results
as “average” effects
Rightly-or-wrongly, this is how 99.9%* of
all modelling is done.
* Made-up number
Issue
Traditional driver analysis
techniques assume that people
have the same needs/wants, and
apply these consistently from
situation to situation.
How to test
• Compare by brand
• Compare by other data
• Latent class analysis

• Select the Shapley Importance for Q3… table and press Duplicate
• Brown drop-down: Brand [Stacked Technology.sav]
• Note that this table is showing difference in important scores by segment. It is not
showing difference in perceptions.
• The colors and arrows are telling us that there are differences by brand. Look at
Fun. This is significantly lower for Intel, HP, and Dell. But, really important in the
evaluation of Google.
• What do we do now? Let’s return to the previous slide and look at the options.
27

28
Build a bespoke model This is usually too hard
Include all the relevant (non-outcome)
variables and cross your fingers (if you
have not collected the data, you cannot
magic it into existence)
Rightly-or-wrongly, this is how 99.9%* of
all modelling is done
* Made-up number
Issue
All driver analysis techniques
assume that the analysis is a
plausible explanation of the
causal relationship between the
predictor variables and the
outcome variable.
This assumption is never true.
How to test
Common sense. Four common
examples are shown on the next
slides.

Example causality problem: Omitted variable bias
If we fail to include a relevant predictor variable, and that variable is correlated with the
predictor variables that we do include, the estimates of importance will be wrong. If
your R-square is less than 0.9, you may have this problem (a typical R-square is closer to
0.2 than 0.9).
Predictor 1
Predictor 2
Predictor 3
Predictor 4
E.g., price
Outcome 1
Assumed
predictor
variables
Arrows denote the true causal relationship
29

Example causality problem:
Outcome variable included as a predictor
30
Predictor 1
Predictor 2
Predictor 3
Outcome 2
E.g., Satisfaction
Outcome 1
E.g., NPS
Assumed predictor variables
If we include a predictor variable that is really an outcome variable, the
estimates of importance will be wrong.

6: Outcome variable a predictor
• Click on relative.importance
• Worth what paid for is clearly an example of this problem, so should be deleted.
• Remove Worth what paid for from the model
• Click on Shapley Importance for Q3. Likelihood to recommend
• Click on Comparisons. Note
• The warning
• The R-square is back in the model.
• Change in the R code -10 to -9
31

Example causality problem: Backdoor path
32
Predictor 1
E.g., price perception
Predictor 2
E.g., quality
Predictor 3
E,g., packaging
Variable
E.g., Attitude
Outcome 1
E.g., NPS
Assumed
predictor
variables
If backdoor path exists from the predictors to the outcome variable, the
estimates of importance will be wrong (spurious).

Example causality problem: Functional form
33
Assumed functional form
If we have the wrong functional form (i.e., assumed equation), the
estimates of importance will be wrong.
Outcome = Predictor 1 + Predictor 2 + Predictor 3
Outcome = Predictor 1 × Predictor 2 + Predictor 3
True functional form

7: There is no multicollinearity/correlations between
predictors (if using GLMs, e.g., linear regression)
34
Take all the relevant theory into
account when interpreting the
results.
This requires a strong technical and
intuitive understanding of the
underlying maths. Even if you possess
that understanding, it is really difficult to
explain to clients (particularly if it is a
tracking study and they are seeing
results fluctuate from period-to-period)
Use Shapley or Relative Importance
Analysis.
These techniques are designed to
address this problem. They are not
perfect, but they are easier to interpret
than linear regression and other GLMs
when predictor variables are correlated.
Issue
The bigger the correlations between predictors,
the more difficult it is to accurately interpret
estimates from traditional GLMs (e.g., linear
regression)
Test
1. Inspect the Variance Inflation Factors (VIF)
or Generalized Variance Inflation Factors
(GVIF). Q automatically computes these and
warns you if they are high.
2. Inspect the coefficients. Do they make
sense?
3. Look at the correlations.

7: There is no multicollinearity
• Uncheck Automatic
• Change it back to Summary. Now it is showing an order logit regression.
• Type to Linear.
• Note that we have a negative effect for Stylish. That is, the model says that if we hold everything else
constant, an improvement in Stylish will make a reduction in likelihood to recommend. This doesn’t
seem to make sense.
• Create > Correlation > Correlation Matrix. Select
• Q3. Likelihood to recommend
• Q4.
• Note that there is a moderate correlation between Stylish and Likelihood to recommend. And, Stylish is
correlated with everything out.
• My intuition is that this is all some weird and uninteresting quirk in the data. But, in truth I don’t know.
• Conclusion: we should be using Shapley or Relative Importance Analysis. It is designed to remove such
headaches.
• Output: Relative Importance Analysis
35

8: There are no unexpected correlations between the
predictors and the outcome variable
36
Options (ranked from best to worst)
Investigate the data to make sense of the
unexpected relationships.
Remove problematic variables from the
analysis.
Issue
When people interpret
importance scores, they assume
that higher means better. This is
assumption is not always right.
Test
Correlate each predictor variable
with the outcome variable

8: There are no unexpected correlations
• Click back on the correlation matrix. As we discussed earlier, these all look fine.
• Now, let’s look at the cola correlations. I computed them before, but they will
need to re-run because we removed the Don’t Know category from the Outcome
Variable.
• Click on cola.correlations
• You can see we have some negative correlations in the first column, which shows
the correlations with the outcome variable.
37

38
Recommendation
If all the effects should be
positive, select the Absolute
importance scores option.
Otherwise, manually change
the results when reporting.
Issue
The underlying Shapley and Relative Importance Analysis algorithms always
compute a positive importance scores.
However, the true effect of a predictor can be negative, resulting in people
misinterpreting the results.
Test
Compute a GLM (e.g., linear regression). Any negative coefficients warrant
investigation. For this reason, Q automatically does this and puts the signs of the
multiple regression coefficients onto the driver analysis outputs (both Shapley and
Relative Importance Analysis).
If the correlation is also negative, it means that the effect is negative. If positive, it
suggests that the multiple regression is picking up a non-interesting artefact.

• Stylish is negative. As discussed, it is because of the multiple regression. This is
really just a reminder that we need to be careful. We have already looked at this,
and we know that the correlation is positive, so we can ignore this.
• Check Absolute importance scores
• Go to comparison, and replace shapley[-9] with abs(shapley[-9])
39

40
Create a bespoke model that appropriately
models the process(es) that cause the
values to be missing.
This is really hard!
Multiple imputation of missing values
If using Relative Importance Analysis, set Missing
Data to Multiple Imputation
Leave out observations with missing values
from the analysis (i.e., complete case
analysis)
This implicitly assumes that the data is Missing
Completely At Random (MCAR; i.e., other than
that some variables have more missing values than
others, there is no pattern of any kind in the
missing data).
Test this assumption using Automate > Browse
Online Library > Missing Data > Little’s MCAR Test
Issue
There are missing
values of predictor
variables (e.g.,
some attributes
were not collected
for some
respondents, or
there were “don’t
know” response)

11: There are no outliers/unusual data points
41
Inspect each unusual observation, and
understand if it is an error or not
Difficult/time consuming
Filter out all the unusual observations, and
check to see if the model has changed. If it
has changed, and the number of unusual
observations is small, use the new model.
Ignore the problem
This is, by far, the most common
approach.
Issue
A few outliers/unusual
observations can skew the
results of importance analysis.
Test
• Hat/influence scores
• Standardized residuals
• Cook’s distance

11: There are no outliers/unusual data points
• Click on 3 more (warnings)
• Note that we have a warning about Unusual Observations. Let’s
explore this further.
• Create > Regression > Diagnostic > Plot > Influence Index
• Uncheck Automatic (so that it does not update when we later
change the model)
• You can see the various unusual observations have been marked:
379, 382, 295, and so on
• Go to Notes tab and copy code
• Variables and Questions – Stacked Technology.sav
• Right-click on a row: Insert > Variable(s) > R Variable:
• !(1:4056 %in% c(295, 1744, 1749, 2764, 3420, 246,
2364, 2829, 2830, 4029, 295, 2380, 2764, 3049,
3050)
• Name: Unusual observations removed
• Check the F in the Tags column. Make the variable available as a
filter.
• Go back to Output: relative.importance
• Apply the filter to the table
42

43
Create a bespoke model that addresses the
serial correlation (e.g., a random effects model
if the serial correlation is due to repeated
measures, or a time series model if it is
measures over time)
This is a lot of work.
Don’t report statistical test results (i.e., p-
values).
The importance scores will be OK. The
significance tests will be misleading to
an unknown extent.
Issue
The standard tests for the
significance of a predictor
assume that there is no serial
correlation/autocorrelation (a
particular type of pattern in the
residuals).
Whenever you stack data you
are highly likely to have this
problem.
Test
Regression > Diagnostic > Serial
Correlation (Durbin-Watson)

12: There is no serial correlation (aka auto…)
• Regression > Diagnostic > Serial Correlation (Durbin-Watson)
• These are significant. As mentioned, this is always the case.
44

13: The residuals have constant variance (i.e., no
heteroscedasticity in a model with a linear outcome variable)
45
Use a more appropriate model (e.g., ordered
logit)
This is not possible with Shapley.
This models make other,
hopefully less problematic,
assumptions (beyond the scope
of this webinar)
Use robust standard errors
This is not possible with Shapley.
In Q: check Robust standard
error
Issue
The standard tests for the significance
of a predictor in a linear model assume
that the variance of the residuals is
constant.
This is rarely the case in driver analysis,
as usually the data is from a bounded
scale (e.g., if it is a rating out of 10, it is
impossible for a value to be observed
that is greater than 10).
Test
Displayr automatically performs the
Breusch-Pagen Test Type = Linear

13: The residuals have constant variance
• Click on relative.importance. Note that there is a warning.
• Change the Type to Ordered Logit
46

Creating the donut chart (earlier in the presentation)
• The first step is to create a new table containing the cola importance scores. Add a new R
Output, with code:
ColaImportance = cola.importance$relative.importance$importance
# Removing "Performance: " from the labels
names(ColaImportance) = flipFormat::TidyLabels(names(ColaImportance))
ColaImportance
• Then, sort the cola importance scores, in a descending order:
• Add a new R Output, with code:
SortedColaImportance = sort(ColaImportance, decreasing = TRUE)
• Create > Charts > Visualization > Donut Chart
• Select SortedColaImportance as the Table
• Data value suffix: %
• Data label decimal places: 1
• Change the name in the report tree to Donut
• Check Automatic (at the top)
47

Creating the performance-importance chart
(earlier in the presentation)
• Right-click on donut in the Report tree and select Add Table
• In the blue drop-down menu, select Performance
• Click on Brand preference, at the top of the Report tree
• Add another table and select Brand [Stacked Cola Brand Associations.sav] in the blue drop-down menu
• Right-click on the 17% next to Diet Coke and select Create Filter
• Make sure that Apply filter to the current table is not selected, and press OK
• Click the Performance table, and apply the Diet Coke filter (using the Filter drop-down, at the bottom-left of the
screen)
• Right-click on Performance and select Reference Name and change this to performance
• Right-click on the Performance table in the Report tree and select Add R Output, entering the code
PerformanceImportance = cbind("Importance (%)" = ColaImportance,
"Diet Coke Brand Associations (%)" = table.Performance[-
nrow(table.Performance)])
• Create > Charts > Visualization > Labeled Scatterplot
• Set Table to PerformanceImportance
• Rename to Perfomance Importance Chart
• Press Calculate
48

Creating the correspondence analysis chart
(earlier in the presentation)
• Right-click on Performance Importance Chartin the Report tree and select Add
Table
• In the blue drop-down menu, select Performance
• In the brown drop-down select Brand [Stacked Cola Brand Associations.sav]
• Right-click on None of these and select Remove
• Create > Dimension Reduction > Correspondence Analysis of a Table
• Table: Performance by Brand
• Output: Bubble Chart
• Bubble sizes: ColaImportance
• Legend title: Importance
49

TIM BOCK PRESENTS
Q&A Session
Type questions into the Questions fields in GoToWebinar.
If we do not get to your question during the webinar, we will write back
via email.
We will email you a link to the slides and data.
Get a free one-month trial of Q from www.q-researchsoftware.com.

DIY Driver Analysis Webinar slides

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DIY Driver Analysis Webinar slides

Similar to DIY Driver Analysis Webinar slides (20)

Recently uploaded

Recently uploaded (20)

DIY Driver Analysis Webinar slides