4. Idea: Don’t explain the whole model,
just one prediction
Complex models are
inherently complex!
But a single prediction involves only a
small piece of that complexity.
Inputvalue
Outputvalue
4
5. Goal: Model agnostic interpretability
model prediction
magic explanation
What if we could view the model as a black box…
…and yet still be able to explain predictions?
data
Interpretable, accurate: choose two!
5 $
6. If only we had this magic box…
model prediction
magic explanation
data
Predictions from any complex model could be explained.
Prediction would be decoupled from explanation, reducing method lock-in.
Explanations from very different model types could be easily compared.
and $!
6
7. So let’s build it!
model prediction
data
7
magic explanation
8. How much money is someone likely to make?
model 31%
chance of making
> $50K annually
9. How much money is someone likely to make?
model 31%
chance of making
> $50K annually
Capital losses $0
Weekly hours 40
Occupation Protective-serv
Capital gains $0
Age 28
Marital status Married-civ-spouse
10. 10
chance of making > $50K annually
Base rate
26%
How did we get here?
15% 40%
Model prediction
31%
11. model 26%
chance of making
> 50K annually
Occupation Exec-managerial
Age 37
Relationship Wife
Years in school 14
Sex Female
Marital status Married-civ-spouse
No attributes are
given to the model
Base rate
12. model 25%
chance of making
> 50K annually
Capital losses $0
Weekly hours 40
Occupation Protective-serv
Capital losses $0
Age 28
Marital status Married-civ-spouse
13. 13
chance of making > $50K annually
Base rate
26%
15% 40%
No capital losses
Model prediction if we only know they had no capital losses
25%
14. 14
chance of making > $50K annually
Base rate
26%
15% 40%
Police/fire
Prediction if we know they had no capital losses and work in police/fire
24%
24. Large capital lossLarge capital gain
Young and single
Highly educated and married
Highly educated and single
Young and married
Divorced women
Married,typical education
Samplessorted by ESvalue similarity(B)
(A)
Bar width isequal to the
ESvalue for that input
Predictiedprobabilityofmaking>=50K
24
Samples clustered by explanation similarity
25. 25
chance of making > $50K annually
15% 40%
Explaining a single prediction from a model
with 500 decision trees
Unique optimal explanation under basic axioms from
cooperative game theory.
33. Car salesman example
33
Age
Weight
Is student
Imagine the explanation gx is a linear model of x’:
Age known to be 25
Weight known to be 150
Is student known to be 1
35. • Axiom 1 (Binarization)
• Axiom 2 (Linearity)
SHAP class of explanation methods
35
All methods that satisfy Axioms 1 and 2 are in the
Shapley Additive Explanation (SHAP) class.
Where 0 means “missing” and 1 means “observed”
An explanation is a linear model.
36. Given two natural axioms, there is only one
possible magic box in the SHAP class!
36
f f(x)
x
m gx
‘m’ is uniquely determined for all methods in the SHAP class under two axioms
38. Monotonicity axiom: If we make a new
model 𝑓′
𝑥 that is larger than 𝑓(𝑥)
whenever 𝑥𝑖
′
= 1 then 𝜙𝑖 𝑓′
, 𝑥 ≥ 𝜙𝑖 𝑓, 𝑥
38
f([0, 1, 1, 0, 0]) - f([0, 0, 1, 0, 0]) =
f([0, 1, 1, 1, 1]) - f([0, 0, 1, 1, 1]) =
f([0, 1, 0, 0, 0]) - f([0, 0, 0, 0, 0]) =
f([1, 1, 1, 0, 1]) - f([1, 0, 1, 0, 1]) =
f([1, 1, 1, 1, 1]) - f([1, 0, 1, 1, 1]) =
f([1, 1, 0, 0, 1]) - f([1, 0, 0, 0, 1]) =
𝜙𝑖 𝑓, 𝑥 =
Input feature i Input feature i
Output value difference
when adding 𝑥𝑖
′
i’th SHAP value for f
39. Monotonicity axiom: If we make a new
model 𝑓′
𝑥 that is larger than 𝑓(𝑥)
whenever 𝑥𝑖
′
= 1 then 𝜙𝑖 𝑓′
, 𝑥 ≥ 𝜙𝑖 𝑓, 𝑥
39
f’([0, 1, 1, 0, 0]) – f’([0, 0, 1, 0, 0]) =
f’([0, 1, 1, 1, 1]) – f’([0, 0, 1, 1, 1]) =
f’([0, 1, 0, 0, 0]) – f’([0, 0, 0, 0, 0]) =
f’([1, 1, 1, 0, 1]) – f’([1, 0, 1, 0, 1]) =
f’([1, 1, 1, 1, 1]) – f’([1, 0, 1, 1, 1]) =
f’([1, 1, 0, 0, 1]) – f’([1, 0, 0, 0, 1]) =
𝜙𝑖 𝑓′
, 𝑥 =
Input feature i Input feature i
Output value difference
when adding 𝑥𝑖
′
i’th SHAP value for f’
40. Proofs from coalitional game theory show there is only
one possible set of values 𝜙𝑖 that satisfy these axioms.
They are the Shapley values.
40
41. Modelagnostic
41
LIME
Shapley value sampling /
Quantitative Input Influence
Approximate the complex model near a given
prediction. - Ribeiro et al. 2016
Feature importance for a given prediction using
game theory. - Štrumbelj et al 2014, Datta et al. 2016
DeepLIFT
Difference from a reference explanations of
neural networks. – Shrikumar et al. 2016
Layer-wise relevance prop
Back propagates neural network explanations. -
Bach et al. 2015
Shapley regression values
Explain linear models in the presence of
collinearity. – Gromping et al. 2012
NeuralnetworksLinear
The SHAP class is large
42. Surprising unity!
Modelagnostic
42
LIME
Shapley regression values
NeuralnetworksLinear
The SHAP class has one
optimum, in the sense that it is
the only set of additive values
satisfying several desirable
properties.
DeepLIFT
Layer-wise relevance prop. SHAP
Shapley value sampling /
Quantitative Input Influence
class
43. Surprising unity!
Modelagnostic
43
LIME
Shapley regression values
NeuralnetworksLinear
The SHAP class has one
optimum, in the sense that it is
the only set of additive values
satisfying several desirable
properties.
DeepLIFT
Layer-wise relevance prop. SHAP
Shapley value sampling /
Quantitative Input Influence
class
44. SHAP class unifies in three ways:
44
Shapley/QIISHAP
1. Extends Shapley value sampling and Quantitative Input Influence.
2. Provides theoretically justified improvements and motivation for other methods.
3. Adapts other method’s to improve Shapley value estimation performance.
46. The LIME formalism of fitting a simple
interpretable model to a complex model locally
The loss function forcing g to well approximate f
A class of interpretable models
Kernel specifying what ‘local’ means
Optional regularization of g
But how do we pick 𝒢, L, 𝜴, and 𝝅 𝒙′?
46
SHAP
47. Surprise: If 𝒢 is linear models, and x’ is
binary then we are in the SHAP class!
This means the Shapley values are the only possible
solution satisfying efficiency and monotonicity.
47
Great! But what about the parameters L, 𝛺, and 𝜋 𝑥′?
48. We found a kernel and loss function that cause a local approximation
to reproduce the Shapley values.
The Shapley kernel
here, and let f x (S) = f (hx (1S )). If for all subsets S that do not contain i or j
f x (S [ { i} ) = f x (S [ { j } ) (4)
φi (f , x) = φj (f , x). This states that if two features contribute equally to the model
heir effects must bethe same.
tonicity. For any two models f and f 0
, if for all subsets S that do not contain i
f x (S [ { i} ) − f x (S) ≥ f 0
x (S [ { i} ) − f 0
x (S) (5)
φi (f , x) ≥ φi (f 0
, x). Thisstates that if observing afeature increases f morethan f 0
in
uations, then that feature’seffect should belarger for f than for f 0
.
of theseaxioms would lead to potentially confusing behavior. In 1985, Peyton Young
hat there is only one set of values that satisfies the aboveassumptions and they are
ues [7, 4]. ESvaluesareShapley values of expected valuefunctions, therefore they
ution to Equation 1 that conforms to Equation 2 and satisfies thethree axioms above.
of ESvaluesholdsover alargeclass of possible models, including theexamples used
aper that originally proposed this formalism [3].
pecific forms of x0
, L, and ⌦that lead to Shapley values asthesolution and they are:
⌦(g) = 0
⇡x 0(z0
) =
(M − 1)
(M choose |z0|)|z0|(M − |z0|)
X ⇥ ⇤
(6) 48
of theseaxioms would lead to potentially confusing behavior. In 1985, Peyton Young
hat there is only one set of values that satisfies the aboveassumptions and they are
ues [7, 4]. ESvaluesareShapley values of expected valuefunctions, therefore they
ution to Equation 1 that conforms to Equation 2 and satisfies thethree axioms above.
of ESvaluesholdsover alargeclass of possible models, including theexamples used
aper that originally proposed this formalism [3].
pecific forms of x0
, L, and ⌦that lead to Shapley values asthesolution and they are:
⌦(g) = 0
⇡x 0(z0
) =
(M − 1)
(M choose |z0|)|z0|(M − |z0|)
L(f , g, ⇡x0) =
X
z02 Z
⇥
f (h− 1
x (z0
)) − g(z0
)
⇤2
⇡x 0(z0
)
(6)
to note that ⇡x 0(z0
) = 1 when |z0
| 2 { 0, M } , which enforces φ0 = f x (; ) and
φi . In practicetheseinfiniteweightscan beavoided during optimization by analytically
o variables using these constraints. Figure 2A compares our Shapley kernel with
schosen heuristically. Theintuitiveconnection between linear regression and classical
estimates isthat classical Shapley value estimates arecomputed asthemean of many
s. Sincethemean isalso thebest least squares point estimate for aset of datapointsit
arch for aweighting kernel that causeslinear least squares regression to recapitulate
tonicity. For any two models f and f 0
, if for all subsets S that do not contain i
f x (S [ { i} ) − f x (S) ≥ f 0
x (S [ { i} ) − f 0
x (S) (5)
φi (f , x) ≥ φi (f 0
, x). Thisstates that if observing afeature increases f morethan f 0
in
uations, then that feature’seffect should belarger for f than for f 0
.
of theseaxioms would lead to potentially confusing behavior. In 1985, Peyton Young
hat there is only one set of values that satisfies the aboveassumptions and they are
ues [7, 4]. ESvaluesareShapley values of expected valuefunctions, therefore they
ution to Equation 1 that conforms to Equation 2 and satisfies thethree axioms above.
of ESvaluesholdsover alargeclass of possible models, including theexamples used
aper that originally proposed this formalism [3].
pecific forms of x0
, L, and ⌦that lead to Shapley values asthesolution and they are:
⌦(g) = 0
⇡x 0(z0
) =
(M − 1)
(M choose |z0|)|z0|(M − |z0|)
L(f , g, ⇡x0) =
X
z02 Z
⇥
f (h− 1
x (z0
)) − g(z0
)
⇤2
⇡x 0(z0
)
(6)
to note that ⇡x 0(z0
) = 1 when |z0
| 2 { 0, M } , which enforces φ0 = f x (; ) and
φi . In practicetheseinfiniteweightscan beavoided during optimization by analytically
There is no other kernel that satisfies the axioms and produces a different result.