What to Optimize? The Heart of Every Analytics Problem

What to Optimize?
The Heart of Every Analytics Problem
Predictive Analytics World
New York - October, 2017
John F. Elder, Ph.D.
elder@elderresearch.com
@johnelder4
Charlottesville, VA
Washington, DC
Baltimore, MD
Raleigh, NC
434-973-7673
www.elderresearch.com

Outline
•  Squared error is convenient for the computer"
but not for the client
•  Lift (cumulative response) charts are great,"
but never optimize AUC (area under the curve)
•  You may need to design a custom metric
•  That may require a global search algorithm
•  Brainstorm about the Project goal
•  And what project to tackle in the ﬁrst place
2

3
4 Series: (X,Y1) (X,Y2) (X,Y3) (X4,Y4)
rxy = 0.85
yLS = 3 + 0.5x
MSE = 1.25
R2 = 0.67
X Y1 Y2 Y3 X4 Y4
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.10 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.10 5.39 19 12.50
12 10.84 9.13 8.15 8 5.56
7 4.82 7.26 6.42 8 7.91
5 5.68 4.74 5.73 8 6.89

Anscomb’s Quartet (1973, American Statistician)
Y1
X
2 4 6 8 10 12 14 16 18 20
14

12

10

8

6

4

2
Y3
14

12

10

8

6

4

2
2 4 6 8 10 12 14 16 18 20
Y2
X
14

12

10

8

6

4

2
2 4 6 8 10 12 14 16 18 20
Y4
14

12

10

8

6

4

2
2 4 6 8 10 12 14 16 18 20

Datasaurus Dozen (David Smith 5/2/17)

1) If your model is linear and your error is squared then
there is a closed-form soluOon (regression)

1) Otherwise, you are groping in the dark (global search)

2) If your model is linear and your error is absolute then
there is an iteraOve soluOon (linear programming)

3) Otherwise, you need
to perform global search
(which has no
guarantees)

Simulated Annealing search path

Nelder-Mead (Amoeba) Search Path

Global Rd OpOmizaOon when Probes are
Expensive (GROPE)
•  Class of problems where goal is to get to the answer
with fewest probes (funcOon evaluaOons)
•  Best algorithms are
–  SDO (SequenOal Design for OpOmizaOon) by Cox & John
(1992, 1997)
–  GROPE-Canopy by Elder (1992, 1993)

Stock Market PredicOon Thought Experiment
•  Say your model predicted a 10% price rise, from
$10 to $11 over the next quarter.
•  But the price later actually rises to $14.
•  How do you feel about it?
•  How does the model (under squared error) “feel”
about it?
•  14-11=3; 3*3=9. Had it instead lost 10% to $9,
the error of 2 would’ve led to a squared error of
less than half as much (4).
•  So the model would have been “twice as happy”
if you’d lost 10% instead of won 40%.
•  Something is wrong with that metric!

19 19
Using Lift Charts
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1a. Set
invesOgaOon limit
1b. Note
expected
response
2a. Or, Set
desired
response
2b. And note
work requirements
Prospects Ordered by Response Probability
[293-295]

Bound by Random and Perfect Models
A random model (no
predicOve power) would
be a diagonal line.

A perfect model (right
predicOon every Ome)
shoots up as fast as
possible to 100%. The
slope depends on event
frequency.

Never Use AUC (Area Under the Curve)
•  The area between the lin curve and the random
line (or the baseline) is onen maximized.
•  This is never the best thing to do
•  Instead, ﬁgure out how deep into the list you
want to, or can, go.
•  You are either constrained by resources (#cases
you can invesOgate, for instance), or there is a
problem-dependent cost tradeoﬀ between false
alarms and false dismissals (false posiOves and
negaOves)

Truth Table (confusion matrix) "
with 25% Threshold
Actual
OK BAD
Predicted
OK 1,352 136
BAD 237 260

Truth table depends on threshold
Same model,
different cutoff
threshold "
results in different
truth table
(confusion matrix)
Actual
OK BAD
Predicted
OK 1540 246
BAD 49 150
Actual
OK BAD
Predicted
OK 846 47
BAD 743 349

0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
CumulaOve % Captured Response
PercenOle

HMEQ "Bads" Regression Model
Baseline Model Best
Gain
Cost Predicted Return Predicted Proﬁt

“Multiple Myeloma I have been diagnosed with
Multiple Myeloma (cancer of the bone marrow) and
am currently undergoing treatment to prepare me for
an autologous stem cell transplant. There has been a
brain tumor associated with this, for which I have
had....”
26
Social Security Administration
Disability Approval Prediction
Text informaOon in “AllegaOon Field” proved most valuable

•  Draw from Bayesian statistics and smooth the raw count with an
empirical prior
–  Use baseline probability of the most probable classification
•  For SSA, roughly 33% of applications approved
–  Counts for each word are initialized with the baseline probability
•  Similar to Shrinkage, James-Stein Estimator, Ridge Regression, etc.
•  Hypothetical Example: Multiple Myeloma
–  Appears 5 times, 4 times was approved = 80% predicted “yes”
–  Prior (given all data) is 33%. If we use an “initial mass of 3 (2 “no” +
1 “yes”) then the total “yes” is 5/8 = 62.5%
•  With no data, results in prior
•  With lots of data, measurement provides probability
•  In between, compromises between measured and prior %
27
Using a Prior: “non-zero initialization”

•  Common aggregations don’t match medical
domain requirements
– SUM: many symptoms increases probability of
predicting approval
– MAX: ignores multiple serious symptoms
– AVG: minor symptoms water down major
symptoms
28
Combining Weights

Business Understanding:"
Desired properties for joining evidence
•  Applicants with multiple severe diseases should be more
likely to be approved
•  A large number of mild ailments should not add up to a
high score that gets an applicant approved
•  Mild ailments should not detract from severe ones
•  Rare diseases should be included, but not with the same
conﬁdence as those with more evidence
•  Calculation of disease severity must be self-adapting to
accommodate rapid changes in the medical ﬁeld
We designed a joint probability function meeting these constraints
29

If (no data), then use prior
Else If (max(probability) < 0.5) then use that max.
Else:
i.  Ignore concepts with probability < 0.5
ii.  Combine the remaining ones with a log-likelihood
formula and use the resulting joint probability.
30
Our approach to combine evidence (SSA)

31 31
Higher Level Optimization Issue:"
What is the Goal of the Project?
Aim at the right target
Example: Fraud Detection for international phone calls
Daryl Pregibon and colleagues at Bell (Shannon) Labs:
The normal approach would have been to attempt to
classify fraud/nonfraud for general calls
Instead they characterized normal behavior for each
account (phone), then ﬂagged outliers.
Model had features like top 5 countries called, durations
of calls, times of day, days of week, “faxicity” of call, etc.
All features slowly adapted if changes occurred.

-> A brilliant success.

32 32
Even Higher-Level Optimization Issue:"
What Project Should you Choose?
ROI
Cost
(Disruption,TechnicalEﬀort)
Cost factors include:
•  Time required
•  DisrupOon eﬀect
•  Data availability
•  Data quality

Phantom inventory

Summary
•  Squared error gives undue power to outliers and is
symmetric, but is very hard to escape.
•  You can always do better than to optimize AUC (but it’s
correlated with success, so don’t throw away its results).
•  Think about what you’re asking the computer to search
for: to solve the hardest problems, you’ll need to design
a custom metric.
•  Get at least a random global search capability ready.
•  Work closely with the client and creative folk to
brainstorm project goals and priorities.
•  If your work isn’t implemented, you failed.
33

What to Optimize? The Heart of Every Analytics Problem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to What to Optimize? The Heart of Every Analytics Problem

Similar to What to Optimize? The Heart of Every Analytics Problem (20)

More from Rising Media, Inc.

More from Rising Media, Inc. (20)

Recently uploaded

Recently uploaded (20)

What to Optimize? The Heart of Every Analytics Problem