1) The document discusses different metrics for optimizing predictive models, noting that squared error can emphasize outliers while lift charts are better. It recommends not optimizing AUC alone.
2) Global search algorithms may be needed if the model and error metric are not simple. The goal of the project and what to optimize should be considered.
3) Case studies are presented showing how optimizing for the problem goal, like flagging account outliers for fraud detection, led to better outcomes than a general classification approach.
What to Optimize? The Heart of Every Analytics Problem
1. What to Optimize?
The Heart of Every Analytics Problem
Predictive Analytics World
New York - October, 2017
John F. Elder, Ph.D.
elder@elderresearch.com
@johnelder4
Charlottesville, VA
Washington, DC
Baltimore, MD
Raleigh, NC
434-973-7673
www.elderresearch.com
2. Outline
• Squared error is convenient for the computer"
but not for the client
• Lift (cumulative response) charts are great,"
but never optimize AUC (area under the curve)
• You may need to design a custom metric
• That may require a global search algorithm
• Brainstorm about the Project goal
• And what project to tackle in the first place
2
20. Bound by Random and Perfect Models
A random model (no
predicOve power) would
be a diagonal line.
A perfect model (right
predicOon every Ome)
shoots up as fast as
possible to 100%. The
slope depends on event
frequency.
23. Truth Table (confusion matrix) "
with 25% Threshold
Actual
OK BAD
Predicted
OK 1,352 136
BAD 237 260
24. Truth table depends on threshold
Same model,
different cutoff
threshold "
results in different
truth table
(confusion matrix)
Actual
OK BAD
Predicted
OK 1540 246
BAD 49 150
Actual
OK BAD
Predicted
OK 846 47
BAD 743 349
26. “Multiple Myeloma I have been diagnosed with
Multiple Myeloma (cancer of the bone marrow) and
am currently undergoing treatment to prepare me for
an autologous stem cell transplant. There has been a
brain tumor associated with this, for which I have
had....”
26
Social Security Administration
Disability Approval Prediction
Text informaOon in “AllegaOon Field” proved most valuable
27. • Draw from Bayesian statistics and smooth the raw count with an
empirical prior
– Use baseline probability of the most probable classification
• For SSA, roughly 33% of applications approved
– Counts for each word are initialized with the baseline probability
• Similar to Shrinkage, James-Stein Estimator, Ridge Regression, etc.
• Hypothetical Example: Multiple Myeloma
– Appears 5 times, 4 times was approved = 80% predicted “yes”
– Prior (given all data) is 33%. If we use an “initial mass of 3 (2 “no” +
1 “yes”) then the total “yes” is 5/8 = 62.5%
• With no data, results in prior
• With lots of data, measurement provides probability
• In between, compromises between measured and prior %
27
Using a Prior: “non-zero initialization”
28. • Common aggregations don’t match medical
domain requirements
– SUM: many symptoms increases probability of
predicting approval
– MAX: ignores multiple serious symptoms
– AVG: minor symptoms water down major
symptoms
28
Combining Weights
29. Business Understanding:"
Desired properties for joining evidence
• Applicants with multiple severe diseases should be more
likely to be approved
• A large number of mild ailments should not add up to a
high score that gets an applicant approved
• Mild ailments should not detract from severe ones
• Rare diseases should be included, but not with the same
confidence as those with more evidence
• Calculation of disease severity must be self-adapting to
accommodate rapid changes in the medical field
We designed a joint probability function meeting these constraints
29
30. If (no data), then use prior
Else If (max(probability) < 0.5) then use that max.
Else:
i. Ignore concepts with probability < 0.5
ii. Combine the remaining ones with a log-likelihood
formula and use the resulting joint probability.
30
Our approach to combine evidence (SSA)
31. 31 31
Higher Level Optimization Issue:"
What is the Goal of the Project?
Aim at the right target
Example: Fraud Detection for international phone calls
Daryl Pregibon and colleagues at Bell (Shannon) Labs:
The normal approach would have been to attempt to
classify fraud/nonfraud for general calls
Instead they characterized normal behavior for each
account (phone), then flagged outliers.
Model had features like top 5 countries called, durations
of calls, times of day, days of week, “faxicity” of call, etc.
All features slowly adapted if changes occurred.
-> A brilliant success.
32. 32 32
Even Higher-Level Optimization Issue:"
What Project Should you Choose?
ROI
Cost
(Disruption,TechnicalEffort)
Cost factors include:
• Time required
• DisrupOon effect
• Data availability
• Data quality
Phantom inventory
33. Summary
• Squared error gives undue power to outliers and is
symmetric, but is very hard to escape.
• You can always do better than to optimize AUC (but it’s
correlated with success, so don’t throw away its results).
• Think about what you’re asking the computer to search
for: to solve the hardest problems, you’ll need to design
a custom metric.
• Get at least a random global search capability ready.
• Work closely with the client and creative folk to
brainstorm project goals and priorities.
• If your work isn’t implemented, you failed.
33