Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

Scott.Clendaniel@MktgSciences.com
Machine Learning: Boosting
Analytics Model Performance

THE JOB OF DATA SCIENTISTS
Does this sound familiar to anyone?

How to design a strategy
for boosting
performance.
2- Strategy
How to use Feature
Engineering to boost model
performance.
3. Features
Explaining why boosting
performance is relevant.
1- Background
Time for questions from the
audience.
5. Questions
A collection of free resources for
boosting model performance.
4. Bonus Round
AGENDA

BOOSTING MODEL PERFOMANCE
Section 1: Background

1- Background
SECTION 1: Background

TIPS SOURCES
Where do the recommendations originate?
197 Kaggle Winner
Interviews
How did they win?
50 In-depth Case
Studies
Which factors mattered
25,000 Head-to-Head
Tests
What made the difference?

WHERE HAVE THESE TIPS WORKED?
IMPORTANT: All views expressed are solely my own, and should not be taken
as being those of current or past employers, clients or others.

TWO CATEGORIES OF TIPS
Presentation Focus
The plan, method, series of tactics or
stratagems for building your model.
Model Strategy
Part 1
The process for identifying, building,
developing, standardizing, normalizing and
engineering the correct inputs for one or
more analytics processes.
Data Preparation
Part 2

Section 2: Model Strategy

for boosting
performance.
2- Strategy
1- Background
SECTION 2
Strategy

Source: Jeong-Yoon Lee, Chief Data
Scientist at Conversion Logic,
https://www.slideshare.net/jeongyoonlee/
data-science-competition-72596610
TIP 1: Leverage Extreme Ensembles
The performance boost from models with non-correlated errors is consistently higher than single models or smaller ensembles.
Source: Owen Zhang, Chief Product
Officer at DataRobot,
https://www.slideshare.net/OwenZhang2
/tips-for-data-science-competitions
• 6-layer process
• 5 distinct data prep steps
• 31 combined feature sets
• 2 layers of 3 models each
2015 Liberty Mutual Contest
Owen Zhang
• 7 feature sets
• 64 component models
• 15 models in Level 1 Ensemble
• 2 models in Level 2 Ensemble
2015 KDD CUP
Jeong-Yoon Lee

• Seed lists
• Old, unusable lead sources
• Discontinued markets
MARKETING
Eliminate irrelevant populations
• Low dollar thresholds
• “Best” customers
• Higher authentication transactions
• “Standing” transactions
• Canceled transfers
FRAUD
Eliminate “safer” populations
• What do you already know?
• What is beyond your influence?
• Which problems can be handled separately?
GENERAL
Other instances
TIP 2: Reduce Decision Space
Reduce the Decision Space

TIP 3: Use Targeted AUC Instead of Total AUC
Match model objective to organizational objective. Example courtesy of ORACLE.
• Less common approach
• Perfect for projects with target thresholds such as
limited marketing budgets or maximum fraud
referral/ turndown rates
• Sacrifices overall accuracy for accuracy at lower
threshold targets
TARGETED AUC
Optimizes targeted model performance
• Traditional approach
• Perfect for may Kaggle competitions
• Sacrifices accuracy at lower threshold targets for
overall accuracy
TOTAL AUC
Optimizes overall model performance

TIP 4: Cross-Validate Everywhere
Reducing overfitting while extracting maximum learning from your data
OUT-OF-SAMPLE VALIDATION
Traditional methodology
CROSS-VALIDATION
Used to reduce both overfitting and outlier influence

TIP 5: Algorithm Arsenal
Leverage diverse modeling arsenal
Bayesian Network
Gradient Boosting
Machines
Random Forests
Logistic Regression
Factorization Machines
Neural Network
Genetic Algorithms
Support Vector Machines

Section 3: Features

for boosting
performance.
2- Strategy
How to use Feature
performance.
3. Features
1- Background
SECTION 3
Features

TIP 7: Test Variable Transformation Functions
Features

“Stumps” represent the first split in
decision trees, and make powerful
“weak learners.” Create a derived
feature for each input.
1. Derive “Stumps”
Using trees creates bin “boundaries”
directly associated with the dependent
variable, rather than a more arbitrary
approach. Assign bins for each
continuous inputs.
2. Bin Continuous Inputs
Missing values assigned to a separate,
unique category preserves information
content and eliminates arbitrary
replacement approaches.
3. Handle Missing Values
Each input, regardless of data type, can
have consistent, normalized scaling by
using something like NORM Sigmoid or
Yule’s Q for each terminal node from
each univariate tree.
5. Normalize scaling
Calling out tree nodes with uniquely
powerful splitting capabilities as
derived features leverages the most
benefit from single inputs.
4. Derive High-Impact Flags
Re-coding the original input into the
values from the terminal nodes makes
interpretation much easier.
6. Overall Transformation
TIPS 8-13: Univariate Tree Feature Engineering
Features

Moving Away From… Moving Toward…
TIP 14: Think “Crafts-person-ship”
Less “Assembly Line,” More “Fine Craftsmanship”

Section 4: Bonus Round

for boosting
performance.
2- Strategy
How to use Feature
performance.
3. Features
1- Background
A collection of free resources for
boosting model performance.
4. Bonus Round
SECTION 4
Bonus Round 

2. Create Common Table
of Values for Each Node
3. Calculate Z-Score
Across Entire Table
5. Calculate Avg., High
and Low
6. Gradient Boosting4. Assign New Value to
New Derived Feature
1. Univariate Tree
Models
Bonus Round:
Patent-Application IMPACT Features
Patent application approach for transforming and combining model inputs

USA 1-443-810-8066
MktgSciences
3719 Yolando Road
Baltimore, MD 21218
Get in TouchSee you soon....

Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic,
https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610
MODEL STRATEGY TIP 1
Cross-validate everywhere.

Source: Owen Zhang, Chief Product Officer at DataRobot,
https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions
MODEL STRATEGY TIP 1
Cross-validate everywhere.

THANK YOU...

Appendix

DEFINITIONS
performance
(noun):
“the manner in which or the efficiency
with which something reacts or
fulfills its intended purpose.”

Moving Away From… Moving Toward…
PERFORMANCE IS BEING MORE CLOSELY MEASURED

PEFORMANCE WILL DETERMINE COMPENSATION
Like it or not, Data Science compensation will become more closely tied to model performance.

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

Similar to Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance (20)

Recently uploaded

Recently uploaded (20)

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance