State-of-the-art techniques anyone can use to improve machine learning model performance. Includes several steps on model strategy, feature creation, Kaggle success secrets, and many other tips.
3. Scott.Clendaniel@MktgSciences.com
How to design a strategy
for boosting
performance.
2- Strategy
How to use Feature
Engineering to boost model
performance.
3. Features
Explaining why boosting
performance is relevant.
1- Background
Time for questions from the
audience.
5. Questions
A collection of free resources for
boosting model performance.
4. Bonus Round
AGENDA
6. Scott.Clendaniel@MktgSciences.com
TIPS SOURCES
Where do the recommendations originate?
197 Kaggle Winner
Interviews
How did they win?
50 In-depth Case
Studies
Which factors mattered
25,000 Head-to-Head
Tests
What made the difference?
7. Scott.Clendaniel@MktgSciences.com
WHERE HAVE THESE TIPS WORKED?
IMPORTANT: All views expressed are solely my own, and should not be taken
as being those of current or past employers, clients or others.
8. Scott.Clendaniel@MktgSciences.com
TWO CATEGORIES OF TIPS
Presentation Focus
The plan, method, series of tactics or
stratagems for building your model.
Model Strategy
Part 1
The process for identifying, building,
developing, standardizing, normalizing and
engineering the correct inputs for one or
more analytics processes.
Data Preparation
Part 2
11. Scott.Clendaniel@MktgSciences.com
Source: Jeong-Yoon Lee, Chief Data
Scientist at Conversion Logic,
https://www.slideshare.net/jeongyoonlee/
data-science-competition-72596610
TIP 1: Leverage Extreme Ensembles
The performance boost from models with non-correlated errors is consistently higher than single models or smaller ensembles.
Source: Owen Zhang, Chief Product
Officer at DataRobot,
https://www.slideshare.net/OwenZhang2
/tips-for-data-science-competitions
• 6-layer process
• 5 distinct data prep steps
• 31 combined feature sets
• 2 layers of 3 models each
2015 Liberty Mutual Contest
Owen Zhang
• 7 feature sets
• 64 component models
• 15 models in Level 1 Ensemble
• 2 models in Level 2 Ensemble
2015 KDD CUP
Jeong-Yoon Lee
12. Scott.Clendaniel@MktgSciences.com
• Seed lists
• Old, unusable lead sources
• Discontinued markets
MARKETING
Eliminate irrelevant populations
• Low dollar thresholds
• “Best” customers
• Higher authentication transactions
• “Standing” transactions
• Canceled transfers
FRAUD
Eliminate “safer” populations
• What do you already know?
• What is beyond your influence?
• Which problems can be handled separately?
GENERAL
Other instances
TIP 2: Reduce Decision Space
Reduce the Decision Space
13. Scott.Clendaniel@MktgSciences.com
TIP 3: Use Targeted AUC Instead of Total AUC
Match model objective to organizational objective. Example courtesy of ORACLE.
• Less common approach
• Perfect for projects with target thresholds such as
limited marketing budgets or maximum fraud
referral/ turndown rates
• Sacrifices overall accuracy for accuracy at lower
threshold targets
TARGETED AUC
Optimizes targeted model performance
• Traditional approach
• Perfect for may Kaggle competitions
• Sacrifices accuracy at lower threshold targets for
overall accuracy
TOTAL AUC
Optimizes overall model performance
14. Scott.Clendaniel@MktgSciences.com
TIP 4: Cross-Validate Everywhere
Reducing overfitting while extracting maximum learning from your data
OUT-OF-SAMPLE VALIDATION
Traditional methodology
CROSS-VALIDATION
Used to reduce both overfitting and outlier influence
15. Scott.Clendaniel@MktgSciences.com
TIP 5: Algorithm Arsenal
Leverage diverse modeling arsenal
Bayesian Network
Gradient Boosting
Machines
Random Forests
Logistic Regression
Factorization Machines
Neural Network
Genetic Algorithms
Support Vector Machines
17. Scott.Clendaniel@MktgSciences.com
How to design a strategy
for boosting
performance.
2- Strategy
How to use Feature
Engineering to boost model
performance.
3. Features
Explaining why boosting
performance is relevant.
1- Background
SECTION 3
Features
19. Scott.Clendaniel@MktgSciences.com
“Stumps” represent the first split in
decision trees, and make powerful
“weak learners.” Create a derived
feature for each input.
1. Derive “Stumps”
Using trees creates bin “boundaries”
directly associated with the dependent
variable, rather than a more arbitrary
approach. Assign bins for each
continuous inputs.
2. Bin Continuous Inputs
Missing values assigned to a separate,
unique category preserves information
content and eliminates arbitrary
replacement approaches.
3. Handle Missing Values
Each input, regardless of data type, can
have consistent, normalized scaling by
using something like NORM Sigmoid or
Yule’s Q for each terminal node from
each univariate tree.
5. Normalize scaling
Calling out tree nodes with uniquely
powerful splitting capabilities as
derived features leverages the most
benefit from single inputs.
4. Derive High-Impact Flags
Re-coding the original input into the
values from the terminal nodes makes
interpretation much easier.
6. Overall Transformation
TIPS 8-13: Univariate Tree Feature Engineering
Features
22. Scott.Clendaniel@MktgSciences.com
How to design a strategy
for boosting
performance.
2- Strategy
How to use Feature
Engineering to boost model
performance.
3. Features
Explaining why boosting
performance is relevant.
1- Background
A collection of free resources for
boosting model performance.
4. Bonus Round
SECTION 4
Bonus Round
23. Scott.Clendaniel@MktgSciences.com
2. Create Common Table
of Values for Each Node
3. Calculate Z-Score
Across Entire Table
5. Calculate Avg., High
and Low
6. Gradient Boosting4. Assign New Value to
New Derived Feature
1. Univariate Tree
Models
Bonus Round:
Patent-Application IMPACT Features
Patent application approach for transforming and combining model inputs
24. Scott.Clendaniel@MktgSciences.com
How to design a strategy
for boosting
performance.
2- Strategy
How to use Feature
Engineering to boost model
performance.
3. Features
Explaining why boosting
performance is relevant.
1- Background
Time for questions from the
audience.
5. Questions
A collection of free resources for
boosting model performance.
4. Bonus Round
AGENDA