Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Using Bayesian Optimization to Tune Machine Learning Models
1. USING BAYESIAN OPTIMIZATION
TO TUNE MACHINE LEARNING MODELS
Scott Clark
Co-founder and CEO of SigOpt
scott@sigopt.com @DrScottClark
2. TRIAL AND ERROR WASTES EXPERT TIME
Machine Learning is extremely
powerful
Tuning Machine Learning systems
is extremely non-intuitive
3. UNRESOLVED PROBLEM IN ML
https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3
What is the most important unresolved problem in machine learning?
“...we still don't really know why some configurations of deep neural networks work
in some case and not others, let alone having a more or less automatic approach
to determining the architectures and the hyperparameters.”
Xavier Amatriain, VP Engineering at Quora
(former Director of Research at Netflix)
5. COMMON APPROACH
Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012
1. Random search or grid search
2. Expert defined grid search near “good” points
3. Refine domain and repeat steps - “grad student descent”
6. COMMON APPROACH
● Expert intensive
● Computationally intensive
● Finds potentially local optima
● Does not fully exploit useful information
Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012
1. Random search or grid search
2. Expert defined grid search near “good” points
3. Refine domain and repeat steps - “grad student descent”
7. … the challenge of how to collect information as efficiently
as possible, primarily for settings where collecting information
is time consuming and expensive.
Prof. Warren Powell - Princeton
What is the most efficient way to collect information?
Prof. Peter Frazier - Cornell
How do we make the most money, as fast as possible?
Me - @DrScottClark
OPTIMAL LEARNING
8. ● Optimize some Overall Evaluation Criterion (OEC)
○ Loss, Accuracy, Likelihood, Revenue
● Given tunable parameters
○ Hyperparameters, feature parameters
● In an efficient way
○ Sample function as few times as possible
○ Training on big data is expensive
BAYESIAN GLOBAL OPTIMIZATION
Details at https://sigopt.com/research
13. HOW DOES IT FIT IN THE STACK?
Big Data
Machine
Learning
Models
with tunable
parameters
14. Optimally suggests
new parameters
HOW DOES IT FIT IN THE STACK?
Objective Metric
New parameters
Big Data
Machine
Learning
Models
with tunable
parameters
15. Optimally suggests
new parameters
HOW DOES IT FIT IN THE STACK?
Objective Metric
New parameters
Better
Models
Big Data
Machine
Learning
Models
with tunable
parameters
17. Optimally suggests
new parameters
Ex: LOAN CLASSIFICATION (xgboost)
Prediction Accuracy
New parameters
Better
AccuracyLoan
Applications
Default
Prediction
with tunable
ML parameters
● Income
● Credit Score
● Loan Amount
18. COMPARATIVE PERFORMANCE
● 8.2% Better
Accuracy than
baseline
● 100x faster
than standard
tuning methods
Accuracy
Cost
Grid Search
Random Search
Iterations
AUC
.698
.690
.683
.675
1,00010,000100,000
19. EXAMPLE: ALGORITHMIC TRADING
Expected Revenue
New parameters
Higher
Returns
Market Data
Trading
Strategy
with tunable
weights and
thresholds
● Closing Prices
● Day of Week
● Market Volatility
Optimally suggests
new parameters
22. 1. Build Gaussian Process (GP) with points
sampled so far
2. Optimize the fit of the GP (covariance
hyperparameters)
3. Find the point(s) of highest Expected
Improvement within parameter domain
4. Return optimal next best point(s) to sample
HOW DOES IT WORK?
23. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
24. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
25. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
26. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
27. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
28. HOW DOES IT WORK?
1. User reports data
2. SigOpt builds statistical model
(Gaussian Process)
3. SigOpt finds the points of
highest Expected Improvement
4. SigOpt suggests best
parameters to test next
5. User tests those parameters
and reports results to SigOpt
6. Repeat
30. ● Classify house numbers
with more training data and
more sophisticated model
PROBLEM
31. ● TensorFlow makes it easier to design DNN architectures,
but what structure works best on a given dataset?
CONVNET STRUCTURE
32. ● Per parameter
adaptive SGD variants
like RMSProp and
Adagrad seem to
work best
● Still require careful
selection of learning
rate (α), momentum
(β), decay (γ) terms
STOCHASTIC GRADIENT DESCENT
33. ● Comparison of several RMSProp SGD parametrizations
● Not obvious which configurations will work best on a
given dataset without experimentation
STOCHASTIC GRADIENT DESCENT
35. ● Avg Hold out accuracy after 5 optimization runs
consisting of 80 objective evaluations
● Optimized single 80/20 CV fold on training set, ACC
reported on test set as hold out
PERFORMANCE
SigOpt
(TensorFlow CNN)
Rnd Search
(TensorFlow CNN)
No Tuning
(sklearn RF)
No Tuning
(TensorFlow CNN)
Hold Out
ACC
0.8130 (+315.2%) 0.5690 0.5278 0.1958
37. EXAMPLE: TUNING DNN CLASSIFIERS
CIFAR10 Dataset
● Photos of objects
● 10 classes
● Metric: Accuracy
○ [0.1, 1.0]
Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
38. ● All convolutional neural network
● Multiple convolutional and dropout layers
● Hyperparameter optimization mixture of
domain expertise and grid search (brute force)
USE CASE: ALL CONVOLUTIONAL
http://arxiv.org/pdf/1412.6806.pdf
39. MANY TUNABALE PARAMETERS...
● epochs: “number of epochs to run fit” - int [1,∞]
● learning rate: influence on current value of weights at each step - double (0, 1]
● momentum coefficient: “the coefficient of momentum” - double (0, 1]
● weight decay: parameter affecting how quickly weight decays - double (0, 1]
● depth: parameter affecting number of layers in net - int [1, 20(?)]
● gaussian scale: standard deviation of initialization normal dist. - double (0,∞]
● momentum step change: mul. amount to decrease momentum - double (0, 1]
● momentum step schedule start: epoch to start decreasing momentum - int [1,∞]
● momentum schedule width: epoch stride for decreasing momentum - int [1,∞]
...optimal values non-intuitive
40. COMPARATIVE PERFORMANCE
● Expert baseline: 0.8995
○ (using neon)
● SigOpt best: 0.9011
○ 1.6% reduction in
error rate
○ No expert time
wasted in tuning
41. USE CASE: DEEP RESIDUAL
http://arxiv.org/pdf/1512.03385v1.pdf
● Explicitly reformulate the layers as learning residual functions with
reference to the layer inputs, instead of learning unreferenced functions
● Variable depth
● Hyperparameter optimization mixture of domain expertise and grid
search (brute force)
42. COMPARATIVE PERFORMANCE
Standard Method
● Expert baseline: 0.9339
○ (from paper)
● SigOpt best: 0.9436
○ 15% relative error
rate reduction
○ No expert time
wasted in tuning
44. TRY OUT SIGOPT FOR FREE
https://sigopt.com/getstarted
● Quick example and intro to SigOpt
● No signup required
● Visual and code examples
45. MORE EXAMPLES
https://github.com/sigopt/sigopt-examples
Examples of using SigOpt in a variety of languages and contexts.
Tuning Machine Learning Models (with code)
A comparison of different hyperparameter optimization methods.
Using Model Tuning to Beat Vegas (with code)
Using SigOpt to tune a model for predicting basketball scores.
Learn more about the technology behind SigOpt at
https://sigopt.com/research
48. USE CASE: CLASSIFICATION MODELS
Machine Learning models have many
non-intuitive tunable hyperparameters
Problem:
Before
Standard methods use high
resources for low performance
After
SigOpt finds better parameters
with 10x fewer evaluations
than standard methods
49. USE CASE: SIMULATIONS
BETTER RESULTS
+450% FASTER
Expensive simulations require
high resources for every run
Problem:
Before
Brute force tuning approach
prohibitively expensive
After
SigOpt finds better results with
fewer required simulations