SlideShare a Scribd company logo
1 of 13
Download to read offline
Hyperparameter Tuning
Rohit Kumar Gupta
Advisor : Georg Heigold
Deep Learning Seminar
November 9, 2015
• Introduction
• Problem Specification
• Simple approach
• Naïve Grid Search
• Sampling Random Combination
• Bayesian Optimization
• Gaussian Prior
• Acquisition function(Utility function)
• Gaussian Posterior
• References
Agenda
| 2
Introduction
What is Hyperparameter?
• a parameter of a prior distribution, before some evidence is
taken into account
• it is a measure of the probability distribution over probabilities
• external model mechanics
• in our context of deep learning using neural networks,
hyperparameters can be:
• Number of layers
• Number of hidden units
• Weight
• Kernel parameter
| 3
x1
y
training data
x2
xn
x1:n -> hyperparameter values y -> test error
Goal :
• For a given setting of hypereparameter values the
system gives test error generated upon training
• And out task is to find the optimal setting of the
hyperparameter for which the test error is minimum
Plotting the problem :
.
.
.
.
.
testerror
hyperparameter
• There is an underlying objective function which
connects these dots
• We want to estimate it and find minima of that
function
• We don’t have any knowledge about the form of the
function we want to optimize
Problem Specification
| 4
Naïve Grid Search :
Basic idea : Make a list of all possible combination of
hyperparameter values and do an exhaustive search for best setting
• there can be large number of combinations
• running each setting takes a very long time
Simple Approach
Sampling Random Combination :
Basic idea : Find those hyperparameter which have less impact on
output. Eliminate redundant run of the system by pruning some
combination where these parameters change
• still we are left with large combination of hyperparameter setting
• we need an approach that directs us towards an optimal setting
| 5
Bayesian Optimization
Motivation:
• Lack of knowledge about concrete form of the objective function
• Few observation data
• We only need to rely on priors to best estimate the objective function
• Under these setting, Bayesian Optimization appears to be a powerful strategy
• It allows us to model the objective function and get better estimate with each observation
A Bayesian approach works in two steps:
1. Developing a prior function, basically a probabilistic modelling of our current beliefs
2. Developing an acquisition function
Step 1:
• With few initial runs of the system with different hyperparameter, we accumulate observations as
• D = {x1:t; y1:t)}
• x1:t are t different settings of hyperparameter
• y1:t are t different test errors
• P(f) is our prior estimate of the objective function
• When we observe new evaluation, we can compute our posterior probability
• P(f|D) ~ P(D|f) * P(f)
• We use the famous Bayes Theorem to update our belief about the objective function
| 6
Bayesian Optimization
Motivation for step 2:
• Our next point of evaluation should be such where our
• estimate about objective function is highly uncertain
• improvement w.r.t current best error is maximum
• We can achieve this via a utility function which considers above constraint and returns the next point of evaluation
Utility function
current best error
prior function
x (t+1)
(next point of observation, t+1 th observation)
Explores the areas
of high uncertainty
Exploits the areas of
max improvement
| 7
Gaussian Priors
| 8
Motivation:
• We hold an underlying assumption that the objective function is smooth i.e. for a small change in input the change in
output is small
• It should be continuous
• GP priors can be one of the approach to formalize our prior as they hold these assumptions well
• It says that there is a Gaussian connecting these dots
• There can many Gaussians doing so, we want to approximate those using a similarity measure between given points
Formalizing our GP prior: • Given the data, D, we want to model f’s with multivariate Gaussian
k(xi, xj) = exp –norm(xi - xj)sq = 0, if xi and xj are far
1, if xi and xj are close
f1
f2
f3
~ GP
0
0
0 ,
k11 k12 k13
k21 k22 k23
k31 k32 k33
G =
• K is the co-variance matric, defined by us depending on similarity of
data points
• So, our prior is just a simulation of G
.
.
.
testerror
hyperparameter
f(x)
x
x1 x2 x3
f3
f2
f1
Gaussian Priors
| 9
• The Gaussian, G, which we derived are multivariate Gaussian of functions
• Now suppose we want to predict the test error at a particular value of x (say test
data is x*)
• We can assume f* comes from a Gaussian distribution such that,
f* ~ GP( 0, k(x*, x*))
• Now since x* can be assumed to come from the same set of training data,
f* is jointly Gaussian with G defined earlier and given by,
• Thus the whole problem can now be assumed to have cut in multivariate
Gaussian at x*, we get another Gaussian defined in a plane and finding the
mean of that Gaussian
.
.
.
testerror
hyperparameter
f(x)
x
x1 x2 x3
f3
f2
f1
x*
f*
meu*
sigma*
.
f1
f2
f3
f*
~ GP Mue, KG’ = Mue = Mean Vector
K= Covariance Matrix
• Cutting a multivariate Gaussian is now basically a problem of conditional distribution
• We can find the mean and variance of this Gaussian by the Multi Variate Gaussian Theorem
• By this theorem we can find the conditional mean and variance from a given joint distribution
• The estimate of the objective function is combining these mean at different cuts
• The shaded region gives the confidence interval of the mean
Gaussian Priors
| 10
Acquisition function
Intuition :
• Confidence interval (variance) is a measure of uncertainty
at a point
• So we intend to find the point where the variance is
maximum
• With a constraint that the mean at that point should be
less than the best known error at that state
Formalizing by Probability of Improvement :
PI(x) = P(f(x) <= f(x+) + $)
PI(x) = Probability of improvement at a point x
f(x) = test error at that point
f(x+) = best current test error
$ = parameter that controls uncertainty
| 11
Gaussian Posterior
• Now once we know the next setting of hyperparameter, we can evaluate our system at that point
• In this way, we get to know a new evidence
• Knowing this we can update our belief and calculate the posterior function
P(f_new |D, x(t+1))
• We can repeat this whole cycle until we find the setting of hyperparameter for which tests error is very less
Benefits of Bayesian approach
• Effective where objective function is open, not closed form
• When problem is non-convex, which we don’t know
• We need few evaluations of objective function
| 12
Bayesian Optimization Algorithm
References
| 13
 Practical Bayesian Optimization of Machine Learning Algorithms
 Algorithms for Hyper-Parameter Optimization
 A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User
Modeling and Hierarchical Reinforcement Learning
 Introduction to Gaussian Processes

More Related Content

What's hot

DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningSangwoo Mo
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsSangwoo Mo
 
Dual Query: Practical Private Query Release for High Dimensional Data
Dual Query: Practical Private Query Release for High Dimensional DataDual Query: Practical Private Query Release for High Dimensional Data
Dual Query: Practical Private Query Release for High Dimensional DataSteven Wu
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept SearchSteve Rowe
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsSangwoo Mo
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 

What's hot (20)

DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
K means
K meansK means
K means
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Dual Query: Practical Private Query Release for High Dimensional Data
Dual Query: Practical Private Query Release for High Dimensional DataDual Query: Practical Private Query Release for High Dimensional Data
Dual Query: Practical Private Query Release for High Dimensional Data
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with UnityVDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 15 PCG with Unity
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep Representations
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 

Viewers also liked

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Samoocenka 2014-2015
Samoocenka 2014-2015Samoocenka 2014-2015
Samoocenka 2014-2015eisgroup
 
Posibles inventos en los proximos 100 años
Posibles inventos en los proximos 100 añosPosibles inventos en los proximos 100 años
Posibles inventos en los proximos 100 añosMaZoS
 
Transition Training - Postsecondary Goals
Transition Training - Postsecondary GoalsTransition Training - Postsecondary Goals
Transition Training - Postsecondary GoalsJamie Thieman
 
Tech Paper Dismantling WAGR PVandI 1999
Tech Paper Dismantling WAGR PVandI 1999Tech Paper Dismantling WAGR PVandI 1999
Tech Paper Dismantling WAGR PVandI 1999Jon Halladay
 
Fall 2015 Final Presentation
Fall 2015 Final PresentationFall 2015 Final Presentation
Fall 2015 Final PresentationNathan Lex
 
Transition assessment power point
Transition assessment power pointTransition assessment power point
Transition assessment power pointJamie Thieman
 
Addressing Quality Performance Across Supply Partners 2012
Addressing Quality Performance Across Supply Partners 2012Addressing Quality Performance Across Supply Partners 2012
Addressing Quality Performance Across Supply Partners 2012Jon Halladay
 
레일스 환경 변수
레일스 환경 변수레일스 환경 변수
레일스 환경 변수Eugene Park
 
Funtional Web Service Composition
Funtional Web Service CompositionFuntional Web Service Composition
Funtional Web Service CompositionRohit Kumar Gupta
 
Transition Training - IDEA
Transition Training - IDEATransition Training - IDEA
Transition Training - IDEAJamie Thieman
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 

Viewers also liked (15)

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Samoocenka 2014-2015
Samoocenka 2014-2015Samoocenka 2014-2015
Samoocenka 2014-2015
 
Posibles inventos en los proximos 100 años
Posibles inventos en los proximos 100 añosPosibles inventos en los proximos 100 años
Posibles inventos en los proximos 100 años
 
Transition Training - Postsecondary Goals
Transition Training - Postsecondary GoalsTransition Training - Postsecondary Goals
Transition Training - Postsecondary Goals
 
Tech Paper Dismantling WAGR PVandI 1999
Tech Paper Dismantling WAGR PVandI 1999Tech Paper Dismantling WAGR PVandI 1999
Tech Paper Dismantling WAGR PVandI 1999
 
Fall 2015 Final Presentation
Fall 2015 Final PresentationFall 2015 Final Presentation
Fall 2015 Final Presentation
 
Transition assessment power point
Transition assessment power pointTransition assessment power point
Transition assessment power point
 
Addressing Quality Performance Across Supply Partners 2012
Addressing Quality Performance Across Supply Partners 2012Addressing Quality Performance Across Supply Partners 2012
Addressing Quality Performance Across Supply Partners 2012
 
레일스 환경 변수
레일스 환경 변수레일스 환경 변수
레일스 환경 변수
 
Funtional Web Service Composition
Funtional Web Service CompositionFuntional Web Service Composition
Funtional Web Service Composition
 
report.docx
report.docxreport.docx
report.docx
 
Transition Training - IDEA
Transition Training - IDEATransition Training - IDEA
Transition Training - IDEA
 
Financial software
Financial softwareFinancial software
Financial software
 
Werving en selectie
Werving en selectieWerving en selectie
Werving en selectie
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 

Similar to Practical-bayesian-optimization-of-machine-learning-algorithms_ver2

ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4rajmohanc
 
ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...Antosny
 
Unit1_AI&ML_leftover (2).pptx
Unit1_AI&ML_leftover (2).pptxUnit1_AI&ML_leftover (2).pptx
Unit1_AI&ML_leftover (2).pptxsahilshah890338
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Maninda Edirisooriya
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxPriyadharshiniG41
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics JCMwave
 
GAUSSIAN PRESENTATION (1).ppt
GAUSSIAN PRESENTATION (1).pptGAUSSIAN PRESENTATION (1).ppt
GAUSSIAN PRESENTATION (1).pptsudhavathsavi
 
GAUSSIAN PRESENTATION.ppt
GAUSSIAN PRESENTATION.pptGAUSSIAN PRESENTATION.ppt
GAUSSIAN PRESENTATION.pptsudhavathsavi
 
Coursera 1week
Coursera  1weekCoursera  1week
Coursera 1weekcsl9496
 

Similar to Practical-bayesian-optimization-of-machine-learning-algorithms_ver2 (20)

ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Unit1_AI&ML_leftover (2).pptx
Unit1_AI&ML_leftover (2).pptxUnit1_AI&ML_leftover (2).pptx
Unit1_AI&ML_leftover (2).pptx
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
MLE.pdf
MLE.pdfMLE.pdf
MLE.pdf
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 
Confirmatory Factor Analysis
Confirmatory Factor AnalysisConfirmatory Factor Analysis
Confirmatory Factor Analysis
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
4646150.ppt
4646150.ppt4646150.ppt
4646150.ppt
 
GAUSSIAN PRESENTATION (1).ppt
GAUSSIAN PRESENTATION (1).pptGAUSSIAN PRESENTATION (1).ppt
GAUSSIAN PRESENTATION (1).ppt
 
GAUSSIAN PRESENTATION.ppt
GAUSSIAN PRESENTATION.pptGAUSSIAN PRESENTATION.ppt
GAUSSIAN PRESENTATION.ppt
 
Coursera 1week
Coursera  1weekCoursera  1week
Coursera 1week
 

Practical-bayesian-optimization-of-machine-learning-algorithms_ver2

  • 1. Hyperparameter Tuning Rohit Kumar Gupta Advisor : Georg Heigold Deep Learning Seminar November 9, 2015
  • 2. • Introduction • Problem Specification • Simple approach • Naïve Grid Search • Sampling Random Combination • Bayesian Optimization • Gaussian Prior • Acquisition function(Utility function) • Gaussian Posterior • References Agenda | 2
  • 3. Introduction What is Hyperparameter? • a parameter of a prior distribution, before some evidence is taken into account • it is a measure of the probability distribution over probabilities • external model mechanics • in our context of deep learning using neural networks, hyperparameters can be: • Number of layers • Number of hidden units • Weight • Kernel parameter | 3
  • 4. x1 y training data x2 xn x1:n -> hyperparameter values y -> test error Goal : • For a given setting of hypereparameter values the system gives test error generated upon training • And out task is to find the optimal setting of the hyperparameter for which the test error is minimum Plotting the problem : . . . . . testerror hyperparameter • There is an underlying objective function which connects these dots • We want to estimate it and find minima of that function • We don’t have any knowledge about the form of the function we want to optimize Problem Specification | 4
  • 5. Naïve Grid Search : Basic idea : Make a list of all possible combination of hyperparameter values and do an exhaustive search for best setting • there can be large number of combinations • running each setting takes a very long time Simple Approach Sampling Random Combination : Basic idea : Find those hyperparameter which have less impact on output. Eliminate redundant run of the system by pruning some combination where these parameters change • still we are left with large combination of hyperparameter setting • we need an approach that directs us towards an optimal setting | 5
  • 6. Bayesian Optimization Motivation: • Lack of knowledge about concrete form of the objective function • Few observation data • We only need to rely on priors to best estimate the objective function • Under these setting, Bayesian Optimization appears to be a powerful strategy • It allows us to model the objective function and get better estimate with each observation A Bayesian approach works in two steps: 1. Developing a prior function, basically a probabilistic modelling of our current beliefs 2. Developing an acquisition function Step 1: • With few initial runs of the system with different hyperparameter, we accumulate observations as • D = {x1:t; y1:t)} • x1:t are t different settings of hyperparameter • y1:t are t different test errors • P(f) is our prior estimate of the objective function • When we observe new evaluation, we can compute our posterior probability • P(f|D) ~ P(D|f) * P(f) • We use the famous Bayes Theorem to update our belief about the objective function | 6
  • 7. Bayesian Optimization Motivation for step 2: • Our next point of evaluation should be such where our • estimate about objective function is highly uncertain • improvement w.r.t current best error is maximum • We can achieve this via a utility function which considers above constraint and returns the next point of evaluation Utility function current best error prior function x (t+1) (next point of observation, t+1 th observation) Explores the areas of high uncertainty Exploits the areas of max improvement | 7
  • 8. Gaussian Priors | 8 Motivation: • We hold an underlying assumption that the objective function is smooth i.e. for a small change in input the change in output is small • It should be continuous • GP priors can be one of the approach to formalize our prior as they hold these assumptions well • It says that there is a Gaussian connecting these dots • There can many Gaussians doing so, we want to approximate those using a similarity measure between given points Formalizing our GP prior: • Given the data, D, we want to model f’s with multivariate Gaussian k(xi, xj) = exp –norm(xi - xj)sq = 0, if xi and xj are far 1, if xi and xj are close f1 f2 f3 ~ GP 0 0 0 , k11 k12 k13 k21 k22 k23 k31 k32 k33 G = • K is the co-variance matric, defined by us depending on similarity of data points • So, our prior is just a simulation of G . . . testerror hyperparameter f(x) x x1 x2 x3 f3 f2 f1
  • 9. Gaussian Priors | 9 • The Gaussian, G, which we derived are multivariate Gaussian of functions • Now suppose we want to predict the test error at a particular value of x (say test data is x*) • We can assume f* comes from a Gaussian distribution such that, f* ~ GP( 0, k(x*, x*)) • Now since x* can be assumed to come from the same set of training data, f* is jointly Gaussian with G defined earlier and given by, • Thus the whole problem can now be assumed to have cut in multivariate Gaussian at x*, we get another Gaussian defined in a plane and finding the mean of that Gaussian . . . testerror hyperparameter f(x) x x1 x2 x3 f3 f2 f1 x* f* meu* sigma* . f1 f2 f3 f* ~ GP Mue, KG’ = Mue = Mean Vector K= Covariance Matrix
  • 10. • Cutting a multivariate Gaussian is now basically a problem of conditional distribution • We can find the mean and variance of this Gaussian by the Multi Variate Gaussian Theorem • By this theorem we can find the conditional mean and variance from a given joint distribution • The estimate of the objective function is combining these mean at different cuts • The shaded region gives the confidence interval of the mean Gaussian Priors | 10
  • 11. Acquisition function Intuition : • Confidence interval (variance) is a measure of uncertainty at a point • So we intend to find the point where the variance is maximum • With a constraint that the mean at that point should be less than the best known error at that state Formalizing by Probability of Improvement : PI(x) = P(f(x) <= f(x+) + $) PI(x) = Probability of improvement at a point x f(x) = test error at that point f(x+) = best current test error $ = parameter that controls uncertainty | 11
  • 12. Gaussian Posterior • Now once we know the next setting of hyperparameter, we can evaluate our system at that point • In this way, we get to know a new evidence • Knowing this we can update our belief and calculate the posterior function P(f_new |D, x(t+1)) • We can repeat this whole cycle until we find the setting of hyperparameter for which tests error is very less Benefits of Bayesian approach • Effective where objective function is open, not closed form • When problem is non-convex, which we don’t know • We need few evaluations of objective function | 12 Bayesian Optimization Algorithm
  • 13. References | 13  Practical Bayesian Optimization of Machine Learning Algorithms  Algorithms for Hyper-Parameter Optimization  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning  Introduction to Gaussian Processes