3. Introduction
What is Hyperparameter?
• a parameter of a prior distribution, before some evidence is
taken into account
• it is a measure of the probability distribution over probabilities
• external model mechanics
• in our context of deep learning using neural networks,
hyperparameters can be:
• Number of layers
• Number of hidden units
• Weight
• Kernel parameter
| 3
4. x1
y
training data
x2
xn
x1:n -> hyperparameter values y -> test error
Goal :
• For a given setting of hypereparameter values the
system gives test error generated upon training
• And out task is to find the optimal setting of the
hyperparameter for which the test error is minimum
Plotting the problem :
.
.
.
.
.
testerror
hyperparameter
• There is an underlying objective function which
connects these dots
• We want to estimate it and find minima of that
function
• We don’t have any knowledge about the form of the
function we want to optimize
Problem Specification
| 4
5. Naïve Grid Search :
Basic idea : Make a list of all possible combination of
hyperparameter values and do an exhaustive search for best setting
• there can be large number of combinations
• running each setting takes a very long time
Simple Approach
Sampling Random Combination :
Basic idea : Find those hyperparameter which have less impact on
output. Eliminate redundant run of the system by pruning some
combination where these parameters change
• still we are left with large combination of hyperparameter setting
• we need an approach that directs us towards an optimal setting
| 5
6. Bayesian Optimization
Motivation:
• Lack of knowledge about concrete form of the objective function
• Few observation data
• We only need to rely on priors to best estimate the objective function
• Under these setting, Bayesian Optimization appears to be a powerful strategy
• It allows us to model the objective function and get better estimate with each observation
A Bayesian approach works in two steps:
1. Developing a prior function, basically a probabilistic modelling of our current beliefs
2. Developing an acquisition function
Step 1:
• With few initial runs of the system with different hyperparameter, we accumulate observations as
• D = {x1:t; y1:t)}
• x1:t are t different settings of hyperparameter
• y1:t are t different test errors
• P(f) is our prior estimate of the objective function
• When we observe new evaluation, we can compute our posterior probability
• P(f|D) ~ P(D|f) * P(f)
• We use the famous Bayes Theorem to update our belief about the objective function
| 6
7. Bayesian Optimization
Motivation for step 2:
• Our next point of evaluation should be such where our
• estimate about objective function is highly uncertain
• improvement w.r.t current best error is maximum
• We can achieve this via a utility function which considers above constraint and returns the next point of evaluation
Utility function
current best error
prior function
x (t+1)
(next point of observation, t+1 th observation)
Explores the areas
of high uncertainty
Exploits the areas of
max improvement
| 7
8. Gaussian Priors
| 8
Motivation:
• We hold an underlying assumption that the objective function is smooth i.e. for a small change in input the change in
output is small
• It should be continuous
• GP priors can be one of the approach to formalize our prior as they hold these assumptions well
• It says that there is a Gaussian connecting these dots
• There can many Gaussians doing so, we want to approximate those using a similarity measure between given points
Formalizing our GP prior: • Given the data, D, we want to model f’s with multivariate Gaussian
k(xi, xj) = exp –norm(xi - xj)sq = 0, if xi and xj are far
1, if xi and xj are close
f1
f2
f3
~ GP
0
0
0 ,
k11 k12 k13
k21 k22 k23
k31 k32 k33
G =
• K is the co-variance matric, defined by us depending on similarity of
data points
• So, our prior is just a simulation of G
.
.
.
testerror
hyperparameter
f(x)
x
x1 x2 x3
f3
f2
f1
9. Gaussian Priors
| 9
• The Gaussian, G, which we derived are multivariate Gaussian of functions
• Now suppose we want to predict the test error at a particular value of x (say test
data is x*)
• We can assume f* comes from a Gaussian distribution such that,
f* ~ GP( 0, k(x*, x*))
• Now since x* can be assumed to come from the same set of training data,
f* is jointly Gaussian with G defined earlier and given by,
• Thus the whole problem can now be assumed to have cut in multivariate
Gaussian at x*, we get another Gaussian defined in a plane and finding the
mean of that Gaussian
.
.
.
testerror
hyperparameter
f(x)
x
x1 x2 x3
f3
f2
f1
x*
f*
meu*
sigma*
.
f1
f2
f3
f*
~ GP Mue, KG’ = Mue = Mean Vector
K= Covariance Matrix
10. • Cutting a multivariate Gaussian is now basically a problem of conditional distribution
• We can find the mean and variance of this Gaussian by the Multi Variate Gaussian Theorem
• By this theorem we can find the conditional mean and variance from a given joint distribution
• The estimate of the objective function is combining these mean at different cuts
• The shaded region gives the confidence interval of the mean
Gaussian Priors
| 10
11. Acquisition function
Intuition :
• Confidence interval (variance) is a measure of uncertainty
at a point
• So we intend to find the point where the variance is
maximum
• With a constraint that the mean at that point should be
less than the best known error at that state
Formalizing by Probability of Improvement :
PI(x) = P(f(x) <= f(x+) + $)
PI(x) = Probability of improvement at a point x
f(x) = test error at that point
f(x+) = best current test error
$ = parameter that controls uncertainty
| 11
12. Gaussian Posterior
• Now once we know the next setting of hyperparameter, we can evaluate our system at that point
• In this way, we get to know a new evidence
• Knowing this we can update our belief and calculate the posterior function
P(f_new |D, x(t+1))
• We can repeat this whole cycle until we find the setting of hyperparameter for which tests error is very less
Benefits of Bayesian approach
• Effective where objective function is open, not closed form
• When problem is non-convex, which we don’t know
• We need few evaluations of objective function
| 12
Bayesian Optimization Algorithm
13. References
| 13
Practical Bayesian Optimization of Machine Learning Algorithms
Algorithms for Hyper-Parameter Optimization
A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User
Modeling and Hierarchical Reinforcement Learning
Introduction to Gaussian Processes