More Related Content
Similar to Pattern Recognition and Machine Learning: Section 3.3 (20)
More from Yusuke Oda (12)
Pattern Recognition and Machine Learning: Section 3.3
- 1. Reading
Pattern Recognition
and Machine Learning
§3.3 (Bayesian Linear Regression)
Christopher M. Bishop
Introduced by: Yusuke Oda (NAIST)
@odashi_t
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
- 2. Agenda
3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
- 3. Agenda
3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
- 4. Bayesian Linear Regression
Maximum Likelihood (ML)
– The number of basis functions (≃ model complexity)
depends on the size of the data set.
– Adds the regularization term to control model complexity.
– How should we determine
the coefficient of regularization term?
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
- 5. Bayesian Linear Regression
Maximum Likelihood (ML)
– Using ML to determine the coefficient of regularization term
... Bad selection
• This always leads to excessively complex models (= over-fitting)
– Using independent hold-out data to determine model complexity
(See §1.3)
... Computationally expensive
... Wasteful of valuable data
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5
In the case of previous slide,
λ always becomes 0
when using ML to determine λ.
- 6. Bayesian Linear Regression
Bayesian treatment of linear regression
– Avoids the over-fitting problem of ML.
– Leads to automatic methods of determining model complexity
using the training data alone.
What we do?
– Introduces the prior distribution and likelihood .
• Assumes the model parameter as proberbility function.
– Calculates the posterior distribution
using the Bayes' theorem:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
- 7. Agenda
3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
- 8. Note: Marginal / Conditional Gaussians
Marginal Gaussian distribution for
Conditional Gaussian distribution for given
Marginal distribution of
Conditional distribution of given
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8
Given:
Then:
where
- 9. Parameter Distribution
Remember the likelihood function given by §3.1.1:
– This is the exponential of quadratic function of
The corresponding conjugate prior is given by
a Gaussian distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9
known parameter
- 10. Parameter Distribution
Now given:
Then the posterior distribution is shown by using (2.116):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
- 11. Online Learning- Parameter Distribution
If data points arrive sequentially,
the design matrix has only 1 row:
Assuming that are the n-th input data then
we can obtain the formula for online learning:
where
In addition,
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
- 12. Easy Gaussian Prior- Parameter Distribution
If the prior distribution is a zero-mean isotropic Gaussian
governed by a single precision parameter :
The corresponding posterior distribution is also given:
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
- 13. Relationship with MSSE- Parameter Distribution
The log of the posterior distribution is given:
If prior distribution is given by (3.52), this result is shown:
– Maximization of (3.55) with respect to
– Minimization of the sum-of-squares error (MSSE) function
with the addition of a quadratic regularization term
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13
Equivalent
- 14. Example- Parameter Distribution
Straight-line fitting
– Model function:
– True function:
– Error:
– Goal: To recover the values of
from such data
– Prior distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
- 15. Generalized Gaussian Prior- Parameter Distribution
We can generalize the
Gaussian prior about exponent.
In which corresponds
to the Gaussian
and only in the case is the
prior conjugate to the (3.10).
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
- 16. Agenda
3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
- 17. Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17
Let's consider that making predictions of directly
for new values of .
In order to obtain it, we need to evaluate the
predictive distribution:
This formula is tipically written:
Marginalization arround
(summing out )
- 18. Predictive Distribution
The conditional distribution of the target variable is given:
And the posterior weight distribution is given:
Accordingly, the result of (3.57) is shown by using (2.115):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
- 19. Predictive Distribution
Now we discuss the variance of predictive distribution:
– As additional data points are observed, the posterior distribution
becomes narrower:
– 2nd term of the(3.59) goes zero in the limit :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19
Addictive noise
goverened by the parameter .
This term depends on the mapping vector
. of each data point .
- 21. Example- Predictive Distribution
Gaussian regression with sine curve
– Basis functions: 9 Gaussian curves
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21
Mean of predictive distribution
Standard deviation of
predictive distribution
- 24. Problem of Localized Basis- Predictive Distribution
Polynominal regression
Gaussian regression
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24
Which is better?
- 25. Problem of Localized Basis- Predictive Distribution
If we used localized basis function such as Gaussians,
then in regions away from the basis function centers
the contribution from the 2nd term in the (3.59) will goes zero.
Accordingly, the predictive variance becomes only the noise
contribution . But it is not good result.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25
Large contribution
Small contribution
- 26. Problem of Localized Basis- Predictive Distribution
This problem (arising from choosing localized basis function)
can be avoided by adopting an alternative Bayesian approach
to regression known as a Gaussian process.
– See §6.4.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
- 27. Case of Unknown Precision- Predictive Distribution
If both and are treated as unknown then
we can introduce a conjugate prior distribution and
corresponding posterior distribution as Gaussian-gamma
distribution:
And then the predictive distribution is given:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
- 28. Agenda
3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
- 29. Equivalent Kernel
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29
If we substitute the posterior mean solution (3.53) into the
expression (3.3), the predictive mean can be written:
This formula can assume the linear combination of :
- 30. Equivalent Kernel
Where the coefficients of each are given:
This function is calld smoother matrix or equivalent kernel.
Regression functions which make predictions by taking linear
combinations of the training set target values are known as
linear smoothers.
We also predict for new input vector using equivalent
kernel, instead of calculating parameters of basis functions.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
- 31. Example 1- Equivalent Kernel
Equivalent kernel with Gaussian regression
Equivalen kernel depends on the set of basis function and the
data set.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
- 32. Equivalent Kernel
Equivalent kernel means the contribution of each data point
for predictive mean.
The covariance between and can be shown by
equivalent kernel:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32
Large contribution
Small contribution
- 33. Properties of Equivalent Kernel- Equivalent Kernel
Equivalent kernel have localization property even if any basis
functions are not localized.
Sum of equivalent kernel equals 1 for all :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33
Polynominal Sigmoid
- 34. Example 2- Equivalent Kernel
Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
- 35. Example 2- Equivalent Kernel
Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
- 36. Example 2- Equivalent Kernel
Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
- 37. Properties of Equivalent Kernel- Equivalent Kernel
Equivalent kernel satisfies an important property shared by
kernel functions in general:
– Kernel function can be expressed in the form of an inner product with
respect to a vector of nonlinear functions:
– In the case of equivalent kernel, is given below:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37