SlideShare a Scribd company logo
1 of 38
Download to read offline
Reading
Pattern Recognition
and Machine Learning
§3.3 (Bayesian Linear Regression)
Christopher M. Bishop
Introduced by: Yusuke Oda (NAIST)
@odashi_t
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
Bayesian Linear Regression
 Maximum Likelihood (ML)
– The number of basis functions (≃ model complexity)
depends on the size of the data set.
– Adds the regularization term to control model complexity.
– How should we determine
the coefficient of regularization term?
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
Bayesian Linear Regression
 Maximum Likelihood (ML)
– Using ML to determine the coefficient of regularization term
... Bad selection
• This always leads to excessively complex models (= over-fitting)
– Using independent hold-out data to determine model complexity
(See §1.3)
... Computationally expensive
... Wasteful of valuable data
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5
In the case of previous slide,
λ always becomes 0
when using ML to determine λ.
Bayesian Linear Regression
 Bayesian treatment of linear regression
– Avoids the over-fitting problem of ML.
– Leads to automatic methods of determining model complexity
using the training data alone.
 What we do?
– Introduces the prior distribution and likelihood .
• Assumes the model parameter as proberbility function.
– Calculates the posterior distribution
using the Bayes' theorem:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
Note: Marginal / Conditional Gaussians
 Marginal Gaussian distribution for
 Conditional Gaussian distribution for given
 Marginal distribution of
 Conditional distribution of given
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8
Given:
Then:
where
Parameter Distribution
 Remember the likelihood function given by §3.1.1:
– This is the exponential of quadratic function of
 The corresponding conjugate prior is given by
a Gaussian distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9
known parameter
Parameter Distribution
 Now given:
 Then the posterior distribution is shown by using (2.116):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
Online Learning- Parameter Distribution
 If data points arrive sequentially,
the design matrix has only 1 row:
 Assuming that are the n-th input data then
we can obtain the formula for online learning:
where
In addition,
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
Easy Gaussian Prior- Parameter Distribution
 If the prior distribution is a zero-mean isotropic Gaussian
governed by a single precision parameter :
 The corresponding posterior distribution is also given:
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
Relationship with MSSE- Parameter Distribution
 The log of the posterior distribution is given:
 If prior distribution is given by (3.52), this result is shown:
– Maximization of (3.55) with respect to
– Minimization of the sum-of-squares error (MSSE) function
with the addition of a quadratic regularization term
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13
Equivalent
Example- Parameter Distribution
 Straight-line fitting
– Model function:
– True function:
– Error:
– Goal: To recover the values of
from such data
– Prior distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
Generalized Gaussian Prior- Parameter Distribution
 We can generalize the
Gaussian prior about exponent.
 In which corresponds
to the Gaussian
and only in the case is the
prior conjugate to the (3.10).
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17
 Let's consider that making predictions of directly
for new values of .
 In order to obtain it, we need to evaluate the
predictive distribution:
 This formula is tipically written:
Marginalization arround
(summing out )
Predictive Distribution
 The conditional distribution of the target variable is given:
 And the posterior weight distribution is given:
 Accordingly, the result of (3.57) is shown by using (2.115):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
Predictive Distribution
 Now we discuss the variance of predictive distribution:
– As additional data points are observed, the posterior distribution
becomes narrower:
– 2nd term of the(3.59) goes zero in the limit :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19
Addictive noise
goverened by the parameter .
This term depends on the mapping vector
. of each data point .
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
Example- Predictive Distribution
 Gaussian regression with sine curve
– Basis functions: 9 Gaussian curves
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21
Mean of predictive distribution
Standard deviation of
predictive distribution
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
Problem of Localized Basis- Predictive Distribution
 Polynominal regression
 Gaussian regression
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24
Which is better?
Problem of Localized Basis- Predictive Distribution
 If we used localized basis function such as Gaussians,
then in regions away from the basis function centers
the contribution from the 2nd term in the (3.59) will goes zero.
 Accordingly, the predictive variance becomes only the noise
contribution . But it is not good result.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25
Large contribution
Small contribution
Problem of Localized Basis- Predictive Distribution
 This problem (arising from choosing localized basis function)
can be avoided by adopting an alternative Bayesian approach
to regression known as a Gaussian process.
– See §6.4.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
Case of Unknown Precision- Predictive Distribution
 If both and are treated as unknown then
we can introduce a conjugate prior distribution and
corresponding posterior distribution as Gaussian-gamma
distribution:
 And then the predictive distribution is given:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
Equivalent Kernel
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29
 If we substitute the posterior mean solution (3.53) into the
expression (3.3), the predictive mean can be written:
 This formula can assume the linear combination of :
Equivalent Kernel
 Where the coefficients of each are given:
 This function is calld smoother matrix or equivalent kernel.
 Regression functions which make predictions by taking linear
combinations of the training set target values are known as
linear smoothers.
 We also predict for new input vector using equivalent
kernel, instead of calculating parameters of basis functions.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
Example 1- Equivalent Kernel
 Equivalent kernel with Gaussian regression
 Equivalen kernel depends on the set of basis function and the
data set.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
Equivalent Kernel
 Equivalent kernel means the contribution of each data point
for predictive mean.
 The covariance between and can be shown by
equivalent kernel:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32
Large contribution
Small contribution
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel have localization property even if any basis
functions are not localized.
 Sum of equivalent kernel equals 1 for all :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33
Polynominal Sigmoid
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel satisfies an important property shared by
kernel functions in general:
– Kernel function can be expressed in the form of an inner product with
respect to a vector of nonlinear functions:
– In the case of equivalent kernel, is given below:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
Thank you!
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38
zzz...

More Related Content

What's hot

変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
Takao Yamanaka
 

What's hot (20)

ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
One Class SVMを用いた異常値検知
One Class SVMを用いた異常値検知One Class SVMを用いた異常値検知
One Class SVMを用いた異常値検知
 
[DL輪読会]Pay Attention to MLPs (gMLP)
[DL輪読会]Pay Attention to MLPs	(gMLP)[DL輪読会]Pay Attention to MLPs	(gMLP)
[DL輪読会]Pay Attention to MLPs (gMLP)
 
最適化計算の概要まとめ
最適化計算の概要まとめ最適化計算の概要まとめ
最適化計算の概要まとめ
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
正準相関分析
正準相関分析正準相関分析
正準相関分析
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
 
Non-autoregressive text generation
Non-autoregressive text generationNon-autoregressive text generation
Non-autoregressive text generation
 
DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向DNNの曖昧性に関する研究動向
DNNの曖昧性に関する研究動向
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
 
計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-
 
PRML第6章「カーネル法」
PRML第6章「カーネル法」PRML第6章「カーネル法」
PRML第6章「カーネル法」
 
[DL輪読会]Deep Learning 第5章 機械学習の基礎
[DL輪読会]Deep Learning 第5章 機械学習の基礎[DL輪読会]Deep Learning 第5章 機械学習の基礎
[DL輪読会]Deep Learning 第5章 機械学習の基礎
 
PRML 4.1 Discriminant Function
PRML 4.1 Discriminant FunctionPRML 4.1 Discriminant Function
PRML 4.1 Discriminant Function
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
 

Viewers also liked

Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
butest
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
odsc
 

Viewers also liked (20)

Neural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code PredictionNeural Machine Translation via Binary Code Prediction
Neural Machine Translation via Binary Code Prediction
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
 
Center loss for Face Recognition
Center loss for Face RecognitionCenter loss for Face Recognition
Center loss for Face Recognition
 
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream) Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
 
портфоліо Бабич О.А.
портфоліо Бабич О.А.портфоліо Бабич О.А.
портфоліо Бабич О.А.
 
Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)
 
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
Processor, Compiler and Python Programming Language
Processor, Compiler and Python Programming LanguageProcessor, Compiler and Python Programming Language
Processor, Compiler and Python Programming Language
 
Semi fragile watermarking
Semi fragile watermarkingSemi fragile watermarking
Semi fragile watermarking
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and Learning
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorial
 
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
 
Facebook Deep face
Facebook Deep faceFacebook Deep face
Facebook Deep face
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learning
 

Similar to Pattern Recognition and Machine Learning: Section 3.3

A modified pso based graph cut algorithm for the selection of optimal regular...
A modified pso based graph cut algorithm for the selection of optimal regular...A modified pso based graph cut algorithm for the selection of optimal regular...
A modified pso based graph cut algorithm for the selection of optimal regular...
IAEME Publication
 
Evaluating competing predictive distributions
Evaluating competing predictive distributionsEvaluating competing predictive distributions
Evaluating competing predictive distributions
Andreas Collett
 

Similar to Pattern Recognition and Machine Learning: Section 3.3 (20)

Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra
 
Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra
 
PRML Chapter 9
PRML Chapter 9PRML Chapter 9
PRML Chapter 9
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
 
Paper 6 (azam zaka)
Paper 6 (azam zaka)Paper 6 (azam zaka)
Paper 6 (azam zaka)
 
Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...
 
A modified pso based graph cut algorithm for the selection of optimal regular...
A modified pso based graph cut algorithm for the selection of optimal regular...A modified pso based graph cut algorithm for the selection of optimal regular...
A modified pso based graph cut algorithm for the selection of optimal regular...
 
linkd
linkdlinkd
linkd
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
40120140507002
4012014050700240120140507002
40120140507002
 
40120140507002
4012014050700240120140507002
40120140507002
 
H235055
H235055H235055
H235055
 
Measuring Robustness on Generalized Gaussian Distribution
Measuring Robustness on Generalized Gaussian DistributionMeasuring Robustness on Generalized Gaussian Distribution
Measuring Robustness on Generalized Gaussian Distribution
 
FEM 10 Common Errors.ppt
FEM 10 Common Errors.pptFEM 10 Common Errors.ppt
FEM 10 Common Errors.ppt
 
Abrigo and love_2015_
Abrigo and love_2015_Abrigo and love_2015_
Abrigo and love_2015_
 
Evaluating competing predictive distributions
Evaluating competing predictive distributionsEvaluating competing predictive distributions
Evaluating competing predictive distributions
 
level set method
level set methodlevel set method
level set method
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Fool me twice
Fool me twiceFool me twice
Fool me twice
 

More from Yusuke Oda

複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
Yusuke Oda
 
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Yusuke Oda
 

More from Yusuke Oda (12)

primitiv: Neural Network Toolkit
primitiv: Neural Network Toolkitprimitiv: Neural Network Toolkit
primitiv: Neural Network Toolkit
 
ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@ChainerによるRNN翻訳モデルの実装+@
ChainerによるRNN翻訳モデルの実装+@
 
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)
 
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
A Chainer MeetUp Talk
A Chainer MeetUp TalkA Chainer MeetUp Talk
A Chainer MeetUp Talk
 
PCFG構文解析法
PCFG構文解析法PCFG構文解析法
PCFG構文解析法
 
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
 
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
 
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
 
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
 
Test
TestTest
Test
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Pattern Recognition and Machine Learning: Section 3.3

  • 1. Reading Pattern Recognition and Machine Learning §3.3 (Bayesian Linear Regression) Christopher M. Bishop Introduced by: Yusuke Oda (NAIST) @odashi_t 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
  • 2. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
  • 3. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
  • 4. Bayesian Linear Regression  Maximum Likelihood (ML) – The number of basis functions (≃ model complexity) depends on the size of the data set. – Adds the regularization term to control model complexity. – How should we determine the coefficient of regularization term? 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
  • 5. Bayesian Linear Regression  Maximum Likelihood (ML) – Using ML to determine the coefficient of regularization term ... Bad selection • This always leads to excessively complex models (= over-fitting) – Using independent hold-out data to determine model complexity (See §1.3) ... Computationally expensive ... Wasteful of valuable data 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5 In the case of previous slide, λ always becomes 0 when using ML to determine λ.
  • 6. Bayesian Linear Regression  Bayesian treatment of linear regression – Avoids the over-fitting problem of ML. – Leads to automatic methods of determining model complexity using the training data alone.  What we do? – Introduces the prior distribution and likelihood . • Assumes the model parameter as proberbility function. – Calculates the posterior distribution using the Bayes' theorem: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
  • 7. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
  • 8. Note: Marginal / Conditional Gaussians  Marginal Gaussian distribution for  Conditional Gaussian distribution for given  Marginal distribution of  Conditional distribution of given 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8 Given: Then: where
  • 9. Parameter Distribution  Remember the likelihood function given by §3.1.1: – This is the exponential of quadratic function of  The corresponding conjugate prior is given by a Gaussian distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9 known parameter
  • 10. Parameter Distribution  Now given:  Then the posterior distribution is shown by using (2.116): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
  • 11. Online Learning- Parameter Distribution  If data points arrive sequentially, the design matrix has only 1 row:  Assuming that are the n-th input data then we can obtain the formula for online learning: where In addition, 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
  • 12. Easy Gaussian Prior- Parameter Distribution  If the prior distribution is a zero-mean isotropic Gaussian governed by a single precision parameter :  The corresponding posterior distribution is also given: where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
  • 13. Relationship with MSSE- Parameter Distribution  The log of the posterior distribution is given:  If prior distribution is given by (3.52), this result is shown: – Maximization of (3.55) with respect to – Minimization of the sum-of-squares error (MSSE) function with the addition of a quadratic regularization term 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13 Equivalent
  • 14. Example- Parameter Distribution  Straight-line fitting – Model function: – True function: – Error: – Goal: To recover the values of from such data – Prior distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
  • 15. Generalized Gaussian Prior- Parameter Distribution  We can generalize the Gaussian prior about exponent.  In which corresponds to the Gaussian and only in the case is the prior conjugate to the (3.10). 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
  • 16. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
  • 17. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17  Let's consider that making predictions of directly for new values of .  In order to obtain it, we need to evaluate the predictive distribution:  This formula is tipically written: Marginalization arround (summing out )
  • 18. Predictive Distribution  The conditional distribution of the target variable is given:  And the posterior weight distribution is given:  Accordingly, the result of (3.57) is shown by using (2.115): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
  • 19. Predictive Distribution  Now we discuss the variance of predictive distribution: – As additional data points are observed, the posterior distribution becomes narrower: – 2nd term of the(3.59) goes zero in the limit : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19 Addictive noise goverened by the parameter . This term depends on the mapping vector . of each data point .
  • 20. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
  • 21. Example- Predictive Distribution  Gaussian regression with sine curve – Basis functions: 9 Gaussian curves 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21 Mean of predictive distribution Standard deviation of predictive distribution
  • 22. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
  • 23. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
  • 24. Problem of Localized Basis- Predictive Distribution  Polynominal regression  Gaussian regression 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24 Which is better?
  • 25. Problem of Localized Basis- Predictive Distribution  If we used localized basis function such as Gaussians, then in regions away from the basis function centers the contribution from the 2nd term in the (3.59) will goes zero.  Accordingly, the predictive variance becomes only the noise contribution . But it is not good result. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25 Large contribution Small contribution
  • 26. Problem of Localized Basis- Predictive Distribution  This problem (arising from choosing localized basis function) can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process. – See §6.4. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
  • 27. Case of Unknown Precision- Predictive Distribution  If both and are treated as unknown then we can introduce a conjugate prior distribution and corresponding posterior distribution as Gaussian-gamma distribution:  And then the predictive distribution is given: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
  • 28. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
  • 29. Equivalent Kernel 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29  If we substitute the posterior mean solution (3.53) into the expression (3.3), the predictive mean can be written:  This formula can assume the linear combination of :
  • 30. Equivalent Kernel  Where the coefficients of each are given:  This function is calld smoother matrix or equivalent kernel.  Regression functions which make predictions by taking linear combinations of the training set target values are known as linear smoothers.  We also predict for new input vector using equivalent kernel, instead of calculating parameters of basis functions. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
  • 31. Example 1- Equivalent Kernel  Equivalent kernel with Gaussian regression  Equivalen kernel depends on the set of basis function and the data set. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
  • 32. Equivalent Kernel  Equivalent kernel means the contribution of each data point for predictive mean.  The covariance between and can be shown by equivalent kernel: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32 Large contribution Small contribution
  • 33. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel have localization property even if any basis functions are not localized.  Sum of equivalent kernel equals 1 for all : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33 Polynominal Sigmoid
  • 34. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
  • 35. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
  • 36. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
  • 37. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel satisfies an important property shared by kernel functions in general: – Kernel function can be expressed in the form of an inner product with respect to a vector of nonlinear functions: – In the case of equivalent kernel, is given below: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
  • 38. Thank you! 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38 zzz...