PRML 2.4-2.5 Exponential Family & Nonparametric Methods

PRML 2.4-2.5

The exponential family
&
Nonparametric methods

June 11, 2014
by Shinichi TAMURA

NONPARAMETRIC METHODS
THE EXPONENTIAL FAMILY

Today's topics

1. The exponential family

1.  What is exponential family?

2.  Maximum likelihood for EF

3.  How to decide priors for EF

2. Nonparametric methods

1.  What is the point of nonparametric methods ?

2.  Kernel density estimator

3.  Nearest-neighbour methods

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


The Exponential Family
Almost all of the distributions we studied so far belong
to a single class, namely the exponential family.

June 11, 2014
PRML 2.4-2.5

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Bernoulli,

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Bernoulli, multinomial,

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Bernoulli, multinomial, Gaussian,

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
beta,

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
beta, gamma,

Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
beta, gamma, von Mises...etc.


Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Parametric distributions

beta, gamma, von Mises...etc.


Gaussian mixture...etc.

Shinichi TAMURA


p(x|η) = h(x)g(η) exp ηT
u(x)
The exponential family over x given

is a class of distributions which form is

η
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


u(x)


η
Natural parameter

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


u(x)


η
Natural parameter
Where and
come across

x η
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


u(x)


η
Natural parameter

Normalizing constant

Where and
come across

x η
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 1) The Bernoulli Distribution

p(x|η) = µx
(1 − µ)1−x
= σ(−η) exp(ηx)
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 1) The Bernoulli Distribution

where

η = ln
µ
1 − µ
p(x|η) = µx
(1 − µ)1−x
= σ(−η) exp(ηx)
u(x)
h(x) = 1
g(η)
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 2) The Multinomial Distribution

p(x|η) = µxk
k
= exp(ηT
x)
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



where

η = (ln µ1, . . . , ln µM )T
⇒ exp(ηk) = µk = 1
p(x|η) = µxk
k
= exp(ηT
x)
u(x)
h(x) = 1
g(η) = 1
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



where

η = (ln µ1, . . . , ln µM )T
⇒ exp(ηk) = µk = 1
p(x|η) = µxk
k
= exp(ηT
x)
It's inconvenient!

u(x)
h(x) = 1
g(η) = 1
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Remove the constraint by

µM = 1 −
M−1
k=1 µk, xM = 1 −
M−1
k=1 xk
p(x|µ) = exp
M−1
k=1
xk ln µk + 1 −
M−1
k=1
xk ln 1 −
M−1
k=1
µk
= exp
M−1
k=1
xk ln
µk
1 −
M−1
k=1 µk
+ ln 1 −
M−1
k=1
µk
= 1 −
M−1
k=1
µk exp
M−1
k=1
xk ln
µk
1 −
M−1
k=1 µk
.
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Remove the constraint by

Therefore...

µM = 1 −
M−1
k=1 µk, xM = 1 −
M−1
k=1 xk
p(x|µ) = exp
M−1
k=1
xk ln µk + 1 −
M−1
k=1
xk ln 1 −
M−1
k=1
µk
= exp
M−1
k=1
xk ln
µk
1 −
M−1
k=1 µk
+ ln 1 −
M−1
k=1
µk
= 1 −
M−1
k=1
µk exp
M−1
k=1
xk ln
µk
1 −
M−1
k=1 µk
.
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 2') The Multinomial Distribution

w/o constraint

where

p(x|η) = µxk
k
= 1 +
M−1
k=1
exp(ηk)
−1
exp(ηT
x)
η = ln µ1
1−
P
j µj
, . . . , ln µM−1
1−
P
j µj
, 0
T
u(x)
h(x) = 1
g(η)
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 3) The Gaussian Distribution

p(x|η) =
1
(2πσ2)1/2
exp −
1
2σ2
(x − µ)2
= (2π)−1/2
(−2η2)1/2
exp
η2
1
4η2
exp η1 η2
x
x2
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g. 3) The Gaussian Distribution

where

u(x)
h(x) = 1
g(η)
p(x|η) =
1
(2πσ2)1/2
exp −
1
2σ2
(x − µ)2
= (2π)−1/2
(−2η2)1/2
exp
η2
1
4η2
exp η1 η2
x
x2
η =
µ
σ2
, −
1
2σ2
T
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Maximum likelihood for EF

OK, we know what EF looks like.

Then, how to estimate the parameter?

Maximize likelihood!

Frequentist way.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Suppose we have i.i.d. data ,

The log-likelihood of is

June 11, 2014
PRML 2.4-2.5
η
X = {x1, . . . , xN }
Shinichi TAMURA
ln p(X|η) = ln
N
n=1
p(xn|η)
= ln
N
n=1
h(xn)g(η) exp ηT
u(xn)
=
N
n=1
ln h(xn) + N ln g(η) + ηT
N
n=1
u(xn).
∴ η ln p(X|η) = N η ln g(η) +
N
n=1
u(xn). −→ 0



Suppose we have i.i.d. data ,

The log-likelihood of is

June 11, 2014
PRML 2.4-2.5
η
X = {x1, . . . , xN }
Shinichi TAMURA
ln p(X|η) = ln
N
n=1
p(xn|η)
= ln
N
n=1
h(xn)g(η) exp ηT
u(xn)
=
N
n=1
ln h(xn) + N ln g(η) + ηT
N
n=1
u(xn).
∴ η ln p(X|η) = N η ln g(η) +
N
n=1
u(xn). −→ 0
By putting this to zero



Therefore

Here, is determined only through ,

so it is called “sufﬁcient statistics”.

We need to store only for estimation.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
− η ln g(ηML) =
1
N
N
n=1
u(xn).
ηML n u(xn)
n u(xn)



E.g.) Gaussian distribution

By and ,

That's what we already know.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) = (−2η2)1/2
exp η2
1/4η2 u(x) = (x, x2
)T
− ln g(η) =
− η1
2η2
− 1
2η2
+
η2
1
4η2
2
=
µ
σ2
+ µ2 .
∴ µML =
1
N n
xn,
σ2
ML =
1
N n
x2
n −
1
N n
xn
2
.



By the way, we want to know

the relation between and .

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
ηηML



Gradient of

by gives

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
η
h(x)g(η) exp ηT
u(x) dx = 1
g(η) h(x) exp ηT
u(x) dx
+ h(x)g(η) exp ηT
u(x) u(x)dx = 0.
⇔ − ln g(η) = E [u(x)] .



Gradient of

by gives

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
η
h(x)g(η) exp ηT
u(x) dx = 1
g(η) h(x) exp ηT
u(x) dx
+ h(x)g(η) exp ηT
u(x) u(x)dx = 0.
⇔ − ln g(η) = E [u(x)] .
Similar to
− η ln g(ηML) =
1
N
N
n=1
u(xn)



According to LLN, sample mean will converge to the
expectation, so will converge to .

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
ηηML
− η ln g(ηML) =
1
N
N
n=1
u(xn)
− ln g(η) = E [u(x)]




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
ηηML
− η ln g(ηML) =
1
N
N
n=1
u(xn)
− ln g(η) = E [u(x)]
Converge




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
ηηML
− η ln g(ηML) =
1
N
N
n=1
u(xn)
− ln g(η) = E [u(x)]
Converge
Converge


Priors for EF

If you want to use the Bayesian inference,

a prior distribution is needed.

Then, how to decide it,

if we don't know anything about the parameter?

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Priors for EF

Three candidates:

1. Conjugate priors

2. Uniform distributions

3. Noninformative priors

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Priors for EF

Three candidates:


... Easy to handle


June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Priors for EF

Three candidates:


... Easy to handle

... Principle of indifference

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Priors for EF

Three candidates:


... Easy to handle


... Make effects of priors little

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


Priors for EF – Conjugate priors

Three candidates:


... Easy to handle



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Distributions of EF has factors of ,
so conjugate priors is

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) exp(ηT
u)
p(η|X, ν) = f(X, ν) g(η) exp{ηT
X}
ν
= f(X, ν)g(η)ν
exp{νηT
X}.




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) exp(ηT
u)
X}
ν
= f(X, ν)g(η)ν
exp{νηT
X}.
Correspond




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) exp(ηT
u)
X}
ν
= f(X, ν)g(η)ν
exp{νηT
X}.
Normalizing constant




It will give posteriors as follows.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) exp(ηT
u)
X}
ν
= f(X, ν)g(η)ν
exp{νηT
X}.
p(η|X, X, ν) ∝
N
n=1
h(xn)g(η) exp ηT
u(xn) × g(η)ν
exp{ηT
X}
∝ g(η)N+ν
exp ηT
N
n=1
u(xn) + νX




It will give posteriors as follows.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
g(η) exp(ηT
u)
X}
ν
= f(X, ν)g(η)ν
exp{νηT
X}.
p(η|X, X, ν) ∝
N
n=1
h(xn)g(η) exp ηT
u(xn) × g(η)ν
exp{ηT
X}
∝ g(η)N+ν
exp ηT
N
n=1
u(xn) + νX
Correspond


Priors for EF – Uniform distributions

Three candidates:


... Easy to handle



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
The uniform distribution is common choice for discrete
bounded variable.

C.f.: Principle of insufﬁcient reason (or Principle of indifference)



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
The uniform distribution is common choice for discrete
bounded variable.

C.f.: Principle of insufﬁcient reason (or Principle of indifference)

But two problems arise when it is applied to continuous
variables:

1.  The normalization problem

2.  The transformation problem



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
1. Normalization Problem

If the parameter is unbounded

These priors are called “improper”.

∞
−∞
p(λ)dλ =
∞
−∞
const dλ → ∞



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
1. Normalization Problem

If the parameter is unbounded

These priors are called “improper”.

Note that these priors can give proper posteriors,
because posteriors are proportional to likelihood,
which can be normalized.

∞
−∞
p(λ)dλ =
∞
−∞
const dλ → ∞



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
2. Transformation problem

Non-linear transformation gives non-constant priors.

E.g.)

(Sometimes, the posteriors are not sensitive to the difference.)

p(λ) = 1


η=
√
λ
p(η) = p(λ)
dλ
dη
= 2η



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g.)


Not constant for
η
p(λ) = 1


η=
√
λ
p(η) = p(λ)
dλ
dη
= 2η



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


E.g.)


Not constant for
η
Think "constant for what?"
p(λ) = 1


η=
√
λ
p(η) = p(λ)
dλ
dη
= 2η



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Keep these problems in mind:

1.  The normalization problem

2.  The transformation problem


Priors for EF – Noninformative priors

Three candidates:


... Easy to handle



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Two examples of noninformative priors:

1. Priors for location parameters

2. Priors for scale parameters

These are constructed to make effects to posteriors
as little as possible, so that the inference would be
objective.



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
1. Priors for location parameters

If the density form is

p(x|µ) = f(x − µ),



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


the constant shift gives same density:

x = x + c
p(x|µ) = f(x − µ),
p(x|µ) = f(x − µ).



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


the constant shift gives same density:

This property is “translation invariance” and

these parameter is “location parameter”.

x = x + c
p(x|µ) = f(x − µ),
p(x|µ) = f(x − µ).



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

To reﬂect the translation invariance, priors should be

A
B
p(µ)dµ =
A
B
p(µ − c)dµ for∀A, B.
⇐⇒ p(µ) = p(µ − c).
⇐⇒ p(µ) = constant.



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

To reﬂect the translation invariance, priors should be

A
B
p(µ)dµ =
A
B
p(µ − c)dµ for∀A, B.
⇐⇒ p(µ) = p(µ − c).
⇐⇒ p(µ) = constant.
We obtained uniform distributions after all.
But unlike before, we know when to use it.



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

E.g.) The mean in Gaussian

p(x|µ) =
1
(2πσ2)1/2
exp −
1
2σ2
(x − µ)2



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


p(x|µ) =
1
(2πσ2)1/2
exp −
1
2σ2
(x − µ)2
f(x − µ)This form is



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


This prior is also obtained as a limit of conjugates.

p(x|µ) =
1
(2πσ2)1/2
exp −
1
2σ2
(x − µ)2
f(x − µ)This form is

p(µ) = N(µ|µ0, σ2
0)
σ2
0 →∞
−−−−→const.,
µN =
σ2
Nσ2
0 + σ2
µ0 +
Nσ2
0
Nσ2
0 + σ2
µML →µML,
1
σ2
N
=
1
σ2
0
+
N
σ2
→
N
σ2
.



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
2. Priors for scale parameters


p(x|σ) =
1
σ
f
x
σ



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


the constant scale gives same density:

p(x|σ) =
1
σ
f
x
σ
p(x|σ) =
1
σ
f
x
σ
x = cx



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


the constant scale gives same density:

This property is “scale invariance” and

these parameter is “scale parameter”.

p(x|σ) =
1
σ
f
x
σ
p(x|σ) =
1
σ
f
x
σ
x = cx



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

To reﬂect the scale invariance, priors should be

A
B
p(σ)dσ =
A
B
p
1
c
σ
dσ
d(cσ)
dσ for∀A, B.
⇐⇒ p(σ) =
1
c
p
1
c
σ .
⇐⇒ p(σ) ∝
1
σ
.
⇐⇒ p(ln σ) = const.



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

E.g.) The deviation in Gaussian

p(x|σ) =
1
(2πσ2)1/2
exp −
1
2σ2
x2



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


This form is
1
σ f x
σ
p(x|σ) =
1
(2πσ2)1/2
exp −
1
2σ2
x2



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA


This prior is also obtained as a limit of conjugates.

This form is
1
σ f x
σ
p(x|σ) =
1
(2πσ2)1/2
exp −
1
2σ2
x2
p(λ) = Gam(λ|a0, b0)
a0,b0→∞
−−−−−−→
const
λ
,
aN = a0 +
N
2
→
N
2
,
bN = b0 +
N
2
σ2
ML →
N
2
σ2
ML,



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Two examples of noninformative priors:

1. Priors for location parameters

2. Priors for scale parameters

p(x|µ) = f(x − µ) =⇒ p(µ) = const.
p(x|σ) =
1
σ
f
x
σ
=⇒ p(σ) ∝
1
σ



We learned

“parametric approach”

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



We learned


vs.

We will learn

“nonparametric approach”

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



We learned


vs.

We will learn

“nonparametric approach”

What is the difference?

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Parametric
Nonparametric

Assume a specific form
of the distribution

Put few assumption about
the form of distribution

Simple

Complex

(depend on data size)

Poor
Rich / Flexible

Efficient
Inefficient



We will learn:

1. Histogram methods

2. Kernel density estimators

3. Nearest-neighbour methods

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



1. Histogram methods

Split the space into grids (or bins), and count data points.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA





where

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
p(x) = pi =
ni
N∆i
(x ∈ i-th bin),
∆i = Width of ith
bin (usually same for all i),
ni = # of observations which is assigned to ith
bin,
N = Total # of observations.





where

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
p(x) = pi =
ni
N∆i
(x ∈ i-th bin),
∆i = Width of ith
bin (usually same for all i),
ni = # of observations which is assigned to ith
bin,
N = Total # of observations.
This is piecewise constant, hence discontinuous.



1. Histogram methods – Example

is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5
∆




is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5
Too narrow to catch enough points

Too spiky (noisy)

∆




is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5

Too spiky (noisy)

# of bins = MD (curse of dimensionality)

∆




is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5

Too spiky (noisy)

Good intermediate value


∆




is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5

Too spiky (noisy)


Too wide to express the data

Too smooth (less info)


∆




is...

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∆ = 0.04
0 0.5 1
0
5
∆ = 0.08
0 0.5 1
0
5
∆ = 0.25
0 0.5 1
0
5

Too spiky (noisy)


Too wide to express the data

Too smooth (less info)

Find good value is very important!


∆



Lessons from histogram methods

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Estimate density at a particular point
from data points of small local region.




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

The regions are deﬁned by “smoothing
parameter”, which control the
complexity in relation with data size.




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA

The regions are deﬁned by “smoothing
parameter”, which control the
complexity in relation with data size.

Other problems
•  Discontinuity
•  Not scalable (curse of dimensionality)




Let's consider a small local region , then

where .

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
R
P = R
p(x)dx
Pr(K out of N data ∈ R) =
N!
K!(N − K)!
PK
(1 − P)N−K
,





where .

If

1.  K is large enough (smoother not too small)

2.  N is constant over (smoother small enough)

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
R
P = R
p(x)dx
N!
K!(N − K)!
PK
(1 − P)N−K
,
R





where .

If



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
R
P = R
p(x)dx
N!
K!(N − K)!
PK
(1 − P)N−K
,
R
Contradictory





where .

If



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
R
P = R
p(x)dx
N!
K!(N − K)!
PK
(1 − P)N−K
,
R
Contradictory
Depend on data size





where .

If



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
R
P = R
p(x)dx
N!
K!(N − K)!
PK
(1 − P)N−K
,
R
⇒ p(x) =
K
NV
.


Kernel density estimators

Fix a region (e.g., hypercube centered on x, side is h)

and count data by kernel function k(u) (Parzen window).

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
k(u) =
1, |ui| 1/2, (i = 1, . . . D)
0, otherwise.





June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Centered on origin,
side is 1

k(u) =
1, |ui| 1/2, (i = 1, . . . D)
0, otherwise.





June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
k(u) =
1, |ui| 1/2, (i = 1, . . . D)
0, otherwise.
Discontinuous kernel



Fix a region (e.g., hypercube centred on x, side is h)


June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
K =
N
n=1
k
x − xn
h
,
V = hD
,
∴ p(x) =
1
N
N
n=1
1
hD
k
x − xn
h
.
k(u) =
1, |ui| 1/2, (i = 1, . . . D)
0, otherwise.



Symmetry of k(u) let us re-interpret the result.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
N data points in the single

cube centered on x



Symmetry of k(u) let us re-interpret the result.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
N data points in the single

cube centered on x

N cubes centered on xn
around x



Other choice of k(u): Gaussian

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
k(u) =
1
(2π)D/2
exp −
u 2
2
.




June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
k(u) =
1
(2π)D/2
exp −
u 2
2
.
This kernel give continuous density.




You can use anything as long as it holds

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
k(u) 0,
k(u)du = 1.
k(u) =
1
(2π)D/2
exp −
u 2
2
.



Example

Again, we can see that
smooth parameter h controls
the outcome of estimations.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
h = 0.005
0 0.5 1
0
5
h = 0.07
0 0.5 1
0
5
h = 0.2
0 0.5 1
0
5


Nearest-neighbour methods

Use a sphere as a region which centred on x and
contains K (ﬁxed number) data points.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Use a sphere as a region which centred on x and
contains K (ﬁxed number) data points.

where V(x) denotes the volume

of the sphere.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
p(x) =
K
NV (x)
,



Note that this density can not be normalized.

From x* where faraway from all data points, the radius
of the sphere is inversely proportional to x, thus integral
diverge.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
∞
−∞
dx
r(x)
∞
x∗
dx
r(x)
∞
x∗
dx
x − x†
→ ∞.
∴
RD
K
NV (x)
dx ∝
RD
dx
r(x)D
→ ∞.


Nearest-neighbour estimators

Example

Here again, smooth parameter

K controls the outcome of
estimations.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
K = 1
0 0.5 1
0
5
K = 5
0 0.5 1
0
5
K = 30
0 0.5 1
0
5


Nearest-neighbour estimators

Example

Here again, smooth parameter

K controls the outcome of
estimations.

Furthermore, we can observe
that in K=1 case.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
K = 1
0 0.5 1
0
5
K = 5
0 0.5 1
0
5
K = 30
0 0.5 1
0
5
p(x) → ∞



Another problem of Kernels and NNs

These methods need all observed data for estimation,
so both time and space complexity is O(N). It is very
inefficient.

On that point, parametric methods are quite efficient
(c.f., sufficient statistics).

Histograms are also efficient.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
Histograms

Kernels

NNs

K

Not fixed

Not fixed

Fixed

V

Not fixed

Fixed

Not fixed

Smoother

h

V

Continuity

No

It depends

Yes*

Dimensionality

Suffer

Scalable

Scalable

Normalization

Proper

Proper

Improper

Data set

Discard

Keep

Keep

∆
* If K=1, not continuous



Use NNs as classiﬁer

To do this, use the sphere contains
K points irrespective to the class.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA





where Kk is # in class k and sphere.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
p(x|Ck) =
Kk
NkV
,
p(x) =
K
NV
,





where Kk is # in class k and sphere.
Class priors are , so

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
p(x|Ck) =
Kk
NkV
,
p(x) =
K
NV
,
p(Ck|x) =
p(x|Ck)p(Ck)
p(x)
=
Kk
K
.
p(Ck) = Nk/N




Therefore, x will be classiﬁed to
the greatest majority among x's
K-nearest neighbours.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA




Therefore, x will be classiﬁed to
the greatest majority among x's
K-nearest neighbours.

If K=1, it is called “nearest-
neighbour rule”.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA



Use NNs as classiﬁer – Example

Same as the discussion so far, here K acts as
smooth parameter.

June 11, 2014
PRML 2.4-2.5
Shinichi TAMURA
x6
x7
K = 1
0 1 2
0
1
2
x6
x7
K = 3
0 1 2
0
1
2
x6
x7
K = 31
0 1 2
0
1
2

PRML 2.4-2.5 Exponential Family & Nonparametric Methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Recently uploaded

Recently uploaded (20)

PRML 2.4-2.5 Exponential Family & Nonparametric Methods