MDL/Bayesian Criteria based on Universal Coding/Measure

.
......
MDL/Bayesian Criteria based on Universal Coding/Measure
Joe Suzuki
Osaka University
November 30, 2011
Joe Suzuki (Osaka University) MDL/Bayesian Criteria based on Universal Coding/MeasureNovember 30, 2011 1 / 17

Road Map
...1 Problem
...2 Density Functions
...3 Generalized Density Functions
...4 The Bayesian Solution
...5 Summary

Problem
Warming-Up
Identify whether X, Y are independent or not, from n examples
(x1, y1), · · · , (xn, yn) independently emitted by (X, Y )?
X ∈ A := {0, 1}
Y ∈ B := {0, 1}
p: a prior probability that X, Y are independent
WA, WB, WAB: weights
Qn
(xn
) :=
∫
P(xn
|θ)dWA(θ) , Qn
(yn
) :=
∫
P(yn
|θ)dWB(θ)
Qn
(xn
, yn
) :=
∫
P(xn
, yn
|θ)dWAB(θ)
.
The Bayesian answer
..
......pQn(xn)Qn(yn) ≥ (1 − p)Qn(xn, yn) ⇐⇒ X, Y are independent

Problem
Today’s Exercise
Identify whether X, Y are independent or not, from n examples
(x1, y1), · · · , (xn, yn) independently emitted by (X, Y )?
X ∈ A := [0, 1) Continuous
Y ∈ B := {1, 2, · · · } Discrete and Inﬁnite
.
Problem
..
......Construct something like Qn(xn), Qn(yn), Qn(xn, yn).
Extend those quantities for general X, Y
without assuming either discrete or continuous

Problem
Why Qn
(xn
), Qn
(yn
), Qn
(xn
, yn
) can be probabilities?
W ∗
A, W ∗
B, W ∗
A,B: the true priors
Pn
(xn
) :=
∫
P(xn
|θ)dW ∗
A(θ) , Pn
(yn
) :=
∫
P(yn
|θ)dW ∗
B(θ)
Pn
(xn
, yn
) :=
∫
P(xn
, yn
|θ)dW ∗
AB(θ)
Known Use W ∗
A, W ∗
B, W ∗
A,B to compare
pPn(xn)Pn(yn) and (1 − p)Pn(xn, yn)
Unknown Use WA, WB, WA,B to compare
pQn(xn)Qn(yn) and (1 − p)Qn(xn, yn)
.
The main Issue
..
......What Qn is qualiﬁed to be an alternative to Pn?

Problem
What is the exact Qn
for ﬁnite A?
P(X = 1) = θ, P(X = 0) = 1 − θ
If we weight
w(θ) =
1
Kθa(1 − θ)a
, K :=
∫
dθ
θa(1 − θ)a
with a > 0, then for each xn = (x1, · · · , xn) ∈ An
Qn
(xn
) :=
∫
w(θ)P(xn
|θ)dθ =
Γ(2a)
∏
x∈A
Γ(cn[x] + a)
Γ(a)2Γ(n + 2a)
ci [x]: the # of x ∈ A in xi = (x1, · · · , xi ) ∈ Ai
Γ: the Gamma function

Problem
Universal Coding/Measures
If we choose
a = 1/2
(Krichevsky-Troﬁmov) and xn is i.i.d. emitted by
Pn
(xn
) =
n∏
i=1
P(xi )
then, for any P, almost surely,
−
1
n
log Qn
(xn
) → H :=
∑
x∈A
−P(x) log P(x)
From the law of large numbers (Shannon McMillian Breiman):
for any P, almost surely,
−
1
n
log Pn
(xn
) =
1
n
n∑
i=1
− log P(xi ) → E[− log P(xi )] = H

Problem
The Essential Problem
For any P, almost surely,
1
n
log
Pn(xn)
Qn(xn)
→ 0 (1)
(the basis why Pn can be replaced by Qn)
.
X is neither discrete nor continuous
..
......Into what can Qn and (1) be generalized ?

Density Functions
If X has a density function
A: the range of X
A0 := {A}
Ak+1 is a reﬁnement of Ak
Example 1: if A0 = {[0, 1)}, the histogram sequence can be
A1 = {[0, 1/2), [1/2, 1)}
A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)}
. . .
Ak = {[0, 2−(k−1)), [2−(k−1), 2 · 2−(k−1)), · · · , [(2k−1 − 1)2−(k−1), 1)}
. . .
sk : A → Ak, sn
k : An → An
k
λ: Lebesgue measure, λn
(sn
k (xn
)) =
n∏
i=1
λ(sk(xi ))

Density Functions
{ωk}∞
k=1:
∑
ωk = 1, ωk > 0
gn
k (xn
) :=
Qn
k (sn
k (xn))
λn(sn
k (xn))
, gn
(xn
) :=
∞∑
k=1
ωkgn
k (xn
)
fk(xn
) :=
Pn
k (sn
k (xn))
λn(sn
k (xn))
=
n∏
i=1
Pk(sk(xi ))
λ(sk(xi ))
If we choose {Ak} such that fk → f , for any f n, almost surely
1
n
log
f n(xn)
gn(xn)
→ 0 (2)
B. Ryabko. IEEE Trans. on Inform. Theory, 55, 9, 2009.

Generalized Density Functions
Exactly when does density function exist?
B: the Borel set ﬁeld of R
µ(D): the probabbility of Borel set D
.
When a density function exists
..
......
The following are equivalent:
for each D ∈ B, λ(D) = 0 =⇒ µ(D) = 0 (µ ≪ λ)
There exists
dµ
dλ
:= f s.t. µ(D) =
∫
t∈D
f (t)dλ(t)

Density Functions in a General Sense
.
Radon-Nikodum’s Theorem
..
......
The following are equivalent:
for each D ∈ B, η(D) = 0 =⇒ µ(D) = 0 (µ ≪ η)
There exists
dµ
dη
:= f s.t. µ(D) =
∫
t∈D
f (t)dη(t)
Example 2: µ({j}) > 0, η({j}) :=
1
j(j + 1)
, j ∈ B := {1, 2, · · · }
µ ≪ η
µ(D) =
∑
j∈D∩B
f (j)η({j})
dµ
dη
(j) = f (j) =
µ({j})
η({j})

In this work, ...
B1 := {{1}, {2, 3, · · · }}
B2 := {{1}, {2}, {3, 4, · · · }}
. . .
Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}
. . .
sk : B → Bk, sn
k : Bn → Bn
k
gn
k (yn
) :=
Qn
k (sn
k (yn))
ηn(sn
k (yn))
, gn
(yn
) :=
∞∑
k=1
ωkgn
k (yn
)
If we choose {Bk} s.t. fk → f , for any f n, almost surely
1
n
log
f n(yn)
gn(yn)
→ 0 (3)
(gn(yn)
∏n
i=1 ηn({yi }) is estimation of P(yn) = f n(yn)
∏n
i=1 ηn({yi }))

Joint Density Functions
Example 3: A × B (based on Examples 1,2)
µ ≪ λη
A0 × B0 = {A} × {B} = {[0, 1)} × {{1, 2, · · · }}
A1 × B1
A2 × B2
. . .
Ak × Bk
. . .
sk : A × B → Ak × Bk

If {Ak × Bk} satisﬁes fk → f , for any f n, almost surely, we can construct
gn s.t.
1
n
log
f n(xn, yn)
gn(xn, yn)
→ 0 (4)

The Bayesian Solution
If we come back to “Today’s Problem”,...
Estimate f n
X (xn), f n
Y (yn), f n
XY (xn, yn) by

gn
X (xn), gn
Y (yn), gn
XY (xn, yn)

.
The Bayesian answer
..
......p0gn
X (xn)gn
Y (yn) ≤ p1gXY (xn, yn) ⇐⇒ X, Y are independent

The Bayesian Solution
In General, ...
Givem n example zn and prior {pm} over models m = 1, 2, · · · , estimate
f n(zn|m) =
dµn
dηn
(zn
|m) w.r.t. model m by gn
(zn
|m) =
dνn
dηn
(zn
|m) s.t.
1
n
log
dµn
dνn
(zn
|m) → 0 ,
where µ ≪ η, ν ≪ η, and
dµn
dνn
(zn
|m) =
dµn
dηn
(zn
|m)/
dνn
dηn
(zn
|m) =
f n(zn|m)
gn(zn|m)
to ﬁnd the model m maxmizing
pm ·
dνn
dηn
(zn
|m)

Summary
Summary and Discussion
.
Bayesian Measure
..
......
Generalization without assuming Discrete or Continuous
Universality as Bayes as well as MDL
.
Many Applications
..
......
Markov order estimation even when {Xi } is continuous
Bayesian network structure estimation

MDL/Bayesian Criteria based on Universal Coding/Measure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MDL/Bayesian Criteria based on Universal Coding/Measure

Similar to MDL/Bayesian Criteria based on Universal Coding/Measure (20)

More from Joe Suzuki

More from Joe Suzuki (20)

Recently uploaded

Recently uploaded (20)

MDL/Bayesian Criteria based on Universal Coding/Measure