Universal Prediction without assuming either Discrete or Continuous

.
.
Universal Prediction
without assuming either Discrete or Continuous
Joe Suzuki
Osaka University
November 13, 2012
Joe Suzuki (Osaka University) Universal Prediction without assuming either Discrete or ContinuousNovember 13, 2012 1 / 16

Problem
What is the probability that the sun will rise tomorrow?
Predict xn+1 ∈ {0, 1} given xn := (x1, · · · , xn) ∈ {0, 1}n
.
.
Construct a computable Q(xn+1|xn) → P(xn+1|xn)
such as
1 Q(xn+1|xn
) =
c
n
2 For a, b > 0, Q(xn+1|xn
) =
c + a
n + a + b

c: the number of xn+1 in xn.

Problem
Open Problems raised by Tom Cover in 1975, Moscow
In the betting, obtain 2 dollars if you win, or lose 1 dollar otherwise.

Problem 1: Existence of a universal gambling scheme
.
Is there any Qn s.t.
1
n
log[2n
Qn
(xn
)] →
1
n
log[2n
Pn
(xn
)]
a.s. n → ∞ for any unknown stationary ergodic Pn ?
Betting without knowledge converges to one with knowledge
(Bayesian strategy realizes the property)

Problem
Problem 2: Existence of a universal prediction scheme
.
.
Is there any Q s.t. for x ∈ {0, 1}
Q(x|x−1
−n ) → P(x|x−1
−∞)
a.s. n → ∞ for any unknown stationary ergodic P ?
Ornstein 1978 (discrete, Non-Bayesian)
Algoet 1992 (extended to the Polish spaces, Non-Bayesian)
x−1
−∞ ∈ {0, 1}∞ → ({sk}, {tk}), s0 < s1 < · · · , t0 < t1 < · · · s.t.
Q(x|x−1
−tk
) =
#Ik(x) + 1/2
#Ik(0) + #Ik(1) + 1
Ik(x) = {1 ≤ τ ≤ sk|x = x−τ , x−1
−tk
= x−τ−1
−τ−tk
}

Problem
Bayesian for binary i.i.d. sources
Qn
(xn
) =
∫
w(θ)P(xn
|θ)dθ , P(xn
|θ) = θc
(1 − θ)n−c
For a, b > 0,
w(θ) ∝ θ−a
(1 − θ)−b
⇐⇒ Q(xn+1|xn
) =
Qn+1(xn+1)
Qn(xn)
=
c + a
n + a + b
For a = b = 1/2 (Krichevsky-Troﬁmov),
−
1
n
log Qn
(xn
) → H :=
∑
x∈A
−P(x) log P(x)
−
1
n
log Pn
(xn
) =
1
n
n∑
i=1
− log P(xi ) → E[− log P(xi )] = H

Problem
Universality
There exists Qn s.t. for any Pn
1
Q(x|x−1
−n ) → P(x|x−1
−∞) (1)
2
1
n
log
Pn(xn)
Qn(xn)
→ 0 (2)
m-nary (m ≥ 2) rather than binary
stationary ergodic rather than i.i.d.
Ornstein 1978 (1)
Bayesian (2) as well as (1)

Problem
Problem
Construct Qn satisfying (2) for the genaral case
.
.
Xn should be stationary ergodic but can be either
discrete,
continuous, or
neither of them
Counting how many (X = xi+1, Xi = xi ) occurs does not help.
Algoet 1992 does not imply (2) for the general case.

Density Functions
Suppose a density function f exists for X
A: the range of X
A0 := {A}
Aj+1 is a reﬁnement of Aj
Example 1: Quantize f over A = [0, 1) to obtain histogram approximations
f1 over A1 = {[0, 1/2), [1/2, 1)}
f2 over A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)}
. . .
fj over Aj = {[0, 2−(j−1)), [2−(j−1), 2 · 2−(j−1)), · · · , [(2j−1 − 1)2−(j−1), 1)}
. . .
Pn
j (an) =
∏n
i=1 Pj (ai ), the probability of an = (a1, · · · , an) ∈ An
j
Qn
j : a Bayesian measure
1
n
log
Pn
j (an)
Qn
j (an)
→ 0 as n → ∞

Density Functions
λ : R → B (Lebesgue measure, a = [b, c) =⇒ λ(a) = c − b)
(x1, · · · , xn) ∈ (a1, · · · , an) ∈ An
j
=⇒



f n
j (xn
) := fj (x1) · · · fj (xn) =
Pj (a1) · · · Pj (an)
λ(a1) . . . λ(an)
gn
j (xn
) :=
Qn
j (a1, · · · , an)
λ(a1) · · · λ(an)
For {ωj }∞
j=1:
∑
ωj = 1, ωj > 0, gn
(xn
) :=
∞∑
j=1
ωj gn
j (xn
)
If we choose {Aj } such that fj → f as j → ∞, for any f , almost surely
1
n
log
f n(xn)
gn(xn)
→ 0 (3)
B. Ryabko. IEEE Trans. on Inform. Theory, 55, 9, 2009.

Generalized Density Functions
Exactly when does density function exist?
B: the Borel sets of R
µ(D): the probabbility of D ∈ B
When a density function exists
.
The following are equivalent (µ ≪ λ):
for each D ∈ B, λ(D) = 0 =⇒ µ(D) = 0
∃ B-measurable
dµ
dλ
:= f s.t. µ(D) =
∫
D
f (t)dλ(t)

Estimating generalized density functions
Radon-Nikodym’s Theorem
.
.
The following are equivalent (µ ≪ η):
for each D ∈ B, η(D) = 0 =⇒ µ(D) = 0
∃ B-measurable
dµ
dη
:= f s.t. µ(D) =
∫
D
f (t)dη(t)
Example 2: µ({k}) > 0, η({k}) :=
1
k(k + 1)
, k ∈ B := {1, 2, · · · }
µ(D) =
∑
k∈D
f (k)η({k}) , D ⊆ B
µ ≪ η =⇒
dµ
dη
(k) = f (k) =
µ({k})
η({k})
= k(k + 1)µ({k})

f1 over B1 := {{1}, {2, 3, · · · }}
f2 over B2 := {{1}, {2}, {3, 4, · · · }}
. . .
fk over Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}
. . .
(y1, · · · , yn) ∈ (b1, · · · , bn) ∈ Bn
k =⇒ gn
k (yn
) :=
Qn
k (b1, · · · , bn)
η(b1) · · · η(bn)
gn
(yn
) :=
∞∑
k=1
ωkgn
k (yn
)
If we choose {Bk} s.t. fk → f , for any f , almost surely
1
n
log
f n(yn)
gn(yn)
→ 0 (4)
gn(yn)
∏n
i=1 ηn({yi }) estimates P(yn) = f n(yn)
∏n
i=1 ηn({yi })

The original case was contained as a special case
For C = {0, 1, · · · , m − 1}, if we quantize
C1 = C2 = · · · = {{0}, {1}, · · · , {m − 1}}
η({0}) = · · · η({m − 1}) = 1/m
then µ ≪ η and
zn
∈ Cn
⇐⇒ cn
∈ Cn
1 = Cn
2 = · · ·
=⇒



f n
(zn
) =
Pn(cn)
(1/m)n
,
gn
1 (zn
) = gn
2 (zn
) = · · · = gn
(zn
) =
∞∑
l=1
ωl gn
l (zn
) =
Qn(cn)
(1/m)n
=⇒
1
n
log
f n(zn)
gn(zn)
=
1
n
log
Pn(cn)
Qn(cn)
→ 0

The Solution
Universality in the generalized sense
If µn ≪ ηn, there exists gn without depending on f n s.t.
1
n
log
f n(zn)
gn(zn)
→ 0
µn
(Dn
) :=
∫
D
f n
(zn
)dηn
(zn
) , νn
(Dn
) :=
∫
D
gn
(zn
)dηn
(zn
)
f n(zn)
gn(zn)
=
dµn
dηn
(zn
)/
dνn
dηn
(zn
) =
dµn
dνn
(zn
)
Theorem (Suzuki, 2011)
1
n
log
dµn
dνn
(zn
) → 0

The Solution
Universal Prediction in the generalized sense
The generalzed universal density function tells everything:
g(xn+1|xn
) =
gn+1(xn+1)
gn(xn)
→ f (xn+1|xn
) =
f n+1(xn+1)
f n(xn)

For any D ∈ B,
ν(D|xn
) =
∫
D
g(x|xn
)dη(x)

Summary
Summary and Discussion
Universal Prediction
.
.
Connection to Universal Bayesian Measures
Generalization without assuming Discrete or Continuous
Stronger universality in the sense of Bayes.
Many Applications except Prediction
Bayesian network structure estimation (DCC 2012)
The Bayesian Chow-Liu Algorithm (PGM 2012)
Markov order estimation even when {Xi } is continuous

Universal Prediction without assuming either Discrete or Continuous

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Universal Prediction without assuming either Discrete or Continuous

Similar to Universal Prediction without assuming either Discrete or Continuous (20)

More from Joe Suzuki

More from Joe Suzuki (20)

Recently uploaded

Recently uploaded (20)

Universal Prediction without assuming either Discrete or Continuous