Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Feature Extraction for Universal Hypothesis Testing via
Rank-Constrained Optimization

Dayu Huang and Sean Meyn

Department of Electrical and Computer Engineering
and Coordinated Science Laboratory
University of Illinois, Urbana-Champaign

June 18, 2010

Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18

Introduction
Universal Hypothesis Testing
n
Sequence of observations: Z1 := (Z1 , . . . , Zn ).
i.i.d. π 0 under H0, π 1 under H1
π 0 : known π 1 : not known
Observation space Z is ﬁnite.
Task: Design a test to decide in favor of H0 or H1.


Introduction
Universal Hypothesis Testing
n
Sequence of observations: Z1 := (Z1 , . . . , Zn ).
i.i.d. π 0 under H0, π 1 under H1
π 0 : known π 1 : not known
Observation space Z is ﬁnite.
Task: Design a test to decide in favor of H0 or H1.

The Hoeﬀding test
φH = 1{D(Γn π 0 ) ≥ η},
n

Empirical distribution
n
1
Γn (A) = 1{Zk ∈ A}, A ⊂ Z.
n
k=1


The Hoeffding Test

Theorem
1 1 The
Hoeffding test achieves the optimal error exponents in
Neyman-Pearson criterion.

1. Hoeffding 1963;


The Hoeffding Test

Theorem
1 1 The
Hoeffding test achieves the optimal error exponents in
Neyman-Pearson criterion.

2 2 Theasymptotic variance of the Hoeffding test depends on the size of
the observation space. When Z1 has marginal π 0 , we have
n

lim Var [nD(Γn π 0 )] = 2 (|Z| − 1).
1
n→∞

Large variance when |Z| large
1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;
Clarke & Barron 1990.

Performance of the Hoeﬀding Test 1

Probability of Detection
0.8

|Z|=19
0.6
Pr(φ = 1|H1)

|Z|=39
0.4

0.2

Red: Better error exponent but larger variance
0
0 0.2 0.4 0.6 0.8 1

Probability of False Alarm
Pr(φ = 1|H0).

Mismatched Universal Test

Variational representation of KL divergence

D(µ π) = sup µ, f − log( π, e f )
f
µ, f = z µ(z)f (z)
Mismatched divergence 1

DF (µ π) := sup µ, f − log( π, e f )
MM

f ∈F


Mismatched Universal Test

Variational representation of KL divergence

D(µ π) = sup µ, f − log( π, e f )
f
µ, f = z µ(z)f (z)
Mismatched divergence 1

DF (µ π) := sup µ, f − log( π, e f )
MM

f ∈F

Mismatched universal test 2

φMM = 1{DF (Γn π 0 ) ≥ η}
n
MM

1. Abbe, M´dard, Meyn & Zheng 2007; 2. Unnikrishnan et al.
e

Function Class and Performance

Consider a linear function class:
d
F = fr := ri ψi
i

Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determines
error exponent of the mismatched universal test. When d is smaller
than |Z|, it is optimal for a restricted set of alternative distributions.

1. Unnikrishnan et al.

Function Class and Performance

Consider a linear function class:
d
F = fr := ri ψi
i

Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determines
error exponent of the mismatched universal test. When d is smaller
than |Z|, it is optimal for a restricted set of alternative distributions.
Dimension d determines asymptotic variance1 : Under H0,

lim Var [nDF (Γn π 0 )] = 1 d
MM
2
n→∞

Problem: How to choose function class F?

1. Unnikrishnan et al.

Our Contribution

1 Mismatched test even with a small dimension d is optimal for a large
set of alternative distributions.
2 Framework to choose F for the mismatched test.


How powerful is mismatched test?

Example
0.2 1

0.15 0.8

π0 0.1

0.05
0.6
0.4
0.2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

1 1

10 distributions. 0.8
0.6
0.4
0.8
0.6
0.4

d =? 0.2
0
1 2 3 4 5 6 7 8 9
0.2
0
1 2 3 4 5 6 7 8 9

1 0.8
0.7
0.8 0.6
0.6 0.5
0.4
0.4 0.3
0.2 0.2
0.1
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9


When MM is optimal?

When does DF (π 1 π 0 ) = D(π 1 π 0 )?
MM

Fact (1)
When F includes LLR.


When MM is optimal?

When does DF (π 1 π 0 ) = D(π 1 π 0 )?
MM

Fact (1)
When F includes LLR.

Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.

Fact (2)
When π 0 , π 1 are in the same exponential family.

How many distributions in an d-dimensional exponential family?


-Extremal Distributions
πθ (z) ∝ exp(θf (z)) ∈ E(F)
θ→∞
Extremal distributions: πθ − − Distributions on the boundary of E(F).
−→


θ→∞
−→

Example
F = span(ψ): ψ = [5, −1, −1] i.e. ψ(z1 ) = −5, ψ(z2 ) = ψ(z3 ) = −1.
What are the extremal distributions?
[1, 0, 0] : f = [5, −1, −1]
[0, 0.5, 0.5] : f = [−5, 1, 1]
[1/3, 1/3, 1/3]: f = [0, 0, 0]


θ→∞
−→

Example
F = span(ψ): ψ = [5, −1, −1] i.e. ψ(z1 ) = −5, ψ(z2 ) = ψ(z3 ) = −1.
What are the extremal distributions?
[1, 0, 0] : f = [5, −1, −1]
[0, 0.5, 0.5] : f = [−5, 1, 1]
[1/3, 1/3, 1/3]: f = [0, 0, 0]

F (π) := {z : π(z) ≥ maxz (π(z)) − }

Deﬁnition
• π is called -extremal if π(F (π)) ≥ 1 − .
Example: [0.004, 0.499, 0.497].


-Distinguishable Distributions
Distinguishable
D(π 1 π 0 ) = D(π 0 π 1 ) = ∞ ⇔ π 1 π 0 and π 0 π1.

Example
π 0 (z1 ) = 0.5, π 0 (z2 ) = 0.5, π 0 (z3 ) = 0
π 1 (z1 ) = 0, π 1 (z2 ) = 0.5, π 1 (z3 ) = 0.5


Distinguishable
D(π 1 π 0 ) = D(π 0 π 1 ) = ∞ ⇔ π 1 π 0 and π 0 π1.

Example
π 0 (z1 ) = 0.5, π 0 (z2 ) = 0.5, π 0 (z3 ) = 0
π 1 (z1 ) = 0, π 1 (z2 ) = 0.5, π 1 (z3 ) = 0.5

Approximately distinguishable
Example
π 0 (z1 ) = 0.49999, π 0 (z2 ) = 0.49999, π 0 (z3 ) = 0.00002
π 1 (z1 ) = 0.00002, π 1 (z2 ) = 0.49999, π 1 (z3 ) = 0.49999


Distinguishable
D(π 1 π 0 ) = D(π 0 π 1 ) = ∞ ⇔ π 1 π 0 and π 0 π1.

Example
π 0 (z1 ) = 0.5, π 0 (z2 ) = 0.5, π 0 (z3 ) = 0
π 1 (z1 ) = 0, π 1 (z2 ) = 0.5, π 1 (z3 ) = 0.5

Approximately distinguishable
Example
π 0 (z1 ) = 0.49999, π 0 (z2 ) = 0.49999, π 0 (z3 ) = 0.00002
π 1 (z1 ) = 0.00002, π 1 (z2 ) = 0.49999, π 1 (z3 ) = 0.49999

Deﬁnition
π 1 , π 2 are -distinguishable if F (π 1 ) ⊆ F (π 1 ) and F (π 2 ) ⊆ F (π 1 ).


The Number of -Distinguishable -Extremal Distributions
Deﬁnition
N(E): The maximum N such that for any small > 0, there exist N
distributions in E that are -extremal and pairwise -distinguishable.


The Number of -Distinguishable -Extremal Distributions
Deﬁnition
N(E): The maximum N such that for any small > 0, there exist N
distributions in E that are -extremal and pairwise -distinguishable.

Proposition
Denote
¯
N(d) : max{N(E) : E is d-dimensional }
It admits the following lower and upper bounds:

¯ d d
N(d) ≥ exp [log(|Z|) − log − 1]
2 2
¯
N(d) ≤ exp (d + 1)(1 + log(|Z|) − log(d + 1))

Many alternative distributions can be distinguished even with small
dimension d

A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to the
algorithm). Observe p distributions from the set: π 1 , . . . , π p .

Objective function to be maximized:
1 p i MM i
maxF p i=1 γ DF (π π0)
subject to dim(F) ≤ d


A Framework for Choosing Function Class

Scenario: Alternative distributions are in a set S (not known to the
algorithm). Observe p distributions from the set: π 1 , . . . , π p .

Objective function to be maximized:
1 p i MM i
maxF p i=1 γ DF (π π0)
subject to dim(F) ≤ d

Rank-constrained optimization:
1 p i
maxX p i=1 γ π i , Xi − log( π 0 , e Xi
subject to rank (X ) ≤ d

µ, f = z µ(z)f (z)


Algorithm

Iterative gradient projection:
1 Y k+1 = X k + αk h(X k ).
2 X k+1 = PS (Y k+1 ).
PS (Y ) = arg min{ Y − X : rank (X ) ≤ d}.
Provable local convergence.


Numerical Experiment

Randomly from a set S of distributions.
π0,
π 1 , . . . , π p for feature extraction.
π 1 for testing.

Experiment steps:
Feature extraction: Extract a d-dimensional function class F based
on π 0 and π 1 , . . . , π p .
Test: Alternative distribution is π 1 . Estimate probability of error by
simulation.


Numerical Experiment S: 12-dimensional exponential family.
|Z | = 20. n = 30.
Pr(φ = 1|H1)

Pr(φ = 1|H0).

Conclusion and Future Work

Conclusions:
Variance is as important as error exponent.
Balance between variance and error-exponent.
Feature extraction algorithm: Exploit prior information to optimize
performance of mismatched test.
Future Work:
Bound probability of error based on ﬁner statistics.
Extend to processes with long memory.
Other heuristics (such as nuclear-norm) for algorithm design.


Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)

Similar to Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010) (13)

Recently uploaded

Recently uploaded (20)

Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)