TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimizaiton (ISIT 2010)
1. Feature Extraction for Universal Hypothesis Testing via
Rank-Constrained Optimization
Dayu Huang and Sean Meyn
Department of Electrical and Computer Engineering
and Coordinated Science Laboratory
University of Illinois, Urbana-Champaign
June 18, 2010
Huang and Meyn (UIUC) Feature Extraction June 2010 1 / 18
2. Introduction
Universal Hypothesis Testing
n
Sequence of observations: Z1 := (Z1 , . . . , Zn ).
i.i.d. π 0 under H0, π 1 under H1
π 0 : known π 1 : not known
Observation space Z is finite.
Task: Design a test to decide in favor of H0 or H1.
Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18
3. Introduction
Universal Hypothesis Testing
n
Sequence of observations: Z1 := (Z1 , . . . , Zn ).
i.i.d. π 0 under H0, π 1 under H1
π 0 : known π 1 : not known
Observation space Z is finite.
Task: Design a test to decide in favor of H0 or H1.
The Hoeffding test
φH = 1{D(Γn π 0 ) ≥ η},
n
Empirical distribution
n
1
Γn (A) = 1{Zk ∈ A}, A ⊂ Z.
n
k=1
Huang and Meyn (UIUC) Feature Extraction June 2010 2 / 18
4. The Hoeffding Test
Theorem
1 1 The
Hoeffding test achieves the optimal error exponents in
Neyman-Pearson criterion.
1. Hoeffding 1963;
Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18
5. The Hoeffding Test
Theorem
1 1 The
Hoeffding test achieves the optimal error exponents in
Neyman-Pearson criterion.
2 2 Theasymptotic variance of the Hoeffding test depends on the size of
the observation space. When Z1 has marginal π 0 , we have
n
lim Var [nD(Γn π 0 )] = 2 (|Z| − 1).
1
n→∞
Large variance when |Z| large
1. Hoeffding 1963; 2. Unnikrishnan, Huang, Meyn, Surana & Veeravalli; Wilks 1938;
Clarke & Barron 1990.
Huang and Meyn (UIUC) Feature Extraction June 2010 3 / 18
6. Performance of the Hoeffding Test 1
Probability of Detection
0.8
|Z|=19
0.6
Pr(φ = 1|H1)
|Z|=39
0.4
0.2
Red: Better error exponent but larger variance
0
0 0.2 0.4 0.6 0.8 1
Probability of False Alarm
Pr(φ = 1|H0).
Huang and Meyn (UIUC) Feature Extraction June 2010 4 / 18
7. Mismatched Universal Test
Variational representation of KL divergence
D(µ π) = sup µ, f − log( π, e f )
f
µ, f = z µ(z)f (z)
Mismatched divergence 1
DF (µ π) := sup µ, f − log( π, e f )
MM
f ∈F
Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18
8. Mismatched Universal Test
Variational representation of KL divergence
D(µ π) = sup µ, f − log( π, e f )
f
µ, f = z µ(z)f (z)
Mismatched divergence 1
DF (µ π) := sup µ, f − log( π, e f )
MM
f ∈F
Mismatched universal test 2
φMM = 1{DF (Γn π 0 ) ≥ η}
n
MM
1. Abbe, M´dard, Meyn & Zheng 2007; 2. Unnikrishnan et al.
e
Huang and Meyn (UIUC) Feature Extraction June 2010 5 / 18
9. Function Class and Performance
Consider a linear function class:
d
F = fr := ri ψi
i
Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determines
error exponent of the mismatched universal test. When d is smaller
than |Z|, it is optimal for a restricted set of alternative distributions.
1. Unnikrishnan et al.
Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18
10. Function Class and Performance
Consider a linear function class:
d
F = fr := ri ψi
i
Choice of function class F determines performance:
Mismatched divergence approximates KL divergence. Determines
error exponent of the mismatched universal test. When d is smaller
than |Z|, it is optimal for a restricted set of alternative distributions.
Dimension d determines asymptotic variance1 : Under H0,
lim Var [nDF (Γn π 0 )] = 1 d
MM
2
n→∞
Problem: How to choose function class F?
1. Unnikrishnan et al.
Huang and Meyn (UIUC) Feature Extraction June 2010 6 / 18
11. Our Contribution
1 Mismatched test even with a small dimension d is optimal for a large
set of alternative distributions.
2 Framework to choose F for the mismatched test.
Huang and Meyn (UIUC) Feature Extraction June 2010 7 / 18
13. When MM is optimal?
When does DF (π 1 π 0 ) = D(π 1 π 0 )?
MM
Fact (1)
When F includes LLR.
Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18
14. When MM is optimal?
When does DF (π 1 π 0 ) = D(π 1 π 0 )?
MM
Fact (1)
When F includes LLR.
Exponential family E(F) = {µ : µ(z) ∝ (exp f (z)), f ∈ F}.
Fact (2)
When π 0 , π 1 are in the same exponential family.
How many distributions in an d-dimensional exponential family?
Huang and Meyn (UIUC) Feature Extraction June 2010 9 / 18
15. -Extremal Distributions
πθ (z) ∝ exp(θf (z)) ∈ E(F)
θ→∞
Extremal distributions: πθ − − Distributions on the boundary of E(F).
−→
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
16. -Extremal Distributions
πθ (z) ∝ exp(θf (z)) ∈ E(F)
θ→∞
Extremal distributions: πθ − − Distributions on the boundary of E(F).
−→
Example
F = span(ψ): ψ = [5, −1, −1] i.e. ψ(z1 ) = −5, ψ(z2 ) = ψ(z3 ) = −1.
What are the extremal distributions?
[1, 0, 0] : f = [5, −1, −1]
[0, 0.5, 0.5] : f = [−5, 1, 1]
[1/3, 1/3, 1/3]: f = [0, 0, 0]
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
17. -Extremal Distributions
πθ (z) ∝ exp(θf (z)) ∈ E(F)
θ→∞
Extremal distributions: πθ − − Distributions on the boundary of E(F).
−→
Example
F = span(ψ): ψ = [5, −1, −1] i.e. ψ(z1 ) = −5, ψ(z2 ) = ψ(z3 ) = −1.
What are the extremal distributions?
[1, 0, 0] : f = [5, −1, −1]
[0, 0.5, 0.5] : f = [−5, 1, 1]
[1/3, 1/3, 1/3]: f = [0, 0, 0]
F (π) := {z : π(z) ≥ maxz (π(z)) − }
Definition
• π is called -extremal if π(F (π)) ≥ 1 − .
Example: [0.004, 0.499, 0.497].
Huang and Meyn (UIUC) Feature Extraction June 2010 10 / 18
21. The Number of -Distinguishable -Extremal Distributions
Definition
N(E): The maximum N such that for any small > 0, there exist N
distributions in E that are -extremal and pairwise -distinguishable.
Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18
22. The Number of -Distinguishable -Extremal Distributions
Definition
N(E): The maximum N such that for any small > 0, there exist N
distributions in E that are -extremal and pairwise -distinguishable.
Proposition
Denote
¯
N(d) : max{N(E) : E is d-dimensional }
It admits the following lower and upper bounds:
¯ d d
N(d) ≥ exp [log(|Z|) − log − 1]
2 2
¯
N(d) ≤ exp (d + 1)(1 + log(|Z|) − log(d + 1))
Many alternative distributions can be distinguished even with small
dimension d
Huang and Meyn (UIUC) Feature Extraction June 2010 12 / 18
23. A Framework for Choosing Function Class
Scenario: Alternative distributions are in a set S (not known to the
algorithm). Observe p distributions from the set: π 1 , . . . , π p .
Objective function to be maximized:
1 p i MM i
maxF p i=1 γ DF (π π0)
subject to dim(F) ≤ d
Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18
24. A Framework for Choosing Function Class
Scenario: Alternative distributions are in a set S (not known to the
algorithm). Observe p distributions from the set: π 1 , . . . , π p .
Objective function to be maximized:
1 p i MM i
maxF p i=1 γ DF (π π0)
subject to dim(F) ≤ d
Rank-constrained optimization:
1 p i
maxX p i=1 γ π i , Xi − log( π 0 , e Xi
subject to rank (X ) ≤ d
µ, f = z µ(z)f (z)
Huang and Meyn (UIUC) Feature Extraction June 2010 13 / 18
25. Algorithm
Iterative gradient projection:
1 Y k+1 = X k + αk h(X k ).
2 X k+1 = PS (Y k+1 ).
PS (Y ) = arg min{ Y − X : rank (X ) ≤ d}.
Provable local convergence.
Huang and Meyn (UIUC) Feature Extraction June 2010 14 / 18
26. Numerical Experiment
Randomly from a set S of distributions.
π0,
π 1 , . . . , π p for feature extraction.
π 1 for testing.
Experiment steps:
Feature extraction: Extract a d-dimensional function class F based
on π 0 and π 1 , . . . , π p .
Test: Alternative distribution is π 1 . Estimate probability of error by
simulation.
Huang and Meyn (UIUC) Feature Extraction June 2010 15 / 18
32. Conclusion and Future Work
Conclusions:
Variance is as important as error exponent.
Balance between variance and error-exponent.
Feature extraction algorithm: Exploit prior information to optimize
performance of mismatched test.
Future Work:
Bound probability of error based on finer statistics.
Extend to processes with long memory.
Other heuristics (such as nuclear-norm) for algorithm design.
Huang and Meyn (UIUC) Feature Extraction June 2010 17 / 18