This document introduces several concepts in estimation theory, including Bayesian parameter estimation, non-Bayesian parameter estimation, maximum likelihood estimation, and the Cramér-Rao lower bound. It provides examples of estimating parameters for linear and nonlinear models from observed data using different cost functions and derivation of the mean square error, maximum a posteriori, and maximum likelihood estimates.
1. Introduction to Estimation Theory
Bayesian (Random) Parameter
Estimation Nonrandom Parameter Estimation
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.1/149
2. In estimation problem we assign a cost to all pairs [a, â(r)]
over the range of interest.
In many cases of interest it is realistic to assume that the
cost depends only on the error:
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.2/149
4. Our goal is to find an estimate that minimizes the expected
value of the cost
R = E {c [a, â(r)]} =
Z ∞
−∞
Z ∞
−∞
c [a, â(r)] p(a, r) drda
R is the risk involved in doing the estimation of a out of
observation(s) r.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.4/149
5. RRMS =
Z ∞
−∞
Z ∞
−∞
(â(r) − a)2
p(a, r) drda
=
Z ∞
−∞
drp(r)
Z ∞
−∞
(â(r) − a)2
p(a|r)da
Because p(r) 0 we can minimize the inner integral:
d
dâ
Z ∞
−∞
(â(r) − a)2
p(a|r) da
= −2
Z ∞
−∞
ap(a|r) da
+2â(r)
Z ∞
−∞
p(a|r) da = 0 ⇒
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.5/149
6. Then, the mean square estimate is represented as:
âRMS(r) =
Z ∞
−∞
a p(a|r)da
We have seen this before as the conditional mean.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.6/149
7. d
dâ
−2
Z ∞
−∞
ap(a|r)da + 2â(r)
Z ∞
−∞
p(a|r) da
= 2 0
Because the second derivative is positive âRMS is the mini-
mum.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.7/149
8. The Bayes estimate for the absolute value criterion:
Rabs =
Z ∞
−∞
Z ∞
−∞
|â(r) − a| p(a, r) drda
=
Z ∞
−∞
drp(r)
Z ∞
−∞
|â(r) − a| p(a|r)da
The inner integral:
I(r) =
Z â(r)
−∞
(â(r) − a) p(a|r)da +
Z ∞
â(r)
(a − â(r)) p(a|r)da ⇒
d
dâ(r)
I(r) = 0 ⇒
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.8/149
9. Z â(r)
−∞
(â(r) − a) p(a|r)da =
Z ∞
â(r)
(a − â(r)) p(a|r)da ⇒
This is the definition for the median. The absolute error
criterion leads to the determination of estimate of a out of
the median of the observation(s) r.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.9/149
10. Runif =
Z ∞
−∞
drp(r)
1 −
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da
#
Minimizing Runif amounts to maximizing
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da ⇒
âunif(r) occurs where p(a|r)|â(r)=a is maximum.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.10/149
11. This is MAP. A necessary, but not sufficient condition for a
Max. is
d
da
{ln p(a|r)}
12.
13.
14.
15. âMAP=a
= 0
p(a|r) =
p(r|a)p(a)
p(r)
⇒
ln p(a|r) = ln p(r|a) + ln p(a) − ln p(r)
| {z }
not a
function of a
⇒
max {ln p(a|r)} ≡ max {ln p(r|a) + ln p(a)}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.11/149
16. Example
ri = a+ni, i = 1, 2, · · · , N, a ∼ N(0, σa), ni ∼ N(0, σn) ⇒
p(r|a) = ΠN
i=1
1
σn
√
2π
exp
−
(ri − a)2
2σ2
n
p(a) =
1
σa
√
2π
exp
−
a2
2σ2
a
We need to compute
R ∞
−∞
ap(a|r)da, one approach could be
p(a|r) = p(a)p(r|a)/p(r) but this is tedious. However, one
can observe that p(a|r) is PDF then:
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.12/149
17. p(a|r) =
1
p(r)
(
1
(2π)N/2σN
n
1
√
2πσa
exp
−
PN
i=1(ri − a)2
2σ2
n
#
exp
−
a2
2σ2
a
)
p(a|r) = k(r) exp
−
1
2σ2
p
a −
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!#2
σ2
p =
1
σ2
a
+
N
σ2
n
−1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.13/149
18. We see that p(a|r) is Gaussian, then: âMS(r) is the
conditional mean:
âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.14/149
19. if σ2
a ≫ σ2
n
N
⇒ a priori knowledge is much better
than the observed data.
if σ2
a ≪ σ2
n
N
⇒ a priori knowledge is not enough and
the estimate uses the received data.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.15/149
20. For MAP:
The location p(a|r) is maximum is the mean of Gaussian⇒
âMAP(r) = âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!
Also because the median of Gaussian occurs at the mean
then for this problem:
âMAP(r) = âMS(r) = âabs(r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.16/149
21. This invariance to choice of cost function is obviously be-
cause of the subjective judgements that are frequently in-
volved in choosing the cost function.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.17/149
22. An example of a nonlinear problem:
rm = a3
+ νm, m = 1, 2, · · · , M, νk ∼ N(0, σν), a ∼ N(0, σa)
p(a|r) = k(r) exp
(
−
1
2
PM
m=1 (rm − a3
)
2
σ2
n
+
a2
σ2
a
#)
âMAP(r) =
(PM
m=1 [rm − a3
] (3a2
)
σ2
n
+
a
σ2
a
)
28. Example:
Pr(n even|a) =
an
n!
e−a
, n = 0, 1, 2, · · · , ∞
p(a) = λe−λa
, a 0 ⇒
Pr(a|n) =
Pr(n|a)p(a)
Pr(n)
= k(n)
an
n!
e−a
λe−λa
because
R ∞
0
p(a|n) da = 1 ⇒
k(n) =
(1 + λ)(n+1)
λ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.19/149
29. âMS(n) =
Z ∞
0
ap(a|n) da =
n + 1
λ + 1
âMAP(n) = max {ln p(r|a) + ln p(a)} =
n
λ + 1
Z âabs
0
p(a|n) da =
Z ∞
âabs
p(a|n) da = polynomial solution, no closed form
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.20/149
30. The first measure of quality is:
E{â(r)} =
Z ∞
−∞
â(r) p(r|a) dr
1. if E{â(r)} = a, unbiased estimate.
2. if E{â(r)} = a + b, biased, but known.
3. if E{â(r)} = a + b(a), biased, but unknown.
Even an unbiased estimate could yield bad results.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.21/149
31. Usually the PDF of the estimate is centered around a.
Therefore, the second measure of quality is the variance of
the estimate.
var[â(r) − a] = E
[â(r) − a]2
− B2
(a)
General strategy:
We shall try to find an unbiased estimate with small vari-
ance.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.22/149
32. Maximum Likelihood Estimation (MLE):
r = a + n, p(r|a) = N(a, σn)
We choose the value of a that most likely caused a given
value of a.
The likelihood function (LF) of the observation given the a
is p(r|a), or the log-likelihood function (LLF) ln p(r|a)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.23/149
33. We maximize LF or LLF with respect to the unknown
parameter
âML(r) is the value of a at which p(r|a) is maximum.
If âML(r) is interior to a, and ln p(r|a) has a maximum then
a = âML(r) is that value.
The ML estimate is the limiting value of MAP as the a priori
knowledge → 0:
MAP:
∂
∂a
ln p(r|a) +
∂
∂a
ln p(a)
| {z }
a priori knowledge
41. If a(r) is any unbiased estimate of a ⇒
var[a(r) − a] E
(
∂
∂a
ln p(r|a)
2
)!−1
var[a(r) − a]
−E
∂2
∂a2
ln p(r|a)
−1
These are called CRLB.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.25/149
42. Any estimate that satisfies CRLB with equality is called an
efficient estimate.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.26/149
43. Because â(r) is unbiased:
E[a(r) − a] =
Z ∞
−∞
p(r|a)[a(r) − a] dr = 0
∂
∂a
{E[a(r) − a]} =
Z ∞
−∞
∂p(r|a)
∂a
[a(r) − a] dr − 1 = 0
∂p(r|a)
∂a
=
∂ ln p(r|a)
∂a
p(r|a)
Z ∞
−∞
∂ ln p(r|a)
∂a
p(r|a)[a(r) − a] dr = 1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.27/149
44. Z ∞
−∞
∂ ln p(r|a)
∂a
p
p(r|a)
p
p(r|a)[a(r) − a]
dr = 1
Using Schwartz inequality:
Z ∞
−∞
∂ ln p(r|a)
∂a
2
p(r|a) dr
Z ∞
−∞
p(r|a)[a(r) − a]2
dr
| {z }
var[a(r)−a]
1 ⇒
var[a(r) − a]
E
n ∂
∂a
ln p(r|a)
2
o−1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.28/149
46. For the 2nd representation:
Z ∞
−∞
p(r|a) dr = 1 ⇒
Z ∞
−∞
∂p(r|a)
∂a
dr =
Z ∞
−∞
∂ ln p(r|a)
∂a
p(r|a) dr = 0
differentiating again:
Z ∞
−∞
∂2
ln p(r|a)
∂a2
p(r|a) dr +
Z ∞
−∞
∂ ln p(r|a)
∂a
2
p(r|a) dr = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.30/149
47. E
∂2
ln p(r|a)
∂a2
= −E
(
∂ ln p(r|a)
∂a
2
)
Then, the 2nd representation results.
1. From CRLB, any unbiased estimate must have a
variance greater than a certain limit.
2. if ∂ ln p(r|a)/∂a = k(a)[a(r) − a], âML(r) will satisfy the
CRLB with equality
∂ ln p(r|a)
∂a
52. Example:
ri = a + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
∂ ln p(r|a)
∂a
=
N
σ2
n
1
N
N
X
i=1
ri − a
#
= 0 ⇒ âML(r) =
1
N
N
X
i=1
ri
E [âML(r)] =
1
N
N
X
i=1
E(ri) =
1
N
N
X
i=1
a = a ⇒
âML(r) is the unbiased estimator.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.32/149
53. The variance of the estimator:
∂2
ln p(r|a)
∂a2
= −
N
σ2
n
⇒
var[âML(r) − a] =
σ2
n
N
→ 0 N → ∞
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.33/149
54. Example:
Pr(n event |a) =
an
n!
e−a
, n = 0, 1, 2, · · · , N
∂ ln p(n = N|a)
∂a
=
∂
∂a
[N ln a − a − ln N!] =
N
a
−1 =
1
a
[N−a] ⇒ âML(N) = N
∂2
ln p(n = N|a)
∂a2
= −
N
a2
⇒
var [âML(N) − a] = a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.34/149
55. Example:
ri = s(a) + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
p(r|a) =
1
√
2πσn
N
exp
−
PN
i=1 (ri − s(a))2
2σ2
n
#
∂ ln p(r|a)
∂a
=
1
σ2
n
N
X
i=1
(ri − s(a))
∂s(a)
∂a
In general cannot be written in the form required by:
∂ ln p(r|a)
∂a
= k(a)[a(r) − a]
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.35/149
56. Therefore, an unbiased efficient estimate does not exist.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.36/149
62. a=âML(r)
⇒
âML(r) = s−1 1
N
N
X
i=1
ri
!
∂2
ln p(r|a)
∂a2
=
1
σ2
n
N
X
i=1
[ri − s(a)]
∂2
s(a)
∂a2
−
N
σ2
n
∂s(a)
∂a
2
⇒
E
∂2
ln p(r|a)
∂a2
= −
N
σ2
n
∂s(a)
∂a
2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.37/149
63. Because E{
P
ri − s(a)} = 0 ⇒
var[âML(r) − a]
σ2
n
N
h
∂s(a)
∂a
i2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.38/149
64. Example of Bayesian Estimation
Suppose we collect n Poisson distributed data points with
mean θ
Yi ∼ Poiss(θ)
p(yi|θ) = e−θ θyi
yi!
, yi ∈ 0, 1, 2...
Likelihood is:
p(y|θ) =
n
Y
i=1
e−θ θyi
yi!
Suppose prior is exponentially distributed with mean 1/b
p(θ) = b exp(−bθ)u(θ)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.39/149
65. Posterior is:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
n
Q
i=1
e−θ θyi
yi!
b exp(−bθ)
∞
R
0
n
Q
i=1
e−θ θyi
yi!
b exp(−bθ)dθ
E[Θ|Y = y] =
∞
Z
−∞
θp(θ|y)dθ =
∞
Z
0
θe−(n+b)θ
θT
∞
R
0
e−(n+b)θ̃θ̃T dθ̃
dθ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.40/149
66. ∞
Z
0
θe−(n+b)θ
θT
dθ =
∞
Z
0
e−(n+b)θ
θT+1
dθ =
Γ(T + 2)
(n + b)T+2
, T ≡
n
X
i=1
yi
E[Θ|Y = y] =
Γ(T+2)
(n+b)T +2
Γ(T+1)
(n+b)T +1
=
(T+1)!
(n+b)T +2
T!
(n+b)T +1
E[Θ|Y = y] =
1 +
n
P
i=1
yi
n + b
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.41/149
67. MAP Estimate:
p(y|θ)p(θ) = e−nθ θT
n
Q
i=1
yi!
b exp(−bθ)
ln p(θ|y) = −nθ + T ln θ − bθ
d
dθ
ln p(θ|y) = −n +
T
θ
− b = 0
θ̂MAP (y) =
T
n + b
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.42/149
68. Minimum Absolute Error(MAE)
Must solve:
θ̂MAE(y)
Z
−∞
p(θ|y)dθ = 1/2
θ̂MAE(y)
Z
0
e−(n+b)θ
θT (n + b)T+1
T!
dθ = 1/2
θ̂MAE(y)
Z
−∞
e−(n+b)θ
θT
dθ
| {z }
incomplete Gamma function
=
T!
2(n + b)T+1
Because: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.43/149
69. ∞
Z
0
e−(n+b)θ
θT
dθ =
Γ(T + 1)
(n + b)T+1
=
T!
(n + b)T+1
The solution for MAE is based on expressing as an “incom-
plete Gamma function” (gammainc in MATLAB); this will
have an inverse, so you could solve for θ̂MAE(y).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.44/149
70. Suppose we instead want to estimate
γ =
√
θ
Can we just say
γ̂CME(y) =
q
θ̂CME(y), CME = Conditional Mean Square
γ̂MAP (y) =
q
θ̂MAP (y)
We’ll see the answer is NO.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.45/149
71. Note that knowing γ is equivalent to knowing θ
pY |Γ(y|γ) = pY |Θ(y|θ)
87. Computing the New MAP Estimate
pY |Γ(y|γ) =
n
Y
i=1
e−γ2 (γ2
)
yi
yi!
, pΓ(γ) = 2γb exp(−bγ2
)
The new logposterior:
H = ln pΓ|Y (γ|y) = −Nγ2
+ (2 ln γ)T + ln γ − bγ2
dH
dγ
= −2Nγ +
2T
γ
+
1
γ
− 2bγ = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.47/149
88. γ̂2
MAP (y) =
2T + 1
2N + 2b
=
T + 1/2
N + b
6=
T
N + b
= θ̂MAP (y)
γ̂MAP =
r
T + 0.5
N + b
As an aside, recall:
θ̂CME(y) =
T + 1
N + b
6= γ̂CME(y)
If we went through the same exercise for the MMSE esti-
mate, would probably come to similar conclusions!?
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.48/149
89. In general, for Bayesian estimates f(θ̂(y)) 6=
f(θ(y)),
whether MAP, MMSE, MAE, or whatever
For the special case of affine transformations,
γ = f(θ) = aθ + b
γ̂ = aθ̂ + b
γ = g(θ) = aθ + b, θ = g−1
(γ) =
γ − b
a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.49/149
94. γ − b
a
, pΓ(γ) =
1
a
pΘ
γ − b
a
MAP estimation:
ln pΓ|Y (γ|y) = ln pY |Θ
y
95.
96.
97.
98. γ − b
a
+ ln
1
a
+ ln pΘ
γ − b
a
Similar arguments work for MMSE, MAE, etc.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.50/149
99. 1. For ML estimates as N → ∞ âML(r) → a in probability
sense, this is called a consistent estimate.
2. ML estimate is asymptotically efficient.
lim
N→∞
var [âML(r) − a]
−E
n
∂2 ln p(r|a)
∂a2
o−1 = 1
3. ML estimate is asymptotically Gaussian, N(a, σaǫ )
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.51/149
100. Method of Moments
To find the method of moments estimator of θ1, · · · , θp we
set up and solve the equations:
µ1 (θ1, · · · , θp) = m1
µ2 (θ1, · · · , θp) = m2
µp (θ1, · · · , θp) = mp
The kth sample moment is defined to be:
mk =
1
n
n
X
i=1
xk
i
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.52/149
101. Example
Let x1, · · · , xn denote a sample from the uniform
distribution from θ1 to θ2.
f (x; θ1, θ2) =
1
θ2−θ1
θ1 ≤ x ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.53/149
102. The joint density of x1, · · · , xn is:
L (θ1, θ2) = f (x1, . . . , xn; θ1, θ2) =
n
Y
i=1
f (xi; θ1, θ2)
=
(
1
(θ2−θ1)n θ1 ≤ x1 ≤ θ2, . . . , θ1 ≤ xn ≤ θ2
0 elsewhere
=
(
1
(θ2−θ1)n θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.54/149
103. To find the maximum likelihood estimates of θ1 and θ2 we
determine;
∂L (θ1, θ2)
∂θ1
and
∂L (θ1, θ2)
∂θ2
∂L
∂θ1
=
(
n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
∂L
∂θ2
=
(
−n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.55/149
104. Note ∂L(θ1,θ2)
∂θ1
and ∂L(θ1,θ2)
∂θ2
are never equal to zero.
∂L (θ1, θ2)
∂θ1
is always positive and
∂L (θ1, θ2)
∂θ2
is always negative.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.56/149
105. hence the maximum likelihood estimates of θ1 and θ2 are
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
This compares with the Method of moments estimators:
θ̃1 = x̄ −
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!
θ̃2 = x̄ +
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.57/149
106. The sampling distribution:
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
solution
We use the distribution function method:
θ̂1 = min
i
(xi) = m, θ̂2 = max
i
(xi) = M
G1 (u) = P [m ≤ u] = P
h
min
i
(xi) ≤ u
i
= 1 − P
h
min
i
(xi) ≥ u
i
= 1 − P [x1 ≥ u, · · · , xn ≥ u]
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.58/149
107. = 1 − P [x1 ≥ u] · · · P [xn ≥ u] = 1 −
θ2 − u
θ2 − θ1
n
Thus the density of
m = θ̂1 = min
i
(xi)
is
g1 (u) = G′
1 (u) = −n
θ2 − u
θ2 − θ1
n−1
−1
θ2 − θ1
=
n (θ2 − u)n−1
(θ2 − θ1)n
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.59/149
108. Is m = θ̂1 = min
i
(xi) unbiased?
E [m] = E
h
θ̂1
i
= E
h
min
i
(xi)
i
=
θ2
Z
θ1
ug1 (u) du =
θ2
Z
θ1
u
n (θ2 − u)n−1
(θ2 − θ1)n du
Put v = θ2u then the above integral becomes
0
Z
θ2−θ1
(θ2 − v)
nvn−1
(θ2 − θ1)n − dv
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.60/149
109. =
n
(θ2 − θ1)n
θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv
E
θ̂1
=
n
(θ2 − θ1)n
θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv
=
n
(θ2 − θ1)n
θ2
(θ2 − θ1)n
n
−
(θ2 − θ1)n+1
n + 1
#
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.61/149
110. = θ2 −
n (θ2 − θ1)
n + 1
=
n
n + 1
θ1 +
1
n + 1
θ2
= θ1 +
1
n + 1
(θ2 − θ1)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.62/149
111. Is M = θ̂2 = max
i
(xi) unbiased?
E [M] = E
h
θ̂2
i
= E
h
max
i
(xi)
i
=
θ2
Z
θ1
vg2 (v) dv =
θ2
Z
θ1
v
n (v − θ1)n−1
(θ2 − θ1)n dv
Put w = v − θ1 then the above integral becomes
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.63/149
112. θ2−θ1
Z
0
(w + θ1)
nwn−1
(θ2 − θ1)n dw
=
n
(θ2 − θ1)n
θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw
E
θ̂2
=
n
(θ2 − θ1)n
θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw
=
n
(θ2 − θ1)n
(θ2 − θ1)n+1
n + 1
+ θ1
(θ2 − θ1)n
n
#
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.64/149
113. =
n (θ2 − θ1)
n + 1
+ θ1 =
1
n + 1
θ1 +
n
n + 1
θ2
= θ2 −
1
n + 1
(θ2 − θ1)
E
θ̂2 − θ̂1
= E
θ̂2
− E
θ̂1
=
θ2 − 1
n+1
(θ2 − θ1)
−
θ1 + 1
n+1
(θ2 − θ1)
= 1 − 2
n+1
(θ2 − θ1)
= n−1
n+1
(θ2 − θ1)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.65/149
114. Hence
E
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ2 − θ1
We can use this to get rid of the bias of θ̂1 and θ̂2
T1 = θ̂1 −
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂1 −
1
n − 1
h
θ̂2 − θ̂1
i
= m −
M − m
n − 1
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.66/149
115. and
T2 = θ̂2 +
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂2 +
1
n − 1
h
θ̂2 − θ̂1
i
= M +
M − m
n − 1
Then T1 and T2 are unbiased estimators of θ1 and θ2.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.67/149
116. Uniformly Better
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Let T(x) and T∗
(x) be estimators of the
parameter φ(θ). Then T(x) is said to be uniformly better
than T∗
(x) if:
MSET(x) (θ) 6 MSET∗(x) (θ) , θ ∈ Ω
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.68/149
117. Uniformly Minimum Variance
Unbiased Estimator
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Then T∗
(x) is said to be the UMVU
(Uniformly minimum variance unbiased) estimator of φ(θ)
if:
E[T∗
(x)] = φ(θ), θ ∈ Ω
Var[T∗
(x)] 6 Var[T(x)]
where E[T(x)] = φ(θ).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.69/149
118. Multiple Parameter Estimation
âǫ(r) =
a1(r) − a1
a2(r) − a2
.
.
.
aK(r) − aK
= ~
aǫ(r) − ~
a
Cost function for MSE criterion:
C (âǫ(r)) =
K
X
i=1
â2
ǫi
(r) = âT
ǫ (r)âǫ(r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.70/149
119. Risk:
RMSE =
Z ∞
−∞
Z ∞
−∞
C (âǫ(r)) p (~
r,~
a) d~
rd~
a
=
Z ∞
−∞
p(~
r) d~
r
| {z }
0
Z ∞
−∞
K
X
i=1
(âi(r) − ai)2
#
p(a|r) d~
a ⇒
âMSEi
(r) =
Z ∞
−∞
ai p(~
a|~
r) d~
a or ˆ
~
aMSEi
(r) =
Z ∞
−∞
~
a p(~
a|~
r) d~
a
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.71/149
120. The above estimates hold true over linear transformation.
~
b = DL×K
~
a, E
bT
ǫ (r)bǫ(r)
= E
L
X
i=1
b2
ǫi
(r)
#
~
bMSE(~
r) = D~
aMSE(~
r)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.72/149
121. For MAP we find ~
a that max {p(~
a|~
r)}:
∂ ln p(~
a|~
r)
∂ai
126. Bias:
E {~
aǫ(~
r)} = E {~
a(~
r) − ~
a} = ~
B(~
r)
If ~
B(~
r) equal to zero then ~
a(~
r) is an unbiased estimate.
For vector variables the quantity analogous to the variance
is the covariance matrix.
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o
= Λǫ
E (~
aǫ) = ~
B(~
a)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.74/149
127. Let’s consider any unbiased estimator of ~
a
σ2
ǫi
= var [ai(~
r) − ai] Jii
Jii is the ith element in the K × K square matrix J−1
Jij = E
∂ ln p(~
r|~
a)
∂ai
·
∂ ln p(~
r|~
a)
∂aj
This is called Fisher’s information matrix (FIM).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.75/149
128. For any estimator we are interested in:
1. Bias:
E {~
a(r)}
2. Error cross-covariance:
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.76/149
129. Example:
Consider the random variable Y with
E[Y ] = g(U1, U2, · · · , Uk) =
p
X
i=1
βiφi(U1, U2, · · · , Uk)
and
var(Y ) = σ2
where βi, i = 1, 2, · · · , p are unknown parameters.
where φi, i = 1, 2, · · · , p are known functions of the
nonrandom variables Ui, i = 1, 2, · · · , k
assume further that Y is normally distributed.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.77/149
134. = h (y) g β, σ2
exp
−
1
2σ2
[y′
y − 2y′
Xβ]
Thus f(y|β, σ2
) is a member of the exponential family of
distributions and
S = (y′
y, X′
y)
is a Minimal Complete set of Sufficient Statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.82/149
135. The Maximum Likelihood estimates of β and σ2
are the
values
β̂ and σ̂2
that maximize
Ly σ2
, β
=
1
(2πσ2)n/2
exp
−
1
2σ2
[y − Xβ]′
[y − Xβ]
or equivalently
ly σ2
, β
= ln Ly σ2
, β
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.83/149
136. = −
n
2
ln (2π) −
n
2
ln σ2
−
1
2σ2
[y − Xβ]′
[y − Xβ]
= −
n
2
ln (2π) −
n
2
ln σ2
−
1
2σ2
[y′
y − 2y′
Xβ + β′
X′
Xβ]
∂ly (σ2
, β)
∂β
= 0
yields the system of linear equations (The Normal
Equations)
X′
Xβ̂ = X′
y
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.84/149
137. while
∂ly (σ2
, β)
∂σ2
= 0
yields the equation:
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
If [X′
X]−1
exists then the normal equations have solution:
β̂ = (X′
X)
−1
X′
y
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.85/149
138. and
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
=
1
n
h
y − X (X′
X)
−1
X′
y
i′ h
y − X (X′
X)
−1
X′
y
i
=
1
n
h
y′
y − y′
X (X′
X)
−1
X′
y
i
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.86/149
139. Almost all problems in statistics can be formulated as a
problem of making a decision . That is given some data
observed from some phenomena a decision will have to be
made about the phenomena. Decisions are generally
broken into two types :
Estimation decisions
Hypothesis Testing decisions.
Probability Theory plays a very important role in these de-
cisions.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.87/149
140. Besides the Normal distribution the following distributions
play an important role in estimation and hypothesis testing:
Chi-squared distribution with ν degrees of freedom
f(x) =
1
Γ(ν/2)2ν/2
x(ν−2)/2
e−x/2
, x 0
Comment: If z1, z2, · · · , zν are independent random
variables each having a standard normal distribution
then U =
Pν
k=1 z2
k has a chi-squared distribution with ν
degrees of freedom.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.88/149
141. F distribution with ν1 degrees of freedom in the
numerator and ν2 degrees of freedom in the
denominator
f(x) = Kx(ν1−2)/2
1 +
ν1
ν2
x
−(ν1+ν2)/2
, x 0, K =
Γ(ν1+ν2
2
)
ν1
ν2
(ν1/2)
Γ(ν1/2)Γ(ν2/2)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.89/149
142. Comment: If U1 and U2 are independent random variables
each having Chi-squared distribution with ν1 and ν2
degrees of freedom respectively then
F =
U1
U2
ν1
ν2
has a F distribution with ν1 degrees of freedom in the nu-
merator and ν2 degrees of freedom in the denominator.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.90/149
143. The t distribution with ν degrees of freedom
f(x) = K
1 +
x2
ν
−(ν+1)/2
, K =
Γ((ν + 1)/2)
Γ(ν/2)
√
πν
Comment: If Z and U are independent random variables,
and Z has a standard Normal distribution while U has a
Chi-squared distribution with ν degrees of freedom then
t =
Z
p
U/ν
has a t distribution with ν degrees of freedom.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.91/149
144. Goal: Extract useful information out of messy data
Strategy: Formulate probabilistic model of data y, which
depends on underlying parameter(s) θ
Terminology depends on parameter space:
Detection (simple hypothesis testing):
θ ∈ {0, 1}, 0 = target absent, 1 = target present
Classification (multihypothesis testing):
θ ∈ {0, 1, · · · , M}, , i.e.θ ∈ {DC-9, 747, F-15, MiG-31}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.92/149
145. Termonology
Suppose θ = (θ1, θ2)
If we are only interested in θ1, then θ2 are called
nuisance parameters
If θ1 = {0, 1}, and θ2 are nuisance parameters, we call
it a composite hypothesis testing problem
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.93/149
146. Ex: Positron Emission Tomography
Simple, traditional linear DSP-based approach Filtered
Back Projection (FBP)
Advanced, estimation-theoretic approach
Model Poisson “likelihood” of collected data
Markov Random Field (MRF) “prior”on image
Find estimate using expectation-maximization
algorithm (or similar technique)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.94/149
147. Tasks of Statistical Signal
Processing: Estimation, Detection,
. . .
1. Create statistical model for measured data
2. Find fundamental limitations on our ability to perform
inference on the data
(a) Cramér-Rao bounds, Chernov bounds, etc.
3. Develop an optimal (or suboptimal) estimator
4. Asymptotic analysis (i.e., assume we have lots and lots
of data) of estimator performance to see if it
approaches bounds derived in (2)
5. Do simulations and experiments comparing algorithm
performance to lower bounds and competing
algorithms
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.95/149
148. A Bayesian analysis treats θ as a random variable with
a “prior” density p(θ)
Data generating machinery is specified by a
conditional density p(y|θ)
Gives the “likelihood” that the data y resulted from
the parameters ?
Inference usually revolves around the posterior density,
derived from Bayes’ theorem:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
p(y|θ)p(θ)
R
p(y|θ)p(θ)dθ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.96/149
149. Classical detection problem:
Design of optimum procedures for deciding
between possible statistical situations given a
random observation:
H0 : Yk ∼ P ∈ P0, k = 1, · · · , n
H1 : Yk ∼ P ∈ P1, k = 1, · · · , n
The model has the following components:
Parameter Space (for parametric detection
problems)
Probabilistic Mapping from Parameter Space to
Observation Space
Observation Space AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.97/149
150. Parameter Space:
Completely characterizes the output given the
mapping.
Each hypothesis corresponds to a point in the
parameter space. This mapping is one-to-one.
Probabilistic Mapping from Parameter Space to
Observation Space:
The probability law that governs the effect of a
parameter on the observation.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.98/149
151. Example:
Yk =
Nk
Nk
Nk
, p = 1/2,
, p = 1/4,
, p = 1/4,
Nk ∼ N(0, σ2
)
Nk ∼ N(−1, σ2
)
Nk ∼ N(1, σ2
)
µ =
−1 0 1
T
| {z }
parameter space
p = 1/2, 1/4, 1/4 is the probabilistic mapping.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.99/149
152. Observation Space:
Finite dimensional, i.e. k = 1, 2, · · · , n where n is
finite.
Detection Rule
Mapping of the observation space into its
parameters in the parameter space is called a
detection rule.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.100/149
153. Classical estimation problem:
Interested in not making a choice among several
discrete situations, but rather making a choice
among a continuum of possible states.
Think of a family of distributions on the observation
space, indexed by a set of parameters.
Given the observation, determine as accurately as
possible the actual value of the parameter.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.101/149
154. Example:
Yk = Nk, Nk(µ, σ2
)
In this example, given the observations, parameter µ is
being estimated. Its value is not chosen among a set of
discrete values, but rather is estimated as accurately as
possible.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.102/149
155. Estimation problem also has the same components as
the detection problem.
Parameter Space Probabilistic Mapping from
Parameter Space to Observation Space
Observation Space
Estimation Rule
Detection problem can be thought of as a special case
of the estimation problem. There are a variety of
estimation procedures differing basically in the amount
of prior information about the parameter and in the
performance criteria applied. Estimation theory is less
structured than detection theory. Detection is science,
estimation is art(I have seen it in a book “Array signal
processing” by Johnson, Dudgeon).
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.103/149
156. Based on the a priori information about the parameter,
there are two basic approaches to parameter
estimation:
Bayesian Parameter Estimation
Nonrandom Parameter Estimation
Bayesian Parameter Estimation:
Parameter is assumed to be a random quantity
related statistically to the observation.
Nonrandom Parameter Estimation:
Parameter is a constant without any probabilistic
structure.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.104/149
157. Estimation theory relies on jargon to characterize the
properties of estimators.
The following definitions are used: The set of n
observations are represented by the n-dimensional
vector y ∈ Γ (observation space).
The values of the parameters are denoted by the
vector θ ∈ Λ (parameter space).
The estimate of this parameter vector is denoted by
: Γ → Λ.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.105/149
158. Definitions (continued): The estimation error ε(y) (ε in
short) is defined by the difference between the
estimate and the actual parameter:
ε(y) = θ̂(y) − θ
The function C(a, θ) is the cost of estimating a true
value of θ as a.
Given such a cost function C, the Bayes risk
(average risk) of the estimator is defined by the
following:
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]
162. Example
Suppose we would like to minimize the Bayes risk defined
by
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]
163.
164.
165. y
oo
for a given cost function C. By inspection, one can see that
the Bayes estimate of θ can be found (if it exists) by
minimizing, for each y ∈ Γ, the posterior cost given Y = y:
E
n
C[θ̂(Y), Θ]
169. An estimate is said to be unbiased if the expected value of
the estimate equals the true value of the parameter
E
n
θ̂|θ
o
= θ
Otherwise the estimate is said to be biased. The bias b(θ) is
usually considered to be additive, so that:
b(θ) = E
n
θ̂|θ
o
− θ
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.108/149
170. An estimate is said to be asymptotically unbiased if the
bias tends to zero as the number of observations tend
to infinity.
An estimate is said to be consistent if the
mean-squared estimation error tends to zero as the
number of observations becomes large.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.109/149
171. An efficient estimate has a mean-squared error that
equals a particular lower bound: the Cramer-Rao
bound. If an efficient estimate exists, it is optimum in
the mean-squared sense: No other estimate has a
smaller mean-squared error.
Following shorthand notations will also be used for
brevity:
pθ(y) = py|θ(y|θ) = Probability density(y given θ)
Eθ{y} = E{y|θ}
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.110/149
172. Following definitions and theorems will be useful later
in the presentation:
Definition: Sufficiency
Suppose that Λ is an arbitrary set. A function T :
Γ → Λ is said to be a sufficient statistic for the
parameter set θ ∈ Λ if the distribution of y
conditioned on T(y) does not depend on θ for
θ ∈ Λ.
If knowing T(y) removes any further dependence
on θ of the distribution of y, one can conclude
that T(y) contains all the information in y that is
useful for estimating θ. Hence, it is sufficient.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.111/149
173. Definition: Minimal Sufficiency
A function T on Γ is said to be minimal sufficient
for the parameter set θ ∈ Λ if it is a function of
every other sufficient statistic for θ.
A minimal sufficient statistic represents the
furthest reduction in the observation without
destroying information about θ.
Minimal sufficient statistic does not necessarily
exist for every problem. Even if it exists, it is
usually very difficult to identify it.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.112/149
174. Let {x1, x2, · · · , xn} denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of statis-
tics: S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
called a set of sufficient statistics if the conditional distri-
bution of {x1, x2, · · · , xn} given S1, S2, · · · , Sq is functionally
independent of the parameters {θ1, θ2, · · · , θp}.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.113/149
175. Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =
1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.114/149
177. The data x1, x2, x3 can be thought to be generated in two
ways
1. Generate the data x1, x2, x3 directly from the joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) or
2. Generate the sufficient statistics S1, S2, · · · , Sq from
their joint sampling distribution then generate the
observations x1, x2, x3 from the conditional distribution
of x1, x2, x3 given S1, S2, · · · , Sq. Since the second step
is independent of the parameters θ1, θ2, · · · , θp all of the
information about the parameters will be determined by
the results of the first step.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.116/149
178. Principle of sufficiency
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the sufficient statistics
S1, S2, · · · , Sq and not otherwise on the data x1, x2, x3.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.117/149
179. The Likelihood Principle
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the Likelihood function
L(θ1, θ2, · · · , θp) and not otherwise on the data x1, x2, x3.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.118/149
182. Let x1, x2, x3 denote a set of observations with joint density
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Then S1 = S1(x1, · · · , xn),
. . . , Sq = Sq(x1, · · · , xn) are a set of sufficient statistics if
the joint density satisfies:
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) = g(S1, S2, · · · , Sq; θ1, θ2, · · · , θp)h(x1, · · · , xn)
i. e.dependence on the parameters factors out with the
Sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.121/149
183. Example
Let x1, x2, · · · , xn denote a sample from the normal
distribution with mean µ and variance σ2
. The density of xi
is:
f (xi) =
1
√
2πσ
e− 1
2σ2 (xi−µ)2
And the joint density of (x1, x2, · · · , xn) is:
f x1, . . . , xn; µ, σ2
=
n
Y
i=1
1
√
2πσ
e− 1
2σ2 (xi−µ)2
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.122/149
184. f x1, . . . , xn; µ, σ2
=
1
(2πσ2)
n/
2
e
− 1
2σ2
n
P
i=1
(xi−µ)2
=
1
(2πσ2)
n/
2
e
− 1
2σ2
n
P
i=1
x2
i −2µ
n
P
i=1
xi+nµ2
n
X
i=1
x2
i =
n
X
i=1
(xi − x̄)2
+ nx̄2
= (n − 1) s2
+ nx̄2
n
X
i=1
xi = nx̄
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.123/149
185. f x1, . . . , xn; µ, σ2
=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
= h (x1, . . . , xn) g x̄, s; µ, σ2
where
g x̄, s; µ, σ2
=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
h (x1, . . . , xn) = 1
Thus, x̄ and s are sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.124/149
186. The Factorization Theorem:
Suppose that the parameter set θ ∈ Λ has a
corresponding families of densities pθ. A statistic
T is sufficient for θ iff there are functions gθ and h
such that
pθ = gθ[T(y)]h(y)
∀y ∈ Γ and θ ∈ Λ.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.125/149
187. Example
Consider the hypothesis-testing problem Λ = {0, 1} with
densities p0 and p1. Noting that
pθ(y) =
(
p0(y) ifθ = 0
p1(y)
p0(y)
p0(y) ifθ = 1,
the factorization pθ = gθ[T(y)]h(y) is possible with
h(y) = p0(y)
T(y) = p1(y)/p0(y) ≡ L(y)
gθ(y) =
1 ifθ = 0
t ifθ = 1.
Thus the likelihood ratio L is a sufficient statistic for the bi-
nary hypothesis-testing problem. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.126/149
188. Rao-Blackwell Theorem
Suppose that g(y) is an unbiased estimate of g(θ) and that
T is sufficient for θ. Define
g̃[T(y)] = Eθ{ĝ(Y)|T(Y) = T(y)}
Then g̃[T(y)] is also an unbiased estimate of g(θ).
Furthermore
Varθ(g̃[T(Y)]) ≤ Varθ(ĝ(Y)),
with equality iff
Pθ(ĝ(Y) = g̃[T(Y)]) = 1.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.127/149
189. Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Let
S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote
a set of sufficient statistics. Let t(x1, x2, · · · , xn) be any
unbiased estimator of the parameter φ = g(θ1, θ2, · · · , θp)
then there exists an unbiased estimator, T(S1, · · · , Sq) of φ
such that
Var(T) 6 Var(t)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.128/149
190. Proof
Let
T (S1, . . . , Sk) = E (t (x1, . . . , xn) |S1, . . . , Sk )
is the Conditional Expectation of tgivenS1, · · · , Sk
T (S1, . . . , Sk) =
Z
. . .
Z
t (x1, . . . , xn)g (x1, . . . , xn |S1, . . . , Sk ) dx1 . . . dxn
Now t is an unbiased estimator of φ = g(θ1, · · · , θp) Hence
E [t] = φ
Also
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.129/149
191. E [t] = ES1,...,Sk
[E [t |S1, . . . , Sk ]]
= ES1,...,Sk
[T (S1, . . . , Sk)] = φ
Thus T is also an unbiased estimator of
φ = g(θ1, θ2, · · · , θp)
Finally
V ar [t] = V arS1,...,Sk
[E [t |S1, . . . , Sk ]]
+ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ V arS1,...,Sk
[T (S1, . . . , Sk)]
Since
ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ 0
QED.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.130/149
192. The Rao-Blackwell theorem states that if you have any unbi-
ased estimator t of a parameter (that depends arbitrarily on
the observations) then you can find a better unbiased es-
timator (smaller variance) that is a function of the solely of
sufficient statistics. Thus the best unbiased estimator (min-
imum variance) has to be a function of the sufficient statis-
tics
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.131/149
193. Thus the search for the UMVU estimator (uniformly min-
imum variance unbiased estimator) is amongst functions
that depend solely on the sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.132/149
194. Example
Suppose that Γ = {0, 1, · · · , n}, Λ = {0, 1}, and
pθ(y) = n!
y!(n−y)!
θy
(1 − θ)n−y
, y = 0, . . . , n, 0 θ 1
For any function f on Γ, we have
Eθ{f(Y )} =
n
P
y=0
n!
y!(n−y)!
f(y)θy
(1 − θ)n−y
= (1 − θ)n
n
P
y=0
ayxy
The condition Eθf(Y ) = 0 ∀θ ∈ Γ implies that
n
P
y=0
ayxy
= 0, ∀x 0.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.133/149
195. However, an nth order polynomial has at most n zeros un-
less all of its coefficients are zero. Hence, θ ∈ Γ is com-
plete.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.134/149
196. Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of
statistics:
Let S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics
Then S1, · · · , Sq are called a set of complete sufficient
statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.135/149
197. If S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics The S1, · · · , Sq are
called a set of complete sufficient statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0
i.e.,
Z
· · ·
Z
h (S1, . . . , Sk)g (S1, . . . , Sk |θ1, . . . , θp ) dS1 . . . dSk = 0
implies
h (S1, . . . , Sk) = 0
Completeness is sometimes difficult to prove.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.136/149
198. Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =
1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.137/149
200. S is a sufficient statistic.
Is it a complete sufficient statistic?
sampling distribution of S
S g(S; π)
0 (1 − π)3
1 3π(1 − π)2
2 3π2
(1 − π)
3 π3
E [h (S)] = h (0) (1 − π)3
+ h (1) 3π (1 − π)2
+ h (2) 3π2
(1 − π) + h (3) π3
=
= h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.139/149
201. E [h (S)] = 0, for all values of π
i.e.,
p (π) = h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3
= 0, ⇒
h (0) = 0,
3 [h (1) − h (0)] = 0,
h (0) − 2h (1) + h (2) = 0,
h (3) − 3h (2) + 3h (1) + h (0) = 0
Thus , h (0) = h (1) = h (2) = h (3) = 0
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.140/149
202. S is a complete sufficient statistic.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.141/149
203. If S1, · · · , Sq are called a set of complete sufficient statistics
and T1 = h1(S1, · · · , Sq) and T2 = h2(S1, · · · , Sq) Are
unbiased estimators of φ.
Then T1 = T2 E(T1) = E(T2) = φ hence E(T1 − T2) = 0.
and h2(S1, · · · , Sq) − h1(S1, · · · , Sq) and T2 = T1
Thus there is only one unbiased estimator of φ that is a
function of complete sufficient statistics
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.142/149
204. Lehmann-Scheffe Theorem
Let x1, x2, · · · , xn denote a set of observations
with joint density f(x1, · · · , xn; θ1, · · · , θq). Let
S1 = S1(x1, · · · , xn),. . . , Sq = Sq(x1, · · · , xn) denote a
set of complete sufficient statistics.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.143/149
205. Let T(S1, · · · , Sq) be an unbiased estimator of the
parameter φ = g(θ1, · · · , θq) then T is the uniform minimum
variance unbiased (UMVU) estimator of φ. That is if
t(x1, x2, · · · , xn) is an unbiased estimator of φ then
Var(T) 6 Var(t)
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.144/149
206. Example
We observe a success-failure experiment (Bernoulli trial)
n = 3 times. the probability of success is π.
Let x1, x2, x3 denote the observations where
xi =
1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.
E
1
3
S
= E
1
3
(x1 + x2 + x3)
= 1
3
(E [x1] + E [x2] + E [x3])
= 1
3
(π + π + π) = π
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.145/149
207. S
3
is an unbiased estimator of π.
S
3
is the uniform minimum variance unbiased (UMVU) estimator of π.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.146/149
208. The strategy to find the UMVU estimator
1. Find a set of complete sufficient statistics
S1, S2, · · · , Sk.
2. To find an unbiased estimator that depends only on the
set of complete sufficient statistics
T(S1, S2, · · · , Sk).
3. Lehman Scheffe theorem.
Maximum Likelihood estimators are functions of a set of
complete sufficient statistics S1, S2, · · · , Sk.
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.147/149
210. 1. Finding Maximum Likelihood estimators.
2. Checking if there is a set of complete sufficient
statistics -S1, · · · , Sk.
3. Checking if the maximum Likelihood estimators they
are unbiased.
4. Making adjustments to these estimators if they are not
unbiased.
This is the standard way of finding UMVU estimators
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.149/149
211. C
a
r
e
f
u
l
Being Careful
• Suppose Y has the density
p(y; µ) =
n
Y
i=1
1
aσc
exp
−
yi − µ
aσ
4
#
where a ≈ 1.4464 and c is a constant which
makes the density normalize to 1.
l(µ) = −
n
X
i=1
yi − µ
aσ
4
• Try usual “set derivative equal to zero”
dl(µ)
dµ
= 4
n
X
i=1
yi − µ
aσ
3
= 0
• This will be a 3rd order polynomial in µ,
which will in general have 3 solutions
• Would have to just compute the
loglikelihood for each solution and see
which one gives the biggest result
• Suppose Y = Y1, · · · , Yn, is i.i.d. with a
Cauchy-like density (but not really
1
212. C
a
r
e
f
u
l
Cauchy!) (Maximum Penalized Likelihood
Estimation)
p(y; c, γ) =
n
Y
i=1
γ
2 (γ + |yi − c|)
2
• Loglikelihood:
l(c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
l(y; c, γ) = n ln γ − 2
X
i∈{yi≥c}
ln[γ + (yi − c)]
−2
X
i∈{yic}
ln[γ + (c − yi)]
• Suppose γ is given and we want to estimate
c taking derivative:
dl(y; c)
dc
= −2
X
i∈{yi≥c}
−1
γ + yi − c
−2
X
i∈{yi≥c}
1
γ + c − yi
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
• We would be tempted to say that the ML
2
213. C
a
r
e
f
u
l
estimator of c is just the solution of
dl(y; c)
dc
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
= 0
• Well. . . theres actually more than one
solution to this, so how about picking the
solution which gives the greatest
likelihood?
• but that is really a trap!!!
• Let’s check out the second derivative:
d2
l(y; c)
dc2
= 2
n
X
i=1
[sign(yi − c)]2
(γ + |yi − c|)
2 0
• Those critical points were really
local minima!!!
• Where is the real maxima?
• Notice that |x| is not differentiable at
x = 0, hence
p(y; c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
3
214. C
a
r
e
f
u
l
is not differentiable at c=any of the data
points!
• To get the real ML estimate, try each yi for
c, and see which one gives the biggest
likelihood
• “It turns out” the ML estimate is one of
the median points
4
215. E
M
Expectation-Maximization Algorithm
• The EM procedure is a way of making
iterative algorithms for maximizing
loglikelihoods or Bayesian logposteriors
when no closed-form solution is available
• There’s a more powerful and more general
EM formulation by Csiszar based on
Information Theory EM Algorithm
• Incomplete data Y, that we actually
measure
• Goal: maximize the
incomplete data loglikelihood(function of
specific collected data)
lid(θ) = log pY (y; θ)
• Complete data Z, a hypothetical data set
• Tool: complete data loglikelihood (function
of complete data as a random variable)
lcd(θ) = log pZ(z; θ)|z=Z = log pZ(Z; θ)
• Complete data space must be “larger” and
1
216. E
M
determine the incomplete data, i.e. there
must be a many-to-one mapping y = h(z)
The EM Recipe
• Step 1: Decide on a complete data space
• Step 2: The expectation step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
• Step 3: The maximization step
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• Start with a feasible initial guess θ̂old
then
iterate steps 2 and 3 (which can usually be
combined)
What is that Expectation?
E[lcd|Y = y; θ̂old
] =
Z
pZ|Y (z|y; θ̂old
) log pZ(z; θ)dz
pZ|Y (z|y; θ̂old
) =
pZ (z;θ̂old
)
R
Z(y)
pZ (z̃;θ̂old)dz̃
z ∈ Z(y)
0 z /
∈ Z(y)
2
217. E
M
Z
Z(y)
pZ(z; θ̂old
)dz = pY (y; θ̂old
)
Aspects of EM Algorithms
• Incomplete data loglikelihood is guaranteed
to increase with each EM iteration
• Must be careful; might converge to a local
maxima which depends on the starting
point
• Often, the estimates naturally stay in the
feasible space (i.e., nonnegativity
constraints
• In many problems, a candidate complete
data space naturally suggests itself
Ex: Poisson Signal in Additive Poisson
Noise
Y = S + N
S ∼ Poisson(θ), N ∼ Poisson(λN ),
• Incomplete-data loglikelihood is
lid(θ) = −(θ + λN ) + y ln(θ + λN ),
3
218. E
M
• ML estimator can be found in closed form
θ̂(y) = max(0, y − λN )
Choose the Complete Data
• Can often choose the complete data in
several different ways; try to choose to
make remaining steps easy
• Different choices lead to different
algorithms; some will converge “faster”
than others.
• Here, take complete data to be Z = (S, N);
suppose we could magically measure the
signal and noise counts separately!
• Complete data loglikelihood is:
lcd(θ) = [−θ + S ln(θ)] + [−λN + N ln(λN )]
The E-Step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
= E[−(θ + λN ) + S ln(θ) + N ln(λN )|y; θ̂old
]
= −(θ + λN ) + E[S|y; θ̂old
] ln(θ)
4
219. E
M
+E[N|y; θ̂old
] ln(λN )
• Often convenient to leave explicit
computation of conditional expectation
until the last minute
• As with loglikelihoods, we sometimes drop
terms which are constants w.r.t. θ
The M-Step
θ̂new
= arg max
θ≥0
Q(θ; θ̂old
)
• Take derivative as usual
d
dθ
Q(θ; θ̂old
) = −1 +
E[S|y; θ̂old
]
θ
• Setting equal to zero yields
θ̂new
= E[S|y; θ̂old
]
• Now we just have to compute that
expectation. (That’s usually the hardest
part.)
That Conditional Expectation
E[S|y; θ̂old
] =
Z
spS(s|y; θ̂old
)ds
5
220. E
M
• Let’s look at the conditional density:
pS(s|y; θ̂old
) =
pY |S(y|s; θ̂old
)pS(s; θ̂old
)
pY (y; θ̂old)
=
exp[−λN ]λy−s
N
(y−s)! I(y ≥ s)exp[−θ̂old
](θ̂old
)s
s!
exp[−(θ̂old + λN )](θ̂old + λN )y/y!
=
y!
s!(y − s)!
λy−s
N
(θ̂old + λN )y−s
(θ̂old
)s
(θ̂old + λN )s
I(s ≤ y)
• We observe that the conditional density is
just binomial. For 0 6 s 6 y,
pS (s|y; θ̂
old
) =
y
s
θ̂old
θ̂old + λN
s
λN
θ̂old + λN
!y−s
E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• So this particular EM algorithm is:
θ̂new
= E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• Let’s see if our analytic formula for the
maximizer, θ̂ = max(0, y − λN ), is a fixed
point for the EM iteration
6
221. E
M
• For y λN ,
θ̂new
= y
θ̂old
θ̂old + λN
y − λN = y
y − λN
y − λN + λN
y − λN = y − λN
• For y λN , immediately get 0=0
• So everything is good
Back in Bayesian land
• EM also good for MAP estimation; just
add the logprior to the Q-function
θ̂new
= arg max
θ≥0
QP (θ; θ̂old
)
QP (θ; θ̂old
) = E[lcd|Y = y; θ̂old
] + log p(θ)
• Consider previous example, with an
exponential prior with mean 1/a
QP (θ; θ̂old
) = −θ + E[S|y; θ̂old
] ln(θ) − aθ
QP (θ; θ̂old
)
dθ
= −1 +
E[S|y; θ̂old
]
θ
− a
7
222. E
M
θ̂old
=
E[S|y; θ̂old
]
1 + a
=
θ̂old
θ̂old + λN
!
y
1 − a
Expectation-Maximization Algorithm
(Theory) Convergence of the EM
Algorithm
• We’d like to prove that the likelihood goes
up with each iteration:
Lid(θ̂new
) ≥ Lid(θ̂old
)
• Recall from last pages:
pZ|Y (z|y; θ) =
pZ(z; θ)
pY (y; θ)
z ∈ Z(y) = {z : h(z) = y}
ln pY (y; θ) = ln pZ(z; θ) − ln pZ|Y (z|y; θ)
• Multiply both sides by the same thing and
integrate with respect to z :
R
pZ|Y (z|y; θ̂old
) ln pY (y; θ)dz
=
R
pZ|Y (z|y; θ̂old
) ln pZ(z; θ)dz
−
R
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz
8
223. E
M
• simplifies to:
Lid(θ) = Q(θ; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz z
• Evaluate z at θ = θ̂new
and θ = θ̂old
Lid(θ̂new
) = Q(θ̂new
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂new
)dz, ♠
Lid(θ̂old
) = Q(θ̂old
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂old
)dz, ♣
• Subtract ♣ from ♠
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
• A really helpful inequality: ln x 6 x − 1
Lid(θ̂new
) − Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
Z
pZ|Y (z|y; θ̂old
)
pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old
)
− 1
#
dz
| {z }
focus on this term
9
224. E
M
Z
pZ|Y (z|y; θ̂old
)
pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old)
− 1
#
dz
=
Z
pZ|Y (z|y; θ̂new
) − pZ|Y (z|y; θ̂old
)dz
=
Z
pZ|Y (z|y; θ̂new
)dz−
Z
pZ|Y (z|y; θ̂old
)dz = 0
• Now we have:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• Recall the definition of the M-step:
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• So, by definition
Q(θ̂new
; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
⇒ Lid(θ̂new
) ≥ Lid(θ̂old
)
• Notice we showed that the likelihood was
nondecreasing; that doesn’t automatically
10
225. E
M
imply that the parameter estimates
converge
• Parameter estimate could slide along a
contour of constant loglikelihood
• Can prove some things about parameter
convergence in special cases Ex: EM
Algorithm for Imaging from Poisson Data
(i.e. Emission Tomography)
Generalized EM Algorithms
• Recall this line:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• What if the M-step is too hard? Try a
“generalized” EM algorithm:
θ̂new
= some easy to compute θ
such thatQ(θ; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
• Problem: EM algorithms tend to be slow
• Observation: “Bigger” complete data
spaces result in slower algorithms than
“smaller” complete data spaces
11
226. E
M
• SAGE (Space-Alternating Generalized
Expectation-Maximization)
– Split big complete data space into
several smaller “hidden” data spaces
– Designed to yield faster convergence
• Generalization of “ordered subsets” EM
algorithm
12
227. W
i
e
n
e
r
Wiener Filtering
• Context: Bayesian linear MMSE
estimation for random sequences
• Parameter sequence {Θk, k ∈ Z}
• Data sequence {Yk, k ∈ I ⊂ Z}
• Goal: Estimate {θk} as a linear function of
the observations:
θ̂k(y) =
X
j∈I
h(k, j)yj
• Find h to minimize mean square error
• By the orthogonality principle
E[(θ̂k(Y ) − Θk)Y ∗
i ] = 0fori ∈ I
E[(
X
j∈I
h(k, j)Yj − Θk)Y ∗
i ] = 0
X
j∈I
h(k, j)E[YjY ∗
i ] = E[ΘkY ∗
i ]
X
j∈I
h(k, j)rY (j, i) = rΘY (k, i)
| {z }
This is Wiener-Hopf equation
1
228. W
i
e
n
e
r
• If processes are stationary, we can write
X
j∈I
h(k, j)rY (j − i) = rΘY (k − i)
• If I = Z it turns out the filter is LTI:
X
j∈I
h(k − j)rY (j − i) = rΘY (k − i)
• consider i = 0
X
j∈I
h(k − j)rY (j) = rΘY (k)
• Can solve W-H in the Z-transform domain:
H(z)SY (z) = SΘY (z)
H(z) =
SΘY (z)
SY (z)
• MSE:
MSE = E[|(Θk − θ̂(Y ))|2
]
= E[(Θk − θ̂(Y ))(Θ∗
k − θ̂∗
(Y ))]
= E[(Θk − θ̂(Y ))Θ∗
k] +E[(Θk − θ̂(Y ))θ̂∗
(Y )]
= E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
2
229. W
i
e
n
e
r
MSE = E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
= E[ΘkΘ∗
k] − E[
X
j∈I
h(k − j)YjΘ∗
k]]
= E[ΘkΘ∗
k] −
X
j∈I
h(k − j)E[YjΘ∗
k]
= rΘ(0) −
X
j∈I
h(k − j)rY Θ(j − k)
• Since everything is stationary, can just take
k = 0
MSE = rΘ(0) − (h ∗ rY Θ)(0)
MSE = rΘ(0) − (h ∗ rY Θ)(0)
=
π
Z
−π
SΘ(ω) − H(ω)SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
SΘY (ω)
SY (ω)
SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
3
230. W
i
e
n
e
r
Deblurring
• Suppose object is observed through a
blurring point spread function f and
additive noise W
Yk = (f ∗ Θ)k + Wk
• Suppose Θ and W are uncorrelated
zero-mean
SY = |F|
2
Sθ + SW , and SΘY = F∗
SY
• So the Wiener filter is
H(z) =
SΘY (z)
SY (z)
=
F∗
(z)SΘ(z)
|F(z)|
2
Sθ(z) + SW (z)
Interpretation of the Deblurring Filter
If noise is negligible, i.e. SW (ω) ≈ 0
H(ω) =
F∗
Sθ
|F|
2
Sθ + SW
≈
F∗
Sθ
FF∗Sθ
=
1
F
• Even if there is no noise, in
implementation, straight division by F(ω)
is often ill-posed and not a good idea
4
231. W
i
e
n
e
r
(round off errors, etc.)
Deblurring Error
MSE =
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
=
π
Z
−π
SΘ(ω) −
|F(ω)|
2
|SΘ(ω)|2
|F(ω)|
2
SΘ(ω) + SW (ω)
dω
=
π
Z
−π
SΘ[|F|
2
SΘ + SW ] − |F|
2
|SΘ|2
|F|
2
SΘ + SW
dω
=
π
Z
−π
SΘSW
|F|
2
SΘ + SW
dω
Competing Approaches
• Competing approaches include iterative
methods such as the “Richardson-Lucy”
algorithm (an EM-style procedure)
– Computationally intensive
– Can naturally incorporate nonnegativity
– Sometimes better match to real
5
232. W
i
e
n
e
r
statistics
Discussion
• Advantage of Wiener approach:
– LTI filtering implementation
• Disadvantages of Wiener approach:
– No natural way to incorporate
nonnegativity constraints (in image
processing, for instance)
– Only truly optimal for Gaussian
statistics
Real-Time Wiener Filtering
• What if we don’t have “future”
measurements?
• Must restrict h to be causal
• Solution:
H(z) =
1
S−
Y (z)
SΘY (z)
S+
Y (z)
+
where the meaning of the plus and minus
superscripts and subscripts will be defined
6
233. W
i
e
n
e
r
on later slides Spectral Factorization
• If Y has a spectrum satisfying the
Paley-Wiener criterion:
π
Z
−π
log SY (ω)dω −∞
then the spectrum can be factored as
SY (ω) = S+
Y (ω)S−
Y (ω)
where F−1
{S+
Y } is causal
and F−1
{S−
Y } is anticausal
Factoring Rational Spectra
• If the spectrum is a ratio of polynomials,
we can factor as
SY (z) = S+
Y (z)
| {z }
Poles and zeros
inside unit circle
S−
Y (z)
| {z }
Poles and zeros
outside unit circle
= S+
Y (z)S+
Y (z−1
)
7
234. W
i
e
n
e
r
• Aside: spectral factorization into causal
and anticausal factors is analogous to
Cholesky decomposition of a covariance
matrix into lower and upper triangular
factors
Causal Part Extraction
• We can split f into its causal and
anticausal parts:
f(k) = {f(k)}+
| {z }
causal
+ {f(k)}−
| {z }
anticausal
f(k)+ = f(k)u(k), {f(k)}− = f(k)u(−k−1)
• Use similar notation for Z-transform
domain
F(z) = {F(z)}+ + {F(z)}−
{F}+ = Z{Z−1
{F}u(k)}
{F}− = Z{Z−1
{F}u(−k − 1)}
How to Extract Causal Parts
• If F is a ratio of polynomials can usually
8
235. W
i
e
n
e
r
do a partial fraction expansion:
F(z) = {F(z)}+
| {z }
?
Poles and zeros
inside unit circle
+ {F(z)}−
| {z }
?
Poles and zeros
outside unit circle
• Can also do polynomial long division
9
236. C
h
e
r
n
o
ff
Chernoff Bounds (Theory)
• General purpose likelihood ratio test
p(y; H1)
p(y; H0)
or
p(y|H1)
p(y|H0)
H1
H0
λ
• Consider the loglikelihood ratio test
L ≡ ln Λ = ln
p(y|H1)
p(y|H0)
H1
H0
ln λ ≡ γ
• Conditional error probabilities:
PD =
∞
Z
γ
pL|H1
(ℓ|H1)dℓ, PF A =
∞
Z
γ
pL|H0
(ℓ|H0)dℓ
• it is often difficult, if not impossible, to
find simple formulas for
pL|H1
(ℓ|H1), pL|H0
(ℓ|H0)
• Makes computing probabilities of detection
and false alarm difficult
– We could use Monte Carlo simulations,
but those are cumbersome
1
237. C
h
e
r
n
o
ff
– Alternative: find easy to compute,
analytic bounds on the error
probabilities
• Discussion based on Van Trees
A Moment Generating Function
ΦL|H0
(s) = E[esL
|H0] =
Z ∞
−∞
esℓ
pL(ℓ|H0)dℓ
=
Z
Y
exp[sL(y)]pY (y|H0)dy
=
Z
Y
exp
s ln
pY (y|H1)
pY (y|H0)
pY (y|H0)dy
=
Z
Y
pY (y|H1)
pY (y|H0)
s
pY (y|H0)dy
=
Z
Y
[pY (y|H1)]
s
[pY (y|H0)]
1−s
dy
• Define a new random variable Xs (for
various values of s) with density
pXs
(x) ≡
esx
pL|H0
(x|H0)
∞
R
−∞
esℓpL|H0
(ℓ|H0)dℓ
2
238. C
h
e
r
n
o
ff
µ(s) ≡ ln ΦL|H0
(s) = ln
∞
Z
−∞
esL
p(ℓ|H0)dℓ
µ̇(s) =
R ∞
−∞
ℓesℓ
p(ℓ|H0)dℓ
R ∞
−∞
esℓp(ℓ|H0)dℓ
= E[Xs]
µ̈(s) = var[Xs]
µ̇(0) =
R ∞
−∞
ℓe0ℓ
p(ℓ|H0)dℓ
R ∞
−∞
e0ℓp(ℓ|H0)dℓ
= E[L|H0]
µ̇(1) =
R ∞
−∞
ℓp(ℓ|H1)
p(ℓ|H0) p(ℓ|H0)dℓ
R ∞
−∞
p(ℓ|H1)
p(ℓ|H0) p(ℓ|H0)dℓ
= E[L|H1]
• with
µ(s) = ln ΦL|H0
(s) = ln
∞
Z
−∞
esL
p(l|H0)dl
• Then
∞
Z
γ
exp[µ(s) − sx]pXs (x)dx
3
240. C
h
e
r
n
o
ff
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PF A ≤ exp[µ(s) − sµ̇(s)]
PM ≤ exp[µ(s) + (1 − s)γ]
• We want the s ≤ 1 which makes the RHS
as small as possible
d
ds
[µ(s) + (1 − s)γ] = µ̇(s) − γ
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PM ≤ exp[µ(s) + (1 − s)µ̇(s)]
PF A ≤ exp[µ(s) − sµ̇(s)], 0 6 s 6 1
PM ≤ exp[µ(s)+(1−s)µ̇(s)], where γ = µ̇(s)
µ̇(0) ≤ γ ≤ µ̇(1)
E[L|H0] ≤ γ ≤ E[L|H1]
Why is this useful? L can often be easily
5
241. C
h
e
r
n
o
ff
described by its moment generating
function.
• Let s = sm satisfy µ̇(sm) = γ = 0
Pe =
1
2
PF A +
1
2
PM
≤
1
2
exp[µ(s)]
Z ∞
0
pXs
(x)dx+
1
2
exp[µ(s)]
Z 0
−∞
pXs
(x)dx
Pe ≤
1
2
exp[µ(sm)]
PF A = eµ(s)
∞
Z
µ̇(s)
e−sx
pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
µ̇(s)
exp[+s(µ̇(s) − x)]pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
where
Z =
Xs − E[Xs]
p
var[Xs]
=
Xs − µ̇(s)
p
µ̈(s)
6
242. C
h
e
r
n
o
ff
exp[µ(s) − sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
| {z }
Original Chernoff
inequality was formed
by replacing this with 1.
We can get a tighter
constant in some
asymptotic cases.
Asymptotic Gaussian Approximation
• In some cases, Z approaches a Gaussian
random variable as the number of samples
n grows large (ex: data points iid with
finite means and variances)
∞
Z
0
exp[−s
p
µ̈(s)z]
1
√
2π
exp
z2
2
dz
= exp
s2
µ̈(s)
2
Q(s
p
µ̈(s))
7
243. C
h
e
r
n
o
ff
PF A = exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
≈ exp[µ(s)−sµ̇(s)] exp
s2
µ̈(s)
2
Q(s
p
µ̈(s))
• If s
p
µ̈(s) 3 we can approximate Q(·)
using an upper bound
Q(a) ≤
1
a
√
2π
exp
−
a2
2
PF A ≈
1
p
2πs2µ̈(s)
exp[µ(s) − sµ̇(s)]
Similar Analysis Works for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp
(s − 1)2
µ̈(s)
2
Q((1−s)
p
µ̈(s))
• If (1 − s)
p
µ̈(s) 3 we can approximate
Q(·) using the upper bound
PM ≈
1
p
2π(1 − s)2µ̈(s)
exp[µ(s) + (1 − s)µ̇(s)]
Asymptotic Analysis for Pe
8
244. C
h
e
r
n
o
ff
• For the case of equal priors and equal
costs, if the conditions for the
approximation for Q(·) to be valid on the
previous to slides holds, we have
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
9
245. C
h
e
r
n
o
ff
Chernoff Bounds (Gaussian Examples)
Consider the loglikelihood ratio test
L ≡ ln Λ = ln
p(y|H1)
p(y|H0)
H1
H0
ln λ ≡ γ
Main object of interest: µ(s) ≡ ln ΦL|H0
(s)
ΦL|H0
(s) = E[esL
|H0] =
Z ∞
−∞
esl
pL(l|H0)dl
=
Z
Y
[p(y|H1)]
s
[p(y|H0)]
1−s
dy
Both representations will be useful
10
246. C
h
e
r
n
o
ff
Gaussian, Equal Variances
H1 ∼ N(m, σ2
), H0 ∼ N(0, σ2
)
µ(s) = ln
Z
Y
[p(y|H1)]
s
[p(y|H0)]
1−s
dy
= ln
∞
Z
−∞
· · ·
∞
Z
−∞
( n
Y
i=1
1
√
2πσ2
exp
−
(yi − m)2
2σ2
)s
×
( n
Y
i=1
1
√
2πσ2
exp
−
y2
i
2σ2
)1−s
dy1 · · · dyn
= n ln
∞
Z
−∞
1
√
2πσ2
exp
−
(y − m)2
s + y2
(1 − s)
2σ2
dy
11
247. C
h
e
r
n
o
ff
Completing the Square
∞
Z
−∞
1
√
2πσ2
exp
−
(y − m)2
s + y2
(1 − s)
2σ2
dy
=
∞
Z
−∞
1
√
2πσ2
exp
−
(y2
− 2my + m2
)s + y2
(1 − s)
2σ2
dy
=
∞
Z
−∞
1
√
2πσ2
exp
−
y2
− 2msy + m2
s
2σ2
dy
=
∞
Z
−∞
1
√
2πσ2
exp
−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2
12
248. C
h
e
r
n
o
ff
Finish Computing the µ
=
∞
Z
−∞
1
√
2πσ2
exp
−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2
=
∞
Z
−∞
1
√
2πσ2
exp
−
(y − ms)2
2σ2
exp
−
m2
s(1 − s)
2σ2
dy
= exp
m2
s(s − 1)
2σ2
= tmp
µ(s) = n ln {tmp} =
s(s − 1)
2
nm2
σ2
≡
s(s − 1)
2
d2
13
249. C
h
e
r
n
o
ff
Basic Bound on PFA
µ(s) =
s(s − 1)
2
d2
, µ̇(s) =
2s − 1
2
d2
PF A ≤ exp[µ(s) − sµ̇(s)], for 0 ≤ s ≤ 1
= exp
s(s − 1)
2
d2
− s
(2s − 1)
2
d2
= exp
−
s2
2
d2
where γ = µ̇(s), γ =
2s − 1
2
d2
s =
γ
d2
+
1
2
14
250. C
h
e
r
n
o
ff
Basic Bound on PM
Pm ≤ exp[µ(s) + (1 − s)µ̇(s)]
= exp
s(s − 1)
2
d2
+ (1 − s)
(2s − 1)
2
d2
= exp
s2
− s
2
d2
+
2s − 1 − 2s2
+ s
2
d2
= exp
2s − 1 − s2
2
d2
= exp
−
(1 − s)2
2
d2
15
251. C
h
e
r
n
o
ff
Where are the Bounds Meaningful? Recall we
need
E[L|H0] ≤ γ ≤ E[L|H1]
µ̇(0) ≤ γ ≤ µ̇(1)
2 · 0 − 1
2
d2
≤ γ ≤
2 · 1 − 1
2
d2
−
d2
2
≤ γ ≤
d2
2
16
252. C
h
e
r
n
o
ff
The Refined Bound for PFA Recall the refined
asymptotic bound:
PF A ≈ exp[µ(s) − sµ̇(s)] exp
s2µ̈(s)
2
Q(s
q
µ̈(s))
µ̇(s) =
2s − 1
2
d2
, µ̈(s) = d2
In this case, since L is a sum of Gaussian
random variables, the expression is exact:
PF A = exp
−
s2
2
d2
exp
s2
d2
2
Q(sd) = Q(sd)
17
253. C
h
e
r
n
o
ff
The Refined Bound for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp
(s − 1)2
µ̈(s)
2
Q((1−s)
p
µ̈(s))
= exp
−
(1 − s)2
2
d2
exp
(s − 1)2
d2
2
Q((1−s)d)
Again, since L is Gaussian, the expression is
exact:
PM = Q((1 − s)d)
18
254. C
h
e
r
n
o
ff
Minimum Prob. of Error For minimum prob.
of error test, γ = 0
sM =
γ
d2
+
1
2
=
1
2
Recall approximate expression for Pe from last
slide of last lecture
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
=
1
2sm(1 − sm)
√
2πd2
exp
sm(sm − 1)
2
d2
19
255. C
h
e
r
n
o
ff
Min. Prob. of Error Cont
Pe ≈
2
√
2πd2
exp
−
d2
8
Recall the exact expression is:
Pe = Q(d/2)
Van Trees’ rule of thumb: approximation is
very good for d 6
20
256. C
h
e
r
n
o
ff
The Bhattacharyya Distance If the criterion is
the minimum prob. of error and µ(s) is
symmetric about s = 1/2, then
µ(s) = ln
Z
Y
p
p(y|H1)
p
p(y|H0)dy
µ(s) is called the Battacharyya distance
21
257. C
h
e
r
n
o
ff
Gaussian, Equal Means
H1 ∼ N(0, σ2
1), H0 ∼ N(0, σ2
0),
µ(s) =
n
2
ln
(σ2
0)s
(σ2
1)1−s
sσ2
0 + (1 − s)σ2
1
A common special case:
σ2
1 = σ2
s + σ2
n, σ2
0 = σ2
n
µ(s) =
n
2
(1 − s) ln
1 +
σ2
s
σ2
n
− ln
1 + (1 − s)
σ2
s
σ2
n
Gaussian, Equal Means
µ̇(s) =
n
2
− ln
1 +
σ2
s
σ2
n
+
σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n
µ̈(s) =
n
2
σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n
2
22
258. U
M
P
Uniformly Most Powerful Tests
• Usual parametric data model p(y; θ)
• Consider a composite problem:
H0 : θ = θ0, H1 : θ ∈ S1
• A test φ∗
is uniformly most powerful of
level α = PF A if it has a better PD (or at
least as good as) than any other αlevel test
PD(φ∗
; θ) = Eθ[φ∗
] ≥ Eθ[φ] = PD(φ, θ)
for all θ ∈ S1, for all φ
α = PF A(φB) = PF A(φA) = PF A(φ∗
)
PD(φB,θ)
PD(φA,θ)
PD(φ∗
,θ)
Figure 1: UMP test.
1
259. U
M
P
• Find the most powerful α-level (recall
α = PF A) test for a fixed θ
• Just the Neyman-Pearson test
If the decision regions do not vary with θ,
then the test is UMP
Gaussian Mean Example
• Suppose we have n i.i.d samples
Yi ∼ N(µ, σ2
)
• Assume σ2
is known, but µ is not
• Consider three cases
H0 : µ = 0
Case I : H1 : µ 0
Case II : H1 : µ 0
Case III : H1 : µ 6= 0
Suffices to use
ȳ =
1
n
n
X
i=1
yi
2
260. U
M
P
Ȳ ∼ N(µ, σ2
/n)
Λ(y; µ) =
p(y; µ)
p(y; 0)
=
exp[−(ȳ − µ)2
/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
=
exp[(−ȳ2
+ 2ȳµ − µ2
)/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
= exp
nµ
σ2
ȳ −
nµ2
2σ2
H1
H0
τ
√
nµ
σ
ȳ
H1
H0
ln τ +
nµ2
2σ2
σ
√
n
≡ γ
• Case I: µ 0
√
nµ
σ
ȳ
H1
H0
γ −→
√
n
σ
ȳ
H1
H0
γ
µ
= γ+
• Set the threshold to get the right “level”
α = PF A = Pr
√
n
σ
Ȳ γ+
286. H1
= 1−Q
γ−
−
√
n
σ
µ
= 1−Q
Q−1
(1 − α) −
√
n
σ
µ
= 1 − Q(Q−1
(1 − α) − d)
• Case III: µ 6= 0
√
nµ
σ
ȳ
H1
H0
γ
• we cant just absorb µ into the threshold
anymore without effecting the inequalities!
• Decision region varies with sign of µ
• No UMP test exists!!!
• Cauchy Median Example
• Suppose we have a single sample from the
density
p(y; θ) =
1
π
1
1 + (y − θ)2
and we want to decide
H0 : θ = 0, H1 : θ 0
5
287. U
M
P
• Likelihood ratio is
p(y; θ)
p(y; 0)
=
1 + y2
1 + (y − θ)2
H1
H0
τ
• Decision region depends on θ
• so no UMP exists!
• The Monotone Likelihood Ratio(MLR)
Condition
H0 : θ = θ0, H1 : θ ∈ S1
• Suppose we have a Fisher Factorization
p(y; θ) = g(T(y), θ)h(y)
• A UMP test of any level α exists if the
likelihood ratio is either monotone
increasing or decreasing in T for all
Λ(y; θ) =
p(y; θ)
p(y; θ0)
=
g(T, θ)
g(T, θ0)
≡ Λ(T; θ)
• Densities Satisfying MLR Condition
• Suppose we have a one-sided test:
H0 : θ = θ0, H1 : θ θ0
6
288. U
M
P
or
H0 : θ = θ0, H1 : θ θ0
• The following satisfy the MLR condition:
– i.i.d. samples from 1-D exponential
family (Gaussian, Bernoulli,
Exponential, Poisson, Gamma, Beta)
– i.i.d. samples from uniform density
U(0, θ)
– i.i.d. samples from shifted Laplace Also
works for
H0 : θ θ0, H1 : θ θ0
• Densities Not Satisfying MLR Condition
• Gaussian with single-sided H1 on mean but
unknown variance
• Cauchy density with single-sided H1 on
centrality parameter
• Exponential family with double-sided H1
7
289. L
M
P
T
Locally Most Powerful Tests
• Usual parametric data model p(y; θ)
• Consider a single-sided problem:
H0 : θ = θ0, H1 : θ θ0
• What to do if there is no UMP test?
• The locally most powerful test of level α
has a power curve that maximizes the slope
of PD(θ) at θ = θ0
φLMP = arg max
φ∈{α−level}
d
dθ
PD(φ; θ)
= arg max
φ∈{α−level}
d
dθ
Eθ[φ]
1
290. L
M
P
T
α =PFA(φB)=PFA(φA)=PFA(φLMP )
θ0
θ
PD(φLMP ,θ)
PD(φB,θ)
PD(φA,θ)
Figure 1: Graphical Interpretation of LMP Test.
• Solution to LMP Problem
• Using a proof similar to that for the
Neyman-Pearson lemma, one can show the
LMP test is
d
dθ p(y; θ)
291.
292. θ=θ0
p(y; θ0)
H1
H0
λ
where we pick λ to achieve Eθ0
[φ] 6 α
• As before, we may need a randomized test
if there is a nonzero prob. of landing
exactly on the threshold (but we wont
worry about that)
2
299. θ=θ0
H1
H0
λ
Suppose λ = 0 and likelihood is unimodal
θ0 θ̂ML(y) θ
Figure 2: Decide H1 if the slope at θ0 is positive,
i.e. ML estimate gives evidence that θ θ0.
• One-Sided Gaussian Mean Example
yi ∈ N(θ, σ2
), H0 : θ = 0, H1 : θ 0
d
dθ
ln p(y; θ) =
d
dθ
c −
n
X
i=1
(y − θ)2
2σ2
#
310. θ=0
H1
H0
λ
√
n
σ
ȳ
H1
H0
λ
σ
√
n
| {z }
Just the UMP test we discussed before
= γ
• Cauchy Median Example
• Suppose we have n Cauchy samples
p(y; θ) =
n
Y
i=1
1
π
1
1 + (yi − θ)2
and we want to decide
H0 : θ = 0, H1 : θ 0
d
dθ
ln p(y; θ) =
d
dθ
{c − ln[1 + (yi − θ)2
]}
2
n
X
i=1
(yi − θ)
1 + (yi − θ)2
311.
312.
313.
314.
315. θ=0
= 2
n
X
i=1
yi
1 + y2
i
H1
H0
λ
• Step 1: Pass each data point through a
memoryless nonlinearity
g(y) =
y
1 + y2
4
316. L
M
P
T
−5 −4 −3 −2 −1 0 1 2 3 4 5
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
y
g(y)
1
Figure 3: Memoryless nonlinearity.
• Step 2: Sum all the nonlinearity outputs
• Step 3: Compare to a threshold
• Two-Sided LMP Tests:
• Now consider a double-sided problem:
H0 : θ = θ0, H1 : θ 6= θ0
• The locally most powerful “unbiased”
test of level α has a power curve which
5
393. G
L
R
T
Generalized Likelihood Ratio
Tests and Model Order Selection Criteria
• Usual parametric data model p(y; θ)
• In previous lectures on LMP tests, we
assumed specials structures like:
H0 : θ = θ0, H1 : θ θ0
or
H0 : θ = θ0, H1 : θ 6= θ0
• What should we do if we have a more
general structure like:
H0 : θ ∈ S0, H1 : θ ∈ S1
• Often, we do something a bit ad-hoc.
1
394. G
L
R
T
The GLRT
• Find parameter estimates θ̂0 and θ̂1 under
H0 and H1
• Substituting estimates into likelihood ratio
yields a generalized likelihood ratio test
• Substituting estimates into likelihood ratio
yields a generalized likelihood ratio test
ΛGLR(y) =
p(y; θ̂1)
p(y; θ̂0)
H1
H0
λ
• If convenient, use ML estimates:
max
θ∈S1
p(y; θ)
max
θ∈S0
p(y; θ)
H1
H0
λ
2
395. G
L
R
T
Two Sided Gaussian Mean Example (1)
Yi ∼ N(θ, σ2
), H0 : θ = 0, H1 : θ 6= 0
ln
p(y; θ̂)
p(y; 0)
= −
n
X
i=1
yi − 1
n
n
P
j=1
yj
!2
2σ2
+
n
X
i=1
y2
i
2σ2
=
n
P
i=1
2yi
1
n
n
P
j=1
yj
!
− n
1
n
n
P
i=1
yi
2
2σ2
3
396. G
L
R
T
Two-Sided Gaussian Mean Example (2)
2n
1
n
n
P
i=1
yi
2
− n
1
n
n
P
i=1
yi
2
2σ2
=
n
2σ2
ȳ2
H1
H0
λ
|ȳ|
H1
H0
γ
4
397. G
L
R
T
Some Gaussian Examples
• Single population:
• Statistic has a Student-T distribution
• Asymptotically Gaussian
– Tests on mean, with unknown variance
yield “T-tests”
– A T-test is any statistical hypothesis
test in which the test statistic has a
Student’s t distribution if the null
hypothesis is true. It is applied when
sample sizes are small enough that
using an assumption of normality and
the associated z-test leads to incorrect
inference. Suppose X1, · · · , Xn are
independent random variables that are
normally distributed with expected
value µ and variance σ2
. Let
Xn = (X1 + · · · + Xn)/n
5
398. G
L
R
T
be the sample mean, and
Sn
2
=
1
n − 1
n
X
i=1
Xi − Xn
2
be the sample variance. It is readily
shown that the quantity
Z =
Xn − µ
σ/
√
n
is normally distributed with mean 0 and
variance 1, since the sample mean Xn is
normally distributed with mean µ and
standard deviation σ/
√
n. Gosset studied
a related quantity,
T =
Xn − µ
Sn/
√
n
,
which differs from Z in that the exact
standard deviation σ is replaced by the
random variable Sn. Technically,
(n−1)S2
n/σ2
has a χ2
n−1 distribution by
Cochran’s theorem. Gosset’s work
showed that T has the probability
6
399. G
L
R
T
density function
f(t) =
Γ(ν+1
2 )
√
νπ Γ(ν
2 )
1 +
t2
ν
−( ν+1
2 )
, ν = n−1
Confidence intervals derived from Student’s t-distribution
Suppose the number A is so chosen that
Pr(−A T A) = 0.9,
when T has a t-distribution with n − 1
degrees of freedom. This is the same as
Pr(T A) = 0.95,
so A is the “95th percentile” of this
probability distribution, or
A = t(0.05,n−1). Then
Pr
−A
Xn − µ
Sn/
√
n
A
= 0.9,
Pr
Xn − A
Sn
√
n
µ Xn + A
Sn
√
n
= 0.9
Therefore the interval whose endpoints
7
400. G
L
R
T
are
Xn ± A
Sn
√
n
is a 90-percent confidence interval for µ.
• Two populations:
– Tests on equality of variances, with
unknown means yields a “Fisher F-test”
– Statistic has a Fisher-F distribution
An F-test is any statistical test in which
the test statistic has an F-distribution if
the null hypothesis is true A random
variate of the F-distribution arises as
the ratio of two chi-squared variates:
U1/d1
U2/d2
U1 and U2 have chi-square distributions
with d1 and d2 degrees of freedom
respectively
The probability density function of an
F(d1, d2) distributed random variable is
8
401. G
L
R
T
given by
g(x) =
1
β(d1/2, d2/2)
d1 x
d1 x + d2
d1/2
1 −
d1 x
d1 x + d2
d2/2
x−1
for real x 0, where d1 and d2 are
positive integers, and β is the beta
function, β(x, y) =
R 1
0
tx−1
(1 − t)y−1
dt.
– Asymptotically Chi-Square
• Suppose n → ∞. Since the ML estimates
are asymptotically consistent, the GLRT is
asymptotically UMP
• If the GLRT is hard to analyze directly,
sometimes asymptotic results can help
• Assume a partition
θ = (ϕ1, · · · , ϕp, ξ1, · · · , ξq
| {z }
(nuisance parameters)
)
• Consider GLRT for a two-sided problem
H0 : φ = φ0, H1 : φ 6= φ1
where ξ is unknown, but we don’t care
what it is
9
402. G
L
R
T
• When the density p(y; θ) is smooth under
H0, it can be shown that for large n
2 ln ΛGLR(Y ) = 2 ln
p(Y ; θ̂)
p(Y ; θ0)
∼ χp
χp = Chi-square with p degrees of freedom
E[χp] = p, var(χp) =2p
Link to Bayesian Land
• Remember if we had a prior p(θ), we could
handle composite hypothesis tests by
integrating and reducing things to a simple
hypothesis test
p(y) =
Z
Rp
p(y|θ)p(θ)dθ
• If p(θ) varies slowly compared to p(y|θ)
around the MAP estimate, we can approx.
p(y) ≈ p(θ)
Z
Rp
exp[L(θ)]dθ
• Suppose MAP and ML estimates are
approximately equal
Laplace’s Approximation
10
403. G
L
R
T
• Do a Taylor series expansion
Z
Rp
exp[L(θ̂ML) −
(θ − θ̂ML)T
F(y; θ)(θ − θ̂ML)T
2
]dθ
= eL(θ̂ML)
Z
Rp
exp
−
(θ − θ̂ML)T
F(y; θ̂ML)(θ − θ̂ML)T
2
#
dθ
where
F(y; θ̂ML) =
−
d2
L(θ)
dθrdθc
404.
405.
406.
407. θ=θ̂ML
#
Empirical
Fisher Info
• Recognize quadratic form of the Gaussian:
Z
Rp
exp
−
(θ − θ̂ML)T
F(y; θ̂ML)(θ − θ̂ML)T
2
#
dθ
=
(2π)p/2
q
det F(y; θ̂ML)
so
p(y) = p(θ̂ML)p(y|θ̂ML)
(2π)p/2
q
det F(y; θ̂ML)
Large Sample Sizes
11
408. G
L
R
T
• Consider the logdensity:
ln p(y) ≈ ln p(θ̂ML) + ln p(y|θ̂ML)
+ p
2 ln 2π − 1
2 ln det F(y; θ̂ML)
• Suppose we have n iid samples. By the law
of large numbers:
ln det F(y|θ̂ML) ≈ ln det F(θ̂ML)
= ln det[nI] + ln det F1(θ̂ML)
= ln np
+ln det F1(θ̂ML) = p ln n+ln det F1(θ̂ML)
Schwarz’s Result
• As n → ∞
ln p(y) ≈ ln p(θ̂ML) + L(θ̂ML)
+ p
2 ln 2π − 1
2 p ln n − 1
2 ln det F1(θ̂ML)
≈ L(θ̂ML) −
p
2
ln n
• Called Bayesian Information Criterion
(BIC) or Schwarz Information Criterion
(SIC)
• Often used in model selection; second term
12
409. G
L
R
T
is a penalty on model complexity
Minimum Description Length
• BIC is related to Rissanens Minimum
Description Length criterion; (p/2) ln(n) is
viewed as the optimum number of “nats”
(like bits, but different base) used to
encode the ML parameter estimate with
limited precision
• Data is encoded with a string of length
description length = −L(θ̂ML) +
p
2
ln n
• Choose model which describes the data
using the smallest number of bits (or nats)
13
410. G
L
R
T
General Multivariate Gaussian
Detection Problems
• We have a data vector y = [y1, · · · , yn]T
distributed according to N(µ, R)
• Two hypotheses:
H0 : µ = µ0, R = R0, H1 : µ = µ1, R = R1
• Likelihood ratio:
Λ(y) =
√
det R0 exp[−1
2 (y − µ1)T
R−1
1 (y − µ1)]
√
det R1 exp[−1
2 (y − µ0)T R−1
0 (y − µ0)]
• Test looks like
T(y)
H1
H0
log τ +
1
2
log
det R1
det R0
≡ γ
where
T(y) ≡
(y − µ0)T
R−1
0 (y − µ0)
2
−
(y − µ1)T
R−1
1 (y − µ1)
2
Mahalanobis Distance Interpretation
• Define a norm on ℜn
: kzkR =
√
zT R−1z
1
411. G
L
R
T
• Emphasizes components of z which are
collinear to eigenvectors of R associated
with small eigenvalues
• we can rewrite test statistic as
T(y) =
ky − µ0k
2
R0
2
−
ky − µ1k
2
R1
2
Quadratic Form Interpretation
• Alternatively, express using a new statistic
T′
(y) ≡
yT
(R−1
0 − R−1
1 )y
2
+(µT
1 R−1
1 −µT
0 R−1
0 )y
and a new threshold
γ′
= ln τ+
1
2
ln
det R1
det R0
+µT
1 R−1
1 µ1−µT
0 R−1
0 µ0
Four Kinds of Decision Regions
1. If the covariances are equal, i.e.
R = R0 = R1, the test reduces to
(µT
1 −µT
0 )R−1
y = aT
y, a ≡ R−1
(µ1 −µ0)
and the decision region is a hyperplane
2
412. G
L
R
T
2. If R0 R1 i.e., R0 − R1 is positive
definite, the H1 decision region is the
interior of an ellipsoid
3. If R0 R1 i.e., R0 − R1 is negative
definite, the H1 decision region is the
exterior of an ellipsoid
4. If none of the above apply, i.e. R0 − R1
is neither singular, positive definite, or
negative definite, then the decision
region has hyperbolic boundaries
Known Signal in White Noise
• Consider the familiar special case:
µ0 = 0, µ1 = s, R = σ2
I
• Test can be expressed as
T(y) ≡ sT
y
H1
H0
log τ −
ksk
2
σ2
≡ γ
• Analysis from univariate case now applies
“Deflection ratio” or “detectability index”
is
d2
= ksk
2
/σ2
≡ SNR
3
413. G
L
R
T
Known Signal in Colored Noise
• Now let R be general
• Neyman-Pearson test has the form
sT
R−1
y
√
sT R−1s
H1
H0
Q−1
(α)
• Could transform to white noise case by
linearly preprocessing the data ỹ = Hy to
give an equivalent test in terms of s̃ = Hs
s̃T
ỹ
ks̃k
H1
H0
Q−1
(α)
The Prewhitening Transformation
• Use MATLAB (or whatever) to compute
the eigendecomposition
R = UDUT
• We can define R1/2
= UD1/2
• Our prewhitening operation is
H = R−1/2
= D−1/2
U−1
= D−1/2
UT
4
414. G
L
R
T
Performance for Colored Noise
• Test statistic for the colored noise problem
is still Gaussian, so we can again use
formulas from the univariate case with
d2
= ks̃k
2
= sT
R−1
s
• In white noise case, performance depended
only on the total power of s, not its shape
• Here, in the colored noise case,
performance depends on shape as
well! Signal Design for Colored Noise
• Problem: maximize d2
= sT
R−1
s subject
to the constraint ||s||2
= 1
• Rayleigh quotient theorem says:
sT
R−1
s
sT s
≤
1
mini λi
furthermore, we have equality if s is a
minimizing eigenvector of R
• So, to make d2
big, pick s to be that
minimizing eigenvector
5
415. G
L
R
T
• For equal means, test statistic purely
quadratic
(y − µ)T
(R−1
0 − R−1
1 )(y − µ)
• Analysis is simplified by prefiltering to
diagonalize R−1
0 − R−1
1
• Still a total pain; test statistic is a mixture
of chi-square random variables
• For unequal means, it’s even more of a
pain; test statistic is mixture of noncentral
chi-square random variables
A Simple Zero-Mean Case
H0 : Yk = Wk, H1 : Yk = Sk + Wk
Wk ∼ N(0, σ2
W ), Sk ∼ N(0, σ2
SK
)
note signal power is time-varying
• In our generic notation, we have
R0 = diag(σ2
W ), R1 = diag(σ2
W + σ2
Sk
)
6
416. G
L
R
T
R−1
0 − R−1
1 = diag
1
σ2
W
−
1
σ2
W + σ2
Sk
!
R−1
0 − R−1
1 = diag
1
σ2
W
σ2
Sk
[σ2
W + σ2
Sk
]
!
κk =
σ2
Sk
[σ2
W + σ2
Sk
]
• Test statistic is
yT
(R−1
0 − R−1
1 )y =
1
σ2
W
n
X
k=1
κky2
k
• Several different interpretations
Filter-Squarer Interpretation
n
X
k=1
κky2
k =
n
X
k=1
(
√
κkyk)2
7
418. G
L
R
T
Wide-Sense-Stationary Sequences
• Suppose Sk and Wk are wide-sense
stationary Gaussian time series with power
spectral densities
SS(ejω
), SW (ejω
), ω ∈ [0, 2π)
• Estimator-Correlator structure generalizes
yk ×
Ŝk
SS (ω)
SS (ω)+SW (ω)
P
γ
H1
H0
9
419. D
T
S
D
I
A
N
Discrete-Time Signal Detection
in Additive Noise
Signal w/Known Amplitude in IID Noise
• We have data samples y1, · · · , yn drawn
from
H0 : Yk = Nk or H1 : Yk = sk + Nk
• The loglikelihood ratio test test compares a
threshold to:
n
X
k=1
ln Lk(yk) =
n
X
k=1
ln
pN (yk − sk)
pN (yk)
• General detector for known amplitude:
yk log Lk
a function of sk
P
γ
H1
H0
Gaussian Case (Known Amplitude)
1
420. D
T
S
D
I
A
N
• If noise is N(0, σ2
),
n
X
k=1
ln Lk(yk) =
n
X
k=1
sk(yk − sk/2) =
n
X
k=1
skyk − s2
k/2
• In a Neyman-Pearson or MinMax setting,
we could absorb the s2
k/2 term into the
threshold, and the test statistic is just a
correlation
n
X
k=1
skyk
Laplacian Case (Known Amplitude)
• Suppose noise has a Laplacian density
pN (x) =
α
2
exp(−α |x|)
• Optimum detector looks like
yk +
−sk/2
−
−
p p
|sk|/2
|sk|/2
−|sk|/2
−|sk|/2
×
sgn(sk)
P
γ
H1
H0
2