Detection & Estimation Theory

Introduction to Estimation Theory
Bayesian (Random) Parameter
Estimation Nonrandom Parameter Estimation
AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.1/149

In estimation problem we assign a cost to all pairs [a, â(r)]
over the range of interest.
In many cases of interest it is realistic to assume that the
cost depends only on the error:

aǫ(r) = (â(r) − a)2
aǫ(r) = |â(r) − a|
aǫ(r) =

0, |aǫ| 6 ∆
2
0, |aǫ| ∆
2

Our goal is to find an estimate that minimizes the expected
value of the cost
R = E {c [a, â(r)]} =
Z ∞
−∞
Z ∞
−∞
c [a, â(r)] p(a, r) drda
R is the risk involved in doing the estimation of a out of
observation(s) r.

RRMS =
Z ∞
−∞
Z ∞
−∞
(â(r) − a)2
p(a, r) drda
=
Z ∞
−∞
drp(r)
Z ∞
−∞
(â(r) − a)2
p(a|r)da
Because p(r) 0 we can minimize the inner integral:
d
dâ
Z ∞
−∞
(â(r) − a)2
p(a|r) da

= −2
Z ∞
−∞
ap(a|r) da
+2â(r)
Z ∞
−∞
p(a|r) da = 0 ⇒

Then, the mean square estimate is represented as:
âRMS(r) =
Z ∞
−∞
a p(a|r)da
We have seen this before as the conditional mean.

d
dâ

−2
Z ∞
−∞
ap(a|r)da + 2â(r)
Z ∞
−∞
p(a|r) da

= 2 0
Because the second derivative is positive âRMS is the mini-
mum.

Z â(r)
−∞
(â(r) − a) p(a|r)da =
Z ∞
â(r)
(a − â(r)) p(a|r)da ⇒
This is the definition for the median. The absolute error
criterion leads to the determination of estimate of a out of
the median of the observation(s) r.

Runif =
Z ∞
−∞
drp(r)

1 −
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da
#
Minimizing Runif amounts to maximizing
Z âunif(r)+ ∆
2
âunif(r)− ∆
2
p(a|r)da ⇒
âunif(r) occurs where p(a|r)|â(r)=a is maximum.

This is MAP. A necessary, but not sufficient condition for a
Max. is
d
da
{ln p(a|r)}

âMAP=a
= 0
p(a|r) =
p(r|a)p(a)
p(r)
⇒
ln p(a|r) = ln p(r|a) + ln p(a) − ln p(r)
| {z }
not a
function of a
⇒
max {ln p(a|r)} ≡ max {ln p(r|a) + ln p(a)}

Example
ri = a+ni, i = 1, 2, · · · , N, a ∼ N(0, σa), ni ∼ N(0, σn) ⇒
p(r|a) = ΠN
i=1
1
σn
√
2π
exp

−
(ri − a)2
2σ2
n

p(a) =
1
σa
√
2π
exp

−
a2
2σ2
a

We need to compute
R ∞
−∞
ap(a|r)da, one approach could be
p(a|r) = p(a)p(r|a)/p(r) but this is tedious. However, one
can observe that p(a|r) is PDF then:

p(a|r) =
1
p(r)
(
1
(2π)N/2σN
n
1
√
2πσa
exp

−
PN
i=1(ri − a)2
2σ2
n
#
exp

−
a2
2σ2
a
)
p(a|r) = k(r) exp



−
1
2σ2
p

a −
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!#2



σ2
p =

1
σ2
a
+
N
σ2
n
−1

We see that p(a|r) is Gaussian, then: âMS(r) is the
conditional mean:
âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!

if σ2
a ≫ σ2
n
N
⇒ a priori knowledge is much better
than the observed data.
if σ2
a ≪ σ2
n
N
⇒ a priori knowledge is not enough and
the estimate uses the received data.

For MAP:
The location p(a|r) is maximum is the mean of Gaussian⇒
âMAP(r) = âMS(r) =
σ2
a
σ2
a + σ2
n/N
1
N
N
X
i=1
ri
!
Also because the median of Gaussian occurs at the mean
then for this problem:
âMAP(r) = âMS(r) = âabs(r)

This invariance to choice of cost function is obviously be-
cause of the subjective judgements that are frequently in-
volved in choosing the cost function.

An example of a nonlinear problem:
rm = a3
+ νm, m = 1, 2, · · · , M, νk ∼ N(0, σν), a ∼ N(0, σa)
p(a|r) = k(r) exp
(
−
1
2
PM
m=1 (rm − a3
)
2
σ2
n
+
a2
σ2
a
#)
âMAP(r) =
(PM
m=1 [rm − a3
] (3a2
)
σ2
n
+
a
σ2
a
)

a=âMAP
= 0

Example:
Pr(n even|a) =
an
n!
e−a
, n = 0, 1, 2, · · · , ∞
p(a) = λe−λa
, a 0 ⇒
Pr(a|n) =
Pr(n|a)p(a)
Pr(n)
= k(n)

an
n!
e−a
λe−λa

because
R ∞
0
p(a|n) da = 1 ⇒
k(n) =
(1 + λ)(n+1)
λ

âMS(n) =
Z ∞
0
ap(a|n) da =
n + 1
λ + 1
âMAP(n) = max {ln p(r|a) + ln p(a)} =
n
λ + 1
Z âabs
0
p(a|n) da =
Z ∞
âabs
p(a|n) da = polynomial solution, no closed form

The first measure of quality is:
E{â(r)} =
Z ∞
−∞
â(r) p(r|a) dr
1. if E{â(r)} = a, unbiased estimate.
2. if E{â(r)} = a + b, biased, but known.
3. if E{â(r)} = a + b(a), biased, but unknown.
Even an unbiased estimate could yield bad results.

Usually the PDF of the estimate is centered around a.
Therefore, the second measure of quality is the variance of
the estimate.
var[â(r) − a] = E

[â(r) − a]2

− B2
(a)
General strategy:
We shall try to find an unbiased estimate with small vari-
ance.

Maximum Likelihood Estimation (MLE):
r = a + n, p(r|a) = N(a, σn)
We choose the value of a that most likely caused a given
value of a.
The likelihood function (LF) of the observation given the a
is p(r|a), or the log-likelihood function (LLF) ln p(r|a)

We maximize LF or LLF with respect to the unknown
parameter
âML(r) is the value of a at which p(r|a) is maximum.
If âML(r) is interior to a, and ln p(r|a) has a maximum then
a = âML(r) is that value.
The ML estimate is the limiting value of MAP as the a priori
knowledge → 0:
MAP:





∂
∂a
ln p(r|a) +
∂
∂a
ln p(a)
| {z }
a priori knowledge






a=âMAP
= 0

If a(r) is any unbiased estimate of a ⇒
var[a(r) − a] E
(
∂
∂a
ln p(r|a)
2
)!−1
var[a(r) − a]

−E

∂2
∂a2
ln p(r|a)
−1
These are called CRLB.

Any estimate that satisfies CRLB with equality is called an
efficient estimate.

Because â(r) is unbiased:
E[a(r) − a] =
Z ∞
−∞
p(r|a)[a(r) − a] dr = 0
∂
∂a
{E[a(r) − a]} =
Z ∞
−∞
∂p(r|a)
∂a
[a(r) − a] dr − 1 = 0
∂p(r|a)
∂a
=
∂ ln p(r|a)
∂a
p(r|a)
Z ∞
−∞
∂ ln p(r|a)
∂a
p(r|a)[a(r) − a] dr = 1

Z ∞
−∞

∂ ln p(r|a)
∂a
p
p(r|a)
p
p(r|a)[a(r) − a]

dr = 1
Using Schwartz inequality:
Z ∞
−∞

∂ ln p(r|a)
∂a
2
p(r|a) dr
Z ∞
−∞
p(r|a)[a(r) − a]2
dr
| {z }
var[a(r)−a]
1 ⇒
var[a(r) − a]

E
n ∂
∂a
ln p(r|a)
2
o−1

Equality holds iff:
∂ ln p(r|a)
∂a
= k(a)[a(r) − a]

E

∂2
ln p(r|a)
∂a2

= −E
(
∂ ln p(r|a)
∂a
2
)
Then, the 2nd representation results.
1. From CRLB, any unbiased estimate must have a
variance greater than a certain limit.
2. if ∂ ln p(r|a)/∂a = k(a)[a(r) − a], âML(r) will satisfy the
CRLB with equality
∂ ln p(r|a)
∂a

a=âML(r)
= 0 = k(a)[a(r) − a] ⇒
a(r) = âML(r) or k(âML(r)) = 0

Example:
ri = a + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
∂ ln p(r|a)
∂a
=
N
σ2
n

1
N
N
X
i=1
ri − a
#
= 0 ⇒ âML(r) =
1
N
N
X
i=1
ri
E [âML(r)] =
1
N
N
X
i=1
E(ri) =
1
N
N
X
i=1
a = a ⇒
âML(r) is the unbiased estimator.

The variance of the estimator:
∂2
ln p(r|a)
∂a2
= −
N
σ2
n
⇒
var[âML(r) − a] =
σ2
n
N
→ 0 N → ∞

Example:
Pr(n event |a) =
an
n!
e−a
, n = 0, 1, 2, · · · , N
∂ ln p(n = N|a)
∂a
=
∂
∂a
[N ln a − a − ln N!] =
N
a
−1 =
1
a
[N−a] ⇒ âML(N) = N
∂2
ln p(n = N|a)
∂a2
= −
N
a2
⇒
var [âML(N) − a] = a

Example:
ri = s(a) + ni, i = 1, 2, · · · , N, ni ∼ N(0, σn)
p(r|a) =
1
√
2πσn
N
exp

−
PN
i=1 (ri − s(a))2
2σ2
n
#
∂ ln p(r|a)
∂a
=
1
σ2
n
N
X
i=1
(ri − s(a))
∂s(a)
∂a
In general cannot be written in the form required by:
∂ ln p(r|a)
∂a
= k(a)[a(r) − a]

Therefore, an unbiased efficient estimate does not exist.

1
σ2
n
∂s(a)
∂a
N
X
i=1
ri − Ns(a)
!

a=âML(r)
⇒
âML(r) = s−1 1
N
N
X
i=1
ri
!
∂2
ln p(r|a)
∂a2
=
1
σ2
n
N
X
i=1
[ri − s(a)]
∂2
s(a)
∂a2
−
N
σ2
n

∂s(a)
∂a
2
⇒
E

∂2
ln p(r|a)
∂a2

= −
N
σ2
n

∂s(a)
∂a
2

Because E{
P
ri − s(a)} = 0 ⇒
var[âML(r) − a]
σ2
n
N
h
∂s(a)
∂a
i2

Example of Bayesian Estimation
Suppose we collect n Poisson distributed data points with
mean θ
Yi ∼ Poiss(θ)
p(yi|θ) = e−θ θyi
yi!
, yi ∈ 0, 1, 2...
Likelihood is:
p(y|θ) =
n
Y
i=1
e−θ θyi
yi!
Suppose prior is exponentially distributed with mean 1/b
p(θ) = b exp(−bθ)u(θ)

Posterior is:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
n
Q
i=1
e−θ θyi
yi!

b exp(−bθ)
∞
R
0
n
Q
i=1
e−θ θyi
yi!

b exp(−bθ)dθ
E[Θ|Y = y] =
∞
Z
−∞
θp(θ|y)dθ =
∞
Z
0
θe−(n+b)θ
θT
∞
R
0
e−(n+b)θ̃θ̃T dθ̃
dθ

∞
Z
0
θe−(n+b)θ
θT
dθ =
∞
Z
0
e−(n+b)θ
θT+1
dθ =
Γ(T + 2)
(n + b)T+2
, T ≡
n
X
i=1
yi
E[Θ|Y = y] =
Γ(T+2)
(n+b)T +2
Γ(T+1)
(n+b)T +1
=
(T+1)!
(n+b)T +2
T!
(n+b)T +1
E[Θ|Y = y] =
1 +
n
P
i=1
yi
n + b

MAP Estimate:
p(y|θ)p(θ) = e−nθ θT
n
Q
i=1
yi!
b exp(−bθ)
ln p(θ|y) = −nθ + T ln θ − bθ
d
dθ
ln p(θ|y) = −n +
T
θ
− b = 0
θ̂MAP (y) =
T
n + b

Minimum Absolute Error(MAE)
Must solve:
θ̂MAE(y)
Z
−∞
p(θ|y)dθ = 1/2
θ̂MAE(y)
Z
0
e−(n+b)θ
θT (n + b)T+1
T!
dθ = 1/2
θ̂MAE(y)
Z
−∞
e−(n+b)θ
θT
dθ
| {z }
incomplete Gamma function
=
T!
2(n + b)T+1
Because: AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.43/149

∞
Z
0
e−(n+b)θ
θT
dθ =
Γ(T + 1)
(n + b)T+1
=
T!
(n + b)T+1
The solution for MAE is based on expressing as an “incom-
plete Gamma function” (gammainc in MATLAB); this will
have an inverse, so you could solve for θ̂MAE(y).

Suppose we instead want to estimate
γ =
√
θ
Can we just say
γ̂CME(y) =
q
θ̂CME(y), CME = Conditional Mean Square
γ̂MAP (y) =
q
θ̂MAP (y)
We’ll see the answer is NO.

Note that knowing γ is equivalent to knowing θ
pY |Γ(y|γ) = pY |Θ(y|θ)

θ=γ2
=
n
Y
i=1
e−θ θyi
yi!

θ=γ2
=
n
Y
i=1
e−γ2 (γ2
)
yi
yi!
transformations of random variables:
γ = g(θ) =
√
θ, θ = g−1
(γ) = γ2
pΘ(θ) = b exp(−bθ), pΓ(γ) =

pΘ(g−1
(γ)) = 2γb exp(−bγ2
)

Computing the New MAP Estimate
pY |Γ(y|γ) =
n
Y
i=1
e−γ2 (γ2
)
yi
yi!
, pΓ(γ) = 2γb exp(−bγ2
)
The new logposterior:
H = ln pΓ|Y (γ|y) = −Nγ2
+ (2 ln γ)T + ln γ − bγ2
dH
dγ
= −2Nγ +
2T
γ
+
1
γ
− 2bγ = 0

γ̂2
MAP (y) =
2T + 1
2N + 2b
=
T + 1/2
N + b
6=
T
N + b
= θ̂MAP (y)
γ̂MAP =
r
T + 0.5
N + b
As an aside, recall:
θ̂CME(y) =
T + 1
N + b
6= γ̂CME(y)
If we went through the same exercise for the MMSE esti-
mate, would probably come to similar conclusions!?

In general, for Bayesian estimates f(θ̂(y)) 6=
f(θ(y)),
whether MAP, MMSE, MAE, or whatever
For the special case of affine transformations,
γ = f(θ) = aθ + b
γ̂ = aθ̂ + b
γ = g(θ) = aθ + b, θ = g−1
(γ) =
γ − b
a

γ − b
a

, pΓ(γ) =
1
a
pΘ

γ − b
a

MAP estimation:
ln pΓ|Y (γ|y) = ln pY |Θ

y

γ − b
a

+ ln
1
a
+ ln pΘ

γ − b
a

Similar arguments work for MMSE, MAE, etc.

1. For ML estimates as N → ∞ âML(r) → a in probability
sense, this is called a consistent estimate.
2. ML estimate is asymptotically efficient.
lim
N→∞
var [âML(r) − a]

−E
n
∂2 ln p(r|a)
∂a2
o−1 = 1
3. ML estimate is asymptotically Gaussian, N(a, σaǫ )

Method of Moments
To find the method of moments estimator of θ1, · · · , θp we
set up and solve the equations:
µ1 (θ1, · · · , θp) = m1
µ2 (θ1, · · · , θp) = m2
µp (θ1, · · · , θp) = mp
The kth sample moment is defined to be:
mk =
1
n
n
X
i=1
xk
i

Example
Let x1, · · · , xn denote a sample from the uniform
distribution from θ1 to θ2.
f (x; θ1, θ2) =
1
θ2−θ1
θ1 ≤ x ≤ θ2
0 elsewhere

The joint density of x1, · · · , xn is:
L (θ1, θ2) = f (x1, . . . , xn; θ1, θ2) =
n
Y
i=1
f (xi; θ1, θ2)
=
(
1
(θ2−θ1)n θ1 ≤ x1 ≤ θ2, . . . , θ1 ≤ xn ≤ θ2
0 elsewhere
=
(
1
(θ2−θ1)n θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere

To find the maximum likelihood estimates of θ1 and θ2 we
determine;
∂L (θ1, θ2)
∂θ1
and
∂L (θ1, θ2)
∂θ2
∂L
∂θ1
=
(
n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere
∂L
∂θ2
=
(
−n (θ2 − θ1)−n+1
θ1 ≤ min
i
(xi) , max
i
(xi) ≤ θ2
0 elsewhere

Note ∂L(θ1,θ2)
∂θ1
and ∂L(θ1,θ2)
∂θ2
are never equal to zero.
∂L (θ1, θ2)
∂θ1
is always positive and
∂L (θ1, θ2)
∂θ2
is always negative.

hence the maximum likelihood estimates of θ1 and θ2 are
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
This compares with the Method of moments estimators:
θ̃1 = x̄ −
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!
θ̃2 = x̄ +
v
u
u
t3
n
n
X
i=1
(xi − x̄)2
!

The sampling distribution:
θ̂1 = min
i
(xi) , θ̂2 = max
i
(xi)
solution
We use the distribution function method:
θ̂1 = min
i
(xi) = m, θ̂2 = max
i
(xi) = M
G1 (u) = P [m ≤ u] = P
h
min
i
(xi) ≤ u
i
= 1 − P
h
min
i
(xi) ≥ u
i
= 1 − P [x1 ≥ u, · · · , xn ≥ u]

= 1 − P [x1 ≥ u] · · · P [xn ≥ u] = 1 −

θ2 − u
θ2 − θ1
n
Thus the density of
m = θ̂1 = min
i
(xi)
is
g1 (u) = G′
1 (u) = −n

θ2 − u
θ2 − θ1
n−1
−1
θ2 − θ1

=
n (θ2 − u)n−1
(θ2 − θ1)n

Is m = θ̂1 = min
i
(xi) unbiased?
E [m] = E
h
θ̂1
i
= E
h
min
i
(xi)
i
=
θ2
Z
θ1
ug1 (u) du =
θ2
Z
θ1
u
n (θ2 − u)n−1
(θ2 − θ1)n du
Put v = θ2u then the above integral becomes
0
Z
θ2−θ1
(θ2 − v)
nvn−1
(θ2 − θ1)n − dv

=
n
(θ2 − θ1)n

θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv


E

θ̂1

=
n
(θ2 − θ1)n

θ2
θ2−θ1
Z
0
vn−1
dv −
θ2−θ1
Z
0
vn
dv


=
n
(θ2 − θ1)n

θ2
(θ2 − θ1)n
n
−
(θ2 − θ1)n+1
n + 1
#

= θ2 −
n (θ2 − θ1)
n + 1
=
n
n + 1
θ1 +
1
n + 1
θ2
= θ1 +
1
n + 1
(θ2 − θ1)

Is M = θ̂2 = max
i
(xi) unbiased?
E [M] = E
h
θ̂2
i
= E
h
max
i
(xi)
i
=
θ2
Z
θ1
vg2 (v) dv =
θ2
Z
θ1
v
n (v − θ1)n−1
(θ2 − θ1)n dv
Put w = v − θ1 then the above integral becomes

θ2−θ1
Z
0
(w + θ1)
nwn−1
(θ2 − θ1)n dw
=
n
(θ2 − θ1)n


θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw


E

θ̂2

=
n
(θ2 − θ1)n


θ2−θ1
Z
0
wn
dw + θ1
θ2−θ1
Z
0
wn−1
dw


=
n
(θ2 − θ1)n

(θ2 − θ1)n+1
n + 1
+ θ1
(θ2 − θ1)n
n
#

=
n (θ2 − θ1)
n + 1
+ θ1 =
1
n + 1
θ1 +
n
n + 1
θ2
= θ2 −
1
n + 1
(θ2 − θ1)
E

θ̂2 − θ̂1

= E

θ̂2

− E

θ̂1

=

θ2 − 1
n+1
(θ2 − θ1)

−

θ1 + 1
n+1
(θ2 − θ1)

= 1 − 2
n+1

(θ2 − θ1)
= n−1
n+1

(θ2 − θ1)

Hence
E

n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ2 − θ1
We can use this to get rid of the bias of θ̂1 and θ̂2
T1 = θ̂1 −
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂1 −
1
n − 1
h
θ̂2 − θ̂1
i
= m −
M − m
n − 1

and
T2 = θ̂2 +
1
n + 1
n + 1
n − 1
h
θ̂2 − θ̂1
i
= θ̂2 +
1
n − 1
h
θ̂2 − θ̂1
i
= M +
M − m
n − 1
Then T1 and T2 are unbiased estimators of θ1 and θ2.

Uniformly Better
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Let T(x) and T∗
(x) be estimators of the
parameter φ(θ). Then T(x) is said to be uniformly better
than T∗
(x) if:
MSET(x) (θ) 6 MSET∗(x) (θ) , θ ∈ Ω

Uniformly Minimum Variance
Unbiased Estimator
Let x = (x1, x2, · · · , xn) denote the vector of observations
having joint density f(x|θ) where the unknown parameter
vector θ ∈ Ω. Then T∗
(x) is said to be the UMVU
(Uniformly minimum variance unbiased) estimator of φ(θ)
if:
E[T∗
(x)] = φ(θ), θ ∈ Ω
Var[T∗
(x)] 6 Var[T(x)]
where E[T(x)] = φ(θ).

Multiple Parameter Estimation
âǫ(r) =





a1(r) − a1
a2(r) − a2
.
.
.
aK(r) − aK





= ~
aǫ(r) − ~
a
Cost function for MSE criterion:
C (âǫ(r)) =
K
X
i=1
â2
ǫi
(r) = âT
ǫ (r)âǫ(r)

Risk:
RMSE =
Z ∞
−∞
Z ∞
−∞
C (âǫ(r)) p (~
r,~
a) d~
rd~
a
=
Z ∞
−∞
p(~
r) d~
r
| {z }
0
Z ∞
−∞
K
X
i=1
(âi(r) − ai)2
#
p(a|r) d~
a ⇒
âMSEi
(r) =
Z ∞
−∞
ai p(~
a|~
r) d~
a or ˆ
~
aMSEi
(r) =
Z ∞
−∞
~
a p(~
a|~
r) d~
a

The above estimates hold true over linear transformation.
~
b = DL×K
~
a, E

bT
ǫ (r)bǫ(r)

= E
L
X
i=1
b2
ǫi
(r)
#
~
bMSE(~
r) = D~
aMSE(~
r)

For MAP we find ~
a that max {p(~
a|~
r)}:
∂ ln p(~
a|~
r)
∂ai

a=aMAP(r)
= 0, i = 1, 2, · · · , K
∇a [ln p(~
a|~
r)]|a=aMAP(r) = 0, ∇a =



∂
∂a1
.
.
.
∂
∂aK



For ML:
∇a [ln p(~
r|~
a)]|a=aML(r) = 0

Bias:
E {~
aǫ(~
r)} = E {~
a(~
r) − ~
a} = ~
B(~
r)
If ~
B(~
r) equal to zero then ~
a(~
r) is an unbiased estimate.
For vector variables the quantity analogous to the variance
is the covariance matrix.
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o
= Λǫ
E (~
aǫ) = ~
B(~
a)

Let’s consider any unbiased estimator of ~
a
σ2
ǫi
= var [ai(~
r) − ai] Jii
Jii is the ith element in the K × K square matrix J−1
Jij = E

∂ ln p(~
r|~
a)
∂ai
·
∂ ln p(~
r|~
a)
∂aj

This is called Fisher’s information matrix (FIM).

For any estimator we are interested in:
1. Bias:
E {~
a(r)}
2. Error cross-covariance:
E
n
(~
aǫ − E (~
aǫ))T
(~
aǫ − E (~
aǫ))
o

Example:
Consider the random variable Y with
E[Y ] = g(U1, U2, · · · , Uk) =
p
X
i=1
βiφi(U1, U2, · · · , Uk)
and
var(Y ) = σ2
where βi, i = 1, 2, · · · , p are unknown parameters.
where φi, i = 1, 2, · · · , p are known functions of the
nonrandom variables Ui, i = 1, 2, · · · , k
assume further that Y is normally distributed.

Thus the PDF of Y is:
f(Y |β1, · · · , βp, σ2
) = f(Y |β, σ2
)
=
1
√
2πσ2
exp

−
1
2s2
[Y − g(U1, U2, · · · , Uk)]2

=
1
√
2πσ2
exp



−
1
2σ2

Y −
p
X
i=1
βiφi (U1, U2, · · · , Uk)
#2



=
1
√
2πσ2
exp

−
1
2σ2
[Y − β1X1 − β2X2 − · · · − βpXp]2

whereXi = φi (U1, U2, · · · , Uk) , i = 1, 2, · · · , p.

Now suppose that n independent observations of Y
(y1, y2, · · · , yn)
corresponding to n sets of values of





(u11, · · · , u1k)
(u21, · · · , u2k)
.
.
.
(un1, · · · , unk)





Let xij = φj(ui1, · · · , uik), j = 1, 2, · · · , p, i = 1, 2, · · · , n.

Then the joint density of y = (y1, y2, · · · , yn) is:
f(y1, · · · , yn|β1, · · · , βp, σ2
) = f(y|β, σ2
)
=
1
(2πσ2)n/2
exp
(
−
1
2σ2
n
X
i=1
[yi − g(u1i, u2i, ..., uki)]2
)
=
1
(2πσ2)n/2
exp



−
1
2σ2
n
X
i=1

yi −
p
X
j=1
βjφj(u1i, u2i, ..., uki)
#2




=
1
(2πσ2)n/2
exp



−
1
2σ2
n
X
i=1

yi −
p
X
j=1
βjxij
#2



=
1
(2πσ2)n/2
exp

−
1
2σ2
[y − Xβ]′
[y − Xβ]

=
1
(2πσ2)n/2
exp

−
1
2σ2
[y′
y − 2y′
Xβ + β′
X′
Xβ]

=
1
(2πσ2)n/2
exp

−
1
2σ2
[β′
X′
Xβ]

exp

−
1
2σ2
[y′
y − 2y′
Xβ]


= h (y) g β, σ2

exp

−
1
2σ2
[y′
y − 2y′
Xβ]

Thus f(y|β, σ2
) is a member of the exponential family of
distributions and
S = (y′
y, X′
y)
is a Minimal Complete set of Sufficient Statistics.

The Maximum Likelihood estimates of β and σ2
are the
values
β̂ and σ̂2
that maximize
Ly σ2
, β

=
1
(2πσ2)n/2
exp

−
1
2σ2
[y − Xβ]′
[y − Xβ]

or equivalently
ly σ2
, β

= ln Ly σ2
, β


= −
n
2
ln (2π) −
n
2
ln σ2

−
1
2σ2
[y − Xβ]′
[y − Xβ]
= −
n
2
ln (2π) −
n
2
ln σ2

−
1
2σ2
[y′
y − 2y′
Xβ + β′
X′
Xβ]
∂ly (σ2
, β)
∂β
= 0
yields the system of linear equations (The Normal
Equations)
X′
Xβ̂ = X′
y

while
∂ly (σ2
, β)
∂σ2
= 0
yields the equation:
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
If [X′
X]−1
exists then the normal equations have solution:
β̂ = (X′
X)
−1
X′
y

and
σ̂2
=
1
n
h
y − Xβ̂
i′ h
y − Xβ̂
i
=
1
n
h
y − X (X′
X)
−1
X′
y
i′ h
y − X (X′
X)
−1
X′
y
i
=
1
n
h
y′
y − y′
X (X′
X)
−1
X′
y
i

Almost all problems in statistics can be formulated as a
problem of making a decision . That is given some data
observed from some phenomena a decision will have to be
made about the phenomena. Decisions are generally
broken into two types :
Estimation decisions
Hypothesis Testing decisions.
Probability Theory plays a very important role in these de-
cisions.

Besides the Normal distribution the following distributions
play an important role in estimation and hypothesis testing:
Chi-squared distribution with ν degrees of freedom
f(x) =
1
Γ(ν/2)2ν/2
x(ν−2)/2
e−x/2
, x 0
Comment: If z1, z2, · · · , zν are independent random
variables each having a standard normal distribution
then U =
Pν
k=1 z2
k has a chi-squared distribution with ν
degrees of freedom.

F distribution with ν1 degrees of freedom in the
numerator and ν2 degrees of freedom in the
denominator
f(x) = Kx(ν1−2)/2

1 +
ν1
ν2
x
−(ν1+ν2)/2
, x 0, K =
Γ(ν1+ν2
2
)

ν1
ν2
(ν1/2)
Γ(ν1/2)Γ(ν2/2)

Comment: If U1 and U2 are independent random variables
each having Chi-squared distribution with ν1 and ν2
degrees of freedom respectively then
F =
U1
U2
ν1
ν2
has a F distribution with ν1 degrees of freedom in the nu-
merator and ν2 degrees of freedom in the denominator.

The t distribution with ν degrees of freedom
f(x) = K

1 +
x2
ν
−(ν+1)/2
, K =
Γ((ν + 1)/2)
Γ(ν/2)
√
πν
Comment: If Z and U are independent random variables,
and Z has a standard Normal distribution while U has a
Chi-squared distribution with ν degrees of freedom then
t =
Z
p
U/ν
has a t distribution with ν degrees of freedom.

Goal: Extract useful information out of messy data
Strategy: Formulate probabilistic model of data y, which
depends on underlying parameter(s) θ
Terminology depends on parameter space:
Detection (simple hypothesis testing):
θ ∈ {0, 1}, 0 = target absent, 1 = target present
Classification (multihypothesis testing):
θ ∈ {0, 1, · · · , M}, , i.e.θ ∈ {DC-9, 747, F-15, MiG-31}

Termonology
Suppose θ = (θ1, θ2)
If we are only interested in θ1, then θ2 are called
nuisance parameters
If θ1 = {0, 1}, and θ2 are nuisance parameters, we call
it a composite hypothesis testing problem

Ex: Positron Emission Tomography
Simple, traditional linear DSP-based approach Filtered
Back Projection (FBP)
Advanced, estimation-theoretic approach
Model Poisson “likelihood” of collected data
Markov Random Field (MRF) “prior”on image
Find estimate using expectation-maximization
algorithm (or similar technique)

Tasks of Statistical Signal
Processing: Estimation, Detection,
. . .
1. Create statistical model for measured data
2. Find fundamental limitations on our ability to perform
inference on the data
(a) Cramér-Rao bounds, Chernov bounds, etc.
3. Develop an optimal (or suboptimal) estimator
4. Asymptotic analysis (i.e., assume we have lots and lots
of data) of estimator performance to see if it
approaches bounds derived in (2)
5. Do simulations and experiments comparing algorithm
performance to lower bounds and competing
algorithms

A Bayesian analysis treats θ as a random variable with
a “prior” density p(θ)
Data generating machinery is specified by a
conditional density p(y|θ)
Gives the “likelihood” that the data y resulted from
the parameters ?
Inference usually revolves around the posterior density,
derived from Bayes’ theorem:
p(θ|y) =
p(y|θ)p(θ)
p(y)
=
p(y|θ)p(θ)
R
p(y|θ)p(θ)dθ

Classical detection problem:
Design of optimum procedures for deciding
between possible statistical situations given a
random observation:
H0 : Yk ∼ P ∈ P0, k = 1, · · · , n
H1 : Yk ∼ P ∈ P1, k = 1, · · · , n
The model has the following components:
Parameter Space (for parametric detection
problems)
Probabilistic Mapping from Parameter Space to
Observation Space
Observation Space AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.97/149

Parameter Space:
Completely characterizes the output given the
mapping.
Each hypothesis corresponds to a point in the
parameter space. This mapping is one-to-one.
Probabilistic Mapping from Parameter Space to
Observation Space:
The probability law that governs the effect of a
parameter on the observation.

Example:
Yk =



Nk
Nk
Nk
, p = 1/2,
, p = 1/4,
, p = 1/4,
Nk ∼ N(0, σ2
)
Nk ∼ N(−1, σ2
)
Nk ∼ N(1, σ2
)
µ =

−1 0 1
T
| {z }
parameter space
p = 1/2, 1/4, 1/4 is the probabilistic mapping.

Observation Space:
Finite dimensional, i.e. k = 1, 2, · · · , n where n is
finite.
Detection Rule
Mapping of the observation space into its
parameters in the parameter space is called a
detection rule.

Classical estimation problem:
Interested in not making a choice among several
discrete situations, but rather making a choice
among a continuum of possible states.
Think of a family of distributions on the observation
space, indexed by a set of parameters.
Given the observation, determine as accurately as
possible the actual value of the parameter.

Example:
Yk = Nk, Nk(µ, σ2
)
In this example, given the observations, parameter µ is
being estimated. Its value is not chosen among a set of
discrete values, but rather is estimated as accurately as
possible.

Estimation problem also has the same components as
the detection problem.
Parameter Space Probabilistic Mapping from
Parameter Space to Observation Space
Observation Space
Estimation Rule
Detection problem can be thought of as a special case
of the estimation problem. There are a variety of
estimation procedures differing basically in the amount
of prior information about the parameter and in the
performance criteria applied. Estimation theory is less
structured than detection theory. Detection is science,
estimation is art(I have seen it in a book “Array signal
processing” by Johnson, Dudgeon).

Based on the a priori information about the parameter,
there are two basic approaches to parameter
estimation:
Bayesian Parameter Estimation
Nonrandom Parameter Estimation
Bayesian Parameter Estimation:
Parameter is assumed to be a random quantity
related statistically to the observation.
Nonrandom Parameter Estimation:
Parameter is a constant without any probabilistic
structure.

Estimation theory relies on jargon to characterize the
properties of estimators.
The following definitions are used: The set of n
observations are represented by the n-dimensional
vector y ∈ Γ (observation space).
The values of the parameters are denoted by the
vector θ ∈ Λ (parameter space).
The estimate of this parameter vector is denoted by
: Γ → Λ.

Definitions (continued): The estimation error ε(y) (ε in
short) is defined by the difference between the
estimate and the actual parameter:
ε(y) = θ̂(y) − θ
The function C(a, θ) is the cost of estimating a true
value of θ as a.
Given such a cost function C, the Bayes risk
(average risk) of the estimator is defined by the
following:
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]

y
oo

Example
Suppose we would like to minimize the Bayes risk defined
by
r(θ̂) = E
n
E
n
C[θ̂(Y), Θ]

y
oo
for a given cost function C. By inspection, one can see that
the Bayes estimate of θ can be found (if it exists) by
minimizing, for each y ∈ Γ, the posterior cost given Y = y:
E
n
C[θ̂(Y), Θ]

y
o

An estimate is said to be unbiased if the expected value of
the estimate equals the true value of the parameter
E
n
θ̂|θ
o
= θ
Otherwise the estimate is said to be biased. The bias b(θ) is
usually considered to be additive, so that:
b(θ) = E
n
θ̂|θ
o
− θ

An estimate is said to be asymptotically unbiased if the
bias tends to zero as the number of observations tend
to infinity.
An estimate is said to be consistent if the
mean-squared estimation error tends to zero as the
number of observations becomes large.

An efficient estimate has a mean-squared error that
equals a particular lower bound: the Cramer-Rao
bound. If an efficient estimate exists, it is optimum in
the mean-squared sense: No other estimate has a
smaller mean-squared error.
Following shorthand notations will also be used for
brevity:
pθ(y) = py|θ(y|θ) = Probability density(y given θ)
Eθ{y} = E{y|θ}

Following definitions and theorems will be useful later
in the presentation:
Definition: Sufficiency
Suppose that Λ is an arbitrary set. A function T :
Γ → Λ is said to be a sufficient statistic for the
parameter set θ ∈ Λ if the distribution of y
conditioned on T(y) does not depend on θ for
θ ∈ Λ.
If knowing T(y) removes any further dependence
on θ of the distribution of y, one can conclude
that T(y) contains all the information in y that is
useful for estimating θ. Hence, it is sufficient.

Definition: Minimal Sufficiency
A function T on Γ is said to be minimal sufficient
for the parameter set θ ∈ Λ if it is a function of
every other sufficient statistic for θ.
A minimal sufficient statistic represents the
furthest reduction in the observation without
destroying information about θ.
Minimal sufficient statistic does not necessarily
exist for every problem. Even if it exists, it is
usually very difficult to identify it.

Let {x1, x2, · · · , xn} denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of statis-
tics: S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
called a set of sufficient statistics if the conditional distri-
bution of {x1, x2, · · · , xn} given S1, S2, · · · , Sq is functionally
independent of the parameters {θ1, θ2, · · · , θp}.

Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =

1 if the ith trial is a success
0 if the ith trial is a failure
Let S = x1 + x2 + x3 = the total number of successes.

Joint distribution of x1, x2, x3 ; sampling distribution of S,
conditional distribution of x1, x2, x3 given S.
x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S
0, 0, 0 (1 − π)3
0 (1 − π)3
1
1, 0, 0 π(1 − π)2
1/3
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
1/3
0, 0, 1 π(1 − π)2
1/3
1, 1, 0 π2
(1 − π) 1/3
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) 1/3
0, 1, 1 π2
(1 − π) 1/3
1, 1, 1 π3
3 π3
1

The data x1, x2, x3 can be thought to be generated in two
ways
1. Generate the data x1, x2, x3 directly from the joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) or
2. Generate the sufficient statistics S1, S2, · · · , Sq from
their joint sampling distribution then generate the
observations x1, x2, x3 from the conditional distribution
of x1, x2, x3 given S1, S2, · · · , Sq. Since the second step
is independent of the parameters θ1, θ2, · · · , θp all of the
information about the parameters will be determined by
the results of the first step.

Principle of sufficiency
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the sufficient statistics
S1, S2, · · · , Sq and not otherwise on the data x1, x2, x3.

The Likelihood Principle
Any decision about the parameters θ1, θ2, · · · , θp should
be made using the values of the Likelihood function
L(θ1, θ2, · · · , θp) and not otherwise on the data x1, x2, x3.

x1, x2, x3 f(x1, x2, x3; π) S g(S; π) L(π)
0, 0, 0 (1 − π)3
0 (1 − π)3
(1 − π)3
1, 0, 0 π(1 − π)2
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
π(1 − π)2
0, 0, 1 π(1 − π)2
1, 1, 0 π2
(1 − π)
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) π2
(1 − π)
0, 1, 1 π2
(1 − π)
1, 1, 1 π3
3 π3
π3

S L(π)
0
1
2
3

Let x1, x2, x3 denote a set of observations with joint density
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Then S1 = S1(x1, · · · , xn),
. . . , Sq = Sq(x1, · · · , xn) are a set of sufficient statistics if
the joint density satisfies:
f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) = g(S1, S2, · · · , Sq; θ1, θ2, · · · , θp)h(x1, · · · , xn)
i. e.dependence on the parameters factors out with the
Sufficient statistics.

Example
Let x1, x2, · · · , xn denote a sample from the normal
distribution with mean µ and variance σ2
. The density of xi
is:
f (xi) =
1
√
2πσ
e− 1
2σ2 (xi−µ)2
And the joint density of (x1, x2, · · · , xn) is:
f x1, . . . , xn; µ, σ2

=
n
Y
i=1
1
√
2πσ
e− 1
2σ2 (xi−µ)2

f x1, . . . , xn; µ, σ2

=
1
(2πσ2)
n/
2
e
− 1
2σ2
n
P
i=1
(xi−µ)2
=
1
(2πσ2)
n/
2
e
− 1
2σ2
n
P
i=1
x2
i −2µ
n
P
i=1
xi+nµ2

n
X
i=1
x2
i =
n
X
i=1
(xi − x̄)2
+ nx̄2
= (n − 1) s2
+ nx̄2
n
X
i=1
xi = nx̄

f x1, . . . , xn; µ, σ2

=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
= h (x1, . . . , xn) g x̄, s; µ, σ2

where
g x̄, s; µ, σ2

=
1
(2πσ2)
n/
2
e− 1
2σ2 ((n−1)s2+nx̄2−2nµx̄+nµ2
)
h (x1, . . . , xn) = 1
Thus, x̄ and s are sufficient statistics.

The Factorization Theorem:
Suppose that the parameter set θ ∈ Λ has a
corresponding families of densities pθ. A statistic
T is sufficient for θ iff there are functions gθ and h
such that
pθ = gθ[T(y)]h(y)
∀y ∈ Γ and θ ∈ Λ.

Example
Consider the hypothesis-testing problem Λ = {0, 1} with
densities p0 and p1. Noting that
pθ(y) =
(
p0(y) ifθ = 0
p1(y)
p0(y)
p0(y) ifθ = 1,
the factorization pθ = gθ[T(y)]h(y) is possible with
h(y) = p0(y)
T(y) = p1(y)/p0(y) ≡ L(y)
gθ(y) =

1 ifθ = 0
t ifθ = 1.
Thus the likelihood ratio L is a sufficient statistic for the bi-
nary hypothesis-testing problem. AKU-EE/Detection-Estimation/HA/ 1st Semester, 86 – p.126/149

Rao-Blackwell Theorem
Suppose that g(y) is an unbiased estimate of g(θ) and that
T is sufficient for θ. Define
g̃[T(y)] = Eθ{ĝ(Y)|T(Y) = T(y)}
Then g̃[T(y)] is also an unbiased estimate of g(θ).
Furthermore
Varθ(g̃[T(Y)]) ≤ Varθ(ĝ(Y)),
with equality iff
Pθ(ĝ(Y) = g̃[T(Y)]) = 1.

Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp). Let
S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn) denote
a set of sufficient statistics. Let t(x1, x2, · · · , xn) be any
unbiased estimator of the parameter φ = g(θ1, θ2, · · · , θp)
then there exists an unbiased estimator, T(S1, · · · , Sq) of φ
such that
Var(T) 6 Var(t)

Proof
Let
T (S1, . . . , Sk) = E (t (x1, . . . , xn) |S1, . . . , Sk )
is the Conditional Expectation of tgivenS1, · · · , Sk
T (S1, . . . , Sk) =
Z
. . .
Z
t (x1, . . . , xn)g (x1, . . . , xn |S1, . . . , Sk ) dx1 . . . dxn
Now t is an unbiased estimator of φ = g(θ1, · · · , θp) Hence
E [t] = φ
Also

E [t] = ES1,...,Sk
[E [t |S1, . . . , Sk ]]
= ES1,...,Sk
[T (S1, . . . , Sk)] = φ
Thus T is also an unbiased estimator of
φ = g(θ1, θ2, · · · , θp)
Finally
V ar [t] = V arS1,...,Sk
[E [t |S1, . . . , Sk ]]
+ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ V arS1,...,Sk
[T (S1, . . . , Sk)]
Since
ES1,...,Sk
[V ar [t |S1, . . . , Sk ]] ≥ 0
QED.

The Rao-Blackwell theorem states that if you have any unbi-
ased estimator t of a parameter (that depends arbitrarily on
the observations) then you can find a better unbiased es-
timator (smaller variance) that is a function of the solely of
sufficient statistics. Thus the best unbiased estimator (min-
imum variance) has to be a function of the sufficient statis-
tics

Thus the search for the UMVU estimator (uniformly min-
imum variance unbiased estimator) is amongst functions
that depend solely on the sufficient statistics.

Example
Suppose that Γ = {0, 1, · · · , n}, Λ = {0, 1}, and
pθ(y) = n!
y!(n−y)!
θy
(1 − θ)n−y
, y = 0, . . . , n, 0 θ 1
For any function f on Γ, we have
Eθ{f(Y )} =
n
P
y=0
n!
y!(n−y)!
f(y)θy
(1 − θ)n−y
= (1 − θ)n
n
P
y=0
ayxy
The condition Eθf(Y ) = 0 ∀θ ∈ Γ implies that
n
P
y=0
ayxy
= 0, ∀x 0.

However, an nth order polynomial has at most n zeros un-
less all of its coefficients are zero. Hence, θ ∈ Γ is com-
plete.

Let (x1, x2, · · · , xn) denote a set of observations with joint
density f(x1, x2, · · · , xn; θ1, θ2, · · · , θp) then the set of
statistics:
Let S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics
Then S1, · · · , Sq are called a set of complete sufficient
statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0

If S1 = S1(x1, x2, · · · , xn), . . . , Sq = Sq(x1, x2, · · · , xn)
denote a set of sufficient statistics The S1, · · · , Sq are
called a set of complete sufficient statistics if whenever
E(h(S1, · · · , Sq)) = 0 ⇒ h(S1, · · · , Sq) = 0
i.e.,
Z
· · ·
Z
h (S1, . . . , Sk)g (S1, . . . , Sk |θ1, . . . , θp ) dS1 . . . dSk = 0
implies
h (S1, . . . , Sk) = 0
Completeness is sometimes difficult to prove.

Example
Suppose that we observe a success-failure experiment
(Bernoulli trial) n = 3 times.
Let π denote the probability of success. Let x1, x2, x3
denote the observations where
xi =


Joint distribution of x1, x2, x3 ; sampling distribution of S,
conditional distribution of x1, x2, x3 given S.
x1, x2, x3 f(x1, x2, x3; π) S g(S; π) x1, x2, x3|S
0, 0, 0 (1 − π)3
0 (1 − π)3
1
1, 0, 0 π(1 − π)2
1/3
0, 1, 0 π(1 − π)2
1 3π(1 − π)2
1/3
0, 0, 1 π(1 − π)2
1/3
1, 1, 0 π2
(1 − π) 1/3
1, 0, 1 π2
(1 − π) 2 3π2
(1 − π) 1/3
0, 1, 1 π2
(1 − π) 1/3
1, 1, 1 π3
3 π3
1

S is a sufficient statistic.
Is it a complete sufficient statistic?
sampling distribution of S
S g(S; π)
0 (1 − π)3
1 3π(1 − π)2
2 3π2
(1 − π)
3 π3
E [h (S)] = h (0) (1 − π)3
+ h (1) 3π (1 − π)2
+ h (2) 3π2
(1 − π) + h (3) π3
=
= h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3

E [h (S)] = 0, for all values of π
i.e.,
p (π) = h (0) + 3 [h (1) − h (0)] π + 3 [h (0) − 2h (1) + h (2)] π2
+ [h (3) − 3h (2) + 3h (1) + h (0)] π3
= 0, ⇒
h (0) = 0,
3 [h (1) − h (0)] = 0,
h (0) − 2h (1) + h (2) = 0,
h (3) − 3h (2) + 3h (1) + h (0) = 0
Thus , h (0) = h (1) = h (2) = h (3) = 0

S is a complete sufficient statistic.

If S1, · · · , Sq are called a set of complete sufficient statistics
and T1 = h1(S1, · · · , Sq) and T2 = h2(S1, · · · , Sq) Are
unbiased estimators of φ.
Then T1 = T2 E(T1) = E(T2) = φ hence E(T1 − T2) = 0.
and h2(S1, · · · , Sq) − h1(S1, · · · , Sq) and T2 = T1
Thus there is only one unbiased estimator of φ that is a
function of complete sufficient statistics

Lehmann-Scheffe Theorem
Let x1, x2, · · · , xn denote a set of observations
with joint density f(x1, · · · , xn; θ1, · · · , θq). Let
S1 = S1(x1, · · · , xn),. . . , Sq = Sq(x1, · · · , xn) denote a
set of complete sufficient statistics.

Let T(S1, · · · , Sq) be an unbiased estimator of the
parameter φ = g(θ1, · · · , θq) then T is the uniform minimum
variance unbiased (UMVU) estimator of φ. That is if
t(x1, x2, · · · , xn) is an unbiased estimator of φ then
Var(T) 6 Var(t)

Example
We observe a success-failure experiment (Bernoulli trial)
n = 3 times. the probability of success is π.
Let x1, x2, x3 denote the observations where
xi =

E
1
3
S

= E
1
3
(x1 + x2 + x3)

= 1
3
(E [x1] + E [x2] + E [x3])
= 1
3
(π + π + π) = π

S
3
is an unbiased estimator of π.
S
3
is the uniform minimum variance unbiased (UMVU) estimator of π.

The strategy to find the UMVU estimator
1. Find a set of complete sufficient statistics
S1, S2, · · · , Sk.
2. To find an unbiased estimator that depends only on the
set of complete sufficient statistics
T(S1, S2, · · · , Sk).
3. Lehman Scheffe theorem.
Maximum Likelihood estimators are functions of a set of
complete sufficient statistics S1, S2, · · · , Sk.

Factorization criterion
L(θ1, · · · , θp) = f(x1, · · · , xn; θ1, · · · , θp)
= g(S1, · · · , Sk; θ1, · · · , θp)h(x1, · · · , xn)
θ1, · · · , θp will maximize L(θ1, · · · , θp) if
θ1, · · · , θp will maximize g(S1, · · · , Sk; θ1, · · · , θp)
These will depend on S1, · · · , Sk.

1. Finding Maximum Likelihood estimators.
2. Checking if there is a set of complete sufficient
statistics -S1, · · · , Sk.
3. Checking if the maximum Likelihood estimators they
are unbiased.
4. Making adjustments to these estimators if they are not
unbiased.
This is the standard way of finding UMVU estimators

C
a
r
e
f
u
l
Being Careful
• Suppose Y has the density
p(y; µ) =
n
Y
i=1
1
aσc
exp

−

yi − µ
aσ
4
#
where a ≈ 1.4464 and c is a constant which
makes the density normalize to 1.
l(µ) = −
n
X
i=1

yi − µ
aσ
4
• Try usual “set derivative equal to zero”
dl(µ)
dµ
= 4
n
X
i=1

yi − µ
aσ
3
= 0
• This will be a 3rd order polynomial in µ,
which will in general have 3 solutions
• Would have to just compute the
loglikelihood for each solution and see
which one gives the biggest result
• Suppose Y = Y1, · · · , Yn, is i.i.d. with a
Cauchy-like density (but not really
1

C
a
r
e
f
u
l
Cauchy!) (Maximum Penalized Likelihood
Estimation)
p(y; c, γ) =
n
Y
i=1
γ
2 (γ + |yi − c|)
2
• Loglikelihood:
l(c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
l(y; c, γ) = n ln γ − 2
X
i∈{yi≥c}
ln[γ + (yi − c)]
−2
X
i∈{yic}
ln[γ + (c − yi)]
• Suppose γ is given and we want to estimate
c taking derivative:
dl(y; c)
dc
= −2
X
i∈{yi≥c}
−1
γ + yi − c
−2
X
i∈{yi≥c}
1
γ + c − yi
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
• We would be tempted to say that the ML
2

C
a
r
e
f
u
l
estimator of c is just the solution of
dl(y; c)
dc
= 2
n
X
i=1
sign(yi − c)
γ + |yi − c|
= 0
• Well. . . theres actually more than one
solution to this, so how about picking the
solution which gives the greatest
likelihood?
• but that is really a trap!!!
• Let’s check out the second derivative:
d2
l(y; c)
dc2
= 2
n
X
i=1
[sign(yi − c)]2
(γ + |yi − c|)
2 0
• Those critical points were really
local minima!!!
• Where is the real maxima?
• Notice that |x| is not differentiable at
x = 0, hence
p(y; c, γ) = n ln γ − 2
n
X
i=1
ln (γ + |yi − c|)
3

C
a
r
e
f
u
l
is not differentiable at c=any of the data
points!
• To get the real ML estimate, try each yi for
c, and see which one gives the biggest
likelihood
• “It turns out” the ML estimate is one of
the median points
4

E
M
Expectation-Maximization Algorithm
• The EM procedure is a way of making
iterative algorithms for maximizing
loglikelihoods or Bayesian logposteriors
when no closed-form solution is available
• There’s a more powerful and more general
EM formulation by Csiszar based on
Information Theory EM Algorithm
• Incomplete data Y, that we actually
measure
• Goal: maximize the
incomplete data loglikelihood(function of
specific collected data)
lid(θ) = log pY (y; θ)
• Complete data Z, a hypothetical data set
• Tool: complete data loglikelihood (function
of complete data as a random variable)
lcd(θ) = log pZ(z; θ)|z=Z = log pZ(Z; θ)
• Complete data space must be “larger” and
1

E
M
determine the incomplete data, i.e. there
must be a many-to-one mapping y = h(z)
The EM Recipe
• Step 1: Decide on a complete data space
• Step 2: The expectation step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
• Step 3: The maximization step
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• Start with a feasible initial guess θ̂old
then
iterate steps 2 and 3 (which can usually be
combined)
What is that Expectation?
E[lcd|Y = y; θ̂old
] =
Z
pZ|Y (z|y; θ̂old
) log pZ(z; θ)dz
pZ|Y (z|y; θ̂old
) =





pZ (z;θ̂old
)
R
Z(y)
pZ (z̃;θ̂old)dz̃
z ∈ Z(y)
0 z /
∈ Z(y)
2

E
M
Z
Z(y)
pZ(z; θ̂old
)dz = pY (y; θ̂old
)
Aspects of EM Algorithms
• Incomplete data loglikelihood is guaranteed
to increase with each EM iteration
• Must be careful; might converge to a local
maxima which depends on the starting
point
• Often, the estimates naturally stay in the
feasible space (i.e., nonnegativity
constraints
• In many problems, a candidate complete
data space naturally suggests itself
Ex: Poisson Signal in Additive Poisson
Noise
Y = S + N
S ∼ Poisson(θ), N ∼ Poisson(λN ),
• Incomplete-data loglikelihood is
lid(θ) = −(θ + λN ) + y ln(θ + λN ),
3

E
M
• ML estimator can be found in closed form
θ̂(y) = max(0, y − λN )
Choose the Complete Data
• Can often choose the complete data in
several different ways; try to choose to
make remaining steps easy
• Different choices lead to different
algorithms; some will converge “faster”
than others.
• Here, take complete data to be Z = (S, N);
suppose we could magically measure the
signal and noise counts separately!
• Complete data loglikelihood is:
lcd(θ) = [−θ + S ln(θ)] + [−λN + N ln(λN )]
The E-Step
Q(θ; θ̂old
) = E[lcd(θ)|Y = y; θ̂old
]
= E[−(θ + λN ) + S ln(θ) + N ln(λN )|y; θ̂old
]
= −(θ + λN ) + E[S|y; θ̂old
] ln(θ)
4

E
M
+E[N|y; θ̂old
] ln(λN )
• Often convenient to leave explicit
computation of conditional expectation
until the last minute
• As with loglikelihoods, we sometimes drop
terms which are constants w.r.t. θ
The M-Step
θ̂new
= arg max
θ≥0
Q(θ; θ̂old
)
• Take derivative as usual
d
dθ
Q(θ; θ̂old
) = −1 +
E[S|y; θ̂old
]
θ
• Setting equal to zero yields
θ̂new
= E[S|y; θ̂old
]
• Now we just have to compute that
expectation. (That’s usually the hardest
part.)
That Conditional Expectation
E[S|y; θ̂old
] =
Z
spS(s|y; θ̂old
)ds
5

E
M
• Let’s look at the conditional density:
pS(s|y; θ̂old
) =
pY |S(y|s; θ̂old
)pS(s; θ̂old
)
pY (y; θ̂old)
=
exp[−λN ]λy−s
N
(y−s)! I(y ≥ s)exp[−θ̂old
](θ̂old
)s
s!
exp[−(θ̂old + λN )](θ̂old + λN )y/y!
=
y!
s!(y − s)!
λy−s
N
(θ̂old + λN )y−s
(θ̂old
)s
(θ̂old + λN )s
I(s ≤ y)
• We observe that the conditional density is
just binomial. For 0 6 s 6 y,
pS (s|y; θ̂
old
) =


y
s




θ̂old
θ̂old + λN


s
λN
θ̂old + λN
!y−s
E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• So this particular EM algorithm is:
θ̂new
= E[S|y; θ̂old
] = y
θ̂old
θ̂old + λN
• Let’s see if our analytic formula for the
maximizer, θ̂ = max(0, y − λN ), is a fixed
point for the EM iteration
6

E
M
• For y λN ,
θ̂new
= y
θ̂old
θ̂old + λN
y − λN = y
y − λN
y − λN + λN
y − λN = y − λN
• For y λN , immediately get 0=0
• So everything is good
Back in Bayesian land
• EM also good for MAP estimation; just
add the logprior to the Q-function
θ̂new
= arg max
θ≥0
QP (θ; θ̂old
)
QP (θ; θ̂old
) = E[lcd|Y = y; θ̂old
] + log p(θ)
• Consider previous example, with an
exponential prior with mean 1/a
QP (θ; θ̂old
) = −θ + E[S|y; θ̂old
] ln(θ) − aθ
QP (θ; θ̂old
)
dθ
= −1 +
E[S|y; θ̂old
]
θ
− a
7

E
M
θ̂old
=
E[S|y; θ̂old
]
1 + a
=
θ̂old
θ̂old + λN
!
y
1 − a
Expectation-Maximization Algorithm
(Theory) Convergence of the EM
Algorithm
• We’d like to prove that the likelihood goes
up with each iteration:
Lid(θ̂new
) ≥ Lid(θ̂old
)
• Recall from last pages:
pZ|Y (z|y; θ) =
pZ(z; θ)
pY (y; θ)
z ∈ Z(y) = {z : h(z) = y}
ln pY (y; θ) = ln pZ(z; θ) − ln pZ|Y (z|y; θ)
• Multiply both sides by the same thing and
integrate with respect to z :
R
pZ|Y (z|y; θ̂old
) ln pY (y; θ)dz
=
R
pZ|Y (z|y; θ̂old
) ln pZ(z; θ)dz
−
R
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz
8

E
M
• simplifies to:
Lid(θ) = Q(θ; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ)dz z
• Evaluate z at θ = θ̂new
and θ = θ̂old
Lid(θ̂new
) = Q(θ̂new
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂new
)dz, ♠
Lid(θ̂old
) = Q(θ̂old
; θ̂old
)−
Z
pZ|Y (z|y; θ̂old
) ln pZ|Y (z|y; θ̂old
)dz, ♣
• Subtract ♣ from ♠
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
Lid(θ̂new
) − Lid(θ̂old
) = Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
R
pZ|Y (z|y; θ̂old
) ln
pZ|Y (z|y;θ̂new
)
pZ|Y (z|y;θ̂old)
dz
• A really helpful inequality: ln x 6 x − 1
Lid(θ̂new
) − Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
) − Q(θ̂old
; θ̂old
)
−
Z
pZ|Y (z|y; θ̂old
)

pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old
)
− 1
#
dz
| {z }
focus on this term
9

E
M
Z
pZ|Y (z|y; θ̂old
)

pZ|Y (z|y; θ̂new
)
pZ|Y (z|y; θ̂old)
− 1
#
dz
=
Z
pZ|Y (z|y; θ̂new
) − pZ|Y (z|y; θ̂old
)dz
=
Z
pZ|Y (z|y; θ̂new
)dz−
Z
pZ|Y (z|y; θ̂old
)dz = 0
• Now we have:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• Recall the definition of the M-step:
θ̂new
= arg max
θ
Q(θ; θ̂old
)
• So, by definition
Q(θ̂new
; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
⇒ Lid(θ̂new
) ≥ Lid(θ̂old
)
• Notice we showed that the likelihood was
nondecreasing; that doesn’t automatically
10

E
M
imply that the parameter estimates
converge
• Parameter estimate could slide along a
contour of constant loglikelihood
• Can prove some things about parameter
convergence in special cases Ex: EM
Algorithm for Imaging from Poisson Data
(i.e. Emission Tomography)
Generalized EM Algorithms
• Recall this line:
Lid(θ̂new
)−Lid(θ̂old
) ≥ Q(θ̂new
; θ̂old
)−Q(θ̂old
; θ̂old
)
• What if the M-step is too hard? Try a
“generalized” EM algorithm:
θ̂new
= some easy to compute θ
such thatQ(θ; θ̂old
) ≥ Q(θ̂old
; θ̂old
)
• Problem: EM algorithms tend to be slow
• Observation: “Bigger” complete data
spaces result in slower algorithms than
“smaller” complete data spaces
11

E
M
• SAGE (Space-Alternating Generalized
Expectation-Maximization)
– Split big complete data space into
several smaller “hidden” data spaces
– Designed to yield faster convergence
• Generalization of “ordered subsets” EM
algorithm
12

W
i
e
n
e
r
Wiener Filtering
• Context: Bayesian linear MMSE
estimation for random sequences
• Parameter sequence {Θk, k ∈ Z}
• Data sequence {Yk, k ∈ I ⊂ Z}
• Goal: Estimate {θk} as a linear function of
the observations:
θ̂k(y) =
X
j∈I
h(k, j)yj
• Find h to minimize mean square error
• By the orthogonality principle
E[(θ̂k(Y ) − Θk)Y ∗
i ] = 0fori ∈ I
E[(
X
j∈I
h(k, j)Yj − Θk)Y ∗
i ] = 0
X
j∈I
h(k, j)E[YjY ∗
i ] = E[ΘkY ∗
i ]
X
j∈I
h(k, j)rY (j, i) = rΘY (k, i)
| {z }
This is Wiener-Hopf equation
1

W
i
e
n
e
r
• If processes are stationary, we can write
X
j∈I
h(k, j)rY (j − i) = rΘY (k − i)
• If I = Z it turns out the filter is LTI:
X
j∈I
h(k − j)rY (j − i) = rΘY (k − i)
• consider i = 0
X
j∈I
h(k − j)rY (j) = rΘY (k)
• Can solve W-H in the Z-transform domain:
H(z)SY (z) = SΘY (z)
H(z) =
SΘY (z)
SY (z)
• MSE:
MSE = E[|(Θk − θ̂(Y ))|2
]
= E[(Θk − θ̂(Y ))(Θ∗
k − θ̂∗
(Y ))]
= E[(Θk − θ̂(Y ))Θ∗
k] +E[(Θk − θ̂(Y ))θ̂∗
(Y )]
= E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
2

W
i
e
n
e
r
MSE = E[ΘkΘ∗
k] − E[θ̂k(Y )Θ∗
k]]
= E[ΘkΘ∗
k] − E[
X
j∈I
h(k − j)YjΘ∗
k]]
= E[ΘkΘ∗
k] −
X
j∈I
h(k − j)E[YjΘ∗
k]
= rΘ(0) −
X
j∈I
h(k − j)rY Θ(j − k)
• Since everything is stationary, can just take
k = 0
MSE = rΘ(0) − (h ∗ rY Θ)(0)
MSE = rΘ(0) − (h ∗ rY Θ)(0)
=
π
Z
−π
SΘ(ω) − H(ω)SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
SΘY (ω)
SY (ω)
SY Θ(ω)dω
=
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
3

W
i
e
n
e
r
Deblurring
• Suppose object is observed through a
blurring point spread function f and
additive noise W
Yk = (f ∗ Θ)k + Wk
• Suppose Θ and W are uncorrelated
zero-mean
SY = |F|
2
Sθ + SW , and SΘY = F∗
SY
• So the Wiener filter is
H(z) =
SΘY (z)
SY (z)
=
F∗
(z)SΘ(z)
|F(z)|
2
Sθ(z) + SW (z)
Interpretation of the Deblurring Filter
If noise is negligible, i.e. SW (ω) ≈ 0
H(ω) =
F∗
Sθ
|F|
2
Sθ + SW
≈
F∗
Sθ
FF∗Sθ
=
1
F
• Even if there is no noise, in
implementation, straight division by F(ω)
is often ill-posed and not a good idea
4

W
i
e
n
e
r
(round off errors, etc.)
Deblurring Error
MSE =
π
Z
−π
SΘ(ω) −
|SΘY (ω)|2
SY (ω)
dω
=
π
Z
−π
SΘ(ω) −
|F(ω)|
2
|SΘ(ω)|2
|F(ω)|
2
SΘ(ω) + SW (ω)
dω
=
π
Z
−π
SΘ[|F|
2
SΘ + SW ] − |F|
2
|SΘ|2
|F|
2
SΘ + SW
dω
=
π
Z
−π
SΘSW
|F|
2
SΘ + SW
dω
Competing Approaches
• Competing approaches include iterative
methods such as the “Richardson-Lucy”
algorithm (an EM-style procedure)
– Computationally intensive
– Can naturally incorporate nonnegativity
– Sometimes better match to real
5

W
i
e
n
e
r
statistics
Discussion
• Advantage of Wiener approach:
– LTI filtering implementation
• Disadvantages of Wiener approach:
– No natural way to incorporate
nonnegativity constraints (in image
processing, for instance)
– Only truly optimal for Gaussian
statistics
Real-Time Wiener Filtering
• What if we don’t have “future”
measurements?
• Must restrict h to be causal
• Solution:
H(z) =
1
S−
Y (z)

SΘY (z)
S+
Y (z)

+
where the meaning of the plus and minus
superscripts and subscripts will be defined
6

W
i
e
n
e
r
on later slides Spectral Factorization
• If Y has a spectrum satisfying the
Paley-Wiener criterion:
π
Z
−π
log SY (ω)dω −∞
then the spectrum can be factored as
SY (ω) = S+
Y (ω)S−
Y (ω)
where F−1
{S+
Y } is causal
and F−1
{S−
Y } is anticausal
Factoring Rational Spectra
• If the spectrum is a ratio of polynomials,
we can factor as
SY (z) = S+
Y (z)
| {z }
Poles and zeros
inside unit circle
S−
Y (z)
| {z }
Poles and zeros
outside unit circle
= S+
Y (z)S+
Y (z−1
)
7

W
i
e
n
e
r
• Aside: spectral factorization into causal
and anticausal factors is analogous to
Cholesky decomposition of a covariance
matrix into lower and upper triangular
factors
Causal Part Extraction
• We can split f into its causal and
anticausal parts:
f(k) = {f(k)}+
| {z }
causal
+ {f(k)}−
| {z }
anticausal
f(k)+ = f(k)u(k), {f(k)}− = f(k)u(−k−1)
• Use similar notation for Z-transform
domain
F(z) = {F(z)}+ + {F(z)}−
{F}+ = Z{Z−1
{F}u(k)}
{F}− = Z{Z−1
{F}u(−k − 1)}
How to Extract Causal Parts
• If F is a ratio of polynomials can usually
8

W
i
e
n
e
r
do a partial fraction expansion:
F(z) = {F(z)}+
| {z }
?

Poles and zeros
inside unit circle
+ {F(z)}−
| {z }
?
Poles and zeros
outside unit circle
• Can also do polynomial long division
9

C
h
e
r
n
o
ff
Chernoff Bounds (Theory)
• General purpose likelihood ratio test
p(y; H1)
p(y; H0)
or
p(y|H1)
p(y|H0)
H1

H0
λ
• Consider the loglikelihood ratio test
L ≡ ln Λ = ln
p(y|H1)
p(y|H0)
H1

H0
ln λ ≡ γ
• Conditional error probabilities:
PD =
∞
Z
γ
pL|H1
(ℓ|H1)dℓ, PF A =
∞
Z
γ
pL|H0
(ℓ|H0)dℓ
• it is often difficult, if not impossible, to
find simple formulas for
pL|H1
(ℓ|H1), pL|H0
(ℓ|H0)
• Makes computing probabilities of detection
and false alarm difficult
– We could use Monte Carlo simulations,
but those are cumbersome
1

C
h
e
r
n
o
ff
=
∞
Z
γ
exp[µ(s)]e−sx esx
pL|H0
(x|H0)
∞
R
−∞
esℓpL|H0
(ℓ|H0)dℓ
dx
=
∞
Z
γ
pL|H0
(x|H0)dx = PF A
PF A =
∞
Z
γ
exp[µ(s) − sx]pXs
(x)dx
= eµ(s)
∞
Z
γ
e−sx
pXs
(x)dx ≤ eµ(s)
∞
Z
γ
e−sγ
pXs
(x)dx
= exp[µ(s) − sγ]
∞
Z
γ
pXs
(x)dx
≤ exp[µ(s) − sγ]
PF A ≤ exp[µ(s) − sγ]
• We want the s ≥ 0 which makes the RHS
as small as possible
d
ds
[µ(s) − sγ] = µ̇(s) − γ,
4

C
h
e
r
n
o
ff
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PF A ≤ exp[µ(s) − sµ̇(s)]
PM ≤ exp[µ(s) + (1 − s)γ]
• We want the s ≤ 1 which makes the RHS
as small as possible
d
ds
[µ(s) + (1 − s)γ] = µ̇(s) − γ
µ̇(s) = γ
• Assuming everything worked (things exist,
equation for maximizing s solvable, etc.):
PM ≤ exp[µ(s) + (1 − s)µ̇(s)]
PF A ≤ exp[µ(s) − sµ̇(s)], 0 6 s 6 1
PM ≤ exp[µ(s)+(1−s)µ̇(s)], where γ = µ̇(s)
µ̇(0) ≤ γ ≤ µ̇(1)
E[L|H0] ≤ γ ≤ E[L|H1]
Why is this useful? L can often be easily
5

C
h
e
r
n
o
ff
described by its moment generating
function.
• Let s = sm satisfy µ̇(sm) = γ = 0
Pe =
1
2
PF A +
1
2
PM
≤
1
2
exp[µ(s)]
Z ∞
0
pXs
(x)dx+
1
2
exp[µ(s)]
Z 0
−∞
pXs
(x)dx
Pe ≤
1
2
exp[µ(sm)]
PF A = eµ(s)
∞
Z
µ̇(s)
e−sx
pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
µ̇(s)
exp[+s(µ̇(s) − x)]pXs
(x)dx
= exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
where
Z =
Xs − E[Xs]
p
var[Xs]
=
Xs − µ̇(s)
p
µ̈(s)
6

C
h
e
r
n
o
ff
exp[µ(s) − sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
| {z }
Original Chernoff
inequality was formed
by replacing this with 1.
We can get a tighter
constant in some
asymptotic cases.
Asymptotic Gaussian Approximation
• In some cases, Z approaches a Gaussian
random variable as the number of samples
n grows large (ex: data points iid with
finite means and variances)
∞
Z
0
exp[−s
p
µ̈(s)z]
1
√
2π
exp

z2
2

dz
= exp

s2
µ̈(s)
2

Q(s
p
µ̈(s))
7

C
h
e
r
n
o
ff
PF A = exp[µ(s)−sµ̇(s)]
∞
Z
0
exp[−s
p
µ̈(s)z]pZ(z)dz
≈ exp[µ(s)−sµ̇(s)] exp

s2
µ̈(s)
2

Q(s
p
µ̈(s))
• If s
p
µ̈(s) 3 we can approximate Q(·)
using an upper bound
Q(a) ≤
1
a
√
2π
exp

−
a2
2

PF A ≈
1
p
2πs2µ̈(s)
exp[µ(s) − sµ̇(s)]
Similar Analysis Works for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp

(s − 1)2
µ̈(s)
2

Q((1−s)
p
µ̈(s))
• If (1 − s)
p
µ̈(s) 3 we can approximate
Q(·) using the upper bound
PM ≈
1
p
2π(1 − s)2µ̈(s)
exp[µ(s) + (1 − s)µ̇(s)]
Asymptotic Analysis for Pe
8

C
h
e
r
n
o
ff
• For the case of equal priors and equal
costs, if the conditions for the
approximation for Q(·) to be valid on the
previous to slides holds, we have
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
9

C
h
e
r
n
o
ff
Gaussian, Equal Variances
H1 ∼ N(m, σ2
), H0 ∼ N(0, σ2
)
µ(s) = ln
Z
Y
[p(y|H1)]
s
[p(y|H0)]
1−s
dy
= ln
∞
Z
−∞
· · ·
∞
Z
−∞
( n
Y
i=1
1
√
2πσ2
exp

−
(yi − m)2
2σ2
)s
×
( n
Y
i=1
1
√
2πσ2
exp

−
y2
i
2σ2
)1−s
dy1 · · · dyn
= n ln
∞
Z
−∞
1
√
2πσ2
exp

−
(y − m)2
s + y2
(1 − s)
2σ2

dy
11

C
h
e
r
n
o
ff
Completing the Square
∞
Z
−∞
1
√
2πσ2
exp

−
(y − m)2
s + y2
(1 − s)
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2my + m2
)s + y2
(1 − s)
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
y2
− 2msy + m2
s
2σ2

dy
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2

12

C
h
e
r
n
o
ff
Finish Computing the µ
=
∞
Z
−∞
1
√
2πσ2
exp

−
(y2
− 2msy + m2
s2
) − m2
s2
+ m2
s
2σ2

=
∞
Z
−∞
1
√
2πσ2
exp

−
(y − ms)2
2σ2

exp

−
m2
s(1 − s)
2σ2

dy
= exp

m2
s(s − 1)
2σ2

= tmp
µ(s) = n ln {tmp} =
s(s − 1)
2
nm2
σ2
≡
s(s − 1)
2
d2
13

C
h
e
r
n
o
ff
Basic Bound on PFA
µ(s) =
s(s − 1)
2
d2
, µ̇(s) =
2s − 1
2
d2
PF A ≤ exp[µ(s) − sµ̇(s)], for 0 ≤ s ≤ 1
= exp

s(s − 1)
2
d2
− s
(2s − 1)
2
d2

= exp

−
s2
2
d2

where γ = µ̇(s), γ =
2s − 1
2
d2
s =
γ
d2
+
1
2
14

C
h
e
r
n
o
ff
Basic Bound on PM
Pm ≤ exp[µ(s) + (1 − s)µ̇(s)]
= exp

s(s − 1)
2
d2
+ (1 − s)
(2s − 1)
2
d2

= exp

s2
− s
2
d2
+
2s − 1 − 2s2
+ s
2
d2

= exp

2s − 1 − s2
2
d2

= exp

−
(1 − s)2
2
d2

15

C
h
e
r
n
o
ff
Where are the Bounds Meaningful? Recall we
need
E[L|H0] ≤ γ ≤ E[L|H1]
µ̇(0) ≤ γ ≤ µ̇(1)
2 · 0 − 1
2
d2
≤ γ ≤
2 · 1 − 1
2
d2
−
d2
2
≤ γ ≤
d2
2
16

C
h
e
r
n
o
ff
The Refined Bound for PFA Recall the refined
asymptotic bound:
PF A ≈ exp[µ(s) − sµ̇(s)] exp


s2µ̈(s)
2

 Q(s
q
µ̈(s))
µ̇(s) =
2s − 1
2
d2
, µ̈(s) = d2
In this case, since L is a sum of Gaussian
random variables, the expression is exact:
PF A = exp

−
s2
2
d2

exp

s2
d2
2

Q(sd) = Q(sd)
17

C
h
e
r
n
o
ff
The Refined Bound for PM
PM ≈ eµ(s)+(1−s)µ̇(s)
exp

(s − 1)2
µ̈(s)
2

Q((1−s)
p
µ̈(s))
= exp

−
(1 − s)2
2
d2

exp

(s − 1)2
d2
2

Q((1−s)d)
Again, since L is Gaussian, the expression is
exact:
PM = Q((1 − s)d)
18

C
h
e
r
n
o
ff
Minimum Prob. of Error For minimum prob.
of error test, γ = 0
sM =
γ
d2
+
1
2
=
1
2
Recall approximate expression for Pe from last
slide of last lecture
Pe ≈
1
2sm(1 − sm)
p
2πµ̈(sm)
exp[µ(sm)]
=
1
2sm(1 − sm)
√
2πd2
exp

sm(sm − 1)
2
d2

19

C
h
e
r
n
o
ff
Min. Prob. of Error Cont
Pe ≈
2
√
2πd2
exp

−
d2
8

Recall the exact expression is:
Pe = Q(d/2)
Van Trees’ rule of thumb: approximation is
very good for d 6
20

C
h
e
r
n
o
ff
The Bhattacharyya Distance If the criterion is
the minimum prob. of error and µ(s) is
symmetric about s = 1/2, then
µ(s) = ln
Z
Y
p
p(y|H1)
p
p(y|H0)dy
µ(s) is called the Battacharyya distance
21

C
h
e
r
n
o
ff
Gaussian, Equal Means
H1 ∼ N(0, σ2
1), H0 ∼ N(0, σ2
0),
µ(s) =
n
2
ln
(σ2
0)s
(σ2
1)1−s
sσ2
0 + (1 − s)σ2
1
A common special case:
σ2
1 = σ2
s + σ2
n, σ2
0 = σ2
n
µ(s) =
n
2

(1 − s) ln

1 +
σ2
s
σ2
n

− ln

1 + (1 − s)
σ2
s
σ2
n

Gaussian, Equal Means
µ̇(s) =
n
2

− ln

1 +
σ2
s
σ2
n

+
σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n

µ̈(s) =
n
2

σ2
s /σ2
n
1 + (1 − s)σ2
s /σ2
n
2
22

U
M
P
Uniformly Most Powerful Tests
• Usual parametric data model p(y; θ)
• Consider a composite problem:
H0 : θ = θ0, H1 : θ ∈ S1
• A test φ∗
is uniformly most powerful of
level α = PF A if it has a better PD (or at
least as good as) than any other αlevel test
PD(φ∗
; θ) = Eθ[φ∗
] ≥ Eθ[φ] = PD(φ, θ)
for all θ ∈ S1, for all φ
α = PF A(φB) = PF A(φA) = PF A(φ∗
)
PD(φB,θ)
PD(φA,θ)
PD(φ∗
,θ)
Figure 1: UMP test.
1

U
M
P
• Find the most powerful α-level (recall
α = PF A) test for a fixed θ
• Just the Neyman-Pearson test
If the decision regions do not vary with θ,
then the test is UMP
Gaussian Mean Example
• Suppose we have n i.i.d samples
Yi ∼ N(µ, σ2
)
• Assume σ2
is known, but µ is not
• Consider three cases
H0 : µ = 0
Case I : H1 : µ 0
Case II : H1 : µ 0
Case III : H1 : µ 6= 0
Suffices to use
ȳ =
1
n
n
X
i=1
yi
2

U
M
P
Ȳ ∼ N(µ, σ2
/n)
Λ(y; µ) =
p(y; µ)
p(y; 0)
=
exp[−(ȳ − µ)2
/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
=
exp[(−ȳ2
+ 2ȳµ − µ2
)/(2σ2
/n)]
exp[−ȳ2/(2σ2/n)]
= exp

nµ
σ2
ȳ −
nµ2
2σ2
H1

H0
τ
√
nµ
σ
ȳ
H1

H0

ln τ +
nµ2
2σ2

σ
√
n
≡ γ
• Case I: µ 0
√
nµ
σ
ȳ
H1

H0
γ −→
√
n
σ
ȳ
H1

H0
γ
µ
= γ+
• Set the threshold to get the right “level”
α = PF A = Pr
√
n
σ
Ȳ γ+

H0

= Q(γ+
)
γ+
= Q−1
(α)
• Notice the test does not depend on µ;
3

U
M
P
hence, it is UMP
PD = Pr
√
n
σ
Ȳ γ+

H1

= Pr
√
n
σ
(Ȳ − µ) γ+
−
√
n
σ
µ

Detection & Estimation Theory

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detection & Estimation Theory

Similar to Detection & Estimation Theory (20)

More from HAmindavarLectures

More from HAmindavarLectures (13)

Recently uploaded

Recently uploaded (20)

Detection & Estimation Theory