Statistical Signal Processing · 2012. 10. 29. · 1.1 oundationsF of Probability Theory: Basic De nitions 1 1.1.1 Basic De nitions The basis of probability theory is a set of events

Statistical Signal Processing

By:Don Johnson

Statistical Signal Processing

By:Don Johnson

Online:< http://cnx.org/content/col11382/1.1/ >

C O N N E X I O N S

Rice University, Houston, Texas

This selection and arrangement of content as a collection is copyrighted by Don Johnson. It is licensed under the

Creative Commons Attribution 3.0 license (http://creativecommons.org/licenses/by/3.0/).

Collection structure revised: December 5, 2011

PDF generated: October 29, 2012

For copyright and attribution information for the modules contained in this collection, see p. 179.

Table of Contents

1 Probability and Stochastic Processes

1.1 Foundations of Probability Theory: Basic Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Random Variables and Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Jointly Distributed Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 The Gaussian Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.7 Basic Denitions in Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.8 The Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.9 Sampling and Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.10 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.11 Linear Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.12 Hilbert Spaces and Separable Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.13 The Vector Space L Squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.14 A Hilbert Space for Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.15 Karhunen-Loeve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.16 Probability and Stochastic Processes: Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Optimization Theory

2.1 Optimization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Estimation Theory

3.1 Introduction to Estimation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2 Minimum Mean Squared Error Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3 Maximum a Posteriori Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4 Linear Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5 Maximum Likelihood Estimators of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.6 Cramer-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.7 Signal Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.8 Maximum Likelihood Estimators of Signal Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.9 Time-Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.10 Probability Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.11 Estimation Theory: Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4 Detection Theory

4.1 Detection Theory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.2 Criteria in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.4 Beyond Two Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.5 Model Consistency Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.6 Stein's Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.7 Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.8 Detection in the Presence of Unknowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.9 Detection of Signals in Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1164.10 White Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.11 Colored Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.12 Detection in the Presence of Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314.13 Unknown Signal Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.14 Unknown Signal Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

iv

4.15 Unknown Noise Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374.16 Partial Knowledge of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404.17 Robust Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.18 Non-Parametric Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1464.19 Partially Known Signals and Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494.20 Partially Known Signal Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494.21 Partially Known Noise Amplitude Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1504.22 Non-Gaussian Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1514.23 Small-Signal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1524.24 Robust Small-Signal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1534.25 Non-Parametric Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1544.26 Type-Based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1554.27 Discrete-Time Detection: Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 157

5 Probability Distributions

6 Matrix Theory

7 Ali-Silvey Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179

Available for free at Connexions <http://cnx.org/content/col11382/1.1>

Chapter 1

Probability and Stochastic Processes

1.1 Foundations of Probability Theory: Basic Denitions1

1.1.1 Basic Denitions

The basis of probability theory is a set of events - sample space - and a systematic set of numbers - probabili-ties - assigned to each event. The key aspect of the theory is the system of assigning probabilities. Formally,a sample space is the set Ω of all possible outcomes ωi of an experiment. An event is a collection of samplepoints ωi determined by some set-algebraic rules governed by the laws of Boolean algebra. Letting A and Bdenote events, these laws are

Union:A ∪B = ω | ω ∈ A ∨ ω ∈ B

Intersection:A ∩B = ω | ω ∈ A ∧ ω ∈ B

Complement:A′ = ω | ω /∈ A

(A ∪B)′ = A′ ∩B′

The null set ∅ is the complement of Ω. Events are said to be mutually exclusive if there is no elementcommon to both events: A ∩B = ∅.

Associated with each event Ai is a probability measure Pr [Ai], sometimes denoted by πi, that obeysthe axioms of probability.

• Pr [Ai] ≥ 0• Pr [Ω] = 1• If A ∩B = ∅, then Pr [A ∪B] = Pr [A] + Pr [B].

The consistent set of probabilities Pr [·] assigned to events are known as the a priori probabilities. Fromthe axioms, probability assignments for Boolean expressions can be computed. For example, simple Booleanmanipulations (A ∪B = A ∪A′B) lead to

Pr [A ∪B] = Pr [A] + Pr [B]− Pr [A ∩B] (1.1)

Suppose Pr [B] 6= 0. Suppose we know that the event B has occurred; what is the probability thatevent A has also occurred? This calculation is known as the conditional probability of A given B and is

1This content is available online at <http://cnx.org/content/m11245/1.2/>.


1

2 CHAPTER 1. PROBABILITY AND STOCHASTIC PROCESSES

denoted by Pr [A | B]. To evaluate conditional probabilities, consider B to be the sample space rather thanΩ. To obtain a probability assignment under these circumstances consistent with the axioms of probability,we must have

Pr [A | B] =Pr [A ∩B]Pr [B]

(1.2)

The event is said to be statistically independent of B if Pr [A | B] = Pr [A]: the occurrence of the eventB does not change the probability that A occurred. When independent, the probability of their intersectionPr [A ∩B] is given by the product of the a priori probabilities Pr [A]Pr [B]. This property is necessary and

sucient for the independence of the two events. As Pr [A | B] = Pr[A∩B]Pr[B] and Pr [B | A] = Pr[A∩B]

Pr[A] , we

obtain Bayes' Rule.

Pr [B | A] =Pr [A | B]Pr [B]

Pr [A](1.3)

1.2 Random Variables and Probability Density Functions2

A random variable X is the assignment of a number - real or complex - to each sample point in samplespace. Thus, a random variable can be considered a function whose range is a set and whose ranges are, mostcommonly, a subset of the real line. The probability distribution function or cumulative is dened tobe

P X (x ) = Pr [X ≤ x] (1.4)

Note that X denotes the random variable and x denotes the argument of the distribution func-tion. Probability distribution functions are increasing functions: if A = ω | X (ω) ≤ x1 and B =ω | x1 < X (ω) ≤ x2 , (Pr [A ∪B] = Pr [A] + Pr [B]) ⇒ (P X (x2 ) = P X (x1 ) + Pr [x1 < X ≤ x2]),which means that P X (x2 ) ≥ P X (x1 ), x1 ≤ x2.

The probability density function p X (x ) is dened to be that function when integrated yields thedistribution function.

P X (x ) =∫ x

−∞p X (α ) dα (1.5)

As distribution functions may be discontinuous, we allow density functions to contain impulses. Furthermore,density functions must be non-negative since their integrals are increasing.

1.3 Jointly Distributed Random Variables3

Two (or more) random variables can be dened over the same sample space. Just as with jointly denedevents, the joint distribution function is easily dened.

P X,,,Y (x, y ) ≡ Pr [X ≤ x ∩ Y ≤ y] (1.6)

The joint probability density function p X,Y (x, y ) is related to the distribution function via doubleintegration.

P X,,,Y (x, y ) =∫ x

−∞

∫ y

−∞p X,Y (α, β ) dαdβ (1.7)

or

p X,Y (x, y ) =∂2P X,,,Y (x, y )

∂x∂y

2This content is available online at <http://cnx.org/content/m11246/1.2/>.3This content is available online at <http://cnx.org/content/m11248/1.4/>.


3

Since limity→∞

P X,,,Y (x, y ) = P X (x ), the so-calledmarginal density functions can be related to the joint

density function.

p X (x ) =∫ ∞−∞

p X,Y (x, β ) dβ (1.8)

and

p Y (y ) =∫ ∞−∞

p X,Y (α, y ) dα

Extending the ideas of conditional probabilities, the conditional probability density functionpX|Y (x|Y = y) is dened (when p Y (y ) 6= 0) as

pX|Y (x|Y = y) =p X,Y (x, y )p Y (y )

(1.9)

Two random variables are statistically independent when pX|Y (x|Y = y) = p X (x ), which is equivalentto the condition that the joint density function is separable: p X,Y (x, y ) = p X (x ) p Y (y ).

For jointly dened random variables, expected values are dened similarly as with single random variables.Probably the most important joint moment is the covariance:

cov (X,Y ) ≡ E [XY ]− E [X]E [Y ] (1.10)

where

E [XY ] =∫ ∞−∞

∫ ∞−∞

xyp X,Y (x, y ) dxdy

Related to the covariance is the (confusingly named) correlation coecient: the covariance normalizedby the standard deviations of the component random variables.

pX,Y =cov (X,Y )σXσY

When two random variables are uncorrelated, their covariance and correlation coecient equals zero so thatE [XY ] = E [X]E [Y ]. Statistically independent random variables are always uncorrelated, but uncorrelatedrandom variables can be dependent. 4

A conditional expected value is the mean of the conditional density.

E [X | Y ] =∫ ∞−∞

pX|Y (x|Y = y) dx (1.11)

Note that the conditional expected value is now a function of Y and is therefore a random variable.Consequently, it too has an expected value, which is easily evaluated to be the expected value of X.

E [E [X | Y ]] =∫∞−∞

∫∞−∞ xpX|Y (x|Y = y) dxp Y (y ) dy

= E [X](1.12)

More generally, the expected value of a function of two random variables can be shown to be the expectedvalue of a conditional expected value: E [f (X,Y )] = E [E [f (X,Y ) | Y ]]. This kind of calculation is fre-quently simpler to evaluate than trying to nd the expected value of f (X,Y ) "all at once." A particularlyinteresting example of this simplicity is the random sum of random variables. Let L be a randomvariable and Xl a sequence of random variables. We will nd occasion to consider the quantity

∑Ll=1Xl.

4Let X be uniformly distributed over [−1, 1] and let Y = X2. The two random variables are uncorrelated, but are clearlynot independent.



Assuming that each component of the sequence has the same expected value E [X], the expected value ofthe sum is found to be

E [SL] = E[E[∑L

l=1Xl | L]]

= E [LE [X]]

= E [L]E [X]

(1.13)

1.4 Random Vectors5

A random vector X is an ordered sequence of random variables X = (X1, . . . , XL)T . The density functionof a random vector is dened in a manner similar to that for pairs of random variables considered previously.The expected value of a random vector is the vector of expected values.

E [X] =∫∞−∞ xp X (x ) dx

=

E [X1]

...

E [XL]

(1.14)

The covariance matrix KX is an L × L matrix consisting of all possible covariances among the randomvector's components.

∀i, j, i ∧ j ∈ 1, . . . , L :(KX

i,j = cov (Xi, Xj) = E[XiXj

]− E [Xi]E

[Xj

])(1.15)

Using matrix notation, the covariance matrix can be written asKX = E[(X − E [X]) (X − E [X])T

]. Using

this expression, the covariance matrix is seen to be a symmetric matrix and, when the random vector has nozero-variance component, its covariance matrix is positive-denite. Note in particular that when the randomvariables are real-valued, the diagonal elements of a covariance matrix equal the variances of the components:KX

i,i = σ2Xi. Circular random vectors are complex-valued with uncorrelated, identically distributed,

real and imaginary parts. In this case, E[(|Xi|)2

]= 2σ2

Xi, and E

[Xi

2]

= 0. By convention, σ2Xi

denotes

the variance of the real (or imaginary) parts. The characteristic function of a real-valued random vector isdened to be

ΦX (iν) = E[eiν

TX]

(1.16)

The maximum of a random vector is a random variables whose probability density is usually quitedierent from the distributions of the vector's components. The probability that the maximum is less thansome number µ is equal to the probability that all of the components are less than µ.

Pr [max X < µ] = P X (µ, . . . , µ ) (1.17)

Assuming that the components of X are statistically independent, this expression becomes

Pr [max X < µ] =dim(X)∏i=1

P X i (µ ) (1.18)



5

1.5 The Gaussian Random Variable6

The random variable X is said to be a Gaussian random variable 7 if its probability density function hasthe form

p X (x ) =1√

2πσ2e−

(x−m)2

2σ2 (1.19)

The mean of such a Gaussian random variable is m and its variance σ2. As a shorthand notation, thisinformation is denoted by. x ∼ N

(m,σ2

). The characteristic function ΦX (·) of a Gaussian random

variable is given by

ΦX (iu) = eimue−σ2u2

2

No closed form expression exists for the probability distribution function of a Gaussian random variable.For a zero-mean, unit-variance, Gaussian random variable N (0, 1), the probability that it exceeds thevalue x is denoted by Q (x).

Pr [X > x] = 1− P X (x ) =1√2π

∫ ∞x

e−α22 dα ≡ Q (x)

Figure 1.1: The function Q (·) is plotted on logarithmic coordinates. Beyond values of about two, thisfunction decreases quite rapidly. Two approximations are also shown that correspond to the upper andlower bounds given by (1.21).

A plot of Q (·) is shown in Figure 1.1. When the Gaussian random variable has non-zero mean and/ornon-unit variance, the probability of it exceeding x can also be expressed in terms of Q (·).

∀X,X ∼ N(m,σ2

):(Pr [X > x] = Q

(x−mσ

))(1.20)

6This content is available online at <http://cnx.org/content/m11250/1.2/>.7Gaussian random variables are also known as normal random variables.



Integrating by parts, Q (·) is bounded (for x > 0) by

1√2π

x

1 + x2e−

x22 ≤ Q (x) ≤ 1√

2πxe−

x22 (1.21)

As x becomes large, these bounds approach each other and either can serve as an approximation to Q (·); theupper bound is usually chosen because of its relative simplicity. The lower bound can be improved; notingthat the term x

1+x2 decreases for x < 1 and that Q (x) increases as x decreases, the term can be replaced byits value at x = 1 without aecting the sense of the bound for x ≤ 1.

∀x, x ≤ 1 :(

12√

2πe−

x22 ≤ Q (x)

)(1.22)

We will have occasion to evaluate the expected value of eaX+bX2where X ∼ N

(m,σ2

)and a, b are

constants. By denition,

E[eaX+bX2

]=

1√2πσ2

∫ ∞−∞

eax+bx2− (x−m)2

2σ2 dx

The argument of the exponential requires manipulation (i.e., completing the square) before the integral canbe evaluated. This expression can be written as

−(

12σ2

((1− 2bσ2

)x2 − 2

(m+ aσ2

)x+m2

))Completing the square, this expression can be written

−

(1− 2bσ2

2σ2

(x− m+ aσ2

1− 2bσ2

)2)

+1− 2bσ2

2σ2

(m+ aσ2

1− 2bσ2

)2

− m2

2σ2

We are now ready to evaluate the integral. Using this expression,

E[eaX+bX2

]= e

1−2bσ2

2σ2

“m+aσ2

1−2bσ2

”2− m2

2σ21√

2πσ2

∫ ∞−∞

e−„

1−2bα2

2σ2

“x−m+aσ2

1−2bσ2

”2«dx

Let

α =x− m+aσ2

1−2bσ2

σ√1−2bσ2

which implies that we must require that 1− 2bσ2 > 0 (or b < 12σ2 ). We then obtain

E[eaX+bX2

]= e

1−2bσ2

2σ2

“m+aσ2

1−2bσ2

”2− m2

2σ21√

1− 2bσ2

1√2π

∫ ∞−∞

e−α22 dα

The integral equals unity, leaving the result

∀b, b < 12σ2

:

E [eaX+bX2]

=e

1−2bσ2

2σ2

“m+aσ2

1−2bσ2

”2− m2

2σ2

√1− 2bσ2

(1.23)

Important special cases are

1. a = 0, X ∼ N(m,σ2

)E[ebX

2]

=e

bm2

1−2bσ2

√1− 2bσ2


7

2. a = 0, X ∼ N(0, σ2

)E[ebX

2]

=1√

1− 2bσ2

3. X ∼ N(0, σ2

)E[eaX+bX2

]=e

a2σ2

2×(1−2bσ2)

1− 2bσ2

The real-valued random vector X is said to be a Gaussian random vector if its joint distributionfunction has the form

p X (x ) =1√

det (2πK)e−( 1

2 (x−m)TK−1(x−m)) (1.24)

If complex-valued, the joint distribution of a circular Gaussian random vector is given by

p X (x ) =1√

det (πK)e−((x−mX)TKX

−1(x−mX)) (1.25)

The vector mX denotes the expected value of the Gaussian random vector and KX its covariance matrix.

mX = E [X]

KX = E[XXT

]−mXmX

T

As in the univariate case, the Gaussian distribution of a random vector is denoted by X ∼ N (mX ,KX).After applying a linear transformation to Gaussian random vector, such as Y = AX, the result is also aGaussian random vector (a random variable if the matrix is a row vector): Y ∼ N

(AmX , AKXA

T). The

characteristic function of a Gaussian random vector is given by

ΦX (iu) = eiuTmX− 1

2uTKXu

From this formula, the N th-order moment formula for jointly distributed Gaussian random variables is easilyderived. 8

E [X1 . . . XN ] =

∑

P NPNE[XPN (1)XPN (2)

]. . . E

[XPN (N−1)XPN (N)

]if N even∑

P NPNE[XPN (1)

]E[XPN (2)XPN (3)

]. . . E

[XPN (N−1)XPN (N)

]if N odd

where PN denotes a permutation of the rst N integers and PN (i) the ith element of the permutation. Forexample, E [X1X2X3X4] = E [X1X2]E [X3X4] + E [X1X3]E [X2X4] + E [X1X4]E [X2X3].

1.6 The Central Limit Theorem9

Let Xl denote a sequence of independent, identically distributed, random variables. Assuming they have

zero means and nite variances (equaling σ2), the Central Limit Theorem states that the sum∑Ll=1

Xl√L

converges in distribution to a Gaussian random variable.

1√L

L∑l=1

XlL→∞→ N

(0, σ2

)

8E [X1 . . . XN ] =∂∂

„...∂ΦX (iu)∂uN

«∂u2∂u1

|u=09This content is available online at <http://cnx.org/content/m11251/1.2/>.



Because of its generality, this theorem is often used to simplify calculations involving nite sums of non-Gaussian random variables. However, attention is seldom paid to the convergence rate of the CentralLimit Theorem. Kolmogorov, the famous twentieth century mathematician, is reputed to have said, "TheCentral Limit Theorem is a dangerous tool in the hands of amateurs." Let's see what he meant.

Taking σ2 = 1, the key result is that the magnitude of the dierence between P (x), dened to be theprobability that the sum given above exceeds x, and Q (x), the probability that a unit-variance Gaussianrandom variable exceeds x, is bounded by a quantity inversely related to the square root of L (Cramer:Theorem 24[10]).

|P (x)−Q (x) | ≤ cE[(|X|)3

]σ3

1√L

The constant of proportionality c is a number known to be about 0.8 (Hall: p6[17]). The ratio of absolutethird moment of Xl to the cube of its standard deviation, known as the skew and denoted by γX , dependsonly on the distribution of Xl and is independent of scale. This bound on the absolute error has been shownto be tight (Cramer: pp. 79 [10]). Using our lower bound for Q (·) (see (1.22)), we nd that the relativeerror in the Central Limit Theorem approximation to the distribution of nite sums is bounded for x > 0 as

|P (x)−Q (x) |Q (x)

≤ cγX

√2πLex22

2 if x ≤ 11+x2

x if x > 1(1.26)

Suppose we require that the relative error not exceed some specic value ε. The normalized (by the standarddeviation) boundary x at which the approximation is evaluated must not violate

Lε2

2πc2γX2≥ ex

2

4 if x ≤ 1(1+x2

x

)2

if x > 1

As shown in Figure 1.2, the right side of this equation is a monotonically increasing function.


9

Figure 1.2: The quantity which governs the limits of validity for numerically applying the CentralLimit Theorem on nite numbers of data is shown over a portion of its range. To judge these limits, we

must compute the quantity Lε2

2πc2γX, where ε denotes the desired percentage error in the Central Limit

Theorem approximation and L the number of observations. Selecting this value on the vertical axis anddetermining the value of x yielding it, we nd the normalized (x = 1 implies unit variance) upper limiton an L-term sum to which the Central Limit Theorem is guaranteed to apply. Note how rapidly thecurve increases, suggesting that large amounts of data are needed for accurate approximation.

Example 1.1If ε = 0.1 and taking cγX arbitrarily to be unity (a reasonable value), the upper limit of thepreceding equation becomes 1.6 × 10−3L. Examining Figure 1.2, we nd that for L = 10000, xmust not exceed 1.17. Because we have normalized to unit variance, this example suggests that theGaussian approximates the distribution of a ten-thousand term sum only over a range correspondingto a 76% area about the mean. Consequently, the Central Limit Theorem, as a nite-sampledistributional approximation, is only guaranteed to hold near the mode of the Gaussian, with hugenumbers of observations needed to specify the tail behavior. Realizing this fact will keep us frombeing ignorant amateurs.

1.7 Basic Denitions in Stochastic Processes10

A random or stochastic process is the assignment of a function of a real variable to each sample point ωin a sample space. Thus, the process X (ω, t) can be considered a function of two variables. For each ω, thetime function must be well-behaved and may or may not look random to the eye. Each time function of theprocess is called a sample function and must be dened over the entire domain of interest. For each t,we have a function of ω, which is precisely the denition of a random variable. Hence the amplitude of arandom process is a random variable. The amplitude distribution of a process refers to the probabilitydensity function of the amplitude: p X(t) (x ). By examining the process's amplitude at several instants,the joint amplitude distribution can also be dened. For the purposes of this module, a process is said tobe stationary when the joint amplitude distribution depends on the dierences between the selected timeinstants.




The expected value or mean of a process is the expected value of the amplitude at each t.

E [X (t)] = mX (t) =∫ ∞−∞

xp X(t) (x ) dx

For the most part, we take the mean to be zero. The correlation function is the rst-order joint momentbetween the process's amplitudes at two times.

RX (t1, t2) =∫ ∞−∞

∫ ∞−∞

x1x2p X(t1),X(t2) (x1, x2 ) dx 1dx 2

Since the joint distribution for stationary processes depends only on the time dierence, correlation functionsof stationary processes depend only on |t1 − t2|. In this case, correlation functions are really functions ofa single variable (the time dierence) and are usually written as RX (τ) where τ = t1 − t2. Related to thecorrelation function is the covariance function KX (τ), which equals the correlation function minus thesquare of the mean.

KX (τ) = RX (τ)−mX2

The variance of the process equals the covariance function evaluated as the origin. The power spectrumof a stationary process is the Fourier Transform of the correlation function.

SX (ω) =∫ ∞−∞

RX (τ) e−(iωτ)dτ

A particularly important example of a random process is white noise. The process X (t) is said to be whiteif it has zero mean and a correlation function proportional to an impulse.

E [X (t)] = 0

RX (τ) =N0

2δ (τ)

The power spectrum of white noise is constant for all frequencies, equaling N02 which is known as the spectral

height. 11

When a stationary process X (t) is passed through a stable linear, time-invariant lter, the resultingoutput Y (t) is also a stationary process having power density spectrum

SY (ω) = (|H (iω) |)2SX (ω)

where H (iω) is the lter's transfer function.

1.8 The Gaussian Process12

A random process X (t) is Gaussian if the joint density of the N amplitudes X (t1) , . . . , , X (tN ) comprisea Gaussian random vector. The elements of the required covariance matrix equal the covariance betweenthe appropriate amplitudes: Ki,j = KX (ti, tj). Assuming the mean is known, the entire structure of theGaussian random process is specied once the correlation function or, equivalently, the power spectrumis known. As linear transformations of Gaussian random processes yield another Gaussian process, linearoperations such as dierentiation, integration, linear ltering, sampling, and summation with other Gaussianprocesses result in a Gaussian process.

11The curious reader can track down why the spectral height of white noise has the fraction one-half in it. This denition isthe convention.



11

1.9 Sampling and Random Sequences13

The usual Sampling Theorem applies to random processes, with the spectrum of interest beign the powerspectrum. If stationary process X (t) is bandlimited - SX (ω) = 0, |ω| > W , as long as the samplinginterval T satises the classic constraint T < π

W the sequence X (lT ) represents the original process. Asampled process is itself a random process dened over discrete time. Hence, all of the random process

notions introduced in the previous section (Section 1.8) apply to the random sequence∼X (l) ≡ X (lT ). The

correlation functions of these two processes are related as

R∼X

(k) = E[∼X (l)

∼X (l + k)

]= RX (kT )

We note especially that for distinct samples of a random process to be uncorrelated, the correlationfunction RX (kT ) must equal zero for all non-zero k. This requirement places severe restrictions on thecorrelation function (hence the power spectrum) of the original process. One correlation function satisfyingthis property is derived from the random process which has a bandlimited, constant-valued power spectrumover precisely the frequency region needed to satisfy the sampling criterion. No other power spectrumsatisfying the sampling criterion has this property. Hence, sampling does not normally yield un-correlated amplitudes, meaning that discrete-time white noise is a rarity. White noise has a correlationfunction given by R∼

X(k) = σ2δ (k), where δ (·) is the unit sample. The power spectrum of white noise is a

constant: S∼X

(ω) = σ2.

1.10 The Poisson Process14

Some signals have no waveform. Consider the measurement of when lightning strikes occur within someregion; the random process is the sequence of event times, which has no intrinsic waveform. Such processesare termed point processes, and have been shown (see Snyder[44]) to have simple mathematical structure.Dene some quantities rst. Let Nt be the number of events that have occurred up to time t (observationsare by convention assumed to start at t = 0). This quantity is termed the counting process, and has theshape of a staircase function: The counting function consists of a series of plateaus always equal to an integer,with jumps between plateaus occuring when events occur. The increment Nt1,t2 = Nt2 −Nt1 corresponds tothe number of events in the interval [t1, t2). Consequently, Nt = N0,t. The event times comprise the randomvector W ; the dimension of this vector is Nt, the number of events that have occured. The occurrenceof events is governed by a quantity known as the intensity λ (t;Nt; W) of the point process through theprobability law

Pr [Nt,t+∆t = 1 | Nt; W] = λ (t;Nt; W) ∆ (t)

for suciently small ∆ (t). Note that this probability is a conditional probability; it can depend on howmany events occurred previously and when they occurred. The intensity can also vary with time to describenon-stationary point processes. The intensity has units of events, and it can be viewed as the instantaneousrate at which events occur.

The simplest point process from a structural viewpoint, the Poisson process, has no dependence on processhistory. A stationary Poisson process results when the intensity equals a constant: λ (t;Nt; W) = λ0. Thus,in a Poisson process, a coin is ipped every ∆ (t) seconds, with a constant probability of heads (an event)occuring that equals λ0∆ (t) and is independent of the occurrence of past (and future) events. When thisprobability varies with time, the intensity equals λ (t), a non-negative signal, and a nonstationary Poisonprocess results. 15

From the Poisson process's denition, we can derive the probability laws that govern event occurrence.These fall into two categories: the count statistics Pr [Nt1,t2 = n], the probability of obtaining n events in

13This content is available online at <http://cnx.org/content/m11254/1.2/>.14This content is available online at <http://cnx.org/content/m11255/1.4/>.15In the literature, stationary Poisson processes are sometimes termed homogeneous, nonstationary ones inhomogeneous.



an interval [t1, t2), and the time of occurrence statistics p W(n) (w ), the joint distribution of the rst nevent times in the observation interval. These times form the vector W (n) , the occurrence time vector ofdimension n. From these two probability distributions, we can derive the sample function density.

1.10.1 Count Statistics

We derive a dierentio-dierence equation that Pr [Nt1,t2 = n], t1 < t2, must satisfy for event occurrencein an interval to be regular and independent of event occurrences in disjoint intervals. Let t1 be xed andconsider event occurrence in the intervals [t1, t2) and [t2, t2 + δ), and how these contribute to the occurrenceof n events in the union of the two intervals. If k events occur in [t1, t2), then n−k must occur in [t2, t2 + δ).Furthermore, the scenarios for dierent values of k are mutually exclusive. Consequently,

Pr [Nt1,t2+δ = n] =∑n

k=0 Pr [Nt1,t2 = k,Nt2,t2+δ = n− k] =Pr [Nt2,t2+δ = 0 | Nt1,t2 = n]Pr [Nt1,t2 = n]+Pr [Nt2,t2+δ = 1 | Nt1,t2 = n− 1]Pr [Nt1,t2 = n− 1]+∑n

k=2 Pr [Nt2,t2+δ = k | Nt1,t2 = n− k]Pr [Nt1,t2 = n− k]

(1.27)

Because of the independence of event occurrence in disjoint intervals, the conditional probabilities in thisexpression equal the unconditional ones. When δ is small, only the rst two will be signicant to the rstorder in δ. Rearranging and taking the obvious limit, we have the equation dening the count statistics.

d

dt 2(Pr [Nt1,t2 = n]) = − (λ (t2)Pr [Nt1,t2 = n]) + λ (t2)Pr [Nt1,t2 = n− 1]

To solve this equation, we apply a z-transform to both sides. Dening the transform of Pr [Nt1,t2 = n] tobe P (t2, z), 16 we have

∂P (t2, z)∂t2

= −(λ (t2)

(1− z−1

)P (t2, z)

)Applying the boundary condition that P (t1, z) = 1, this simple rst-order dierential equation has thesolution

P (t2, z) = e−((1−z−1)R t2t1λ(α)dα)

To evaluate the inverse z-transform, we simply exploit the Taylor series expression for the exponential, andwe nd that a Poisson probability mass function governs the count statistics for a Poisson process.

Pr [Nt1,t2 = n] =

(∫ t2t1λ (α) dα

)nn!

e−R t2t1λ(α)dα (1.28)

The integral of the intensity occurs frequently, and we succinctly denote it by Λt2t1 . When the Poisson processis stationary, the intensity equals a constant, and the count statistics depend only on the dierence t2 − t1.

1.10.2 Time of occurrence statistics

To derive the multivariate distribution of W , we use the count statistics and the independence properties ofthe Poisson process. The density we seek satises∫ w1+δ1

w1

. . .

∫ wn+δn

wn

p W(n) (v ) dvdv = Pr [W1 ∈ [w1, w1 + δ1) , . . . ,Wn ∈ [wn, wn + δn)]

The expression on the right equals the probability that no events occur in [t1, w1), one event in [w1, w1 + δ1),no event in [w1 + δ1, w2), etc. Because of the independence of event occurrence in these disjoint intervals,we can multiply together the probability of these event occurrences, each of which is given by the countstatistics.

16Remember, t1 is xed and can be suppressed notationally.


13

Pr [W1 ∈ [w1, w1 + δ1) , . . . ,Wn ∈ [wn, wn + δn)] = e−Λw1t1 Λw1+δ1

w1e−Λ

w1+δ1w1 e−Λ

w2w1+δ1 Λw2+δ2

w2e−Λ

w2+δ2w2 . . .Λwn+δn

wn e−Λwn+δnwn '∏n

k=1 λ (wk) δke−Λwnt1

for small δk. From this approximation, we nd that the joint distribution of the rst n event times equals

p W (n) (w ) =

∏nk=1 λ (wk) e−

Rwnt1

λ(α)dαif t1 ≤ w1 ≤ w2 ≤ · · · ≤ wn

0 otherwise(1.29)

1.10.3 Sample function density

For Poisson processes, the sample function density describes the joint distribution of counts and event timeswithin a specied time interval. Thus, it can be written as

Pr [Nt | t1 ≤ t < t2] = Pr [Nt1,t2 = n | W1 = w1, . . . ,Wn = wn] p W (n) (w )

The second term in the product equals the distribution derived previously for the time of occurrence statistics.The conditional probability equals the probability that no events occur between wn and t2; from the Poisson

process's count statistics, this probability equals e−Λt2wn . Consequently, the sample function density for thePoisson process, be it stationary or not, equals

Pr [Nt | t1 ≤ t < t2] =n∏k=1

λ (wk) e−R t2t1λ(α)dα (1.30)

1.10.4 Properties

From the probability distributions derived on the previous pages, we can discern many structural propertiesof the Poisson process. These properties set the stage for delineating other point processes from the Poisson.They, as described subsequently, have much more structure and are much more dicult to handle analytically.

1.10.4.1 The Counting Process

The counting process Nt is an independent increment process. For a Poisson process, the number ofevents in disjoint intervals are statistically independent of each other, meaning that we have an independentincrement process. When the Poisson process is stationary, increments taken over equi-duration intervalsare identically distributed as well as being statistically independent. Two important results obtain from thisproperty. First, the counting process's covariance function KN (t, u) equals σ2min t, u. This close relationto the Wiener waveform process indicates the fundamental nature of the Poisson process in the world ofpoint processes. Note, however, that the Poisson counting process is not continuous almost surely. Second,the sequence of counts forms an ergodic process, meaning we can estimate the intensity parameter fromobservations.

The mean and variance of the number of events in an interval can be easily calculated from the Poissondistribution. Alternatively, we can calculate the characteristic function and evaluate its derivatives. Thecharacteristic function of an increment equals

ΦNt1,t2 (v) = e(eiv−1)Λ

t2t1

The rst two moments and variance of an increment of the Poisson process, be it stationary or not, equal

E [Nt1,t2 ] = Λt2t1 (1.31)



E[Nt1,t2

2]

= Λt2t1 + Λt2t12

σ (Nt1,t2)2 = Λt2t1Note that the mean equals the variance here, a trademark of the Poisson process.

1.10.4.2 Poisson process event times from a Markov process

Consider the conditional density pWn|Wn−1,...,W1 (wn|wn−1, . . . , w1). This density equals the ratio of the eventtime densities for the n- and (n− 1)-dimensional event time vectors. Simple substitution yields

∀w n, wn ≥ wn−1 :(pWn|Wn−1,...,W1 (wn|wn−1, . . . , w1) = λ (wn) e

−Rwnwn−1

λ(α)dα)

(1.32)

Thus the nth event time depends only on when the (n− 1)th event occurs, meaning that we have a Markov

process. Note that event times are ordered: the nth event must occur after the (n− 1)th, etc. Thus, thevalues of this Markov process keep increasing, meaning that from this viewpoint, the event times form anonstationaryMarkovian sequence. When the process is stationary, the evolutionary density is exponential.It is this special form of event occurence time density that denes a Poisson process.

1.10.4.3 Interevent intervals in a Poisson process form a white sequence.

Exploiting the previous property, the duration of the nth interval τn = wn − wn−1 does not depend on thelengths of previous (or future) intervals. Consequently, the sequence of interevent intervals forms a "white"sequence. The sequence may not be identically distributed unless the process is stationary. In the stationarycase, interevent intervals are truly white - they form an IID sequence - and have an exponential distribution.

∀τ, τ ≥ 0 :(p τn (τ ) = λ0e

−(λ0τ))

(1.33)

To show that the exponential density for a white sequence corresponds to the most "random" distribution,Parzen[38] proved that the ordered times of n events sprinkled independently and uniformly over a giveninterval form a stationary Poisson process. If the density of event sprinkling is not uniform, the resultingordered times constitute a nonstationary Poisson process with an intensity proportional to the sprinklingdensity.

1.10.5 Doubly stochastic Poisson processes

Here, the intensity λ (t) equals a sample function drawn from some waveform process. In waveform processes,the analogous concept does not have nearly the impact it does here. Because intensity waveforms must benon-negative, the intensity process must be nonzero mean and non-Gaussian. The authors shall assumethroughout that the intensity process is stationary for simplicity. This model arises in those situations inwhich the event occurrence rate clearly varies unpredictably with time. Such processes have the propertythat the variance-to-mean ratio of the number of events in any interval exceeds one. In the process of derivingthis last property, we illustrate the typical way of analyzing doubly stochastic processes: Condition on theintensity equaling a particular sample function, use the statistical characteristics of nonstationary Poissonprocesses, then "average" with respect to the intensity process. To calculate the expected number Nt1,t2 ofevents in an interval, we use conditional expected values:

E [Nt1,t2 ] = E [E [Nt1,t2 | ∀λt, t1 ≤ t < t2 : (t1 ≤ t < t2) ]]

= E[∫ t2t1λ (α) dα

]= (t2 − t1)E [λ (t)]

(1.34)


15

This result can also be written as the expected value of the integrated intensity: E [Nt1,t2 ] = E[Λt2t1].

Similar calculations yield the increment's second moment and variance.

E[Nt1,t2

2]

= E[Λt2t1]

+ E[Λt2t1

2]

σ (Nt1,t2)2 = E[Λt2t1]

+ σ(Λt2t1)2

Using the last result, we nd that the variance-to-mean ratio in a doubly stochastic process always exceedsunity, equaling one plus the variance-to-mean ratio of the intensity process.

The approach of sample-function conditioning can also be used to derive the density of the number ofevents occurring in an interval for a doubly stochastic Poisson process. Conditioned on the occurrence of asample function, the probability of n events occurring in the interval [t1, t2) equals ((1.27))

Pr [Nt1,t2 = n | ∀λt, t1 ≤ t < t2 : (t1 ≤ t < t2)] =Λt2t1

n

n!e−Λ

t2t1

Because Λt2t1 is a random variable, the unconditional distribution equals this conditional probability averagedwith respect to this random variable's density. This average is known as the Poisson Transform of the randomvariable's density.

Pr [Nt1,t2 = n] =∫ ∞

0

αn

n!e−αp

Λt2t1

(α ) dα (1.35)

1.11 Linear Vector Spaces17

One of the more powerful tools in statistical communication theory is the abstract concept of a linear vectorspace. The key result that concerns us is the representation theorem: a deterministic time function canbe uniquely represented by a sequence of numbers. The stochastic version of this theorem states that aprocess can be represented by a sequence of uncorrelated random variables. These results will allow us toexploit the theory of hypothesis testing to derive the optimum detection strategy.

1.11.1 Basics

Rule 1.1:A linear vector space S is a collection of elements called vectors having the following properties:

1. The vector-addition operation can be dened so that if x ∧ y ∧ z ∈ S:• (x+ y) ∈ S (the space is closed under addition)• x+ y = y + x (Commutivity)• (x+ y) + z = x+ (y + z) (Associativity)• The zero vector exists and is always an element of S. The zero vector is dened by

x+ 0 = x.• For each x ∈ S, a unique vector −x is also an element of S so that x+−x = 0, the zero

vector.

2. Associated with the set of vectors is a set of scalars which constitute an algebraic eld. Aeld is a set of elements which obey the well-known laws of associativity and commutivityfor both addition and multiplication. If a, b are scalars, the elements x, y of a linear vectorspace have the properties that:

• ax (multiplication by scalar a) is dened and ax ∈ S.17This content is available online at <http://cnx.org/content/m11236/1.6/>.



• a (bx) = (ab)x.• If "1" and "0" denotes the multiplicative and additive identity elements respectively of

the eld of scalars; then 1x = x and 0x = 0• a (x+ y) = ax+ ay and (a+ b)x = ax+ bx.

There are many examples of linear vector spaces. A familiar example is the set of column vectors oflength N . In this case, we dene the sum of two vectors to be:

x1

x2

...

xN

+

y1

y2

...

yN

=

x1 + y1

x2 + y2

...

xN + yN

(1.36)

and scalar multiplication to be a(x1, x2, . . . , xN )T = (ax1, ax2, . . . , axN )T . All of the properties listed aboveare satised.

A more interesting (and useful) example is the collection of square integrable functions. A square-integrable function x (t) satises: ∫ Tf

Ti

(|x (t) |)2dt <∞ (1.37)

One can verify that this collection constitutes a linear vector space. In fact, this space is so important thatit has a special name - L2 (Ti, Tf ) (read this as el-two); the arguments denote the range of integration.

Rule 1.2:Let S be a linear vector space. A subspace T of S is a subset of S which is closed. In other words,if x ∧ y ∈ T , then x ∧ y ∈ S and all elements of T are elements of S, but some elements of Sare not elements of T . Furthermore, the linear combination (ax+ by) ∈ T for all scalars a, b. Asubspace is sometimes referred to as a closed linear manifold.

1.11.2 Inner Product Spaces

A structure needs to be dened for linear vector spaces so that denitions for the length of a vector and forthe distance between any two vectors can be obtained. The notions of length and distance are closely relatedto the concept of an inner product.

Rule 1.3:An inner product of two real vectors x ∧ y ∈ S, is denoted by < x, y > and is a scalar assignedto the vectors x and y which satises the following properties:

1. < x, y >=< y, x >2. < ax, y >= a < (x, y) >, a is a scalar3. < x+ y, z >=< x, z > + < y, z >, z a vector.4. < x, x >> 0 unless x = 0. In this case, < x, x >= 0.

As an example, an inner product for the space consisting of column matrices can be dened as

< x, y >= xT y =N∑i=1

xiyi

The reader should verify that this is indeed a valid inner product (i.e., it satises all of the properties givenabove). It should be noted that this denition of an inner product is not unique: there are other innerproduct denitions which also satisfy all of these properties. For example, another valid inner product is

< x, y >= xTKy


17

where K is an NxN positive-denite matrix. Choices of the matrix K which are not positive denite do notyield valid inner products (property 4 (p. 16) is not satised). The matrix K is termed the kernel of theinner product. When this matrix is something other than an identity matrix, the inner product is sometimeswritten as < x, y > K to denote explicitly the presence of the kernel in the inner product.

Rule 1.4:The norm of a vector x ∈ S is denoted by ‖ x ‖ and is dened by:

‖ x ‖= (< x, x >)1/2(1.38)

Because of the properties of an inner product, the norm of a vector is always greater than zero unlessthe vector is identically zero. The norm of a vector is related to the notion of the length of a vector. Forexample, if the vector x is multiplied by the constant scalar a, the norm of the vector is also multiplied bya.

‖ ax ‖= (< ax, ax >)1/2 = a ‖ x ‖

In other words, "longer" vectors (a > 1) have larger norms. A norm can also be dened when the innerproduct contains a kernel. In this case, the norm is written ‖ x ‖K for clarity.

Rule 1.5:An inner product space is a linear vector space in which an inner product can be dened for allelements of the space and a norm is given by (1.38). Note in particular that every element of aninner product space must satisfy the axioms of a valid inner product.

For the space S consisting of column matrices, the norm of a vector is given by (consistent with the rstchoice of an inner product)

‖ x ‖=

(N∑i=1

xi2

)1/2

This choice of a norm corresponds to the Cartesian denition of the length of a vector.One of the fundamental properties of inner product spaces is the Schwarz inequality

| < x, y > | ≤‖ x ‖‖ y ‖ (1.39)

This is one of the most important inequalities we shall encounter. To demonstrate this inequality, considerthe norm squared of x+ ay.

(‖ x+ ay ‖)2 =< x+ ay, x+ ay >= (‖ x ‖)2 + 2a < (x, y) > +a2(‖ y ‖)2

Let a = −<x,y>(‖y‖)2 . In this case:

(‖ x+ ay ‖)2 = (‖ x ‖)2 − 2(| < x, y > |)2

(‖ y ‖)2 +(| < x, y > |)2

(‖ y ‖)4 (‖ y ‖)2 = (‖ x ‖)2 − (| < x, y > |)2

(‖ y ‖)2

As the left hand side of this result is non-negative, the right-hand side is lower-bounded by zero. The Schwarzinequality (1.39) is thus obtained. Note that the equality occurs only when x = − (ay), or equivalently whenx = cy, where c is any constant.

Rule 1.6:Two vectors are said to be orthogonal if the inner product of the vectors is zero: < x, y >= 0.

Consistent with these results is the concept of the "angle" between two vectors. The cosine of this angleis dened by:

cos (x, y) =< x, y >

‖ x ‖‖ y ‖



Because of the Schwarz inequality, |cos (x, y) | ≤ 1. The angle between the orthogonal vectors is ±(π2

)and

the angle between vectors satisfying the Schwarz inequality (1.39) with equality x ∝ y is zero (the vectorsare parallel to each other).

Rule 1.7:The distance between two vectors is taken to be the norm of the dierence of the vectors.

d (x, y) =‖ x− y ‖

In our example of the normed space of column matrices, the distance between x and y would be

‖ x− y ‖=

(N∑i=1

(xi − yi)2

)1/2

which agrees with the Cartesian notion of distance. Because of the properties of the inner product, thisdistance measure (or metric) has the following properties:

• d (x, y) = d (y, x) (Distance does not depend on how it is measured.)• (d (x, y) = 0)⇒ (x = y) (Zero distance means equality)• d (x, z) ≤ d (x, y) + d (y, z) (Triangle inequality)

We use this distance measure to dene what we mean by convergence. When we say the sequence ofvectors xn converges to x (xn → x), we mean

limitnn→∞

‖ xn − x ‖= 0

1.12 Hilbert Spaces and Separable Vector Spaces18

1.12.1 Hilbert Spaces

Denition 1.1: Hilbert SpacesA Hilbert space H is a closed, normed linear vector space which contains all of its limit points: ifxn is any sequence of elements in H that converges to x, then x is also contained in H. x istermed the limit point of the sequence.

Example 1.2Let the space consist of all rational numbers. Let the inner product be simple multiplication:< x, y >= xy. However, the limit point of the sequence xn = 1 + 1 + 1

2! + · · ·+ 1n! is not a rational

number. Consequently, this space is not a Hilbert space. However, if we dene the space to consistof all nite numbers, we have a Hilbert space.

Denition 1.2: orthogonalIf Y is a subspace of H, the vector x is orthogonal to the subspace Y for every y ∈ Y , < x, y >= 0.

We now arrive at a fundamental theorem.

Theorem 1.1:Let H be a Hilbert space and Y a subspace of it. Any element x ∈ H has the unique de-composition x = y + z, where y ∈ Y and z is orthogonal to Y . Furthermore, ‖ x − y ‖=minν ‖ x− ν ‖ | ν ∈ Y : the distance between x and all elements of Y is minimized by thevector y. This element y is termed the projection of x onto Y .

Geometrically, Y is a line or a plane passing through the origin. Any vector x can be expressed asthe linear combination of a vector lying in Y and a vector orthogonal to y. This theorem is of extremeimportance in linear estimation theory and plays a fundamental role in detection theory.



19

1.12.2 Separable Vector Spaces

Denition 1.3: separableA Hilbert space H is said to be separable if there exists a set of vectors φi, i =1, . . . , elements of H, that express every element x ∈ H as

x =∞∑i=1

xiφi (1.40)

where xi are scalar constants associated with φi and x and where "equality" is taken to mean thatthe distance between each side becomes zero as more terms are taken in the right.

limitm→∞

‖ x−m∑i=1

xiφi ‖= 0

The set of vectors φi are said to form a complete set if the above relationship is valid. A completeset is said to form a basis for the space H. Usually the elements of the basis for a space are taken to belinearly independent. Linear independence implies that the expression fo the zero vector by a basis canonly be made by zero coecients.

∀i :

( ∞∑i=1

xiφi = 0⇔ xi = 0

)(1.41)

The representation theorem states simply that separable vector spaces exist. The representation of thevector x is the sequence of coecients xi.

Example 1.3The space consisting of column matrices of length N is easily shown to be separable. Let thevector φi be given a column matrix having a one in the ith row and zeros in the remaining rows:φi = (0, . . . , 0, 1, 0, . . . , 0)T . This set of vectors φi, i = 1, . . . , N constitutes a basis for the

space. Obviously if the vector x is given by x = (x1, x2, . . . , xN )T , it may be expressed as:

x =N∑i=1

xiφi

using the basis vectors just dened.

In general, the upper limit on the sum in (1.40) is innite. For the previous example (Example 1.2), theupper limit is nite. The number of basis vectors that is required to express every element of a separablespace in terms of (1.40) is said to be the dimension of the space. In this example (Example 1.3), thedimension of the space is N . There exist separable vector spaces for which the dimension is innite.

Denition 1.4: orthonormalThe basis for a separable vector space is said to be an orthonormal basis if the elements of thebasis satisfy the following two properties:

• The inner product between distinct elements of the basis is zero (i.e., the elements of the basisare mutually orthogonal).

∀i, j, i 6= j : (< φi, φj >= 0) (1.42)

• The norm of each element of a basis is one (normality).

∀i, i = 1, . . . : (‖ φi ‖= 1) (1.43)



For example, the basis given above for the space of N -dimensional column matrices is orthonormal. Forclarity, two facts must be explicitly stated. First, not every basis is orthonormal. If the vector space isseparable, a complete set of vectors can be found; however, this set does not have to be orthonormal to bea basis. Secondly, not every set of orthonormal vectors can constitute a basis. When the vector space L2 isdiscussed in detail, this point will be illustrated.

Despite these qualications, an orthonormal basis exists for every separable vector space. There is anexplicit algorithm - the Gram-Schmidt procedure - for deriving an orthonormal set of functions from acomplete set. Let φi denote a basis; the orthonormal basis ψi is sought. The Gram-Schmidt procedureis:

• 1. - ψ1 = φ1‖φ1‖ . This step makes ψ1 have unit length.

• 2. - ψ′2 = φ2− < (ψ1, φ2) > ψ1. Consequently, the inner product between ψ′2and ψ1 is zero. We obtain

ψ2 from ψ′2 forcing the vector to have unit length.

• 2'. - ψ2 = ψ′2‖ψ′2‖

.

The algorithm now generalizes.

• k. - ψ′k = φk −∑k−1i=1 < (ψi, φk) > ψi

• k'. - ψk = ψ′k‖ψ′k‖

By construction, this new set of vectors is an orthonormal set. As the original set of vectors φi is acomplete set, and, as each ψk is just a linear combination of φi, i = 1, . . . , k, the derived set ψi is alsocomplete. Because of the existence of this algorithm, a basis for a vector space is usually assumed to beorthonormal.

A vector's representation with respect to an orthonormal basis φi is easily computed. The vector xmay be expressed by:

x =∞∑i=1

xiφi (1.44)

xi =< x, φi > (1.45)

This formula is easily conrmed by substituting (1.44) into (1.45) and using the properties of an innerproduct. Note that the exact element values of a given vector's representation depends upon both the vectorand the choice of basis. Consequently, a meaningful specication of the representation of a vector mustinclude the denition of the basis.

The mathematical representation of a vector (expressed by equations (1.44) and (1.45) can be expressedgeometrically. This expression is a generalization of the Cartesian representation of numbers. Perpendicularaxes are drawn; these axes correspond to the orthonormal basis vector used in the representation. A givenvector is representation as a point in the "plane" with the value of the component along the φi axis beingxi.

An important relationship follows from this mathematical representation of vectors. Let x and y by anytwo vectors in a separable space. These vectors are represented with respect to an orthonormal basis by xiand yi, respectively. The inner product < x, y > is related to these representations by:

< x, y >=∞∑i=1

xiyi

This result is termed Parseval's Theorem. Consequently, the inner product between any two vectors canbe computed from their representations. A special case of this result corresponds to the Cartesian notion ofthe length of a vector; when x = y, Parseval's relationship becomes:

‖ x ‖=

√√√√ ∞∑i=1

xi2


21

These two relationships are key results of the representation theorem. The implication is that any innerproduct computed from vectors can also be computed from their representations. There are circumstancesin which the latter computation is more manageable than the former and, furthermore, of greater theoreticalsignicance.

1.13 The Vector Space L Squared19

Special attention needs to be paid to the vector space L2 (Ti, Ti): the collection of functions x (t) which aresquare-integrable over the interval (Ti, Tf ):∫ Tf

Ti

(|x (t) |)2dt <∞

An inner product can be dened for this space as:

< x, y >=∫ Tf

Ti

x (t) y (t) dt (1.46)

Consistent with this denition, the length of the vector x (t) is given by

‖ x ‖=

√∫ Tf

Ti

(|x (t) |)2dt

Physically, (‖ x ‖)2can be related to the energy contained in the signal over (Ti, Tf ). This space is a Hilbert

space. If Ti and Tf are both nite, an orthonormal basis is easily found which spans it. For simplicity ofnotation, let Ti = 0 and Tf = T . The set of functions dened by:

φ2i−1 (t) =(

2T

)1/2

cos(

2π (i− 1) tT

)(1.47)

φ2i (t) =(

2T

)1/2

sin(

2πitT

)is complete over the interval (0, T ) and therefore constitutes a basis for L2 (0, T ). By demonstrating a basis,we conclude that L2 (0, T ) is a separable vector space. The representations of functions with respect to thisbasis corresponds to the well-known Fourier series expansion of a function. As most functions require aninnite number of terms in their Fourier series representation, this space is innite dimensional.

There also exist orthonormal sets of functions that do not constitute a basis. For example, the set φi (t)dened by:

∀i, i = 0, 1, . . . :

φi (t) =

1T if iT ≤ t < (i+ 1)T

0 otherwise

(1.48)

over L2 (0,∞). The members of this set are normal (unit norm) and are mutually orthogonal (no memberoverlaps with any other). Consequently, this set is an orthonormal set. However, it does not constitutea basis for L2 (0,∞). Functions piecewise constant over intervals of length T are the only members ofL2 (0,∞) which can be represented by this set. Other functions such as e−tu (t) cannot be represented bythe φi (t) dened above. Consequently, orthonormality of a set of functions does not guaranteecompleteness.




While L2 (0, T ) is a separable space, examples can be given in which the representation of a vector in thisspace is not precisely equal to the vector. More precisely, let x (t) ∈ L2 (0, T ) and the set φi (t) be denedby (1.47). The fact that φi (t) constitutes a basis for the space implies:

‖ x (t)−∞∑i=1

xiφi (t) ‖= 0

where

xi =∫ T

0

x (t)φi (t) dt

In particular, let x (t) be:

x (t) =

1 if 0 ≤ t ≤ T2

0 if T2 < t < T

Obviously, this function is an element of L2 (0, T ). However, the representation of this function is not equalto 1 at t = T

2 . In fact, the peak error never decreases as more terms are taken in the representation. In thespecial case of the Fourier series, the existence of this "error" is termed the Gibbs phenomenon. However,this "error" has zero norm in L2 (0, T ); consequently, the Fourier series expansion of this function is equalto the function in the sense that the function and its expansion have zero distance between them. However,one of the axioms of a valid inner product is that if (‖ e ‖= 0) ⇒ (e = 0). The condition is satised, butthe conclusion does not seem to be valid. Apparently, valid elements of L2 (0, T ) can be dened which arenonzero but have zero norm. An example is

e =

1 if t = T2

0 otherwise

So as not to destroy the theory, the most common method of resolving the conict is to weaken the denitionof equality. The essence of the problem is that while two vectors x and y can dier from each other andbe zero distance apart, the dierence between them is "trivial." This dierence has zero norm which, in L2,implies that the magnitude of (x−y) integrates to zero. Consequently, the vectors are essentially equal. Thisnotion of equality is usually written as x = ya.e. (x equals y almost everywhere). With this convention,we have:

(‖ e ‖= 0)⇒ (e = 0) a.e.

Consequently, the error between a vector and its representation is zero almost everywhere.Weakening the notion of equality in this fashion might seem to compromise the utility of the theory.

However, if one suspects that two vectors in an inner product are equal (e.g., a vector and its representation),it is quite dicult to prove that they are strictly equal (and as has been seen, this conclusion may not bevalid). Usually, proving they are equal almost everywhere is much easier. While this weaker notion ofequality does not imply strict equality, one can be assured that any dierence between them is insignicant.The measure of "signicance" for a vector space is expressed by the denition of the norm for the space.

1.14 A Hilbert Space for Stochastic Processes20

The result of primary concern here is the construction of a Hilbert space for stochastic processes. The spaceconsisting of random variables X having a nite mean-square value is (almost) a Hilbert space with innerproduct E [XY ]. Consequently, the distance between two random variables X and Y is

d (X,Y ) =(E[(X − Y )2

])1/2



23

Now (d (X,Y ) = 0) ⇒(E[(X − Y )2

]= 0). However, this does not imply that X = Y . Those sets with

probability zero appear again. Consequently, we do not have a Hilbert space unless we agree X = Y meansPr [X = Y ] = 1.

Let X (t) be a process with E[X2 (t)

]< ∞. For each t, X (t) is an element of the Hilbert space just

dened. Parametrically, X (t) is therefore regarded as a "curve" in a Hilbert space. This curve is continuousif

limitt→u

E[(X (t)−X (u))2

]= 0

Processes satisfying this condition are said to be continuous in the quadratic mean. The vector spaceof greatest importance is analogous to L2 (Ti, Tf ). Consider the collection of real-valued stochastic processesX (t) for which ∫ Tf

Ti

E[X2 (t)

]dt <∞

Stochastic processes in this collection are easily veried to constitute a linear vector space. Dene an innerproduct for this space as:

E [< X (t) , Y (t) >] = E

[∫ Tf

Ti

X (t)Y (t) dt

]While this equation is a valid inner product, the left-hand side will be used to denote the inner product insteadof the notation previously dened. We take < X (t) , Y (t) > to be the time-domain inner product asshown here (1.46). In this way, the deterministic portion of the inner product and the expected valueportion are explicitly indicated. This convention allows certain theoretical manipulations to be performedmore easily.

One of the more interesting results of the theory of stochastic processes is that the normed vector spacefor processes previously dened is separable. Consequently, there exists a complete (and, by assumption,orthonormal) set φi (t), i = 1, . . . of deterministic (nonrandom) functions which constitutes a basis. Aprocess in the space of stochastic processes can be represented as

∀t, Ti ≤ t ≤ Tf :

(X (t) =

∞∑i=1

Xiφi (t)

)(1.49)

where Xi, the representation of X (t), is a sequence of random variables given by

Xi =< X (t) , φi (t) >

or

Xi =∫ Tf

Ti

X (t)φi (t) dt

Strict equality between a process and its representation cannot be assured. Not only does the analogousissue in L2 (0, T ) occur with respect to representing individual sample functions, but also sample functionsassigned a zero probability of occurrence can be troublesome. In fact, the ensemble of any stochastic processcan be augmented by a set of sample functions that are not well-behaved (e.g., a sequence of pulses) buthave probability zero. In a practical sense, this augmentation is trivial: such members of the process cannotoccur. Therefore, one says that two processes X (t) and Y (t) are equal almost everywhere if the distancebetween ‖ X (t) − Y (t) ‖ is zero. The implication is that any lack of strict equality between the processes(strict equality means the processes match on a sample-function-by-sample-function basis) is "trivial."

1.15 Karhunen-Loeve Expansion21

The representation of the process, X (t), is the sequence of random variables Xi. The choice basis of φi (t)is unrestricted. Of particular interest is to restrict the basis functions to those which make the Xi uncor-




related random variables. When this requirement is satised, the resulting representation of X (t) is termedthe Karhunen-Loève expansion. Mathematically, we require ∀i, j, i 6= j : (E [XiXj ] = E [Xi]E [Xj ]). Thisrequirement can be expressed in terms of the correlation function of X (t).

E [XiXj ] = E

[∫ T

0

X (α)φi (α) dα∫ T

0

X (β)φj (β) dβ

]=∫ T

0

∫ T

0

φi (α)φj (β)RX (α, β) dαdβ

As E [Xi] is given by

E [Xi] =∫ T

0

mX (α)φi (α) dα

our requirement becomes

∀i, j, i 6= j :

(∫ T

0

∫ T

0

φi (α)φj (β)RX (α, β) dαdβ =∫ T

0

mX (α)φi (α) dα∫ T

0

mX (β)φj (β) dβ

)(1.50)

Simple manipulations result in the expression

∀i, j, i 6= j :

(∫ T

0

φi (α)∫ T

0

KX (α, β)φj (β) dβdα = 0

)(1.51)

When i = j, the quantity E[Xi

2]− (E [Xi])

2is just the variance of Xi. Our requirement is obtained by

satisfying ∫ T

0

φi (α)∫ T

0

KX (α, β)φj (β) dβdα = λiδij

or

∀i, j, i 6= j :

(∫ T

0

φi (α) gj (α) dα = 0

)(1.52)

where

gj (α) =∫ T

0

KX (α, β)φj (β) dβ

Furthermore, this requirement must hold for each j which diers from the choice of i. A choice of a functiongj (α) satisfying this requirement is a function which is proportional to φj (α): gj (α) = λjφj (α). Therefore,∫ T

0

KX (α, β)φj (β) dβ = λjφj (α) (1.53)

The φi which allow the representation of X (t) to be a sequence of uncorrelated random variables mustsatisfy this integral equation. This type of equation occurs often in applied mathematics; it is termed theeigenequation. The sequences φi and λi are the eigenfunctions and eigenvalues of KX (α, β), thecovariance function of X (t). It is easily veried that:

KX (t, u) =∞∑i=1

λiφi (t)φi (u)

This result is termed Mercer's Theorem.The approach to solving for the eigenfunction and eigenvalues of KX (t, u) is to convert the integral

equation into an ordinary dierential equation which can be solved. This approach is best illustrated by anexample.


25

Example 1.4KX (t, u) = σ2min t, u. The eigenequation can be written in this case as

σ2

(∫ t

0

uφ (u) du+ t

∫ T

t

φ (u) du

)= λφ (t)

Evaluating the rst derivative of this expression,

σ2tφ (t) + σ2

∫ T

t

φ (u) du− σ2tφ (t) = λdφ (t)dt

or

σ2

∫ T

t

φ (u) du = λdφ

dt

Evaluating the derivative of the last expression yields the simple equation

−(σ2φ (t)

)= λ

d2φ

dt2

This equation has a general solution of the form φ (t) = Asin(σ√λt)

+Bcos(σ√λt). It is easily seen

that B must be zero. The amplitude A is found by requiring ‖ φ ‖= 1. To nd λ, one must returnto the original integral equation. Substituting, we have

σ2A

∫ t

0

usin(σ√λu

)du+ σ2tA

∫ T

t

sin(σ√λu

)du = λAsin

(σ√λt

)After some manipulation, we nd that

∀t, t ∈ [0, T ) :(Aλsin

(σ√λt

)−Aσt

√λcos

(σ√λT

)= λAsin

(σ√λt

))or

∀t, t ∈ [0, T ) :(Aσt√λcos

(σ√λT

)= 0)

Therefore, ∀n, n = 1, 2, . . . :(σ√λT = (n− 1/2)π

)and we have

λn =σ2T 2

(n+ 1/2)2π2

φn (t) =(

2T

)1/2

sin(

(n+ 1/2)πtT

)The Karhunen-Loève expansion has several important properties.

• The eigenfunctions of a positive-denite covariance function constitute a complete set. One can easilyshow that these eigenfunctions are also mutually orthogonal with respect to both the usual innerproduct and with respect to the inner product derived from the covariance function.

• If X (t) Gaussian, Xi are Gaussian random variables. As the random variables Xi are uncorrelatedand Gaussian, the Xi comprise a sequence of statistically independent random variables.

• Assume KX (t, u) = N02 δ (t− u): the stochastic process X (t) is white. Then∫

N0

2δ (t− u)φ (u) du = λφ (t)

for all φ (t). Consequently, if λi = N02 , this constraint equation is satised no matter what choice is

made for the orthonormal set φi (t). Therefore, the representation of white, Gaussian processesconsists of a sequence of statistically independent, identically-distributed (mean zero and varianceN02 ) Gaussian random variables. This example constitutes the simplest case of the Karhunen-Loèveexpansion.



1.16 Probability and Stochastic Processes: Problems22

Exercise 1.16.1Joe is an astronaut for project Pluto. The mission success or failure depends only on the behavior ofthree major systems. Joe feels that the following assumptions are valid and apply to the performanceof the entire mission:

• The mission is a failure only if two or more major systems fail.• System I, the Gronk system, fails with probability 0.1.• System II, the Frab system, fails with probability 0.5 if at least one other system fails. If no

other system fails, the probability the Frab system fails is 0.1.• System III, the beer cooler (obviously, the most important), fails with probability 0.5 if the

Gronk system fails. Otherwise the beer cooler cannot fail.

1.16.1

What is the probability that the mission succeeds but that the beer cooler fails?

1.16.2

What is the probability that all three systems fail?

1.16.3

Given that more than one system failed, determine the probability that:

1.16.3.1

The Gronk did not fail.

1.16.3.2

The beer cooler failed.

1.16.3.3

Both the Gronk and the Frab failed.

1.16.4

About the time Joe was due back on Earth, you overhear a radio broadcast about Joe while watchingthe Muppet Show. You are not positive what the radio announcer said, but you decide that it istwice as likely that you heard "mission a success" as opposed to "mission a failure". What is theprobability that the Gronk failed?

Exercise 1.16.2The random variables X1 and X2 have the joint pdf

∀b 1,b 2, b1 ∧ b2 > 0 :

(p X1,X2 (x1, x2 ) =

1π2

b1b2(b1

2 + x12) (b2

2 + x22)) (1.54)



27

1.16.1

Show that X1 and X2 are statistically independent random variables with Cauchy density functions.

1.16.2

Show that ΦX1 (ju) = e−(b1|u|).

1.16.3

Dene Y = X1 +X2. Determine p Y (y ).

1.16.4

Let Zi be a set of N statistically independent Cauchy random variables with bi = b, i ∈1, 2, . . . , N. Dene

Z =1N

N∑1=1

Zi

Determine p Z (z ). Is Z - the sample mean - a good estimate of the expected value E [Zi]?Exercise 1.16.3Let X1, . . . , XN be independent, identically distributed random variables. The density of each ran-dom variable is p X (x ). The order statistics X0 (1) , . . . , X0 (N) of this set of random variablesis the set that results when the original one is ordered (sorted).

X0 (1) ≤ X0 (2) ≤ · · · ≤ X0 (N)

1.16.1

What is the joint density of the original set of random variables?

1.16.2

What is the density of X0 (N), the largest of the set?

1.16.3

Show that the joint density of the ordered random variables is

p X0(1),...,X0(N) (x1, . . . , xN ) = N !p X (x1 ) . . . p X (xN )

1.16.4

Consider a Poisson process having intensity λ (t). N events are observed to occur in the interval[0, T ). Show that the joint density of the times of occurrence W1, . . . ,WN is the same as the orderstatistics of a set of random variables. Identify the common density of these random variables.

Exercise 1.16.4A crucial skill in developing simulations of communication systems is random variable gener-ation. Most computers (and environments like Matlab) have software that generates statisticallyindependent, uniformly distributed, random sequences. in Matlab, the function is rand. We wantto change the probability distribution to one required by the problem at hand. One technique isknown as the distribution method.



1.16.1

If P X (x ) is the desired distribution, show that U = P X (X ) (applying the distribution functionto a random variable having that distribution) is uniformly distributed over [0, 1). This result means

that X = (P X (U ))−1has the desired distribution. Consequently, to generate a random variable

having any distribution we want, we only need the inverse function of the distribution function.

1.16.2

Write a Matlab function that simulates random throws of two dice. It should have the number ofthrows as its argument and return a vector containing the results. Use the Matlab function hist

to plot the distribution of the sum.

1.16.3

Write Matlab functions that generate (i) Laplacian and (ii) Cauchy random variables. Again usehist to plot the distribution function.

Exercise 1.16.5Determine the mean and correlation function of each of the processes Xt dened below.

1.16.1

Xt is dened by the following equally likely sample functions.

Xt (ω1) = 1Xt (ω2) = −2Xt (ω3) = sin (πt)Xt (ω4) = cos (πt)

1.16.2

Xt is dened by Xt = cos (At+ θ), where A and θ are statistically independent random variables.θ is uniformly distributed over [0, 2π) and A has the density function

p A (A ) =1

π (1 +A2)

Exercise 1.16.6The joint density of the amplitudes of a stochastic process Xt at the specic times t = t1 andt = t2 (t2 > t1) is found to be

p Xt1 ,Xt2 (x1, x2 ) =

constant if (x1 > x2) ∧ (0 < x1) ∧ (x2 < 1)

0 otherwise

This joint density is found to be a valid joint density for Xt and Xu when |t− u| = |t2 − t1|.

1.16.1

Find the correlation function RX (t, u) at the times t = t1 and u = t2.

1.16.2

Find the expected value of Xt for all t.


29

1.16.3

Is this process wide-sense stationary?

Exercise 1.16.7

1.16.1

Which of the following are valid correlation functions? Indicate your reasoning.

1. RX (τ) = e−|τ | − e−(2|τ |)

2. RX (τ) = 5sin(1000τ)τ

3. RX (τ) =

1− |τ |T if |τ | ≤ T0 otherwise

4. RX (τ) =

1 if |τ | ≤ T0 otherwise

5. RX (τ) = δ (τ) + 256. RX (τ) = δ (τ + 1) + δ (τ) + δ (τ − 1)

1.16.2

Which of the following are valid power density spectra? Indicate your reasoning.

1. SX (f) = sin(πf)πf

2. SX (f) =(

sin(πf)πf

)2

3. SX (f) = e−(f−f0)2

4

4. SX (f) = e−|f | − e−(2|f |)

5. SX (f) = 1 + 0.25e−(i2πf)

6. SX (f) =

1 if |f | ≤ 1T

0 otherwise

Exercise 1.16.8Let Xl denote a sequence of independent, indentically distributed random variables. This se-quence serves as the input to a discrete-time system having an input-output relationship given bythe dierence equation

Yl = aYl−1 +Xl

1.16.1

If Xl ∼ N(0, σ2

), nd the probability density function of each element of the output sequence

Yl.

1.16.2

Show that |ΦXl (iu) | ≤ ΦXl (i0) for all choices of u no matter what the amplitude distribution ofXl may be.



1.16.3

If Xl is non-Gaussian, the computation of the probability density of Yl can be dicult. On theother hand, if the density of Yl is known, the density of Xl can be found. How is the characteristicfunction of Xl related to the characteristic function of Yl?

1.16.4

Show that if Yl is uniformly distributed over [−1, 1), the only allowed values of the parameter a arethose equalling 1

m , m = ± (2) ,± (3) ,± (4) , . . . .Exercise 1.16.9

1.16.1

Show that necessary and sucient conditions for the stochastic process Xt dened by Xt =cos (2πf0t+ θ) to be wide-sense stationary is that the characteristic function Φθ (iu) satisfy

Φθ (i1) = 0 = Φθ (i2)

note: f0 is a constant and is not a random variable.

1.16.2

Let Xt = Acos (2πf0t) +Bsin (2πf0t), where A and B are random variables and f0 constant. Findnecessary and sucient conditions for Xt to be wide-sense stationary.

1.16.3

Consider the process dened by:Xt = Acos (2πf0t)

where A is a random variable and f0 is a constant. Under what conditions is Xt wide-sensestationary?

Exercise 1.16.10

1.16.1

Show that correlation and covariance functions have the following properties:

1. Rx (t, u) = Rx (u, t)2. Rx (τ) = Rx (−τ)3. Kx

2 (t, u) ≤ Kx (t, t)Kx (u, u)4. |Kx (t, u) | ≤ 1

2 (Kx (t, t) +Kx (u, u))5. |Rx (τ) | ≤ Rx (0)

1.16.2

Let Xt to be a wide-sense stationary random process. If s (t) is a deterministic function and wedene Yt = Xt + s (t), what is the expected value and correlation function of Yt?


31

1.16.3

Which of the following are valid correlation functions? Justify your answer.

1. Rx (t, u) = e−(t−u)2

2. Rx (t, u) = σ2max t, u3. Rx (t, u) = e−(t−u)

4. Rx (t, u) = cos (t) cos (u)

Exercise 1.16.11It is desired to generate a wide-sense stationary process with correlation function

RX (τ) = e−|τ |

Two methods are proposed.

1. Let Xt = Acos (2πFt+ θ) where A, F , and θ are statistically independent random variables.2. Dene Xt by:

Xt =∫ ∞

0

h (α)Nt−αdα

where Nt is white and h (t) is the impulse response of the appropriate lter.

1.16.1

Find at least two impulse responses h (t) that will work in method 2.

1.16.2

Specify the densities for A, F , and θ in method 1 that yield the desired results.

1.16.3

Sketch sample functions generated by each method. Interpret your results. What is the technicaldierences between these processes?

Exercise 1.16.12Let the stochastic process Xt be dened by

Xt = Acos (2πFt+ θ)

where A, F , and θ are statistically independent random variables. The random variable θ isuniformly distributed over the interval [−π, π). The densities of the other random variables are tobe determined.

1.16.1

Show that Xt is a wide-sense stationary process.

1.16.2

Is Xt strict-sense stationary? Why or why not?



1.16.3

The inventor of this process claims that Xt can have any correlation function one desires bymanipulating the densities of A and F . Demonstrate the validity of this result and the requirementsthese densities must satisfy.

1.16.4

The inventor also claims that Xt can have any rst-order density one desires so long as the desireddensity is bounded. Show that this claim is also valid. Furthermore, show that the requirementsplaced on the densities of A and F are consistent with those found in the previous part (Sec-tion 1.16.3).

1.16.5

Could this be fruitfully used in simulations to emualte a process having a specic correlation functionand rst-order density? In other words, would statistics computed the results of the simulation bemeaningful? Why or why not?

Exercise 1.16.13Let Xt be a Gaussian random process with mean mX (t) and covariance function KX (t, u). Theprocess is passed through the system depicted in Figure 1.3.

Figure 1.3

1.16.1

Is Yt a Gaussian process? If so, compute the pdf of Yt.

1.16.2

What are the mean and covariance functions of Yt?

1.16.3

If Xt is stationary, is Yt stationary?


33

1.16.4

Compute the cross-correlation function between Xt and Yt.

Exercise 1.16.14A stochasic process Xt is dened to be

Xt = cos (2πFt+ θ)

where F and θ are statistically independent random variables. The quantity θ is uniformly dis-tributed over [−π, π) and F can assume one of the values 1, 2, or 3 with equal probability.

1.16.1

Compute the mean and correlation function of Xt.This process serves as the input to the system in Figure 1.4.

Figure 1.4

note: The signals are multiplied, not summed, at the node located just before the integrator. Ytand Zt are related by

Yt =∫ t

t−1

Zαdα

1.16.2

Is the process Zt wide-sense stationary?

1.16.3

What is the expected value and correlation function of Yt?



Exercise 1.16.15The white process Xt serves as the input to a system having output Yt. The input-output rela-tionship of this system is determined by the dierential equation

Yt′ + 2Yt = Xt

1.16.1

Find the mean and correlation function of Yt.

1.16.2

Compute the cross-correlation function between Xt and Yt.

1.16.3

Show that the correlation function of Yt obeys a homogeneous version of the dierential equationgoverning the system for positive values of τ .

∀τ, τ > 0 :(dRY (τ)

d+ 2RY (τ) = 0

)(1.55)

Do not use your answer from the part above (Section 1.16.1) to work this part. Rather, show thevalidity of this result in a more general fashion.

Exercise 1.16.16Let Xt be a wide-sense stationary stochastic process. Let Xt

′ denote the derivative of Xt.

1.16.1

Compute the expected value and correlation function of Xt′ in terms of the expected value and

correlation function of Xt

1.16.2

Under what conditions are Xt′ and Xt orthogonal? In other words when does < X ′, X >= 0 where

< X ′, X >= E[Xt′Xt

]?

1.16.3

Compute the mean and correlation function of Yt = Xt −Xt′.

1.16.4

The bandwidth of the process Xt can be dened by

BX2 =

∫∞−∞ f2SX (f) df∫∞−∞SX (f) df

Express this denition in terms of the mean and correlation functions of Xt and Xt′.


35

1.16.5

The statistic U is used to count the average number of excursions of the stochastic process Xt

across the level Xt = A in the interval [0, T ]. One form of this statistic is

U =1T

∫ T

0

|du (Xt −A)dt

|dt

where u (·) denotes the unit step function. Find the expected value of U using in your nal expressionthe formula for BX . Assume that the conditions found in the part above (Section 1.16.2) are metand Xt is a Gaussian process.

Exercise 1.16.17Let Nt be a Poisson process with intensity λ (t).

1.16.1

What is the expected value and variance of the number of events occurring in the time interval[t, u)?

1.16.2

Under what conditions is Nt a stationary, independent increment process?

1.16.3

Compute RN (t, u).

1.16.4

Assume that λ (t) = λ0, a constant. What is the conditional density pWn|Wn−1 (wn|wn−1)? Fromthis relationship, nd the density of τn, the time interval between Wn and Wn−1.

Exercise 1.16.18A stochastic process Xt is said to have stationary, independent increments if, for t1 < t2 <t3 < t4:

• The random variable Xt2 −Xt1 is statistically independent of the random variable Xt4 −Xt3 .• The pdf of Xt2 −Xt1 is equal to the pdf of Xt2+T −Xt1+T for all t1, t2, T .

The process is identically equal to zero at t = 0 (Pr [X0 = 0] = 1).

1.16.1

What is the expected value and variance of Xt1+t2?

note: Write Xt1+t2 = (Xt1+t2 −Xt1) + (Xt1 −X0).

1.16.2

Using the result of the previous part (Section 1.16.1), nd expressions for E [Xt] and σ (Xt)2.



1.16.3

Dene ΦXt (iu) to be the characteristic function of the rst-order density of the process Xt. Showthat the characteristic function must be of the form:

ΦXt (iu) = etf(u)

where f (u) is a conjugate-symmetric function of u.

1.16.4

Compute KX (t, u).

1.16.5

The process Xt is passed through a linear, time-invariant lter having the transfer function H (f) =i2πf . Letting Yt denote the output, determine KY (t, u).Exercise 1.16.19In optical communication systems, a photomultiplier tube is used to convert the arrival of photonsinto electric pulses so that each arrival can be counted by other electronics. Being overly clever,a clever Rice engineer bought a photomultiplier tube from AGGIE PMT, Inc. The AGGIE PMTdevice is unreliable. When it is working, each photon is properly converted to an electric pulse.When not working, it has "dead-time" eects: the conversion of a photon arrival blocks out theconversion of the next photon. After a photon arrival has been missed, the device converts the nextarrival properly. To detect whether the Aggies device is working properly or not, the clever Riceengineer decides to use results from Detection Theory (ELEC 530) to help him. A calibrated lightsource is used to give an average arrival rate of λ photons/sec on the surface of the photomultipliertube. Photon arrivals are described by a Poisson process.

1.16.1

Find the density of the time between electric pulses if the AGGIE device has these dead-time eects.

1.16.2

Can the times of occurrence of electric pulses be well-described by a Poisson process when thedead-time eects are present? If so, nd the parameters of the process; if not, state why.

1.16.3

Assuming the device is as likely to not be working as it is to be working, nd a procedure todetermine its mode of operation based on the observation of the time between two successiveelectric pulses.

Exercise 1.16.20In this problem, assume that the process Xt has stationary, independent increments.

1.16.1

Dene Yt to be:Yt = g (t)Xh(t)

g(t)

where g (t) and h (t) are deterministic functions and h(t)g(t) is a strictly increasing function. Find the

mean and covariance functions of Yt.


37

1.16.2

A stochastic process Xt is said to be a Markov process if, for t1 < t2 < · · · < tn−1 < tn, theconditional density of Xtn satises:

pXtn |Xt1 ,...,Xtn−1(Xn|X1, . . . , Xn−1) = pXtn |Xtn−1

(Xn|Xn−1)

Show that all independent-increment processes are Markov processes.

Exercise 1.16.21Let Xt be a stochastic process with correlation function RX (t, u). Xt is said to be mean-squarecontinuous if

∀t, u :(

limitt→u

E[(Xt −Xu)2

]= 0)

(1.56)

1.16.1

Show thatXt is mean-square continuous if and only if the correlation functionRX (t, u) is continuousat t = u.

1.16.2

Show that if RX (t, u) is continuous at t = u, it is continuous for all t and u.

1.16.3

Show that a zero-mean, independent-increment process with stationary increments is mean-squarecontinuous.

1.16.4

Show that a stationary Poisson process is mean-square continuous.

note: This process has no continuous sample functions, but is continuous in the mean-squaresense.

Exercise 1.16.22The Bispectrum

The idea of a correlation function can be extended to higher order moments. For example, the

third order "correlation" function R(3)X (t1, t2, t3) of a random process Xt is dened to be

R(3)X (t1, t2, t3) = E [Xt1Xt2Xt3 ]

1.16.1

Show that if Xt is strict-sense stationary, then the third-order correlation function depends only onthe time dierences t2 − t1 and t3 − t1.

1.16.2

Find the third-order correlation function of Xt = Acos (2πf0t+ Θ), where Θ ∼ U ([−π, π)) and A,f0 are constants.



1.16.3

Let Zt = Xt+Yt, whereXt is Gaussian, Yt is non-Gaussian, andXt, Yt are statistically independent,zero-mean processes. Find the third-order correlation function of Zt.

Exercise 1.16.23A process Xt is said to be a martingale if it satises the relationship

∀t, u :(E[Xt | XuH , u

H ≤ u]

= Xt

)(1.57)

In words, the expected value of a martingale at time t given all values of the process that havebeen observed up to time u (u ≤ t) is equal to the most recently observed value of the process (Xu).

1.16.1

Show that the mean of a martingale is a constant.

1.16.2

Show that all zero-mean, independent, stationary-increment processes are martingales. Are allmartingales independent increment processes?

1.16.3

If Xt is a zero-mean martingale having a possibly time-varying variance σ2 (t), show that its corre-lation function is given by

RX (t, u) = σ2 (min t, u)

Exercise 1.16.24Shot noise is noise measured in vacuum tube circuits which is due to the spontaneous emissionof electrons from each tube's cathode. The electron emission is assumed to be described as astationary Poisson process of intensity λ. The impact of each electron on a tube's anode causesa current to ow in the attached circuit equal to the impulse response of the circuit. Thus, shotnoise is often modeled as a sequence of impulses (whose times of occurrence are a Poisson process)passing though a linear, time-invariant system.

1.16.1

What is the correlation function of Xt?

note: Related Xt to the counting process Nt.

1.16.2

Show that for any wide-sense stationary process Xt for which limitτ→∞

RX (τ) = 0, the mean of Xt is

zero. Use this result to show that if limitτ→∞

RX (τ) exists, the value of the limit equals the square of

the mean of the process.

1.16.3

Find the power density specturm of the shot noise process Yt.


39

1.16.4

Evaluate the mean and variance of Yt.

Exercise 1.16.25Noise reduction lters are used to reduce, as much as possible, the noise component of a noise-corrupted signal. Let the signal of interest be described as a wide-sense stationary process Xt.The observed signal is given by Yt = Xt + Nt, where Nt is a process modeling the noise that isstatistically independent of the signal.

1.16.1

Assuming that the noise is white, nd a relationship that the transfer function of the lter mustsatisfy to maximize the signal-to-noise ratio (i.e., the ratio of the signal power the noise power) inthe ltered output.

1.16.2

Compute the resulting signal-to-noise ratio when the signal correlation function is

RX (τ) = σ2e−(a|τ |)

Exercise 1.16.26In practice, one often wants to measure the power density of a stochastic process. For the purposesof this problem, assume the process Xt is wide-sense stationary, zero mean, and Gaussian. Thismeasurement system is proposed (Figure 1.5).



(a)

(b)

(c)

Figure 1.5

where H1 (f) is the transfer function of an ideal bandpass lter and H2 (f) is an ideal lowpass.Assume that ∆ (f) is small compared to the range of frequencies over which SX (f) varies.


41

1.16.1

Find the mean and correlation function of Yt2 in terms of the second-order statistics of Xt.

1.16.2

Compute the power density spectrum of the process Zt.

1.16.3

Compute the expected value of Zt.

1.16.4

By considering the variance of Zt, comment on the accuracy of this measurement of the powerdensity of the process Xt

Exercise 1.16.27A student in freshman chemistry lab is frustrated; an experiment is not going well and limitedtime is available to perform the experiment again. The outcome of each experiment is a Gaussianrandom variable X having a mean equal to the value being sought and variance σ2. The studentdecides to average his experimental outcomes.

∀l, l = 1, 2, . . . :

(Yl =

1l

l∑k=1

Xk

)(1.58)

Each outcome Xi is uncorrelated with all other outcomes Xj , j 6= i.

1.16.1

Find the mean and correlation function of the stochastic sequence Yl.

1.16.2

Is Yl stationary? Indicate your reasoning.

1.16.3

How large must n be to ensure that the probability of the relative error |Yl−E[Yl]σ | being less than

0.1 is 0.95?

Exercise 1.16.28The price of a certain stock can uctuate during the day while the true value is rising or falling.To facilitate nancial decisions, a Wall Street broker decides to use stochastic process theory. Theprice Pt of a stock is described by

∀t, 0 ≤ t < 1 : (Pt = Kt+Nt) (1.59)

where K is the constant our knowledgeable broker is seeking and Nt is a stochastic process de-scribing the random uctutations. Nt is a white, Gaussian process having spectral height N0

2 . Thebroker decides to estimate K according to:

K =∫ 1

0

Ptg (t) dt

where the best function g (t) is to be found.



1.16.1

Find the probability density function of the estimate K for any g (t) the broker might choose.

1.16.2

A simple-minded estimate of K is to use simple averaging (i.e., set g (t) = constant). Find the value

of this constant which results in E[K]

= K. What is the resulting percentage error as expressed

by

rσ“K”2

|EhKi|.

1.16.3

Find g (t) which minimizes the percentage error and yields E[K]

= K. How much better is this

optimum choice than simple averaging?

Exercise 1.16.29To determine the presence or absence of a constant voltage measured in the presence of additive,

white Gaussian noise (spectral height N02 ) an engineer decides to compute the average

−V of the

measured voltage Vt.−V=

1T

∫ T

0

Vtdt

The value of the constant voltage, if present, is V0. The presence and absence of the voltage areequally likely to occur.

1.16.1

Derive a good method by which the engineer can use the average to determine the presence orabsence of the constant voltage.

1.16.2

Determine the probability that the voltage is present when the engineer's method announces it is.

1.16.3

The engineer decides to improve the performance of his technique by computing the more compli-cated quantity V given by

V =∫ T

0

f (t)Vtdt

What function f (t) maximizes the probability found above (Section 1.16.2)?

Exercise 1.16.30Let Xt be a stationary, zero-mean random process that serves as the input to three linear, time-invariant lters. The power density spectrum of Xt is SX (f) = N0

2 . The impulse responses of thelters are

h1 (t) =

1 if 0 ≤ t ≤ 1

0 otherwise

h2 (t) =

2e−t if t ≥ 0

0 otherwise


43

h3 (t) =

√

2sin (2πt) if 0 ≤ t ≤ 2

0 otherwise

The output of lter i is denoted by Yi (t).

1.16.1

Compute E [Yi (t)] and E[Yi

2 (t)]for i = 1, 2, 3.

1.16.2

Compute RXY2 (t, u). Interpret your result.

1.16.3

Is there any pair of processes for which E [Yi (t)Yj (t)] = 0 for all t?

1.16.4

Is there any pair of processes for which E [Yi (t)Yj (u)] = 0 for all t and u?

Exercise 1.16.31An impulse is associated with the occurrence of each event in a stationary Poisson process. Thisderived process serves as the input to a linear, time-invariant lter having transfer function H (f),which is given by

H (f) = 1− e−(iπfT ) + e−(i2πfT )

where T is a constant.

1.16.1

What is the mean and covariance function of the input to the lter?

1.16.2

What is the mean and covariance function of the output of the lter?

1.16.3

Now let the lter have any impulse response that has duration T (i.e., h (t) = 0, t < 0 andt > T ). Find the impulse response that yields the smallest possible coecient of variation v (t).The coecient of variation is a measure of the percentage variation of a positive-valued process. Itis dened to be the ratio of the standard deviation of the process at time t to the mean at time t.

Exercise 1.16.32Do the following classes of stochastic processes constitute a linear vector space? If so, indicate theproof; if not, show why not.

1. All stochastic processes.2. All wide-sense stationary processes.3. All nonstationary stochastic process.

Exercise 1.16.33Show that the inner product of two vectors satises the following relationships.



1.16.1

| < x, y > | ≤‖ x ‖‖ y ‖, the Schwarz inequality.

1.16.2

‖ x+ y ‖≤‖ x ‖ + ‖ y ‖, the triangle inequality.

1.16.3

(‖ x+ y ‖)2 + (‖ x− y ‖)2 = 2(‖ x ‖)2 + 2(‖ y ‖)2, the parallelogram equality.

1.16.4

Find an expression for the inner product < x, y > using norms only.

Exercise 1.16.34Let x and y be elements of a normed, linear vector space.

1.16.1

Determine whether the following are valid inner products for the indicated space.

1. < x, y >= xTAy where A is a nonsingular, N ×N matrix and x, y are elements of the spaceof N -dimensional column matrices.

2. < x, y >= xyT where x, x are elements of the space of N -dimensional column matrices.

3. < x, y >=∫ T

0x (t) y (T − t) dt where x, y are nite-energy signals dened over [0, T ].

4. < x, y >=∫ T

0w (t)x (t) y (t) dt where w (t) is a non-negative function and x, y are nite-

energy signals dened over [0, T ].5. E [XY ] where X and Y are real-valued random variables having nite mean-square values.6. cov (X,Y ), the covariance of the real-valued random variables X and Y . Assume that the

random variables have nite mean-square values.

1.16.2

Under what conditions is ∫ T

0

∫ T

0

Q (t, u)x (t) y (u) dtdu

a valid inner product for the set of nite energy functions dened over [0, T ]?Exercise 1.16.35Let an inner product be dened with respect to the positive-denite, symmetric kernel Q.

< x, y >Q = xQy

where xQy is the abstract notation for the mapping of the two vectors to a scalar. For example, ifx and y are column matrices, Q is a positive-denite square matrix and

< x, y >Q = xTQy

If x and y are dened in L2, then

< x, y >Q =∫ ∫

x (t)Q (t, u) y (u) dtdu

Let v denote an eigenvector of Q: Qv = λv.


45

1.16.1

Show that the eigenvectors of a positive-denite, symmetric kernel are orthogonal.

∀i, j, i 6= j : (< vi, vj >= 0) (1.60)

1.16.2

Show that these eigenvectors are orthogonal with respect to the inner product generated by Q.Consequently, the eigenvectors are orthogonal with respect to two dierent inner products.

1.16.3

Let∼Q be the inverse kernal associated with Q. If Q is a matrix, then Q

∼Q= I. If Q is a continuous-

time kernel, then ∫Q (t, u)

∼Q (u, v) du = δ (t− v)

Show that the eigenvectors of the inverse kernel are equal to those of the kernel. How are theassociated eigenvalues of these kernels related to each other?

Exercise 1.16.36The purpose of this problem is to derive a general result which describes conditions for an or-thonormal basis to result in an uncorrelated representation of a process. Let X denote a stochasticprocess which has the expansion

X =∞∑i=1

< (X,φi) > φi

where φi denotes a complete, orthonormal basis with respect to the inner product < ·, · >.

< φi, φj >= δij

The process X may be nonstationary and may have non-zero mean.

1.16.1

To require that the representation be an uncorrelated sequence is equivalent to requiring:

E [< (X,φi) >< (X,φj) >]− E [< X,φi >]E [< X,φj >] = λiδij

Show that this requirement implies:

E [X < (X −mX , φi) >] = λiφi (1.61)

where mX = E [X].

1.16.2

Let X be a nite-length stochastic sequence so that it can be considered a random vector. Denethe inner product < X,Y > to be XTY . Show that (1.61) is equivalent to

KXφ = λφ



1.16.3

Let X be a continuous parameter process so that

< X,Y >=∫ T

0

XtYtdt

Show that this inner product implies∫ T

0

KX (t, u)φ (u) du = λφ (t)

1.16.4

Again let X be a continuous parameter process. However, dene the inner product to be

< X,Y >=∫ T

0

∫ T

0

Q (t, u)XtYudtdu

where Q (t, u) is a non-negative denite function. Find the equivalent relationship implied by therequirements of the Karhunen-Loève expansion. Under what conditions will the φ's satisfying thisrelationship not depend on the covariance function of X?

Exercise 1.16.37Let the covariance function of a wide-sense stationary process be

KX (τ) =

1− |τ | if |τ | ≤ 1

0 otherwise

Find the eigenfunctions and eigenvalues associated with the Karhunen-Loève expansion of Xt over(0, T ) with T < 1.


Chapter 2

Optimization Theory

2.1 Optimization Theory1

Optimization theory is the study of the extremal values of a function: its minima and maxima. Topics inthis theory range from conditions for the existence of a unique extremal value to methodsboth analyticand numericfor nding the extremal values and for what values of the independent variables the functionattains its extremes. In this book, minimizing an error criterion is an essential step toward deriving optimalsignal processing algorithms. An appendix summarizing the key results of optimization theory is essentialto understand optimal algorithms.

2.1.1 Unconstrained Optimization

The simplest optimization problem is to nd the minimum of a scalar-valued function of a scalar variablef (x)the so-called objective functionand where that minimum is located. Assuming the function isdierentiable, the well-known conditions for nding the minimalocal and globalare2

d

dxf (x) = 0

d2

dx2f (x) > 0

All values of the independent variable x satisfying these relations are locations of local minima.Without the second condition, solutions to the rst could be either maxima, minima, or inection points.

Solutions to the rst equation are termed the stationary points of the objective function. To nd theglobal minimumthat value (or values) where the function achieves its smallest valueeach candidateextremum must be tested: the objective function must be evaluated at each stationary point and the smallestselected. If, however, the objective function can be shown to be strictly convex, then only one solution ofddx (f) = 0 exists and that solution corresponds to the global minimum. The function f (x) is strictlyconvex if, for any choice of x1, x2, and the scalar a, f (ax1 − 1x2) < af (x1) − 1f (x2). Convex objectivefunctions occur often in practice and are more easily minimized because of this property.

When the objective function f (·) depends on a complex variable z, subtleties enter the picture. If thefunction f (z) is dierentiable, its extremes can be found in the obvious way: nd the derivative, set itequal to zero, and solve for the locations of the extrema. However, there are many situations in whichthis function is not dierentiable. In contrast to functions of a real variable, non-dierentiable functions ofa complex variable occur frequently. The simplest example is f (z) = (|z|)2

. The minimum value of this

1This content is available online at <http://cnx.org/content/m11240/1.4/>.2The maximum of a function is found by nding the minimum of its negative.


47

48 CHAPTER 2. OPTIMIZATION THEORY

function obviously occurs at the origin. To calculate this obvious answer, a complication arises: the functionf (z) = z is not analytic with respect to z and hence not dierentiable. More generally, the derivative of afunction with respect to a complex-valued variable cannot be evaluated directly when the function dependson the variable's conjugate. See Churchill [7] for more about the analysis of functions of a complex variable.

This complication can be resolved with either of two methods tailored for optimization problems. Therst is to express the objective function in terms of the real and imaginary parts of z and nd the function'sminimum with respect to these two variables.3 This approach is unnecessarily tedious but will yield thesolution. The second, more elegant, approach relies on two results from complex variable theory. First,the quantities z and z can be treated as independent variables, each considered a constant with respectto the other. A variable and its conjugate are thus viewed as the result of applying an invertible lineartransformation to the variable's real and imaginary parts. Thus, if the real and imaginary parts can beconsidered as independent variables, so can the variable and its conjugate with the advantage that the

mathematics is far simpler. In this way, ∂(|z|)2

∂z = z and ∂(|z|)2

∂z = z. Seemingly, the next step to minimizingthe objective function is to set the derivatives with respect to each quantity to zero and then solve theresulting pair of equations. As the following theorem suggests, that solution is overly complicated.

Theorem 2.1:If the function f (z, z) is real-valued and analytic with respect to z and z, all stationary points canbe found by setting the derivative (in the sense just given) with respect to either z or z to zero [4].

Thus, to nd the minimum of (|z|)2, compute the derivative with respect to either z or z. In most cases,

the derivative with respect to z is the most convenient choice.4 Thus, ∂(|z|)2

∂z = z and the stationary point isz = 0. As this objective function is strictly convex, the objective function's sole stationary point is its globalminimum.

When the objective function depends on a vector-valued quantity x, the evaluation of the function'sstationary points is a simple extension of the scalar-variable case. However, testing stationary points aspossible locations for minima is more complicated [32]. The gradient of the scalar-valued function f (x) ofa vector x (dimension N) equals an N -dimensional vector where each component is the partial derivative off (·) with respect to each component of x.

∇x (f (x)) =

∂f(x)∂x1

...∂f(x)∂xN

For example, the gradient of xTAx is Ax + ATx. This result is easily derived by expressing the quadraticform as a double sum (

∑i∧ji ∧ j Ai,jxixj) and evaluating the partials directly. When A is symmetric, which

is often the case, this gradient becomes 2Ax.The gradient "points" in the direction of the maximum rate of increase of the function f (·). This fact

is often used in numerical optimization algorithms. The method of steepest descent is an iterativealgorithm where a candidate minimum is augmented by a quantity proportional to the negative of theobjective function's gradient to yield the next candidate.

∀α, α > 0 : (xk−1 − α∇x (f (x)))

If the objective function is suciently "smooth" (there aren't too many minima and maxima), this approachwill yield the global minimum. Strictly convex functions are certainly smooth for this method to work.

The gradient of the gradient of f (x), denoted by ∇2x (f (x)), is a matrix where jth column is the gradient

of the jth component of f 's gradient. This quantity is known as the Hessian, dened to be the matrix ofall the second partials of f (·).

∇2x (f (x))i,j =

∂2f (x)∂xi∂xj

3The multi-variate minimization problem is discussed in a few paragraphs.4Why should this be? In the next few examples, try both and see which you feel is "easier".


49

The Hessian is always a symmetric matrix.The minima of the objective function f (x) occur when

∇x (f (x)) = 0

and∇2x (f (x)) > 0

i.e., positive denite. Thus, for a stationary point to be a minimum, the Hessian evaluated at that pointmust be a positive denite matrix. When the objective function is strictly convex, this test need not beperformed. For example, the objective function f (x) = xTAx is convex whenever A is positive denite andsymmetric.5

When the independent vector is complex-valued, the issues discussed in the scalar case also arise. Becauseof the complex-valued quantities involved, how to evaluate the gradient becomes an issue: is ∇z or ∇z∗ moreappropriate?. In contrast to the case of complex scalars, the choice in the case of complex vectors is unique.

Theorem 2.2:Let f (z, z) be a real-valued function of the vector-valued complex variable z where the dependenceon the variable and its conjugate is explicit. By treating z and z as independent variables, thequantity pointing in the direction of the maximum rate of change of f (z, z) is ∇zΨ(f(z)) [4].

To show this result, consider the variation of f given by

δ (f) =∑ii

∂f

∂ziδ (zi) +

∂f

∂ziδ (zi) = ∇z (f)T δ (z) +∇TzΨ(f)δ (z)

This quantity is concisely expressed as δ (f) = 2<((∇zΨ(f)

)Hδ (z)

). By the Schwarz inequality, the maximum

value of this variation occurs when δ (z) is in the same direction as (∇zΨ(f)). Thus, the direction correspondingto the largest change in the quantity f (z, z) is in the direction of its gradient with respect to z. To implementthe method of steepest descent, for example, the gradient with respect to the conjugate must be used.

To nd the stationary points of a scalar-valued function of a complex-valued vector, we must solve

∇zΨ(f(z))=0(2.1)

For solutions of this equation to be minima, the Hessian dened to be the matrix of mixed partials given by∇z(∇zΨ(f(z))

)must be positive denite. For example, the required gradient of the objective function zHAz

is given by Az, implying for positive denite A that a stationary point is z = 0. The Hessian of the objectivefunction is simply A, conrming that the minimum of a quadratic form is always the origin.

2.2 Constrained Optimization6

Constrained optimization is the minimization of an objective function subject to constraints on thepossible values of the independent variable. Constraints can be either equality constraints or inequalityconstraints. Because the scalar-variable case follows easily from the vector one, only the latter is discussedin detail here.

2.2.1 Equality Constraints

The typical constrained optimization problem has the form

minx x, f (x) subject to g (x) = 05Note that the Hessian of xTAx is 2A.6This content is available online at <http://cnx.org/content/m11223/1.2/>.



where f (·) is the scalar-valued objective function and g (·) is the vector-valued constraint function. Strictconvexity of the objective function is not sucient to guarantee a unique minimum; in addition, eachcomponent of the constraint must be strictly convex to guarantee that the problem has a unique solution.Because of the constraint, stationary points of f (·) alone may not be solutions to the constrained problem:they may not satisfy the constraints. In fact, solutions to the constrained problem are often not stationarypoints of the objective function. Consequently, the ad hoc technique of searching for all stationary points ofthe objective function that also satisfy the constraint do not work.

The classical approach to solving constrained optimization problems is the method of Lagrange mul-tipliers. This approach converts the constrained optimization problem into an unconstrained one, therebyallowing use of the techniques described in the previous section. The Lagrangian of a constrained opti-mization problem is dened to be the scalar-valued function

L (x, λ) = f (x) + λT g (x)

Essentially, the following theorem states that stationary points of the Lagrangian are potential solutionsof the constrained optimization problem: as always, each candidate solution must be tested to determinewhich minimizes the objective function.

Theorem 2.3:Let x denote a local solution to the constrained optimization problem given above where the gradi-ents ∇x (g1 (x)), . . ., ∇x (gM (x)) of the constraint function's components are linearly independent.There then exists a unique vector λ such that

∇x (L (x, λ)) = 0

Furthermore, the quadratic form yT∇2x (L (x, λ)) y is non-negative for all y satisfying

∇x (g (x))T y = 0.The latter result in the theorem says that the Hessian of the Lagrangian evaluated at its stationary points

is non-negative denite with respect to all vectors orthogonal to the gradient of the constraint. This resultgeneralizes the notion of a positive denite Hessian in unconstrained problems.

The rather abstract result of the preceding theorem has a simple geometric interpretation. As shown inFigure 2.1 (Geometric interpretation of Lagrange multipliers.), the constraint corresponds to a contour inthe x plane.

Geometric interpretation of Lagrange multipliers.

Figure 2.1: The thick line corresponds to the contour of the values of x satisfying the constraintequation g (x) = 0. The thinner lines are contours of constant values of the objective function f (x).The contour corresponding to the smallest value of the objective function just tangent to the constraintcontour is the solution to the optimization problem with equality constraints.

A contour map of the objective function indicates those values of x for which f (x) = c. In this gure,as c becomes smaller, the contours shrink to a small circle in the center of the gure. The solution to the


51

constrained optimization problem occurs when the smallest value of c is chosen for which the contour justtouches the constraint contour. At that point, the gradient of the objective function and of the constraintcontour are proportional to each other. This proportionality vector is λ, the so-called Lagrange multiplier.The Lagrange multiplier's exact value must be such that the constraint is exactly satised. Note that theconstraint can be tangent to the objective function's contour map for larger values of c. These potential,but erroneous, solutions can be discarded only by evaluating the objective function.

Example 2.1A typical problem arising in signal processing is to minimize xTAx subject to the linear constraintcTx = 1. A is a positive denite, symmetric matrix (a correlation matrix) in most problems.Clearly, the minimum of the objective function occurs at x = 0, but his solution cannot satisfy theconstraint. The constraint g (x) = cTx − 1 is a scalar-valued one; hence the theorem of Lagrangeapplies as there are no multiple components in the constraint forcing a check of linear independence.The Lagrangian is

L (x, λ) = xTAx+ λ(cTx− 1

)Its gradient is 2Ax+λc with a solution x = −λA

−1c2 . To nd the value of the Lagrange multiplier,

this solution must satisfy the constraint. Imposing the constraint, λcTA−1c = −2; thus, λ =

−2cTA−1c

and the total solution is

x =A−1c

cTA−1c

When the independent variable is complex-valued, the Lagrange multiplier technique can be used if care istaken to make the Lagrangian real. If it is not real, we cannot use the theorem (Theorem 2.3, p. 50) thatpermits computation of stationary points by computing the gradient with respect to z alone. The Lagrangianmay not be real-valued even when the constraint is real. Once insured real, the gradient of the Lagrangianwith respect to the conjugate of the independent vector can be evaluated and the minimization procedureremains as before.

Example 2.2Consider slight variations to the previous example: let the vector z be complex so that the objectivefunction is zHAz where A is a positive denite, Hermitian matrix and let the constraint be linear,but vector-valued ( Cz = c). The Lagrangian is formed from the objective function and the realpart of the usual constraint term.

L (z, λ) = zHAz + λH (Cz − c) + λT(Cz − c

)For the Lagrange multiplier theorem to hold, the gradients of each component of the constraint mustbe linearly independent. As these gradients are the columns of C, their mutual linear independencemeans that each constraint vector must not be expressible as a linear combination of the others.We shall assume this portion of the problem statement true. Evaluating the gradient with respectto z, keeping z a constant, and setting the result equal to zero yields

Az + CHλ = 0

The solution is z is −(A−1CHλ

). Applying the constraint, we nd that CA−1CHλ = −c.

Solving for the Lagrange multiplier and substituting the result into the solution, we nd that thesolution to the constrained optimization problem is

z = A−1CH(CA−1CH

)−1c

The indicated matrix inverses always exist: A is assumed invertible and CA−1CH is invertiblebecause of the linear independence of the constraints.



2.2.2 Inequality Constraints

When some of the constraints are inequalities, the Lagrange multiplier technique can be used, but the solutionmust be checked carefully in its details. But rst, the optimization problem with equality and inequalityconstraints is formulated as

minx x, f (x) subject to g (x) = 0 and h (x) ≤ 0

As before, f (·) is the scalar-valued objective function and g (·) is the equality constraint function; h (·) isthe inequality constraint function.

The key result which can be used to nd the analytic solution to this problem is to rst form theLagrangian in the usual way as L (x, λ, µ) = f (x) +λT g (x) +µTh (x). The following theorem is the generalstatement of the Lagrange multiplier technique for constrained optimization problems.

Theorem 2.4:Let x be a local minimum for the constrained optimization problem. If the gradients of g'scomponents and the gradients of those components of h (·) for which hi (x) = 0 are linearlyindependent, then

∇x (L (x, λ, µ)) = 0

where µ ≥ 0 and µi hi (x) = 0The portion of this result dealing with the inequality constraint diers substantially from that concerned

with the equality constraint. Either a component of the constraint equals its maximum value (zero in thiscase) and the corresponding component of its Lagrange multiplier is non-negative (and is usually positive)or a component is less than the constraint and its component of the Lagrange multiplier is zero. This latterresult means that some components of the inequality constraint are not as stringent as others and these laxones do not aect the solution.

The rationale behind this theorem is a technique for converting the inequality constraint into an equalityconstraint: hi (x) ≤ 0 is equivalent to hi (x) + si

2 = 0. Since the new term, called a slack variable,is non-negative, the constraint must be non-positive. With the inclusion of slack variables, the equalityconstraint theorem can be used and the above theorem results. To prove the theorem, not only does thegradient with respect to x need to be considered, but also with respect to the vector s of slack variables.The ith component of the gradient of the Lagrangian with respect to s at the stationary point is 2µi s

i = 0.

If in solving the optimization problem, si = 0, the inequality constraint was in reality an equality constraintand that component of the constraint behaves accordingly. As si =

√−hi (x), si = 0 implies that that

component of the inequality constraint must equal zero. On the other hand, if si 6= 0, the correspondingLagrange multiplier must be zero.

Example 2.3Consider the problem of minimizing a quadratic form subject to a linear equality constraint andan inequality constraint on the norm of the linear constraint vector's variation.

minxx, xTAx

subject to (c+ δ)Tx = 1 and (‖ δ ‖)2 ≤ ε

This kind of problem arises in robust estimation. One seeks a solution where one of the "knowns"of the problem, c in this case, is, in reality, only approximately specied. The independent variablesare x and δ. The Lagrangian for this problem is

L (x, δ , λ, µ) = xTAx+ λ(

(c+ δ)Tx− 1)

+ µ(

(‖ δ ‖)2 − ε)

Evaluating the gradients with respect to the independent variables yields

2Ax + λ (c+ δ) = 0

λx + 2µδ = 0


53

The latter equation is key. Recall that either µ = 0 or the inequality constraint is satisedwith equality. If µ is zero, that implies that x must be zero which will not allow the equalityconstraint to be satised. The inescapable conclusion is that (‖ δ ‖)2 = ε and that δ is parallel

to x: δ = −(λ

2µx). Using the rst equation, x is found to be

x =(−λ

2

)(A− λ2

4µI

)−1

c

Imposing the constraints on this solution results in a pair of equations for the Lagrange multipliers.(1/4

λ2

µ

)2

cT(A− 1/4

λ2

µI

)−2

c = ε

cT(A− 1/4

λ2

µI

)−1

c =(− 2λ

)− 2ε

µ

λ2

Multiple solutions are possible and each must be checked. The rather complicated completion ofthis example is left to the (numerically oriented) reader.




Chapter 3

Estimation Theory

3.1 Introduction to Estimation Theory1

In searching for methods of extracting information from noisy observations, this chapter describes estima-tion theory, which has the goal of extracting from noise-corrupted observations the values ofdisturbance parameters (noise variance, for example), signal parameters (amplitude or prop-agation direction), or signal waveforms. Estimation theory assumes that the observations contain aninformation-bearing quantity, thereby tacitly assuming that detection-based preprocessing has been per-formed (in other words, do I have something in the observations worth estimating?). Conversely, detectiontheory often requires estimation of unknown parameters: Signal presence is assumed, parameter estimatesare incorporated into the detection statistic, and consistency of observations and assumptions tested. Con-sequently, detection and estimation theory form a symbiotic relationship, each requiring the other to yieldhigh-quality signal processing algorithms.

Despite a wide variety of error criteria and problem frameworks, the optimal detector is characterizedby a single result: the likelihood ratio test. Surprisingly, optimal detectors thus derived are usually easyto implement, not often requiring simplication to obtain a feasible realization in hardware or software. Incontrast to detection theory, no fundamental result in estimation theory exists to be summoned to attackthe problem at hand. The choice of error criterion and its optimization heavily inuences the form of theestimation procedure. Because of the variety of criterion-dependent estimators, arguments frequently rageabout which of several optimal estimators is "better." Each procedure is optimum for its assumed errorcriterion; thus, the argument becomes which error criterion best describes some intuitive notion of quality.When more ad hoc, noncriterion-based procedures2 are used, we cannot assess the quality of the resultingestimator relative to the best achievable. As shown later (Section 3.6), bounds on the estimation error doexist, but their tightness and applicability to a given situation are always issues in assessing estimator quality.At best, estimation theory is less structured than detection theory. Detection is science, estimation art.Inventiveness coupled with an understanding of the problem (what types of errors are critically important,for example) are key elements to deciding which estimation procedure "ts" a given problem well.

3.1.1 Terminology in Estimation Theory

More so than detection theory, estimation theory relies on jargon to characterize the properties of estimators.Without knowing any estimation technique, let's use parameter estimation as our discussion prototype. Theparameter estimation problem is to determine from a set of L observations, represented by the L-dimensionalvector r, the values of parameters denoted by the vector θ. We write the estimate of this parameter vector

as θ (r), where the "hat" denotes the estimate, and the functional dependence on r explicitly denotes the

1This content is available online at <http://cnx.org/content/m11263/1.2/>.2This governmentese phrase concisely means guessing.


55

56 CHAPTER 3. ESTIMATION THEORY

dependence of the estimate on the observations. This dependence is always present3, but we frequently

denote the estimate compactly as θ. Because of the probabilistic nature of the problems considered inthis chapter, a parameter estimate is itself a random vector, having its own statistical characteristics. The

estimation error ε (r) equals the estimate minus the actual parameter value: ε (r) =θ (r)− θ. It too isa random quantity and is often used in the criterion function. For example, the mean-squared erroris given by E

[εT ε]; the minimum mean-squared error estimate would minimize this quantity. The mean-

squared error matrix is E[εεT]; on the main diagonal, its entries are the mean-squared estimation errors for

each component of the parameter vector, whereas the o-diagonal terms express the correlation between theerrors. The mean-squared estimation error E

[εT ε]equals the trace of the mean-squared error matrix

tr(E[εεT]).

3.1.1.1 Bias

An estimate is said to be unbiased if the expected value of the estimate equals the true value of the

parameter: E[θ | θ

]= θ. Otherwise, the estimate is said to be biased: E

[θ | θ

]6= θ. The bias b (θ) is

usually considered to be additive, so that b (θ) = E[θ | θ

]− θ. When we have a biased estimate, the bias

usually depends on the number of observations L. An estimate is said to be asymptotically unbiased ifthe bias tends to zero for large L: limit

L→∞b = 0. An estimate's variance equals the mean-squared estimation

error only if the estimate is unbiased.An unbiased estimate has a probability distribution where the mean equals the actual value of the

parameter. Should the lack of bias be considered a desirable property? If many unbiased estimates arecomputed from statistically independent sets of observations having the same parameter value, the averageof these estimates will be close to this value. This property does not mean that the estimate has less errorthan a biased one; there exist biased estimates whose mean-squared errors are smaller than unbiased ones.In such cases, the biased estimate is usually asymptotically unbiased. Lack of bias is good, but that is justone aspect of how we evaluate estimators.

3.1.1.2 Consistency

We term an estimate consistent if the mean-squared estimation error tends to zero as the number ofobservations becomes large: limit

L→∞E[εT ε]

= 0. Thus, a consistent estimate must be at least asymptotically

unbiased. Unbiased estimates do exist whose errors never diminish as more data are collected: Their variancesremain nonzero no matter how much data are available. Inconsistent estimates may provide reasonableestimates when the amount of data is limited, but have the counterintuitive property that the quality ofthe estimate does not improve as the number of observations increases. Although appropriate in the propercircumstances (smaller mean-squared error than a consistent estimate over a pertinent range of values of L,consistent estimates are usually favored in practice.

3.1.1.3 Eciency

As estimators can be derived in a variety of ways, their error characteristics must always be analyzed andcompared. In practice, many problems and the estimators derived for them are suciently complicated torender analytic studies of the errors dicult, if not impossible. Instead, numerical simulation and comparisonwith lower bounds on the estimation error are frequently used instead to assess the estimator performance.An ecient estimate has a mean-squared error that equals a particular lower bound: the Cramér-Raobound (Section 3.6). If an ecient estimate exists (the Cramér-Rao bound is the greatest lower bound), itis optimum in the mean-squared sense: No other estimate has a smaller mean-squared error (see MaximumLikelihood Estimators (Section 3.5) for details).

3Estimating the value of a parameter given no data may be an interesting problem in clairvoyance, but not in estimationtheory.


57

For many problems no ecient estimate exists. In such cases, the Cramér-Rao bound remains a lowerbound, but its value is smaller than that achievable by any estimator. How much smaller is usually notknown. However, practitioners frequently use the Cramér-Rao bound in comparisons with numerical errorcalculations. Another issue is the choice of mean-squared error as the estimation criterion; it may not suceto pointedly assess estimator performance in a particular problem. Nevertheless, every problem is usuallysubjected to a Cramér-Rao bound computation and the existence of an ecient estimate considered.

3.2 Minimum Mean Squared Error Estimators4

In terms of the densities involved in scalar random-parameter problems, the mean-squared error is given by

E[ε2]

=∫ ∫ (

θ− θ)2

p (r, θ ) drdθ (3.1)

where p (r, θ ) is the joint density of the observations and the parameter. To minimize this integral with

respect to θ, we rewrite using the laws of conditional probability as

E[ε2]

=∫p (r )

∫ (θ− θ (r)

)2

p (θ |r) dθdr (3.2)

The density pr (·) is nonnegative. To minimize the mean-squared error, we must minimize the inner integralfor each value of r because the integral is weighted by a positive quantity. We focus attention on the innerintegral, which is the conditional expected value of the squared estimation error. The condition, a xed

value of r, implies that we seek that constant θ (r) derived from r that minimizes the second moment of the

random parameter θ. A well-known result from probability theory states that the minimum of E[(x− c)2

]occurs when the constant c equals the expected value of the random variable x (see Expected Values ofProbability Functions5). The inner integral and thereby the mean-squared error is minimized by choosingthe estimator to be the conditional expected value of the parameter given the observations.

θMMSE (r) = E [θ | r ] (3.3)

Thus, a parameter's minimum mean-squared error (MMSE) estimate is the parameter's a posteriori (afterthe observations have been obtained) expected value.

The associated conditional probability density p (θ |r) is not often directly stated in a problem denitionand must somehow be derived. In many applications, the likelihood function p (r |θ) and the a priori densityof the parameter are a direct consequence of the problem statement. These densities can be used to nd thejoint density of the observations and the parameter, enabling us to use Bayes's Rule to ne the a posterioridensity if we knew the unconditional probability density of the observations.

p (θ |r) =p (r |θ) p (θ )

p (r )(3.4)

This density p (r ) is often dicult to determine. Be that as it may, to nd the a posteriori conditionalexpected value, it need not be known. The numerator entirely expresses the a posteriori density's dependenceon θ; the denominator only serves as the scaling factor to yield a unit-area quantity. The expected value isthe center-of-mass of the probability density and does not depend directly on the "weight" of the density,bypassing calculation of the scaling factor. If not, the MMSE estimate can be exceedingly dicult tocompute.

4This content is available online at <http://cnx.org/content/m11267/1.5/>.5"Expected Values of Probability Functions" <http://cnx.org/content/m11247/latest/>



Example 3.1Let L statistically independent observations be obtained, each of which is expressed by r (l) =θ + n (l). Each n (l) is a Gaussian random variable having zero mean and variance σn

2. Thus, theunknown parameter in this problem is the mean of the observations. Assume it to be a Gaussianrandom variable a priori (mean mθ and variance σθ

2). The likelihood function is easily found tobe

p (r |θ) =L−1∏l=0

1√2πσn2

e−“

12 ( r(l)−θσn

)2”(3.5)

so that the a posteriori density is given by

p (θ |r) =

1√2πσθ2

e−„

12

“θ−mθσθ

”2«∏L−1

l=01√

2πσn2 e−“

12 ( r(l)−θσn

)2”

p (r )(3.6)

In an attempt to nd the expected value of this distribution, lump all terms that do not dependexplicitly on the quantity θ into a proportionality term.

p (θ |r) ∝ e−„

12

„P(r(l)−θ)2

σn2 +(θ−mθ)2

σθ2

««(3.7)

After some manipulation, this expression can be written as

p (θ |r) ∝ e−„

12σ2

“θ−σ2

“mθσθ

2 +Pr(l)

σn2

””2«

(3.8)

where σ2 is a quantity that succinctly expresses the ratio σn2σθ

2

σn2+Lσθ2 . The form of the a posterioridensity suggests that it too is Gaussian; its mean, and therefore the MMSE estimate of θ, is givenby

θMMSE (r) = σ2

(mθ

σθ2+∑r (l)σn2

)(3.9)

More insight into the nature of this estimate is gained by rewriting it as

θMMSE (r) =σn

2

L

σθ2 + σn2

L

mθ +σθ

2

σθ2 + σn2

L

(1L

L−1∑l=0

r (l)

)(3.10)

The term σn2

L is the variance of the averaged observations for a given value of θ; it expresses thesquared error encountered in estimating the mean by simple averaging. If this error is much greater

than the a priori variance of θ ( σn2

L σθ2), implying that the observations are noisier than the

variation of the parameter, the MMSE estimate ignores the observations and tends to yield the apriori mean mθ as its value. If the averaged observations are less variable than the parameter, thesecond term dominates, and the average of the observations is the estimate's value. This estimatebehavior between these extremes is very intuitive. The detailed form of the estimate indicates howthe squared error can be minimized by a linear combination of these extreme estimates.

The conditional expected value of the estimate equals

E[θMMSE | θ

]=

σn2

L

σθ2 + σn2

L

mθ +σθ

2

σθ2 + σn2

L

θ (3.11)

This estimate is biased because its expected value does not equal the value of the sought-after

parameter. It is asymptotically unbiased as the squared measurement error σn2

L tends to zero asL becomes large. The consistency of the estimator is determined by investigating the expected


59

value of the squared error. Note that the variance of the a posteriori density is the quantity σ2;as this quantity does not depend on r, it also equals the unconditional variance. As the number ofobservations increases, this variance tends to zero. In concert with the estimate being asymptoticallyunbiased, the expected value of the estimation error thus tends to zero, implying that we have aconsistent estimate.

3.3 Maximum a Posteriori Estimators6

In those cases in which the expected value of the a posteriori density cannot be computed, a related butsimpler estimate, the maximum a posteriori (MAP) estimate, can usually be evaluated. The estimate

θMAP (r) equals the location of the maximum of the a posteriori density. Assuming that this maximum canbe found by evaluating the derivative of the a posteriori density, the MAP estimate is the solution of theequation

∂p (θ |r)∂θ

|θ=θMAP=0(3.12)

Any scaling of the density by a positive quantity that depends on r does not change the location of themaximum. Symbolically, pθ|r = pr|θ

pθpr; the derivative does not involve the denominator, and this term

can be ignored. Thus, the only quantities required to compute θMAP are the likelihood function and theparameter's a priori density.

Although not apparent in its denition, the MAP estimate does satisfy an error criterion. Dene acriterion that is zero over a small range of values about ε = 0 and a positive constant outside that range.

Minimization of the expected value of this criterion with respect to θ is accomplished by centering thecriterion function at the maximum of the density. The region having the largest area is thus "notched out,"and the criterion is minimized. Whenever the a posteriori density is symmetric and unimodal, the MAP andMMSE estimates coincide. In Gaussian problems, such as the last example, this equivalence is alway valid.In more general circumstances, they dier.

Example 3.2Let the observations have the same form as the previous example, but with the modication thatthe parameter is now uniformly distributed over the interval [θ1, θ2]. The a posteriori mean cannotbe computed in closed form. To obtain the MAP estimate, we need to nd the location of themaximum of

∀θ, θ1 ≤ θ ≤ θ2 :

(p (r |θ) p (θ ) =

1θ2 − θ1

L−1∏l=0

1√2πσn2

e−“

12 ( r(l)−θσn

)2”)(3.13)

Evaluating the logarithm of this quantity does not change the location of the maximum andsimplies the manipulations in many problems. Here, the logarithm is

∀θ, θ1 ≤ θ ≤ θ2 :

(ln (p (r |θ) p (θ )) = (−ln (θ2 − θ1))−

L−1∑l=0

(r (l)− θσn

)2

+ ln (C)

)(3.14)

where C is a constant with respect to θ. Assuming that the maximum is interior to the domain

of the parameter, the MAP estimate is found to be the sample average∑ r(l)

L . If the average liesoutside this interval, the corresponding endpoint of the interval is the location of the maximum.




To summarize,

θMAP (r) =

θ1 if

∑llr(l)L < θ1∑

llr(l)L if θ1 ≤

∑llr(l)L ≤ θ2

θ2 if θ2 <∑llr(l)L

(3.15)

The a posteriori density is not symmetric because of the nite domain of θ. Thus, the MAPestimate is not equivalent to the MMSE estimate, and the accompanying increase in the mean-squared error is dicult to compute. When the sample average is the estimate, the estimate isunbiased; otherwise it is biased. Asymptotically, the variance of the average tends to zero, with theconsequences that the estimate is unbiased and consistent.

3.4 Linear Estimators7

We derived the minimum mean-squared error estimator in the previous section (Section 3.2) with no con-straint on the form of the estimator. Depending on the problem, the computations could be a linear functionof the observations (which is always the case in Gaussian problems) or nonlinear. Deriving this estimator isoften dicult, which limits its application. We consider here a variation of MMSE estimation by constrainingthe estimator to be linear while minimizing the mean-squared estimation error. Such linear estimatorsmay not be optimum; the conditional expected value may be nonlinear and it always has the smallest mean-squared error. Despite this occasional performance decit, linear estimators have well-understood properties,they interact will with other signal processing algorithms because of linearity, and they can always be derived,no matter what the problem.

Let the parameter estimate θ (r) be expressed as L (r) where L (·) is a linear operator: L (a1r1 + a2r2) =a1L (r1) + a2L (r2) where a1, a2 are scalars. Although all estimators of this form are obviously linear, theterm linear estimator denotes that member of this family that minimizes the mean-squared error.

argminL(r)

E[εT ε]

= θLIN (r) (3.16)

Because of the transformation's linearity, the theory of linear vector spaces can be fruitfully used toderive the estimator and to specify its properties. One result of that theoretical framework is the well-known Orthogonality Principle (Papoulis, pp. 407-414)[37] The linear estimator is that particular lineartransformation that yields an estimation error orthogonal to all linear transformations of the data. Theorthogonality of the error to all linear transformations is termed the universality constraint. This principleprovides us not only with a formal denition of the linear estimator but also with the mechanism to deriveit. To demonstrate this intriguing result, let < ·, · > denote the absract inner product between two vectorsand ‖ · ‖ the associated norm.

(‖ x ‖)2 =< x, x > (3.17)

For example, if x and y are each column matrices having only one column,8 their inner product might bedened as < x, x >= xT y. Thus, the linear estimator as dened by the Orthogonality Principle must satisfy

∀ for all linear transformations L (·) :(E[< θLIN (r)− θ,L (r) >

]= 0)

(3.18)

7This content is available online at <http://cnx.org/content/m11276/1.5/>.8There is a confusion as to what a vector it. "Matricies having one column" are colloquially termed vectors as are the eld

quantities such as electric and magnetic elds. "Vectors" and their associated inner products are taken to be much more generalmathematical objects than these. Hence the prose in this section is rather contorted.


61

To see that this principle produces the MMSE linear estimator, we express the mean-squared estimation

error E[εT ε]

= E[(‖ ε ‖)2

]for any choice of linear estimator θ as

E

[(‖θ−θ ‖

)2]

= E

[(‖ θLIN −θ −

(θLIN − θ

)‖)2]

= E

[(‖ θLIN −θ ‖

)2]

+ E

[(‖ θLIN − θ ‖

)2]− 2E

[< θLIN −θ, θLIN − θ >

] (3.19)

As θLIN − θ is the dierence of two linear transformations, it too is linear and is orthogonal to the estimation

error resulting from θLIN. As a result, the last term is zero and the mean-squared estimation error is the sumof two squared norms, each of which is, of course, nonnegative. Only the second norm varies with estimator

choice; we minimize the mean-squared estimation error by choosing the estimator θ to be the estimator θLIN,which sets the second term to zero.

The estimation error for the minimum mean-squared linear estimator can be calculated to some degreewithout knowledge of the form of the estimator. The mean-squared estimation error is given by

E

[(‖ θLIN −θ ‖

)2]

= E[< θLIN −θ, θLIN −θ >

]= E

[< θLIN −θ, θLIN>

]+ E

[< θLIN −θ,−θ >

] (3.20)

The rst term is zero because of the Orthogonality Principle. Rewriting the second term yields a generalexpression for the MMSE linear estimator's mean-squared error.

E[(‖ ε ‖)2

]= E

[(‖ θ ‖)2

]− E

[< θLIN, θ >

](3.21)

This error is the dierence of two terms. The rst, the mean-squared value of the parameter, represents thelargest value that the estimation error can be for any reasonable estimator. That error can be obtained bythe estimator that ignores the data and has a value of zero. The second term reduces this maximum errorand represents the degree to which the estimate and the parameter agree on the average.

Note that the denition of the minimum mean-squared error linear estimator makes no explicit assump-tions about the parameter estimation problem being solved. This property makes this kind of estimatorattractive in many applications where neither the a priori density of the parameter vector nor the density ofthe observations is known precisely. Linear transformations, however, are homogeneous: A zero-values inputyields a zero output. Thus, the linear estimator is especially pertinent to those problems where the expectedvalue of the parameter is zero. If the expected value is nonzero, the linear estimator would not necessarilyyield the best result (See this problem (Exercise 3.11.9))

Example 3.3Express the rst example (Example 3.1) in vector notation so that the observation vector is writtenas

r = Aθ + n

where the matrix A has the form A = (1, . . . , 1)T . The expected value of the parameter is zero. The

linear estimator has the form θLIN= Lr, where L is a 1 × L matrix. The orthogonality Principlestates that the linear estimator satises

∀for all 1× L matricies M :(E[(Lr − θ)TMr

]= 0)

To use the Orthogonality Principle to derive an equation implicitly specifying the linear estimator,the "for all linear transformations" phrase must be interpreted. Usually the quantity specifyingthe linear transformation must be removed from the constraining inner product by imposing a verystringent but equivalent condition. In this example, this phrase becomes one about matrices. The



elements of the matrix M can be such that each element of the observation vector multiplies eachelement of the estimation error. Thus, in this problem the Othogonality Principle means that theexpected value of the matrix consisting of all pairwise priducts of these elements must be zero.

E[(Lr − θ) rT

]= 0

Thus, two terms must equal each other: E[LrrT

]= E

[θrT

]. The second term equals E

[θ2]AT as

the additive noise and the parameter are assumed to be statistically independent quantities. Thequantity E

[rrT

]in the rst term is the correlation matrix of the observations, which is given by

AATE[θ2]

+Kn. Here, Kn is the noise covariance matrix, and E[θ2]is the parameter's variance.

The quantity AAT is a L×Lmatrix with each element equaling 1. The noise vector has independentcomponents; the covariance matrix thus equals σn

2I. The equation that L must satisfy is thereforegiven by

(L1 · · · LL

)

σn2 + σθ

2 σθ2 · · · σθ

2

σθ2 σn

2 + σθ2 . . .

......

. . .. . . σθ

2

σθ2 · · · σθ

2 σn2 + σθ

2

=(σθ

2 · · · σθ2)

The components of L are equal and are given by Li = σθ2

σn2+Lσθ2 . Thus, the minimum mean-squarederror linear estimator has the form

θLIN (r) =σθ

2

σθ2 + σn2

L

1L

∑ll

r (l)

Note that this result equals the minimum mean-squared error estimate derived earlier (Sec-tion 3.2) under the condition that E [θ] = 0. Mean-squared error, linear estimators, and Gaussianproblems are intimately related to each other. The linear minimum mean-squared error solution toa problem is optimal if the underlying distributions are Gaussian.

3.5 Maximum Likelihood Estimators of Parameters9

When the a priori density of a parameter is not known or the parameter itself is inconveniently described as arandom variable, techniques must be developed that make no presumption about the relative possibilities ofparameter values. Lacking this knowledge, we can expect the error characteristics of the resulting estimatesto be worse than those which can use it.

The maximum likelihood estimate θML (r) of a nonrandom parameter is, simply, that value which max-imizes the likelihood function (the a priori density of the observations). Assuming that the maximum can

be found by evaluating a derivative, θML (r) is dened by

∂p (r |θ)∂θ

|θ=θML=0(3.22)

The logarithm of the likelihood function may also be used in this maximization.

Example 3.4Let r (l) be a sequence of independent, identically distributed Gaussian random variables havingan unknown mean θ but a known variance σn

2. Often, we cannot assign a probability density to



63

a parameter of a random variable's density; we simply do not know what the parameter's valueis. Maximum likelihood estimates are often used in such problems. In the specic case here, thederivative of the logarithm of the likelihood function equals

∂ ln (p (r |θ))∂θ

=1σn2

L−1∑l=0

r (l)− θ

The solution of this equation is the maximum likelihood estimate, which equals the sample aver-age.

θML=1L

L−1∑l=0

r (l)

The expected value of this estimate E[θML | θ

]equals the actual value θ, showing that the max-

imum likelihood estimate is unbiased. The mean-squared error equals σn2

L and we infer that thisestimate is consistent.

3.5.1 Parameter Vectors

The maximum likelihood procedure (as well as the others being discussed) can be easily generalized tosituations where more than one parameter must be estimated. Letting θ denote the parameter vector, the

likelihood function is now expressed as p (r |θ). The maximum likelihood estimate θML of the parametervector is given by the location of the maximum of the likelihood function (or equivalently of its logarithm).Using derivatives, the calculation of the maximum likelihood estimate becomes

∇θ (ln (p (r |θ))) |θ=θML=0(3.23)

where ∇θ denotes the gradient with respect to the parameter vector. This equation means that we mustestimate all of the parameter simultaneously by setting the partial of the likelihood function with respect toeach parameter to zero. Given P parameters, we must solve in most cases a set of P nonlinear, simultaneousequations to nd the maximum likelihood estimates.

Example 3.5Let's extend the previous example to the situation where neither the mean nor the variance of asequence of independent Gaussian random variables is known. The likelihood function is, in thiscase,

p (r |θ) =L−1∏l=0

1√2πθ2

e−“

12θ2

(r(l)−σ1)2”

Evaluating the partial derivatives of the logarithm of this quantity, we nd the following set of twoequations to solve for θ1, representing the mean, and θ2, representing the variance.10

1θ2

L−1∑l=0

r (l)− θ1 = 0

− L

2θ2+

12θ2

2

L−1∑l=0

(r (l)− θ1)2 = 0

10The variance rather than the standard deviation is represented by θ2. The mathematics is messier and the estimator hasless attractive properties in the latter case. This problem (Exercise 3.11.5) illustrates this point.



The solution of this set of equations is easily found to be

θML1 =

1L

L−1∑l=0

r (l)

θML2 =

1L

L−1∑l=0

(r (l)− θML

1

)2

The expected value of θML1 equals the actual value of θ1; thus, this estimate is unbiased. However,

the expected value of the estimate of the variance equals θ2L−1L . The estimate of the variance is

biased, but asymptotically unbiased. This bias can be removed by replacing the normalization of

L in the averaging computation for θML2 by L− 1.

3.6 Cramer-Rao Bound11

The mean-squared error for any estimate of a nonrandom parameter has a lower bound, the Cramér-RaoBound (Cramér (1946) pp. 474-477)[8], which denes the ultimate accuracy of any estimation procedure.This lower bound, as shown later, is intimately related to the maximum likelihood estimator.

We seek a "bound" on the mean-squared error matrix M dened to be

M = E

[(θ−θ

)(θ−θ

)T]= E

[εεT]

A matrix is "lower bounded" by a second matrix if the dierence between the two is a non-negative denitematrix. Dene the column matrix x to be

x =

θ−θ − b (θ)

∇θ(ln(p r |θ (r )

))

where b (θ) denotes the column matrix of estimator biases. To derive the Cramér-Rao bound, evaluateE[xxT

].

E[xxT

]=

M − bbT I +∇θ (b)

(I +∇θ (b))T F

where ∇θ (b) represents the matrix of partial derivatives of the bias ∂bi

∂θjand the matrix F is the Fisher

information matrixF = E

[∇θ(ln(p r |θ (r )

))∇θ(ln(p r |θ (r )

))T ](3.24)

Note that this matrix can alternatively be expressed as

F = −E[∇θ∇Tθ

(ln(p r |θ (r )

))]The notation ∇θ∇Tθ means the matrix of all second partials of the quantity it operates on (the gradient ofthe gradient). The matrix is known as the Hessian. Demonstrating the equivalence of these two forms forthe Fisher information is quite easy. Because

∫p r |θ (r ) dr = 1 for all choices of the parameter vector, the



65

gradient of the expression equals zero. Furthermore, ∇θ(ln(p r |θ (r )

))=∇θ(p r |θ ( r ))p r |θ ( r ) . Combining these

results yields ∫∇θ(ln(p r |θ (r )

))p r |θ (r ) dr = 0

Evaluating the gradient for this quantity (using the chain rule) also yields zero.∫∇θ∇Tθ

(ln(p r |θ (r )

))p r |θ (r ) +∇θ

(ln(p r |θ (r )

))∇θ(ln(p r |θ (r )

))Tp r |θ (r ) dr = 0

orE[∇θ(ln(p r |θ (r )

))∇θ(ln(p r |θ (r )

))T ] = −E[∇θ∇Tθ

(ln(p r |θ (r )

))]Calculating the expected value for the Hessian for is somewhat easier than nding the expected value of theouter product of the gradient with itself. In the scalar case, we have

E

(∂ ln(p r |θ (r )

)∂θ

)2 = −E

[∂2ln

(p r |θ (r )

)∂θ2

]

Returning to the derivation, the matrix E[xxT

]is non-negative denite because it is a correlation matrix.

Thus, for any column matrix, α, the quadratic form αTE[xxT

]α is non-negative. Choose a form for α that

simplies the quadratic form. A convenient choice is

α =

β

−(F−1(I +∇θ (b))Tβ

) where β is an arbitrary column matrix. The quadratic form becomes in this case

αTE[xxT

]α = βT

(M − bbT − (I +∇θ (b))F−1(I +∇θ (b))T

)β

As this quadratic form must be non-negative, the matrix expression enclosed in brackets must be non-negativedenite. We thus obtain the well-known Cramér-Rao bound on the mean-square error matrix.

E[εεT]≥ b (θ) b (θ)T + (I +∇θ (b))F−1(I +∇θ (b))T (3.25)

This form for the Cramér-Rao Bound does not mean that each term in the matrix of squared errors isgreater than the corresponding term in the bounding matrix. As stated earlier, this expression means thatthe dierence between these matrices is non-negative denite. For a matrix to be non-negative denite, eachterm on the main diagonal must be non-negative. The elements of the main diagonal of E

[εεT]are the

squared errors of the estimate of the individual parameters. Thus, for each parameter, the mean-squaredestimation error can be no smaller than

E

[(θi −θi

)2]≥ bi2 (θ) +

((I +∇θ (b))F−1(I +∇θ (b))T

)i,i

This bound simplies greatly if the estimator is unbiased ( b = 0). In this case, the Cramér-Rao boundbecomes

E

[(θi −θi

)2]≥ F−1

i,i

Thus, the mean-squared error for each parameter in a multiple-parameter, unbiased-estimator problem canbe no smaller than the corresponding diagonal term in the inverse for the Fisher information matrix. Insuch problems, the estimate's error characteristics of any parameter become intertwined with the other



parameters in a complicated way. Any estimator satisfying the Cramér-Rao bound with equality is said tobe ecient.

Example 3.6Let's evaluate the Cramér-Rao bound for the example we have been discussing: the estimationof the mean and variance of a length L sequence of statistically independent Gaussian random

variables. Let the estimate of the mean θ1 be the sample average θ1=∑ r(l)

L ; as shown in the lastexample, this estimate is unbiased. Let the estimate of the variance θ2 be the unbiased estimate

θ2=P“

r(l)−θ1

”2

L−1 . Each term in the Fisher information matrix F is given by the expected value ofthe paired products of derivatives of the logarithm of the likelihood function.

Fi,j = E

[∂ ln

(p r |θ (r )

)∂θi

∂ ln(p r |θ (r )

)∂θj

]

The logarithm of the likelihood function is

ln(p r |θ (r )

)=(−(L

2ln (2πθ2)

))− 1

2θ2

L−1∑l=0

(r (l)− θ1)2

its partial derivatives are

∂ ln(p r |θ (r )

)∂θ1

=1θ2

L−1∑l=0

r (l)− θ1 (3.26)

∂ ln(p r |θ (r )

)∂θ2

= − L

2θ2+

12θ2

2

L−1∑l=0

(r (l)− θ1)2(3.27)

and its second partials are∂2ln

(p r |θ (r )

)∂θ1

2 = − Lθ2

∂2ln(p r |θ (r )

)∂θ1∂θ2

=(− 1θ2

2

) L−1∑l=0

r (l)− θ1

∂2ln(p r |θ (r )

)∂θ2∂θ1

=(− 1θ2

2

) L−1∑l=0

r (l)− θ1

∂2ln(p r |θ (r )

)∂θ2

2 =L

2θ22 −

1θ2

3

L−1∑l=0

(r (l)− θ1)2

The Fisher information matrix has the surprisingly simple form

F =

Lθ2

0

0 L2θ22

its inverse is also a diagonal matrix with the elements on the main diagonal equalling the reciprocal ofthose in the original matrix. Because of the zero-values o-diagonal entries in the Fisher informationmatrix, the errors between the corresponding estimates are not inter-dependent. In this problem,the mean-square estimation error can be no smaller than

E

[(θ1 −θ1

)2]≥ θ2

L


67

E

[(θ2 −θ2

)2]≥ 2θ2

2

L

Note that nowhere in the preceding example did the form of the estimator enter into the computationof the bound. The only quantity used in the computation of the Cramér-Rao bound is the logarithm ofthe likelihood function, which is a consequence of the problem statement, not how it is solved. Only inthe case of unbiased estimators is the bound independent of the estimators used.12 Becauseof this property, the Cramér-Rao bound is frequently used to assess the performance limits that can beobtained with an unbiased estimator in a particular problem. When bias is present, the exact form of theestimator's bias explicitly enters the computation of the bound. All too frequently, the unbiased form isused in situations where the existence of an unbiased estimator can be questioned. As we shall see, onesuch problem is time delay estimation, presumably of some importance to the reader. This misapplicationof the unbiased Cramér-Rao arises from desperation: the estimator is so complicated and nonlinear thatcomputing the bias is nearly impossible. As shown in this problem (Exercise 3.11.6), biased estimators canyield mean-squared error smaller as well as larger than the unbiased version of the Cramér-Rao bound.Consequently, desperation can yield misinterpretation when a general result is misapplied.

In the single-parameter estimation problem, the Cramér-Rao bound incorporating bias has the well-knownform 13

E[ε2]≥ b2 +

(1 + db

dθ

)2E

[(∂p r |θ ( r )

∂θ

)2] (3.28)

Note that the sign of the bias's derivative determines whether this bound is larger or potentially smallerthan the unbiased version, which is obtained by setting the bias term to zero.

3.6.1 Eciency

An interesting question arises: when, if ever, is the bound satised with equality? Recalling the details ofthe derivation of the bound, equality results when the quantity E

[αTxxTα

]equals zero. As this quantity

is the expected value of the square of αTx, it can only equal zero if αTx = 0. Substituting in the form ofthe column matrices α and x, equality in the Cramér-Rao bound results whenever

∇θ(ln(p r |θ (r )

))=(I +∇θ (b)T

)−1

F(θ (r)− θ − b

)(3.29)

This complicated expression means that only if estimation problems (as expressed by the a priori densityhave the form of the right side of this equation can the mean-squared error equal the Cramér-Rao bound.In particular, the gradient of the log likelihood function can only depend on the observations through theestimator. In all other problems, the Cramér-Rao bound is a lower bound but not a tight one no estimatorcan have error characteristics that equal it. In such cases, we have limited insight into ultimate limitationson estimation error size with the Cramér-Rao bound. However, consider the case where the estimator isunbiased ( b = 0). In addition, note the maximum likelihood estimate occurs when the gradient of the

logarithm of the likelihood function equals zero: ∇θ(ln(p r |θ (r )

))= 0 when θ = θML. In this case, the

condition for equality in the Cramér-Rao bound becomes

F(θ−θML

)= 0

As the Fisher information matrix is positive-denite, we conclude that if the estimator equals the maximumlikelihood estimator, equality in the Cramér-Rao bound can be satised with equality, only the maximum

12That's why we assumed in the example that we used an unbiased estimator for the variance.13Note that this bound diers somewhat from that originally given by Cramér (1946) p.480[8]; his derivation ignores the

additive bias term bbT .



likelihood estimate will achieve it. To use estimation theoretic terminology, if an ecient estimate exists,it is the maximum likelihood estimate. This result stresses the importance of maximum likelihoodestimates, despite the seemingly ad hoc manner by which they are dened.

Example 3.7Consider the Gaussian example being examined so frequently in this section. The componentsof the gradient of the logarithm of the likelihood function were given earlier by (3.26) and (3.27).These expressions can be rearranged to reveal ∂ ln(p r |θ ( r ))

∂θ1∂ ln(p r |θ ( r ))

∂θ2

=

Lθ2

(1L

∑ll r (l)− θ1

)− L

2θ2+ 1

2θ22

∑ll (r (l)− θ1)2

(3.30)

The rst component, which corresponds to the estimate of the mean, is expressed in the formrequired for the existence of an ecient estimate. The second componentthe partial with respectto the variance θ2cannot be rewritten in a similar fashion. No unbiased, ecient estimate of thevariance exits in this problem. The mean-squared error of the variance's unbiased estimate, but

not the maximum likelihood estimate, is lower-bounded by 2θ22

(L−1)2 . This error is strictly greater

than the Cramér-Rao bound of 2θ22

L2 . As no unbiased estimate of the variance can have a mean-squared error equal to the Cramér-Rao bound (no ecient estimate exists for the variance in theGaussian problem), one presumes that the closeness of the error of our unbiased estimator to thebound implies that it possesses the smallest squared-error of any estimate. This presumption may,of course, be incorrect.

3.6.2 Properties of the Maximum Likelihood Estimator

The maximum likelihood estimate is the most used estimation technique for nonrandom parameters. Notonly because of its close linkage to the Cramér-Rao bound, but also because it has desirable asymptoticproperties in the context of any problem (Cramér (1946) pp. 500-506)[8].

1. The maximum likelihood estimate is at least asymptotically unbiased. It may be unbiasedfor any number of observations (as in the estimation of the mean of a sequence of independent randomvariable) for some problems.

2. The maximum likelihood estimate is consistent.3. The maximum likelihood estimates is asymptotically ecient. As more and more data are

incorporated into an estimate, the Cramér-Rao bound accurately projects the best attainable errorand the maximum likelihood estimate has those optimal characteristics.

4. Asymptotically, the maximum likelihood estimate is distributed as a Gaussian randomvariable. Because of the previous properties, the mean asymptotically equals the parameter and thecovariance matrix is (LF (θ))−1

.

Most would agree that a "good" estimator should have these properties. What these results do not provideis assessment of how many observations are needed for the asymptotic results to apply to some specieddegree of precision. Consequently, they should be used with caution; for instance, some other estimator mayhave a smaller mean-square error than the maximum likelihood for a modest number of observations.

3.7 Signal Parameter Estimation14

One extension of parametric estimation theory necessary for its application to array processing is the esti-mation of signal parameters. We assume that we observe a signal s (l, θ), whose characteristics are known



69

save a few parameters θ, in the presence of noise. Signal parameters, such as amplitude, time origin, andfrequency if the signal is sinusoidal, must be determined in some way. In many cases of interest, we wouldnd it dicult to justify a particular form for the unknown parameters' a priori density. Because of suchuncertainties, the minimum mean-squared error and maximum a posteriori estimators cannot be used inmany cases. The minimum mean-squared error linear estimator does not require this density, but it ismost fruitfully used when the unknown parameter appears in the problem in a linear fashion (such as signalamplitude as we shall see).

3.7.1 Linear Minimum Mean-Squared Error Estimator

The only parameter that is linearly related to a signal is the amplitude. Consider, therefore, the problemwhere the observations at an array's output are modeled as

∀l, l ∈ 0, . . . , L− 1 : (r (l) = θs (l) + n (l)) (3.31)

The signal waveform s (l) is known and its energy normalized to be unity (∑l s

2 (l) = 1). The linear

estimate of the signal's amplitude is assumed to be of the form θ =∑l h (l) r (l), where h (l) minimizes the

mean-squared error. To use the Orthogonality Principle expressed by this equation (3.16), an inner productmust be dened for scalars. Little choice avails itself but multiplication as the inner product of two scalars.The Orthogonality Principle states that the estimation error must be orthogonal to all linear transformationsdening the kind of estimator being sought.

∀h· :

(E

[(L−1∑l=0

hLIN (l) r (l)− θ

)L−1∑k=0

h (k) r (k)

]= 0

)Manipulating this equation to make the universality constraint more transparent results in

∀h· :

(L−1∑k=0

h (k)E

[(L−1∑l=0

hLIN (l) r (l)− θ

)r (k)

]= 0

)Written in this way, the expected value must be 0 for each value of k to satisfy the constraint. Thus, thequantity hLIN (·) of the estimator of the signal's amplitude must satisfy

∀k :

(L−1∑l=0

hLIN (l)E [r (l) r (k)] = E [θr (k)]

)Assuming that the signal's amplitude has zero mean and is statistically independent of the zero-mean noise,the expected values in this equation are given by

E [r (l) r (k)] = σθ2s (l) s (k) +Kn (k, l)

E [θr (k)] = σθ2s (k)

where Kn (k, l) is the covariance function of the noise. The equation that must be solved for the unit-sampleresponse hLIN (·) of the optimal linear MMSE estimator of signal amplitude becomes

∀k :

L−1∑l=0

hLIN (l)Kn (k, l) = σθ2s (k)

1−L(1)∑l=0

hLIN (l) s (l)

(3.32)

This equation is easily solved once phrased in matrix notation. Letting Kn denote the covariance matrix ofthe noise, s the signal vector, and hLIN the vector of coecients, this equation becomes

KnhLIN = σθ2(1− sThLIN

)s



The matched lter for colored-noise problems consisted of the dot product between the vector of observationsand Kn

−1s (see the detector result (4.24)). Assume that the solution to the linear estimation problem isproportional to the detection theoretical one: hLIN = cKn

−1s, where c is a scalar constant. This proposedsolution satises the equation; the MMSE estimate of signal amplitude corresponds to applying a matchedlter to the observations with

hLIN =σθ

2

1 + σθ2sTKn−1s

Kn−1s (3.33)

The mean-squared estimation error of signal amplitude is given by

E[ε2]

= σθ2 − E

[θ

L−1∑l=0

hLIN (l) r (l)

]

Substituting the vector expression for hLIN yields the result that the mean-squared estimation error equalsthe proportionality constant c dened earlier.

E[ε2]

=σθ

2

1 + σθ2sTKn−1s

Thus, the linear lter that produces the optimal estimate of signal amplitude is equivalent to the matchedlter used to detect the signal's presence. We have found this situation to occur when estimates of unknownparameters are needed to solve the detection problem (see Detection in the Presence of Uncertainties (Sec-tion 4.12)). If we had not assumed the noise to be Gaussian, however, this detection-theoretic result wouldbe dierent, but the estimator would be unchanged. To repeat, this invariance occurs because the linearMMSE estimator requires no assumptions on the noise's amplitude characteristics.

Example 3.8Let the noise be white so that its covariance matrix is proportional to the identity matrix ( Kn =σn

2I). The weighting factor in the minimum mean-squared error linear estimator is proportionalto the signal waveform.

hLIN (l) =σθ

2

σn2 + σθ2s (l)

θLIN=σθ

2

σn2 + σθ2

L−1∑l=0

s (l) r (l)

This proportionality constant depends only on the relative variances of the noise and the parameter.If the noise variance can be considered to be much smaller than the a priori variance of theamplitude, then this constant does not depend on these variances and equals unity. Otherwise, thevariances must be known.

We nd the mean-squared estimation error to be

E[ε2]

=σθ

2

1 + σθ2

σn2

This error is signicantly reduced from its nominal value σθ2 only when the variance of the noise is

small compared with the a priori variance of the amplitude. Otherwise, this admittedly optimumamplitude estimate performs poorly, and we might as well as have ignored the data and "guessed"that the amplitude was zero15.

15In other words, the problem is dicult in this case.


71

3.8 Maximum Likelihood Estimators of Signal Parameters16

Many situations are either not well suited to linear estimation procedures, or the parameter is not welldescribed as a random variable. For example, signal delay is observed nonlinearly and usually no a prioridensity can be assigned. In such cases, maximum likelihood estimators are more frequently used. Becauseof the Cramér-Rao bound (Section 3.6), fundamental limits on parameter estimation performance can bederived for any signal parameter estimation problem where the parameter is not random.

Assume that the data are expressed as a signal observed in the presence of additive Gaussian noise.

∀l, l ∈ 0, . . . , L− 1 : (r (l) = s (l, θ) + n (l)) (3.34)

The vector of observations r is formed from the data in the obvious way. Evaluating the logarithm of theobservation vector's joint density,

ln(p r |θ (r )

)= −1/2ln (det (2πKn))− 1/2(r − s (θ))TKn

−1 (r − s (θ))

where s (θ) is the signal vector having P unknown parameters, and Kn is the covariance matrix of the noise.The partial derivative of this likelihood function with respect to the ith parameter θi, for real-valued signals,

∂ ln(p r |θ (r )

)∂θi

= (r − s (θ))TKn−1 ∂s (θ)

∂θi

and, for complex-valued ones,

∂ ln(p r |θ (r )

)∂θi

= <(

(r − s (θ))HKn−1 ∂s (θ)

∂θi

)If the maximum of the likelihood function can be found by setting its gradient to 0, the maximum likelihoodestimate of the parameter vector is the solution of the set of equations

∀i, i ∈ 1, . . . , P :(

(r − s (θ))TKn−1 ∂s (θ)

∂θi|θ,θ=θML=0

)(3.35)

The Cramér-Rao bound depends on the evaluation of the Fisher information matrix F . The elements ofthis matrix are found to be

∀i, j, i ∧ j ∈ 1, . . . , P :

(Fi,j =

∂s (θ)T

∂θiKn−1 ∂s (θ)

∂θj

)(3.36)

Further computation of the Cramér-Rao bound's components is problem dependent if more than one pa-rameter is involved, and the o-diagonal terms of F are nonzero. If only one parameter is unknown, theCramér-Rao bound is given by

E[ε2]≥ b2 (θ) +

(1 + db(θ)

dθ

)2

∂s(θ)T

∂θ Kn−1 ∂s(θ)

∂θ

(3.37)

When the signal depends on the parameter nonlinearly (which constitute the interesting cases), the maximumlikelihood estimate is usually biased. Thus, the numerator of the expression for the bound cannot be ignored.One interesting special case occurs when the noise is white. The Cramér-Rao bound becomes

E[ε2]≥ b2 (θ) +

σn2(

1 + db(θ)dθ

)2

∑L−1l=0

(∂s(l,θ)∂θ

)2




The derivative of the signal with respect to the parameter can be interpreted as the sensitivity of the signalto the parameter. The mean-squared estimation error depends on the "integrated" squared sensitivity: Thegreater this sensitivity, the smaller the bound.

For an ecient estimate of a signal parameter to exist, the estimate must satisfy the condition we derivedearlier (3.29).

∇θ (s (θ))TKn−1 (r − s (θ)) ?=

(I +∇θ (b)T

)−1

∇θ (s (θ))TKn−1∇θ (s (θ))

(θ (r)−θ − b

)Because of the complexity of this requirement, we quite rightly question the existence of any ecient estima-tor, especially when the signal depends nonlinearly on the parameter (see this problem (Exercise 3.11.10)).

Example 3.9Let the unknown parameter be the signal's amplitude; the signal is expressed as θs (l) and isobserved in an array's output in the presence of additive noise. The maximum likelihood estimateof the amplitude is the solution of the equation(

r − θML s)TKn−1s = 0

The form of this equation suggests that the maximum likelihood estimate is ecient. The amplitudeestimate is given by

θML=rTKn

−1s

sTKn−1s

The form of this estimator is precisely that of the matched lter derived in the colored-noise situation(see equation (4.24)). The expected value of the estimate equals the actual amplitude. Thus thebias is zero and the Cramér-Rao bound is given by

E[ε2]≥(sTKn

−1s)−1

The condition for an ecient estimate becomes

sTKn−1 (r − θs) ?= sTKn

−1s(θML −θ

)whose veracity we can easily verify.

In the special case where the noise is white, the estimator has the form θML= rT s, and theCramér-Rao bound equals σn

2 (the nominal signal is assumed to have unit energy). The maximumlikelihood estimate of the amplitude has xed error characteristics that do not depend on the

actual signal amplitude. A signal-to-noise ratio for the estimate, dened to be θ2

E[ε2] , equals the

signal-to-noise ratio of the observed signal.When the amplitude is well described as a random variable, its linear minimum mean-squared

error estimator has the form

θLIN=σθ

2rTKn−1s

1 + σθ2sTKn−1s

which we found in the white-noise case becomes a weighted version of the maximum likelihoodestimate (see example (Example 3.8)).

θLIN=σθ

2

σθ2 + σn2rT s

Seemingly, these two estimators are being used to solve the same problem: Estimating the amplitudeof a signal whose waveform is known. They make very dierent assumptions, however, about the


73

nature of the unknown parameter; in one it is a random variable (and thus it has a variance),whereas in the other it is not (and variance makes no sense). Despite this fundamental dierence,the computations for each estimator are equivalent. It is reassuring that dierent approaches tosolving similar problems yield similar procedures.

3.9 Time-Delay Estimation17

An important signal parameter estimation problem is time-delay estimation. Here the unknown is the timeorigin of the signal: s (l, θ) = s (l − θ). The duration of the signal (the domain over which the signal isdened) is assumed brief compared with the observation interval L. Although in continuous time the signaldelay is a continuous-valued variable, in discrete time it is not. Consequently, the maximum likelihoodestimate cannot be found by dierentiation, and we must determine the maximum likelihood estimate ofsignal delay by the most fundamental expression of the maximization procedure. Assuming Gaussian noise,the maximum likelihood estimate of delay is the solution of

minθ

θ, (r − s (θ))TKn

−1 (r − s (θ))

The term sTKn−1s is usually assumed not to vary with the presumed time origin of the signal because of the

signal's short duration. If the noise is white, this term is constant except near the "edges" of the observationinterval. If not white, the kernel of this quadratic form is equivalent to a whitening lter. As discussed later(Section 4.11), this lter may be time varying. For noise spectra that are rational and have only poles, thewhitening lter's unit-sample response varies only near the edges (see the example (4.24)). Thus, near theedges, this quadratic form varies with presumed delay and the maximization is analytically dicult. Takingthe "easy way out" by ignoring edge eects, the estimate is the solution of

maxθθ, rTKn

−1s (θ)

Thus, the delay estimate is the signal time origin that maximizes the matched lter's output.In addition to the complexity of nding the maximum likelihood estimate, the discrete-valued nature of

the parameter also calls into question the use of the Cramér-Rao bound. One of the fundamental assumptionsof the bound's derivation is the dierentiability of the likelihood function with respect to the parameter.Mathematically, a sequence cannot be dierentiated with respect to the integers. A sequence can be dieren-tiated with respect to its argument if we consider the variable to be continuous valued. This approximationcan be used only if the sampling interval, unity for the integers, is dense with respect to variations of thesequence. This condition means that the signal must be oversampled to apply the Cramér-Rao bound in ameaningful way. Under these conditions, the mean-squared estimation error for unbiased estimators canbe no smaller than the Cramér-Rao bound, which is given by

E[ε2]≥ 1∑

k,lk,lKn−1

k,ls′ (k − θ) s′ (l − θ)

which, in the white-noise case, becomes

E[ε2]≥ σn

2∑l (s′ (l))

2 (3.38)

Here, s′ (·) denotes the "derivative" of the discrete-time signal. To justify using this Cramér-Rao bound, wemust face the issue of whether an unbiased estimator for time delay exists. No general answer exists; eachestimator, including the maximum likelihood one, must be examined individually.




Example 3.10Assume that the noise is white. Because of this assumption, we determine the time delay bymaximizing the match-ltered observations.

argmaxθ

∑ll

r (l) s (l − θ) = θML

The number of terms in the sum equals the signal duration. Figure 3.1 illustrates the match-ltered output in two separate situations; in one the signal has a relatively low-frequency spectrumas compared with the second.

Figure 3.1: The matched lter outputs are shown for two separate signal situations. In each case,the observation interval (100 samples), the signal's duration (50 samples) and energy (unity) are thesame. The dierence lies in the signal waveform; both are sinusoids with the rst having a frequency of(2π) 0.04 and the second (2π) 0.25. Each output is the signal's autocorrelation function. Few, broad peakscharacterize the low-frequency example whereas many narrow peaks are found in the high frequency one.

Because of the symmetry of the autocorrelation function, the estimate should be unbiased solong as the autocorrelation function is completely contained within the observation interval. Directproof of this claim is left to the masochistic reader. For sinusoidal signals of energy E and frequency

ω0, the Cramér-Rao bound is given by E[ε2]

= σn2

ω02E . This bound on the error is accurate only if themeasured maximum frequently occurs in the dominant peak of the signal's autocorrelation function.Otherwise, the maximum likelihood estimate "skips" a cycle and produces values concentratednear one of the smaller peaks. The interval between zero crossings of the dominant peak is π

2ω0;

the signal-to-noise ratio Eσn2 must exceed 4

π2 (about 0.5). Remember that this result implicitlyassumed a low-frequency sinusoid. The second example demonstrates that cycle skipping occursmore frequently than this guideline suggests when a high-frequency sinusoid is used.

The size of the errors encountered in the time-delay estimation problem can be more accurately assessedby a bounding technique tailored to the problem: the Ziv-Zakai bound (Wiess and Weinstein [53], Ziv andZakai [56]). The derivation of this bound relies on results from detection theory (Chazan, Zakai, and Ziv[11]). 18 Consider the detection problem in which we must distinguish the signals s (l − τ) and s (l − τ + ∆)while observing them in the presence of white noise that is not necessarily Gaussian. Let hypothesis M0

represent the case in which the delay, denoted by our parameter symbol θ, is τ and M1 the case in whichθ = τ + ∆. The suboptimum test statistic consists of estimating the delay, then determining the closest a

18This result is an example of detection and estimation theory complementing each other to advantage.


75

priori delay to the estimate.

θM1

≷M0

τ +∆2

By using this ad hoc hypothesis test as an essential part of the derivation, the bound can apply to many sit-uations. Furthermore, by not restricting the type of parameter estimate, the bound applies to any estimator.The probability of error for the optimum hypothesis test (derived from the likelihood ratio) is denoted byPe (τ,∆). Assuming equally likely hypotheses, the probability of error resulting from the ad hoc test mustbe greater than that of the optimum.

Pe (τ,∆) ≤ 1/2Pr[ε >

∆2| M0

]+ 1/2Pr

[ε < −∆

2| M1

]Here, ε denotes the estimation error appropriate to the hypothesis.

ε =

θ−τ if underM0

θ−τ −∆ if underM1

The delay is assumed to range uniformly between 0 and L. Combining this restriction to the hypothesizeddelays yields bounds on both τ and ∆: 0 ≤ τ < L−∆ and 0 ≤ ∆ < L. Simple manipulations show that theintegral of this inequality with respect to τ over the possible range of delays is given by 19

∫ L−∆

0

Pe (τ,∆) dτ ≤ 1/2∫ L

0

Pr

[|ε| > ∆

2| M0

]dτ

Note that if we dene L2

∼P(

∆2

)to be the right side of this equation so that

∼P

(∆2

)=

1L

∫ L

0

Pr

[|ε| > ∆

2| M0

]dτ

∼P (·) is the complementary distribution function 20 of the magnitude of the average estimation error. Mul-

tiplying∼P(

∆2

)by ∆ and integrating, the result is

∫ L

0

∆∼P

(∆2

)d∆ = −2

∫ L2

0

x2 d∼P

dxdx

The reason for these rather obscure manipulations is now revealed: Because∼P (·) is related to the probability

distribution function of the absolute error, the right side of this equation is twice the mean-squared errorE[ε2]. The general Ziv-Zakai bound for the mean-squared estimation error of signal delay is thus expressed

as

E[ε2]≥ 1L

∫ L

0

∆∫ L−∆

0

Pe (τ,∆) dτd∆

In many cases, the optimum probability of error Pe (τ,∆) does not depend on τ , the time origin of theobservations. This lack of dependence is equivalent to ignoring edge eects and simplies calculation of thebound. Thus, the Ziv-Zakai bound for time-delay estimation relates the mean-squared estimation error for

19Here again, the issue of the discrete nature of the delay becomes a consideration; this step in the derivation implicitlyassumes that the delay is continuous valued. This approximation can be greeted more readily as it involves integration ratherthan dierentiation (as in the Cramér-Rao bound).

20The complementary distribution function of a probability distribution function P (x) is dened to be∼P (x) = 1 − P (x),

the probability that a random variable exceeds x.



delay to the probability of error incurred by the optimal detector that is deciding whether a nonzero delayis present or not.

E[ε2]≥ 1L

∫ L

0

∆ (L−∆)Pe (∆) d∆ ≥ L2

6Pe (L)−

∫ L

0

(∆2

2− ∆3

3L

)dPed∆

d∆ (3.39)

To apply this bound to time-delay estimates (unbiased or not), the optimum probability of error for the typeof noise and the relative delay between the two signals must be determined. Substituting this expressioninto either integral yields the Ziv-Zakai bound.

The general behavior of this bound at parameter extremes can be evaluated in some cases. Note that theCramér-Rao bound in this problem approaches innity as either the noise variance grows or the observationinterval shrinks to 0 (either forces the signal-to-noise ratio to approach 0). This result is unrealistic asthe actual delay is bounded, lying between 0 and L. In this very noisy situation, one should ignore theobservations and "guess" any reasonable value for the delay; the estimation error is smaller. The probabilityof error approaches 1/2 in this situation no matter what the delay ∆ may be. Considering the simpliedform of the Ziv-Zakai bound, the integral in the second form is 0 in this extreme case.

E[ε2]≥ L2

12

The Ziv-Zakai bound is exactly the variance of a random variable uniformly distributed over [0, L− 1]. TheZiv-Zakai bound thus predicts the size of mean-squared errors more accurately than does the Cramér-Raobound.

Example 3.11Let the noise be Gaussian of variance σn

2 and the signal have energy E. The probability of errorresulting from the likelihood ratio test is given by

Pe (∆) = Q

(√E

2σn2(1− ρ (∆))

)

The quantity ρ (∆) is the normalized autocorrelation function of the signal evaluated at the delay∆.

ρ (∆) =1E

∑l

s (l) s (l −∆)

Evaluation of the Ziv-Zakai bound for a general signal is very dicult in this Gaussian noisecase. Fortunately, the normalized autocorrelation function can be bounded by a relatively simpleexpression to yield a more manageable expression. The key quantity 1 − ρ (∆) in the probabilityof error expression can be rewritten using Parseval's Theorem.

1− ρ (∆) =1

2πE

∫ π

0

2(|S (ω) |)2 × (1− cos (ω∆)) dω

Using the inequality 1− cos (x) ≤ x2, 1− ρ (∆) is bounded from above by min

∆2β2

2 , 2, where β

is the root-mean-squared (RMS) signal bandwidth.

β2 =

∫ π−π ω

2(|S (ω) |)2dω∫ π

−π (|S (ω) |)2dω

(3.40)

Because Q (·) is a decreasing function, we have Pe (∆) ≥ Q (µmin ∆,∆∗), where µ is a combina-

tion of all of the constants involved in the argument of Q (·): µ =√

Eβ2

4σn2 . This quantity varies with

the product of the signal-to-noise ratio Eσn2 and the squared RMS bandwidth β2. The parameter


77

∆∗ = 2β is known as the critical delay and is twice the reciprocal RMS bandwidth. We can use

this lower bound for the probability of error in the Ziv-Zakai bound to produce a lower bound on themean-squared estimation error. The integral in the rst form of the bound yields the complicated,but computable result

E [ε2] ≥ L2

6Q (µmin L,∆∗) + 1

4µ2P χ 3 2

(µ2min

L2,∆∗2

)−

23√

2πLµ3

(1−

(1 + µ2

2min

L2,∆∗2

)e−

µ2minL2,∆∗22

)The quantity P χ 3 2 ( · ) is the probability distribution function of a χ2 random variable having

three degrees of freedom.21 Thus, the threshold eects in this expression for the mean-squaredestimation error depend on the relation between the critical delay and the signal duration. In mostcases, the minimum equals the critical delay ∆∗, with the opposite choice possible for very lowbandwidth signals.

Figure 3.2: The Ziv-Zakai bound and the Cramér-Rao bound for the estimation of the time delay of asignal observed in the presence of Gaussian noise is shown as a function of the signal-to-noise ratio. Forthis plot, L = 20 and β = (2π) 0.2. The Ziv-Zakai bound is much larger than the Cramér-Rao bound forsignal-to-noise ratios less than 13 dB; the Ziv-Zakai bound can be as much as 30 times larger.

The Ziv-Zakai bound and the Cramér-Rao bound for the time-delay estimation problem areshown in Figure 3.2. Note how the Ziv-Zakai bound matches the Cramér-Rao bound only for large

21This distribution function has the "closed-form" expression P χ 3 2 (x ) =“

1−Q`√x´−qx2e−

x2

”.



signal-to-noise ratios, where they both equal 1/4µ2 = σn2

Eβ2 . For smaller values, the former boundis much larger and provides a better indication of the size of the estimation errors. These errorsare because of the "cycle skipping" phenomenon described earlier. The Ziv-Zakai bound describesthem well, whereas the Cramér-Rao bound ignores them.

3.10 Probability Density Estimation22

3.10.1 Probability Density Estimation

Many signal processing algorithms, implicitly or explicitly, assume that the signal and the observation noiseare each well described as Gaussian random sequences. Virtually all linear estimation and prediction ltersminimize the mean-squared error while not explicitly assuming any form for the amplitude distribution ofthe signal or noise. In many formal waveform estimation theories where probability density is, for betteror worse, specied, the mean-squared error arises from Gaussian assumptions. A similar situation occursexplicitly in detection theory. The matched lter is probably the optimum detection rule only when theobservation noise is Gaussian. When the noise is non-Gaussian, the detector assumes some other form. Muchof what has been presented in this chapter is based implicitly on a Gaussian model for both the signal andthe noise. When non-Gaussian distributions are assumed, the quantities upon which optimal linear lteringtheory are based, covariance functions, no longer suce to characterize the observations. While the jointamplitude distribution of any zero-mean, stationary Gaussian stochastic process is entirely characterized byits covariance function; non-Gaussian processes require more. Optimal linear ltering results can be appliedin non-Gaussian problems, but we should realize that other informative aspects of the process are beingignored.

This discussion would seem to be leading to a formulation of optimal ltering in a non-Gaussian setting.Would that such theories were easy to use; virtually all of them require knowledge of process characteristicsthat are dicult to measure and the resulting lters are typically nonlinear [Lipster and Shiryayev: Chapter8] [43] Rather than present preliminary results, we take the tack that knowledge is better than ignorance:At least the rst-order amplitude distribution of the observed signals should be considered during the signalprocessing design. If the signal is found to be Gaussian, then linear ltering results can be applied with theknowledge than no other ltering strategy will yield better results. If non-Gaussian, the linear ltering canstill be used and the engineer must be aware that future systems might yield "better" results. 23

3.10.2 Types

When the observations are discrete-valued or made so by digital-to-analog converters, estimating the prob-ability mass function is straightforward: Count the relative number of times each value occurs. Let[r (0) , . . . , r (L− 1)] denote a sequence of observations, each of which takes on A = a1, . . . , aN . Thisset is known as an alphabet and each an is a letter in that alphabet. We estimate the probability that anobservation equals one of the letters according to

Pr (an) =1L

L−1∑l=0

I (r (l) = an)

where I (·) is the indicator function, equaling one if its argument is true and zero otherwise. This kind ofestimate is known in information theory as a type [Cover and Thomas: Chapter 12] [49], and types have

22This content is available online at <http://cnx.org/content/m11291/1.8/>.23Note that linear ltering optimizes the mean-squared error whether the signals involved are Gaussian or not. Other error

criteria might better capture unexpected changes in signal characteristics and non-Gaussian processes contain internal statisticalstructure beyond that described by the covariance function.


79

remarkable properties. For example, if the observations are statistically independent, the probability that agiven sequence occurs equals

Pr [r = r (0) , . . . , r (L− 1)] =L−1∏l=0

Pr (r (l))

Evaluating the logarithm, we nd that

logPr [r] = logPr (r (l))

Converting to a sum over letters reveals

logPr [r] =∑N−1n=0 L Pr (an) logPr (an)

= L∑N−1n=0 Pr (an)

(log Pr (an)−logPr(an)

Pr(an)

)= (−L)

(H(P r

)+ D

(P r‖ Pr

)) (3.41)

which yields

Pr [r] = e(−L)

“H“P r

”+D

“P r‖Pr

””(3.42)

We introduce the entropy [Cover and Thomas: 2.1] [49] and Kullback-Leibler distance [See Stein'sLemma (4.14)].

H (P ) = −N−1∑n=0

P (an) logP (an)

D (P1 ‖ P0) =N−1∑n=0

P1 (an) logP1 (an)P0 (an)

Because the Kullback-Leibler distance is non-negative, equaling zero only when the two probability distri-

butions equal each other, we maximize (3.42) with respect to P by choosing P =P : The type estimator isthe maximum likelihood estimator of Pr.

The number of length-L observation sequences having a given type P approximately equals e(−L)H

“P”

. The probability that a given sequence has a given type approximately equals e(−L)D

“P‖P

”, which means

that the probability a given sequence has a type not equal to the true distribution decays exponentiallywith the number of observations. Thus, while the coin ip sequences H,H,H,H,H and T, T,H,H, Tare equally likely (assuming a fair coin), the second is more typical because its type is closer to the truedistribution.

3.10.3 Histogram Estimators

By far the most used technique for estimating the probability distribution of a continuous-valued randomvariable is the histogram; more sophisticated techniques are discussed in Silverman [5]. For real-valueddata, subdivide the real line into N intervals (ri, ri+1] having widths δi = ri+1 − ri, i = 1, . . . , N. Theseregions are called bins and they should encompass the range of values assumed by the data. For large values,the "edge bins" can extend to innity to catch the overows. Given L observations of a stationary randomsequence r (l) , l = 0, . . . , L− 1 , the histogram estimate h (i) is formed by simply forming a type fromthe number Li of these observations that fall into the ith bin and dividing by the binwidth δi.

p r (r ) =

h (1) = L1Lδ1

if r1 < r ≤ r2

h (2) = L2Lδ2

if r2 < r ≤ r3

... if

h (N) = LNLδN

if rN < r ≤ rN+1



The histogram estimate resembles a rectangular approximation to the density. Unless the underlying densityhas the same form (a rare event), the histogram estimate does not converge to the true density as the numberL of observations grows. Presumably, the value of the histogram at each bin converges to the probabilitythat the observations lie in that bin.

limitL→∞

LiL

=∫ ri+1

ri

p r (r ) dr

To demonstrate this intuitive feeling, we compactly denote the histogram estimate by using indicatorfunctions. An indicator function Ii (r (l)) for the ith bin equals one if the observation r (l) lies in the binand is zero otherwise. The estimate is simply the average of the indicator functions across the observations.

h (i) =1Lδi

L−1∑l=0

Ii (r (l))

The expected value of Ii (r (l)) is simply the probability Pi that the observation lies in the ith bin. Thus,the expected value of each histogram value equals the integral of the actual density over the bin, showingthat the histogram is an unbiased estimate of this integral. Convergence can be tested by computing thevariance of the estimate. The variance of one bin in the histogram is given by

σ (h (i))2 =Pi − Pi2

Lδi2 +

1L2δi

2

∑k 6=l

E [Ii (r (k)) Ii (r (l))]− Pi2

To simplify this expression, the correlation between the observations must be specied. If the values arestatistically independent (we have white noise), each term in the sum becomes zero and the variance is given

by σ (h (i))2 = Pi−Pi2Lδi2

. Thus, the variance tends to zero as L→∞ and the histogram estimate is consistent,

converging to Piδi

. If the observations are not white, convergence becomes problematical. Assume, forexample, that Ii (r (k)) and Ii (r (l)) are correlated in a rst-order, geometric fashion.

E [Ii (r (k)) Ii (r (l))]− Pi2 = Pi2ρ|k−l|

The variance does increase with this presumed correlation until, at the extreme ( ρ = 1 ), the variance isa constant independent of L! In summary, if the observations are mutually correlated and the histogramestimate converges, the estimate converges to the proper value but more slowly than if the observationswere white. The estimate may not converge if the observations are heavily dependent from index to index.This type of dependence structure occurs when the power spectrum of the observations is lowpass with anextremely low cuto frequency.

Convergence to the density rather than its integral over a region can occur if, as the amount of datagrows, we reduce the binwidth δi and increase N , the number of bins. However, if we choose the binwidthtoo small for the amount of available data, few bins contain data and the estimate is inaccurate. Letting r′

denote the midpoint of a bin, using a Taylor expansion about this point reveals that the mean-squared errorbetween the histogram and the density at that point is [Thompson and Tapia:44-59] [27]

E[(p r (r′ )− h (i))2

]=p r (r′ )

2Lδi+δi

4

36

(d2

dr2(p r (r )) |r=r′

)2

+O

(1L

)+O

(δi

5)

This mean-squared error becomes zero only if L → ∞ , Lδi → ∞ , and δi → 0. Thus, the binwidth mustdecrease more slowly than the rate of increase of the number of observations. We nd the "optimum"compromise between the decreasing binwidth and the increasing amount of data to be 24

δi =

(9p r (r′ )

2(d2

dr2 (p r (r )) |r=r′)2) 1

5

L−15

24This result assumes that the second derivative of the density is nonzero. If it is not, either the Taylor series expansionbrings higher order terms into play or, if all the derivatives are zero, no optimum binwidth can be dened for minimizing themean-squared error.


81

Using this binwidth, we nd the the mean-squared error to be proportional to L−45 . We have thus dis-

covered the famous "4/5" rule of density estimation; this is one of the few cases where the variance of aconvergent statistic decreases more slowly than the reciprocal of the number of observations. In practice,this optimal binwidth cannot be used because the proportionality constant depends on the unknown densitybeing estimated. Roughly speaking, wider bins should be employed where the density is changing slowly.How the optimal binwidth varies with L can be used to adjust the histogram estimate as more data becomesavailable.

3.10.4 Density Verication

Once a density estimate is produced, the class of density that best coincides with the estimate remains anissue: Is the density just estimated statistically similar to a Gaussian? The histogram estimate can be useddirectly in a hypothesis test to determine similarity with any proposed density. Assume that the observationsare obtained from a white, stationary, stochastic sequence. LetM0 denote the hypothesis that the data hasan amplitude distribution equal to the presumed density andM1 the dissimilarity hypothesis. IfM0 is true,the estimate for each bin should not deviate greatly from the probability of a randomly chosen datum lyingin the bin. We determine this probability from the presumed density by integrating over the bin. Summingthese deviations over the entire estimate, the result should not exceed a threshold. The theory of standardhypothesis testing requires us to produce a specic density for the alternative hypothesisM1 . We cannotrationally assign such a density; consistency is being tested, not whether either of two densities providesthe best t. However, taking inspiration from the Neyman-Pearson approach to hypothesis testing [SeeNeyman-Pearson Criterion (Section 4.2.2: Neyman-Pearson Criterion)], we can develop a test statistic andrequire its statistical characteristics only underM0 . The typically used, but ad hoc test statistic S (L,N)is related to the histogram estimate's mean-squared error [Cramer:416-41] [18].

S (L,N) =N∑i=1

(Li − LPi)2

LPi=

N∑i=1

Li2

LPi− L

This statistic sums over the various bins the squared error of the number of observations relative to theexpected number. For large L, S (L,N) has a χ2 probability distribution with N − 1 degrees of freedom[Cramer:417] [18]. Thus, for a given number of observations L we establish a threshold ηN by picking afalse-alarm probability PF and using tables to solve Pr

[χ2N−1 > ηN

]= PF . To enhance the validity of

this approximation, statisticians recommend selecting the binwidth so that each bin contains at least tenobservations. In practice, we fulll this criterion by merging adjacent bins until a sucient number ofobservations occur in the new bin and dening its binwidth as the sum of the merged bins' widths. Thus,the number of bins is reduced to some number N ′ , which determines the degrees of freedom in the hypothesistest. The similarity test between the histogram estimate of a probability density function and an assumedideal form becomes

S (L,N ′)M1

≷M0

ηN ′

In many circumstances, the formula for the density is known but not some of its parameters. In theGaussian case, for example, the mean or variance are usually unknown. These parameters must be determinedfrom the same data used in the consistency test before the test can be used. Doesn't the fact that we useestimates rather than actual values aect the similarity test? The answer is "yes," but in an interesting way:The similarity test changes only in that the number of degrees of freedom of the χ2 random variable used toestablish the threshold is reduced by one for each estimated parameter. If a Gaussian density is being tested,for example, the mean and variance usually need to be found. The threshold should then be determinedaccording to the distribution of a χ2

N ′−3 random variable.Example 3.12Three sets of observations are considered: Two are drawn from a Gaussian distribution and theother not. The rst Gaussian example is white noise, a signal whose characteristics match the



assumptions of this section. The second is non-Gaussian, which should not pass the test. Finally,the last test consists of colored Gaussian noise that, because of dependent samples, does not have asmany degrees of freedom as would be expected. The number of data available in each case is 2000.The histogram estimator uses xed-width bins and the χ2 test demands at least ten observationsper merged bin. The mean and variance estimates are used in constructing the nominal Gaussiandensity. The histogram estimates and their approximation by the nominal density whose mean andvariance were computed from the data are shown in the Figure 3.3.

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAAAAA

AAAAAAAAA

0.5

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAA

AAAAAA


AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

0.5

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA


AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAA

AAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

0.5

WhiteSech

ColoredGaussian

WhiteGaussian

-4 0 4

x

Figure 3.3: Three histogram density estimates are shown and compared with Gaussian densitieshaving the same mean and variance. The histogram on the top is obtained from Gaussian data thatare presumed to be white. The middle one is obtained from a non-Gaussian distribution related tothe hyperbolic secant p r (r ) = 1

2σsech2

`πr2σ

´. This density resembles a Gaussian about the origin but

decreases exponentially in the tails. The bottom histogram is taken from a rst-order autoregressiveGaussian signal. Thus, these data are correlated, but yield a histogram resembling the true amplitudedistribution. In each case, 2000 data points were used and the histogram contained 100 bins.


83

The chi-squared test PF = 0.1 yielded the following results.

Density N ′ χ2N ′−3 S (2000, N ′)

White Gaussian 70 82.2 78.4

White Sech 65 76.6 232.6

Colored Gaussian 65 76.6 77.8

Table 3.1

The white Gaussian noise example clearly passes the χ2 test. The test correctly evaluated thenon-Gaussian example, but declared the colored Gaussian data to be non-Gaussian, yielding a valuenear the threshold. Failing in the latter case to correctly determine the data's Gaussianity, we seethat the χ2 test is sensitive to the statistical independence of the observations.

3.11 Estimation Theory: Problems25

Exercise 3.11.1Estimates for identical parameters are heavily dependent on the assumed underlying probabilitydensities. To understand this sensitivity better, consider the following variety of problems, each ofwhich asks for estimates of quantities related to variance. Determine the bias and consistency ineach case.

3.11.1

Compute the maximum a posteriori and maximum likelihood estimates of θ based on L statisticallyindependent observations of a Maxwellian random variable r.

∀r, θ, (r > 0) ∧ (θ > 0) :

(p r |θ (r ) =

√2πθ−3/2r2e

−“

12r2θ

”)

∀θ, θ > 0 :(p θ (θ ) = λe−(λθ)

)3.11.2

Find the maximum a posteriori estimate of the variance σ2 from L statistically independent obser-vations having the exponential density

∀r, r > 0 :(p r (r ) =

1√σ2e− r√

σ2

)where the variance is uniformly distributed over the interval

[0, σ2

max

).




3.11.3

Find the maximum likelihood estimate of the variance of L identically distributed, but depen-

dent Gaussian random variables. Here, the covariance matrix is written Kr = σ2∼Kr, where the

normalized covariance matrix has trace tr[∼Kr

]= L

Exercise 3.11.2Imagine yourself idly standing on the corner in a large city when you note the serial number of apassing beer truck. Because you are idle, you wish to estimate (guess may be more accurate here)how many beer trucks the city has from this single operation

3.11.1

Making appropriate assumptions, the beer truck's number is drawn from a uniform probabilitydensity ranging between zero and some unknown upper limit, nd the maximum likelihood estimateof the upper limit.

3.11.2

Show that this estimate is biased.

3.11.3

In one of your extraordinarily idle moments, you observe throughout the city L beer trucks. As-suming them to be independent observations, now what is the maximum likelihood estimate of thetotal?

3.11.4

Is this estimate of θ biased? asymptotically biased? consistent?

Exercise 3.11.3We make L observations r1, . . . , rL of a parameter θ corrupted by additive noise ( rl = θ + nl ).The parameter θ is a Gaussian random variable [θ ∼ N

(0, σ2

θ

)] and nl are statistically independent

Gaussian random variables [nl ∼ N(0, σ2

n

)].

3.11.1

Find the MMSE estimate of θ.

3.11.2

Find the maximum a posteriori estimate of θ.

3.11.3

Compute the resulting mean-squared error for each estimate.


85

3.11.4

Consider an alternate procedure based on the same observations rl . Using the MMSE criterion,we estimate θ immediately after each observation. This procedure yields the sequence of estimates

θ1 (r1), θ2 (r1, r2), . . ., θL (r1, . . . , rL). Express θ1 as a function of θl−1, σ2l−1, and rl. Here, σ2

l

denotes the variance of the estimation error of the lth estimate. Show that

1σ2l

=1σ2θ

+1σ2n

Exercise 3.11.4Although the maximum likelihood estimation procedure was not clearly dened until early in the20th century, Gauss showed in 1905 that the Gaussian density 26 was the sole density for whichthe maximum likelihood estimate of the mean equaled the sample average. Let r0, . . . , rL−1 be asequence of statistically independent, identically distributed random variables.

3.11.1

What equation denes the maximum likelihood estimate mML of the mean m when the commonprobability density function of the data has the form p (r −m )?

3.11.2

The sample average is, of course,∑llrlL . Show that it minimizes the mean-square error∑

ll (rl −m)2.

3.11.3

Equating the sample average to mML, combine this equation with the maximum likelihood equationto show that the Gaussian density uniquely satises the equations.

note: Because both equations equal 0, they can be equated. Use the fact that they must hold forall L to derive the result. Gauss thus showed that mean-squared error and the Gaussian densitywere closely linked, presaging ideas from modern robust estimation theory.

Exercise 3.11.5In this example (Example 3.5), we derived the maximum likelihood estimate of the mean andvariance of a Gaussian random vector. You might wonder why we chose to estimate the varianceσ2 rather than the standard deviation σ. Using the same assumptions provided in the example,let's explore the consequences of estimating a function of a parameter (van Trees: Probs 2.4.9,2.4.10[52]).

3.11.1

Assuming that the mean is known, nd the maximum likelihood estimates of rst the variance,then the standard deviation.

3.11.2

Are these estimates biased?

26It wasn't called the Gaussian density in 1805; this result is one of the reasons why it is.



3.11.3

Describe how these two estimates are related. Assuming that f (·) is a monotonic function, how

are θML and f (θ)ML related in general? These results suggest a general question. Consider theproblem of estimating some function of a parameter θ, say f1 (θ). The observed quantity is r andthe conditional density p r |θ (r ) is known. Assume that θ is a nonrandom parameter.

3.11.4

What are the conditions for an ecient estimate f1 (θ) to exist?

3.11.5

What is the lower bound on the variance of the error of any unbiased estimate of f1 (θ)?

3.11.6

Assume an ecient estimate of f1 (θ) exists; when can an ecient estimate of some other functionf2 (θ) exist?Exercise 3.11.6Let the observations r (l) consist of statistically independent, identically distributed Gaussianrandom variables having zero mean but unknown variance. We wish to estimate σ2, their variance.

3.11.1

Find the maximum likelihood estimate σ2ML and compute the resulting mean-squared error.

3.11.2

Show that this estimate is ecient.

3.11.3

Consider a new estimate σ2NEW given by σ2

NEW = α σ2ML, where α is a constant. Find the value of

α that minimizes the mean-squared error for σ2NEW. Show that the mean-squared error of σ2

NEW is

less than that of σ2ML. Is this result compatible with this previous part (p. 86)?

Exercise 3.11.7Let the observations be of the form r = Hθ + n where θ and n are statistically independentGaussian random vectors.

θ ∼ N (0,Kθ)

n ∼ N (0,Kn)

The vector θ has dimension M ; the vectors r and n have dimension N .

3.11.1

Derive the minimum mean-squared error estimate of θ, θMMSE, from the relationship

θMMSE= E [θ | r ]


87

3.11.2

Show that this estimate and the optimum linear estimate θLIN derived by the Orthogonality Prin-ciple are equal.

3.11.3

Find an expression for the mean-squared error when these estimates are used.

Exercise 3.11.8To illustrate the power of importance sampling, let's consider a somewhat naïve example. Letr have a zero-mean Laplacian distribution; we want to employ importance sampling techniquesto estimate Pr [r > γ] (despite the fact that we can calculate it easily). Let the density for r beLaplacian having mean γ.

3.11.1

Find the weight cl that must be applied to each decision based on the variable r.

3.11.2

Find the importance sampling gain. Show that this gain means that a xed number of simulationsare needed to achieve a given percentage estimation error (as dened by the coecient of variation).Express this number as a function of the criterion value for the coecient of variation.

3.11.3

Now assume that the density for r is Laplacian, but with mean m. Optimize m by nding the valuethat maximizes the importance sampling gain.

Exercise 3.11.9Suppose we consider an estimate of the parameter θ having the form θ = L (r) + C, where r denotesthe vector of the observables and L (·) is a linear operator. The quantity C is a constant. Thisestimate is not a linear function of the observables unless C = 0. We are interested in ndingapplications for which it is advantageous to allow C 6= 0. Estimates of this form we term "quasi-linear".

3.11.1

Show that the optimum (minimum mean-squared error) quasi-linear estimate satises

E [< L (r) + C − θ,L (r) + C >] = 0

for all L (·) and C where θQLIN= L (r) + C.

3.11.2

Find a general expression for the mean-squared error incurred by the optimum quasi-linear estimate.



3.11.3

Such estimates yield a smaller mean-squared error when the parameter θ has a nonzero mean. Let θbe a scalar parameter with mean m. The observables comprise a vector r having components givenby rl = θ + nl, l ∈ 1, . . . , N where nl are statistically independent Gaussian random variables

[nl ∼ N(0, σ2

n

)] independent of θ. Compute expressions for θQLIN and θLIN. Verify that θQLIN

yields a smaller mean-squared error when m 6= 0.Exercise 3.11.10In this section (Section 3.8), we questioned the existence of an ecient estimator for signal pa-rameters. We found in the succeeding example that an unbiased ecient estimator exists for thesignal amplitude. Can a nonlinearly represented parameter, such as time delay, have an ecientestimator?

3.11.1

Simplify the condition for the existence of an ecient estimator by assuming it to be unbiased.Note carefully the dimensions of the matrices involved.

3.11.2

Show that the only solution in this case occurs when the signal depends "linearly" on the parametervector.

Exercise 3.11.11In Poission problems, the number of events n occurring in the interval [0, T ) is governed by theprobability distribution (see The Poission Process (Section 1.10))

Pr [n] =(λT )n

n!e−(λT )

where λ is the average rate at which events occur.

3.11.1

What is the maximum likelihood estimate of average rate?

3.11.2

Does this estimate satisfy the Cramér-Rao bound?

Exercise 3.11.12In the "classic" radar problem, not only is the time of arrival of the radar pulse unknown but alsothe amplitude. In this problem, we seek methods of simultaneously estimating these parameters.The received signal r (l) is of the form

r (l) = θ1s (l − θ2) + n (l)

where θ1 is Gaussian with zero mean and variance σ21 and θ2 is uniformly distributed over the

observation interval. Find the receiver that computes the maximum a posteriori estimates of θ1

and θ2 jointly. Draw a block diagram of this receiver and interpret its structure.

Exercise 3.11.13We state without derivation the Cramér-Rao bound for estimates of signal delay (see this equation(3.38)).


89

3.11.1

The parameter θ is the delay of the signal s (·) observed in additive, white Gaussian noise: r (l) =s (l − θ) + n (l), l ∈ 0, . . . , L− 1. Derive the Cramér-Rao bound for this problem.

3.11.2

In Time-delay Estimation (Section 3.9), this bound is claimed to be given byσ2n

Eβ2 , where β2 is

the mean-squared bandwidth. Derive this result from your general formula. Does the bound makesense for all values of signal-to-noise ratio E

σ2n?

3.11.3

Using optimal detection theory, derive the expression (see Time-Delay Estimation (Section 3.9))for the probability of error incurred when trying to distinguish between a delay of τ and a delay ofτ + ∆. Consistent with the problem pposed for the Cramér-Rao bound, assume the delayed signalsare observed in additive, white Gaussian noise.

Exercise 3.11.14In formulating detection problems, the signal as well as the noise are sometimes modeled asGaussian processes. Let's explore what dierences arise in the Cramér-Rao bound derived whenthe signal is deterministic. Assume that the signal contains unknown parameters θ, that it isstatistically independent of the noise, and that the noise covariance matrix is known.

3.11.1

What forms do the conditional densities of the observations take under the two assumptions? Whatare the two covariance matrices?

3.11.2

Assuming the stochastic signal model, show that each element of the Fisher information matrix hasthe form

Fi,j =12

tr[K−1 ∂K

∂θiK−1 ∂K

∂θj

]where K denotes the covariance matrix of the observations. Make this expression more complex byassuming the noise complement has no unknown parameters.

3.11.3

Compare the stochastic and deterministic bounds, the latter is given by this equation (3.36), whenthe unknown signal parameters are amplitude and delay. Assume the noise covariance matrix equalsσ2nI. Do these bounds have similar dependence on signal-to-noise ratio?

Exercise 3.11.15The histogram probability density estimator is a special case of a more general class of estimatorsknown as kernel estimators.

p r (x ) =1L

L−1∑l=0

k (x− r (l))

Here, the kernel k (·) is usually taken to be a density itself.



3.11.1

What is the kernel for the histogram estimator.

3.11.2

Interpret the kernel estimator in signal processing terminology. Predict what the most time con-suming computation of this estimate might be. Why?

3.11.3

Show that the sample average equals the expected value of a random variable having the density

p r (x ) regardless of the choice of kernel.Exercise 3.11.16Random variables can be generated quite easily if the probability distribution function is "nice."Let X be a random variable having distribution function P X ( · ).

3.11.1

Show that the random variable U = P X (X ) is uniformly distributed over (0, 1).

3.11.2

Based on this result, how would you generate a random variable having a specic density with auniform random variable generator, which is commonly supplied with most computer and calculatorsystems?

3.11.3

How would you generate random variables having the hyperbolic secant density p X (x ) =12 sech

(πx2

)?

3.11.4

Why is the Gaussian not in the class of "nice" probability distribution functions? Despite this fact,the Gaussian and other similarly unfriendly random variables can be generated using tabulatedrather than analytic forms for the distribution function.


Chapter 4

Detection Theory

4.1 Detection Theory Basics1

Detection theory concerns making decisions from data. Decisions are based on presumptive models thatmay have produced the data. Making a decision involves inspecting the data and determining which modelwas most likely to have produced them. In this way, we are detecting which model was correct. Decisionproblems pervade signal processing. In digital communications, for example, determining if the current bitreceived in the presence of channel disturbances was a zero or a one is a detection problem.

More concretely, we denote by Mi the ith model that could have generated the data R. A "model"

is captured by the conditional probability distribution of the data, which is denoted by the vector R. Forexample, model i is described by p R |Mi

(r ). Given all the models that can describe the data, we need tochoose which model best matched what was observed. The word "best" is key here: what is the optimalitycriterion, and does the detection processing and the decision rule depend heavily on the criterion used? Sur-prisingly, the answer to the second question is "No." All of detection theory revolves around the likelihoodratio test, which as we shall see, emerges as the optimal detector under a wide variety of optimality criteria.

4.1.1 The Likelihood Ratio Test

In a binary detection problem in which we have two models, four possible decision outcomes can result.Model M0 did in fact represent the best model for the data and the decision rule said it was (a correctdecision) or said it wasn't (an erroneous decision). The other two outcomes arise when model M1 was infact true with either a correct or incorrect decision made. The decision process operates by segmenting therange of observation values into two disjoint decision regions Z0 and Z1. All values of R fall into eitherZ0 or Z1. If a given R lies in Z0, we will announce our decision "modelM0 was true"; if in Z1, modelM1

would be proclaimed. To derive a rational method of deciding which model best describes the observations,we need a criterion to assess the quality of the decision process so that optimizing this criterion will specifythe decision regions.

The Bayes' decision criterion seeks to minimize a cost function associated with making a decision.Let Cij be the cost of mistaking model j for model i (i 6= j) and Cii the presumably smaller cost of correctlychoosing model i: Cij > Cii, i 6= j. Let πi be the a priori probability of model i. The so-called Bayes' cost−C is the average cost of making a decision.

−C =

∑i,ji,j CijPr [say Mi when Mj true]

=∑i,ji,j CijπjPr [ say Mi | Mj true]

(4.1)



91

92 CHAPTER 4. DETECTION THEORY

The Bayes' cost can be expressed as

−C=

∑i,ji,jCijπjPr [R ∈ Zi | Mj true] =

∑i,ji,jCijπj

∫p R |Mj

(r ) dr =∫ (C00π0p R |M0 (r ) + C01π1p R |M1 (r )

)dr+

∫ (C10π0p R |M0 (r ) + C11π1p R |M1 (r )

)dr

(4.2)

To minimize this expression with respect to the decision regions Z0 and Z1, ponder which integral wouldyield the smallest value if its integration domain included a specic value of the observation vector. Tominimize the sum of the two integrals, whichever integrand is smaller should include that value of r in itsintegration domain. We conclude that we choose M0 for those values of r yielding a smaller value for therst integral.

π0C00p R |M0 (r ) + π1C01p R |M1 (r ) < π0C10p R |M0 (r ) + π1C11p R |M1 (r )

We chooseM1 when the inequality is reversed. This expression is easily manipulated to obtain the crowningresult of detection theory: the likelihood ratio test.

p R |M1 (r )p R |M0 (r )

M1

≷M0

π0 (C10 − C00)π1 (C01 − C11)

(4.3)

The comparison relation means selecting model M1 if the left-hand ratio exceeds the value on the right;

otherwise,M0 is selected. The likelihood ratiop R |M1 ( r )

p R |M0 ( r ) , symbolically represented by Λ (r), encapsulatesthe signal processing performed by the optimal detector on the observations r. The optimal decision rule

then compares that scalar-valued result with a threshold η equaling π0(C10−C00)π1(C01−C11) . The likelihood ratio test

can be succinctly expressed as the comparison of the likelihood ratio with a threshold.

Λ (r)M1

≷M0

η (4.4)

The data processing operations are captured entirely by the likelihood ratio Λ (r). However, the calcu-lations required by the likelihood ratio can be simplied in many cases. Note that only the value of thelikelihood ratio relative to the threshold matters. Consequently, we can perform any positively monotonictransformation simultaneously on the likelihood ratio and the threshold without aecting the result of thecomparison. For example, we can multiply by a positive constant, add any constant or apply a monotoni-cally increasing function to reduce the complexity of the expressions. We single out one such function, thelogarithm, because it often simplies likelihood ratios that commonly occur in signal processing applications.Known as the log-likelihood, we explicitly express the likelihood ratio test with it as

ln (Λ (r))M1

≷M0

ln (η) (4.5)

What simplifying transformations are useful are problem-dependent. But, by laying bare what aspect of theobservations is essential to the model-testing problem, we reveal the sucient statistic Υ (r): the scalarquantity which best summarizes the data for detection purposes. The likelihood ratio test is best expressedin terms of the sucient statistic.

Υ (r)M1

≷M0

γ (4.6)

We denote the threshold value for the sucient statistic by γ or by η when the likelihood ratio is used inthe comparison.

The likelihood ratio is comprised of the quantities p R |Mi(r ), which are known as likelihood functions

and play an important role in estimation theory. It is the likelihood function that portrays the probabilisticmodel describing data generation. The likelihood function completely characterizes the kind of "world"


93

assumed by each model. For each model, we must specify the likelihood function so that we can solve thehypothesis testing problem.

A complication, which arises in some cases, is that the sucient statistic may not be monotonic. If itis monotonic, the decision regions Z0 and Z1 are simply connected: all portions of a region can be reachedwithout crossing into the other region. If not, the regions are not simply connected and decision region islandsare created. Disconnected regions usually complicate calculations of decision performance. Monotonic ornot, the decision rule proceeds as described: the sucient statistic is computed for each observation vectorand compared to a threshold.

Example 4.1The coach of a soccer team suspects his goalie has been less than attentive to his training regimen.The coach focuses on the kicks the goalie makes to send the ball down the eld. The data r heobserves is the length of a kick. The coach denes the models as

• M0: not maintaining a training regimen• M1: is maintaining a training regimen

The conditional densitiesmodelsof the kick length are shown in Figure 4.1.

Figure 4.1: Conditional densities for the distribution of the lengths of soccer kicks assuming that thegoalie has not attended to his training (M0) or did (M1) are shown in the top row. The lower portiondepicts the likelihood ratio formed from these densities.

Based on knowledge of soccer player behavior, the coach assigns a priori probabilities of π0 = 1/4and π1 = 3/4. The costs Cij are chosen to reect the coach's sensitivity to the goalies feelings:C01 = 1 = C10 (an erroneous decision either way is given the same cost) and C00 = 0 = C11. Thelikelihood ratio is plotted in Figure 4.1 and the threshold value η, which is computed from the apriori probabilities and the costs to be 1/3, is indicated. The calculations of this comparison canbe simplied in an obvious way.

r

50

M1

≷M0

1/3

or

rM1

≷M0

50/3 = 16.7



The multiplication by the factor of 50 is a simple illustration of the reduction of the likelihoodratio to a sucient statistic. Based on the assigned costs and a priori probabilities, the optimumdecision rule says the coach must assume that the student did not train if a kick is less than 16.7;if greater, the goalie is assumed to have trained despite producing an abysmally short kick suchas 20. Note that as the densities given by each model overlap entirely: the possibility of makingthe wrong interpretation always haunts the coach. However, no other procedure will be better(produce a smaller Bayes' cost)!

4.2 Criteria in Hypothesis Testing2

The criterion used in the previous section - minimize the average cost of an incorrect decision - may seem tobe a contrived way of quantifying decisions. Well, often it is. For example, the Bayesian decision rule dependsexplicitly on the a priori probabilities; a rational method of assigning values to these - either by experimentor through true knowledge of the relative likelihood of each model - may be unreasonable. In this section, wedevelop alternative decision rules that try to answer such objections. One essential point will emerge fromthese considerations: the fundamental nature of the decision rule does not change with choice ofoptimization criterion. Even criteria remote from error measures can result in the likelihood ratio test(see this problem3). Such results do not occur often in signal processing and underline the likelihood ratiotest's signicance.

4.2.1 Maximum Probability of a Correct Decision

As only one model can describe any given set of data (the models are mutually exclusive), the probabilityof being correct Pc for distinguishing two models is given by

Pc = Pr [say M0 when M0 true] + Pr [say M1 when M1 true]

We wish to determine the optimum decision region placement Expressing the probability correct in terms ofthe likelihood functions p r |Mi

(r ), the a priori probabilities, and the decision regions,

Pc =∫π0p r |M0 (r ) dr +

∫π1p r |M1 (r ) dr

We want to maximize Pc by selecting the decision regions <0 and <0. The probability correct is maximizedby associating each value of r with the largest term in the expression for Pc. Decision region <0, for example,is dened by the collection of values of r for which the rst term is largest. As all of the quantities involvedare non-negative, the decision rule maximizing the probability of a correct decision is

note: Given r, chooseMi for which the product πip r |Mi(r ) is largest.

Simple manipulations lead to the likelihood ratio test.

p r |M1 (r )p r |M0 (r )

M1

≷M0

π0

π1

Note that if the Bayes' costs were chosen so that Cii = 0 and Cij = C, ( i 6= j), we would have the samethreshold as in the previous section.

2This content is available online at <http://cnx.org/content/m11228/1.4/>.3"Statistical Hypothesis Testing: Problems", Exercise 3 <http://cnx.org/content/m11271/latest/#problem3>


95

To evaluate the quality of the decision rule, we usually compute the probability of error Pe rather thanthe probability of being correct. This quantity can be expressed in terms of the observations, the likelihoodratio, and the sucient statistic.

Pe = π0

∫p r |M0 (r ) dr + π1

∫p r |M1 (r ) dr

= π0

∫p Λ |M0 (Λ) dΛ + π1

∫p Λ |M1 (Λ) dΛ

= π0

∫p Υ |M0 (Υ) dΥ + π1

∫p Υ |M1 (Υ) dΥ

(4.7)

When the likelihood ratio is non-monotonic, the rst expression is most dicult to evaluate. When mono-tonic, the middle expression proves the most dicult. Furthermore, these expressions point out that thelikelihood ratio and the sucient statistic can be considered a function of the observations r; hence, they arerandom variables and have probability densities for each model. Another aspect of the resulting probabilityof error is that no other decision rule can yield a lower probability of error. This statement isobvious as we minimized the probability of error in deriving the likelihood ratio test. The point is that theseexpressions represent a lower bound on performance (as assessed by the probability of error). This probabil-ity will be non-zero if the conditional densities overlap over some range of values of r, such as occurred in theprevious example. In this region of overlap, the observed values are ambiguous: either model is consistentwith the observations. Our "optimum" decision rule operates in such regions by selecting that model whichis most likely (has the highest probability) of generating any particular value.

4.2.2 Neyman-Pearson Criterion

Situations occur frequently where assigning or measuring the a priori probabilities Pi is unreasonable. Forexample, just what is the a priori probability of a supernova occurring in any particular region of the sky?We clearly need a model evaluation procedure which can function without a priori probabilities. This kindof test results when the so-called Neyman-Pearson criterion is used to derive the decision rule. The ideasbehind and decision rules derived with the Neyman-Pearson criterion (Neyman and Pearson[36]) will serveus well in sequel; their result is important!

Using nomenclature from radar, where modelM1 represents the presence of a target andM0 its absence,the various types of correct and incorrect decisions have the following names (Woodward, pp. 127-129[55]).4

Detection - we say it's there when it is; PD = Pr (say M1|M1 true)False-alarm - we say it's there when it's not; PF = Pr (say M1|M0 true)Miss - we say it's not there when it is; PM = Pr (say M0|M1 true)

The remaining probability Pr [ say M0 | M0 true] has historically been left nameless and equals 1 − PF .We should also note that the detection and miss probabilities are related by PM = 1 − PD. As these areconditional probabilities, they do not depend on the a priori probabilities and the two probabilities PF andPD characterize the errors when any decision rule is used.

These two probabilities are related to each other in an interesting way. Expressing these quantities interms of the decision regions and the likelihood functions, we have

PF =∫p r |M0 (r ) dr

PD =∫p r |M1 (r ) dr

As the region <1 shrinks, both of these probabilities tend toward zero; as <1 expands to engulf the entirerange of observation values, they both tend toward unity. This rather direct relationship between PD and

4In hypothesis testing, a false-alarm is known as a type I error and a miss a type II error.



PF does not mean that they equal each other; in most cases, as <1 expands, PD increases more rapidly thanPF (we had better be right more often than we are wrong!). However, the "ultimate" situation where a ruleis always right and never wrong (PD = 1, PF = 0) cannot occur when the conditional distributions overlap.Thus, to increase the detection probability we must also allow the false-alarm probability to increase. Thisbehavior represents the fundamental tradeo in hypothesis testing and detection theory.

One can attempt to impose a performance criterion that depends only on these probabilities with theconsequent decision rule not depending on the a priori probabilities. The Neyman-Pearson criterion assumesthat the false-alarm probability is constrained to be less than or equal to a specied value α while we attemptto maximize the detection probability PD.

∀P F, PF ≤ α : (max< 1 < 1, PD)

A subtlety of the succeeding solution is that the underlying probability distribution functions may not becontinuous, with the result that PF can never equal the constraining value α. Furthermore, an (unlikely)possibility is that the optimum value for the false-alarm probability is somewhat less than the criterion value.Assume, therefore, that we rephrase the optimization problem by requiring that the false-alarm probabilityequal a value α′ that is less than or equal to α.

This optimization problem can be solved using Lagrange multipliers (see Constrained Optimization (Sec-tion 2.2)); we seek to nd the decision rule that maximizes

F = PD + λ (PF − α′)

where λ is the Lagrange multiplier. This optimization technique amounts to nding the decision rule thatmaximizes F , then nding the value of the multiplier that allows the criterion to be satised. As is usual inthe derivation of optimum decision rules, we maximize these quantities with respect to the decision regions.Expressing PD and PF in terms of them, we have

F =∫p r |M1 (r ) dr + λ

(∫p r |M0 (r ) dr − α′

)= − (λα′) +

∫p r |M1 (r ) + λp r |M0 (r ) dr

(4.8)

To maximize this quantity with respect to <1, we need only to integrate over those regions of r where the in-tegrand is positive. The region <1 thus corresponds to those values of r where p r |M1 (r ) > −

(λp r |M0 (r )

)and the resulting decision rule is

p r |M1 (r )p r |M0 (r )

M1

≷M0

(−λ)

The ubiquitous likelihood ratio test again appears; it is indeed the fundamental quantity in hypothesistesting. Using the logarithm of the likelihood ratio or the sucient statistic, this result can be expressed aseither

ln (Λ (r))M1

≷M0

ln (−λ)

or

Υ (r)M1

≷M0

γ

We have not as yet found a value for the threshold. The false-alarm probability can be expressed in termsof the Neyman-Pearson threshold in two (useful) ways.

PF =∫∞−λ p Λ |M0 (Λ) dΛ

=∫∞γp Υ |M0 (Υ) dΥ

(4.9)

One of these implicit equations must be solved for the threshold by setting PF equal to α′. The selectionof which to use is usually based on pragmatic considerations: the easiest to compute. From the previous


97

discussion of the relationship between the detection and false-alarm probabilities, we nd that to maximizePD we must allow α′ to be as large as possible while remaining less than α. Thus, we want to nd thesmallest value of −λ (note the minus sign) consistent with the constraint. Computation of the threshold isproblem-dependent, but a solution always exists.

Example 4.2An important application of the likelihood ratio test occurs when r is a Gaussian random vectorfor each model. Suppose the models correspond to Gaussian random vectors having dierent meanvalues but sharing the same identity covariance.

• M0: r ∼ N(0, σ2I

)• M1: r ∼ N

(m,σ2I

)Thus, r is of dimension L and has statistically independent, equal variance components. The vectorof means m = (m0, . . . ,mL−1)T distinguishes the two models. The likelihood functions associatedthis problem are

p r |M0 (r ) =L−1∏l=0

1√2πσ2

e−“

1/2( rlσ )2”

p r |M1 (r ) =L−1∏l=0

1√2πσ2

e−„

1/2“rl−mlσ

”2«

The likelihood ratio Λ (r) becomes

Λ (r) =∏L−1l=0 e

−„

1/2“rl−mlσ

”2«

∏L−1l=0 e

−“

1/2( rlσ )2”

This expression for the likelihood ratio is complicated. In the Gaussian case (and many others), weuse the logarithm the reduce the complexity of the likelihood ratio and form a sucient statistic.

ln (Λ (r)) =∑L−1l=0 −1/2 (rl−ml)2

σ2 + 1/2 rl2

σ2

= 1σ2

∑L−1l=0 mlrl − 1

2σ2

∑L−1l=0 ml

2(4.10)

The likelihood ratio test then has the much simpler, but equivalent form

L−1∑l=0

(mlrl)M1

≷M0

(σ2ln (η)

)+ 1/2

L−1∑l=0

ml2

To focus on the model evaluation aspects of this problem, let's assume means be equal to a positiveconstant: ml = m (0).5

L−1∑l=0

rlM1

≷M0

(σ2

mln (η)

)+Lm

2

Note that all that need be known about the observations rl is their sum. This quantity is thesucient statistic for the Gaussian problem: Υ (r) =

∑rl and γ = σ2ln

(ηm

)+ Lm

2 .When trying to compute the probability of error or the threshold in the Neyman-Pearson crite-

rion, we must nd the conditional probability density of one of the decision statistics: the likelihoodratio, the log-likelihood, or the sucient statistic. The log-likelihood and the sucient statistic arequite similar in this problem, but clearly we should use the latter. One practical property of the suf-cient statistic is that it usually simplies computations. For this Gaussian example, the sucientstatistic is a Gaussian random variable under each model.

5Why did the authors assume that the mean was positive? What would happen if it were negative?



• M0: Υ (r) ∼ N(0, Lσ2

)• M1: Υ (r) ∼ N

(Lm,Lσ2

)To nd the probability of error from (4.7), we must evaluate the area under a Gaussian probabilitydensity function. These integrals are succinctly expressed in terms of Q (x), which denotes theprobability that a unit-variance, zero-mean Gaussian random variable exceeds x (see Probabilityand Stochastic Processes (Section 1.1)). As 1 − Q (x) = Q (−x), the probability of error can bewritten as

Pe = π1Q

(Lm− γ√

Lσ

)+ π0Q

(γ√Lσ

)An interesting special case occurs when π0 = 1/2 = π1. In this case, γ = Lm

2 and the probabilityof error becomes

Pe = Q

(√Lm

2σ

)As Q (·) is a monotonically decreasing function, the probability of error decreases with increasing

values of the ratio√Lm2σ . However, as shown in this gure (Figure 1.1), Q (·) decreases in a nonlinear

fashion. Thus, increasing m by a factor of two may decrease the probability of error by a larger ora smaller factor; the amount of change depends on the initial value of the ratio.

To nd the threshold for the Neyman-Pearson test from the expressions given on (4.9), we needthe area under a Gaussian density.

PF = Q(

γ√Lσ2

)= α′

(4.11)

As Q (·) is a monotonic and continuous function, we can now set α′ equal to the criterion value αwith the result

γ =√LσQ−1 (α)

where Q−1 (·) denotes the inverse function of Q (·). The solution of this equation cannot be per-formed analytically as no closed form expression exists for Q (·) (much less its inverse function);the criterion value must be found from tables or numerical routines. Because Gaussian problemsarise frequently, the Table 4.1 accompanying table provides numeric values for this quantity at thedecade points.

x Q−1 (x)

10−1 1.281

10−2 2.396

10−3 3.090

10−4 3.719

10−5 4.265

10−6 4.754

Table 4.1: The table displays interesting values for Q−1 (·) that can be used to determine thresholds in theNeyman-Pearson variant of the likelihood ratio test. Note how little the inverse function changes for

decade changes in its argument; Q (·) is indeed very nonlinear.


99

The detection probability is given by

PD = Q

(Q−1 (α)−

√Lm

σ

)

4.3 Performance Evaluation6

We alluded earlier (Section 4.2.2: Neyman-Pearson Criterion) to the relationship between the false-alarmprobability PF and the detection probability PD as one varies the decision region. Because the Neyman-Pearson criterion depends on specifying the false-alarm probability to yield an acceptable detection probabil-ity, we need to examine carefully how the detection probability is aected by a specication of the false-alarmprobability. The usual way these quantities are discussed is through a parametric plot of PD versus PF : thereceiver operating characteristic or ROC.

As we discovered in the Gaussian example (Example 4.2), the sucient statistic provides the simplestway of computing these probabilities; thus, they are usually considered to depend on the threshold parameterγ. In these terms, we have

PD =∫ ∞γ

p Υ |M1 (Υ) dΥ (4.12)

and

PF =∫ ∞γ

p Υ |M0 (Υ) dΥ (4.13)

These densities and their relationship to the threshold γ are shown in Figure 4.2 (Densities of the sucientstatistic).

Densities of the sucient statistic

Figure 4.2: The densities of the sucient statistic Υ (r) conditioned on two hypotheses are shown forthe Gaussian example. The threshold γ used to distinguish between the two models is indicated. Thefalse-alarm probability is the area under the density corresponding toM0 to the right of the threshold;the detection probability is the area under the density corresponding toM1.




We see that the detection probability is greater than or equal to the false-alarm probability. Since theseprobabilities must decrease monotonically as the threshold is increased, the ROC curve must be concave-downand must always exceed the equality line (Figure 4.3).7

Figure 4.3: A plot of the receiver operating characteristic for the densities shown in the previous

gure. Three ROC curves are shown corresponding to dierent values for the parameter√Lmσ

.

The degree to which the ROC departs from the equality line PD = PF measures the relative distinctive-ness between the two hypothesized models for generating the observations. In the limit, the two models canbe distinguished perfectly if the ROC is discontinuous and consists of the point (1,0). The two are totallyconfused if the ROC lies on the equality line (this would mean, of course, that the two models are identical);distinguishing the two in this case would be "somewhat dicult".

Example 4.3Consider the Gaussian example we have been discussing where the two models dier only in themeans of the conditional distributions. In this case, the two model-testing probabilities are givenby

PF = Q

(γ√Lσ

)and

PD = Q

(γ − Lm√

Lσ

)7This seemingly haughty claim is proved when we consider the sequential hypothesis test.


101

By re-expressing γ as σ2

m γ′ + Lm

2 , we discover that these probabilities depend only on the ratio√Lmσ .

PF = Q

(γ′√Lmσ

+√Lm

2σ

)

PD = Q

(γ′√Lmσ

−√Lm

2σ

)As this signal-to-noise ratio increases, the ROC curve approaches its "ideal" form: the northwest

corner of a square as illustrated in Figure 4.3 by the value of 7.44 for√Lmσ , which corresponds to

a signal-to-noise ratio of 7.442 ' 17dB. If a small false-alarm probability (say 10−4) is specied,a large detection probability (0.9999) can result. Such values of signal-to-noise ratios can thus beconsidered "large" and the corresponding model evaluation problem relatively easy. If, however,the signal-to-noise ratio equals 4 (6 dB), the gure illustrates the worsened performance: a 10−4

specication on the false-alarm probability would result in a detection probability of essentiallyzero. Thus, in a fairly small signal-to-noise ratio range, the likelihood ratio test's performancecapabilities can vary dramatically. However, no other decision rule can yield better performance.

Specication of the false-alarm probability for a new problem requires experience. Choosing a "reasonable"value for the false-alarm probability in the Neyman-Pearson criterion depends strongly on the problemdiculty. Too small a number will result in small detection probabilities; too large and the detectionprobability will be close to unity, suggesting that fewer false alarms could have been tolerated. Problemdiculty is assessed by the degree to which the conditional densities p r |M0 (r ) and p r |M1 (r ) overlap,a problem dependent measurement. If we are testing whether a distribution has one of two possible meanvalues as in our Gaussian example, a quantity like a signal-to-noise ratio will probably emerge as determiningperformance. The performance in this case can vary drastically depending on whether the signal-to-noiseratio is large or small. In other kinds of problems, the best possible performance provided by the likelihoodratio test can be poor. For example, consider the problem of determining which of two zero-mean probabilitydensities describes a given set of data consisting of statistically independent observations (See this problem8).Presumably, the variances of these two densities are equal as we are trying to determine which density ismost appropriate. In this case, the performance probabilities can be quite low, especially when the generalshapes of the densities are similar. Thus a single quantity, like the signal-to-noise ratio, does not emergeto characterize problem diculty in all hypothesis testing problems. In sequel, we will analyze each modelevaluation and detection problem in a standard way. After the sucient statistic has been found, we willseek a value for the threshold that attains a specied false-alarm probability. The detection probability willthen be determined as a function of "problem diculty", the measure of which is problem-dependent. Wecan control the choice of false-alarm probability; we cannot control over problem diculty. Confusingly, thedetection probability will vary with both the specied false-alarm probability and the problem diculty.

We are implicitly assuming that we have a rational method for choosing the false-alarm probabilitycriterion value. In signal processing applications, we usually make a sequence of decisions and pass them tosystems making more global determinations. For example, in digital communications problems the modelevaluation formalism could be used to "receive" each bit. Each bit is received in sequence and then passedto the decoder which invokes error-correction algorithms. The important notions here are that the decision-making process occurs at a given rate and that the decisions are presented to other signal processing systems.The rate at which errors occur in system input(s) greatly inuences system design. Thus, the selection of afalse-alarm probability is usually governed by the error rate that can be tolerated by succeeding systems.If the decision rate is one per day, then a moderately large (say 0.1) false-alarm probability might beappropriate. If the decision rate is a million per second as in a one megabit communication channel, thefalse-alarm probability should be much lower: 10−12 would suce for the one-tenth per day error rate.

8"Statistical Hypothesis Testing: Problems", Exercise 2 <http://cnx.org/content/m11271/latest/#problem2>



4.4 Beyond Two Models9

Frequently, more than two viable models for data generation can be dened for a given situation. Theclassication problem is to determine which of several models best "ts" a set of measurements. Forexample, determining the type of airplane from its radar returns forms a classication problem. The modelevaluation framework has the right structure if we can allow more than two models. We happily note thatin deriving the likelihood ratio test we did not need to assume that only two possible descriptions exist. Goback and examine the expression for the maximum probability correct decision rule (Section 4.2.1: MaximumProbability of a Correct Decision). If K models seem appropriate for a specic problem, the decision rulemaximizing the probability of making a correct choice is

∀i, i ∈ 1, . . . ,K :(max

πip r |Mi

(r ))

To determine the largest of K quantities, exactly K − 1 numeric comparisons need be made. When we havetwo possible models (K = 2), this decision rule reduces to the computation of the likelihood ratio and itscomparison to a threshold. In general, K − 1 likelihood ratios need to be computed and compared to athreshold. Thus the likelihood ratio test can be viewed as a specic method for determining the largest ofthe decision statistics πip r |Mi

(r ).Since we need only the relative ordering of the K decision statistics to make a decision, we can apply

any transformation T (·) to them that does not aect ordering. In general, possible transformations must bepositively monotonic to satisfy this condition. For example, the needless common additive components in thedecision statistics can be eliminated, even if they depend on the observations. Mathematically, "common"means that the quantity does not depend on the model index i. The transformation in this case would be ofthe form T (zi) = zi − a, clearly a monotonic transformation. A positive multiplicative factor can also be"canceled"; if negative, the ordering would be reversed and that cannot be allowed. The simplest resultingexpression becomes the sucient statistic Υi (r) for the model. Expressed in terms of the sucient statistic,the maximum probability correct or the Bayesian decision rule becomes

∀i, i ∈ 1, . . . ,K : (max Ci + Υi (r))

where Ci summarizes all additive terms that do not depend on the observation vector r. The quantityΥi (r) is termed the sucient statistic associated with model i. In many cases, the functional form ofthe sucient statistic varies little from one model to another and expresses the necessary operations thatsummarize the observations. The constants Ci are usually lumped together to yield the threshold againstwhich we compare the sucient statistic. For example, in the binary model situation, the decision rulebecomes

Υ1 (r) + C1

M1

≷M0

Υ0 (r) + C0

or

Υ1 (r)−Υ0 (r)M1

≷M0

C0 − C1

Thus, the sucient statistic for the decision rule is Υ1 (r)−Υ0 (r) and the threshold γ is C0 − C1.

Example 4.4In the Gaussian problem just discussed, the logarithm of the likelihood function is

ln(p r |Mi

(r ))

=(−(L

2ln(2πσ2

)))− 1

2σ2

L−1∑l=0

(rl −m(i)

)2

where m(i) is the mean under model i. After appropriate simplication that retains the ordering,we have

Υi (r) =m(i)

σ2

L−1∑l=0

rl



103

Ci = −

(1/2

Lm(i)2

σ2

)+ ci

The term ci is a constant dened by the error criterion; for the maximum probability correctcriterion, this constant is ln (πi).

When employing the Neyman-Pearson test, we need to specify the various error probabilitiesPr [ say Mi | Hj true]. These specications amount to determining the constants ci when the sucientstatistic is used. Since K − 1 comparisons will be used to home in on the optimal decision, only K − 1 er-ror probabilities need be specied. Typically, the quantities Pr [ say Hi | M0 true], i ∈ 1, . . . ,K − 1 areused, particularly when the modelM0 represents the situation when no signal is present (see this problem

10).

4.5 Model Consistency Testing11

In many situations, we seek to check consistency of the observations with some preconceived model. Alterna-tive models are usually dicult to describe parametrically since inconsistency may be beyond our modelingcapabilities. We need a test that accepts consistency of observations with a model or rejects the modelwithout pronouncing a more favored alternative. Assuming we know (or presume to know) the probabilitydistribution of the observations underM0, the models are

• M0: r ∼ p r |M0 (r )• M1: r p r |M0 (r )

Null hypothesis testing seeks to determine if the observations are consistent with this description. Thebest procedure for consistency testing amounts to determining whether the observations lie in a highlyprobable region as dened by the null probability distribution. However, no one region denes a probabilitythat is less than unity. We must restrict the size of the region so that it best represents those observationsmaximally consistent with the model while satisfying a performance criterion. Letting PF be a false-alarmprobability established by us, we dene the decision region <0 to satisfy

Pr [r ∈ <0 | M0] =∫p r |M0 (r ) dr = 1− PF

and

min< 0

< 0,

∫<0dr

Usually, this region is located about the mean, but may not be symmetrically centered if the probabilitydensity is skewed. Our null hypothesis test for model consistency becomes

r ∈ <0 ⇒ "say observations are consistent"

r /∈ <0 ⇒ "say observations are not consistent"

Example 4.5Consider the problem of determining whether the sequence rl, l ∈ 1, . . . , L, is white and Gaussianwith zero mean and unit variance. Stated this way, the alternative model is not provided: is thismodel correct or not? We could estimate the probability density function of the observations andtest the estimate for consistency. Here we take the null-hypothesis testing approach of converting

10"Statistical Hypothesis Testing: Problems", Exercise 5 <http://cnx.org/content/m11271/latest/#problem5>11This content is available online at <http://cnx.org/content/m11277/1.5/>.



this problem into a one-dimensional one by considering the statistic r =∑ll rl

2, which has a χ2L.

Because this probability distribution is unimodal, the decision region can be safely assumed to bean interval [r′, r′′].12 In this case, we can nd an analytic solution to the problem of determiningthe decision region. Letting R = r′′ − r′ denote the width of the interval, we seek the solution ofthe constrained optimization problem

minr ′ r ′, R subject to Pr (r′ +R)− Pr (r′) = 1− PF

We convert the constrained problem into an unconstrained one using Lagrange multipliers.

minr ′ r ′, R+ λ (Pr (r′ +R)− Pr (r′)− (1− PF ))

Evaluation of the derivative of this quantity with respect to r′ yields the result pr (r′ +R) =pr (r′): to minimize the interval's width, the probability density function's values at the interval'sendpoints must be equal. Finding these endpoints to satisfy the constraints amounts to searchingthe probability distribution at such points for increasing values of R until the required probabilityis contained within. For L = 100 and PF = 0.05, the optimal decision region for the χ2

L distributionis [78.82, 128.5]. Figure 4.4 demonstrates ten testing trials for observations that t the model andfor observations that don't.

Figure 4.4: Ten trials of testing a 100-element sequence for consistency with a white, Gaussian model -rl ∼ N (0, 1) - for three situations. In the rst (shown by the circles), the observations do conform to themodel. In the second (boxes), the observations are zero-mean Gaussian but with variance two. Finally,the third example (crosses) has white observations with a density closely resembling the Gaussian: ahyperbolic secant density having zero mean and unit variance. The sum of squared observations for eachexample are shown with the optimal χ2

100 interval displayed. Note how dramatically the test statisticdeparts from the decision interval when parameters disagree.

12This one-dimensional result for the consistency test may extend to the multi-dimensional case in the obvious way.


105

4.6 Stein's Lemma13

As important as understanding the false-alarm, miss and error probabilities of the likelihood ratio test mightbe, no general expression exists for them. In analyzing the Gaussian problem, we nd these in terms of Q (·),which has no closed form expression. In the general problem, the situation is much worse: No expressionof any kind can be found! The reason is that we don't have the probability distribution of the sucientstatistic in the likelihood ratio test, which is needed in these expressions (4.12). We are faced with the curioussituation that while knowing the decision rule that optimizes the performance probabilities, we usually don'tknow what the resulting performance probabilities will be.

Some general expressions are known for the asymptotic form of these error probabilities: limitingexpressions as the number of observations L becomes large. These results go by the generic name of Stein'sLemma Cover and Thomas, 12.8[50].14 To develop these, we need to dene the Kullback-Leibler andCherno distances. Respectively,

D (p1 ‖ p0) =∫p1 (x) log

p1 (x)p0 (x)

dx (4.14)

C (p0, p1) = −min

log∫p0

1−s (x, 1, s) p1s (x) dx | 0 ≤ s ≤ 1

(4.15)

These distances are important special cases of the Ali-Silvey distances (Chapter 7) and have the followingproperties.

1. D (p1 ‖ p0) ≥ 0 and C (p0, p1) ≥ 0. Furthermore, these distances equal zero only when p0 (r) = p1 (r).The Kullback-Leibler and Cherno distances are always non-negative, with zero distance occurringonly when the probability distributions are the same.

2. D (p1 ‖ p0) = ∞ whenever, for some r, p0 (r) = 0 and p1 (r) 6= 0. If p1 (r) = 0, the value of

p1 (r) log p1(r)p0(r) is dened to be zero.

3. When the underlying stochastic quantities are random vectors having statistically independent compo-nents with respect to both p0 and p1, the Kullback-Leibler distance equals the sum of the componentdistances. Stated mathematically, if p0 (r) =

∏l p0 (rl) and p1 (r) =

∏l p1 (rl),

D (p1 (r) ‖ p0 (r)) =∑ll

D (p1 (rl) ‖ p0 (rl)) (4.16)

The Cherno distance does not have this property unless the components are identically distributed.4. D (p1 ‖ p0) 6= D (p0 ‖ p1); C (p0, p1) = C (p1, p0). The Kullback-Leibler distance is usually not a sym-

metric quantity. In some special cases, it can be symmetric (like the just described Gaussian example),but symmetry cannot, and should not, be expected. The Cherno distance is always symmetric.

5. D (p (r1, r2) ‖ p (r1) p (r2)) = I (r1; r2). The Kullback-Leibler distance between a joint probabilitydensity and the product of the marginal distributions equals what is known in information theory asthe mutual information between the random variables r1, r2. From the properties of the Kullback-Leibler distance, we see that the mutual information equals zero only when the random variables arestatistically independent.

These quantities are not actually distances. The Kullback-Leibler distance D (· ‖ ·) is not symmetric in itsarguments and the Cherno distance C (·, ·) does not obey the triangle inequality. Nonetheless, the name"distance" is frequently attached to them for reasons that will become clear later.

13This content is available online at <http://cnx.org/content/m11275/1.7/>.14The attribution to statistician Charles Stein is probably incorrect. Herman Cherno wrote a paper (Cherno [6]) providing

a derivation of this result. A reviewer stated that he thought Stein had derived the result in a technical report, which Chernohad not seen. Cherno modied his paper to include the reference without checking it. Cherno's paper provided the link toStein. However, Stein later denied he had proved the result; so much for not questioning reviewers! Stein's Lemma should beknown as Cherno's Lemma.



In the case that the L observations are statistically independent according to both models, Stein's Lemmastates that the false-alarm probability resulting from using the Neyman-Pearson hypothesis test has the form(Johnson and Orsak)[26]

limitL→∞

1L

logPF = −D(pr|M1 ‖ pr|M0

)(4.17)

orPF

L→∞→ f (L) e−(LD(pr|M1‖pr|M0)) (4.18)

where f (L) is a slowly varying function compared to the exponential: limitL→∞

logf(L)L = 0. This function is

problem- and PM -dependent, and thus usually not known. What this asymptotic formula says is that as thenumber of observations increases, the false-alarm probability of a Neyman-Pearson hypothesis test plottedon semilogarithmic coordinates will eventually become a straight line, the slope of which is −D (p1 ‖ p0).Figure 4.5 demonstrates this eect.


107

Figure 4.5: Using the Gaussian and Poisson classication problems as examples, we plot the false-alarm probability (left panels) and average probability of error (right panels) for each as a function of theamount of statistically independent data used by the optimal classier. The miss-probability criterionwas that it be less than or equal to 0.1. The a priori probabilities are 1/2 in the right-column examples.As shown here, the average error probability produced by the minimum Pe classier typically decaysmore slowly than the false-alarm probability for the classier that xes the miss probability. The dashedlines depict the behavior of the error probabilities as predicted by asymptotic theory (4.17). In each case,these theoretical lines have been shifted vertically for ease of comparison.

A similar result holds for the miss probability if it is optimized and the false-alarm probability is con-strained. Thus, in all model evaluation problems solved using the Neyman-Pearson approach, the optimizedprobability will always (eventually) decay exponentially in the number of observations. For this reason,D(pr|M1 ‖ pr|M0

)is known as the exponential rate.

Showing this result relies on, believe it or not, the law of large numbers. Dene the decision region <1 (L)according to

<1 (L) =r | eL(D(pr|M1‖pr|M0)−δ) ≤

pr|M1

pr|M0

≤ eL(D(pr|M1‖pr|M0)+δ)

This decision region will vary with the number of observations as is typical of Neyman-Pearson decisionrules. First of all, we must show that the decision rule this region yields does indeed satisfy a criterion on



the miss probability.

1− PM = Pr

[1L

∑ll

logp (rl|M1)p (rl|M0)

∈(D(pr|M1 ‖ pr|M0

)− δ,D

(pr|M1 ‖ pr|M0

)+ δ)| M1

]This sum is the average of the likelihood ratios computed at each observation assuming that M1 is true.Because of the strong law of large numbers, as the number of observations increases, this sum convergesto its expected value, which is D

(pr|M1 ‖ pr|M0

). Therefore, limit

L→∞1 − PM = 1 for any δ, which means

PM → 0. Thus, this decision region guarantees not only that the miss probability is less than some speciednumber, but also that it decreases systematically as the number of observations increases.

To analyze how the false-alarm probability varies with L, we note that in this decision region

pr|M0 ≤ pr|M1e−(L(D(pr|M1‖pr|M0)−δ))

pr|M0 ≥ pr|M1e−(L(D(pr|M1‖pr|M0)+δ))

Integrating these over <1 (L) yields upper and lower bounds on PF .

e−(L(D(pr|M1‖pr|M0)+δ)) (1− PM ) ≤ PF ≤ e−(L(D(pr|M1‖pr|M0)−δ)) (1− PM )

or (−D

(pr|M1 ‖ pr|M0

))− δ +

1L

log (1− PM ) ≤ 1L

logPF ≤ −D(pr|M1 ‖ pr|M0

)+ δ +

1L

log (1− PM )

Because δ can be made arbitrarily small as the number of observations increases, we obtain Stein's Lemma.For the average probability of error, the Cherno distance is its exponential rate.

PeL→∞→ f (L) e−(LC(pr|M1 ,pr|M0))

Showing this result leads to more interesting results. For the minimum-probability-of-error detector, thedecision region <1 is dened according to π1pr|M1 > π0pr|M0 . Using this fact, we can write the expressionfor Pe as

Pe = π0

∫p r |M0 (r ) dr + π1

∫p r |M1 (r ) dr

=∫min

π0pr|M0 , π1pr|M1

dr

(4.19)

The minimum function has the property min a, b ≤ a1−sbs, 0 ≤ s ≤ 1 for non-negative quantities a andb.15 With this bound, we can nd an upper bound for the average error probability.

Pe ≤∫ (

π0p r |M0 (r ))1−s(

π0p r |M1 (r ))1−s

dr ≤∫ (

p r |M0 (r ))1−s(

p r |M1 (r ))1−s

dr

When r has statistically independent and identically distributed components,

Pe ≤∫ (∏

ll p rl |M0 (rl ))1−s(∏

ll p rl |M1 (rl ))1−s

dr =∏ll

∫ (p rl |M0 (rl )

)1−s(p rl |M1 (rl )

)1−sdr l =

(∫ (p rl |M0 (rl )

)1−s(p rl |M1 (rl )

)1−sdr)L(4.20)

Consequently, we have that

∀s :(

1L

logPe ≤ log∫ (

p rl |M0 (rl ))1−s(

p rl |M1 (rl ))1−s

dr

)15To see this, simply plot the minimum function versus one of its argument with the other one xed. Plotting a1−sbs the

same way shows that it indeed is an upper bound.


109

Asymptotically, this bound is tight (Gaussian example shows this to be true). Thus, the exponential ratefor the average probability of error is the minimum of this bound with respect to s. This quantity is thenegative of the Cherno distance (4.15) and we arrive at the expression for the average probability of error.

The Kullback-Leibler and Cherno distances are related in an important way. Dene ps to be p01−sp1

s

J(s) ,

where J (s) is the normalization constant and the quantity optimized in the denition of the Chernodistance. The Cherno distance equals the Kullback-Leibler distance between the ps equi-distant from p0

and p1. Call the value of s corresponding to this equi-distant distribution s∗; this value corresponds to theresult of the optimization and

C (p0, p1) = D (ps∗ ‖ p0) = D (ps∗ ‖ p1)

Figure 4.5 shows that the exponential rates of minimum Pe and Neyman-Pearson hypothesis tests are notnecessarily the same. In other words, the distance to the equi-distant distribution need not equal the totaldistance, which seems to make sense.

The larger the distance, the greater the rate at which the error probability decreases, which corresponds toan easier problem (smaller error probabilities). However, Stein's Lemma does not allow a precise calculationof error probabilities. The quantity f (·) depends heavily on the underlying probability distributions in waysthat are problem dependent. All a distance calculation gives us is the exponential rate.

Example 4.6For the Gaussian example we have been considering, set the miss probability's criterion value to

be α. Because PM = Q(Lm−γ√Lσ

), we nd the threshold value γ to be Lm −

√LσQ−1 (α), which

yields a false-alarm probability of Q(√

Lmσ −Q−1 (α)

)The Kullback-Leibler distance between two

Gaussians p0 = N(0, σ2

)and p1 = N

(m,σ2

)equals

D (p1 ‖ p0) =∫

1√2πσ2 e

− x2

2σ2

(− x2

2σ2 + (x−m)2

2σ2

)dx

=∫

1√2πσ2 e

− x2

2σ2 2mx+m2

2σ2 dx

= m2

2σ2

(4.21)

In this Gaussian case the Kullback-Leibler distance is symmetric: D (p1 ‖ p0) = D (p0 ‖ p1). Verifythis atypical (most cases are asymmetric) result for yourself. The Cherno distance equals m2

8σ2 (thevalue of s∗ is 1/2). Stein's Lemma predicts that the false-alarm probability has the asymptotic

form f (L) e−Lm2

2σ2 . Thus, does Q(√

Lmσ −Q−1 (α)

)→ f (L) e−

Lm2

2σ2 , where f (L) is slowly varying

compare to the exponential?An asymptotic formula (1.19) for Q (x) is

Q (x) x→∞→ 1√2πx

e−x22

As L increases, so does the argument of Q (·) in the expression for the false-alarm probability. Thus

PFL→∞→ 1

√2π(√

Lmσ −Q−1 (α)

)e−„√Lmσ −Q−1(α)«2

2 =1

√2π(√

Lmσ −Q−1 (α)

)e√Lmσ Q−1(α)− (Q−1(α))2

2 e−Lm2

2σ2

(4.22)The quantity multiplying the nal exponential corresponds to f (L), which satises the slowlyvarying property. We have veried Stein's Lemma prediction for the asymptotic behavior of thefalse-alarm probability in the Gaussian case. Note how the criterion value α occurs only in theexpression for f (L), and hence does not aect the exponential rate; Stein's Lemma says thissituation always occurs.



4.7 Sequential Hypothesis Testing16

In many circumstances, the observations to be used in evaluating models arrive sequentially rather than allat once. For example, passive sonar systems may well "listen" over a period of time to an array's outputwhile the array is steered in a particular direction. The decision rules we have derived implicitly assume theentire block of data - the array output observed over a long period of time - is available. You might wonderwhether a hypothesis test could be developed that takes the sequential arrival of data into account, makingdecisions as the data arrive, with the possibility of determining early in the data collection procedure thevalidity of one model, while maintaining the same performance specications. Answering this questionleads to the formulation of sequential hypothesis testing (Poor: 136-156 [25], Wald[2]). Not only dosequential tests exist, they can provide performance superior to that of block tests in certain cases.

To make decisions as the data become available, we must generalize the decision-making process. Assumeas before that the observed data comprise an observation vector r of length L. The decision rule (in thetwo-model case) now consists of determining which model is valid or that more data are required. Thus,the range of values of r is partitioned into three regions <0, <1, and <? . Making the latter decision impliesthat the data gathered to that point is insucient to meet the performance requirements. More data mustbe obtained to achieve the required performance and the test re-applied once these additional data becomeavailable. Thus, a variable number of observations are required to make a decision. An issue in this kindof procedure is the number of observations required to satisfy the performance criteria: for a common setof performance specications, does this procedure result in a decision rule requiring, on the average, fewerobservations than does a xed-length block test?

4.7.1 Sequential Likelihood Ratio Test

In a manner similar to the Neyman-Pearson criterion, we specify the false-alarm probability PF ; in addition,we need to specify the detection probability PD . These constraints over-specify the model evaluation problemwhere the number of observations is xed: enforcing one constraint forces violation of the other. In contrast,both may be specied as the sequential test as we shall see.

Assuming a likelihood ratio test, two thresholds are required to dene the three decision regions.

ΛL (r) < η0 sayM0

η0 < ΛL (r) < η1 say "need more data"

η1 < ΛL (r) sayM1

where ΛL (r) is the usual likelihood ratio where the dimension L of the vector r is explicitly denoted. Thethreshold values η0 and η1 are found from the constraints, which are expressed as

PF =∫<1

p r |M0 (r ) dr = α

and

PD =∫<1

p r |M1 (r ) dr = β

Here, α and β are design constants that you choose according to the application. Note that the probabilitiesPF , PD are associated not with what happens on a given trial, but what the sequential test yields in termsof performance when a decision is made. Thus, PM = 1 − PD although the probability of correctly sayingM1 on a given trial does not equal one minus the probability of incorrectly sayingM0 is true: The "needmore data" region must be accounted for on an individual trial but not when considering the sequentialtest's performance when it terminates.



111

Rather explicitly attempting to relate thresholds to performance probabilities, we obtain simpler resultsby using bounds and approximations. Note that the expression for PD may be written as

PD =∫<1

p r |M1 (r )p r |M0 (r )

p r |M0 (r ) dr =∫<1

ΛL (r) p r |M0 (r ) dr

In the decision region <1, ΛL (r) ≥ η1; thus, a lower bound on the detection probability can be establishedby substituting this inequality into the integral.

PD ≥ η1

∫<1

p r |M0 (r ) dr

The integral is the false-alarm probability PF of the test when it terminates. In this way, we nd thatPDPF≥ η1. Using similar arguments on the miss probability, we obtain a similar bound on the threshold η0.

These inequalities are summarized as

η0 ≥1− PD1− PF

and

η1 ≤PDPF

These bounds, which relate the thresholds in the sequential likelihood ratio test with the false-alarm anddetection probabilities, are general, applying even when sequential tests are not being used. In the usuallikelihood ratio test, there is a single threshold η; these bounds apply to it as well, implying that in alikelihood ratio test the error probabilities will always satisfy

PDPF≥ 1− PD

1− PF(4.23)

This relationship can be manipulated to show PD ≥ PF , indicating that the likelihood ratio test is, at thevery least, a reasonable decision rule and that the ROC curves (p. 99) have the right general form.

Only with diculty can we solve the inequality constraints on the sequential test's thresholds in thegeneral case. Surprisingly, by approximating the inequality constraints by equality constraints we canobtain a result having pragmatically important properties. As an approximation, we thus turn to solvingfor η0 and η1 under the conditions

η0 =1− β1− α

and

η1 =β

α

In this way, the threshold values are explicitly specied in terms of the desired performance probabilities. Weuse the criterion values for the false-alarm and detection probabilities because when we use these equalities,the test's resulting performance probabilities PF and PD usually do not satisfy the design criteria. Forexample, equating η1 to a value potentially larger than its desired value might result in a smaller detectionprobability and a larger false-alarm rate. We will want to understand how much actual performance departsfrom what we want.

The relationships derived above between the performance levels and the thresholds apply no matter howthe thresholds are chosen.

1− β1− α

≥ 1− PD1− PF

β

α≤ PDPF



From these inequalities, two important results follow:

PF ≤α

β

1− PD ≤1− β1− α

andPF + 1− PD ≤ α+ 1− β

The rst result follows directly from the threshold bounds. To derive the second result, we must work alittle harder. Multiplying the rst inequality by (1− α) (1− PF ) yields (1− β) (1− PF ) ≥ (1− PD) (1− α). Considering the reciprocal of the second inequality and multiplying it by βPD yields PDα ≥ βPF . Addingthe two inequalities yields the second result.

The rst set of inequalities suggest that the false-alarm and miss (which equals 1−PD) probabilities willincrease only slightly from their specied values: the denominators on the right sides are very close to unityin the interesting cases (e.g., small error probabilities like 0.01). The second inequality suggests that thesum of the false-alarm and miss probabilities obtained in practice will be less than the sum of the speciederror probabilities. Taking these results together, one of two situations will occur when we approximate theinequality criterion by equality: either the false alarm probability will decrease and the detection probabilityincrease (a most pleasing but unlikely circumstance) or one of the error probabilities will increase whilethe other decreases. The false-alarm and miss probabilities cannot both increase. Furthermore,whichever one increases, the rst inequalities suggest that the incremental change will be small. Our ad hocapproximation to the thresholds does indeed yield a level of performance close to that specied

Usually, the likelihood is manipulated to derive a sucient statistic. The resulting sequential decisionrule is

ΥL (r) < γ0 (L) sayM0

γ0 (L) < ΥL (r) < γ1 (L) say "need more data"

γ1 (L) < ΥL (r) sayM1

Note that the thresholds γ0 (L)and γ1 (L), derived from the thresholds η0 and η1, usually depend on thenumber of observations used in the decision rule.

Example 4.7Let r be a Gaussian random vector as in our previous examples with statistically independentcomponents.

M0 : r ∼ N(0, σ2I

)M1 : r ∼ N

(m,σ2I

)The mean vector m is assumed for simplicity to consist of equal positive values: M = (m, . . . ,m)T ,m > 0. Using the previous derivations, our sequential test becomes∑L−1

l=0 rl <σ2

m logη0 + Lm2 sayM0

σ2

m logη0 + Lm2 <

∑L−1l=0 rl <

σ2

m logη1 + Lm2 say "need more data"

σ2

m logη1 + Lm2 <

∑L−1l=0 rl sayM1

Starting with L = 1, we gather the data and compute the sum. The sucient statistic will lie inthe middle range between the two thresholds until one of them is exceeded as shown in Figure 4.6.


113

Figure 4.6: Example of the sequential likelihood ratio test. The sucient statistic wanders betweenthe two thresholds in a sequential decision rule until one of them is crossed by the statistic. The numberof observations used to obtain a decision is L0.

The model evaluation procedure then terminates and the chosen model announced. Note howthe thresholds depend on the amount of data available (as expressed by L). This variation typiesthe sequential hypothesis tests.

4.7.2 Average Number of Required Observations

The awake reader might wonder whether that the sequential likelihood ratio test just derived has the dis-turbing property that it may never terminate: can the likelihood ratio wander between the two thresholdsforever? Fortunately, the sequential likelihood ratio test has been shown to terminate with probability one(Wald[2]). Condent of eventual termination, we need to explore how many observations are required tomeet performance specications. The number of observations is variable, depending on the observed dataand the stringency of the specications. The average number of observations required can be determinedin the interesting case when the observations are statistically independent.

Assuming that the observations are statistically independent and identically distributed, the likelihoodratio is equal to the product of the likelihood ratios evaluated at each observation. Considering ln (ΛL0 (r)),the logarithm of the likelihood ratio when a decision is made on observation L0, we have

ln (ΛL0 (r)) =L0−1∑l=0

ln (Λ (rl))

where Λ (rl) is the likelihood ratio corresponding to the lth observation. We seek an expression for E [L0],the expected value of the number of observations required to make the decision. To derive this quantity, weevaluate the expected value of the likelihood ratio when the decision is made. This value will usually varywith which model is actually valid; we must consider both models separately. Using the laws of conditionalexpectation (see Joint Distributions (1.9)), we nd that the expected value of ΛL0 (r), assuming that model



M1 was true, is given by

E [ln (ΛL0 (r)) | M1 ] = E [E [ln (ΛL0 (r)) | M1, L0 ]]

The outer expected value is evaluated with respect to the probability distribution of L0 ; the inner expectedvalue is average value of the log-likelihood assuming that L0 observations were required to choose modelM1

. In the latter case, the log-likelihood is the sum of L0 component log-likelihood ratios

E [ln (ΛL0 (r)) | M1, L0 ] = L0E [ln (Λ (rl)) | M1 ]

Noting that the expected value on the right is a constant with respect to the outer expected value, we ndthat

E [ln (ΛL0 (r)) | M1 ] = E [L0 | M1 ]E [ln (Λ (rl)) | M1 ]

The average number of observations required to make a decision, correct or incorrect, assuming thatM1 istrue is thus expressed by

E [L0 | M1 ] =E [ln (ΛL0 (r)) | M1 ]E [ln (Λ (rl)) | M1 ]

Assuming that the other model was true, we have the complementary result

E [L0 | M0 ] =E [ln (ΛL0 (r)) | M0 ]E [ln (Λ (rl)) | M0 ]

The numerator is dicult to calculate exactly but easily approximated; assuming that the likelihoodratio equals its threshold value when the decision is made,

E [ln (ΛL0 (r)) | M0 ] ' PF ln (η1)− 1ln (η0)PF ln(PDPF

)− 1ln

(1− PD1− PF

)

E [ln (ΛL0 (r)) | M1 ] ' PDln (η1)− 1ln (η0)PDln(PDPF

)− 1ln

(1− PD1− PF

)Note these expressions are not problem dependent; they depend only on the specied probabilities. Thedenominator cannot be approximated in a similar way with such generality; it must be evaluated for eachproblem.

Example 4.8In the Gaussian example we have been exploring, the log-likelihood of each component observationrl is given by

ln (Λ (rl)) =mrlσ2− m2

2σ2

The conditional expected values required to evaluate the expression for the average number ofrequired observations are

E [ln (Λ (rl)) | M0 ] = −m2

2σ2

E [ln (Λ (rl)) | M1 ] =m2

2σ2

For simplicity, let's assume that the false-alarm and detection probabilities are symmetric (i.e.PF = 1 − PD). The expressions for the average number of observations are equal for each modeland we have

E [L0 | M0 ] = E [L0 | M1 ] = f (PF )σ2

m2


115

where f (PF ) is a function equal to (2− 4PF ) ln(

1−PFPF

). Thus, the number of observations de-

creases with increasing signal-to-noise ratio mσ and increases as the false-alarm probability is re-

duced.Suppose we used a likelihood ratio test where all data were considered once and a decision made;

how many observations would be required to achieve a specied level of performance and how wouldthis xed number compare with the average number of observations in a sequential test? In this

example, we nd from our earlier calculations (see equation (4.11)) that PF = Q(√

Lm2σ

)so that

L = 4(Q−1 (PF )

)2 σ2

m2

The duration of the sequential and block tests depend on the signal-to-noise ratio in the sameway; however, the dependence on the false-alarm probability is quite dierent. As depicted in theFigure 4.7, the disparity between these quantities increases rapidly as the false alarm probabilitydecreases, with the sequential test requiring correspondingly fewer observations on the average.

Figure 4.7: The numbers of observations required by the sequential test (on the average) and by the

block test for Gaussian observations are proportional to σ2

m2 ; the coecients of these expressions ( f (PF )

and“

4`Q−1 (PF )

´2”respectively) are shown.

We must not forget that these results apply to the average number of observations required to makea decision. Expressions for the distribution of the number of observations are complicated and dependheavily on the problem. When an extremely large number of observation are required to resolve a dicultcase to the required accuracy, we are forced to truncate the sequential test, stopping when a speciednumber of observations have been used. A decision would then be made by dividing the region between theboundaries in half and selecting the model corresponding to the boundary nearest to the sucient statistic.If this truncation point is larger than the expected number, the performance probabilities will change little."Larger" is again problem dependent; analytic results are few, leaving the option of computer simulationsto estimate the distribution of the number of observations required for a decision.



4.8 Detection in the Presence of Unknowns17

We assumed in the previous sections that we have a few well-specied models (hypotheses) for a set ofobservations. These models were probabilistic; to apply the techniques of statistical hypothesis testing, themodels take the form of conditional probability densities. In many interesting circumstances, the exactnature of these densities may not be known. For example, we may know a priori that the mean is either zeroor some constant (as in the Gaussian example). However, the variance of the observations may not be knownor the value of the non-zero mean may be in doubt. In an array processing context, these respective situationscould occur when the background noise level is unknown (a likely possibility in applications) or when thesignal amplitude is not known because of far-eld range uncertainties (the further the source of propagatingenergy, the smaller its received energy at each sensor). In an extreme case, we can question the exactnature of the probability densities (everything is not necessarily Gaussian!). The model evaluation problemcan still be posed for these situations; we classify the "unknown" aspects of a model testing problem aseither parametric (the variance is not known, for example) or nonparametric (the formula for the densityis in doubt). The former situation has a relatively long history compared to the latter; many techniquescan be used to approach parametric problems while the latter is a subject of current research (Gibson andMelsa[13]). We concentrate on parametric problems here.

We describe the dependence of the conditional density on a set of parameters by incorporating the pa-rameter vector θ as part of the condition. We write the likelihood function as p r |Miθ (r ) for the parametricproblem. In statistics, this situation is said to be a composite hypothesis (Cramér[9]). Such situations canbe further categorized according to whether the parameters are random or nonrandom. For a parameterto be random, we have an expression for its a priori density, which could depend on the particular model.As stated many times, a specication of a density usually expresses some knowledge about the range ofvalues a parameter may assume and the relative probability of those values. Saying that a parameter has auniform distribution implies that the values it assumes are equally likely, not that we have no idea what thevalues might be and express this ignorance by a uniform distribution. If we are ignorant of the underlyingprobability distribution that describes the values of a parameter, we will characterize them simply as beingunknown (not random). Once we have considered the random parameter18 case, nonrandom but unknownparameters19 will be discussed.

4.9 Detection of Signals in Noise20

Far and away the most common decision problem in signal processing is determining which of several signalsoccurs in data contaminated by additive noise. Specializing to the case when one of two possible of signalsis present, the data models are

• M0 : R (l) = s0 (l) +N (l) , 0 ≤ l < L• M1 : R (l) = s1 (l) +N (l) , 0 ≤ l < L

where si (l) denotes the known signals and N (l) denotes additive noise modeled as a stationary stochasticprocess. This situation is known as the binary detection problem: distinguish between two possiblesignals present in a noisy waveform.

We form the discrete-time observations into a vector: R = (R (0) , . . . , R (L− 1))T . Now the modelsbecome

• M0 : R = s0 +N• M1 : R = s1 +N

17This content is available online at <http://cnx.org/content/m11229/1.4/>.18"Random Parameters" <http://cnx.org/content/m11605/latest/>19"Non-Random Parameters" <http://cnx.org/content/m11238/latest/>20This content is available online at <http://cnx.org/content/m16253/1.9/>.


117

To apply our detection theory results, we need the probability density of R under each model. As the onlyprobabilistic component of the observations is the noise, the required density for the detection problem isgiven by

p R |Mi(r ) = p N (r − si )

and the corresponding likelihood ratio by

Λ (r) =p N (r − s1 )p N (r − s0 )

Much of detection theory revolves about interpreting this likelihood ratio and deriving the detection thresh-old.

4.9.1 Additive White Gaussian Noise

By far the easiest detection problem to solve occurs when the noise vector consists of statistically independent,identically distributed, Gaussian random variables, what is commonly termed white Gaussian noise. Themean of white noise is usually taken to be zero21 and each component's variance is σ2. The equal-varianceassumption implies the noise characteristics are unchanging throughout the entire set of observations. Theprobability density of the noise vector evaluated at r − si equals that of a Gaussian random vector havingindependent components with mean si.

p N (r − si ) =(

12πσ2

)L2

e−( 12σ2 (r−si)T (r−si))

The resulting detection problem is similar to the Gaussian example we previously examined, with the dif-ference here being a non-zero meanthe signalunder both models. The logarithm of the likelihood ratiobecomes

(r − s0)T (r − s0)− (r − s1)T (r − s1)M1

≷M0

2σ2ln (η)

The usual simplications yield in

rT s1 −s1T s1

2−(rT s0 −

s0T s0

2

)M1

≷M0

σ2ln (η)

The model-specic components on the left side express the signal processing operations for each model.22

Each term in the computations for the optimum detector has a signal processing interpretation. Whenexpanded, the term si

T si equals∑L−1l=0 si

2 (l), the signal energy Ei. The remaining term, rT si, is theonly one involving the observations and hence constitutes the sucient statistic Υi (r) for the additive whiteGaussian noise detection problem.

Υi (r) = rT si

An abstract, but physically relevant, interpretation of this important quantity comes from the theory oflinear vector spaces. In that context, the quantity rT si would be termed the projection of r onto si.From the Schwarz inequality, we know that the largest value of this projection occurs when these vectors areproportional to each other. Thus, a projection measures how much alike two vectors are: they are completelyalike when they are parallel (proportional to each other) and completely dissimilar when orthogonal (theprojection is zero). In eect, the projection operation removes those components from the observations whichare orthogonal to the signal, thereby generalizing the familiar notion of ltering a signal contaminated bybroadband noise. In ltering, the signal-to-noise ratio of a bandlimited signal can be drastically improvedby lowpass ltering; the output would consist only of the signal and "in-band" noise. The projection servesa similar role, ideally removing those "out-of-band" components (the orthogonal ones) and retaining the"in-band" ones (those parallel to the signal).

21The zero-mean assumption is realistic for the detection problem. If the mean were non-zero, simply subtracting it from theobserved sequence results in a zero-mean noise component.

22If more than two signals were assumed possible, quantities such as these would need to be computed for each signal andthe largest selected.



4.9.2 Matched Filtering

The projection operation can be expanded as rT si =∑L−1l=0 r (l) si (l) another signal processing interpretation

emerges. The projection now describes a nite impulse response (FIR) ltering operation evaluated at aspecic index. To demonstrate this interpretation, let h (l) be the unit-sample response of a linear, shift-invariant lter where h (l) = 0 for l < 0 and l ≥ L. Letting r (l) be the lter's input sequence, the convolutionsum expresses the output.

r (k) ∗ h (k) =k∑

l=k−(L−1)

r (l)h (k − l)

Letting k = L− 1, the index at which the unit-sample response's last value overlaps the input's value at theorigin, we have

r (k) ∗ h (k) |k=L−1 =L−1∑l=0

r (l)h (L− 1− l)

Suppose we set the unit-sample response equal to the index-reversed, then delayed signal.

h (l) = si (L− 1− l)

In this case, the ltering operation becomes a projection operation.

r (k) ∗ si (L− 1− k) |k=L−1 =L−1∑l=0

r (l) si (l)

Figure 4.8 depicts these computations graphically.

Figure 4.8: The detector for signals contained in additive, white Gaussian noise consists of a matchedlter, whose output is sampled at the duration of the signal and half of the signal energy is subtractedfrom it. The optimum detector incorporates a matched lter for each signal compares their outputs todetermine the largest.

The sucient statistic for the ith signal is thus expressed in signal processing notation as r (k) ∗si (L− 1− k) |k=L−1 −

Ei2 . The ltering term is called a matched lter because the observations are

passed through a lter whose unit-sample response "matches" that of the signal being sought. We samplethe matched lter's output at the precise moment when all of the observations fall within the lter's memoryand then adjust this value by half the signal energy. The adjusted values for the two assumed signals aresubtracted and compared to a threshold.


119

4.9.3 Detection Performance

To compute the performance probabilities, the expressions should be simplied in the ways discussed inprevious sections. As the energy terms are known a priori they can be incorporated into the threshold withthe result

L−1∑l=0

r (l) (s1 (l)− s0 (l))M1

≷M0

σ2ln (η) +E1 − E0

2

The left term constitutes the sucient statistic for the binary detection problem. Because the additivenoise is presumed Gaussian, the sucient statistic is a Gaussian random variable no matter which model isassumed. UnderMi, the specics of this probability distribution are

L−1∑l=0

r (l) (s1 (l)− s0 (l)) ∼ N (mi, vari)

where the mean and variance of the Gaussian distribution are given respectively by

mi =∑

si (l) (s1 (l)− s0 (l))

vari = σ2∑

(s1 (l)− s0 (l))2

Note that the variance does not depend on model. The false-alarm probability is given by

PF = Q

(σ2ln (η) + E1−E0

2 −m0

var12

)The signal-related terms in the numerator of this expression can be manipulated so that the false-alarmprobability of the optimal white Gaussian noise detector is succinctly expressed by

PF = Q

ln (η) + 12σ2

∑(s1 (l)− s0 (l))2

1σ

(∑(s1 (l)− s0 (l))2

) 12

Note that the only signal-related quantity aecting this performance probability (and all of the others as

well) is the ratio of the energy in the dierence signal to the noise variance. The larger this ratio,the better (i.e., smaller) the performance probabilities become. Note that the details of the signal waveformsdo not greatly aect the energy of the dierence signal. For example, consider the case where the two signalenergies are equal (E0 = E1 = E); the energy of the dierence signal is given by 2E − 2

∑s0 (l) s1 (l). The

largest value of this energy occurs when the signals are negatives of each other, with the dierence-signalenergy equaling 4E. Thus, equal-energy but opposite-signed signals such as sine waves, square-waves, Besselfunctions, etc. all yield exactly the same performance levels. The essential signal properties that do yieldgood performance values are elucidated by an alternate interpretation. The term

∑(s1 (l)− s0 (l))2

equals

(‖ s1 − s0 ‖)2, the L2 norm of the dierence signal. Geometrically, the dierence-signal energy is the same

quantity as the square of the Euclidean distance between the two signals. In these terms, a larger distancebetween the two signals means better performance.

Example 4.9: Detection, Gaussian exampleA common detection problem is to determine whether a signal is present (M1) or not (M0). Tomodel the latter case, the signal equals zero: s0 (l) = 0. The optimal detector relies on lteringthe data with a matched lter having a unit-sample response based on the signal that might bepresent. Letting the signal underM1 be denoted simply by s (l), the optimal detector consists of

r (l) ∗ s (L− 1− l) |l=L−1 −E

2

M1

≷M0

σ2ln (η)



or

r (l) ∗ s (L− 1− l) |l=L−1

M1

≷M0

γ

The false-alarm and detection probabilities are given by

PF = Q

γ

E12

σ

PD = Q

(Q−1 (PF )−

√E

σ

)Figure 4.9 displays the probability of detection as a function of the signal-to-noise ratio E

σ2 forseveral values of false-alarm probability. Given an estimate of the expected signal-to-noise ratio,these curves can be used to assess the trade-o between the false-alarm and detection probabilities.

Figure 4.9: The probability of detection is plotted versus signal-to-noise ratio for various values of thefalse-alarm probability PF . False-alarm probabilities range from 10−1 down to 10−6 by decades. Thematched lter receiver was used since the noise is white and Gaussian. Note how the range of signal-to-noise ratios over which the detection probability changes shrinks as the false-alarm probability decreases.This eect is a consequence of the non-linear nature of the function Q (·).

The important parameter determining detector performance derived in this example is the signal-to-noiseratio E

σ2 : the larger it is, the smaller the false-alarm probability is (generally speaking). Signal-to-noiseratios can be measured in many dierent ways. For example, one measure might be the ratio of the rmssignal amplitude to the rms noise amplitude. Note that the important one for the detection problem is muchdierent. The signal portion is the sum of the squared signal values over the entire set of observed values -the signal energy; the noise portion is the variance of each noise component - the noise power. Thus, energycan be increased in two ways that increase the signal-to-noise ratio: the signal can be made larger or theobservations can be extended to encompass a larger number of values.

To illustrate this point, how a matched lter operates is shown in Figure 4.10. The signal is very dicultto discern in the presence of noise. However, the signal-to-noise ratio that determines detection performancebelies the eye. The matched lter output demonstrates an amazingly clean signal.


121

-3

-2

-1

0

1

2

3

l

-10

0

10

l

Signal

Signal + Noise

Matched Filter Output

Figure 4.10: The signal consists of ten cycles of sin (ω0l) with ω0 = 2π0.1. The middle panel shows thesignal with noise added. The lower portion depicts the matched-lter output. The detection thresholdwas set for a false-alarm probability of 10−2. Even though the matched lter output crosses the thresholdseveral times, only the output at l = L− 1 matters. For this example, it coincides with the peak outputof the matched lter.

4.10 White Gaussian Noise23

By far the easiest detection problem to solve occurs when the noise vector consists of statistically inde-pendent, identically distributed, Gaussian random variables. In this book, a white sequence consists ofstatistically independent random variables. The white sequence's mean is usually taken to be zero 24 andeach component's variance is σ2. The equal-variance assumption implies the noise characteristics are un-changing throughout the entire set of observations. The probability density of the zero-mean noise vectorevaluated at r− si equals that of Gaussian random vector having independent components ( K = σ2I) with

23This content is available online at <http://cnx.org/content/m11281/1.3/>.24The zero-mean assumption is realistic for the detection problem. If the mean were non-zero, simply subtracting it from the

observed sequence results in a zero-mean noise component.



mean si.

p n (r − si ) =(

12πσ2

)L2

e−( 12σ2 (r−si)T (r−si))

The resulting detection problem is similar to the Gaussian example examined so frequently in the hypothesistesting sections, with the distinction here being a non-zero mean under both models. The logarithm of thelikelihood ratio becomes

(r − s0)T (r − s0)− (r − s1)T (r − s1)M1

≷M0

2σ2ln (η)

and the usual simplications yield in

rT s1 −s1T s1

2−(rT s0 −

s0T s0

2

)M1

≷M0

σ2ln (η)

The quantities in parentheses express the signal processing operations for each model. If more than twosignals were assumed possible, quantities such as these would need to be computed for each signal and thelargest selected. This decision rule is optimum for the additive, white Gaussian noise problem.

Each term in the computations for the optimum detector has a signal processing interpretation. Whenexpanded, the term si

T si equals∑L−1l=0 si

2 (l), which is the signal energy Ei. The remaining term - rT si -is the only one involving the observations and hence constitutes the sucient statistic Υi (r) for the additivewhite Gaussian noise detection problem.

Υi (r) = rT si

An abstract, but physically relevant, interpretation of this important quantity comes from the theory oflinear vector spaces. There, the quantity rT si would be termed the dot product between r and si or theprojection of r onto si . By employing the Schwarz inequality, the largest value of this quantity occurs whenthese vectors are proportional to each other. Thus, a dot product computation measures how much aliketwo vectors are: they are completely alike when they are parallel (proportional) and completely dissimilarwhen orthogonal (the dot product is zero). More precisely, the dot product removes those components fromthe observations which are orthogonal to the signal. The dot product thereby generalizes the familiar notionof ltering a signal contaminated by broadband noise. In ltering, the signal-to-noise ratio of a bandlimitedsignal can be drastically improved by lowpass ltering; the output would consist only of the signal and"in-band" noise. The dot product serves a similar role, ideally removing those "out-of-band" components(the orthogonal ones) and retaining the "in-band" ones (those parallel to the signal).

Expanding the dot product, rT si =∑L−1l=0 r (l) si (l) another signal processing interpretation emerges.

The dot product now describes a nite impulse response (FIR) ltering operation evaluated at a specicindex. To demonstrate this interpretation, let h (l) be the unit-sample response of a linear, shift-invariantlter where h (l) = 0 for l < 0 and l ≥ L. Letting r (l) be the lter's input sequence, the convolution sumexpresses the output.

r (k) ∗ h (k) =k∑

l=k−(L−1)

r (l)h (k − l)

Letting k = L− 1, the index at which the unit-sample response's last value overlaps the input's value at theorigin, we have

r (k) ∗ h (k) |k=L−1 =L−1∑l=0

r (l)h (L− 1− l)

If we set the unit-sample response equal to the index-reversed, then delayed signal h (l) = si (L− 1− l), wehave

r (k) ∗ si (L− 1− k) |k=L−1 =L−1∑l=0

r (l) si (l)


123

which equals the observation-dependent component of the optimal detector's sucient statistic. Figure 4.11depicts these computations graphically.

Figure 4.11: The detector for signals contained in additive, white Gaussian noise consists of a matchedlter, whose output is sampled at the duration of the signal and half of the signal energy is subtractedfrom it. The optimum detector incorporates a matched lter for each signal compares their outputs todetermine the largest.

The sucient statistic for the ith signal is thus expressed in signal processing notation as r (k) ∗si (L− 1− k) |k=L−1 −

Ei2 . The ltering term is called a matched lter because the observations are

passed through a lter whose unit-sample response "matches" that of the signal being sought. We samplethe matched lter's output at the precise moment when all of the observations fall within the lter's memoryand then adjust this value by half the signal energy. The adjusted values for the two assumed signals aresubtracted and compared to a threshold.

To compute the performance probabilities, the expressions should be simplied in the ways discussed inthe hypothesis testing sections. As the energy terms are known a priori they can be incorporated into thethreshold with the result

L−1∑l=0

r (l) (s1 (l)− s0 (l))M1

≷M0

σ2ln (η) +E1 − E0

2

The left term constitutes the sucient statistic for the binary detection problem. Because the additivenoise is presumed Gaussian, the sucient statistic is a Gaussian random variable no matter which model isassumed. UnderMi , the specics of this probability distribution are

L−1∑l=0

r (l) (s1 (l)− s0 (l)) ∼ N

(∑si (l) (s1 (l)− s0 (l)) , σ2

∑(s1 (l)− s0 (l))2

)

The false-alarm probability is given by

PF = Q

σ2ln (η) + E1−E02 −

∑s0 (l) (s1 (l)− s0 (l))

σ(∑

(s1 (l)− s0 (l))2) 1

2

The signal-related terms in the numerator of this expression can be manipulated with the false-alarm prob-ability (and the detection probability) for the optimal white Gaussian noise detector succinctly expressedby

PF = Q

ln (η) + 12σ2

∑(s1 (l)− s0 (l))2

1σ

(∑(s1 (l)− s0 (l))2

) 12



PF = Q

ln (η)− 12σ2

∑(s1 (l)− s0 (l))2

1σ

(∑(s1 (l)− s0 (l))2

) 12

Note that the only signal-related quantity aecting this performance probability (and all of the others)

is the ratio of energy in the dierence signal to the noise variance. The larger this ratio, the better(smaller) the performance probabilities become. Note that the details of the signal waveforms do not greatlyaect the energy of the dierence signal. For example, consider the case where the two signal energies areequal (E0 = E1 = E); the energy of the dierence signal is given by 2E − 2

∑s0 (l) s1 (l). The largest value

of this energy occurs when the signals are negatives of each other, with the dierence-signal energy equaling4E. Thus, equal-energy but opposite-signed signals such as sine waves, square-waves, Bessel functions, etc.all yield exactly the same performance levels. The essential signal properties that do yield good performancevalues are elucidated by an alternate interpretation. The term

∑(s1 (l)− s0 (l))2

equals (‖ s1 − s0 ‖)2, the

L2 norm of the dierence signal. Geometrically, the dierence-signal energy is the same quantity as thesquare of the Euclidean distance between the two signals. In these terms, a larger distance between the twosignals will mean better performance.

Example 4.10: Detection, Gaussian exampleA common detection problem in array processing is to determine whether a signal is present (M1)or not (M0 ) in the array output. In this case, s0 (l) = 0 The optimal detector relies on lteringthe array output with a matched lter having an impulse response based on the assumed signal.Letting the signal underM1 be denoted simply by s (l), the optimal detector consists of

r (l) ∗ s (L− 1− l) |l=L−1 −E

2

M1

≷M0

σ2ln (η)

or

r (l) ∗ s (L− 1− l) |l=L−1

M1

≷M0

γ

The false-alarm and detection probabilities are given by

PF = Q

γ

E12

σ

PD = Q

(Q−1 (PF )−

√E

σ

)Figure 4.12 displays the probability of detection as a function of the signal-to-noise ratio E

σ2 forseveral values of false-alarm probability. Given an estimate of the expected signal-to-noise ratio,these curves can be used to assess the trade-o between the false-alarm and detection probabilities.


125

Figure 4.12: The probability of detection is plotted versus signal-to-noise ratio for various values ofthe false-alarm probability PF . False-alarm probabilities range from 10−1 down to 10−6 by decades. Thematched lter receiver was used since the noise is white and Gaussian. Note how the range of signal-to-noise ratios over which the detection probability changes shrinks as the false-alarm probability decreases.This eect is a consequence of the non-linear nature of the function Q (·).

The important parameter determining detector performance derived in this example is the signal-to-noiseratio E

σ2 : the larger it is, the smaller the false-alarm probability is (generally speaking). Signal-to-noiseratios can be measured in many dierent ways. For example, one measure might be the ratio of the rmssignal amplitude to the rms noise amplitude. Note that the important one for the detection problem is muchdierent. The signal portion is the sum of the squared signal values over the entire set of observed values -the signal energy; the noise portion is the variance of each noise component - the noise power. Thus, energycan be increased in two ways that increase the signal-to-noise ratio: the signal can be made larger or theobservations can be extended to encompass a larger number of values.

To illustrate this point, two signals having the same energy are shown in Figure 4.13. When these signalsare shown in the presence of additive noise, the signal is visible on the left because its amplitude is larger; theone on the right is much more dicult to discern. The instantaneous signal-to-noise ratio-the ratio of signalamplitude to average noise amplitude - is the important visual cue. However, the kind of signal-to-noiseratio that determines detection performance belies the eye. The matched lter outputs have similar maximalvalues, indicating that total signal energy rather than amplitude determines the performance of a matchedlter detector.



Figure 4.13: Two signals having the same energy are shown at the top of the gure. The one on theleft equals one cycle of a sinusoid having ten samples/period ( sin (ω0l) with ω0 = 2π0.1). On the right,ten cycles of similar signal is shown, with an amplitude a factor of

√10 smaller. The middle portion of

the gure shows these signals with the same noise signal added; the duration of this signal is 200 samples.The lower portion depicts the outputs of matched lters for each signal. The detection threshold was setby specifying a false-alarm probability of 10−2.

4.10.1 Validity of the White Noise Assumption

The optimal detection paradigm for the additive, white Gaussian noise problem has a relatively simplesolution: construct FIR lters whose unit-sample responses are related to the presumed signals and comparethe ltered outputs with a threshold. We may well wonder which assumptions made in this problem are mostquestionable in "real-world" applications. noise is additive in most cases. In many situation, the additivenoise present in observed data is Gaussian. Because of the Central Limit Theorem, if numerous noise sourcesimpinge on a measuring device, their superposition will be Gaussian to a great extent. As we know from thediscussion on the Central Limit Theorem (Section 1.6), glibly appealing to the Central Limit Theorem is notwithout hazards; the non-Gaussian detection problem will be discussed in some detail later. Interestingly, theweakest assumption is the "whiteness" of the noise. Note that the observation sequence is obtained as a resultof sampling the sensor outputs. Assuming white noise samples does not mean that the continuous-timenoise was white. White noise in continuous time has innite variance and cannot be sampled; discrete-time


127

white noise has a nite variance with a constant power spectrum. The Sampling Theorem suggests that asignal is represented accurately by its samples only if we choose a sampling frequency commensurate with thesignal's bandwidth. One should note that delity of representation does not mean that the sample valuesare independent. In most cases, satisfying the Sampling Theorem means that the samples are correlated. Asshown in Sampling and Random Sequences (Section 1.9), the correlation function of sampled noise equalssamples of the original correlation function. For the sampled noise to be white, E [n (l1T )n (l2T )] = 0 forl1 6= l2: the samples of the correlation function at locations other than the origin must all be zero. Whilesome correlation functions have this property, many examples satisfy the sampling theorem but donot yield uncorrelated samples. In many practical situations, undersampling the noise will reduceinter-sample correlation. Thus, we obtain uncorrelated samples either by deliberately undersampling, whichwastes signal energy, or by imposing anti-aliasing lters that have a bandwidth larger than the signal andsampling at the signal's Nyquist rate. Since the noise power spectrum usually extends to higher frequenciesthan the signal, this intentional undersampling can result in larger noise variance. in either case, by tryingto make the problem at hand match the solution, we are actually reducing performance! We need a directapproach to attacking the correlated noise issue that arises in virtually all sampled-data detection problemsrather than trying to work around it.

4.11 Colored Gaussian Noise25

When the additive Gaussian noise in the sensors' outputs is colored (i.e., the noise values are correlatedin some fashion), the linearity of beamforming algorithms means that the array processing output r alsocontains colored noise. The solution to the colored-noise, binary detection problem remains the likelihoodratio, but diers in the form of the a priori densities. The noise will again be assumed zero mean, but thenoise vector has non-trivial covariance matrix K: n ∼ N (0,K).

p n (n ) =1√

det (2πK)e−( 1

2nTK−1n)

In this case, the logarithm of the likelihood ratio is

(r − s1)TK−1 (r − s1)− (r − s0)TK−1 (r − s0)M1

≷M0

2ln (η)

which, after the usual simplications, is written(rTK−1s1 −

s1TK−1s1

2

)−(rTK−1s0 −

s0TK−1s0

2

)M1

≷M0

ln (η)

The sucient statistic for the colored Gaussian noise detection problem is

Υi (r) = rTK−1si (4.24)

The quantities computed for each signal have a similar, but more complicated interpretation than in thewhite noise case. rTK−1si is a dot product, but with respect to the so-called kernel K−1. The eect of thekernel is to weight certain components more heavily than others. A positive-denite symmetric matrix (thecovariance matrix is one such example) can be expressed in terms of its eigenvectors and eigenvalues.

K−1 =L∑k=1

1λkvkvk

T

The sucient statistic can thus be written as the complicated summation

rTK−1si =L∑k=1

1λk

(rT vk

) (vkT si)




where λk and vk denote the kth eigenvalue and eigenvector of the covariance matrix K. Each of the con-stituent dot products is largest when the signal and the observation vectors have strong components parallelto vk. However, the product of these dot products is weighted by the reciprocal of the associated eigen-value. Thus, components in the observation vector parallel to the signal will tend to be accentuated; thosecomponents parallel to the eigenvectors having the smaller eigenvalues will receive greater accentuationthan others. The usual notions of parallelism and orthogonality become "skewed" because of the presenceof the kernel. A covariance matrix's eigenvalue has "units" of variance; these accentuated directions thuscorrespond to small noise variance. We can therefore view the weighted dot product as a computation thatis simultaneously trying to select components in the observations similar to the signal, but concentrating onthose where the noise variance is small.

The second term in the expressions consistuting the optimal detector are of the form siTK−1si. This

quantity is a special case of the dot product just discussed. The two vectors involved in this dot productare identical; they are parallel by denition. The weighting of the signal components by the reciprocaleigenvalues remains. Recalling the units of the eigenvectors of K, si

TK−1si has the units of a signal-to-noiseratio, which is computed in a way that enhances the contribution of those signal components parallel to the"low noise" directions.

To compute the performance probabilities, we express the detection rule in terms of the sucient statis-tic.

rTK−1 (s1 − s0)M1

≷M0

ln (η) +12(s1TK−1s1 − s0

TK−1s0

)The distribution of the sucient statistic on the left side of this equation is Gaussian because it consists asa linear transformation of the Gaussian random vector r. Assuming the ith model to be true,

rTK−1 (s1 − s0) ∼ N(siTK−1 (s1 − s0) , (s1 − s0)TK−1 (s1 − s0)

)The false-alarm probability for the optimal Gaussian colored noise detector is given by

PF = Q

ln (η) + 12 (s1 − s0)TK−1 (s1 − s0)(

(s1 − s0)TK−1 (s1 − s0)) 1

2

(4.25)

As in the white noise case, the important signal-related quantity in this expression is the signal-to-noiseratio of the dierence signal. The distance interpretation of this quantity remains, but the distance is nowwarped by the kernel's presence in the dot product.

The sucient statistic computed for each signal can be given two signal processing interpretations in thecolored noise case. Both of these rest on considering the quantity rTK−1si as a simple dot product, but withdierent ideas on grouping terms. The simplest is to group the kernel with the signal so that the sucientstatistic is the dot product between the observations and a modied version of the signal

∼si= K−1si. This

modied signal thus becomes the equivalent to the unit-sample response of the matched lter. In this form,the observed data are unaltered and passed through a matched lter whose unit-sample response depends onboth the signal and the noise characteristics. The size of the noise covariance matrix, equal to the numberof observations used by the detector, is usually large: hundreds if not thousands of samples are possible.Thus, computation of the inverse of the noise covariance matrix becomes an issue. This problem needs to besolved only once if the noise characteristics are static; the inverse can be precomputed on a general purposecomputer using well-established numerical algorithms. The signal-to-noise ratio term of the sucient statisticis the dot product of the signal with the modied signal

∼si. This view of the receiver structure is shown in

Figure 4.14.


129

Figure 4.14: These diagrams depict the signal processing operations involved in the optimum detectorwhen the additive noise is not white. The upper diagram shows a matched lter whose unit-sampleresponse depends both on the signal and the noise characteristics. The lower diagram is often termedthe whitening lter structure, where the noise components of the observed data are rst whitened, thenpassed through a matched lter whose unit-sample response is related to the "whitened" signal.

A second and more theoretically powerful view of the computations involved in the colored noise detectoremerges when we factor covariance matrix. The Cholesky factorization of a positive-denite, symmetricmatrix (such as a covariance matrix or its inverse) has the form K = LDLT . With this factorization, thesucient statistic can be written as

rTK−1si =(D−1/2L−1r

)T (D−1/2L−1si

)The components of the dot product are multiplied by the same matrix (D−1/2L−1), which is lower-triangular.If this matrix were also Toeplitz, the product of this kind between a Toeplitz matrix and a vector wouldbe equivalent to the convolution of the components of the vector with the rst column of the matrix. Ifthe matrix is not Toeplitz (which, inconveniently, is the typical case), a convolution also results, but witha unit-sample response that varies with the index of the outputa time-varying, linear ltering operation.The variation of the unit-sample response corresponds to the dierent rows of the matrix D−1/2L−1 runningbackwards from the main-diagonal entry. What is the physical interpretation of the action of this lter?The covariance of the random vector x = Ar is given by Kx = AKrA

T . Applying this result to the currentsituation, we set A = D−1/2L−1 and Kr = K = LDLT with the result that the covariance matrix Kx is theidentity matrix! Thus, the matrix D−1/2L−1 corresponds to a (possibly time-varying) whitening lter:we have converted the colored-noise component of the observed data to white noise! As the lter is alwayslinear, the Gaussian observation noise remains Gaussian at the output. Thus, the colored noise problemis converted into a simpler one with the whitening lter: the whitened observations are rst match-lteredwith the "whitened" signal s+

i = D−1/2L−1si (whitened with respect to noise characteristics only) then halfthe energy of the whitened signal is subtracted (Figure 4.14).

Example 4.11To demonstrate the interpretation of the Cholesky factorization of the covariance unit matrix as atime-varying whitening lter, consider the covariance matrix

K =

1 a a2 a3

a 1 a a2

a2 a 1 a

a3 a2 a 1



This covariance matrix indicates that the nosie was produced by passing white Gaussian noisethrough a rst-order lter having coecient a: n (l) = an (l − 1)+w (l), where w (l) is unit-variancewhite noise. Thus, we would expect that if a whitening lter emerged from the matrix manipulations(derived just below), it would be a rst-order FIR lter having a unit-sample response proportionalto

h (l) =

1 if l = 0

−a if l = 1

0 otherwise

Simple arithmetic calculations of the Cholesky decomposition suce to show that the matrices Land D are given by

L =

1 0 0 0

a 1 0 0

a2 a 1 0

a3 a2 a 1

D =

1 0 0 0

0 1− a2 0 0

0 0 1− a2 0

0 0 0 1− a2

and that their inverses are

L−1 =

1 0 0 0

−a 1 0 0

0 −a 1 0

0 0 −a 1

D−1 =

1 0 0 0

0 11−a2 0 0

0 0 11−a2 0

0 0 0 11−a2

Because D is diagonal, the matrix D−1/2 equals the term-by-term square root of the inverse of D.The product of interest here is therefore given by

D−1/2L−1 =

1 0 0 0−a√1−a2

1√1−a2 0 0

0 −a√1−a2

1√1−a2 0

0 0 −a√1−a2

1√1−a2

Let

∼r express the product D−1/2L−1r. This vector's elements are given by

∼r0= r0,

∼r1=

1√1− a2

(r1 − ar0) , . . .

Thus, the expected FIR whitening lter emerges after the rst term. The expression could not beof this form as no observations were assumed to precede r0. This edge eect is the source of the


131

time-varying aspect of the whitening lter. If the system modeling the noise generation processhas only poles, this whitening lter will always stabilize - not vary with time - once sucient dataare present within the memory of the FIR inverse lter. In contrast, the presence of zeros in thegeneration system would imply an IIR whitening lter. With nite data, the unit-sample responsewould then change on each output sample.

4.12 Detection in the Presence of Uncertainties26

4.12.1 Unknown Signal Parameters

Applying the techniques described in the previous section may be dicult to justify when the signal and/ornoise models are uncertain. For example, we must "know" a signal down to the precise value of every sample.In other cases, we may know the signal's waveform, but not the waveform's amplitude as measured by asensor. A ubiquitous example of this uncertainty is propagation loss: the range of a far-eld signal can onlybe lower-bounded, which leads to the known waveform, unknown amplitude detection problem. Anotheruncertainty is the signal's time origin: Without this information, we do not know when to start thematched ltering operation! In other circumstances, the noise may have a white power spectrum but itsvariance is unknown. Much worse situations (from the point of view of detection theory) can be imagined:the signal may not be known at all and one may want to detect the presence of any disturbance in theobservations other than that of well-specied noise. These problems are very realistic, but the detectionschemes as presented are inadequate to attack them. The detection results we have derived to date need tobe extended to incorporate the presence of unknowns just as we did in hypothesis testing (Detection in thePresence of Unknowns (Section 4.8)).

4.12.1.1 Unknown Signal Amplitude

Assume that a signal's waveform is known exactly, but the amplitude is not. We need an algorithm to detectthe presence or absence of this signal observed in additive noise at an array's output. The models can beformally stated as

M0 : r (l) = n (l) , l ∈ 0, . . . , L− 1

M1 : r (l) = As (l) + n (l) , l ∈ 0, . . . , L− 1 , A =?

As usual, L observations are available and the noise is Gaussian. This problem is equivalent to an unknownparameter problem described in Detection in the Presence of Unknowns (Section 4.8). We learned there thatthe rst step is to ascertain the existence of a uniformly most powerful test. For each value of the unknownparameter A, the logarithm of the likelihood ratio is written

ArTK−1s−A2sTK−1sM1

≷M0

ln (η)

Assuming that A > 0, a typical assumption in array processing problems, we write this comparison as

rTK−1sM1

≷M0

1A

ln (η) +AsTK−1s = γ

As the sucient statistic does not depend on the unknown parameter and one of the models (M0) does notdepend on this parameter, a uniformly most powerful test exists: the threshold term, despite its explicitdependence on a variety of factors, can be determined by specifying a false-alarm probability. If the noise is




not white, the whitening lter or a spectral transformation may be used to simplify the computation of thesucient statistic.

Example 4.12Assume that the waveform, but not the amplitude, of a signal is known. The Gaussian noise iswhite with a variance of σ2. The decision rule expressed in terms of a sucient statistic becomes

rT sM1

≷M0

γ

The false-alarm probability is given by

PF = Q

(γ√Eσ2

)where E is the assumed signal energy which equals (‖ s ‖)2

. The threshold γ is thus found to be

γ =√Eσ2Q−1 (PF )

The probability of detection for the matched lter detector is given by

PD = Q

(γ −AE√Eσ2

)= Q

(Q−1 (PF )−

√A2E

σ2

)

where A is the signal's actual amplitude relative to the assumed signal having energy E. Thus,the observed signal, when it is present, has energe A2E. The probability of detection is shown inFigure 4.15 as a function of the observed signal-to-noise ratio. For any false-alarm probability, thesignal must be suciently energetic for its presence to be reliably determined.

Figure 4.15: The false-alarm probability of the detector was xed at 10−2. The signal equaledAsin (ω0l), l ∈ 0, . . . , L− 1, where ω0 was 2π × 0.1 and L = 100; the noise was white and Gaussian.The detection probabilities that result from a matched lter detector, a sign detector, and a square-lawdetector are shown. These detectors make progressively fewer assumptions about the signal, consequentlyyielding progressively smaller detection probabilities.


133

All too many interesting problems exist where a uniformly most powerful decision rule cannot be found.Suppose in the problem just described that the amplitude is known (A = 1, for example), but the variance

of the noise is not. Writing the covariance matrix as σ2∼K, where we normalize the covariance matrix to

have unit variance entries by requiring tr(∼K)

= L, unknown values of σ2 express the known correlation

structure, unknown noise power problem. From the results just given, the decision rule can be written sothat sucient statistic does not depend on the unknown variance.

rT∼K−1

sM1

≷M0

σ2ln (η) + sT∼K−1

s = γ

However, as both models depend on the unknown parameter, performance probabilities cannot be computedand we cannot design a detection threshold.

Hypothesis testing ideas show the way out; estimate the unknown parameter(s) under each model sepa-rately and then use these estimates in the likelihood ratio (Non-Random Parameters27). Using the maximumlikelihood estimates for the parameters results in the generalized likelihood ratio test for the detection prob-lem (Kelly, et al[41], Kelly, et al[42], van Trees[51]). Letting θ denote the vector of unknown parameters, bethey for the signal or the noise, the generalized likelihood ratio test for detection problems is expressed by

Λ (r) =maxθ

θ, p n |θ (r − s1 (θ) )

maxθ

θ, p n |θ (r − s0 (θ) )

M1

≷M0

η

Again, the use of separate estimates for each model (rather than for the likelihood ratio as a whole) must bestressed. Unknown signal-related parameter problems and unknown noise-parameter problems have dierentcharacteristics; the signal may not be present in one of the observation models. This simplication allowsa threshold to be established objectively. In contrast, the noise is present in each model; establishing athreshold value objectively will force new techniques to be developed. We rst continue our adventure inunknown-signal-parameter problems, deferring the more challenging unknown-noise-parameter ones to later(Section 4.15).

4.13 Unknown Signal Delay28

A uniformly most powerful decision rule may not exist when an unknown parameter appears in a nonlinearway in the signal model. Most pertinent to array processing is the unknown time origin case: the signal hasbeen subjected to an unknown delay (s (l −∆), ∆ =?) and we must determine the signal's presence. Thelikelihood ratio cannot be manipulated so that the sucient statistic can be computed without having avalue for ∆. Thus, the search for a uniformly most powerful test ends in failure and other methods must besought. As expected, we resort to the generalized likelihood ratio test.

More specically, consider the binary test where a signal is either present (M1) or not (M0). The signalwaveform is known, but its time origin is not. For all possible values of ∆, the delayed signal is assumedto lie entirely in the observations (Figure 4.16). This signal model is ubiquitous in active sonar and radar,where the reected signal's exact time-of-arrival is not known and we want to determine whether a returnis present or not and the value of the delay. 29 Additive white Gaussian noise is assumed present. Theconditional density of the observations made underM1 is

p r |M1∆ (r ) =1

(2πσ2)L2e−( 1

2σ2PL−1l=0 (r(l)−s(l−∆))2)

27"Non-Random Parameters", (1) <http://cnx.org/content/m11238/latest/#glr>28This content is available online at <http://cnx.org/content/m11283/1.2/>.29For a much more realistic (and harder) version of the active radar/sonar problem, see this problem (Exercise 4.27.16).



Figure 4.16: Despite uncertainties in the signal's delay ∆, the signal is assumed to lie entirely withinthe observation interval. Hence the signal's duration D, the duration L of the observation interval, andthe maximum expected delay are assumed to be related by max ∆ + D − 1 < L. The gure shows asignal falling properly within the allowed window and a grey one falling just outside.

The exponent contains the only portion of this conditional density that depends on the unknown quantity∆. Maximizing the conditional density with respect to ∆ is equivalent to maximizing

∑L−1l=0 r (l) s (l −∆)−

12s

2 (l −∆). As the signal is assumed to be contained entirely in the observations for all possible values of∆, the second term does not depend on ∆ and equals half of the signal energy E. Rather than analyticallymaximizing the rst term now, we simply write the logarithm of the generalized likelihood ratio test as

max∆

∆,

∆+D−1∑l=∆

r (l) s (l −∆)

M1

≷M0

σ2ln (η) +E

2

where the non-zero portion of the summation is expressed explicitly. Using the matched lter interpretationof the sucient statistic, this decision rule is expressed by

max∆ ∆, r (l) ∗ s (D − 1− l) |l=D−1+∆

M1

≷M0

γ

This formulation suggests that the matched lter having a unit-sample response equal to the zero-originsignal be evaluated for each possible value of ∆ and that we use the maximim value of the resulting outputin the decision rule. In the known-delay case, the matched-lter output is sampled at the "end" of thesignal; here, the lter, which has a duration D less than the observation interval L, is allowed to continueprocessing over the allowed values of signal delay with the maximum output value chosen. The resultof this procedure is illustrated here (Figure 4.14). There two signals, each having the same energy, arepassed through the appropriate matched lter. Note that the index at which the maximim output occursis the maximim likelihood estimate of ∆. Thus, the detection and the estimation problems aresolved simultaneously. Furthermore, the amplitude of the signal need not be known as it entersin expression for the sucient statistic in a linear fashion and an UMP test exists in that case. We caneasily nd the threshold γ by establishing a criterion on the false-alarm probability; the resulting simplecomputation of γ can be traced to the lack of a signal-related quantity or an unknown parameter appearinginM0.

We have argued the doubtfulness of assuming that the noise is white in discrete-time detection problems.The approach for solving the colored noise problem is to use spectral detection. Handling the unknown delayproblem in this way is relatively straightforward. Since a sequence can be represented equivalently by itsvalues or by its DFT, maximization can be calculated in either the time or the frequency domain withoutaecting the nal answer. Thus, the spectral detector's decision rule for the unknown delay problem is (fromthis equation30)

max∆

∆,L−1∑k=0

<(R (k)S (k) e−

i2πk∆L

)σk2

− 12

(|S (k) |)2

σk2

M1

≷M0

γ (4.26)

30"Spectral Detection", (1) <http://cnx.org/content/m11282/latest/#eqn1>


135

where, as usual in unknown delay problems, the observation interval captures the entire signal waveform nomatter what the delay might be. The energy term is a constant and can be incorporated into the threshold.The maximization amounts to nding the best linear phase t to the observations' spectrum once the signal'sphase has been removed. A more interesting interpretation arises by noting that the sucient statistic isitself a Fourier Transform; the maximization amounts to nding the location of the maximum of a sequencegiven by

<

(L−1∑k=0

R (k)S (k)σk2

e−i2πk∆L

)The spectral detector thus becomes a succession of two Fourier Transforms with the nal result determinedby the maximum of a sequence!

Unfortunately, the solution to the unknown-signal-delay problem in either the time or frequency domainsis confounded when two or more signals are present. Assume two signals are known to be present in thearray output, each of which has an unknown delay: r (l) = s1 (l −∆1) + s2 (l −∆2) +n (l). Using argumentssimilar to those used in the one-signal case, the generalized likelihood ratio test becomes

max∆ 1,∆ 2

∆ 1, , ,∆ 2,

∑L−1l=0 r (l) s1 (l −∆1) + r (l) s2 (l −∆2)− s1 (l −∆1) s2 (l −∆2)

M1

≷M0

σ2ln (η) + E1+E2

2

Not only do matched lter terms for each signal appear, but also a cross-term between the two signals.It is this latter term that complicates the multiple signal problem: if this term is not zero for all possibledelays, a non-separable maximization process results and both delays must be varied in concert to locatethe maximum. If, however, the two signals are orthogonal regardless of the delay values, the delays can befound separately and the structure of the single signal detector (modied to include matched lters for eachsignal) will suce. This seemingly impossible situation can occur, at least approximately. Using Parseval'sTheorem, the cross term can be expressed in the frequency domain.

L−1∑l=0

s1 (l −∆1) s2 (l −∆2) =1

2π

∫ π

−πS1 (ω)S2 (ω)eiω(∆2−∆1)dω

For this integral to be zero for all ∆1, ∆2, the product of the spectra must be zero. Consequently, if thetwo signals have disjoint spectral support, they are orthogonal no matter what the delays may be.31 Underthese conditions, the detector becomes

max∆ 1 ∆ 1, r (l) ∗ s1 (D − 1− l) |l=D−1+∆1+max∆ 2 ∆ 2, r (l) ∗ s2 (D − 1− l) |l=D−1+∆2

M1

≷M0

γ

with the threshold again computed independently of the received signal amplitudes.32

PF = Q

(λ√

(E1 + E2)σ2

)

This detector has the structure of two parallel, independently operating, matched lters, each of which istuned to the specic signal of interest.

Reality is insensitive to mathematically simple results. The orthogonality condition on the signals thatyielded the relatively simple two-signal, unknown-delay detector is often elusive. The signals often sharesimilar spectral supports, thereby violating the orthogonality condition. In fact, we may be interested in

31We stated earlier that this situation happens "at least approximately." Why the qualication?32Not to be boring, but we emphasize that E1 and E2 are the energies of the signals s1 (l) and s2 (l) used in the detector,

not those of their received correlates A1s1 (l) and A2s2 (l).



detecting the same signal repeated twice (or more) within the observation interval. Because of the complexityof incorporating inter-signal correlations, which are dependent on the relative delay, the idealistic detector isoften used in practice. In the repeated signal case, the matched lter is operated over the entire observationinterval and the number of excursions above the threshold noted. An excursion is dened to be a portionof the matched lter's output that exceeds the detection threshold over a contiguous interval. Becauseof the signal's non-zero duration, the matched lter's response to just the signal has a non-zero duration,implying that the threshold can be crossed at more than a single sample. When one signal is assumed, themaximization step automatically selects the peak value of an excursion. As shown in lower panels of thisgure (Figure 4.14), a low-amplitude excursion may have a peak value less than a non-maximal value in alarger excursion. Thus, when considering multiple signals, the important quantities are the times at whichexcursion peaks occur, not all of the times the output exceeds the threshold.

This gure (Figure 4.14) illustrates the two kinds of errors prevalent in multiple signal detectors. In theleft panel, we nd two excursions, the rst of which is due to the signal, the second due to noise. Thiskind of error cannot be avoided; we never said that detectors could be perfect! The right panel illustratesa more serious problem: the threshold is crossed by four excursions, all of which are due to a single signal.Hence, excursions must be sorted through, taking into account the nature of the signal being sought. Inthe example, excursions surrounding a large one should be discarded if they occur in close proximity. Thisrequirement means that closely spaced signals cannot be distinguished from a single one.

4.14 Unknown Signal Waveform33

The most general unknown signal parameter problem occurs when the signal itself is unknown. This phrasingof the detection problem can be applied to two dierent sorts of situations. The signal's waveform maynot be known precisely because of propagation eects (rapid multipath, for example) or because of sourceuncertainty. Another situation is the "Hello, is anyone out there?" problem: you want to determine ifany non-noise-like quantity is present in the observations. These problems impose severe demands on thedetector, which must function with little a priori knowledge of the signal's structure. Consequently, wecannot expect superlative performance.

M0 : r (l) = n (l)

M1 : r (l) = s (l) + n (l) , s (l) =?

The noise is assumed to be Gaussian with a covariance matrix K. The conditional density under M1 isgiven by

p r |M1s (r ) =1√

det (2πK)e−( 1

2 (r−s)TK−1(r−s))

Using the generalized likelihood ratio test, the maximum value of this density with respect to the unknown"parameters" - the signal values - occurs when r = s.

maxss, p r |M1s (r )

=

1√det (2πK)

The other model does not depend on the signal and the generalized likelihood ratio test for the unknownsignal problem, often termed the square-law detector, is

rTK−1rM1

≷M0

γ (4.27)

For example, if the noise were white, the sucient statistic is the sum of the squares of the observations.

L−1∑l=0

r2 (l)M1

≷M0

γ



137

If the noise is not white, the detection problem can be formulated in the frequency domain, as shown here34,where the decision rule becomes

L−1∑k=0

(|R (k) |)2

σk2

M1

≷M0

γ

Computation of the false-alarm probability in, for example, the white noise case is relatively straightforward.The probability density of the sum of the squares of L statistically independent zero-mean unit-variance,Gaussian random variables is termed chi-squared with L degrees of freedom (see Probability Distributions35):χ2 (L). The percentiles of this density are tabulated in many standard statistical references (Abramowitzand Stegun[1]).

Example 4.13Assume that the additive noise is white and Gaussian, having a variance σ2. The sucient statisticΥ =

∑r2 (l) of the square-law detector has the probability density

Υσ2∼ χ2 (L)

when no signal is present. The threshold γ for this statistic is established by Pr[χ2 (L) > γ

σ2

]= PF .

The probability of detection is found from the density of a non-central chi-squared random variablehaving χ′

2 (L, λ) having L degrees of freedom and "centrality" parameter λ =∑L−1l=0 (E [r (l)])2

.In this example, λ = E, the energy of the observed signal. In this gure (Figure 4.15), the false-alarm probability was set to 10−2 and the resulting probability of detection shown as a functionof signal-to-noise ratio. Clearly, the inability to specify the signal waveform leads to a signicantreduction in performance. In this example, roughly 10 dB more signal-to-noise ratio is required bythe square-law detector than the matched-lter detector, which assumes knowledge of the waveformbut not of the amplitude, to yield the same performance.

4.15 Unknown Noise Parameters36

When aspects of the noise, such as the variance or power spectrum, are in doubt, the detection problembecomes more dicult to solve. Although a decision rule can be derived for such problems using thetechniques we have been discussing, establishing a rational threshold value is impossible in many cases.The reason for this inability is simple: all models depend on the noise, thereby disallowing a computationof a threshold based on a performance probability. The solution is innovative: derive decision rules andaccompanying thresholds that do not depend on false-alarm probabilities!

Consider the case in which the variance of the noise is not known and the noise covariance matrix iswritten as σ2

∼K where the trace of

∼K is normalized to L. The conditional density of the observations under

a signal-related model is

p r |Miσ2 (r ) =1

(σ2)L2

√det(

2π∼K)e−

„1

2σ2 (r−si)T∼K−1

(r−si)«

Using the generalized-likelihood-ratio approach, the maximum value of this density with respect to σ2 occurswhen (

σML

)2

=(r − si)T

∼K−1

(r − si)L

34"Spectral Detection" <http://cnx.org/content/m11282/latest/>35"Probability Distributions": Distributions Related to the Gaussian <http://cnx.org/content/m11241/latest/#table2>36This content is available online at <http://cnx.org/content/m11285/1.2/>.



This seemingly complicated answer is easily interpreted. The presence of∼K−1

in the dot product can beconsidered a whitening lter. Under the ith model, the expected value of the observation vector is the signal.This computation amounts to subtracting the expected value from the observations, whitening the result,then averaging the squared values - the usual form for the estimate of a variance. Using this estimate foreach model, the logarithm of the generalized likelihood ratio becomes

L

2ln

(r − s0)T∼K−1

(r − s0)

(r − s1)T∼K−1

(r − s1)

M1

≷M0

ln (η)

Computation of the threshold remains. Both models depend on the unknown variance. However, a false-alarm probability, for example, can be computed if the probability density of the sucient statistic does notdepend on the variance of the noise. In this case, we would have what is known as a constant false-alarmrate or CFAR detector (Carlyle & Thomas[48]; Helstrom: p.317 [19]). If a detector has this property, thevalue of the statistic will not change if the observations are scaled about their presumed mean. Unfortunately,the statistic just derived does not have this property. Let there be no signal underM0. The scaling propertycan be checked in this zero-mean case by replacing r by cr. With this substitution, the statistic becomes

c2rT∼K−1r

(cr−s1)T∼K−1

(cr−s1). The constant c cannot be eliminated and the detector does not have the CFAR property.

If, however, the amplitude of the signal is also assumed to be in doubt, a CFAR detector emerges. Expressthe signal component of model i as Asi, where A is an unknown constant. The maximum likelihood estimateof this amplitude under model i is

AML=rT∼K−1

si

siT∼K−1

si

Using this estimate in the likelihood ratio, we nd the decision rule for the CFAR detector. 37

L

2ln

rT∼K−1

r −

„rT∼K−1s0

«2

s0T∼K−1s0

rT∼K−1

r −

„rT∼K−1s1

«2

s1T∼K−1s1

M1

≷M0

ln (η) (4.28)

Now we nd that when r is replaced by cr, the statistic is unchanged. Thus, the probability distributionof this statistic does not depend on the unknown variance σ2. In most array processing applications, nosignal is assumed present in modelM0; in this case,M0 does not depend on the unknown amplitude A anda threshold can be found to ensure a specied false-alarm rate for any value of the unknown variance. Forthis specic problem, the likelihood ratio can be manipulated to yield the CFAR decision rule(

rT∼K−1

s1

)2

(rT∼K−1

r

)(s1T∼K−1

s1

) M1

≷M0

γ

Example 4.14Let's extend the previous example (Example 4.13) to the CFAR statistic just discussed to thewhite noise case. The sucient statistic is

Υ (r) =(∑r (l) s (l))2∑

r2 (l)∑s2 (l)

37∼K is the normalized noise covariance matrix.


139

We rst need to nd the false-alarm probability as a function of the threshold γ. Using thetechniques described in Wishner[54], the probability density of Υ (r) underM0 is given by a Betadensity (see Probability Distributions38), the parameters of which do not depend on either thenoise variance (expectedly) or the signal values (unexpectedly).

p Υ |M0 (Υ) = β

(Υ,

12,L− 1

2

)We express the false-alarm probability derived from this density as an incomplete Beta function(Abramowitz & Stegun[47]), resulting in the curves shown in Figure 4.17.

Figure 4.17: The false-alarm probability for the CFAR receiver is plotted against the threshold valueγ for several values of L, the number of observations. Note that the test statistic, and thereby thethreshold, does not exceed one.

The statistic's density under modelM1 is related to the non-central F distribution, expressibleby the fairly simple, quickly converging, innite sum of Beta densities

p Υ |M1 (Υ) =∞∑k=0

e−d2

(d2)k

k!β

(Υ, k +

12,L− 1

2

)

where d2 equals a signal-to-noise ratio: d2 =∑ls2(l)2σ2 . The results of using this CFAR detector are

shown in Figure 4.18.

38"Probability Distributions": Distributions Related to the Gaussian <http://cnx.org/content/m11241/latest/#table2>



Figure 4.18: The probability of detection for the CFAR detector and the matched lter detector isshown as a function of signal-to-noise ratio. The signal and false-alarm criterion are the same as in thisgure (Figure 4.15). Note how little performance has been lost in this case!

4.16 Partial Knowledge of Probability Distributions39

In previous chapters, we assumed we knew the mathematical form of the probability distribution for theobservations under each model; some of these distribution's parameters were not known and we developeddecision rules to deal with this uncertainty. A more dicult problem occurs when the mathematical formis not known precisely. For example, the data may be approximately Gaussian, containing slight departuresfrom the ideal. More radically, so little may be known about an accurate model for the data that we are onlywilling to assume that they are distributed symmetrically about some value. We develop model evaluationalgorithms in this section that tackle both kinds of problems. However, be forewarned that solutions to suchgeneral models come at a price: the more specic a model can be that accurately describes a given problem,the better the performance. In other words, the more specic the model, the more the signal processingalgorithms can be tailored to t it with the obvious result that we enhance the performance. However, ifour specic model is in error, our neatly tailored algorithms can lead us drastically astray. Thus, the bestapproach is to relax those aspects of the model which seem doubtful and to develop algorithms that will copewell with worst-case situations should they arise ("And they usually do," echoes every person experienced inthe vagaries of data). These considerations lead us to consider nonparametric variations in the probabilitydensities compatible with out assessment of model accuracy and to derive decision rules that minimizethe impact of the worse-case situation.

4.16.1 Worst-Case Probability Distributions

In model evaluation problems, there are "optimally" hard problems, those where the models are the mostdicult to distinguish. The impossible problem is to distinguish models that are identical. In this situation,the conditional densities of the observed data are equal and the likelihood ratio is constant for all possiblevalues of the observations. It is obvious that identical models are indistinguishable; this elaboration suggest



141

that in terms of the likelihood ratio, hard problems are those in which the likelihood ratio isconstant. Thus, "hard problems" are those in which the class of conditional probability densities has aconstant ratio for wide ranges of observed data values.

The most relevant model evaluation problem for us is the discrimination between two models that dieronly in the means of statistically independent observations: the conditional densities of each observation arerelated as p rl |M1 (rl ) = p rl |M0 (rl −m ). Densities that would make this model evaluation problem hardwould satisfy the functional equation

∀x,m, x ≥ m : (p (x−m) = C (m) p (x))

where C (m) is quantity depending on the mean m, but not the variable x.40 For the probability densitiessatisfying this equation, any value of the observed datum which has a value greater than m cannot be used todistinguish the two models. If one considers only those zero-mean densities p (·) which are symmetric aboutthe origin, then by symmetry the likelihood ratio would also be constant for x ≤ 0. Hypotheses havingthese densities could only be distinguished when the oberservations lay in the interval (0,m); such modelevaluation problems are hard!

From the functional equation, we see that the quantity C (m) must be inversely proportional to p (m)(substitute x = m into the equation). Incorporating this fact into our functional equation, we nd that theonly solution is the exponential function.

∀z, z ≥ 0 :((p (z −m) = C (m) p (z))⇒

(p (z) ∝ e−z

))If we insist that the density satisfying the functional equation by symmetric, the solution is the so-calledLaplacian (or double-exponential) density.

p z (z ) =1√2σ2

e− |z|√

σ22

When this density serves as the underlying density for our hard model-testing problem, the likelihood ratiohas the form (Huber; 1965[20], Huber; 1981[22], Poor pp.175-187[39])

ln (Λ (rl)) =

− mq

σ22

if rl < 0

2rl−mqσ22

if 0 < rl < m

mqσ22

if m < rl

Indeed, the likelihood ratio is constant over much of the range of values of rl, implying that the two modelsare very similar over those ranges. This worst-case result will appear repeatedly as we embark on searchingfor the model evaluation rules that minimize the eect of modeling errors on performance.

4.17 Robust Hypothesis Testing41

"Robust" is a technical word that implies insensitivity to modeling assumptions. As we have seen, somealgorithms are robust while others are not. The intent of robust signal processing is to derive algorithmsthat are explicitly insensitive to the underlying signal and/or noise models. The way in which modelingincertainties are described is typied by the approach we shall use in the following discussion of robust modelevaluation.

We assume that two nominal models of the generation of the statistically independent observations areknown; the "actual" conditional probability density that describes the data under the assumptions of each

40The uniform density does not satisfy this equation as the domain of the function p (·) is assumed to be innite.41This content is available online at <http://cnx.org/content/m11299/1.5/>.



model is not known exactly, but is "close" to the nominal. Letting p (·) be the actual probability density foreach observation and po (·)the nominal, we say that (Huber; 1981[23])

p (x) = 1po (x) + εpd (x)

where pd is the unknown disturbance density and ε is the uncertainty variable (0 ≤ ε < 1). The uncertaintyvariable species how accurate the nominal model is through to be: the smaller ε, the smaller the contributionof the disturbance. It is assumed that some value for ε can be rationally assigned. The disturbance density isentirely unknown and is assumed to be any value probability density function. The expression given aboveis normalized so that p (·) has unit density ranging about it. An example of densities described this way areshown in Figure 4.19.

Figure 4.19: The nominal density, a Gaussian, is shown as a dashed line along with example den-sities derived from it having an uncertainty of 10% (ε = 0.1). The left plot illustrates a symmetriccontamination and the right an asymmetric one.

The robust model evaluation problem is formally stated as

M0 : p r |M0 (r ) =L−1∏l=0

1 po rl |M0(rl ) + ε pd rl |M0

(rl )

M1 : p r |M1 (r ) =L−1∏l=0

1 po rl |M1(rl ) + ε pd rl |M1

(rl )

The nominal densities under each model correspond to the conditional densities that we have been usinguntil now. The disturbance densities are intended to model imprecision of both descriptions; hence, they areassumed to be dierent in the context of each model. Note that the measure of imprecision ε is assumed tobe the same under either model.

To solve this problem, we take what is known as aminimax approach: nd the worst-case combinationsof a priori densities (max), then minimize the consequences of this situation (mini) according to somecriterion. In this way, bad situations are handles as well as can be expected while the more tolerable onesare (hopefully) processed well also. The "mini" phase of the minimax solution corresponds to the likelihoodratio for many criteria. Thus, the "max" phase amounts to nding the worst-case probability distributionsfor the likelihood ratio test as described in the previous section: nd the disturbance densities that can


143

result in a constant value for the ratio over large domains of functions. When the two nominal distributionsscaled by 1− ε can be brought together so that they are equal for some disturbance, then the likelihood ratiowill be constant in that domain. Of most interest here is the case where the models dier only in the valueof the mean, as shown in Figure 4.20. "Bringing the distributions together" means, in this case, scalingthe distribution for M0 by 1 − ε while adding the constant ε to the scaled distribution for M1. One canshown in general that if the ratio of the nominal densities is monotonic, this procedure nds the worst-casedistribution (Huber; 1965[21]). The distributions overlap for small and for large values of the data with nooverlap in a central region. As we shall see, the size of this central region depends greatly on the choice ofε. The tails of the worst-case distributions under each model are equal; conceptually, we consider that theworst-case densities have exponential tails in the model evaluation problem.

Figure 4.20: Nominal probability distributions for each model are shown. The worst-case distributionscorresponding to these are also shown for the uncertainty variable ε equaling 0.1

Letting pω denote the worst-case density, out minimax procedure results in the following densities foreach model in the likelihood ratio test.

pω rl |Mi(rl ) =

po r′l |M0

(r′l )C′ie−(K′|rl−r′l|) if rl < r′l

po rl |Mi(rl ) if r′l < rl < r′′l

po r′′l |M0(r′′l )C ′′i e

−(K′′|rl−r′′l |) if rl > r′′l

The constantsK ′ andK ′′ determine the rate of decay of the exponential tails of these worst-case distributions.Their specic values have not yet been determined, but since they are not needed to compute the likelihoodratio, we don't need them. The constants C ′i and C

′′i are required so that a unit-area density results. The



likelihood ratio for each observation in the robust model evaluation problem becomes

Λ (rl) =

C′1C′0

if rl < r′lpo rl |M1

( rl )

po rl |M0( rl ) if r′l < rl < r′′l

C′′1C′′0

if r′′l < rl

(4.29)

The evaluation of the likelihood ratio depends entirely on determining values for r′l and r′′l . The ratios

C′1C′0

= c′andC′′1C′′0

= c′′ are easily found; in the tails, the value of the likelihood ration equals that at the edges

of the central region for continuous densities.

c′ =po rl |M1

(r′l )po rl |M0

(r′l )

c′′ =po rl |M1

(r′′l )po rl |M0

(r′′l )

At the left boundary, for example, the distribution functions must satisfy (1− ε) p rl |M0 (r′l ) =1p rl |M1 (r′l ) + ε. In terms of the nominal densities, we have∫ r′l

−∞p rl |M0 (x )− p rl |M1 (x ) dx =

ε

1− ε

This equation also applies the value right edge r′′l . Thus, for a given value of ε, the integral of the dierencebetween the nominal densities should equal the ratio ε

1−ε for two values. Figure 4.21 illustrates this eect fora Gaussian example. The bi-valued nature of this integral may not be valid for some values of ε; the valuechosen for ε can be too large, making it impossible to distinguish the models! This unfortunate circumstancemeans that the uncertainties, as described by the value of ε, swamp the characteristics that distinguish themodels. Thus, the models must be made more precise (more must be known about the data) so that smallerdeviations from the nominal models can describe the observations.

Figure 4.21: The quantity used to determine the thresholds in the robust decision rule is shown whenm = 1 and σ2 = 5. Given a value of ε, a value on the vertical axis is selected and the correspondingvalues on the horizontal axis yield the thresholds.


145

Returning to the likelihood ratio, the "robust" decision rule consists of computing a clipped function ofeach observed value, multiplying them together, and comparing the product computed over the observationswith a threshold value. We assume that the nominal distributions of each of the L observations are equal;the values of the boundaries r′l and r′′l then do not depend on the observation index l in this case. Moresimply, evaluating the logarithm of the quantities involved results in the decision rule

L−1∑l=0

f (rl)M1

≷M0

γ

where the function f (·) is the clipping function given by

f (rl) =

ln (c′) if rl < r′

ln(

po rl |M1( rl )

po rl |M0( rl )

)if r′ < rl < r′′

ln (c′′) if r′′ < rl

If the observations were not identically distributed, then the clipping function would depend on the obser-vation index. 42

Determining the threshold γ that meets a specic performance criterion is dicult in the context of robustmodel evaluation. By the very nature of the problem formulation, some degree of uncertainty in the a prioridensities exists. A specic false-alarm probability can be guaranteed by using the worst-case distributionunder M0. This density has the disturbance term begin an impulse at innity. Thus, the expected valuemc of a clipped observation f (rl) with respect to the worst-case density is 1E [f (rl)] + εln (c′′) where theexpected value in this expression is evaluated with respect to the nominal density underM0. Similarly, anexpression for the variance σc

2 of the clipped observation can be derived. As the decision rule computes thesum of the clipped, statistically independent observations, the Central Limit Theorem can be applied to the

sum, with the result that the worst-case false-alarm probability will approximately equal Q(γ−Lmc√Lσc

). The

threshold γ can then be found which will guarantee a specied performance level. Usually, the worst-casesituation does not occur and the threshold set by this method is conservative. We can assess the degree ofconservatism by evaluating these quantities under the nominal density rather than the worst-case density.

Example 4.15Let's consider the Gaussian model evaluation problem we have been using so extensively. Theindividual observations are statistically independent and identically distributed with variance ve:σ2 = 5. For model M0, the mean is zero; for M1, the mean is one. These nominal densitiesdescribe our best models for the observations, but we seek to allow slight deviations (10%) fromthem. The equation to be solved for the boundaries is the implicit equation

Q

(z −mσ

)−Q

( zσ

)=

ε

1− ε

The quantity on the left side of the equation is shown in Figure 4.21. If the uncertainty in theGaussian model, as expressed by the parameter ε, is larger than 0.15 (for the example values of mand σ), no solution exists. Assuming that ε equals 0.1, the quantity ε

1−ε = 0.11 and the clippingthresholds are r′ = −1.675 and r′′ = 2.675. Between these values, the clipping function is given by

the logarithm of the likelihood ratio, which is given by 2mrl−m2

2σ2 .We can decompose the clipping operation into a cascade of two operations: a linear scaling

and shifting (as described by the previous expression) followed by a clipper having unit slope (seeFigure 4.22).

42Note that we only need to require that the nominal density remain constant throughout the observations. The disturbancedensity, and through it the density of each observation, could vary without disturbing the validity of this result! Such generalityis typical when one loosens modeling restrictions, but, as we have said, this generality is bought with diminished performance.



Figure 4.22: The robust decision rule for the case of a Gaussian nominal density is shown. Theobservations are rst scaled and shifted by quantities that depend on the mean m and the variance σ2.The resulting quantity is then passed through a symmetric unit-slope clipping function whose clippingthresholds also depend on the parameters of the distributions.

Let rl denote the result of the scaling and shifting operation. This quantity has mean m2

2σ2

and variance m2

σ2 underM1 and the opposite signed mean and the same variance underM0. Thethreshold values of the unit-clipping function are thus given by the solution of the equation

Q

(z + m2

2σ2

mσ

)−Q

(z − m2

2σ2

mσ

)=

ε

1− ε

By substituting −z for z in this equation, we nd that the two solutions are negatives of each other.We have now placed the unit-clipper's threshold values symmetrically about the origin; however,they do depend on the value of the mean m. In this example, the threshold is numerically given byz = 0.435. The expected value of the result of the clipping function with respect to the worst-casedensity is given by the complicated expression

E [f (rl)] = 1

(r′Q

(−r′

σ

)+ r′′Q

(r′′

σ

)+

√σ2

2π

(e−

r′22σ2 − e−

r′′22σ2

))+ εr′′

The variance is found in a similar fashion and can be used to nd the threshold γ on the sum ofclipped observation values.

4.18 Non-Parametric Model Evaluation43

If very uncertain about model accuracy, assuming a form for the nominal density may be questionable orquantifying the degree of uncertainty may be unreasonable. In these cases, any formula for the underlyingprobability densities may be unjustied, but the model evaluation problem remains. For example, we maywant to determine a signal presence or absence in an array output (non-zero mean vs. zero mean) withoutmuch knowledge of the contaminating noise. If minimal assumptions can be made about the probabilitydensities, non-parametric model evaluation can be used (Gibson and Melsa[14]). In this theoreticalframework, no formula for the conditional densities is required; instead, we use worst-case densities whichconform to the weak problem specication. Because few assumptions about the probability models areused, non-parametric decision rules are robust: they are insensitive to modeling assumptions because sofew are used. The "robust" test of the previous section are so-named because they explicitly encapsulatemodel imprecision. In either case, one should expect greater performance (smaller error probabilities) innon-parametric decision rules than possible from a "robust" one.

Two hypothesized models are to be tested;M0 is intended to describe the situation "the observed datahave zero mean" and the other "a non-zero mean is present." We make the usual assumption that the L



147

observed data values are statistically independent. The only assumption we will make about the probabilisticdescriptions underlying these models is that the median of the observations is zero in the rst instance andnon-zero in the second. The median of a random variable is the "half-way" value: the probability thatthe random variable is less than the median is one-half as is the probability that it is greater. The medianand mean of a random variable are not necessarily equal; for the special case of a symmetric probabilitydensity they are. In any case, the non-parametric models will be stated in terms of the probability that anobservation is greater than zero.

M0 : Pr [rl ≥ 0] =12

M1 : Pr [rl ≥ 0] >12

The rst model is equivalent to a zero-median model for the data; the second implies that the median isgreater than zero. Note that the form of the two underlying probability densities need not be the same tocorrespond to the two models; they can dier in more general ways than in their means.

To solve this model evaluation problem, we seek (as do all robust techniques) the worst-case density, thedensity satisfying the conditions for one model that is maximally dicult to distinguish from a given densityunder the other. Several interesting problems arise in this approach. First of all, we seek a non-parametricanswer: the solution must not depend on unstated parameters (we should not have to specify how large thenon-zero mean might be). Secondly, the model evaluation rule must not depend on the form for the givendensity. These seemingly impossible properties are easily satised. To nd the worst-case density, rst denep+

rl |M1(rl ) to be the probability density of the lth observation assuming that M1 is true and that

the observation was non-negative. A similar denition for negative values is needed.

p+rl |M1

(rl ) = p rl |M1rl≥0 (rl )

p− rl |M1(rl ) = p rl |M1rl<0 (rl )

In terms of these quantities, the conditional density of an observation underM1 is given by

p rl |M1 (rl ) = Pr [rl ≥ 0 | M1] p+rl |M1

(rl )− 1 p− rl |M1(rl )

The worst-case density under M0 would have exactly the same functional form as this one for positiveand negative values while having a zero median. 44 As depicted in Figure 4.23, a density meeting theserequirements is

p rl |M1 (rl ) =p+

rl |M1(rl ) + p− rl |M1

(rl )2

44Don't forget that the worst-case density in model evaluation agrees with the given one over as large a range as possible.



Figure 4.23: For each density having a positive-valued median, the worst-case density having zeromedian would have exactly the same functional form as the given one on the positive and negative reallines, but with the areas adjusted to be equal. Here, a unit-mean, unit-variance Gaussian density andits corresponding worst-case density is usually discontinuous at the origin; be that as it may, this ratherbizarre worst-case density leads to a simple non-parametric decision rule.

The likelihood ratio for a single observation would be 2Pr [rl ≥ 0 | M1] for non-negative values and2 × (1− Pr [rl ≥ 0 | M1]) for negative values. While the likelihood ratio depends on Pr [rl ≥ 0 | M1],which is not specied in out non-parametric model, the sucient statistic will not depend on it! To see this,note that the likelihood ratio varies only with the sign of the observation. Hence, the optimal decision ruleamounts to counting how many of the observations are positive; this count can be succinctly expressed withthe unit-step function u (·) as

∑L−1l=0 u (rl). 45 Thus, the likelihood ratio for the L statistically independent

observations is written(2L(Pr [rl ≥ 0 | M1])

Pl u(rl)(1− Pr [rl ≥ 0 | M1])L−

Pl u(rl)

)M1

≷M0

η

Making the usual simplications, the unknown probability Pr [rl ≥ 0 | M1] can be maneuvered to the rightside and merged with the threshold. The optimal non-parametric decision rule thus compares the sucientstatistic - the count of positive-valued observations - with a threshold determined by the Neyman-Pearsoncriterion.

L−1∑l=0

u (rl)M1

≷M0

γ (4.30)

This decision rule is called the sign test as it depends only on the signs of the observed data. The signtest is uniformly most powerful and robust.

To nd the threshold γ, we can use the Central Limit Theorem to approximate the probability distributionof the sum by a Gaussian. UnderM0, the expected value of u (rl) is 1

2 and the variance is 14 . To the degree

that the Central Limit Theorem reects the false-alarm probability (see this problem46), PF is approximately

45We dene the unit-step function as u (x) =

(1 if x > 0

0 if x < 0, with the value at the origin undened. We presume that the

densities have no mass at the origin under either model. Although appearing unusual,Pu (rl) does indeed yield the number

of positively values observations.46"Non-Gaussian Detection Theory: Problems", Exercise 2 <http://cnx.org/content/m11222/latest/#problem2>


149

given by

PF = Q

γ − L2√L4

and the threshold is found to be

γ =√L

2Q−1 (PF ) +

L

2As it makes no sense for the threshold to be greater than L (how many positively values observations can

there be?), the specied false-alarm probability must satisfy PF ≥ Q(√

L). This restriction means that

increasing stringent requirements on the false-alarm probability can only be met if we have sucient data.

4.19 Partially Known Signals and Noise47

Rather than assuming that aspects of the signal, such as its amplitude are beyond any set of justiableassumptions and are thus "unknown," we may have a situation where these signal aspects are "uncertain."For example, the amplitude may be known to be within ten percent of a nominal value. If the case, wewould expect better performance characteristics from a detection strategy exploiting this partial knowledgefrom one that doesn't. To derive detectors that use partial information about signal and noise models, weapply the approach used in robust model evaluation: nd the worst-case combination of signal and noiseconsistent with the partial information, then derive the detection strategy that best copes with it. We haveseen that the optimal detection strategy is found from the likelihood ratio: no matter what the signal andnoise model are, the likelihood ratio yields the best decision rule. When applied to additive Gaussian noiseproblems, the performance of the likelihood ratio test increases with the signal-to-noise ratio of the dierencebetween the two hypothesized signals. Since we focus on deciding whether a particular signal is present ornot, performance is determined by that signal's SNR and the worst-case situation occurs when this signal-to-noise ratio is smallest. The results from robust model evaluation taught us to design the detector to theworst-case situation, which in our case roughly means employing matched lters based on the worst-casesignal. Employing this approach results in what are known as robust detectors.

4.20 Partially Known Signal Waveform48

The nominal signal waveform is known, but the actual signal present in the observed data can be corruptedslightly. Using the minimax approach (p. 142), we seek the worst possible signal that could be presentconsistent with constraints on the corruption. Once we know what signal that is, our detector should consistof a lter matched to that worst-case signal. Let the observed signal be of the form s (l) = so (l) + c (l):so (l) is the nominal signal and c (l) the corruption in the observed signal. The nominal-signal energy isEo and the signal corruption is assumed to have an energy that is less than Ec. Spectral techniques areassumed to be applicable so that the covariance matrix of the Fourier transformed noise is diagonal. The

signal-to-noise ratio is given by∑k

(|S(k)|)2

σk2 . What corruption yields the smallest signal-to-noise ratio? Theworst-case signal will have the largest amount of corruption possible. This constrained minimization problem- minimize the signal-to-noise ratio while forcing the energy of the corruption to be less than Ec - can thenbe solved using Lagrange multipliers.

minSk

∑k

(|S (k) |)2

σk2+ λ

(∑k

(|C (k) |)2 − Ec)

47This content is available online at <http://cnx.org/content/m11305/1.2/>.48This content is available online at <http://cnx.org/content/m11308/1.4/>.



By evaluating the appropriate derivatives, we nd the spectrum of the worst-case signal to be a frequency-weighted version of the nominal signal's spectrum.

Sw (k) =λσk

2

1 + λσk2So (k)

where ∑k

(1

1 + λσk2

)2

(|So (k) |)2 = Ec

The only unspecies parameter is the Lagrange multiplier λ, with the latter equation providing an implicitsolution in most cases.

If the noise is white, the σk2 equal a constant, implying that the worst-case signal is a scaled version

of the nominal, equaling Sw (k) =(

1−√

Ec

Eo

)So (k). The robust decision rule derived from the likelihood

ratio is given by

<

(L−1∑k=0

R (k)So (k)

)M1

≷M0

γ

By incorporating the scaling constant 1 −√

Ec

Eo into the threshold γ, we nd that the matched lter

used in the white noise, known-signal case is robust with respect to signal uncertainties. Thethreshold value is identical to that derived using the nominal signal as model M0 does not depend on theuncertainties in the signal model. Thus, the detector used in the known-signal, white noise case can alsobe used when the signal is partially corrupted. Note that in solving the general signal corruption case, theimprecise signal amplitude situation was also solved.

If the noise is not white, the proportion between the nominal and worst-case signal spectral componentsis not a constant. The decision rule expressed in term of the frequency-domain sucient statistic becomes

<

(L−1∑k=0

λσk2

1 + λσk2R (k)So (k)

)M1

≷M0

γ

Thus, the detector derived for colored noise problems is not robust. The threshold depends on the noisespectrum and the energy of the corruption. Furthermore, calculating the value of the Lagrange multiplier inthe colored noise problem is quite dicult, with multiple solutions to its constraint equation quite possible.Only one of these solutions will correspond to the worst-case signal.

4.21 Partially Known Noise Amplitude Distribution49

The previous sections assumed that the probability distribution of the noise was known precisely, and further-more, that this distribution was Gaussian. Deviations from this assumption occur frequently in applications(Machell and Penrod[33], Middleton[34], Milne and Ganton[35]). We expect Gaussian noise in situationswhere noise sources are many and of roughly equal strength: the Central Limit Theorem suggests that ifthese noise sources are independent (or even mildly dependent), their superposition will be Gaussian. Asshown in this discussion (Section 1.6), the Central Limit Theorem converges very slowly and deviationsfrom this model, particularly in the tails of the distribution, are a fact of life. Furthermore, unexpected,deviant noise sources also occur and these distort the noise amplitude distribution. Examples of the phe-nomenon are lightning (causing momentary, large, skewed changes in the nominal amplitude distribution inelectromagnetic sensors) and ice fractures in polar seas (evoking similar distributional changes in accousticnoise). These changes are momentary and their eects on the amplitude distribution are, by and large, un-predictable. For these situations, we invoke ideas from robust model evaluation to design robust detectors,



151

ones insensitive to deviations from the Gaussian model (El-Sawy and Vandelinde[12], Kassam and Poor[30],Poor pp.175-187[40]).

We assume that the noise component in the observations consists of statistically independent elements,each having a probability amplitude density of the form

p n (n ) = 1 po n (n ) + ε pd n (n )

where po n ( · ) is the nominal noise density, taken to be Gaussian, and pd n ( · ) is the deviation ofthe actual density from the nominal, also a density. This ε-contamination model (Huber; 1981[24]) isparameterized by ε, the uncertainty variable, a positive number less than one that denes how large thedeviations from the nominal can be. As shown in Robust Hypothesis Testing (Section 4.17), the decisionrule for the robust detector is

L−1∑l=0

fl

[r (l) s (l)− s2(l)

2

σ2

]M1

≷M0

γ

where fl (·) is a memoryless nonlinearity having the form of a clipper (see Robust Hypothesis Testing (Sec-tion 4.17)). The block diagram of this receiver is shown diagrammatically in Figure 4.24.

Figure 4.24: The robust detector consists of a linear scaling and shifting operation followed by a unit-slope clipper, whose clipping thresholds depends on the value of the signal. The key element is theclipper, which serves to censor large excursions of the observed from the value of the signal.

The clipping function threshold z′l is related to the assumed variance σ2 of the nominal Gaussian density,the deviation parameter ε, and the signal value at the lth sample by the positive-valued solution of

Q

(z′l + s2(l)

2σ2

s(l)σ

)−Q

(z′l −

s2(l)2σ2

s(l)σ

)=

ε

1− ε

An example of solving for ε is shown in this gure (Figure 4.21).The characteristics of the clipper vary with each signal value: the clipper is exquisitely tuned to the

proper signal value, ignoring values that deviate from the signal. Furthermore, note that this detector relieson large signal values relative to the noise. If the signal values are small, the above equation for z′l has nosolution. Essentially, the robust detector ignores those values since the signal is too weak to prevent thenoise density deviations from entirely confusing the models. 50 The threshold can be established for thesignal values used by the detector through the Central Limit Theorem as described in our previous discussion(Section 4.17).

4.22 Non-Gaussian Observations51

The term non-Gaussian has a connotation similar to that of "non-linear;" rather than meaning those prob-lems which are not simple (the simple ones being the Gaussian or linear ones), these terms refer instead to

50In the next section, we present a structure applicable in small signal-to-noise ratio cases.51This content is available online at <http://cnx.org/content/m11318/1.2/>.



the general problem - all possible stationary random sequences or all systems. In general, "non-Gaussiandetection theory" makes no assumption as to the specic form of the noise amplitude distribution (Kas-sam[28]). This generality mimics situations where the additive noise is variable, having an unpredictablestructure which makes a priori models of the noise dicult to justify. This section describes detection al-gorithms that make few assumptions about the joint probability density function of the noise amplitudes atseveral samples. For simplicity, the noise sequence is assumed in sequel to be white.

4.23 Small-Signal Detection52

Before we introduce truly non-Gaussian detection strategies, we need to discuss the structure of the detectorwhen the noise amplitude distribution is known. From the results presented while solving the general modelevaluation problem, we nd that the likelihood ratio test for statistically independent noise values is

L−1∑l=0

ln (p n (r (l)− s1 (l) ))− ln (p n (r (l)− s0 (l) ))M1

≷M0

γ

The term in braces is Υl, the sucient statistic at the lth sample. As L is usually large and each termin the summation is statistically independent of the others, the threshold γ is found by approximatingthe distribution of the sucient statistic by a Gaussian (the Central Limit Theorem (Section 1.6) again).Computing the mean and variance of each term, the false-alarm probability is found to be

PF = Q

ln (η)−∑E [Υl | M0 ]√∑

σ (Υl|M0)2

This general result applies to any assumed noise density, including the Gaussian. The matched lter answerderived earlier is contained in the decision rule given above.

A matched-lter-like answer also results when small signal-to-noise ratio problems are considered forany amplitude distribution Kassam pp. 5-8[29], Spaulding [45], Spaulding and Middleton[?]). Assume thatln (p n (r (l)− s (l) )), considered as a function of the signal value s (l), can be expressed in a Taylor seriescentered at r (l).

ln (p n (r (l)− s (l) )) = ln (p n (r (l) ))− d

dn(ln (p n (n ))) |n=r(l)s (l) +

12d2

dn2(ln (p n (n ))) |n=r(l)s

2 (l) + . . .

In the small signal-to-noise ratio case, second and higher order terms are neglected, leaving the decisionrule

L−1∑l=0

−(d

dn(ln (p n (n ))) |n=r(l)s1 (l)

)+

d

dn(ln (p n (n ))) |n=r(l)s0 (l)

M1

≷M0

γ

This rule says that in the small signal case, the sucient statistic for each signal is the result of the match-

ltering the result of passing the observations through a front-end memoryless non-linearity −dln(p n (x ))dx ,

which depends only on noise amplitude characteristics (see Figure 4.25).



153

Figure 4.25: The decision rule for the non-Gaussian case when the signal is small can be expressed asa matched lter where the observations are rst passes through a memoryless nonlinearity.

Because of the presence of the matched lter and the relationship of the matched lter to Gaussian noiseproblems, one might presume that the nonlinearity serves to transform each observation into a Gaussian

random variable. In the case of the zero-mean Gaussian noise, −dln(p n (x ))dx = x

σ2 . Interestingly, thisGaussian case is the only instance where the "non-linearity" is indeed linear and yields a Gaussian output.

Now consider the case where the noise is Laplacian ( p n (n ) = 1√2σ2 e

−|n|√σ22 ). The detector's front-end

transformations is −dln(p n (x ))dx = sign[x]q

σ22

. This non-linearity is known as an innite clipper and the output

consists only of the values ±

(1qσ22

), not a very Gaussian quantity.

4.24 Robust Small-Signal Detection53

These results can be modied to derive a robust detector that accommodates small signals. To make thisdetector insensitive to deviations from the nominal noise probability density, the density should be replaced bythe worst-case density consistent with the maximum allowed deviations. For unimodal, symmetric densities,the worst-case density equals that of the nominal within a region centered about the origin while beingproportional to an exponential for values outside that region.

pω n (n ) =

(1− ε) po n (n ) if |n| < n′

(1− ε) po n (n′ ) e(−a)(|n|−n′) if |n| > n′

The parameter a controls the rate of decrease of the exponential tail; it is species by requiring that the

derivative of the density's logarithm be continuous at n = n′. Thus, a = −dln( po n (n ))dn for n = n′. For

the worst-case density to be a probability density, the boundary value n′ and the deviation parameter ε mustbe such that its integral is unity. As ε is xed, this requirement reduces to a specication of n′.∫ n′

−n′po n (n ) dn+

2apo n (n′ ) =

11− ε

For the case where the nominal density a = n′

σ2 and this equation becomes

1− 2Q(n′

σ

)+σ

n′

√2πe−

n′22σ2 +

11− ε

The non-linearity required by the small-signal detection strategy thus equals the negative derivative of thelogarithm of the worst-case density within the boundaries while equaling a constant outside that region.

f [r (l)] =

−dln( po n (n ))

dn |n=r(l) if |r (l) | < n′

asign [r (l)] if |r (l) | > n′(4.31)




The robust detector for small signals thus consists of a clipper followed by a matched lter for each signal,the results of which are compared with a threshold.

4.25 Non-Parametric Detection54

In situations when no nominal density can be reasonably assigned or when the possible extent of deviationsfrom the nominal cannot be assessed, non-parametric detection theory can rise to the occasion (Gisbon andMelsa[15], Kassam and Thomas[31]). In this framework rst explored in Non-Parametric Model Evaluation(Section 4.18), little is assumed about the form of the noise density. Assume that model M0 correspondsto the noise-only situation and M1 to the presence of a signal. Moreover, assume that the noise densityhas zero median: any noise value is equally likely to be positive or negative. This assumption does notnecessarily demand that the density be symmetric about the origin, but such densities do have zero median.Given these assumptions, the formalism of non-parametric model evaluation yields the sign test as the bestdecision rule. As described in the simpler model evaluation context, M1 had constant, positive mean foreach observation. Signal values are usually unequal and change sign; we must extend the sign test to thismore realistic situation. Noting that the statistic of the sign test did not depend on the value of the meanbut on its sign, the sign of each observation should be "matched" with the sign of each signal value. A kindof matched lter results, where sign [r (l)] is match-ltered with sign [s (l)].

L−1∑l=0

u (sign [r (l)] sign [s (l)])M1

≷M0

γ (4.32)

the quantity u (·) is the unit-step function; the sum counts the times when the signal and the observationsigns matched.

To nd the threshold, the ubiquitous Central Limit Theorem (Section 1.6) can be invoked. UnderM0,the expected value of summation is L

2 and its variance L4 . The false alarm probability for a given value of γ

is therefore approximately given by

PF = Q

γ − L2√L4

and the threshold easily found. We nd the probability of detection with similar techniques, assuming aGaussian density describes the distribution of the sum when a signal is present. Letting Pl denote theprobability the observation and the signal value agree in sign at the lth sample, the sum's expected value is∑l Pl and its variance

∑l Pl (1− Pl). Using the Central Limit Theorem approximation, the probability of

detection is given by

PD = Q

(γ −

∑l Pl√∑

l Pl (1− Pl)

)For a symmetric as well as zero-median density for the noise amplitude, this probability is given by Pl =∫∞−|s(l)| p n (n ) dn. If Gaussian noise were present, this probability would be 1−Q

(s(l)σ

)and for Laplacian

noise 1− 12e− |s(l)|√

σ22 .

The non-parametric detector expressed by (4.32) has many attractive properties for array processingapplications. First, the detector does not require knowledge of the amplitude of the signal. In addition, notethat the false-alarm probability does not depend on the variance of the noise; the sign detector is thereforeCFAR. Another property of the sign detector is its robustness: we have implicitly assumed that the nosievalues have the worst-case probability density - the Laplacian. A more practical property is the one bit ofprecision required by the quantities used in the computation of the sucient statistic: each observation ispassed through an innite clipper (a one-bit quantizer) and matched (ANDed) with a one bit representation



155

of the signal. A less desirable property is the dependence of the sign detector's performance on the signalwaveform. A signal having a few dominant peak values may be less frequently detected than an equal energyone having a more constant envelope. As the following example demonstrates, the loss in performancecompared to a detector specially tailored to the signal and noise properties can be small.

Example 4.16Let the signal to be detected be a sinusoid having, for simplicity, an integer number of cycles

within the L observations [s (l) =√

2EL sin (ωol)]. Setting a false alarm probability of 0.01 results

in a threshold value of γ = 1.16√L+ L

2 . This previous gure (Figure 4.15) depicts the probabilityof detection for various signal-to-noise ratios when the case ωo = 2π0.1 and L = 100. The loss inperformance compared to the matched lter is small - approximately 3 dB!

4.26 Type-Based Detection55

Perhaps the ultimate non-parametric detector makes no assumptions about the observations' probabilitydistribution under either model. Here, we assume that data representative of each model are available totrain the detection algorithm. One approach uses articial neural networks, which are dicult to analyze interms of both optimality and performance. When the observations are discrete-valued, a provable optimaldetection algorithm (Gutman[16]) can be derived using the theory of types (Section 3.10.2: Types).

For a two-model evaluation problem, let r (length L) denote training data representative of some unknownprobability distribution P . We assume that the data have statistically independent components. To derivea non-parametric detector, form a generalized likelihood ratio to distinguish whether a set of observations r(length L) has the same distribution as the training data or a dierent one Q.

logΛ (r) = logmaxP,Q P, , ,Q, P (r)Q (r)

maxP P, P (r)P (r)

Because a type is the maximum likelihood estimate of the probability distribution (see Histogram Estimators(3.42)), we simply substitute types for the training data and observations probability distributions into thelikelihood ratio. The probability of a set of observations having a probability distribution identical to its

type equals e−“LH“P””. Thus, the log likelihood ratio becomes

logΛ (r) = loge−“LH“P r

””e−“LH“P r

””

e−“(L+L)H

“P r,r

””The denominator term means that the training and observed data are lumped together to form a type. Thistype equals the linear combination of the types for the training and observed data weighted by their relativelengths.

P r,r=LP r +LP rL+ L

Returning to the log likelihood ratio, we have that

logΛ (r) =(−(LH

(P r

)))− LH

(P r

)+(L+ L

)H

(LP r +LP rL+ L

)Note that the last term equals(

L+ L)H

(LP r +LP rL+ L

)= −

∑aa

(L Pr (a) +L Pr (a)

)log

L Pr (a) +L Pr (a)L+ L




which means it can be combined with the other terms to yield the simple expression for the log likelihoodratio.

logΛ (r) = LD(P r‖ P r,r

)+ LD

(P r‖ P r,r

)(4.33)

When the training data and the observed data are drawn from the same distribution, the Kullback-Leibler distances will be small. When the distributions dier, the distances will be larger. DeningM0 tobe the model that the training data and observations have the same distribution andM1 that they don't,Gutman[16] showed that when we use the decision rule

1L

logΛ (r)M1

≷M0

γ

its false-alarm probability has an exponential rate at least as large as the threshold and the miss probabilityis the smallest among all decision rules based on training data.

limitL→∞

1LPF ≤ −γ

and PM minimum.We can extend these results to the K-model case if we have training data ri (each of length L) that

represent modelMi, i ∈ 0, . . . ,K − 1. Given observed data r (length L), we calculate the log likelihoodfunction given above for each model to determine whether the observations closely resemble the testedtraining data or not. More precisely, dene the sucient statistics Υi according to

Υi = D(P r‖ P r,ri

)+L

LD(P r‖ P r,ri

)− γ

Ideally, this statistic would be negative for one of the training sets (matching it) and positive for all of theothers (not matching them). However, we could also have the observation matching more than one trainingset. In all such cases, we dene a rejection region <? similar to what we dened in sequential modelevaluation (Section 4.7). Thus, we dene the ith decision region <i according to Υi < 0 and Υj > 0, j 6= i

and the rejection region as the complement of⋃K−1i=0 <i. Note that all decision regions depend on the value of

γ, a number we must choose. Regardless of the value chosen, the probability of confusing models - choosingsome model other than the true one - has an exponential rate that is at least γ for all models. Because ofthe presence of a rejection region, another kind of "error" is to not choose any model. This decision rule isoptimal in the sense that no other training-data-based decision rule has a smaller rejection region than thetype-based one.

Because it controls the exponential rate of confusing models, we would like γ to be as large as possible.However, the rejection region grows as γ increases; choosing too large a value could make virtually alldecisions rejections. What we want to ensure is that limit

L→∞Pr [<? | Mi] = 0. Obtaining this behavior

requires that limitL→∞

LL > 0: As the length of the observations increases, so must the size of the training set.

In summary,

for some i :

(limitL→∞

L

L= 0

)∨ (γ > γ0)⇒

(limitL→∞

Pr [<? | Mi] = 1)

∀i :

((limitL→∞

L

L> 0

)∧ (γ < γ0)⇒

(limitL→∞

1LPr [<? | Mi] ≤ −β < 0

))The critical value γ0 depends on the true distributions underlying the models. The exponential rate of therejection probability β also depends on the true distributions. These results mean that if sucient trainingdata are available and the decision threshold is not too large, then we can perform optimal detection basedentirely on data! As the number of observations increases (and the amount of training data as well), thecritical threshold γ0 becomes the Kullback-Liebler distance between the unknown models. In other words,the type-based detector becomes optimal!


157

4.27 Discrete-Time Detection: Problems56

Exercise 4.27.1Assume that observations of a sinusoidal signal s (l) = Asin (2πfl), l ∈ 0, . . . , L− 1, are contam-inated by rst-order colored noise as decribed in the example.

4.27.1

Find the unit-sample response of the whitening lter.

4.27.2

Assuming that the alternative model is the sole presence of the colored Gaussian noise, what is theprobaiblity of detection?

4.27.3

How does this probability vary with signal frequency f when the rst-order coecient is positive?Does your result make sense? Why?

Exercise 4.27.2In space-time decoding systems, a common bit stream is transmitted over several channels si-multaneously but using dierent signals. r(k) denotes the signal received from the kth channel,k ∈ 1, . . . ,K, and the received signal equals s(k,i) + n(k). Here i equals 0 or 1, corresponding tothe bit being transmitted. Each signal has length L. n(k) denotes a Gaussian random vector withstatistically independent components having mean zero and variance σk

2 (the variance depends onthe channel).

4.27.1

Assuming equally likely bit transmissions, nd the minimum probability of error decision rule.

4.27.2

What is the probability that your decision rule makes an error?

4.27.3

Suppose each channel has its own decision rule, which is designed to yield the same miss probabilityas the others. Now what is the minimum probability of error decision rule of the system thatcombines the individual decisions into one?

Exercise 4.27.3The performance for the optimal detector in white Gaussian noise problems depends only on thedistance between the signals. Let's conrm this result experimentally. Dene the signal underone hypothesis to be a unit-amplitude sinusoid having one cycle within the 50-sample observationinterval. Observations of this signal are contaminated by additive white Gaussian noise havingvariance equal to 1.5. The hypotheses are equally likely.

4.27.1

Let the second hypothesis be a cosine of the same frequency. Calculate and estimate the detector'sfalse-alarm probability.




4.27.2

Now let the signals correspond to square-waves constructed from the sinusoids used in the previouspart (Section 4.27.1). Normalize them so that they have the same energy as the sinusoids. Calculateand estimate the detector's false-alarm probability.

4.27.3

Now let the noise be Laplacian with variance 1.5. Although no analytic expression for the detectorperformance can be found, do the simulated performances for the sinusoid and the square-wavesignals change signicantly?

4.27.4

Finally, let the second signal be the negative of the sinusoid. Repeat the calculations and thesimulation for Gaussian noise.

Exercise 4.27.4Physical constraints imposed on signals can change what signal set choices result in the bestdetection performance. Let one of two equally likely discrete-time signals be observed in the presenceof white Gaussian noise (variance/sample equals σ2).

M0 : r (l) = s(0) (l) + n (l) l ∈ 0, . . . , L− 1

M1 : r (l) = s(1) (l) + n (l) l ∈ 0, . . . , L− 1We are free to choose any signals we like, but there are constraints. Average signals power equals∑lls2(l)L , and the peak power equals maxl

l, s2 (l)

.

4.27.1

Assuming the average signal power must be less than Pave, what are the optimal signal choices? Isyour answer unique?

4.27.2

When the peak power Ppeak is constrained, what are the optimal signal choices?

4.27.3

If Pave = Ppeak, which constraint yields the best detection performance?

Exercise 4.27.5In many signal detection problems, the signal itself is not known accurately; for example, thesignal could have been the result of a measurement, in which case the "signal" used to specify theobservations is actually the actual signal plus noise. We want to determine how the measurementnoise aects the detector.

The formal statement of the detection problem is

M0 : r (l) = s0 (l) + n (l) l ∈ 0, . . . , L− 1

M1 : r (l) = s1 (l) + n (l) l ∈ 0, . . . , L− 1where si (l) equals si (l)+w (l), with w (l) and n (l) comprising white Gaussian noise having variancesσw

2 and σn2, respectively. We know precisely what the signals si (l) are, but not the underlying

"actual" signal.


159

4.27.1

Find a detector for this problem.

4.27.2

Analyze, as much as possible, how this detector is aected by the measurement noise, constrastingits performance with that obtained when the signals are known precisely.

Exercise 4.27.6One of the more interesting problems in detection theory is determining when the probabilitydistribution of the observations diers from that in other portions of the observation interval.The most common form of the problem is that over the interval [0, C), the observations have oneform, and that in the remainder of the observation interval [C,L− 1] have a dierent probabilitydistribution. The change detection problem is to determine whether in fact a change has occurredand, if so, estimate when that change occurs.

To explore the change detection problem, let's explore the simple situation where the mean ofwhite Gaussian noise changes at the Cth sample.

M0 : r (l) ∼ N(0, σ2

)l ∈ 0, . . . , L− 1

M1 : r (l) ∼

N(0, σ2

)if l ∈ 0, . . . , C − 1

N(m,σ2

)if l ∈ C, . . . , L− 1

The observations in each case are statistically independent of the others.

4.27.1

Find a detector for this change problem when m is a known positive number.

4.27.2

Find an expression for the threshold in the detector using the Neyman-Pearson criterion.

4.27.3

How does the detector change when the value of m is not known?

Exercise 4.27.7Noise generated by a system having zeros complicates the calculations for the colored noise detectionproblem. To illustrate these diculties, assume the observation noise is produced as the output ofa lter governed by the dierence equation

∀a, b, a 6= −b : (n (l) = an (l − 1) + w (l) + bw (l − 1)) (4.34)

where w (l) is white, Gaussian noise. Assume an observation duration suciently long to capturethe colored eects.

4.27.1

Find the covariance matrix of this noise process

4.27.2

Calculate the Cholesky factorization of the covariance matrix



4.27.3

Find the unit-sample response of the optimal detector's whitening lter. If it weren't for the niteobservation interval, would it indeed have an innite-duration unit-sample reponse as claimed here?Describe the edge-eects of your lter, constrasting them with the case when b = 0.Exercise 4.27.8The calculation of the sucient statistic in spectrally based detectors (see this equation) can besimplied using signal processing notions. When the observations are real-valued, their spectra areconjugate symmetric.

4.27.1

Exploiting this symmetry, how can the calculations of this equation be simplied? Be careful tonote "spectral edge-eects."

4.27.2

Can your formula be manipulated to depend on the power spectra of the signals and the observa-tions? Why or why not?

Exercise 4.27.9Here, we claimed that the relation between noise bandwidth and reciprocal duration of the observa-tion interval played a key role in determining whether DFT values were approximately uncorrelated.While the statements sound plausible, their veracity should be checked. Let the covariance functionof the observation noise be Kn (l) = a|l|.

4.27.1

How is the bandwidth (dened by the half-power point) of this noise's power spectrum related tothe parameter a? How is the duration (dened to be two time constants) of the covariance functionrelated to a?

4.27.2

Find the variance of the length-L DFT of this noise process as a function of frequency index k.This result should be compared with the power spectrum calculated in the previous part (p. 160);they should resemble each other when the "memory" of the noise - the duration of the covariancefunction - is much less than L while demonstrating dierences as the memory becomes comparableto or exceeds L.

4.27.3

Calculate the covariance between adjacent frequency indices. Under what conditions will they beapproximately uncorrelated? Relate your answer to the relations of a to L found in the previouspart (p. 160).

Exercise 4.27.10The results derived in Exercise 4.27.9 assumed that a length-L Fourier Transform was computedfrom a length-L segment of the noise process. What will happen if the transform has length 2Lwith the observation interval remaining unchanged?

4.27.1

Find the variance of DFT values at index k.


161

4.27.2

Assuming the conditions in Exercise 4.27.9 for uncorrelated adjacent samples, now what is thecorrelation between adjacent DFT values?

Exercise 4.27.11A common application of spectral detection is the detection of sinusoids in noise. Not only canone sinusoidal signal be present, but several. The problem is to determine which combinationof sinusoids is present. Let's rst assume that no more than two sinusoids can be present, thefrequencies of which are integer fractions of the number of observations: fi = ki

L , i ∈ 1, 2.Assume additive Gaussian noise is present.

4.27.1

What is the detection procedure for determining if one of the sinusoids is present while ignoringthe second? The amplitude of the sinusoid is known.

4.27.2

How does the probability of detection vary with the ratio of the signal's squared amplitude to thenoise variance at the frequency of the signal? Sketch this result for several false-alarm probabilities.

4.27.3

Now assume that none, either, or both of the signals can be present. Find the detector thatbest determines the combination present in the observations. Does your answer have a simpleinterpretation?

4.27.4

Derive a detector that determines the number of sinusoids present in the input.

Exercise 4.27.12In a discrete-time detection problem, one of two, equally likely, length-L sinusoids (each having afrequency equal to a known multiple of 1

L ) is observed in additive colored Gaussian noise. Signalamplitudes are also unknown by the receiver. In addition, the power spectrum of the noise isuncertain: what is known is that the noise power spectrum is broadband, and varies gently acrossfrequency. Find a detector that performs well for this problem. What notable properties (if any)does your receiver have?

Exercise 4.27.13A sampled signal is suspected of consisting of a periodic component and additive Gaussian noise.The signal, if present, has a known period L. The number of samples equals N , a multiple of L.The noise is white and has known variance of σ2. A consultant (you!) has been asked to determinethe signal's presence.

4.27.1

Assuming the signal is a sinusoid with unknown phase and amplitude, what should be done todetermine the presence of the sinusoid so that a false-alarm probability criterion of 0.1 is met?



4.27.2

Other than its periodic nature, now assume that the signal's waveform is unknown. What compu-tations must the optimum detector perform?

Exercise 4.27.14The QAM (Quadrature Amplitude Modulation) signal set consists of signals of the form

si (l) = Acicos (2πfcl) +Asi sin (2πfcl)

where Aci and Asi are amplitudes that dene each element of the signal set. These are chosenaccording to design constraints. Assume the signals are observed in additive Gaussian noise.

4.27.1

What is the optimal amplitude choice for the binary and quaternary (four-signal) signal sets whenthe noise is white and the signal energy is constrained (

∑ll si

2 (l) < E)? Comment on the unique-ness of your answers.

4.27.2

Describe the optimal binary QAM signal set when the noise is colored.

4.27.3

Now suppose the peak amplitude (maxl l, |si (l) | < Amax) is constrained. What are the optimalsignal sets (both binary and quaternary) for the white noise case? Again, comment on uniqueness.

Exercise 4.27.15Because sources rarely remain still, a common problem in radar applications is uncertainty of thesignal frequency because of Doppler shifts. This eect is usually small, but conceivably could belarge enough for a sinusoidal signal to "wander" from its presumed frequency index.

4.27.1

Design a spectrally based detection strategy which examines a range of frequencies, testing for thepresence of a sinusoid in any one of the bins or no sinusoid in any bin.

4.27.2

Predict the detection loss (or gain) relative to the known frequency case (no Doppler shift) byconstrasting the signal-to-noise ratio terms in the detection probability expression for each detector.

4.27.3

Contrast the performance of this detector with a square-law detector that only considers the fre-quency band in question.

Exercise 4.27.16I confess that unknown-delay problem given previously (Section 4.13) is not terribly relevant toactive sonar and radar ranging problems. In these cases, the signal's delay, measured with respectto its emission by a source, presents the round-trip propagation time from the source to the energy-reecting object and back. Hence, delay is proportional to twice the range. What makes theexample overly simplistic is the independence of signal amplitude on delay.


163

4.27.1

Because of attenuation due to spherical propagation, show that the received signal energy is inverselyrelated to the fourth power of the range. This result is known as the radar equation.

4.27.2

Derive the detector that takes the dependence of delay and amplitude into account, thereby opti-mally determining the presence of a signal in active radar/sonar applications, and produces a delayestimate, thereby simultaneously providing the object's range. Not only determine the detector'sstructure, but also how the threshold and false-alarm probability are related.

4.27.3

Does the object's reectivity need to be known to implement your detector?

Exercise 4.27.17Derive the frequency domain counterpart of the CFAR detector, explicitly indicating its indepen-dence on noise variance. Constrast this detector to that derived in Exercise 4.27.15: how do theydier? How much penalty in performance is paid by uncertainty in the noise variance?

Exercise 4.27.18CFAR detectors are extremely important in applications because they automatically adapt tothe value of noise variance during the observations, allowing them to be used in varying noisesituations. However, as described in Exercise 4.27.16, unknown delays must be also considered inrealistic problems.

4.27.1

Derive a CFAR detector that also takes unknown signal delay into account.

4.27.2

Show that your detector automatically incorporates amplitude uncertainties.

4.27.3

Under the no-signal model, what is the distribution of the sucient statistic?




Chapter 5

Probability Distributions


165

166 CHAPTER 5. PROBABILITY DISTRIBUTIONS


Chapter 6

Matrix Theory


167

168 CHAPTER 6. MATRIX THEORY


Chapter 7

Ali-Silvey Distances1

Ali-Silvey distances comprise a family of quantities that depend on the likelihood ratio Λ (r) and on themodel-describing densities p0, p1 in the following way.

d (p0, p1) = f (E0 [c (Λ (r))]) (7.1)

Here, f (·) is an increasing function, c (·) is a convex function, and E0 [·] means expectation with respect top0. Where applicable, π0, π1 denote the a priori probabilities of the models. Basseville[3] is a good referenceon distances in this class and many others. In all cases, the observations consist of L IID random variables.

Ali-Silvey Distances and Relation to Detection Performance

Name c (·) Performance Comment

Kullback-LeiblerD (p1 ‖ p0)

(·) log (·) limitL→∞

−(

1L logPF

)=

d (p0, p1)

Neyman-Pearson errorrate under both xedand exponentiallydecaying constraints onPM (PF )

Kullback-LeiblerD (p0 ‖ p1)

−log (·) limitL→∞

−(

1L logPM

)=

d (p0, p1)

J-Divergence ((·)− 1) log (·) π0π1e− d(p0,p1)

2 ≤ Pe J (p0, p1) =D (p0 ‖ p1)+D (p1 ‖ p0)

Cherno (·)s, s ∈ (0, 1) max

limitL→∞

−(

1L logPe

)=

infs d (p0, p1) |s ∈ (0, 1)Independent of a prioriprobabilities

M -Hypothesis Cherno (·)s, s ∈ (0, 1) max

limitL→∞

−(

1L logPe

)=

min (s, s ∈ (0, 1) , d (pi, pj)) | i 6= j

continued on next page



169

170 CHAPTER 7. ALI-SILVEY DISTANCES

Bhattacharyya (·)12 π0π1d

2 (p0, p1) ≤ Pe ≤√π0π1d (p0, p1)

Minimizing d (p0, p1)will tend to minimizePe

Orsak |π1 (·)− π0| Pe = 12 −

12d (p0, p1) Exact formula for aver-

age error probability

Kolmogorov 12 | (·)− 1| If π0 = π1, Pe = 1

2 −12d (p0, p1)

Hellinger(

(·)12 − 1

)2

Table 7.1


GLOSSARY 171

Glossary

H Hilbert Spaces

A Hilbert space H is a closed, normed linear vector space which contains all of its limit points: ifxn is any sequence of elements in H that converges to x, then x is also contained in H. x istermed the limit point of the sequence.

O orthogonal

If Y is a subspace of H, the vector x is orthogonal to the subspace Y for every y ∈ Y ,< x, y >= 0.

orthonormal

The basis for a separable vector space is said to be an orthonormal basis if the elements of thebasis satisfy the following two properties:

• The inner product between distinct elements of the basis is zero (i.e., the elements of thebasis are mutually orthogonal).

∀i, j, i 6= j : (< φi, φj >= 0) (1.42)

• The norm of each element of a basis is one (normality).

∀i, i = 1, . . . : (‖ φi ‖= 1) (1.43)

S separable

A Hilbert space H is said to be separable if there exists a set of vectors φi,i = 1, . . . , elements of H, that express every element x ∈ H as

x =∞∑i=1

xiφi (1.40)

where xi are scalar constants associated with φi and x and where "equality" is taken to meanthat the distance between each side becomes zero as more terms are taken in the right.

limitm→∞

‖ x−m∑i=1

xiφi ‖= 0


172 BIBLIOGRAPHY


Bibliography

[1] M. Abramowitz and I. A. Stegun, editors. Handbook of Mathematical Functions. U. S. GovernmentPrinting Oce, 1968.

[2] A.Wald. Sequential Analysis. Wiley and Sons, New York, 1947.

[3] M. Basseville. Distance measures for signal processing and pattern recognition. Signal Processing,18:349369, 1989.

[4] D.H. Brandwood. A complex gradient operator and its application in adaptive array theory. IEE Proc.,

Pts. F and H, 130:1116, 1983.

[5] B.W.Silverman. Density Estimation. Chapman and Hall, London, 1986.

[6] H. Cherno. Measure of asymptotic eency tests of a hypothesis based on the sum of observations.Ann. Math. Stat., 23:493507, 1952.

[7] R.V. Churchill and J.W. Brown. Complex Variables and Applications. McGraw-Hill, 1989.

[8] H. Cram[U+FFFD] Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ, 1946.

[9] H. Cram[U+FFFD] Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ, 1946.

[10] H. Cram[U+FFFD] Random Variables and Probability Distributions. Cambridge University Press, thirdedition edition, 1970.

[11] M. Zakai D. Chazan and J. Ziv. Improved lower bounds on signal parameter estimation. IEEE Trans.

Info. Th., IT-21:9093, January 1975.

[12] A.H. El-Sawy and V.D. Vandelinde. Robust detection of known signals. IEEE Trans. Info. Th., IT-23:722727, 1977.

[13] J.D. Gibson and J.L. Melsa. Introduction to Non-Parametric Detection with Applications. AcademicPress, New York, 1975.



[16] M. Gutman. Asymptotically optimal classication for multiple test with empirically observed statistics.IEEE Trans. Info. Theory, 35:401408, 1989.

[17] P. Hall. Rates of convergence in the central limit theorem. In Research Notes in Mathematics, volume 62,page 6. Pitman Advanced Publishing Program, 1982.

[18] H.Cramer. Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ, 1946.


173

174 BIBLIOGRAPHY

[19] C. W. Helstrom. Statistical Theory of Signal Detection. Pergamon Press, Oxford, second edition edition,1968.

[20] P.J. Huber. A robust version of the probability ratio test. Ann. Math. Stat., 36:17531758, 1965.

[21] P.J. Huber. A robust version of the probability ratio test. Ann. Math. Stat., 36:17531758, 1965.

[22] P.J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.



[25] H.V.Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, New York, 1988.

[26] D.H. Johnson and G.C. Orsak. Relation of signal set choice to the performance of optimal non-gaussiandetectors. IEEE Trans. Comm., 41:13191328, 1993.

[27] R.A.Tapia J.R.Thompson. Nonparametric Function Estimation, Modeling and Simulation. SIAM,Philadelphia, PA, 1990.

[28] S.A. Kassam. Signal Detection in Non-Gaussian Noise. Springer-Verlag, New York, 1988.

[29] S.A. Kassam. Signal Detection in Non-Gaussian Noise. Springer-Verlag, New York, 1988.

[30] S.A. Kassam and H.V. Poor. Robust techniques for signal processing: A survey. Proc. IEEE, 73:433481,1985.

[31] S.A. Kassam and J.B. Thomas, editors. Nonparametric Detection: Theory and Applications. Dowden,Hutchinson and Ross, Stroudsburg, PA, 1980.

[32] D.G. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.

[33] F.W. Machell and C.X. Penrod. Probability density functions of ocean acoustic noise processes. In E.J.Wegman and J.G. Smith, editors, Statistical Signal Processing, pages 211221. Marcel Dekker, NewYork, 1984.

[34] D. Middleton. Statistical-physical models of electromagnetic interference. IEEE Trans. Electromag.

Compat., EMC-17:106127, 1977.

[35] A.R. Milne and J.H. Ganton. Ambient noise under arctic sea ice. J. Acoust. Soc. Am., 36:855863,1964.

[36] J. Neyman and E.S. Pearson. On the problem of the most ecient tests of statistical hypotheses. Phil.Trans. Roy. Soc. Ser. A, 231:289337, 1933.

[37] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, secondedition edition, 1984.

[38] E. Parzen. Stochastic Processes. Holden-Day, San Francisco, 1962.

[39] H.V. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, New York, 1988.

[40] H.V. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, New York, 1988.

[41] E. J. Kelly; I. S. Reed; W. L. Root. The detection of radar echoes in noise. i. J. Soc. Indust. Appl.

Math., 8:309341, June 1960.

[42] E. J. Kelly; I. S. Reed; W. L. Root. The detection of radar echoes in noise. ii. J. Soc. Indust. Appl.

Math., 8:481507, September 1960.


BIBLIOGRAPHY 175

[43] A.N.Shiryayev R.S. Lipster. Statistics of Random Processes I: General Theory. Springer-Verlag, NewYork, 1977.

[44] D. L. Snyder. Random Point Processes. Wiley, New York, 1975.

[45] A.D. Spaulding. Locally optimum and suboptimum detector performance in a non-gaussian interferenceenvironment. IEEE Trans. Comm., COM-33:509517, 1985.

[46] A.D. Spaulding and D. Middleton. Optimum reception in an impulsive interference environmentparti: Coherent detection. IEEE Trans. Comm., COM-25:910923, 1977.

[47] M. Abramowitz & I. A. Stegun, editor. Handbook of Mathematical Functions. U.S. Government PrintingOce, 1968.

[48] J. W. Carlyle & J. B. Thomas. On nonparametric signal detectors. IEEE Trans. Info. Th., IT-10:146152, April 1964.

[49] J.A.Thomas T.M.Cover. Elements of Information Theory. John Wiley and Sons, Inc., 1991.

[50] J.A.Thomas T.M.Cover. Elements of Information Theory. John Wiley and Sons, Inc., 1991.

[51] H. L. van Trees. Detection, Estimation, and Modulation Theory, Part I. John Wiley & Sons, New York,1968.

[52] H.L. van Trees. Detection, Estimation, and Modulation Theory, Part I. John Wiley and Sons, NewYork, 1968.

[53] A.J. Weiss and E. Weinstein. Fundamental limitations in passive time delay estimation: I. narrow-bandsystems. IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-31:472486, April 1983.

[54] R. P. Wishner. Distribution of the normalized periodogram detector. IRE Trans. Info. Th., IT-8:342349, September 1962.

[55] P.M. Woodward. Probability and Information Theory, with Applications to Radar. Pergamon Press,Oxford, second edition edition, 1964.

[56] J. Ziv and M. Zakai. Some lower bounds on signal parameter estimation. IEEE Trans. Info. Th.,IT-15:386391, May 1969.


176 INDEX

Index of Keywords and Terms

Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywordsdo not necessarily appear in the text of the page. They are merely associated with that section. Ex.apples, 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1

" "quasi-linear", 87

A a Posteriori, 3.3(59)a priori, 4.2(94)a priori probabilities, 1.1(1), 1a priori probability, 4.1(91)Ali-Silvey, 4.6(105)alphabet, 78amplitude, 9, 131amplitude distribution, 9asymptotically unbiased, 3.1(55), 56autocorrelation function, 3.9(73)axioms of probability, 1.1(1), 1

B basis, 19Bayes, 4.1(91), 4.2(94)Bayes', 4.2(94)Bayes' cost, 4.1(91), 91Bayes' decision criterion, 4.1(91), 91Bayes' Rule, 1.1(1), 2bias, 3.1(55), 56, 3.9(73)biased, 56binary detection problem, 116bins, 79

C CFAR, 138Cherno, 4.6(105), 105Cherno distance, 4.6(105)Chi-squared Probability Distribution, 3.10(78)Cholesky factorization, 129Circular random vectors, 4classication, 102closed linear manifold, 16complete set, 19complex optimization, 2.1(47)composite hypothesis, 116conditional expected value, 3conditional probability, 1.1(1), 1conditional probability density function, 3consistency, 3.1(55)consistent, 3.1(55), 56

constant false-alarm rate, 138constrained optimization, 2.2(49), 49constraint function, 2.2(49), 50continuous in the quadratic mean, 23contour map, 50convergence, 18convergence rate, 8converges, 18convex, 2.1(47)correlation coecient, 3correlation function, 10cost, 4.1(91)count, 11covariance, 3covariance function, 10covariance matrix, 4Cramer-Rao bound, 3.9(73)Cramér-Rao Bound, 64criteria, 4.2(94)critical delay, 3.9(73), 77cumulative, 2cycle skipping, 3.9(73)

D decision criterion, 4.1(91)decision region, 4.1(91), 4.2(94)decision regions, 91Density Estimation, 3.10(78)detection, 3.9(73), 4.2(94)detection probability, 4.2(94), 4.7(110)dimension, 19discrete-time white noise, 11distance, 4.6(105)distance between two vectors, 18distinctiveness, 100distribution method, 27dot product, 122

E eciency, 3.1(55)ecient, 3.1(55)ecient estimate, 56eigenequation, 24Entropy, 3.10(78), 79


INDEX 177

equality constraints, 49error probability, 4.2(94)error rate, 4.3(99), 101Estimation, 3.1(55), 3.9(73)Estimation Error, 3.1(55), 56Estimation Theory, 3.1(55), 55estimator, 3.7(68)Estimators, 3.3(59), 3.4(60)event, 1.1(1), 1excursion, 136excursions, 136expected value, 10exponential rate, 107

F false-alarm probability, 4.2(94), 4.3(99), 4.6(105), 4.7(110)eld, 15Fisher information matrix, 64

G Gaussian Distribution, 3.10(78)Gaussian random variable, 5Gaussian random vector, 7Gibbs phenomenon, 22global minimum, 2.1(47), 47gradient, 2.1(47), 48Gram-Schmidt procedure, 20

H Hessian, 2.1(47), 48, 64Hilbert Spaces, 18histogram, 79Histogram Estimators, 3.10(78)hypothesis testing, 4.2(94)

I indicator functions, 80inequality constraint function, 52inequality constraints, 49innite clipper, 153inner product, 16inner product space, 17intensity, 11

J joint distribution function, 2joint probability density function, 2

K Karhunen-Loève, 24kernel, 17, 127kernel estimators, 89Kullback-Leibler, 4.6(105), 105Kullback-Leibler Distance, 3.10(78), 79

L Lagrange multiplier, 51, 4.2(94)Lagrange multipliers, 2.2(49), 50Lagrangian, 50

likelihood functions, 92likelihood ratio, 4.1(91), 92, 4.2(94), 4.4(102), 4.7(110)likelihood ratio test, 91, 92limit point, 171linear, 1.11(15), 3.4(60)linear estimator, 60linear estimators, 60Linear independence, 19linear minimum, 3.7(68)linear vector space, 15log-likelihood, 92

M marginal density functions, 3Markov, 37martingale, 38matched lter, 118, 123matched-lter, 3.7(68)maximum probability correct, 4.2(94)mean, 10Mean Squared Error, 3.10(78)mean-square continuous, 37mean-squared error, 3.1(55), 56mean-squared estimation error, 56median, 147Mercer's Theorem, 24method of steepest descent, 2.1(47), 48metric, 18minimax approach, 142minimum error probability, 4.2(94)Minimum Mean Squared Error, 3.2(57)miss probability, 4.2(94)ML estimate, 3.9(73)MMSE, 3.3(59)MMSE estimate, 3.2(57)monotonic transformation, 4.4(102)monotonicity, 4.2(94)mutual information, 4.6(105), 105mutually exclusive, 1.1(1), 1

N Neyman-Pearson, 4.2(94), 4.6(105)Neyman-Pearson criteria, 4.7(110)Neyman-Pearson criterion, 4.2(94), 4.3(99)non-Gaussian, 151Non-Gaussian Distribution, 3.10(78)non-parametric model evaluation, 146nonparametric, 116nonstationary, 14norm, 1.11(15), 17normal random variables, 5Null hypothesis testing, 103


178 INDEX

O objective function, 2.1(47), 47optimization, 2.1(47), 2.2(49), 4.2(94)optimization theory, 2.1(47)order statistics, 27orthogonal, 17, 18Orthogonality Principle, 60, 3.7(68)orthonormal, 19

P parametric, 116Parseval's Theorem, 20, 3.9(73)point processes, 11positive-denite, 2.1(47)power spectrum, 10probability density function, 2probability distribution function, 2probability measure, 1.1(1), 1probability of error, 4.2(94), 95projection, 117, 122

R radar equation, 163random, 9random sum of random variables, 3random variable, 2random variable generation, 27random vector, 4receiver operating characteristic, 4.3(99), 99rejection region, 156representation theorem, 15, 19RMS, 76robust detectors, 149robust signal processing, 141ROC, 4.3(99)

S sample function, 9sample space, 1.1(1), 1Schwarz, 1.11(15)Schwarz inequality, 17, 2.1(47)separable, 19Sequential data, 4.7(110)sequential hypothesis testing, 110Shot noise, 38sign test, 148signal classication, 4.4(102)signal energy, 117, 122signal parameter estimation, 3.7(68)signal-to-noise ratio, 120, 125

slack variable, 52space, 1.11(15)spectral height, 10square integrable functions, 16square-law detector, 136stationary, 9stationary point, 2.1(47)stationary points, 47stationary, independent increments, 35statistically independent, 1.1(1), 2, 3Stein's Lemma, 4.6(105), 105stochastic, 9strictly convex, 47suboptimum, 3.7(68)sucient statistic, 4.1(91), 92, 4.2(94), 4.4(102)sucient statistic associated with model i, 102

T The Bispectrum, 37threshold, 92time of occurrence, 12time origin, 131time-delay, 3.9(73)time-delay estimation, 3.7(68), 3.9(73)time-domain inner product, 23tradeo, 4.2(94)type, 78Type Estimators, 3.10(78)type I error, 95type II error, 95

U unbiased, 3.1(55), 56unbiased estimators, 3.7(68)unconstrained, 2.1(47)uncorrelated, 3universality constraint., 60

V vector, 1.11(15)

W white Gaussian noise, 117white noise, 10white sequence, 121whitening lter, 3.9(73), 129

Z Ziv-Zakai bound, 3.9(73)

ε ε-contamination model, 151


ATTRIBUTIONS 179

Attributions

Collection: Statistical Signal ProcessingEdited by: Don JohnsonURL: http://cnx.org/content/col11382/1.1/License: http://creativecommons.org/licenses/by/3.0/

Module: "Foundations of Probability Theory: Basic Denitions"By: Don JohnsonURL: http://cnx.org/content/m11245/1.2/Pages: 1-2Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Random Variables and Probability Density Functions"By: Don JohnsonURL: http://cnx.org/content/m11246/1.2/Page: 2Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Jointly Distributed Random Variables"By: Don JohnsonURL: http://cnx.org/content/m11248/1.4/Pages: 2-4Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Random Vectors"By: Don JohnsonURL: http://cnx.org/content/m11249/1.2/Page: 4Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "The Gaussian Random Variable"By: Don JohnsonURL: http://cnx.org/content/m11250/1.2/Pages: 5-7Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "The Central Limit Theorem"By: Don JohnsonURL: http://cnx.org/content/m11251/1.2/Pages: 7-9Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


180 ATTRIBUTIONS

Module: "Basic Denitions in Stochastic Processes"By: Don JohnsonURL: http://cnx.org/content/m11252/1.2/Pages: 9-10Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "The Gaussian Process"By: Don JohnsonURL: http://cnx.org/content/m11253/1.2/Page: 10Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Sampling and Random Sequences"By: Don JohnsonURL: http://cnx.org/content/m11254/1.2/Page: 11Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "The Poisson Process"By: Don JohnsonURL: http://cnx.org/content/m11255/1.4/Pages: 11-15Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Linear Vector Spaces"By: Don JohnsonURL: http://cnx.org/content/m11236/1.6/Pages: 15-18Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Hilbert Spaces and Separable Vector Spaces"By: Don JohnsonURL: http://cnx.org/content/m11256/1.3/Pages: 18-21Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "The Vector Space L Squared"By: Don JohnsonURL: http://cnx.org/content/m11257/1.4/Pages: 21-22Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "A Hilbert Space for Stochastic Processes"By: Don JohnsonURL: http://cnx.org/content/m11258/1.3/Pages: 22-23Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


ATTRIBUTIONS 181

Module: "Karhunen-Loeve Expansion"By: Don JohnsonURL: http://cnx.org/content/m11259/1.3/Pages: 23-25Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Probability and Stochastic Processes: Problems"By: Don JohnsonURL: http://cnx.org/content/m11261/1.3/Pages: 26-46Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Optimization Theory"By: Don JohnsonURL: http://cnx.org/content/m11240/1.4/Pages: 47-49Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Constrained Optimization"By: Don JohnsonURL: http://cnx.org/content/m11223/1.2/Pages: 49-53Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Introduction to Estimation Theory"By: Don JohnsonURL: http://cnx.org/content/m11263/1.2/Pages: 55-57Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Minimum Mean Squared Error Estimators"By: Don JohnsonURL: http://cnx.org/content/m11267/1.5/Pages: 57-59Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Maximum a Posteriori Estimators"By: Don JohnsonURL: http://cnx.org/content/m11268/1.3/Pages: 59-60Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Linear Estimators"By: Don JohnsonURL: http://cnx.org/content/m11276/1.5/Pages: 60-62Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


182 ATTRIBUTIONS

Module: "Maximum Likelihood Estimators of Parameters"By: Don JohnsonURL: http://cnx.org/content/m11269/1.4/Pages: 62-64Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Cramer-Rao Bound"By: Don JohnsonURL: http://cnx.org/content/m11266/1.5/Pages: 64-68Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Signal Parameter Estimation"By: Don JohnsonURL: http://cnx.org/content/m11264/1.4/Pages: 68-70Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Maximum Likelihood Estimators of Signal Parameters"By: Don JohnsonURL: http://cnx.org/content/m11237/1.4/Pages: 71-73Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Time-Delay Estimation"By: Don JohnsonURL: http://cnx.org/content/m11243/1.5/Pages: 73-78Copyright: Elizabeth GregoryLicense: http://creativecommons.org/licenses/by/1.0

Module: "Probability Density Estimation"By: Don JohnsonURL: http://cnx.org/content/m11291/1.8/Pages: 78-83Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Estimation Theory: Problems"By: Don JohnsonURL: http://cnx.org/content/m11221/1.5/Pages: 83-90Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


ATTRIBUTIONS 183

Module: "Detection Theory Basics"By: Don JohnsonURL: http://cnx.org/content/m16250/1.6/Pages: 91-94Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/2.0/Based on: The Likelihood Ratio TestBy: Don JohnsonURL: http://cnx.org/content/m11234/1.6/

Module: "Criteria in Hypothesis Testing"By: Don JohnsonURL: http://cnx.org/content/m11228/1.4/Pages: 94-99Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Performance Evaluation"By: Don JohnsonURL: http://cnx.org/content/m11274/1.3/Pages: 99-101Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Beyond Two Models"By: Don JohnsonURL: http://cnx.org/content/m11278/1.3/Pages: 102-103Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Model Consistency Testing"By: Don JohnsonURL: http://cnx.org/content/m11277/1.5/Pages: 103-104Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Stein's Lemma"By: Don JohnsonURL: http://cnx.org/content/m11275/1.7/Pages: 105-109Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Sequential Hypothesis Testing"By: Don JohnsonURL: http://cnx.org/content/m11242/1.5/Pages: 110-115Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


184 ATTRIBUTIONS

Module: "Detection in the Presence of Unknowns"By: Don JohnsonURL: http://cnx.org/content/m11229/1.4/Page: 116Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Detection of Signals in Noise"By: Don JohnsonURL: http://cnx.org/content/m16253/1.9/Pages: 116-121Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/2.0/Based on: White Gaussian NoiseBy: Don JohnsonURL: http://cnx.org/content/m11281/1.3/

Module: "White Gaussian Noise"By: Don JohnsonURL: http://cnx.org/content/m11281/1.3/Pages: 121-127Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Colored Gaussian Noise"By: Don JohnsonURL: http://cnx.org/content/m11260/1.2/Pages: 127-131Copyright: Eileen KrauseLicense: http://creativecommons.org/licenses/by/1.0

Module: "Detection in the Presence of Uncertainties"By: Don JohnsonURL: http://cnx.org/content/m11262/1.2/Pages: 131-133Copyright: Eileen KrauseLicense: http://creativecommons.org/licenses/by/1.0

Module: "Unknown Signal Delay"By: Don JohnsonURL: http://cnx.org/content/m11283/1.2/Pages: 133-136Copyright: Eileen KrauseLicense: http://creativecommons.org/licenses/by/1.0

Module: "Unknown Signal Waveform"By: Don JohnsonURL: http://cnx.org/content/m11284/1.2/Pages: 136-137Copyright: Eileen KrauseLicense: http://creativecommons.org/licenses/by/1.0


ATTRIBUTIONS 185

Module: "Unknown Noise Parameters"By: Don JohnsonURL: http://cnx.org/content/m11285/1.2/Pages: 137-140Copyright: Eileen KrauseLicense: http://creativecommons.org/licenses/by/1.0

Module: "Partial Knowledge of Probability Distributions"By: Don JohnsonURL: http://cnx.org/content/m11293/1.4/Pages: 140-141Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Robust Hypothesis Testing"By: Don JohnsonURL: http://cnx.org/content/m11299/1.5/Pages: 141-146Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Non-Parametric Model Evaluation"By: Don JohnsonURL: http://cnx.org/content/m11304/1.4/Pages: 146-149Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Partially Known Signals and Noise"By: Don JohnsonURL: http://cnx.org/content/m11305/1.2/Page: 149Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Partially Known Signal Waveform"By: Don JohnsonURL: http://cnx.org/content/m11308/1.4/Pages: 149-150Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Partially Known Noise Amplitude Distribution"By: Don JohnsonURL: http://cnx.org/content/m11316/1.5/Pages: 150-151Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Non-Gaussian Observations"By: Don JohnsonURL: http://cnx.org/content/m11318/1.2/Pages: 151-152Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0


186 ATTRIBUTIONS

Module: "Small-Signal Detection"By: Don JohnsonURL: http://cnx.org/content/m11319/1.5/Pages: 152-153Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Robust Small-Signal Detection"By: Don JohnsonURL: http://cnx.org/content/m11324/1.4/Pages: 153-154Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Non-Parametric Detection"By: Don JohnsonURL: http://cnx.org/content/m11325/1.5/Pages: 154-155Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Type-Based Detection"By: Don JohnsonURL: http://cnx.org/content/m11329/1.3/Pages: 155-156Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Discrete-Time Detection: Problems"By: Don JohnsonURL: http://cnx.org/content/m11300/1.3/Pages: 157-163Copyright: Don JohnsonLicense: http://creativecommons.org/licenses/by/1.0

Module: "Ali-Silvey Distances"By: Don JohnsonURL: http://cnx.org/content/m11218/1.4/Pages: 169-170Copyright: Jerey SilvermanLicense: http://creativecommons.org/licenses/by/1.0


Statistical Signal ProcessingCourse notes for ELEC 531, Statistical Signal Processing, at Rice University

About ConnexionsSince 1999, Connexions has been pioneering a global system where anyone can create course materials andmake them fully accessible and easily reusable free of charge. We are a Web-based authoring, teaching andlearning environment open to anyone interested in education, including students, teachers, professors andlifelong learners. We connect ideas and facilitate educational communities.

Connexions's modular, interactive courses are in use worldwide by universities, community colleges, K-12schools, distance learners, and lifelong learners. Connexions materials are in many languages, includingEnglish, Spanish, Chinese, Japanese, Italian, Vietnamese, French, Portuguese, and Thai. Connexions is partof an exciting new information distribution system that allows for Print on Demand Books. Connexionshas partnered with innovative on-demand publisher QOOP to accelerate the delivery of printed coursematerials and textbooks into classrooms worldwide at lower prices than traditional academic publishers.

Documents

Statistical Signal Processing · 2012. 10. 29. · 1.1 oundationsF of Probability Theory: Basic De nitions 1 1.1.1 Basic De nitions The basis of probability theory is a set of events