ECE 8072, Statistical Signal Processing, Fall 2010

ECE 8072, Statistical Signal Processing, Fall 2010Course Information Sheet - 8/26/10 draft

Instructor: Kevin Buckley, Tolentine 433a, 610-519-5658 (Office), 610-519-4436 (Fax),610-519-5864 (CEER307), [email protected], www.ece.villanova.edu/user/buckley

Office Hours: M 12:30 - 1:30pm; W 4:30-6:00pm; R 2:30 - 3:30pm; F 10:30 - 11:30am;by appointment; stop by my office anytime I’m available.

Prerequisites: Undergrad level probability; undergrad level signals/systems; some linear algebra

Grading Policy:* Regular Homeworks (some will have computer problems): due before most classes - 20%* 1.5 hour Quiz (Oct. 6, after Part I) - 20%* 1.5 hour Quiz (Nov. 17, after Part II) - 20%* Three Computer Projects - 20% total* Take Home Final - 20%

Text: Probability, Random Variables and Stochastic Processes, 4-th ed. by Papoulis and Pillai,2002.

References:1. Peebles, Probability, Random Variables, and Random Signal Principles, 4-th ed., McGraw Hill,2001.2. VanTrees, Detection, Estimation and Modulation Theory, Part I, Wiley, 1968.3. Strang, Linear Algebra and Its Applications, 3-rd ed., Harcourt Brace Jovanovich, 1976.

Course Description: This course covers, for beginning graduate level ECE majors, randomsignals and statistical signal processing. The general objectives are to provide:

1. a review of probability and random variables. This will be directed towards issues mostpertinent to signal processing and communications.

2. an understanding of statistical signal representation. Representation is mainly in the discrete-time context so that a random vector observation is a principal focus. Relationships to/forcontinuous-time processes are developed as needed. The role of random process modeling isexplored.

3. an introduction to statistical signal processing problems and methods. Basic signal processingmethods for detection, parameter estimation, optimum filtering and spectrum estimationproblems are introduced.

4. experience in the use of standard mathematical tools for the processing random signals. Weexemplify the use of Fourier transforms, linear algebra, probability and optimization theory.

Throughout, the relevance of course topics within the context of digital communications and otherapplications will be emphasized.

1

ECE 8072, Statistical Signal Processing, Fall 2010

Course Outline

Part I: Random Variables and Vectors

[1 ] Introduction (Lect. 1)

1.1 Basic Set Theory

1.1 Basic Probability

1.1 The Gaussian Function

[2 ] A Random Variable (Lects. 1,2)

2.1 The Definition of a Random variables, and its Extensions

2.2 The Probability Distribution and Density Functions

2.3 Some Commonly Occurring Random Variables

2.4 Basic Detection Problems

2.5 Conditional Probability Density Functions

[3 ] Functions of a Random Variable (Lects. 2,3)

3.1 The pdf of a Functions of a Random Variable

3.2 Random Number Generation

3.3 The Expectation Operator

3.4 Moments of a Random Variable

3.5 Conditional Expectation

3.6 Characteristic and Moment Generating Functions

[4 ] Multiple Random Variables (Lects. 3-5)

4.1 The Joint pdf of Multiple Random Variables

4.2 Functions of Multiple Random Variables

4.3 Moments of Multiple Random Variables

4.4 Complex-Valued Random Variables

4.5 Signal in Noise, and SNR


4.7 Central Limit Theorem

[5 ] Random Vectors (Lects. 5,6)

5.1 Expectation & Moments

5.2 Linear Transformations

5.3 Vector Observations & the Observation Space

5.4 Diagonalization of Rx: Eigenstructure Transformation

5.5 Diagonalization of Cx

and Decorrelation of the Observation

5.6 Gaussian Vector Observation and pdf Contours

5.7 Sample Estimates of Observation Mean and Correlation Matrix (and SVD)

1

Part II: Random Processes and Linear Time Invariant Systems

[6 ] Random Processes (Lects. 6-10)

6.1 Introduction: Definitions and Basic Concepts

6.2 Correlation Functions & Power Spectral Density

6.2.1 DT Correlation Function

6.2.2 DT power Spectral Density

6.2.3 CT Correlation Functions & Power Spectral Density

6.2.4 Sampling Wide-Sense Stationary CT Random Processes

6.3 Note on Cyclostationary Processes

6.4 Correlation and Covariance Matrices

6.4.1 Random Vector Observation

6.4.2 Wide-Sense Stationary Random Processes

4.5 Discrete Karhunen Loeve Transformation (DKLT)

4.5.1 Orthogonal expansions of a random vector

4.5.2 DKLT

6.6 Narrowband Signals in Additive White Noise

6.6.1 Correlation Matrix Eigenstructure

6.6.2 An Example

6.7 Whitening

[7 ] Linear Time-Invariant (LTI) Systems (Lects. 10-11)

7.1 Discrete Time LTI System Review

7.2 Wide-Sense Stationary Random Processes and DT LTI Systems (mean, correlation func-tions, power density spectra, examples)

7.3 Wide-Sense Stationary Random Processes and CT LTI Systems

7.4 Matched filters (various cases)

7.5 Introduction to Linear Modeling of Random Processes

2

Part III: Estimation and Optimum Filtering

[8 ] Parameter Estimation (Lects. 11-13)

8.1 The Problem

8.2 Ad Hoc Mean and Variance Estimators

8.3 Maximum Likelihood (ML) Parameter Estimation

8.4 Cramer Rao Bounds & the Fisher Information Matrix

8.5 Overview of Bayesian Estimation

8.6 Discrete-Valued Parameter Estimation (a.k.a. Detection)

[9 ] Optimum Filtering (Lects. 13,14)

9.1 Problem Statement and Examples

9.2 Minimum Mean Squared Error Filtering (optimum filter, orthogonality principle, mean-square error surface)

9.3 Least Squares Filtering

[10 ] Overview of Spectrum Estimation (Lect. 14)

10.1 Problem Statement

10.2 Classical Spectrum Estimation (correlation function, periodogram, averaged/windowedperiodogram, computation)

10.3 Autoregressive Spectrum Estimation (model, AR coefficient estimation, spectrum esti-mation)

10.4 Optimum Filter Based Spectrum Estimation (filter banks, the ”ML” approach)

10.5 MUSIC: an Eigenstructure Approach (model, spectrum estimation)

3

c©Kevin Buckley - 2010 -1

ECE8072 – Statistical Signal Processing

Fall 2010

Villanova University

ECE Department

Prof. Kevin M. Buckley

Part1a

1 2

3

A A

A

S1A

2A3A

An

S

(b)(a)

xa b

f (x/B)f (x)x

x

c©Kevin Buckley - 2010 0

Contents

1 Introduction 1

1.1 Basic Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 The Gaussian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 A Random Variable 13

2.1 The Definition of a Random Variable, and its Extensions . . . . . . . . . . . 132.2 The Probability Distribution and Density Functions . . . . . . . . . . . . . . 142.3 Some Commonly Occurring Random Variables . . . . . . . . . . . . . . . . . 162.4 Basic Detection Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5 Conditional Probability Density Functions . . . . . . . . . . . . . . . . . . . 26

List of Figures

1 Venn diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Set partitioning for representation of digital communications symbols. . . . . 43 For a continuous and a discrete valued random variable, X and Y respectively:

(a) Probability Distribution Functions (PDF’s); and (b) Probability DensityFunctions (pdf’s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 The continuous-valued uniform pdf. . . . . . . . . . . . . . . . . . . . . . . . 165 The Gaussian pdf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Illustration of event conditioning. . . . . . . . . . . . . . . . . . . . . . . . . 26


1 Introduction

This is a first level graduate course on probability which is directed towards establishing thefundamentals that are most pertinent to signal processing and communications applications.An undergraduate level background is assumed in: probability; and signals & systems theory.

In order to establish a required baseline of knowledge, and since multiple exposures totheoretical material at different levels is an effective pedagogy, we will begin with a directedreview of probability and random variables. As reviews go, this will be somewhat extensive(i.e. almost 4 weeks; Sections 1 through 5 of the Course Notes). However it will not bepresented at the more extensive and rigorous level sometimes encountered in a beginning levelgraduate course on probability for signal processing and communications (e.g. see Chapters1 through 7 of the Course Text by Papoulis and Pillai). In a one semester course, the morerigorous development of probability presented by Papoulis and Pillai would not be possiblewhile additionally covering more advanced statistical signal processing topics. In this Coursewe intend to establish a basic understanding of statistical signal processing functions suchas detection, parameter estimation and optimum filtering. Thus we will cover only selectedtopics from Chapters 1 through 7 of the Course Text. Throughout these Course Notes, wewill use the Course Text as a reference, relating discussions in the Course Notes to Sectionsin the Course Text as much as possible. The interested students is encouraged to studyadditional material in the Course Text which is not covered in the Course Notes. Note,however, that students are only responsible for topics covered in the Notes, Homeworks andComputer Assignments.

Part I of this Course, consisting of Section 1 through 5 of these Notes, establishes aworking understanding of random variables. We will first define, establish characteristics,and consider the utility of a single random variable. We will then generalize this to multiplerandom variables (a.k.a. a random vector). As noted above, this corresponds to selectedtopics from Chapters 1 through 7 of the Course Text. Random vectors are at the centerof modern signal processing and communications problems, since observations (i.e. data)are often of this form. To become proficient with signal processing and communicationsengineering at a masters level, it is important to study this topic in depth. Our coverage ofrandom vectors in Section 5 of these Notes is more detailed than the coverage in the CourseText.

Part II of the Course focuses on random signals and the processing of them with LinearTime-Invariant (LTI) systems. It consists of Section 6 on Random Processes and Section 7on LTI systems. The coverage in Section 6 is somewhat extensive (i.e. about 3 weeks), whileSection 7 is more of an overview with a closer look at an important type of LTI system - thematched filter.

Part III of this Course consists of Sections on Parameter Estimation (Section 8), OptimumFiltering (Section 9) and Spectrum Estimation (Section 10). Each of these topics, being bothbroad and deep, could occupy and entire course. Our objective in this Course is to provide astudent with: an appreciation of the considerations involved; and an overview of some of themore important techniques. Selected techniques are covered in more detail in higher levelsignal processing and communications courses.


The remainder of this Section is a review of basic probability. This review is directedtowards only the specific topics necessary for this Course. In Subsection 1.1 we overview settheory. In Subsection 1.2 we formalize the idea of probability (that is we define it) and lista few properties of probability. In Subsection 1.3 we define the Gaussian function. Later,in Section 2 of this Course, set theory and probability will be combined in to establish thebasic model of randomness that we will use through this Course – the random variable andits probability density function. We will use probability properties, the Gaussian functionand the probability density function throughout this Course.

1.1 Basic Set Theory

This Section corresponds to Section 2.1 of the Course Text. Our interest here is to developonly what is needed to define a random variable. A random variable the basic representationof random phenomena that we will use throughout this Course.

Set theory begins with a set of elements. Typical examples for a beginning level probabilitycourse include the set of possible outcomes of: a coin flip, a card draw, and roll of a dice.Examples of sets of interest to signal processing and communication engineers are: thesymbols for a digital communication system; the possible voltages across a preamp output;the phases on a sinusoid; the possible depths of an oil well; and the latency (delay) of aheartbeat.

Let S denote a set and ζ an element in the set. Note that a set may contain elementswhich are: countable and finite; countable and infinite; or uncountably infinite. The set ofelements of outcome of a coin toss, S = ζ1, ζ2 = heads, tails are countable finite, asare the set of possible symbols for a digital communication system. The set whose elementsare all positive integers S = ζi = i; i = 1, 2, 3, · · · is countable infinite. The set ofpossible voltages across an ideal preamp output, S = −∞ < ζ < ∞, is uncountablyinfinite.

Consider a general set S = ζ. A subset of S is any collection of its elements. Thesesubsets range from the subset of no elements denoted ø and called the null set, to thesubset of all elements, i.e. S, which is called the universal set.


Example 1.1: Consider the set S of all possible phases of a complex number, i.e.

S = θ : 0 ≤ θ < 2π .

List four possible subsets. Denote these as Ai; i = 1, 2, 3, 4.

Solution:

A1 = θ : 0 ≤ θ ≤ π A2 = θ :

π

2≤ θ ≤ 3π

2

A3 = θ : 0 ≤ θ < 2π (= S)

A4 = ø (null set, no outcome) .

A Venn diagram is a visualization of a set of elements which is useful in developing anunderstanding of basic set theory concepts. Figure 1(a) is an example of a Venn diagram.In this figure, S is the entire set, and A1, A2, A3 are three subsets. As shown, A1 and A2

share some elements, while subset A2 does not share elements with either A1 or A2. Not allelements of S are covered by the three subsets shown.

1 2

3

A A

A

S1A

2A3A

An

S

(b)(a)

Figure 1: Venn diagrams.

A partition of set S is a group of subsets that cover all elements of S while not sharing anyelements. Figure 1(b) illustrates a partition of a set S into the n subsets A1, A2, · · · , An.


Example 1.2: Phase Shift Keying (PSK) is a popular digital communicationmodulation scheme. In this approach, a sinusoidal pulse

s(t) =

A cos(ω0t) 0 ≤ t < T0 otherwise

of carrier frequency ω0 and pulse width T is imprinted with binary informationby using different phases, i.e. for 8-PSK the symbols are

sm(t) = A cos(ω0t + φm) m = 1, 2, · · · , 8 0 ≤ t < T (1)

where φm = 2π8

(m − 1); m = 1, 2, · · · , 8. (With 8 symbols, all combinations of3 bits can be uniquely represented.) A constellation plot for this modulationscheme is the plot of the sinusoidal amplitudes/phases for each of the 8 symbols.In the figure below, the first row shows the 8-PSK constellation. (Each symbolis represented as a point. Since all 8 symbol have the same amplitude, all pointsare equispaced from the origin. The phase of each point is the symbol’s phase.)

Consider the set S of 8 symbols. The constellation plot represents the 8 elementsof this set that serves as a more useful visualization then the Venn diagram sinceit conveys sinusoidal symbol amplitude/phase information. In the constellation,distances between points determines how easy it is to differentiate between sym-bols at the receiver.

D0

C2C0

B1B0

C1 C 3

D 4 D2 D 6 D1 D 5 D 3 D7

d0

d1

d2

0 1

0 1 0

0 0 1 1 0

1

1 0 1

010100000 110 001 101 011 111

Figure 2: Set partitioning for representation of digital communications symbols.

a) Determine a partition of S into two subsets, each with 4 elements, such thatthe minimum distance between elements, one from each subset, is maxi-mized.


b) Determine a partition of S into four subsets, each with 2 elements, suchthat the minimum distance between elements, one from each subset, is max-imized.

Solution:

The second row of the figure above shows two subsets, B1 and B2, which are thesolution to problem a). The third row of the figure above shows four subsets, C1,C2, C3 and C4, which are the solution to problem b). The last row shows theindividual elements of the set.

Example 1.2 is a pretty straightforward illustration of sets, subsets and partitions. Theidea of identifying symbol subsets that have well separated or maximally separated ele-ments is an important consideration in the development of high performance digital modu-lation/demodulation schemes. In the digital communications literature this idea is referredto as set partitioning.

As mentioned earlier, our main objective here is to develop a minimum number of ideasfrom set theory required to develop the concept of a random variable. In this sense, Example1.2 is a distraction. What Example 1.2 does suggest is that, depending on the specific signalprocessing application under consideration, set theory may warrant further consideration.

Set Operators and Rules: For our purposes, the following are the most important setoperations and rules.

• Union, denoted A1 ∪A2 or A1 + A2: The subset of all elements in subset A1 or subsetA2.

• Intersection, denoted A1 ∩A2 or A1 A2: The subset of all elements in both subsets A1

and A2.

• Complement, denoted A or Ac: The subset of all elements not in subset A.

• Subset, denoted A1 ⊂ A2: All elements of subset A1 that are in subset A2.

• Distributivity of Union over Intersection: given subsets A, B and C,

(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) (2)

• De Morgan’s theorem: given subsets A1 and A2,

A1 ∪ A2 = A1 ∩ A2 . (3)

The first four bullets above are (definitions of) basic operations. The last two are set algebrarules. These rules can be easily proven using Venn diagrams.

Also note that given an set S, a partition of S, as described earlier, can now be formallydefined as a collection of subsets Ai; i = 1, 2, · · · , n such that:

a) ∪Ni=1Ai = S (i.e. the A1 are inclusive, they cover all elements of S; and

b) Ai ∩ Aj = ø; i 6= j (i.e. the Ai are disjoint or mutually exclusive).


Fields: The following terminology and notation is used on occasion in the Course Text.Though its basic use is not difficult, we will not rely on it in this Course. We introduce itso that its use in the Text will not create a problem. (See p. 22 on the Course Text.)

• Field: A collection of subsets that includes the null and universal sets, and is closedunder union and intersection.

• Sigma (σ) field: A field that is closed under any countable number of unions andintersections.

• Borel field: A σ-field on the real line.

1.2 Basic Probability

This Section of the Course Notes corresponds to Chapter 1 and Sections 2.2 & 2.3 of theCourse Text. Here we build on Section 1.1 so as to lay the foundation for defining a randomvariable, which we will do in the next Section. In this Section we also establish a number ofprobability rules which will be useful later in this Course.

Consider the occurrence of something random. We refer to this occurrence as a randomexperiment or a random trail. Examples of random experiments that are of interest in signalprocessing and communications include: an information bit; a digital communication trans-mitted symbol; the output of a receiving electromagnetic antenna or an audio microphoneat a specific time; the voltage across a preamp output due to random electron motion at itsinput; the starting time or duration of a QRST complex in a EEG signal; or the intensity ofan image pixel. Just as the list of applications of signal processing could go on-and-on, socould this list of random experiments of interest to us.

The possible outcomes of a random experiment form a set. This is where basic set theorycomes in. The set of all possible outcomes is called the universal set. It is usually be denotedas S or Ω. Any given outcome is an element of Ω. Let A ∈ Ω be some subset of outcomes.In probability terms, this is called an event. Ω is the certain event, ø is the impossible event(i.e. there must be some outcome), and an outcome is called an elementary event.

We wish to associate with any event of a random experiment a measure which indicates itslikelihood of occurrence. This measure should have both intuitive appeal and mathematicalutility. Let P (A) denote this measure for some event A. For example, intuitively, P (Ω) = 1and P (ø) = 0. It turns out that it is useful for the set of events, for which probabilities areassociated, be a field. Formally, given the universal set Ω, a field F of all events, and assignedprobabilities for F which we will denote as P , the combined information P = Ω,F , P iscalled the probability space of the random experiment. With the concepts of universal set Ωand field F already established via set theory, what remains is to establish a formal theoryof probability, i.e. what are the assigned probabilities P ?

There are various foundations on which to build a theory of probability. Some of theseare discussed in Chapter 1 of the Course Text, so you should spend a few minutes readingthat material. One example is the concept of relative frequency. Another, with a morepowerful foundation, is axiomatic probability for which three axioms form the basis formodern probability theory. All other probability concepts can be derived from these axions.Below, we first list these axioms. We then begin to develop a broad theory based on them.


Axioms of Probability: Let Ω be the universal set associated with a random experiment.Let A represent any event associated with this experiment. Let P (A) denote the probabilityof A. That is, P (·) is a set function (i.e. that identifies a probability for each event A). Thethree axioms of probability are:

I. P (A) ≥ 0.

II. P (Ω) = 1.

III. If A1 and A2 share no outcomes (i.e. they are mutually exclusive, A1 ∩ A2 = ø), then

P (A1 ∪ A2) = P (A1) + P (A2) . (4)

Note that by the nature of a random experiment (i.e. one and only one outcome will occur),if A1 and A2 are elementary events, Axiom III will hold.

Another axiom from the Course Text, that can be derived from Axiom III by induction,is

IIIa. If for any integer 0 < n ≤ ∞ we have that if Ai; i = 1, 2, · · · , n are mutually exclusive,then

P (∪ni=1 Ai) =

n∑

i=1

P (Ai) . (5)

Probability Relationships: Any of the following properties of interest in this Course canbe derived directly from Axioms I, II and III.

• The null event:P (ø) = 0 . (6)

• The complement event:P (A) = 1 − P (A) . (7)

• Intersection:

P (A1 ∪ A2) = P (A1) + P (A2) − P (A1 ∩ A2) ≤ P (A1) + P (A2) . (8)

Equivalently,

P (A1 ∩ A2) = P (A1) + P (A2) − P (A1 ∪ A2) ≤ P (A1) + P (A2) . (9)

• Union bound:

P ∪ni=1 Ai ≤

n∑

i=1

P (Ai) . (10)

• Subset: given events A and B, with B ⊂ A,

P (A) = P (B) + P (A ∩ B) ≥ P (B) . (11)


Probabilities for Countable and Uncountable Outcomes: As notes earlier, all possibleoutcomes of a random experiment from the universal set Ω. Events are defined as individualoutcomes (called elementary events) or as collections of outcomes.

First Consider an experiment with a countable number of elementary events, Ai; i =1, 2, · · · , n where n ≤ ∞. These events do not include ø. (If n = ∞ then there are acountably infinite number of elementary events.) For such an experiment, we have that0 < P (Ai) ≤ 1. That is, the elementary events have finite probability bounded by one.

Example 1.3: Consider a random experiment with countably infinite elementaryevents Ai = i; i = 0, 1, 2, · · · ,∞. Consider P (Ai) = 0.1 (0.9)i. Is this a validset of probabilities?

Solution: Yes, this is a valid set of elementary event (outcome) probabilities.

• Axiom I holds since 0.9i > 0 for i ≥ 0.

• Axiom III holds since the events are outcomes (i.e. they share no outcomes).

• Axiom II holds since Ω = ∪∞i=0 Ai (i.e. the Ai are all the elementary

events), so since Axiom III and thus Axiom IIIa hold,

P (Ω) =∞∑

i=0

P (Ai) = 0.1∞∑

i=0

(0.9)i = 0.11

1 − 0.9= 1 .

Generally, for an experiment with a countable number of elementary events Ai (i.e. out-comes), if P (Ai) > 0 and

∑

i P (Ai) = 1, the Probability Axioms hold.Now consider an experiment with an uncountably infinite set of outcomes. As noted on

p. 24 of the Course Text, probabilities can not be represented in terms of the elementaryevents. In other words, individual outcomes can not have finite (appreciable) probabilities.So we will represent probabilities in terms of nonelementary events. As we will see whenwe introduce the concept of a random variable, we will be interested in outcomes which areelements on the real line (i.e. experiments with outcomes that are real numbers). In thiscase, let x represent values on the real line. We will define probabilities in terms of eventsx1 < x ≤ x2, where x2 > x1 are real numbers. Using the notation on p. 25 of the CourseText, we will use a nonnegative function α(x) to represent probabilities as follows:

P (x1 < x ≤ x2) =∫ x2

x1

α(x) dx . (12)

Note that, since α(x) > 0, Axiom I holds. Since in general Ω = −∞ < x ≤ ∞, if werequire,

∫ ∞

−∞α(x) dx = 1 , (13)

then Axiom II holds. So using a functional mapping of probability to events can be effective,especially for random experiments with uncountably infinite outcomes. Moving forward, thisis the basic approach to defining a random process.


Conditional Probabilities and Statistical Independence: Consider a random exper-iment with events A and M and their respective probabilities P (A) and P (M). Theseprobabilities convey our understanding of the relative frequency on the outcomes indepen-dent if the running of the experiment. We will find it very useful to be able to refine thisunderstanding (i.e. these probabilities) given additional observed information. Consider, forexample, the transmission of a digital communication symbol. Let A represent a particu-lar symbol, and P (A) the understood probability of that symbol being transmitted, Thiscan be useful information in determining which symbol is actually transmitted, especially ifP (A) ≈ 0 or P (A) ≈ 1. More importantly, say M represents a particular value of receiverdata for this experiment. We need to be able to use the fact that we received M to refine ourunderstanding of P (A). For example, does M make A more or less likely than previouslyunderstood. Conditional probability effectively addresses this need.

Conditional probability of event A, given that event M has occurs, is defined as

P (A/M) =P (A ∩ M)

P (M), (14)

where P (A/M) reads – the probability of A given M . So, in general, knowing M changes ourunderstanding of the likelihood of A. The obvious question is: does this defining equationof conditional probability make sense? That is, is it useful? Consider the following:

• It can be shown that P (A/M) is a probability. That is, as shown on pp. 28-29 of theCourse Text, it adheres to the three Axioms of Probability.

• If A ∩ M = ø, i.e. for A and M share no outcomes, then P (A ∩ M) = 0 soP (A/M) = 0. This makes sense since if M occurs, i.e. if the actual outcome is anelement of M , then that actual outcome can not be in A.

• Consider A = M . Then A ∩ M = M , P (A ∩ M) = P (M), and P (A/M) = 1.Occurrence of M guarantees the A has occurred.

So conditional probability P (A/M) provides a useful indication of the probability of A givenM .

We define statistical independence as follows: events A and M are statistically independentif

P (A/M) = P (A) . (15)

So statistical independence means that given that event M has occurred, our understandingof the probability of A is unaltered. In general, from Eqs (14),

P (A ∩ M) = P (A/M) · P (M) . (16)

Combining Eqs (14,15), we have that for statistically independent events A and M ,

P (A ∩ M) = P (A) · P (B) . (17)


Total Probability and Bayes’ Theorem: Let Ai; i = 1, 2, · · · , n be a partition of Ω, i.e.

∪ni=1Ai = Ω ; Ai ∩ Aj = ø , i 6= j . (18)

Then, for any event B,

P (B) =n∑

i=1

P (B ∩ Ai) =n∑

i=1

P (B/Ai) P (Ai) . (19)

This is called the total probability (of B in terms of its components as partitioned by theAi).

Consider events B and Ai. By the conditional probability equation,

P (B ∩ Ai) = P (B/Ai) P (Ai) = P (Ai/B) P (B) . (20)

From this, we have Bayes’ theorem:

P (Ai/B) =P (B/Ai) P (Ai)

P (B). (21)

In terms of the total probability of B with respect to the partition Ai; i = 1, 2, · · · , n, Bayes’theorem is

P (Ai/B) =P (B/Ai) P (Ai)

∑ni=1 P (B/Ai) P (Ai)

. (22)

Chapter 3 of the Course Text: The purpose of this Chapter of the Course Text appearsto be twofold: to use some basic probability problems to reinforce and modestly expand theconcepts covered to this point; and to develop some specific results (e.g. Bernoulli trials)that are useful for engineering applications such as reliability and computer networking. Wewill skip the topics of this Chapter, since they are not needed for the primary topics ofthis Course, and move directly to a discussion on random variables and to Chapter 4 of theCourse Text.


1.3 The Gaussian Function

In the previous Section we suggested that for random experiments with uncountably infiniteoutcomes on the real line, a functional mapping α(x) or probabilities to the elementaryevents (i.e. outcomes) can be an effective way to represent probabilities. Furthermore, thisfunction should be nonnegative with a total area of one. As formalized in the next Sectionof these Course Notes, the probabilities associated with random phenomena occurring innature are frequently characterized by scaled/shifted versions of the following Gaussian ornormal function:

α(x) = φ(x) =1√2π

e−x2/2 , (23)

or equivalently

G(x) =∫ x

−∞φ(y) dy . (24)

If x represents a numerical outcome of a random experiment, then the probability that xwill be in a certain range of values will often be obtained by integrating φ(x) over a relatedrange. Thus, we often find the need to integrate the Gaussian function φ(x). Unfortunately,there is not analytical functional expression for this integral. We must resort to numericalintegration approaches. Note that G(−∞) = 0, and it can be shown that G(0) = 0.5 andG(∞) = 1.0.

Extensive tables exist for several functions related to G(x) and the integral over φ(x). Forexample, tables of the Q-function

Q(x) =∫ ∞

xφ(y) dy =

1√2π

∫ ∞

xe−y2/2 dy ; x ≥ 0 (25)

are often found in basic books on probability and communications (see Table 1, generatedusing Matlab).

Noting that, for x < 0Q(x) = 1 − Q(−x) , (26)

for any real-valued a and b, such that a > b, we have that∫ b

aφ(x) dx = Q(a) − Q(b) . (27)

Another function used for evaluating Gaussian integrals is the error function1

erf(x) =2√π

∫ x

0e−y2

dy ; x > 0 . (28)

Noting that, for x < 0erf(x) = − erf(−x) , (29)

for any real-valued a and b, such that a > b, we have that∫ b

aφ(x) dx =

1

2erf

(

b√2

)

− 1

2erf

(

a√2

)

. (30)

Matlab has a built in function “erf(x)” which can be used for −∞ < x < ∞. Table 4.1, p.106 of the Course Text provides samples of erf(x).

1This function is defined slightly differently in the Course Text. The definition we use here is morestandard. For example, it is the form used for the Matlab erf function.


Table 1: Q-function Table

x Q(x) x Q(x)

0.0000000e+000 5.0000000e-001 2.7000000e+000 3.4669738e-003

1.0000000e-001 4.6017216e-001 2.8000000e+000 2.5551303e-003

2.0000000e-001 4.2074029e-001 2.9000000e+000 1.8658133e-003

3.0000000e-001 3.8208858e-001 3.0000000e+000 1.3498980e-003

4.0000000e-001 3.4457826e-001 3.1000000e+000 9.6760321e-004

5.0000000e-001 3.0853754e-001 3.2000000e+000 6.8713794e-004

6.0000000e-001 2.7425312e-001 3.3000000e+000 4.8342414e-004

7.0000000e-001 2.4196365e-001 3.4000000e+000 3.3692927e-004

8.0000000e-001 2.1185540e-001 3.5000000e+000 2.3262908e-004

9.0000000e-001 1.8406013e-001 3.6000000e+000 1.5910859e-004

1.0000000e+000 1.5865525e-001 3.7000000e+000 1.0779973e-004

1.1000000e+000 1.3566606e-001 3.8000000e+000 7.2348044e-005

1.2000000e+000 1.1506967e-001 3.9000000e+000 4.8096344e-005

1.3000000e+000 9.6800485e-002 4.0000000e+000 3.1671242e-005

1.4000000e+000 8.0756659e-002 4.5000000e+000 3.3976731e-006

1.5000000e+000 6.6807201e-002 5.0000000e+000 2.8665157e-007

1.6000000e+000 5.4799292e-002 5.5000000e+000 1.8989562e-008

1.7000000e+000 4.4565463e-002 6.0000000e+000 9.8658765e-010

1.8000000e+000 3.5930319e-002 6.5000000e+000 4.0160006e-011

1.9000000e+000 2.8716560e-002 7.0000000e+000 1.2798125e-012

2.0000000e+000 2.2750132e-002 7.5000000e+000 3.1908917e-014

2.1000000e+000 1.7864421e-002 8.0000000e+000 6.2209606e-016

2.2000000e+000 1.3903448e-002 8.5000000e+000 9.4795348e-018

2.3000000e+000 1.0724110e-002 9.0000000e+000 1.1285884e-019

2.4000000e+000 8.1975359e-003 9.5000000e+000 1.0494515e-021

2.5000000e+000 6.2096653e-003 1.0000000e+001 7.6198530e-024

2.6000000e+000 4.6611880e-003 1.0500000e+001 4.3190063e-026


2 A Random Variable

This Chapter of the Course Notes corresponds to Chapter 4 of the Course Text. In thisChapter we use set theory and probability to introduce the concept of a random variable. Arandom variable is a mathematical formulation used in science and engineering to representthe probabilities associated of random experiments. We will see that the probability densityfunction (pdf) of a random variable, which encompasses the probabilistic information of theoutcomes represented by the random variable, is our primary mathematical representation ofa random experiment. In this Chapter we consider a single real-valued random variable andits pdf. Later, in Chapter 5 of these Notes, we will generalize to multiple and complex-valuedrandom variables.

2.1 The Definition of a Random Variable, and its Extensions

Formally, a random variable is defined as a mapping from the outcomes (i.e. the elementaryevents) of a random experiment to the real line. An example would be the mapping of thetails and heads outcomes of a tossed coin to the numbers “0” and “1”. Another exampleis the mapping of the M symbols of a digital communication modulation scheme to thenumbers 0, 1, · · · , M − 1.

For engineers, a random experiment of interest is often a sensor measurement from arandom system. The sensor output is already numeric, so the mapping is trivial. As a result,as a practical matter we often think of the outcomes for a numerical random experimentto be directly the values of a random variable, and we think of a random variable as themeasurement (and forget about the idea of a mapping.) Even with the digital communicationsymbol example just mentioned, it is natural to think of the symbols themselves as theintegers 0, 1, · · · , M − 1. In any event, a random variable is characterized by its possiblenumerical values and a probabilistic function (e.g. the pdf) which describes the probabilitiesof these possible values.

A random variable will be discrete-valued (if the random experiment it represents hasa countable number of outcomes), continuous-valued (if the experiment has uncountableoutcomes), or mixed. We will see that the structure of the pdf reflects this.

As engineers, we are often interested in the joint outcomes of several random experiments.For example, we may be interested in the outputs of one or more sensors at more thanone time, or we may be interested in a sequence of transmitted symbols. In general, theprobabilities associated with these joint experiments are related, and it may be that thisrelationship is of primary importance. For example, if one random variable represents atransmitted symbol and a second represents a received data point which is to be used todecide which symbol was transmitted, the understanding of relationship between the tworandom variables is critical. So we need a joint representation that preserves probabilisticrelationships. The joint pdf of multiple random variables, which will serve this purpose, willbe introduced in Chapter 5. Complex-valued random data is often encountered in signalprocessing and communications applications. For example, the outputs of an FFT or aquadrature receiver are complex-valued. In Chapter 5 we will see that when such data israndom, we can use joint pdf’s to represent them.


2.2 The Probability Distribution and Density Functions

Let x denote a random variable which generally takes on valuesx over the range −∞ ≤ x ≤∞. Its probability distribution function (PDF) is defined as2

Fx(x) = P (x ≤ x) . (1)

That is, for a given value x, the PDF Fx(x) of the random variable x is the probability thatthe random variable x will be less than or equal to the value x. The PDF is a function ofvalues x. The PDF has the following properties:

1. Fx(−∞) = 0. This is equivalent to P (ø) = 0.

2. Fx(∞) = 1. This is equivalent to P (Ω) = 1.

3. Fx(x) is a nondecreasing function of x. As x increases, the range if elementary eventsrepresented by x > x increases, so the probability Fx(x) = P (x < x) can not decrease.

4. P (a < x ≤ b) = Fx(b) − Fx(a). Since we are often interested in the probabilitythat a random variable takes on values within a certain range (i.e. the probability ofa certain subset of elementary events), this property is very useful.

These properties are the subset of those listed on pp. 78-79 of the Course Text which arethe most important and pertinent to our needs.

Figure 3(a) illustrates two basic types of PDF’s. The first, for random variable y, ispiecewise constant with several instantaneous jumps. Probability increases only at certainpoints in y, suggesting that probability exists only at these points. So only a discrete set ofvalues of the random variable are possible, implying the the underlying random experimenthas a countable (finite or infinite) number of outcomes. We call this type of random variablea discrete-valued or discrete-type random variable. The second, for a random variable x, iscontinuously increasing with increasing values of x but has no discontinuities. As x increases,probability steadily increases, suggesting that in some sense probability exists for all values ofx. This implies that the underlying random experiment has an uncountably infinite numberof outcomes. We call this type of random variable a continuous-valued or continuous-typerandom variable. Note that for both continuous-valued and discrete-valued random variablesproperties 1.-4. of the PDF hold.

2Note that we follow the Course Text notation. To denote a random variable, i.e. to represent a randomexperiment, we used bold face notation (e.g. x). We used lower case (e.g. x) to represent values that therandom variable takes on. The PDF is a function of x, the values the random variable can take on.


y1 y4y3y2

1

y

F (y)y

1

x

F (x)x

y1 y4y3y2

2P(y )

P(y )1 P(y )3 P(y )4

y

yf (y)

x

f (x)x

(b)

(a)

Figure 3: For a continuous and a discrete valued random variable, X and Y respectively: (a)Probability Distribution Functions (PDF’s); and (b) Probability Density Functions (pdf’s).

The probability density function (pdf) of a random variable x is defined as

fx(x) =d

dxFx(x) . (2)

The pdf has the following properties:

1. fx(−∞) = fx(∞) = 0.

2. fx(x) ≥ 0 ; ∀ x.

3.∫∞−∞ fx(x) dx = 1.

4. P (a < x ≤ b) =∫ ba fx(x) dx,

which follow directly from the PDF properties. Property 4. of the pdf is the reason that fx(x)is referred to as a probability density function - probabilities are computed by integratingover it (the area under the curve is the probability). Note, from calculus or by property 4.,that

Fx(x) =∫ x

−∞fx(y) dy . (3)

Figure 3(b) illustrates pdf’s for discrete-valued and continuous-valued random variables.Since probability is calculated by integrating the pdf, the impulses in the pdf of discrete-valued random variable y indicate that for a discrete-valued random variable only certaindiscrete values of y have non-zero probability3. By contrast, for continuous-valued randomvariable x, the probability of any specific value x is zero (without impulses, there is no area

3For discrete-valued random variables, a probability mass function (PMF) is sometimes used instead of apdf to represent probabilities. A PMF is simply a plot of P (yi) vs. yi.


under a point). Note that the pdf clearly shows the possible values that the random variablecan take on. We refer to the range of possible values as the region-of-support (ROS) of therandom variable. In the Figure 3 illustrations, the ROS for continuous-valued x is all valuesof x, while for discrete-valued y the ROS is only a few values of y.

2.3 Some Commonly Occurring Random Variables

A random variable is characterized by the functional form of its pdf. Depending of theapplication, different pdf functions tend to reoccur, and thus certain pdf’s tend to be impor-tant for a given application. It is thus important to become familiar with some commonlyoccurring pdf’s. A number of important pdf’s are introduced in Section 4.3 of the CourseText. So look that Section over, along with the pdf lists on the inside front and back coversof the Text4. Throughout this Course, beginning directly below, we consider Examples ofimportant discrete-valued and continuous-valued random variables.

Example 2.1 – the continuous-valued uniform random variable: Consider therandom variable x, described on p. 90 of the Course Text, with pdf

fx(x) =

1b−a

a < x ≤ b

0 otherwise

with b > a.

x 1 x 2

b − a1

a b x (the values of RV )x

xf (x)

Figure 4: The continuous-valued uniform pdf.

Let a < x1 < x2 < b. Determine an expression for P (x1 ≤ x < x2).

Solution: Since P (x1 ≤ x < x2) is the area under the pdf over range x1 ≤ x < x2,by inspection

P (x1 ≤ x ≤ x2) =x2 − x1

b − a

A uniform random variable is used to model the quantization of sampled data.

4For a number of important random variables types, the inside front cover of the Course Text is a listpdf’s along with certain of their characteristics which we will consider in Chapter 4 of this Course. Theinside back cover of the Text is a list of functional relationships between different types of random variables.This list in part indicates how certain pdf’s are encountered in signal processing applications as functionsof other random variables. That is what signal processing is – generating functions of data that is typicallyrandom. We will consider functions of random variables in Chapter 3 of this Course.


Example 2.2 – the Gaussian (a.k.a. normal) random variable: Consider therandom variable x , introduced on p. 84 of the Course Text, with pdf

fx(x) = N (µ, σ2) =1√

2πσ2e−(x−µ)2/2σ2

σ2 > 0 .

Note that x is a continuous-valued random variable. The notationfx(x) = N (µ, σ2) is standard for a Gaussian pdf. For reasons we will see later, µand σ2 are termed, respectively, the mean and variance of x. Assume x1 < x2.Determine the expression for P (x1 ≤ x ≤ x2).

µ x (the values of )

xf (x)

x

Figure 5: The Gaussian pdf.

Solution: Using the Q-function introduced in Section 1.3 of these Notes, specifi-cally applying changes-of-variables as needed to Eq (25) of Chapter 1, we have

P (x1 < x ≤ x2) =∫ x2

x1

1√2πσ2

e−(x−µ)2/2σ2

dx = Q(

x1 − µ

σ

)

− Q(

x2 − µ

σ

)

This is a far as we can go without specific values of µ, σ2, x1 and x2.

A Gaussian is the most common type of random variable occurring in nature andengineering systems. We will see why later in this Course.

Example 2.3: Let v be Gaussian with pdf fv(v) = N (−2, 3), determineP (0 ≤ v ≤ ∞).

Solution:


Example 2.4 – the Laplacian (two-sided continuous-valued exponential) randomvariable: Consider the random variable x, introduced on p. 92 of the CourseText, with pdf

fx(x) = c e−α|x| ∀x c, α > 0 .

Determine conditions on c and α so that fx(x) is a valid pdf. Determine, as afunction of variable x, P (|x| < x).

Solution:


Example 2.5 – the Rayleigh random variable: Consider two random variables xr

and xi. Say that each is Gaussian with “zero-mean” (i.e. µ = 0) and equal“variance” (σ2). Now consider

x =√

x2r + x2

i .

It will be shown later, in Section 3 of this Course that, under a certain assumptionconcerning the relationship between xr and xi, the pdf of x is of the form5

fX(x) =

xσ2 e−x2/2σ2

x ≥ 00 x < 0

σ2 > 0 .

This is the pdf of a Rayleigh random variable (see p. 90 and the 8-th entry ofthe table on the inside cover of the Course Text). Let σ = 1

4. Determine the

probability that

√

x2r + x2

i > 1 .

Solution:

5See line 3 of the “Interrelationship among Random Variables” table at the back of the Course Text.


Example 2.6 – the continuous-valued exponential random variable: Consider theexponential random variable x, introduced on p. 85 of the Course Text, with pdf

fx(x) = λ e−λx u(x) λ > 0 ,

where u(x) =

1 x ≥ 00 x < 0

is the step function. Consider a new random variable

y =√

x. Use the table inside the back cover of the Course Text (and thesubstitution 2σ2 = 1

λ) along with the table on the inside front cover of the Course

Text to determine fy(y).

Solution: Although we will not learn how to derive the pdf of a function ofa random variable until Chapter 3 of this Course, this problem is set up toeffectively do this using available tables.

Using the “Interrelationship among Random Variables Table” in the back coverof the Course Text, and specifically,

x ∼ exponential(λ) −→ √x ∼ Rayleigh ,

and making the substitution 2σ2 = 1λ

for the Rayleigh pdf in the “RandomVariable pdf Table” table on the inside front cover of the Text, we have that

fy(y) =

2λ y e−λy2

y ≥ 00 y < 0

Example 2.7 – the Chi-Squared Random Variable with n degrees of freedom: Con-sider the random variable x, introduced on p. 89 of the Course Text, with pdf

fx(x) =1

2n/2 Γ(n/2)x(n/2)−1 e−x/2 u(x) n = 1, 2, 3, 4, · · ·

where the Gamma function Γ(·) is described on p. 87 of the Course Text. De-termine specific expressions for the pdf’s of the Chi-Squared Random Variableswith 2 & 4 degrees of freedom. Do these specific pdf’s correspond to and otherpdf’s in the table on the inside front cover of the Course Text?

Solution: For n = 2, noting that Γ(1) = 1, we have

fx(x) =1

2e−x/2 u(x) .

This is the exponential pdf, from Example 2.6, with parameter λ = 12. Note that

it is also the Gamma pdf listed in the Text’s Random Variable pdf Table (withparameters α = 1 and β = 2). Also from this Table, it is also the Erlang pdf(with parameters k = 1 and λ = 1

2). (Note the for the Erlang entry in this table

there is a u(x) missing on the pdf function). It is also the Weibull pdf from thisTable (with parameters β = 1 and α = 1

2).


For n = 4, noting that Γ(2) = 1, we have

fx(x) =1

4x e−x/2 u(x) .

In the Text’s Random Variable pdf Table, this is the exponential pdf (with pa-rameter λ = 1

2), the Gamma pdf (with parameters α = 2 and β = 2), and the

Erlang pdf (with parameters k = 2 and λ = 14).

Example 2.8 – the Gamma Random Variable: Consider the random variable x,introduced on p. 87 of the Course Text, with pdf

fx(x) = G(α, β) =1

Γ(α) βαxα−1 e−x/β α > 0 β > 0 .

The G(α, β) notation for a Gamma pdf, established on p. 87 of the Course Text,suggests that a Gamma pdf is parameterized by the positive constants α and β.Using the “Interrelationship among Random Variables” table at the back of theCourse Text, what can you say about the pdf of the sum of two “independent”Gamma random variables? (The concept of “independent” random variables hasnot yet been established, since we have yet to address the issue of multiple randomvariables. However, at this point we can use the table to solve this problem, sincewe state “independence” as a condition in the problem statement.) What canyou say about the sum of multiple Gamma random variables?

Solution: Let x and y be two independent Gamma random variables, with re-spective pdf’s G(α, β) and G(α0, β). (Note that the random variables must sharethe same β parameter.) Then, by this table, z = x + y is Gamma distributedwith pdf G(α + α0, β).

Concerning the sum of Gamma random variables xi; i = 1, 2, · · · , n, assume:

1) they all share the same β parameter and their individual α parameters aredenotes as αi; i = 1, 2, · · · , n; and

2) they are all “independent” of one another, that therefore the sum of someof them will be “independent” of the rest.

Then, by induction, z =∑n

i=1 xi will be Gamma with pdf G(∑n

i=1 αi, β).

There are many other examples similar to Examples 2.6 through 2.8 above that we couldderive from the information in Section 4.4 and the Tables on the front/back covers of theCourse Text. That is, we could further explore interrelationships between pdf’s of the dif-ferent types of random variables, even though we do not yet know how to derive theseinterrelationships. As in Examples 2.1 through 2.5, we could also spend more time derivingprobabilities from pdf’s. Concerning interrelationships, as noted earlier we will treat thistopic more formally in Chapters 3 and 5 of this Course. Concerning deriving probabilitiesfrom pdf’s, this is essentially an integration problem. Often, for realistic pdf’s such as thoselisted on the inside front cover of the Course Text, this can be quite a challenge. An ex-tensive table of integrals can be very helpful, as can the motivation of getting paid maybe$70.00/hour to do this. I’ll strive to keep the integrals reasonable in this Course.


A number of pdf’s for discrete-valued random variables are also introduced in Section4.4 and the inside front cover table of the Course Text. Several of these are useful in signalprocessing and communication problems, including the simple Bernoulli pdf that can be usedto represent the probabilities of a random bit, and the Poisson pdf which is an accurate modelof the output voltage of a photodetector for low intensity optical communications systems.The discrete-valued uniform and exponential pdf’s are important because the are easy towork with, and thus make for good examples.

Example 2.9: Consider the discrete-valued exponential random variable x withpdf

fx(x) = (1 − a)∞∑

k=0

ak δ(x − k) 0 < a < 1 .

As noted earlier in this Course we use a pdf to describe the probabilities ofdiscrete values of discrete random variables. Thus we use impulses so that specificrandom variable values can have nonzero probabilities. This is as opposed tothe alternative probability mass function (PMF) description of probabilities of adiscrete random variable (see the footnote on page 16 of these Notes).

Show that fx(x) is a valid pdf. For a = 0.5, determine P (x ≤ 3).

Solution:


Example 2.10: Consider a mixed discrete/continuous-valued random variable y

with pdf

fy(y) =1

2δ(y) +

1

2e−y u(y) . (4)

Determine P (|y| < 2).

Solution:

2.4 Basic Detection Problems

Signal detection is a common problem in digital communications, Radar, Sonar, biomedicalsignal processing, astrophysical/geophysical exploration and machine diagnostic, to namejust a few of its many applications. In this Course, we treat the detection problem fromtime to time to illustrate other topics in random variables and processes. Here, with twoexamples, we introduce the basic detection problem and illustrate interesting and importantapplications of basic random variable concepts.


Example 2.11 – digital communications: Consider a Gaussian binary communi-cations channel. Specifically consider a ”received” real-valued Gaussian randomvariable x which, given that a ”1” bit has been transmitted, has pdf

fx(x/1) =1√2π4

e−(x−1)2/2·4 ˙

That is, if a bit ”1” is sent, x is Gaussian with mean µ = 1 and variance σ2 = 4.Given that a ”0” bit has been transmitted, x has pdf

fx(x/0) =1√2π4

e−(x+1)2/2·4 ,

i.e. the same variance but with a mean µ = −1. Here, the receiver detectionproblem is to decide, given an observation x, whether a ”1” or a ”0” bit was sent.This can be done by assigning a threshold T which is used as follows:

x>1

<0T .

That is, if x > T , a ”1” is received. If x < T , a ”0” is received.

Let T = 0. Determine P (1/1), the probability of receiving a ”1” given that a ”1”was transmitted. Determine P (0/1). Assuming the probabilities that a ”1” anda ”0” are transmitted are equal, determine the probability P (e) that an error ismade in receiving a bit.

Solution:


Example 2.12 – surveillance: Consider the signal detection problem for which wereceive a scalar, real-valued random variable x under the following two alternativehypotheses (cases):

H0 : x = n (noise only)H1 : x = S + n (signal plus noise) .

Let S be a positive number (related to the energy of the received signal). Assumen is zero-mean Gaussian with variance σ2 = 4. The surveillance problem is todecide from an observation x either hypothesis H0 or hypothesis H1. Again thisis accomplished by comparing the observed x with a threshold T : i.e.

x>H1

<H0T .

Define the false alarm probability as Pfa = P (H1/H0), the probability of decid-ing H1 given H0. Similarly, Pd = P (H1/H1) is the probability of detection andPm = P (H0/H1) is the miss probability. P (H0/H0), the quiescent situation,means all’s actually quiet on the Western front.

a) Determine T so that Pfa = 0.01.

b) Given T from a), select S so that Pd = 0.09.

Solution:


2.5 Conditional Probability Density Functions

This corresponds to Section 4.4 of the Course Text.

We have previously observed, within the context of probability, that the concept of a con-ditional measure is very useful in incorporating additional information into a probabilisticcharacterization. In Section 1.2 of the Course, we studied conditional probability as a mech-anism for adjusting our understanding about the probability of one event given (conditionedon) the occurrence of another event. In this Section we begin to extend this idea of condi-tioning probabilistic measures to pdf’s. We will continue this extension later when dealingwith multiple random variables.

Event Conditioning: Consider a random variable x with pdf fx(x), and B some eventassociated with x. For example, consider the event B = a < x ≤ b). Conditioned on B,our pdf of x is

fx(x/B) =fx(x) [u(x − a) − u(x − b)]

∫ ba fX(x) dx

. (5)

Figure 6 illustrates the original and event conditioned pdf’s.

xa b

f (x/B)f (x)x

x

Figure 6: Illustration of event conditioning.

In words, the pdf of x conditioned on event B having occurred, is restricted to the range ofevent B and normalized by

∫ ba fX(x) dx such that it is a valid pdf. This is because, with out

knowledge that event B has occurred (i.e. given event B), we now know that x is restrictedto the range of B. Note that within the range dictated by B, the shape of pdf does notchange, since our understanding of the probabilities in this rang e is not altered.

Within the context of a single random variable, this is a simple and intuitive concept. Wewill see throughout this Course that conditional probability is much more interesting andimportant within the context of multiple random variables.

Kevin Buckley - 2010 25

ECE8072

Statistical Signal Processing

Villanova University

ECE Department


Part1b

Function of a Random Variable

2X

1Y

X1x

y = g(x)


Contents

3 Functions of a Random Variable 27

3.1 The pdf of a Function of a Single Random Variable . . . . . . . . . . . . . . 283.2 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 The Expectation Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4 The Moments of a Random Variable . . . . . . . . . . . . . . . . . . . . . . 393.5 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Characteristic and Moment Generating Functions . . . . . . . . . . . . . . . 42

3.6.1 The Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . 423.6.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . 44

List of Figures

7 General illustration of a function of a random variable. . . . . . . . . . . . . 288 Illustration of a monotonically increasing function of a random variable. . . . 309 Illustration of a non monotonic function of a random variable. . . . . . . . . 3410 Non monotonic function of a random variable for Example 3.5. . . . . . . . . 3511 Non monotonic function of a random variable for Example 3.5. . . . . . . . . 35


3 Functions of a Random Variable

In this Chapter we begin to consider the widely applicable topic – the processing of randomvariables. In engineering, we often deal systems that process or analyze random data. Forexample, we amplify, clip, or perform trigonometric operations on single data points. Wefilter, transform, or extract parameter estimates from multiple data points. In other words,we take functions of input random variables to generate output random variables. We areinterest in deriving probabilistic characterizations of these outputs.

We begin this Chapter by considering the problem of identifying the pdf of a randomvariable which is formed as a function of some other random variable. That is, we considerthe function of a single random variable. As you might expect, we will see that the pdf ofthe new random variable will depend both on the original pdf and on the function appliedto the original random variable. We will deal with functions of multiple random variables inChapter 4.

In this Chapter we also introduce the expectation operator, and we learn how to useit to extract information about a random variable from its pdf. This information can beconsidered partial probabilistic information about the random variable, whereas the pdf itselfis considered the complete probabilistic characterization of a random variable. This will leadto the concept of the moments of a random variable. Examples of moments are the mean andthe variance of a random variable. We will see that we can also use the expectation operatorto identify partial information about a function of a single random variable, as opposed toderiving its complete probabilistic characterization (i.e. its pdf). In Chapter 4 we will applythis expectation operator to identify partial probabilistic characterizations of functions ofmultiple random variables. This will lead to the concept of joint moments. The correlation

between two random variables is an important example of a joint moment.


3.1 The pdf of a Function of a Single Random Variable

This discussion corresponds to Sections 5.1 & 5.2 of the Course Text. Several examples areincluded in this Section of the Notes. There is also a good collection of examples in Section5.2 of the Text.

Consider a random variable x and some function (a.k.a. transformation) g(·). Let

y = g(x) . (1)

We can expect y to be random since it is a function of random variable x. We are interestedin deriving a probabilistic characterization of random variable y (i.e. the pdf fy(y)) in termsof that of x. Of course, we can expect that the form of the function g(·) will influence fy(y)as well. Figure 7 provides an illustration which we can use to start to develop an anticipationof what the mapping of fx(x) to fy(y) will look like.

2X

1Y

X1x

y = g(x)

Figure 7: General illustration of a function of a random variable.

First note that in this illustration, because the function g(y) is not monotonic, both regionsX1 and X2 of values of x are mapped to region Y1 of y. Thus,

P (Y1) = P (X1) + P (X2) , (2)

where P (Xi) is the probability that x is in region Xi. So we can expect that, in general, ifthe function y = g(x) in not monotonic, the pdf of x at several points in x can impact fy(y)at a single point in y. If the function y = g(x) is monotonic, then the situation should besimpler.

Also note that in this illustration fy(y) will have different forms for different regions of y.For example, in this illustration fy(y) = 0; y < 0 because no values of x map to y < 0.That is, independent of what fx(x) is, y < 0 is not possible. Whether or not fy(y) = 0will also depend on the ROS of x. So generally we can expect that we will have to identifydifferent expressions of fy(y) for different regions of y.

Below we consider a sequence of more challenging problems.


Example 3.1: continuous x to discrete y. Let fx(x) = 12e−|x| and consider the

function

y =

3A x > 1A 0 < x ≤ 1−A −1 < x ≤ 0−3A x ≤ −1

.

This is the quantization operation. Determine fx(y).

Solution:

Example 3.2: discrete r to discrete p. Let fr(r) =13

∑∞k=−∞ 0.5|k| δ(r − k) and

p = g(r) = r2. Determine fp(p).

Solution:

These first two examples illustrate that the problem of identifying the pdf of a function ofa random variable is straightforward when the resulting random variable is discrete-valued.The problem reduces to computing the probabilities of the different values of the new randomvariable. For each value of the new random variable, this involves determining the rangeof values of the the original random variable that map to this new value, and then usingtechniques established in Chapter 2 to determine the probabilities of the original randomvariable over that range. This covers all cases where the original random variable is discrete-valued, and all cases where the function is piecewise constant.


Now consider the case of a general monotonically increasing transformation y = g(x) anda continuous-valued random variable input x. This is illustrated below in Figure 8.

x

y = g(x)

y

1y

x xx x

2

y

y

3

4

1 2

3 4

Figure 8: Illustration of a monotonically increasing function of a random variable.

First, let’s develop a qualitative feel for the problem of characterizing the pdf of outputrandom variable y. Consider the range of input x1 < x ≤ x2. As illustrated, the slope of thetransformation is small in this range. Therefore, the probability of x in this range will bemapped onto a relatively small range of output, y1 < y ≤ y2. The probability is condensed.On the other hand, consider input range x3 < x ≤ x4. Now, as illustrated, the slope is large,and any probability associated with this range of input is spread out over the correspondingoutput range y3 < y ≤ y4. We should expect any general relationship between input pdf

fx(x) and output pdf fy(y) to reflect this dependence on transformation slope ddx

g(x) (or aswe shall see, the slope of the inverse function, d

dxg−1(y)).

Note that

P (y1 < y ≤ y2) = P (x1 < x ≤ x2) =∫ x2

x1

fx(x) dx =∫ g−1(y2)

g−1(y1)fx(x) dx . (3)

We need to somehow modify this input/output probability relationship into an expressioninvolving pdf’s.


To derive the desired relationship between fx(x) and fy(y), consider for some value y0 theprobability

Fy(y0) = P (y ≤ y0) = P (x ≤ x0) = Fx(x0) , (4)

where x0 = g−1(y0). Alternatively, we can write this as

∫ y0

−∞fy(y) dy =

∫ x0=g−1(y0)

−∞fx(x) dx . (5)

What we now need to do is take the derivative of Eq (5) with respect to y0 so that the rightside of Eq (5) fy(y0).

The following is Leibniz’s rule: for

G(u) =∫ β(u)

α(u)H(x, u) dx (6)

we have

d

duG(u) = H(β(u), u)

d

duβ(u) − H(α(u), u)

d

duα(u) +

∫ β(u)

α(u)

d

duH(x, u) dx . (7)

Setting u to y0, G(y0) to∫ y0−∞ fy(y) dy, H(x, yo) to fx(x), β(y0) to g−1(y0) and α(y0) = −∞,

Leibniz’s rule will serve our needs.Differentiating Eq. (5) with respect to y0 (assumed a variable), using Leibniz’s rule, we

have

fy(y0) = fx(g−1(y0))

d

dy0g−1(y0) (8)

(note that H(α(u), u) = 0 and dduH(x, u) = 0). Setting y = y0, we have that

fy(y) = fx(g−1(y))

d

dyg−1(y) . (9)

This is the desired relationship.


Example 3.3: continuous x to continuous y; monotonically increasing linear

transformation. Consider continuous random variable x and linear transforma-tion y = α x + β, where α and β are constants, with α > 0 so the the functionis monotonically increasing.

1. Find fy(y) in terms of fx(x).

2. Let fx(x) = u(x+ 12)− u(x− 1

2). Find fy(y).

3. Let fx(x) =1√2πσ2

X

e(x−µX)2/2σ2

X . Find fy(y).

Solution:


For a monotonically decreasing function, with continuous-valued input x and output y,paralleling the development above of monotonically increasing functions, it can be shownthat the input/output pdf relation is:

fy(y) = fx(g−1(y))

∣

∣

∣

∣

∣

d

dyg−1(y)

∣

∣

∣

∣

∣

. (10)

Note that for the monotonically decreasing function case, ddy

g−1(y) is negative for all y.

For the monotonically increasing function case, ddy

g−1(y) is positive for all y. Given this

and Eq. (9), note that Eq. (10) is valid for both monotonically increasing and decreasingtransformations.

Example 3.4: continuous x to continuous y ; monotonically decreasing transfor-

mation. Let y = g(x) = −a tan(x) with a > 0. Assume that fx(x) = 0; |x| > π2

so that g(x) is monotonically decreasing over the region of support of x.

1. Determine fy(y) in terms of fx(x).

2. Let fx(x) =1π

[

u(x+ π2)− u(x− π

2)]

. Determine fy(y).

Solution:


Now let’s consider non monotonic functions. First note that the function in Example 3.4 isnon monotonic. However, since the region of support of the input is limited to a monotonicregion of the function, the problem can be treated as monotonic, and Eq. (10) is applicable.

Consider the general non monotonic case illustrated below. For this illustration, we seethat three disjoint regions of x are mapped to output region y1 < y ≤ y2. On the otherhand, only one region x is mapped onto the output region y3 < y ≤ y4. This illustrationsuggests that to derive the output pdf fy(y) for the general non monotonic function case,different regions of y must be considered separately.

y3

y4

y2

1yx

y = g(x)

Figure 9: Illustration of a non monotonic function of a random variable.

Let n be the number of disjoint regions of x that map onto an output region y1 < y ≤ y2.Then,

fy(y) =n∑

i=1

fx(g−1i (y))

∣

∣

∣

∣

∣

d

dyg−1i (y)

∣

∣

∣

∣

∣

y1 < y ≤ y2 , (11)

where g−1i (y) is the inverse transformation governing the ith region of x.


Example 3.5: continuous x to continuous y; non monotonic transformation. Lety = g(x) = a x4; a > 0.

a) Determine fy(y) in terms of fx(x).

b) Let fx(x) =12[u(x+ 1)− u(x− 1)]. Determine fy(y).

Solution: The figure below shows this transformation.

4y = a x = g (x)1 ; x = (y/a) = g (y)1/4 −1

1

1/4

2x = − (y/a) = g (y)−1

x

y = g(x)

region 2region 1

Figure 10: Non monotonic function of a random variable for Example 3.5.

a)

fy(y) =

fx(g−11 (y))

∣

∣

∣

ddy

g−11 (y)

∣

∣

∣ + fx(g−12 (y))

∣

∣

∣

ddy

g−12 (y)

∣

∣

∣ y ≥ 0

0 y < 0

=

fx((y/a)1/4) 1

4 a1/4y−3/4 + fx(−(y/a)1/4) 1

4 a1/4y−3/4 y ≥ 0

0 y < 0

b) The figure below shows this transformation specifically for the uniformly dis-tributed input.

4y = a x = g (x)1 ; x = (y/a) = g (y)1/4 −1

1

1/4

2x = − (y/a) = g (y)−1

x

y = g(x)

region 1 region 2−1 1

a

Figure 11: Non monotonic function of a random variable for Example 3.5.

Note: fx(g−11 (y)) = 1

2; 0 ≤ g−1

1 (y) ≤ 1 = 12; 0 ≤ (y/a)1/4 ≤ 1 =

12; 0 ≤ (y/a) ≤ 1 = 1

2; 0 ≤ y ≤ a. The approach is similar for fx(g

−12 (y)).

fy(y) =1

4 a1/4y−3/4 1

2[u[y]− u[y − a]] +

1

4 a1/4y−3/4 1

2[u[y]− u[y − a]]

=1

4 a1/4y−3/4 [u[y]− u[y − a]]

Check to confirm that this is a valid pdf.


3.2 Random Number Generation

You will explore this function of a random variable application as part of Computer Assign-ment #2.

3.3 The Expectation Operator

Let x be a real random variable with pdf fXx(x). The expected value of x is defined as

Ex =∫ ∞

−∞x fx(x) dx . (12)

E· =∫∞−∞ · fX(x) dx is the expectation operator. (In this case we are simply considering

the expected value of x.) Evaluating Eq. (12), observe that Ex is a weighted average of thevalues x, where the weighting function is the pdf. This probabilistic weighting emphasizedvalues x which are more probable. This makes sense.

Example 3.6: Let v be a discrete-valued random variable with exponential pdf

fv(v) = 0.1∞∑

n=0

.9n δ(v − n) .

Determine Ev.

Solution:


Example 3.7: Let continuous-valued random variable y have exponential pdf

fy(y) = a e−ay u(y) .

Find Ey.

Solution:

Now let x be a random variable, and let g(x) be some function of x. The expected valueof this function of x is

Eg(x) =∫ ∞

−∞g(x) fX(x) dx . (13)

That is, Eg(x) is calculated as the expectation operator applied to g(x). Again, the ex-pectation operator forms a weighted average of the values g(x) using the pdf as the weightingfunction. Note that the expectation operator is linear, so that

Ec1 g1(x) + c2 g2(x) = c1 Eg1(x) + c2 Eg2(x) . (14)

Example 3.8: Recall the linear transformation example from Section 3.1 of theseNotes, where y = g(x) = α x + β and fx(x) = u(x+ 1

2)−u(x− 1

2). To find Ey,

we could first find fy(y) from fx(x) by applying the rule for transformation of amonotonic random variable, and then derive Ex from it using the expectationoperator. Alternatively, we can directly evaluate Eα x + β. Use the latterapproach.

Solution:


Example 3.9: Consider a continuous-valued random variable Φ which is uniformlydistributed with pdf

pΦ(φ) =

12π

0 ≤ φ < 2π0 otherwise

Let this be the phase of the following discrete time complex sinusoidal signal

x[n] = A ej(Ωn+Φ)

where A and Ω are constants. Also, consider n to be a constant. The notationsuggests that for different values of n we have different random variables, and asn varies we trace through a signal. We will study random signals later in thisCourse. For now just assume that for each sample time n, x[n] is a function ofrandom variable Φ. Note that each x[n] is a random variable since it is a functionof random Φ.

a) Determine Ex[n].b) Determine Ex[n] x∗[m].

Solution:

a)

Ex[n] = EA ej(Ωn+Φ) = A ejΩn EejΦ= A ejΩn 1

2π

∫ 2π

0ejφ dφ = 0 .

It is interesting to note that this result does not depend on n. Every randomvariable x[n]; −∞ ≤ n ≤ ∞ has zero expected value.

b) This might be considered a 2 random variable problem, since x[n] and x[m]are random. However, we can also just consider the product x[m] x∗[n] to be asingle function of random variable Φ.

Ex[n] x∗[m] = EA ej(Ωn+Φ) A e−j(ωm+Φ) = A2 ejΩ(n−m) EejΦe−jΦ= A2 ejΩ(n−m) E1 = A2 ejΩ(n−m) .


3.4 The Moments of a Random Variable

This Section of the Notes corresponds to Sections 5.3 & 5.4 of the Course Text.Consider a random variable x and, for positive integer n, the class of functions g(x) = xn.

The moments about the origin (a.k.a. absolute moments) are defined as:

mn = Exn =∫ ∞

−∞xn fx(x) dx . (15)

For example, the 1-st moment about the origin of x, m1 = Ex, is the mean of x. For themean, we use the simplified notation m1 = η. It is useful to think of the 2-nd moment aboutthe origin, m2 = Ex2, as the energy (or power) of the random variable.

Again for positive integer n, consider the class of functions g(x) = (x− η)n. The centralmoments (a.k.a. the moments about the mean) are defined as:

µn = E(x− η)n =∫ ∞

−∞(x− η)n fx(x) dx . (16)

The most commonly considered central moment is the 2-nd order central moment,

µ2 = σ2 = E(x− η)2 =∫ ∞

−∞(x− η)2 fx(x) dx . (17)

µ2 = σ2 is termed the variance of the random variable.

Example 3.10: Determine the mean and variance of the binomial random variablex which has pdf

fx(x) =N∑

n=0

(

Nn

)

pn (1− p)N−n δ(x− n) .

Solution:


Example 3.11: Determine the mean and variance of the Gaussian random variablex which has pdf

fx(x) =1√2πc2

e−(x−c1)2/2c2 .

Solution:

a) For the mean,

η = Ex =∫ ∞

−∞x

1√2πc2

e−(x−c1)2/2c2 dx ; y = x− c1

=1√2πc2

∫ ∞

−∞(y + c1) e

−y2/2c2 dy

=1√2πc2

∫ ∞

−∞y e−y2/2c2 dy + c1

∫ ∞

−∞

1√2πc2

e−y2/2c2 dy .

The first term above is the integral over −∞ ≤ y ≤ ∞ of the odd symmetricfunction y e−y2/2c2 . So this first term is zero. The integral in the second term isthe integral over a Gaussian pdf. This integral is equal to one, so

η = Ex = c1 .

So c1 is the mean of this Gaussian random variable.

b) For the variance,

σ2 = E(x− c1)2 =

∫ ∞

−∞(x− c1)

2 1√2πc2

e−(x−c1)2/2c2 dx ; y = x− c1

=1√2πc2

∫ ∞

−∞y e−y2/2c2 dy ; note : y e−y2/2c2 is even symmetric

=

√2√π

1√c2

∫ ∞

0y e−y2/2c2 dy ; z =

y√2c2

=4√π

c2

∫ ∞

0z2 e−z2 dz .

From a table of definite integrals, we have that∫∞0 z2 e−z2 dz =

√π4, so

σ2 = E(x− c1)2 = c2 .

This proves that c2 is the variance of this Gaussian random variable.

This example justifies our use back in Section 2 of these Notes of the terms mean and variancefor, respectively, η and σ2 in the Gaussian pdf function

fx(x) =1√2πσ2

e−(x−η)2/2σ2

. (18)


Of the higher order moments (i.e. n > 2), the skew

µ3 = E(x− µ)3 (19)

and the kurtosis

µ4 = E(x− µ)4 (20)

have been found to be useful. For example, the skew is a measure of the asymmetry of thepdf, and the kurtosis is a measure of the peakedness of the pdf.

Example 3.12: Determine the skew of the Gaussian random variable x.

Solution:

µ3 =∫ ∞

−∞(x− η)3

1√2πσ2

e−(x−η)2/2σ2

dx ; y = x− η

=1√2πσ2

∫ ∞

−∞y3 e−y2/2σ2

dx = 0 ,

since y3 e−y2/2σ2

is an odd-symmetric function.

This result points to the fact that the skewness of a random variable is sometimes used as ameasure of the random variable’s dissimilarity from a Gaussian pdf.

Generalizing the Result from Example 3.12, note that:

1. the skew for any pdf which is even-symmetric about its mean will be zero;

2. all odd higher order central moments for any pdf which is even-symmetric about itsmean will be zero.


In Section 1.2 of this Course, we studied conditional probability as a mechanism for adjustingour understanding about the probability of one event given (conditioned on) the occurrenceof another event. In Section 2.5 we considered an event conditioned pdf, which is simply thenew pdf of a random variable which is constrained by the occurrence of an event concerningthat random variable. Here we extend this idea of conditioning probabilistic measures toexpectation.

Consider a random variable x with pdf fx(x), and B some event associated with x. Forexample, consider the event B = a < x ≤ b). In Section 2.5 of these Notes we saw that,conditioned on B, our pdf of x is

fx(x/B) =fx(x) [u(x− a)− u(x− b)]

∫ ba fx(x) dx

. (21)

The expectation of x conditioned on B is

EX/B =∫ ∞

−∞x fx(x/B) dx (22)

=

∫ ba x fx(x) dx∫ ba fx(x) dx

. (23)


3.6 Characteristic and Moment Generating Functions

This Section of the Notes corresponds to Section 5.5 of the Course Text.We will see that the characteristic function and the moment generating function are es-

sentially the Continuous Time Fourier Transform (CTFT) and the Laplace transform ofthe pdf, respectively. These functions are useful for a number of reasons. We will see inthis Section the the moment generating function can be used to generate what? Momentsof course. One use of the characteristic function, as shown on pp. 161,164 of the CourseText, is as an approach for determining the pdf of a function or a random variable. You arenot responsible for this approach. Also, we will see in Chapter 4 of these Notes that thecharacteristic function can sometimes be used to determine the pdf of a function of multiplerandom variables.

3.6.1 The Characteristic Function

Consider a random variable x with pdf fx(x). Let ω be a real-valued variable. The charac-teristic function of x is defined as

Φx(ω) = Eejωx =∫ ∞

−∞fx(x) e

jωx dx . (24)

So the characteristic function Φx(ω) of a random variable is a function of ω. This functionlooks a lot like the Continuous-Time Fourier Transform (CTFT) of fx(x), where in the CTFTcontext ω is the frequency domain variable and x plays the role that “time” plays in themore familiar signals & system analysis applications. The only difference is that Ee−ωx isthe CTFT of fx(x), so that Φx(ω) differs from the CTFT of fx(x) only in that a ”-” sign ismissing from the exponent of the ejωx function.)

Example 3.13: Determine the characteristic function of the Laplacian pdf

fx(x) =b

2e−b|x| .

Solution:

Φx(ω) =b

2

[∫ ∞

0e−bx ejωx dx +

∫ 0

−∞ebx ejωx dx

]

=b

2

[∫ ∞

0e(jω−b)x dx +

∫ 0

−∞e(jω+b)x dx

]

=b

2

1

jω − be(jω−b)x

∣

∣

∣

∣

∣

∞

x=0

+1

jω + be(jω+b)x

∣

∣

∣

∣

∣

0

x=−∞

=b

2

[

1

jω − b(−1) +

1

jω + b(1)

]

=b2

ω2 + b2.

Note that since this pdf is even symmetric, Φx(ω) = F(ω) where F(ω) is theCTFT of fxx. So if we were to have a CTFT table that included the func-tion e−b|t| ←→ 2b

ω2+b2, we could have used this entry, plus the CTFT linearity

property, to solve this problem.


Example 3.14: Given the CTFT pair e−at2 ←→√

π/a e−ω2/4a, determine the

characteristic function of a Gaussian random variable x with pdf N (η, σ2).

Solution: Consider first the zero mean pdf fx(x) = N (0, σ2). Since this pdf iseven symmetric, its characteristic function is its CTFT. Letting a = 1

2σ2 andusing the CTFT linearity property, we have

e−ax2 ←→√

π/a e−ω2/4a

e−x2/2σ2 ←→√2πσ2 e−ω2σ2/2

1√2πσ2

e−x2/2σ2 ←→ e−ω2σ2/2 .

So, for η = 0, the characteristic function is

Φx(ω) = e−ω2σ2/2 .

For the non zero mean Gaussian case, i.e. for general mean η, note that thisshift of the pdf is analogous to a delay of a signal. The delay property of theCTFT indicates that this corresponds to an additional linear phase term in theω domain. The shift property for the characteristic function will look a littledifferent from the delay property of the CTFT because of the slight differencebetween the characteristic function equation and the CTFT. It can be shownthat the shift property of the characteristic function is as follows: given

fx(x) ←→ Φx(ω) ,

then a shift of the mean by η results in

fx(x− η) ←→ Φx(ω) ejηω .

Thus, the characteristic function of a general Gaussian pdf is

N (η, σ2) ←→ e−ω2σ2/2 ejηω = ejηω − ω2σ2/2 .

Note that this corresponds to entry #1 of Table 5.2, p. 162 of the Course Text.

Example 3.14 illustrates the following points:

• we can use CTFT tables and properties to assist us in deriving a characteristic functionfrom a pdf (though we must be careful about the slight difference between the CTFTand characteristic function equations); and

• if you need the characteristic function for a given pdf, first see if you can find it in acharacteristic function table (e.g. Table 5.2 of the Text).

Also, note that finding the pdf corresponding to a given characteristic function is analogousto finding an inverse CTFT. Start by trying to use tables and properties.


3.6.2 Moment Generating Function

Consider a random variable x with pdf fx(x). The moment generating function of x isdefined as

Φ(s) = Eesx =∫ ∞

−∞fx(x) e

sx dx . (25)

Note that Eq (25) looks a lot like the Laplace transform of fx(x), where s is the complex-valued transform domain variable and x plays the role that “time” plays in the more familiarsystem analysis application of the Laplace transform. The only difference is that Ee−sxis the Laplace transform of fx(x), so that Φ(s) differs from the Laplace transform of fx(x)only in that a ”-” sign is missing from the exponent of the esx function.

Why is Φ(s) termed the moment generating function? Using the Taylor series expansion

esx = 1 + sx +(sx)2

2!+

(sx)3

3!+ · · · , (26)

and the fact that the expectation in Φ(s) = Eesx is a linear operator, we have

Φ(s) = 1 + s Ex +s2

2!Ex2 +

s3

3!Ex3 + · · · . (27)

So a Taylor series expansion of the moment generating function Φ(s) yields the momentsabout the origin of x from the coefficients of the expansion.

Using Φ(s), the following is a method for generating the moments of x about the origin:

1. Determine Φ(s) = Eesx. For example, this can be done using Laplace transformtables. Φ(s) is the Laplace transform of fX(−x).

2. Expand Φ(s) as a Taylor series.

3. Pick of the moments about the origin form the expansion coefficients.

Alternatively, note that the kth Taylor series coefficient of Φ(s) is

dk

dskΦ(s)|s=0

k!=

Φ(k)(0)

k!. (28)

So, the kth moment about the origin is

mk = Φ(k)(0) . (29)


Example 3.15: Find the mean, variance, skewness and kurtosis of the Laplacerandom variable x with pdf

fx(x) =b

2e−b|x| .

Solution:

Φ(s) =b

2

∫ ∞

−∞esx e−b|x| dx =

b

2

[

1

b− s+

1

b+ s

]

=b2

b2 − s2.

Then,

η = m1 = Φ(1)(0) =b

2

[

1

(b− s)2− 1

(b+ s)2

]∣

∣

∣

∣

∣

s=0

= 0 .

This makes sense since fx(x) is symmetric about x = 0, Since η = 0,

σ2 = m2 = Φ(2)(0) =b

2

[

2

(b− s)3+

2

(b+ s)3

]∣

∣

∣

∣

∣

s=0

=2

b2

skew = m3 = Φ(3)(0) =b

2

[

6

(b− s)4− 6

(b+ s)4

]∣

∣

∣

∣

∣

s=0

= 0

kurtosis = m4 = Φ(4)(0) =b

2

[

24

(b− s)5+

24

(b+ s)5

]∣

∣

∣

∣

∣

s=0

=24

b4.

The fact that shew = 0 makes sense since fx(x) is symmetric.


ECE8072Statistical Signal Processing

Villanova UniversityECE Department


Part1c

Multiple Random Variables

x 2

x 1

x 1

x 2

Region of Support

1

1

X ,X 21

1

1

1 2f (x , x )


Contents

4 Multiple Random Variables 464.1 The Joint pdf of Multiple Random Variables . . . . . . . . . . . . . . . . . . 464.2 Functions of Multiple Random Variables . . . . . . . . . . . . . . . . . . . . 544.3 Moments of Multiple Random Variables . . . . . . . . . . . . . . . . . . . . 614.4 Complex-Valued Random Variables . . . . . . . . . . . . . . . . . . . . . . . 684.5 Signal in Noise and SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.6 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.7 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

List of Figures

12 Illustration of the Region-of-Support and joint pdf of two random variables. . 4813 The integral region for computing Fy(y) given y = x1 + x2. . . . . . . . . . . 54


4 Multiple Random Variables

In this Chapter of the Course Notes we extend the representation and analysis presentedin Chapter 3 for a single random variable to multiple random variables. So now we areinterested in the outcomes of more than one random variable. These random variables could,for example, be several outputs of an antenna or microphone preamp, or several parameters(i.e. amplitude, frequency, phase, source location) of a signal of interest. This discussionparallels and extends that of the previous Chapter of these Notes. We first define a jointpdf as the complete joint statistical representation of multiple random variables. We thenuse this description to define and consider functions of multiple random variables, and jointmoments of several random variables. We then use these descriptions to consider severalmore specific topics on interest.

If you look at the Course Outline, you will notice that Chapter 5 of the Course is entitled“Random Vectors”. In that Section we will exploit linear algebra to study joint characteristicsof multiple random variables represented as a random vector. In this Chapter we introducerandom vector notation simply as a compact notation for the representation of multiplerandom variables. We start with the general random vector notation, and often use the tworandom variable case for illustration.

This Chapter of the Notes corresponds to topics in Chapters 6 of the Course Text, whichis entitled “Two Random Variables”. To some extent below we extend the discussionsin Chapter 6 of the Text in a straightforward manner to include more than two randomvariables. Note that Chapter 7 of the Course Text, entitled “Sequences of Random Variables”also deals with multiple (i.e. more than two) random variables. Most topics in Chapter 7of the Course Text are more closely align with the next Chapter of these Notes (on randomvectors) and with some topics we will cover later (i.e. in Parts II and III of this Course).So below we will use the Course Text as a reference. Do not feel the need to follow therationale of the progression of topics in the Text. As we proceed, I will point out supportingdiscussions from the Course Text. Sometimes these discussions from the Text are somewhatterse.

4.1 The Joint pdf of Multiple Random Variables

This Section of the Course Notes corresponds to topics in Section 6.1 and on p. 243 of theCourse Text.

Let xi; i = 1, 2, · · · , N be N random variables. Let x = [x1,x2, · · · ,xN ]T be an N -

dimensional column vector representation of these. We call x a random vector (i.e. a vectorof random variables). Let x = [x1, x2, · · · , xN ]

T the N -dimensional column vector of valuesof the random vector x. The joint probability density function (joint pdf) of x is denoted

fx(x) = fx1,x2,···,xN(x1, x2, · · · , xN ) . (1)

It is an N -dimensional function of the values x.


Properties of the joint pdf:

1. fx(x) = 0 if any element of x is either ∞ or −∞.

2. fx(x) ≥ 0 ; ∀ x.

3.∫∞−∞ fx(x) dx = 1.

4. P (a < x < b) =∫ ba fx(x) dx.

As was the case for a single random variable, note that property 4. indicates why fx(x) istermed a probability density - probabilities are computed by integrating over it. For example,for the N = 2 random variable case, the volume under the fx(x) surface is the probability.In general, we say the hyper volume under fx(x) over a certain range of x is the probabilitythat the random variables in x are jointly in that range.

The joint probability distribution function (joint PDF) is

Fx(x) = P (−∞ < x < x) = P (x1 < x1 ∩ x2 < x2 ∩ · · · ∩ xN < xN )

=∫ x

−∞fx(y) dy =

∫ x1

−∞

∫ x2

−∞· · ·

∫ xN

−∞fy(y) dyN · · · dy2 dy1 . (2)

Note that

fx(x) =d

dxFx(x) =

d

dx1

d

dx2· · · d

dxNFx(x) . (3)

For two random variables, consider x and y. Their joint pdf is denoted fx,y(x, y). It isa 2-dimensional function of joint values of x and y.

Properties:

1. fx,y(x, y) = 0 if either x or y are ∞ or −∞.

2. fx,y(x, y) ≥ 0 ; ∀ x, y.

3.∫∞−∞

∫∞−∞ fx,y(x, y) dy dx = 1.

4. P (a1 ≤ x < b1 and a2 ≤ y < b2) =∫ b1a1

∫ b2a2

fx,y(x, y) dy dx.

The joint PDF is

Fx,y(x, y) = P (−∞ < x < x and −∞ < y < y) = P (x < x1 ∩ y < y)

=∫ x

−∞

∫ y

−∞fx,y(ξ1, ξ2) dξ2 dξ1 , (4)

and

fx,y(x, y) =d

dx

d

dyFx,y(x, y) . (5)


Example 4.1: Let N = 2 and x = [x1, x2]T . Let

fx(x) =

8x1x2 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ x1

0 otherwise.

x 2

x 1

x 1

x 2

Region of Support

1

1

X ,X 21

1

1

1 2f (x , x )

Figure 12: Illustration of the Region-of-Support and joint pdf of two random variables.

Determine P (0.5 ≤ x1 < 1, 0.5 ≤ x2 < 1).

Solution:

P (0.5 ≤ x1 ≤ 1, 0.5 ≤ x2 ≤ 1) =∫ 1

.5

∫ x1

.5fx1,x2(x1, x2) dx2 dx1 =

∫ 1

.5

∫ x1

.58x1x2 dx2 dx1

= 8∫ 1

.5x1

(∫ x1

.5x2 dx2

)

dx1 = 8∫ 1

.5x1

(

x21

2− 1

8

)

dx1

= 8∫ 1

.5

(

x31

2− x1

8

)

dx1 = x41 −

x21

2

∣

∣

∣

∣

∣

1

.5

=9

16

Try P (0 ≤ x1 < 0.5, 0 ≤ x2 < 0.5) and P (0.5 ≤ x1 < 1, 0 ≤ x2 < 0.5) yourself.


Example 4.2: Consider the random variables x = [x1, x2]T with joint pdf

fx(x) =

1ab

0 ≤ x1 ≤ a, 0 ≤ x2 ≤ b0 otherwise

.

Assume 0 < a < b. Determine P (x1 + x2 ≤ 34a).

Solution:

Example 4.3: Consider the random variables x = [x1, x2]T with joint pdf fx(x) =

fx1(x1) fx2(x2) where fxi(xi) = N (ηi, σ

2) (i.e. the individual random variablesare Gaussian, and in this case the joint pdf is the product of the individual pdf’s).Let ηi = 0; i = 1, 2. Determine the expression for P (x1 > 0 ∩ x2 > 0) in termsof the Q(.) function.

Solution:


Marginal pdf’s: This is discussed, for two random variables, on p. 171 of the Course Text.Given the joint pdf fx(x) of random vector x = [x1,x2, · · · ,xN ]

T , if we wish the joint pdf ofsome subset of x, we integrate out from fx(x) all but that subset of random variables. Forexample,

fx1(x1) =∫ ∞

−∞· · ·

∫ ∞

−∞fx(x) dx2 · · ·dxN (6)

fx1,xN(x1, xN ) =

∫ ∞

−∞· · ·

∫ ∞

−∞fx(x) dx2 · · · dxN−1 . (7)

In some applications where there is a set of unknown parameters modeled as random vari-ables, a subset of these are of primary interest and the rest are what are called nuisanceparameters. An effective approach to dealing with nuisance parameters is to marginalizethem out (assuming you know the joint pdf’s).

Conditional pdf’s: This is discussed, for two random variables, in Section 6.6 of the CourseText. First recall the conditional probability relationship

P (A/B) =P (A ∩B)

P (B)(8)

that we established back in Section 1.2 of these Course Notes. Consider random variablesx1 and x2 with joint pdf fx1,x2(x1, x2). Let events A and B be defined as

A : x1 < x1 ≤ x1 +∆x1 (9)

B : x2 < x2 ≤ x2 +∆x2 .

Letting ∆x1 → 0 and ∆x2 → 0, we have

P (A/B) = fx1/x2(x1/x2) ·∆x1 =

P (A ∩ B)

P (B)=

fx1,x2(x1, x2) ·∆x1 ·∆x2

fx2(x2) ·∆x2

, (10)

or

fx1/x2(x1/x2) =fx1,x2(x1, x2)

fx2(x2). (11)

(See Section 6.6 of the Course Text for a rigorous explanation of this proof.)

Note that:

• fx1/x2(x1/x2) is a valid pdf for each possible x2.

• fx1/x2(x1/x2) does not exist for values of x2 for which fx2(x2) = 0. To be conditioned

on a value of x2, that value of x2 must be in the region-of-support of x2.

We will address the relevance of conditional pdf’s below, after we introduce Bayes’ theoremfor conditional probabilities.


Now consider the extension of conditional pdf’s to the general multivariate case. Considerx = [x1,x2, · · · ,xN ]

T partitioned, for example, as x1 = [x1,x2, · · · ,xP ]T and

x2 = [xp+1,xp+2, · · · ,xN ]T . The joint pdf of x1 conditioned on values x2 of x2 is

fx1/x2(x1/x2) =

fx(x)

fx2(x2)

. (12)

Statistical Independence: Again consider x = [x1,x2, · · · ,xN ]T partitioned, for example,

as x1 = [x1,x2, · · · ,xP ]T and x2 = [xp+1,xp+2, · · · ,xN ]

T . Then by definition the randomvariable sets x1 and x2 are statistically independent of each other if and only if

fa(x) = fa1(x1) · fa2(x2) , (13)

or equivalently

fx1/x2(x1/x2) = fx1

(x1) (14)

fx2/X1(x2/x1) = fx2

(x2) .

Example 4.4: For x = [x1,x2]T , let

fx(x) = e−(x1+x2) u(x1) u(x2) .

Are x1 and x2 statistically independent?

Solution: It’s pretty easy to see that fx(x) factors into fx(x) = fx1(x1) ·fx2(x2),where

fx1(x1) = e−x1 u(x1)

fx2(x2) = e−x2 u(x2) .

(Use marginalization if you wish to verify these.) Thus x1 and x2 are statisticallyindependent.


Example 4.5: Now let

fx(x) =

2 e−(x1+x2) 0 ≤ x1 ≤ x2 ≤ ∞0 otherwise

.

Determine if x1 and x2 are statistically independent.

Solution:

Example 4.6: Given

fx1,x2(x1, x2) = x1 e−x1(x2+1) u(x1) u(x2) ,

determine: a) fx1(x1) and fx2(x2); and b) fx2/x1(x2/x1). Are x1 and x1 statisti-cally independent?

Solution:


Total pdf’s: Yet again consider x = [x1,x2, · · · ,xN ]T partitioned as x1 = [x1,x2, · · · ,xP ]

T

and x2 = [xp+1,xp+2, · · · ,xN ]T . Then

fx1(x1) =

∫ ∞

−∞fx1/x2

(x1/x2) · fx2(x2) dx2 . (15)

Note that, since fX1/X2(x1/x2) · fX2

(x2) = fX(x), Eq (15) is just a marginalization.We have already mentioned why marginalization can be useful. This total pdf form of it,

where conditional pdf’s are used instead of the of the entire joint pdf fx(x), is used in manyapplications because the conditional pdf’s are easier to identify.

Bayes’ Theorem: Again consider x = [x1,x2, · · · ,xN ]T partitioned as x1 = [x1,x2, · · · ,xP ]

T

and x2 = [xp+1,xp+2, · · · ,xN ]T . It directly follows from the conditional pdf equation, Eq (12)

above, that

fx1/x2(x1/x2) =

fx2/x1(x2/x1) fx1

(x1)

fx2(x2)

. (16)

This is the pdf version of Bayes’ theorem. The following example illustrates the importanceof Bayes’ theorem for pdf’s.

Example 4.7: Let x be a received data point which is composed of Gaussian noisen, with pdf fn(n) = N (0, σ2), superimposed with random variable s with somepdf fs(s), i.e. x = s + n. A common problem is to determine as best aspossible (i.e. estimate) the value of s given an observation x (i.e. the date) of x.

To do this effectively, we often first identify fs/x(s/x), then plug into it theobserved value x, and then find the value s that maximizes fs/x(s/x). Thiseffectively finds the most likely value s given the data x. So what is fs/x(s/x)?

Solution: Note the since x is functionally related to s, these two random variablesare not statistically independent. Thus fx,s(x, s) can be difficult to identify.Assume that fs(s) can be identified. Then, using Bayes’ theorem,

fs/x(s/x) =fx,s(x, s)

fx(x)=

fx/s(x/s) fs(s)

fx(x).

Both forms of fs/x(s/x) in the above equation involve fx(x), which can be difficultto identify. However, if the objective is to determine the value s that maximizesfs/x(s/x), we don’t need fx(x) since it is not a function of s.

The middle term in the equation above would be difficult to identify directly sincefx,s(x, s) is not known. However, for the last term of this equation, noting thatgiven (conditioned on) s, fx/s(x/s) = N (s, σ2), the last term of the equationabove can be identified.


4.2 Functions of Multiple Random Variables

Let xi; i = 1, 2, · · · , N be N random variables, and let

x = [x1, x2, · · · , xN]T . (17)

Here we are interested in the pdf fy(y), where

y = g(x) . (18)

g(·) is a scalar function of the multiple random variables in x.Note that for each value y, y ≤ y corresponds to some region of x. Denote this region as

x ∈ RNy , where RN is N -dimensional real space. Then,

Fy(y) =∫

x∈RNy

fx(x) dx , (19)

from which

fy(y) =d

dyfY (y) . (20)

This leads to a generalization of the procedure for a function of a single random variablewhich can be dealt with systematically, albeit tediously. Below, we will look at several cases.

Sum of two statistically independent random variables:

Let x1 and x2 be two random variables with joint pdf fx1,x2(x1, x2), and consider the sumy = x1 + x2. From Figure 13 we see that

Fy(y) =∫ ∞

−∞

∫ y−x1

−∞fx1,x2(x1, x2) dx2 dx1 . (21)

x 1

x 2

2y = x + x1

y

y

Figure 13: The integral region for computing Fy(y) given y = x1 + x2.

Let x1 and x2 be statistically independent. Then,

Fy(y) =∫ ∞

−∞

∫ y−x1

−∞fx1(x1)fx2(x2) dx2 dx1

=∫ ∞

−∞fx1(x1)

(∫ y−x1

−∞fx2(x2) dx2

)

dx1 . (22)

Taking the derivative with respect to y, using Leibniz’s rule, we have

fy(y) =∫ ∞

−∞fx1(x1) fx2(y − x1) dx1

= fx1(y) ∗ fx2(y) . (23)

That is, the pdf’s convolve.


Example 4.8: Consider two statistically independent random variables x1 and x2

with pdf’s

fx1(x1) = a e−ax1u(x1) ,

fx2(x2) = b e−bx2u(x2) .

Determine fy(y) for y = x1 + x2.

Solution:


Example 4.9: Consider two independent identically distributed (iid) ran-dom variables x1 and x2 with fxi

(xi) = N (0, σ2). Determine fy(y) fory = x2

1 + x22.

Solution 1: Let yi = x2i ; i = 1, 2. For this function and Gaussian xi we know

that

fyi(yi) =1

2√yi

fxi(√yi) +

1

2√yi

fxi(−√

yi) yi ≥ 0,

and since the fxi(xi) are symmetric, this reduces to

fyi(yi) =1√yi

fxi(√yi) yi > 0

=1√yi

1√2πσ2

e−yi/2σ2

yi > 0.

Then, for y = y1 + y2,

fy(y) = fy1(y) ∗ fy2(y) .

This convolution would be a challenge, so let’s try another approach.

Solution 2: The PDF of y is

Fy(y) =∫ ∫

x21+x2

2≤yfx1,x2(x1, x2) dx1 dx2

=∫ ∫

x21+x2

2≤y

1

2πσ2e−(x2

1+x22)/2σ

2

dx1 dx2 .

To change to polar coordinates, let x1 = r cos(θ), x2 = r sin(θ) anddx1 dx2 = r dr dθ. Then

Fy(y) =∫ 2π

0

∫

√y

0

1

2πσ2e−r/2σ2

r dr dθ

=1

2πσ2

∫ 2π

0dθ

∫

√y

0r e−r/2σ2

dr

=1

σ2

∫

√y

0r e−r/2σ2

dr

= [1 − e−y/2σ2

] u(y) .

This is the exponential PDF. The corresponding pdf is

y(y) =1

2σ2e−y/2σ2

u(y) .

Note that this is also chi-squared with 2 degrees of freedom.


Example 4.10: Consider the two iid Gaussian random variables x1 and x2 theExample 4.9. Let

z =(

x21 + x2

2

)1/2= y1/2 (24)

where y is the exponential random variable from that example. Determine fz(z).

Solution:

Weighted sum of multiple random variables:

Consider statistically independent xi; i = 1, 2, · · · , N . Let

y =N∑

i=1

wi xi = xTw (25)

where w = [w1, w2, · · · , wN ]T is an N -dimensional vector of weights (i.e. coefficients, multi-

pliers) and x is the random vector. Considered finding the pdf of y.Let xi = wi xi; i = 1, 2, · · · , N . Then we know that

fyi(yi) = fxi

(

1

wiyi

) ∣

∣

∣

∣

1

wi

∣

∣

∣

∣

(26)

(i.e. yi is a linear function of xi). We then have that

y =N∑

i=1

yi (27)

andfy(y) = fy1(y) ∗ fy2(y) ∗ · · · ∗ fyN (y) . (28)

Alternatively, using the convolution property of the Continuous Time Fourier Transform(CTFT), and denoting as Fy(ω) the CTFT of fy(y), we have

Fy(ω) = Fy1(ω) · Fy2(ω) · · ·FyN (ω) . (29)

Note that we could have used characteristic equations instead of CTFT’s.


Example 4.11: Let xi; i = 1, 2, · · ·N be iid random variables with Gaussian pdf’sfxi

(xi) = N (0, σ2). Let y = 1N1Tx (i.e. y is the average of the yi’s). Determine

fy(y).

Solution:

Example 4.11 suggests a very important fact concerning Gaussian random variables. Itspecifically establishes that an average of zero-mean iid Gaussians is Gaussian. If you con-sider the solution procedure for this Example, there is nothing that prohibits its applicationto a general weighted sum of independent but not necessarily zero-mean or identical Gaus-sians. So, more generally, this Example suggests that any weighted sum of independentGaussians is Gaussian. Later, in Example 4.18, we will make use of this more general fact.Even more generally, we will show in Chapter 5 that a weighted sum of Gaussians is Gaussianeven if the Gaussians are not independent.


We previously showed that, given zero-mean, iid, Gaussian random variables x1 & x2:

1. y = x21 + x2

2 is exponential distributed; and

2. z =√y =

√

x21 + x2

2 is Rayleigh distributed,

We now expand these results with two examples.

Example 4.12 - the sum of squared iid Gaussian random variables: Letxi; i = 1, 2, · · · , N be zero mean, iid, Gaussian random variables, each withvariance σ2. Determine the pdf of

y =N∑

i=1

x2i .

Solution:


Example 4.13 - the sum of two squared non-zero-mean Gaussian random vari-ables: Let xi; i = 1, 2 be independent Gaussian distributed with pdf’s N (ηi, σ

2).

Determine the pdf of z =√y =

√

x21 + x2

2.

Solution: We just sketch the solution derivation. First, let yi = x2i . It is a

challenge to show that

fyi(yi) =1√2πσ2

y−1/2i cosh

(√yi ηiσ2

)

e−(yi+η2i )/2σ2

u(yi) .

The corresponding characteristic function is

Φyi(ω) =1√

1− j2ωσ2ejη

2i ω/(1−j2ωσ2) .

Next, let y = y1 + y2. It is easy to show that its characteristic function is

Φy(ω) =1

1− j2ωσ2ej(η

21+η22)ω/(1−j2ωσ2) .

but a challenge to show that the corresponding pdf is

fy(y) =1

2σ2I0

√

η21 + η22

σ2y1/2

e−(y+(η21+η22))/2σ2

u(y) ,

where I0(·) is the zeroth-order Bessel function of the first kind.

Finally, let z = y1/2, which is a simple monotonically increasing function overthe region of support of y. It is straightforward to show that

fz(z) =z

σ2I0

z√

η21 + η22

σ2

e−(z2+(η21+η22))/2σ2

u(z) .

This is a Rician pdf. Some of this Example is covered in Example 6.16 of theCourse Text.


4.3 Moments of Multiple Random Variables

In this Section of the Course we introduce expectation and moments for multiple randomvariables. At this point we focus on joint moments of two random variables, and considerthe extension to multiple random variables in Chapter 5 of the Course (on random vectors).This Section of the Notes corresponds to Section 6.4 of the Course Text.

ExpectationConsider two random variables x and y, and let z = g(x,y) be some function of them.

The expectation of z is

Ez = Eg(x,y) =∫ ∞

−∞

∫ ∞

−∞g(x, y) fx,y(x, y) dx dy . (30)

This generalizes to g(xi; i = 1, 2, · · · , N) in an obvious manner.

Joint MomentsGiven two random variables x and y, the ijth moment about the origin is

mij = Exiyj =∫ ∞

−∞

∫ ∞

−∞xiyj fx,y(x, y) dx dy . (31)

For example, starting with the joint pdf, the mean of x is

ηx = m10 = Ex1y0 = Ex =∫ ∞

−∞

∫ ∞

−∞x fx,y(x, y) dx dy

=∫ ∞

−∞x∫ ∞

−∞fx,y(x, y) dy dx

=∫ ∞

−∞x fx(x) dx . (32)

Correlation, an important joint moment about the origin, is defined for random variablesx and y as

Rxy = m11 = Ex y . (33)

We say that x and x are uncorrelated if Rxy = ηx ηy.

We say that x and y are orthogonal if Rxy = 0.

Note the mixed up terminology.


Given two random variables x and y, the ijth joint central moment is

µij = E(x− ηx)i(y − ηy)

j =∫ ∞

−∞

∫ ∞

−∞(x− ηx)

i(y − ηy)j fx,y(x, y) dx dy . (34)

For both joint central moments µij and joint moments about the origin mij , the order of themoment is i+ j.

The covariance between x and y, a 2-nd order central moment, is

Cxy = µ11 = E(x− ηx)(y− ηy) . (35)

Example 4.14: Show that

Cxy = Rxy − ηx ηy .

Solution:

The other 2-nd order central moments are

µ20 = E(x− ηx)2 = σ2

x (36)

µ02 = E(y− ηy)2 = σ2

y . (37)

The correlation coefficient is defined as

ρxy =µ11√

µ20√µ02

=Cxy

σx σy. (38)

As shown on p. 212 of the Course Text, |ρxy| ≤ 1.


Example 4.15: Consider two statistically independent random variables x and y.Determine expressions for Rxy, Cxy and ρxy.

Solution:

Note that statistically independent random variables are uncorrelated (and orthogonal onlyof ηX = 0 and/or ηY = 0). In general, uncorrelated does not necessarily imply statisticallyindependent. Statistical independence say something about the entire joint pdf. Uncorrelatedis only a 2-nd order characteristic.


Two Joint Gaussian Random VariablesTwo random variables x and y are joint Gaussian if their joint pdf is of the form:

fx,y(x, y) =1

2πσxσy

√

1− ρ2xye

− 1

2(1−ρ2xy)

[

(x−ηxσx

)2−2ρxy

(

(x−ηx)(y−ηy)

σxσy

)

+

(

y−ηyσy

)2]

(39)

Example 4.16: Show that if x and y are jointly Gaussian and uncorrelated, thenx and y are statistically independent.

Solution: Uncorrelated means that ρxy = 0. Thus

fx,y(x, y) =1

2πσxσye

− 12

[

(x−ηxσx

)2+

(

y−ηyσy

)2]

=1

√

2πσ2x

e−(x−ηx)2/2σ2x

1√

2πσ2y

e−(y−ηy)2/2σ2y

= fx(x) · fy(y) . (40)

Note the significance of the result of this example. In general, uncorrelated random variablescan not be expected to be statistically independent. Uncorrelated says something only aboutthe joint second order statistics of two random variables, whereas statistical independencepertains to the entire joint pdf (i.e. all of the moments). However, if the random variablesare Gaussian, then uncorrelated implies statistical independence. This is fortunate, sincestatistically independent random variables are easier to work with, and Gaussian randomvariables are so common.

Example 4.17: Show that the joint pdf of two Gaussian random variablesx = [x1, x2]

T has the following form:

fx(x) =1

2π|C|1/2 e−12(x−η

x)TC−1(x−η

x)

where ηxis the mean of x, C is a symmetric 2× 2 matrix, |C| is its determinant,

and C−1 is its inverse. Determine C in terms of moments of x1 and x2.


Solution: Let

C =

[

a cc b

]

,

so that

|C| = ab − c2 ; C−1 =1

ab − c2

[

b −c−c a

]

.

Let’s postulate that ηx= [ηx1 , ηx2]

T is the vector of the means of x = [x1, x2]T .

Then the joint pdf proposed above is,

fx(x) =1

2π√ab− c2

e− 1

2(ab−c2)[b(x1−ηx1 )

2+a(x2−ηx2 )2−2c(x1−ηx1 )(x2−ηx2 )]

=1

2π√a√b√

1− (c2/ab)e− 1

2(1−(c2/ab))1ab [b(x1−ηx1)

2+a(x2−ηx2)2−2c(x1−ηx1 )(x2−ηx2 )]

=1

2π√a√b√

1− (c2/ab)e− 1

2(1−(c2/ab))[(x1−ηx1)

2/a+(x2−ηx2 )2/b−(2c/

√a√b)[(x1−ηx1 )(x2−ηx2 )]/

√a√b])]

.

If we let a = σ2x1, b = σ2

x2, and c = ρσx1σx2 , then

fx(x) =1

2πσx1σx2

√1− ρ2

e− 1

2(1−ρ2)[(x1−ηx1 )

2/σ2x1

+(x2−ηx2 )2/σ2

x2−2ρ[(x1−ηx1 )(x2−ηx2 )]/σx1σx2 ] ,

which is the joint pdf of two Gaussian random variables.

In Example 4.17, note that the matrix

C =

[

a cc b

]

=

[

σ2x1

ρσx1σx2

ρσx1σx2 σ2x1

]

(41)

is the matrix of variances/covariances between the random variables in x. This matrix,

C = E[x− ηx] [x− η

x]T = E

[

(x1 − ηx1)(x2 − ηx2)

]

[(x1 − ηx1), (x2 − ηx2)]

.

.

= E

[

(x1 − ηx1)2 (x1 − ηx1)(x2 − ηx2)

(x1 − ηx1)(x2 − ηx2) (x2 − ηx2)2

]

, (42)

is a compact representation of these variances/covariances. It is termed the covariancematrix. Note that it is symmetric. In Chapter 5 of these Notes we will see that for a generalrandom vector x, its covariance matrix is an important partial statistical characterization.


Weighted Sums (Linear Combinations) of Multiple Random VariablesLet x = [x1,x2, · · · ,xN ]

T be a vector of N random variables. Consider the weighted sum

y = wTx , (43)

where w = [w1, w2, · · · , wN ]T is a vector of constants.

1. Determine an expression for the mean of y.

2. Determine an expression for the variance of y.

3. Determine an expression for the pdf of y.

Results:


Example 4.18: Let xi; i = 1, 2, · · · , N be uncorrelated Gaussian random vari-ables, with means ηxi

; i = 1, 2, · · · , N and variances σ2xi; i = 1, 2, · · · , N . Let

y = wTx ,

where w = [w1, w2, · · · , wN ]T . Determine fY (y).

Solution: In Example 4.11 we showed that the average of iid zero mean Gaus-sians, each with pdf N (0, σ2

x), was Gaussian with pdf N (0, σ2x/N). After that

Example we commented that the weighted sum of nonzero mean, non identicallydistributed but independent Gaussians was Gaussian, but we did not identifyexpressions for its mean or variance. We now know how to determine the meanand variance. So,

fy(y) =1√

2πσy2e−(y−ηy)2/2σ2

y ,

where

ηy = wT x =N∑

i=1

wi ηxi

and

σ2y = wT C w =

N∑

i=1

w2i σ2

xi.

We now know that the ηy expression in Example 4.18 holds even if the xi are correlated (andthus not statistically independent). We also now know that even if the xi are correlated, thevariance of y is still σ2

y = wT C w, although we can not say that σ2y =

∑Ni=1 w2

i σ2xi, unless

C is diagonal (i.e. the xi are uncorrelated). Also, as noted after Example 4.11, we will showin Chapter 5 that even if the xi are correlated, y Gaussian. So, the result of Example 4.18extends to any linear combination of Gaussians.


4.4 Complex-Valued Random Variables

In this Section of the Course we introduce the concept of a complex-valued random variable.Though it may be argued that such a random variable is hard to imagine is nature, theyare frequently encountered in signal processing systems. For example, an output of an fft(i.e. a fast Fourier transform, commonly used in spectrum analyzers and filter banks, isin general complex-valued. Quadrature receivers, commonly used in communications andRadar systems, have complex-valued outputs. When these outputs are random, we need tomodel them as complex-valued random variables. This type of random variable is introducedon p. 143 of the Course Text.

Let x = xr + jxi be a complex-valued random variable. That is, both xr and xi arereal-valued random variables, used in conjunction to form complex-valued x. The pdf of xis just the 2-dimensional joint pdf of (xr,xi). That is, a complex-valued random variablecan be simply viewed as two real-valued random variables which will be employed in tandemaccording to

x = xr + j xi . (44)

The extension to multiple random variables is straightforward. The joint pdf of N complex-valued random variables is the 2N -dimensional joint pdf of all the real and imaginary parts.

Example 4.19: Consider the complex-valued random variable with pdf

fx(x) =

1π

|x|2 ≤ 10 otherwise

.

Show that fx(x) is a valid pdf. Determine P (Rex > 0). Determine P (|x| < 12).

Solution:


Let x be a complex-valued random variable, and g(x) be a function of it. Since x =xr + j xi where xr and xi are real-valued random variables with joint pdf fxr,xi

(xr, xi),

Eg(x) =∫ ∞

−∞g(x) fxr ,xi

(xr, xi) dxr dxi =∫ ∞

−∞g(xr + j xi) fxr,xi

(xr, xi) dxr dxi .(45)

The mean of x is

Ex = Exr + j xi =∫ ∞

−∞

∫ ∞

−∞(xr + j xi) fxr,xi

(xr, xi) dxr dxi

=∫ ∞

−∞

∫ ∞

−∞xr fxr ,xi

(xr, xi) dxr dxi + j∫ ∞

−∞

∫ ∞

−∞xi fxr,xi

(xr, xi) dxr dxi

=∫ ∞

−∞xr

∫ ∞

−∞fxr ,xi

(xr, xi) dxi

dxr + j∫ ∞

−∞xi

∫ ∞

−∞fxr ,xi

(xr, xi) dxr

dxi

=∫ ∞

−∞xr fxr(xr) dxr + j

∫ ∞

−∞xi fxi

(xi) dxi

= ηxr + j ηxi, (46)

or, simply using the linearity property of the expectation operator,

Ex = Exr + jExi = ηxr + j ηxi. (47)

The 2-nd order moment about the origin is defined as

E|x|2 = Ex x∗ = Ex2r + x2

i = Ex2r + Ex2

i . (48)

The variance of complex-valued x is

σ2x = E|(x− ηx)|2 = E|x|2 − ηx Ex∗ − η∗x Ex + |ηx|2

= E|x|2 − |ηx|2 . (49)

It is almost always the case that in signal processing applications involving a complex-valuedrandom variable x, we will have that

σ2xr

= σ2xi

and Exr x∗i = 0. (50)


4.5 Signal in Noise and SNR

In many application where data is observed using a sensor, the data is a superposition of asignal-of-interest (a.k.a. signal, desired signal) and noise. Typically, the signal and noise arestatistically independent, and the noise is random and typically additive and often Gaussian.The signal may or may not be random. Consider data

x = s + n , (51)

where s is the signal and n is the noise. The Signal-to-Noise Ratio (SNR) is defined asthe signal power or energy over the noise power or energy. First, let the signal s = S be aconstant. Then,

SNR =|S|2

E|n|2 (52)

andfx(x) = fn(x− S) . (53)

E|n|2 = σ2n if n is zero mean. If both s and n are random, then

SNR =E|s|2E|n|2 , (54)

and if s and n are statistically independent,

fx(x) = fs(x) ∗ fn(x) . (55)

In Subsection 7.4 of the Course we will see how we can use filters to improve SNR when thesensor output is a random process (i.e. a random time-varying signal).


Consider two random variables, y and y. Recall that, in terms of the joint pdf fx,y(x, y), theconditional pdf of x given a value of y is

fx/y(x/y) =fx,y(x, y)

fy(y), (56)

where it is assumed that fy(y) 6= 0. From this, we have that

Ex/y = y =∫ ∞

−∞x fx/y(x/y) dx . (57)


4.7 The Central Limit Theorem

Broadly and qualitatively speaking, the sum of a large number of random variables ap-proaches a Gaussian random variable in distribution. This is true for almost any realisticcase. It may not apply if there is too great of a disparity in the spread (variance) of therandom variables to be summed (i.e. if a few of the random variables have very high variancerelative to the others), or if all of the random variables are highly statistically dependent.Here we prove the Central Limit Theorem for the sum of iid random variables. This proof,which parallels that on pp. 125-128 of Peebles 4-th edition, is based on the characteristicfunction representation of pdf’s, and illustrates that the theorem is basically the result ofproperties of convolutions of multiple non-negative functions. See Chapter 7, pp. 278-284,for for a further discussion on this topic.

Theorem: Let xi; i = 1, 2, · · · , n be iid random variables and let

yn =n∑

i=1

xi (58)

Then in the limit as n −→ ∞, fyn(y) is Gaussian.

Proof: Let ηx and σ2x denote, respectively, the mean and variance of each of the

xi. We know that ηyn = n ηx and σ2yn = n σ2

x. Consider normalized yn,

wn =yn − ηyn

σ2yn

=

∑ni=1 (xi − ηx)√

n σx. (59)

Its characteristic function is

Φwn(ω) = Eejωwn = E

e(jω/√nσx)

∑n

i=1(xi−ηx)

= E

n∏

i=1

e(jω/√nσx)(xi−ηx)

=[

E


]n. (60)

Consider the Taylor series expansion of the expectation in Eq (60),

E 1 +

(

jω√n σx

)

(xi − ηx) +

(

jω√n σx

)2(xi − ηx)

2

2+ · · ·

= 1 +

(

jω√nσx

)2σ2x

2+ · · ·

= 1 − ω2

2n+ · · · (61)

So,

E


= 1 −(

ω2

2n

)

+ R (62)

where R, which represents all the higher terms of the Taylor series expansion, isO(

( 1n)3/2

)

(i.e. “order 32in 1

n” – it contains additive terms which are powers of

1ngreater than or equal to 3/2).


The characteristic function of Wn is then

Φwn(ω) =

[

1−(

ω2

2n

)

+ R

]n

. (63)

Its natural log, which we will use next, is

lnΦwn(ω) = n ln

1−(

ω2

2n

)

+ R

. (64)

You may recall from calculus that

ln(1− z) = −z − z2

2− z3

3− · · · , (65)

which means that

lnΦwn(ω) = −n

(

ω2

2n− R

)

+

(

ω2

2n− R

)2

/2 + · · ·

. (66)

So,

limn→∞

lnΦwn(ω) = − ω2

2, (67)

limn→∞

Φwn(ω) = e−ω2/2 , (68)

limn→∞ fwn(w) =

1√2π

e−w2/2 , (69)

i.e. limn→∞wn is normalized Gaussian, and

limn→∞

fyn(y) =1

√

2π(nσ2x)

e−(y−nµx)2/2(nσ2x) . (70)

This completes the proof of the Central Limit Theorem for the sum of IID randomvariables.





Part1d

Random Vectors

e1

e

e

x

x

θ

φ

3

2


Contents

5 Random Vectors 735.1 Expectation and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1.1 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3 Vector Observation & The Observation Space . . . . . . . . . . . . . . . . . 855.4 Diagonalization of the Correlation Matrix: Eigenstructure Transformation . 86

5.4.1 Eigenstructure of the Correlation Matrix . . . . . . . . . . . . . . . . 865.4.2 Properties of the Eigenstructure of the Correlation Matrix . . . . . . 875.4.3 Orthogonalization of the Observation Vector . . . . . . . . . . . . . . 90

5.5 Diagonalization of the Covariance matrix and Decorrelation of the Observation 915.6 Gaussian Random Observation and pdf Contours . . . . . . . . . . . . . . . 925.7 Sample Estimate of the Mean, Correlation & Covariance and the Singular

Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

List of Figures

14 An N = 3 dimensional observation vector is R3 space. . . . . . . . . . . . . . 7415 Illustration of a N = 2 dimensional, real-valued random vector observation x. 85


5 Random Vectors

The topics in this Chapter of the Course extend several topics discussed in Chapter 4. Ourfocus here will be on the representation of multiple random variables as an N -dimensionalrandom vector which we term an observation vector. We interpret this random vector asresiding in an N -dimensional observation vector space. Our characterization of this obser-vation vector will mainly be a 2-nd order statistical characterization (i.e. the correlationmatrix). So, in other words, in this Chapter we will not in general be interested in thecomplete statistical characterization of a random vector (i.e. the joint pdf of its elements),though on occasion we will refer to the complete characterization. In Chapter 6 we will finetune this representation specifically for multiple random variables which are samples of arandom signal.

The material in this Chapter of the Course overlaps material in Chapter 6 and Section7.1 of the Course Text. Note however that the emphasis here is significantly different fromthat of the Text since we are more focused on laying a probabilistic foundation specificallyfor signal processing.

In Chapter 4 of these Notes we introduced vector representation of multiple random vari-ables. That is, let x = [x1,x2, · · · ,xN ]

T be an N dimensional random vector that takes onvalues x = [x1, x2, · · · , xN ]

T . Its joint pdf is denoted fx(x), which is N -dimensional if x isreal-valued and 2N -dimensional if it is complex-valued. All fx(x) properties are joint pdf

properties, e.g.

P (a < x ≤ b) =∫ b

afx(x) dx . (1)

Marginal, conditional, total, statistically independent pdf issues are all the same as discussedearlier for multiple random variables.

The emphasis in this Chapter of the Course is that of this vector of N random variables isan observation in an N -dimensional space (RN for real-valued random variables and CN forcomplex-valued random variables). We will think of x more as a random vector observationthan as a vector of multiple scalar random variables. As a vector in an N -dimensional space,we think for example of the Euclidean norm (i.e. the length) of a vector,

||x|| = xH x =

(

N∑

i=1

|xi|2

)1/2

, (2)

or its orientation in space relative to a set of basis vectors which provide a meaningfulreference. This is illustrated in Figure 14 for R3 space, where the three basis vectors aredenoted ei; i = 1, 2, 3.

Naturally, linear algebra will play a central role in the development and utilization of thischaracterization.


e1

e

e

x

x

θ

φ

3

2

Figure 14: An N = 3 dimensional observation vector is R3 space.

5.1 Expectation and Moments

Let x be an N -dimensional random vector with pdf fx(x), and let g(x) be a scalar, vector ormatrix function of it. Then

Eg(x) =∫ ∞

−∞g(x) fx(x) dx . (3)

If, for example, g(x) is an P ×Q dimensional function, then Eg(x) is P ×Q dimensionaltoo. The expectation operator operates separately on each element of the matrix g(x) (i.e.the expectation of a matrix is the matrix of expectations). That is,

Eg(x) = E

g11(x) g12(x) · · · g1Q(x)g21(x) g22(x) · · · g2Q(x)

......

. . ....

gP1(x) gP2(x) · · · gPQ(x)

(4)

=

Eg11(x) Eg12(x) · · · Eg1Q(x)Eg21(x) Eg22(x) · · · Eg2Q(x)

......

. . ....

EgP1(x) EgP2(x) · · · EgPQ(x)

. (5)

The First Central Moment (Mean):Let g(x) = x (i.e. a simple N -dimensional vector function). Then the mean of X is defined

as

ηx

= Eg(x) = Ex , (6)

which can be determined as

Ex = [ηx1, ηx2

, · · · , ηxN]T or Ex =

∫ ∞

−∞x fX(x) dx . (7)


Example 5.1: Let x = [x1,x2, · · · ,xN ]T be a real-valued vector with pdf

fx(x) =N∏

n=1

fxn(xn)

withfxn

(xn) = 0.5 e−0.5xn u(xn)

(i.e. iid exponential). Find ηx.

Solution:

Example 5.2: Consider v = [v1,v2]T with pdf

fv(v) = v1 e−v1(v2+1) u(v1) u(v2) .

Find ηv.

Solution:


Correlation Matrix:Let x be an N -dimensional random vector. The correlation matrix of x is

Rxx = Ex xH . (8)

Since,

x xH =

x1x∗1 x1x

∗2 · · · x1x

∗N

x2x∗1 x2x

∗2 · · · x2x

∗N

......

. . ....

xNx∗1 xNx

∗2 · · · xNx

∗N

, (9)

we have that

Rxx =

E|x1|2 Ex1x

∗2 · · · Ex1x

∗N

Ex2x∗1 E|x2|

2 · · · Ex2x∗N

......

. . ....

ExNx∗1 ExNx

∗2 · · · E|xN |

2

, (10)

where recall that E|xi|2 = σ2

xi+ |ηxi

|2 is the power of xi. Also note thatExix

∗j = Exjx

∗i

∗.

Property 1 of the Correlation Matrix: Rxx is Hermitian (complex) symmetric, i.e.

RHxx = Rxx . (11)

See Eq. (10) above, and the notes following it.

Property 2 of the Correlation Matrix: Rxx is positive semidefinite. Let a be anN -dimensional constant vector. A matrix Rxx is positive semidefinite if, for anya,

aH Rxx a ≥ 0 . (12)

Any correlation matrix is positive semidefinite since

aH Rxx a = aH ExxH a = EaHxxH a = E|aHx|2 ≥ 0 . (13)

Note: as we will see later, for most practical x, Rxx is positive definite, i.e.

aH Rxx a > 0 ; ∀ a 6= 0N . (14)

For example, if x = [x(n), x(n − 1), · · · , x(n − N + 1)]T and x(n) is a fullbandwidth signal, then we will see that Rxx is positive definite.


Covariance Matrix:Let x be an N -dimensional random vector. The covariance matrix of x is

Cxx = E(x− ηx) (x− η

x)H . (15)

Note that for ηx= 0N , we have Cxx = Rxx. Also, since Cxx is itself a correlation matrix

(of the random vector x− ηx), it is both Hermitian and positive semidefinite.

Example 5.3:

Let x be an N -dimensional zero-mean random vector. Assume thatExix

∗j = σ2

xiδ[i− j] (i.e. the elements of x are mutually orthogonal). Then

Cxx = Rxx = Diagσ2x1, σ2

x2, · · · , σ2

xN . (16)

If σ2xi= σ2

x; i = 1, 2, · · · , N , then

Cxx = σ2x IN . (17)

Cross-Correlation & Cross-Covariance Matrices:Let x and y be random vectors. Then

Rxy = Ex yH (18)

is the cross-correlation matrix, and

Cxy = E(x− ηx) (y − η

y)H (19)

is the cross-covariance matrix.x and y are orthogonal if Rxy = 0.x and y are uncorrelated if Cxy = 0.


5.1.1 Gaussian Random Vectors

Recall that a real-valued Gaussian random variable x has pdf

fX(x) =1

√

2πσ2x

e−(x−ηx)2/2σ2x . (20)

This pdf is completely parameterized by the random variable’s mean ηx and variance σ2x.

To generalize this to the multivariable case, consider an N−dimensional real-valued ran-dom vector x. This vector is Gaussian if its pdf is on the form

fx(x) =1

(2π)N/2|Cxx|1/2e−

1

2(x−η

x)TC−1

xx (x−ηx) , (21)

where, employing notation we developed earlier, ηx= Ex is the mean of x, and

Cxx = E(x− ηx)(x− η

x)T is its covariance matrix. This pdf is completely parameterized

by the random vector’s mean vector and covariance matrix. We’ve seen Eq (21) before, forthe N = 2 case.

If the random variables in x are mutually uncorrelated, then

Cxx = Diagσ2x1, σ2

x2, · · · , σ2

xN (22)

where σ2xi

is the variance of the ith element of x. If this is the case, then

C−1xx = Diag

1

σ2x1

,1

σ2x2

, · · · ,1

σ2xN

(23)

so that, in the argument of the exponential in the pdf,

−1

2(x− η

x)TC−1

xx (x− ηx) = −

N∑

i=1

(xi − ηxi)2/2σ2

xi(24)

and the determinant of Cxx reduces to

|Cxx| =N∏

i=1

σ2xi

. (25)

Then,

fx(x) =1

∏Ni=1

√

2πσ2xi

e−∑N

i=1(xi−ηxi)

2/2σ2xi (26)

=N∏

i=1

1√

2πσ2xi

e−(xi−ηxi )2/2σ2

xi (27)

=N∏

i=1

fxi(xi) . (28)


If the random variables in a Gaussian vector are all mutually uncorrelated, then they areall statistically independent. Again, we have already seen this for the N = 2 case.

For a complex Gaussian random variable x = xr + jxi, note from previous discussionsthat the pdf is

fx(x) =1

πσ2x

e−|x−ηx|2/σ2x (29)

=1

πσ2x

e−[(xr−ηxr )2+(xi−ηxi )

2]/σ2x (30)

=1

√

2π σ2x

2

e−(xr−ηxr )2/2(σ2

x/2)1

√

2π σ2x

2

e−(xi−ηxi )2/2(σ2

x/2) (31)

where ηx = ηxr+ jηxi

. This is the product of pdf’s fxr(xr) and fxi

(xi). From this we see thatthe real and imaginary parts of complex-valued X are uncorrelated (and thus statistically

independent), each with variance σ2x

2, and with means ηxr

and ηxirespectively. Note that if

we define x = [xr,xi]T , its joint pdf is

fx(x) =1

2π|Cxx|1/2

e−1

2(x−η

x)TC−1

xx (x−ηx) , (32)

where Cxx = Diagσ2x

2, σ2

x

2 and η

x= [ηxr

, ηxi]T . Note that Eqs (31) and (32) are equivalent.

So the pdf of complex-valued Gaussian x, Eq (29), reduces to Eq (31). Really it worksthe other way around. For the complex Gaussian random variables we observe in signalprocessing applications, the joint pdf of the real and imaginary parts are of the form Eq (31)which is equivalent to Eq (29).

Consider an N−dimensional complex-valued random vector x. This vector is Gaussian ifits pdf is on the form

fx(x) =1

πN |Cxx|e−(x−η

x)HC−1

xx (x−ηx) . (33)

Eq (33) is the N -dimensional generalization of Eq (29), where ηxand Cxx are, respectively

the mean vector and covariance matrix of complex-valued x.


5.2 Linear Transformations

This Section of the Course corresponds to material at the beginning of Section 7.1 of theCourse Text. After studying the linear case, we will briefly consider a generalization of it.

Let x be an N × 1 random vector with mean ηx, correlation matrix Rxx and covariance

matrix Cxx. Let T be an M ×N dimensional matrix. Then,

y = T x (34)

is a linear transformation of x that results in the M ×1 dimensional random vector y. (Thisis an M-dimensional linear function of N -dimensional random vector x. Note that in Section4.2 of this Course we considered the more specific M = 1 (scalar) case, but we allowed (atleast at the beginning) the transformation to be general, i.e. y = g(x).)

The mean of y:

ηy

= Ey = ET x = T Ex = T ηx. (35)

The correlation matrix of y:

Ryy = Ey yH = ET x xH TH = T Ex xH TH = T Rxx TH . (36)

The covariance matrix of y:Cyy = T Cxx TH . (37)

Example 5.4: Consider the transformation T = wH where w is and N ×1 vector.Determine the mean vector, the correlation matrix and the covariance matrix ofy = T x for both the general x case and the uncorrelated, zero-mean x case.

Solution:


Linear algebra identities:

• Transpose of a matrix product: For any A and B of compatible dimensions,

(A B)H = BH AH . (38)

• Trace: the trace of an N ×N (square) matrix is

TrA =N∑

i=1

A[i, i] . (39)

Note that for a correlation matrix Rxx, TrRxx is the total energy (or power) of thevector observation.

• Inverse of a matrix product: For any invertible N ×N matrices A and B,

(A B)−1 = B−1 A−1 . (40)

For example,(T Cxx TH)−1 = (TH)−1 C−1

xx T−1 . (41)

• Euclidean norm of vector w:

||w|| =(

wH w)1/2

=

(

N∑

i=1

|wi|2

)1/2

(42)

• Euclidean norm of matrix T :

||T || = max||x||=1

||T x|| (43)

• Frobinius norm of matrix T :

||T ||F =

M∑

i=1

N∑

j=1

|ti,j |2

1/2

=(

TrT TH)1/2

(44)

• Determinant/trace of matrix product: For N ×N dimensional T , andRyy = T Rxx TH ,

|Ryy| = |T |∗ |T | |Rxx| ; |Cyy| = |T |∗ |T | |Cxx| (45)

TrRyy = TrTH T Rxx ; TrCyy = TrTH T Cxx . (46)


We now turn to the problem of deriving the joint pdf of the transformed random vectory in terms of the joint pdf of x. It is here that we will start out with a somewhat generaltransformation formulation, and then focus on the linear case.

Recall that, for a single random variable case, the pdf fy(y) of the random variabley = g(x), in terms of the input pdf fx(x) and a monotonic function g(·), is

fy(y) =

∣

∣

∣

∣

∣

d

dyg−1(y)

∣

∣

∣

∣

∣

fx(g−1(y)) . (47)

If the function g(·) is non-monotonic (i.e. if a one-to-one mapping between x and y does notexist), then the expression for fy(y) becomes a more involved but systematically manageablegeneralization of (47).

Consider two N × 1-dimensional random vectors x = [x1,x2, · · · ,xN ]T and

y = [y1,y2, · · · ,yN ]T and a one-to-one mapping between them

y = g(x) =

g1(x)g2(x)...

gN(x)

. (48)

The gi(x) are not necessarily linear, but they are monotonic. Since they are monotonic, wehave the inverse transformation

x = φ(x) =

φ1(y)φ2(y)

...φN(x)

. (49)


Let

J =

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∂φ1

∂y1

∂φ2

∂y1· · · ∂φN

∂y1∂φ1

∂y2

∂φ2

∂y2· · · ∂φN

∂y2...

.... . .

...∂φ1

∂yN

∂φ2

∂yN· · · ∂φN

∂yN

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

=

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∂g1∂x1

∂g2∂x1

· · · ∂gN∂x1

∂g1∂x2

∂g2∂x2

· · · ∂gN∂x2

......

. . ....

∂g1∂xN

∂g2∂xN

· · · ∂gN∂xN

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

−1

= J−1 (50)

where

J =

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∂g1∂x1

∂g2∂x1

· · · ∂gN∂x1

∂g1∂x2

∂g2∂x2

· · · ∂gN∂x2

......

. . ....

∂g1∂xN

∂g2∂xN

· · · ∂gN∂xN

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

(51)

is the Jacobian (the determinant of the matrix of partial derivatives) of the transformationy = g(x). J is the Jacobian of x = φ(y).

Then, it can be shown1 that

fy(y) =1

|J |fx(φ(y)) = |J | fx(φ(y)) (52)

(i.e. fy(y) is fx(φ(y)) times the absolute value of the Jacobian J). Note that for the 1 × 1case (i.e. for scalar x and y), Eq (52) reduces to Eq (47).

For linear g(x), i.e. for gi(x) = [ti1, ti2, · · · , tiN ] x, we have y = T x. Then the JacobianJ is just the determinant of T , i.e.

J =

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∂g1∂x1

∂g2∂x1

· · · ∂gN∂x1

∂g1∂x2

∂g2∂x2

· · · ∂gN∂x2

......

. . ....

∂g1∂xN

∂g2∂xN

· · · ∂gN∂xN

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

=

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

t11 t12 · · · t1Nt21 t22 · · · t2N...

.... . .

...tN1 tN2 · · · tNN

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

= |T | . (53)

With g−1(y) = T−1y, we have

fy(y) =1

abs |T |fx(T

−1y) . (54)

1on p. 244 of the Course Text, where the author sketches out a proof


Consider real-valued Gaussian x with

fx(x) =1

(2π)N/2|Cxx|1/2

e−1

2(x−η

x)TC−1

xx (x−ηx) . (55)

Then, for linear transformation y = T x, we have

fy(y) =1

abs |T |

1

(2π)N/2|Cxx|1/2

e−1

2(T−1y−η

x)TC−1

xx (T−1y−η

x) (56)

=1

(2π)N/2|Cyy|1/2

e− 1

2(y−η

y)TC−1

yy (y−ηy), (57)

where Cyy = T Cxx T T and ηy= T η

x.

Proof:

So, y is Gaussian. Note that its mean vector and covariance matrix are related to those of yas derived previously in this Subsection. Also note that since every element of y is Gaussian,this proves that a linear combination of Gaussian random variables is also Gaussian, even ifthe original random variables are correlated.

Similarly, for complex-valued Gaussian x,

fy(y) =1

πN |Cyy|e−(y−y)

HC−1

yy (y−y) , (58)

where Cyy = T Cxx TH and ηy= T η

x.


5.3 Vector Observation & The Observation Space

Let x be an N -dimensional (real or complex) random vector where each element of itsrealization x is a data point and where the elements of x are to be processed jointly. We termx an observation and the N -dimensional space (real or complex) is termed the observationspace. Figure 15 provides a visualization of an observation vector in the observation spaceof the N = 2, real-valued case.

x

x1

x2

observationspace

Figure 15: Illustration of a N = 2 dimensional, real-valued random vector observation x.

In signal processing, understanding and exploiting the probabilistic locations of observa-tions ofX in its observation space is very useful. This requires a marriage between probabilityand linear algebra.


5.4 Diagonalization of the Correlation Matrix: EigenstructureTransformation

In this Section we begin to study, in some depth, characteristics of the correlation matrixRxx = Ex xH of a random vector observation. We will continue this in Subsection 6.3of this Course when we consider the covariance matrix of a vector observation of a randomprocess.

5.4.1 Eigenstructure of the Correlation Matrix

The eigenstructure of an N × N -dimensional (i.e. square) matrix Rxx consists of its eigen-values and eigenvectors, which are the N solutions to

Rxx ei = λi ei i = 1, 2, · · · , N , (59)

where the λi; i = 1, 2, · · · , N are the eigenvalues and ei; i = 1, 2, · · · , N the eigenvectors ofRxx. We will assume that the eigenvalues are ordered in descending order of magnitude. Also,note from (59) that the magnitude of the eigenvectors do not matter. It’s their directions inthe N-dimensional space that are important in their definition and utility. For convenience,we will assume that eigenvectors are normalized, i.e. that eHi ei = 1. The issue of how tocompute the eigenstructure of a matrix will not be covered here. It’s the utility of it thatwe will focus on. Students should know how to find the eigenstructure of a matrix, and maybe required to solve a 2× 2-dimensional problem by hand. For larger problems, we will relyon Matlab.

Concerning (59), we have the following equivalent expression:

Rxx E = E Λ (60)

where E = [e1, e2, · · · , eN ] is the eigenvector matrix and Λ = Diagλ1, λ2, · · · , λN is thematrix of eigenvalues. For invertible E, we have

Rxx = E Λ E−1 (61)

Λ = E−1 Rxx E . (62)

Eq. (62) is the diagonalization of matrix Rxx.


5.4.2 Properties of the Eigenstructure of the Correlation Matrix

The following properties of the eigenstructure of the correlation matrix are a result of thecomplex-symmetry and positive-definite properties of the correlation matrix.

• Property 1 - Function of Matrices: First consider a square matrix Rxx raised to a powerk:

Rkxx = (E Λ E−1) (E Λ E−1) · · · (E Λ E−1) = E Λk E−1 , (63)

since E−1E = IN . From Eq. (63) and Taylor series expansion, the function

F(Rxx) = E F(Λ) E−1 , (64)

whereF(Λ) = DiagF(λ1),F(λ2), · · · ,F(λN) . (65)

So, for exampleR−1

xx = E Λ−1 E−1 , (66)

where Λ−1 = Diagλ−11 , λ−1

2 , · · · , λ−1N , and

R−1/2xx = E Λ−1/2 E−1 , (67)

where Λ−1/2 = Diagλ−1/21 , λ

−1/22 , · · · , λ

−1/2N .

• Property 2 - Eigenvalues of a Correlation Matrix: The eigenvalues of a correlationmatrix Rxx are real and non-negative. As proof, if we premultiply Eq(59) by eHi , weget

eHi Rxx ei = λi eHi ei (68)

so that, since the eigenvectors are normalized, and correlation matrix Rxx is positivesemi-definite,

λi = eHi Rxx ei ≥ 0 . (69)

Note that |Rxx| =∏N

i=1 λi. So, if Rxx is positive definite, |Rxx| > 0, and R−1xx exists,

since if the determinant of a matrix is non zero, its inverse exists. (Concerning theexistence of the inverse of a matrix, see Property 1.)


• Property 3 - Orthogonal Eigenvectors: For distinct eigenvalues (i.e. forλi 6= λj , i 6= j), the eigenvectors of the correlation matrix Rxx are orthonormal, i.e.

eHi ej = δ(i− j) . (70)

Proof:

• Property 3a - Repeated Eigenvalues: For all λi (i.e. distinct or not), a set of orthonormal

eigenvectors exist. One proof of this uses the Schure decomposition of Rx, which isbeyond the scope of this Course.

One important consequence of Property 3a is that

EH E = IN (71)

which means thatE−1 = EH . (72)

That is, E is unitary.


• Property 4 - Spectral Decomposition:

Proof:

• Property 5 - Trace:

Proof:


5.4.3 Orthogonalization of the Observation Vector

Consider the N ×N linear transformation

y = EH x (73)

where E is the eigenvector matrix of Rxx, the correlation matrix of x. Then we know that

ηy

= EH ηx

, (74)

which is no big deal. We also have that

Ryy = EEHx xHE = EH Rxx E = Λ . (75)

This means thatEyiy

∗j = λi δ(i− j) . (76)

Since

x = E y =N∑

i=1

yi ei , (77)

we see that λi, the power of yi, is the power of the observation x projected onto ei. Thisis a useful result! This eigenvector transformation orthogonalizes the observation, and givesa decomposition of the observation power with respect to the orthogonal eigenvector basis.The eigenvalues are the powers in the directions of the eigenvectors.

Illustration:


5.5 Diagonalization of the Covariance matrix and Decorrelationof the Observation

Often a random observation will be zero mean, so that the correlation matrix Rxx is equalto the covariance matrix Cxx. In this case, we know that the eigenstructure of Cxx is that ofRxx and all of the properties follow. Generally, if x has mean η

x, then the eigenstructure of

Rxx and Cxx will differ. However, since as is the case with Rxx, Cxx is complex-symmetricand positive semidefinite, the eigenstructure of Cxx will have all the properties of that ofRxx. The only exception concerns Property 5 above, where TrCxx does not represent thepower of the observation, but the power of its variation from its mean.

The eigenstructure of an N×N -dimensional covariance matrix Cxx is the set ofN solutionsto

Cxx ei = λi ei i = 1, 2, · · · , N , (78)

where the λi; i = 1, 2, · · · , N are the ordered eigenvalues (λi ≥ λi+1) andei; i = 1, 2, · · · , N the orthonormal eigenvectors of Cx. Equivalently,

Cxx E = E Λ , (79)

soCxx = E Λ E

H. (80)

Often, we assume that the observation is zero mean, and we use the notation E, Λ forthe covariance matrix.


5.6 Gaussian Random Observation and pdf Contours

The way of thinking presented here is very useful in understanding, for example, performanceof adaptive filtering and estimation problems.

In Subsection 5.4.3 above, we showed that the transformation y = EH x results in anorthogonal observation. This can result in design, analysis and processing advantages. Herewe show how a similar transformation simplifies the Gaussian observation problems.

Let x be an N -dimensional real-valued Gaussian observation with pdf

fx(x) =1

(2π)N/2|Cxx|1/2e−

1

2(x−η

x)TC−1

xx (x−ηx) . (81)

fx(x) is an N -dimensional function. All values of x such that

(x− ηx)TC−1

xx (x− ηx) = c , (82)

where c is a constant, form an equi-level contour of fx(x). The idea is similar for complex-valued x.

Example 5.5 - an N = 2 real-valued example:

Solution:


The pdf of fy(y) will be easier to work with than fx(x). Example 5.5 establishes that

the transformation y = EH

(x − ηx) is useful in designing, analyzing and processing

multiple Gaussian random variables. Now let’s develop a visualization of the effect of thistransformation on the pdf.

Example 5.6 - An N = 2 real-valued detection example:

Solution:

We can deduce from Subsection 5.4.3 and Section 5.4 that the eigenvalues of the covariancematrix of x (i.e. the eigenvalues of the correlation matrix of x−η

x) give the powers along the

directions of the corresponding eigenvectors of the variation of x from its mean. With thisSection, we now see how, for Gaussian pdfs, these eigenvalues and eigenvectors also dictatethe shape of the pdf fx(x).


5.7 Sample Estimate of the Mean, Correlation & Covariance andthe Singular Value Decomposition (SVD)

Consider M realizations of an N × 1 dimensional observation x:

xi ; i = 1, 2, · · · ,M . (83)

So all the xi are drawn from the same pdf. The sample mean is defined as

ηx

=1

M

M∑

i=1

xi . (84)

This is an estimator of ηx. (Later in the Course we will study the performance of this

estimator and approaches to designing estimators.) Similarly we have, as an estimator ofRxx, the sample correlation

Rxx =1

M

M∑

i=1

xi xHi , (85)

and as an estimator of the Cxx, the sample covariance

Cxx =1

M − 1

M∑

i=1

(xi − ηx) (xi − η

x)H . (86)

Let W = [x1, x2, · · · , xM ] be the N ×M data matrix. Then

Rxx = S =1

MW WH . (87)

We wish to consider the eigenstructure of S:

S = E Λ EH

. (88)

Though Λ and E are estimates of the eigenstructure of Rxx, we will not dwell on this here.Instead, also consider

S′

=1

MWH W = E

′

Λ′

E′H

. (89)

Finally consider the SVD of W :

W = V 1 Σ V H2 , (90)

where V 1 is the N ×N unitary matrix whose columns are the left singular vectors, V 2 is theM×M unitary matrix whose columns are the right singular vectors, and, assuming M > N ,

Σ =[

Σ1 0N×(M−N)

]

, (91)

whereΣ1 = Diagσ1, σ2, · · · , σN (92)

is the singular value matrix, σi ≥ σi+1 ≥ 0, and 0N×(M−N) is the N × (M − N) matrix ofzeros.

The point here is to relate V 1, V 2 and Σ to E, E′

, Λ and Λ′

.


• The left singular vectors of W are the eigenvectors of S. The eigenvalues of S are theσ2

i

M.

Solution:

• The right singular vectors of W are the eigenvectors of S′

. The eigenvalues of S′

are

theσ2

i

M.

Solution:

• The non-zero eigenvalues of S′

are the eigenvalues of S.

Solution:





Part2a

Random Processes

n

.... ....

n[n]


Contents

6 Random Processes 966.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Correlation Functions and Power Spectral Density . . . . . . . . . . . . . . . 104

6.2.1 DT Correlation Function . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.2 DT Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . 1096.2.3 CT Correlation Function & Power Spectral Density . . . . . . . . . . 1186.2.4 Sampling Wide-Sense Stationary CT Random Processes . . . . . . . 123

6.3 Note on Cyclostationary Random Processes . . . . . . . . . . . . . . . . . . 1266.4 Correlation & Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . 129

6.4.1 Random Vector Observation . . . . . . . . . . . . . . . . . . . . . . . 1296.4.2 Wide-Sense Stationary Random Processes . . . . . . . . . . . . . . . 131

6.5 Discrete Karhunen-Loeve Transformation (DKLT) . . . . . . . . . . . . . . . 1346.5.1 Orthogonal Expansion of a Random Vector . . . . . . . . . . . . . . . 1346.5.2 The Discrete Karhunen-Loeve Transformation (DKLT) . . . . . . . . 138

6.6 Narrowband Signals in Additive White Noise . . . . . . . . . . . . . . . . . . 1406.6.1 Correlation Matrix Eigenstructure . . . . . . . . . . . . . . . . . . . . 1436.6.2 Two Narrowband Signals in White Noise Example . . . . . . . . . . . 148

6.7 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

List of Figures

16 Illustration of a realization of a white-noise random process. . . . . . . . . . 9717 Illustration of the correlation function for a wide-sense stationary process. . . 10018 Illustration of the correlation function for a cyclostationary process: (a) wide-

sense stationary; (b) cyclostationary. . . . . . . . . . . . . . . . . . . . . . . 12619 Illustration of formation of the averaged correlation function of a cyclosta-

tionary process from its 2-D corelation function. . . . . . . . . . . . . . . . . 128


6 Random Processes

In this Chapter of the Course we consider random processes (a.k.a. random signals, stochasticprocesses, stochastic signals). We consider discrete-time (DT) random processes (a.k.a.random sequences) and continuous-time (CT) random processes, and we briefly discuss thesampling of an underlying CT random process to obtain a corresponding DT random process.We begin in Sections 6.1 & 6.2 by defining, exemplifying and probabilistically characterizingrandom processes. In the rest of the Chapter, i.e. in Section 6.3 through 6.7, we focus on therepresentation of a finite dimensional vector observation of a random process. In these latterSection, we basically relate topics in Chapter 5 of the Course (on random vectors) to topicsin Section 6.1 & 6.2. The emphasis here is that in most applications and implementations,random signals are first samples and represented as a finite dimensional vector, so that thesubsequent processing of the signal e.g. FIR filtering, FFT calculation) is based on a randomvector representation of it. In relating the results of these calculations to characteristics ofthe underlying CT signal, it is critical to understand the relationship between probabilisticcharacteristics of the CT signal to those of its vector representation.

Concerning the Course Text, the emphasis in Chapter 9 is on CT random processes, withonly a brief discussion on DT processes appearing in Section 9.4. Our emphasis in thisChapter of the Course is more on DT random processes. So, although some of the materialin this Chapter of the Course is complemented by parallel discussions in Chapter 9 of theText, emphasis in this Chapter on the Course is significantly different from the Text. As withthe rest of this Course, you are responsible only for the topics in these Notes. Supportingdiscussions in the text will serve to reinforce and broaden your understanding. SubsequentChapters in the Course Text deal with more advanced issues (some of which we will coverin Part 3 of this Course).

6.1 Introduction

A Random Process is a mathematical representation (a model) of a continuous random signalor discrete random sequence. We usually call the signal or sequence a random process. Inthe context of this Course, a random process will most often be a function of time. However,the concept of random process is more general than that, being applicable to function ofdirection on an image or any other independent variable of interest. It also generalized tomultiple dimensions (e.g. in biomedical imaging we can consider a 4-dimensional randomfunction of time and 3-dimensional space).

A CT random process x(t) is a signal that, at every point in time, is a random variable.A DT random process x[n] is a random variable at each sample time n. Since definitionsand terminology are the same, below we treat CT and DT at the same time.


The Ensemble, and Realizations

The ensemble is the set of all possible outcomes (i.e. the sample space) of a random experi-ment that generates a random process. A realization is an outcome.

Example 6.1 - Discrete-time ”white” noise: You have likely heard the expres-sion white noise before. Qualitatively, this term suggest totally random in somesense. Figure 16 below illustrates one possible realization of a white noise randomprocess n[n].

n

.... ....

n[n]

Figure 16: Illustration of a realization of a white-noise random process.

It is drawn to give a visual sense of randomness. We will see below that, bydefinition, white noise means that En[n] = 0; ∀n, E|n[n]|2 = σ2

n, andEn[n] · n∗[m] = 0; ∀n 6= m. That is, all the random variables that constitutethe random process are zero-mean. They all have the same variance. And allpairs of these random variables are uncorrelated.

So why the term “white”? We will answer this question as an Example later on.


Example 6.2 - A deterministic random process: given the CT random signalx(t) = A cos(ω0t + Φ), where A and ω0 are constants, and Φ is random withpdf

fφ(φ) =

12π

0 ≤ φ < 2π0 otherwise

,

determine fx(t1)(x(t1)), the pdf of the random variable x(t1), for any time t1.

Solution: This is a scalar function of a random variable problem, with a slighttwist in that the desired pdf could end up being parameterized by time t1 (i.e.we could end up with different pdfs for different times). Remember that, forany time t1, our notation for the random variable of interest is x(t1), and x(t1)represents the values it can take on.

Using techniques developed in Chapter 3 of the Course, it is not too difficult toshow that

fx(t1)(x(t1)) =

1π(A − x(t1))−1/2 |x(t1)| ≤ A

0 otherwise.

There are a couple of interesting issues suggested with Examples 6.1 & 6.2 that we will beexploring below. The first has to do with the fact that we sometimes call random processeslike the one in Example 6.2 deterministic random processes. We do this not because wesimply like to use oxymorons when introducing new concepts so as to confuse students toas great of an extent as possible. In addition to that, perhaps, with Examples 6.1 & 6.2we want to introduce random processes with clearly different attributes so as to develop ananticipation of what we are looking for in characterizations of a random process. Generally,we are interested in probabilistic characteristics of random processes that somehow exhibitthe degree of randomness across time. The x(t) in Example 6.2 is random in some sensebecause of the randomness of its phase, but any realization is deterministicly describedby an equation that is parameterized by (in this case) a single random variable. In somesense it is not as random as the signal n(t) described in Example 6.1. We desire randomprocess probabilistic characterizations that are both reasonable to work with and effectivein representing some useful indicators of randomness across time.

A second interesting issue suggested by both Examples 6.1 & 6.2 is that of the consistency(or lack there of) of characteristics across time. The random process n[n] in Example 6.1has a constant mean and variance across time, and random variables from different times areuncorrelated regardless of where they are in time. Generally speaking, some probabilisticcharacteristics are constant across time – i.e. they are stationary. Similarly, in Example6.2, we see that the pdfs that describe the random variables at different times are in somesense stationary across time. Combined, these two Examples suggest that there may bedifferent kinds of stationarity. It also can be expected that some random processes maynot be stationary in some sense, but that those that are stationary will likely be easier tocharacterize and process.


The Complete Probabilistic Characterization

The complete probabilistic characterization of a random process x(t) is considered to bethe set of all joint pdf’s

fx(t1),x(t2),···,x(tN )(x(t1), x(t2), · · · , x(tN)) (1)

for all positive integers N and all sets of times t1, t2, · · · , tN. For most random processes,this characterization is completely impractical.

Partial Characterization: Moments

Examples:

• The mean function (described here for a DT random process) is

ηx[n] = Ex[n] =∫ ∞

−∞x[n] fx[n](x[n]) dx[n] . (2)

Note that dx[n] denotes integration across the variable x[n]. In general, the mean canbe time varying.

• The correlation function (described here for a complex-valued CT random process) is

Rxx(t1, t2) = Ex(t1) x∗(t2) =∫ ∞

−∞

∫ ∞

−∞x(t1) x

∗(t2) fx(t1),x(t2)(x(t1), x(t2)) dx(t1) dx(t2) .

(3)This is a two dimensional function of the times t1 and t2 of the two samples which arebeing correlated.

Note that these functions are computed as ensemble averages. If you wanted to determine(e.g. estimate) them from observations, you would have to average over different realizations.Often, only one realization is available. Then, what can you do?


StationarityQualitatively, stationarity of a random process means that its probabilistic characteris-

tics do not change with time. There are different types of stationarity corresponding todifferent probabilistic characteristics considered. Here we look at several important types ofstationarity (described for either DT or CT, but applicable to both):

• Strict-sense stationarity (SSS) means that

fx(t1),x(t2),···,x(tN )(x(t1), x(t2), · · · , x(tN )) = fx(t1+T ),x(t2+T ),···,x(tN+T )(x(t1+T ), x(t2+T ), · · · , x(tN+T ))(4)

for all T , positive integers N , and all sets of t1, t2, · · · , tN.

• Stationary in the mean means that

Ex[n] = ηx[n] = ηx . (5)

• Wide-sense stationarity (WSS) is defined as stationarity in the mean plus stationarityin correlation. Stationarity in correlation means that

Rxx(t1, t2) = Ex(t1) x∗(t2) = Ex(t1 + T ) x∗(t2 + T ) (6)

= Rxx(t1 − t2) (7)

= Rxx(τ) (8)

for all T , t1, t2, where τ = t1 − t2 is the lag between t1 and t2. Figure 17 illustratesthis property of Rxx(t1, t2) for wide-sense stationary random process x(t).

t 1

t 2

21t −t = −1 21t −t = 0

21t −t = 1

1−1

the correlation function is constant along any of these lines

For Wide−Sense Stationary (WSS) X(t),

completely describes the autocorrelation function

For WSS X(t), for any t , 0

xxR (t ,t )1 2

xxe.g. R ( ,0)τ

xx τ0 0R (t + ,t ); all τ

Figure 17: Illustration of the correlation function for a wide-sense stationary process.


Temporal Averages

Temporal averages are averages over time of one realization of a random process. This isas opposed to ensemble averages which are averages over realizations. For example, sometime averaged means are

< x[n] >n0,n1=

1

n1 − n0 + 1

n1∑

n=n0

x[n] (9)

< x[n] > = limM→∞

1

2M + 1

M∑

n=−M

x[n] (10)

< x(t) > = limT→∞

1

2T

∫ T

−Tx(t) dt . (11)

Ergodicity

Qualitatively, a random process is ergodic if temporal averages give ensemble averages.

• Ergodic in mean:< x[n] > = Ex[n] = ηx . (12)

• Ergodic in correlation:

< x[n]x∗[n−m] > = limM→∞

1

2M + 1

M∑

n=−M

x[n] x∗[n−m] (13)

= Rxx[m] . (14)

Note that for a random process to be ergodic in some sense, it must be stationary in thatsense.


Example 6.3 - two discrete time Gaussian random process: this example illus-trates two DT random processes. The input is wide-sense stationary. Strictlyspeaking the output, called a Gauss-Markov process, is not wide-sense stationary,though asymptotically (as n −→ ∞) it behaves like one.

Let w[n] be a sequence of real-valued, zero-mean, uncorrelated Gaussian randomvariables, each with variance σ2

w. Since the w[n] are uncorrelated, they are sta-tistically independent. Since they are also all zero-mean with the same variance,they are iid (i.e. independent and identically distributed). Let

w[n] = [w[n], w[n+ 1], · · · , w[n+N − 1]]T

be a random vector obtained from w[n]. We say that w[n] is a vector observationof w[n].

For real-valued constant ρ with |ρ| < 1, consider a new random process

x[n] =n∑

k=0

w[n] ρn−k ,

and corresponding random vector observation

x[n] = [x[n], x[n + 1], · · · , x[n+N − 1]]T .

Determine the joint pdfs of w[n] and x[n].

Solution: Concerning random vector w[n], we know its elements are jointly Gaus-sian, and its mean vector and covariance matrix are

ηx[n] = η

w= 0N ; Cww[n] = Cww = σ2

w IN .

Thus, its pdf is

fw[n](w[n]) =1

(2π)N/2 |Cww|1/2e−wT [n] C−1

ww w[n]/2

=N−1∏

i=0

1√

2πσ2w

e−w2[n−1]/2σ2w .

Since

x[n] = T x[n]; T =

1 0 0 · · · 0ρ 1 0 · · · 0ρ2 ρ 1 · · · 0...

......

. . ....

ρN−1 ρN−2 ρN−3 · · · 1

,

we have that ηx= T η

w= 0N . This shows that, like w[n], the mean of random

process x[n] does not change with time. They are both stationary in the mean.


It can be shown that

Cxx = T Cww T T =σ2w

1− ρ2

1 ρ ρ2 · · · ρN−1

ρ 1 ρ · · · ρN−2

ρ2 ρ 1 · · · ρN−3

......

.... . .

...ρN−1 ρN−2 ρN−3 · · · 1

+σ2wρ

2

1− ρ2

1 ρ ρ2 · · · ρρ ρ2 ρ3 · · · ρρ2 ρ3 ρ4 · · · ρ...

......

. . ....

ρN−1 ρN ρN+1 · · · ρ

Note that the diagonal elements of Cxx, i.e. the individual random variablecovariances Cxx[n, n] = Cx[n],x[n]; n = 0, 1, 2, · · ·, vary with n. Thus, the randomprocess x[n] does not have constant variance over time. It is not WSS. However,

as n −→ ∞, Cx[n],x[n] −→ σ2w

1−ρ2. It is asymptotically WSS.

Finally, since we know that x[n] is Gaussian,

fx[n](x[n]) =1

(2π)N/2 |Cxx|1/2e−xT [n] C−1

xx x[n]/2 .

The input random process w[n] in Example 6.3 is an example of the “white-noise” processintroduced in Example 6.1. As noted at the beginning of Example 6.3, the output randomprocess x[n] is called a Gauss-Markov process. Strictly speaking, it is not wide-sense sta-tionary, however asymptotically it is. The two matrices presented above in the description ofthe covariance matrix Cxx have special structures, termed respectively, a Toeplitz (constantalong any diagonal) structure and a Hankel (increasing powers along rows and columns)structure.

The purpose of Example 6.3 is twofold: 1) to provide an example of a commonly en-countered random process which is more complex than the simple processes described inExamples 6.1 & 6.2 – one which shows varying probabilistic attributes across time; and 2) toreinforce the idea that the covariance matrix of a vector observation of a random process canhave substantial structure that may be exploited (i.e. to deduce information of the randomprocess or to derive efficient algorithms for processing it). As we proceed through the restof this Course, we will at times focus on the first purpose – on exploiting the structure ofthe covariance matrix of a vector observation of a random process to extract information ofinterest about that random process.


6.2 Correlation Functions and Power Spectral Density

This Section corresponds to Chapters 8 & 9 of the Course Text.

We consider wide-sense stationary random processes. We know that for such processes themean is constant, e.g.

Ex[n] = ηx[n] = ηx , (15)

and the correlation function reduces to a one-dimensional function of lag (the distance intime between random process samples).

Here we focus on important properties of wide-sense stationary random processes, andon the relationship between the correlation function of a wide-sense stationary process andits frequency content. We begin with a consideration of DT processes, and then move toCT processes. We also cover the issue of sampling CT processes to form corresponding DTprocesses.

6.2.1 DT Correlation Function

This Section corresponds to Section 9.4 of the Course Text.

The correlation function (also called the autocorrelation function) of a DT process is

Rxx[n1, n0] = E x[n1] x∗[n0] . (16)

The (auto) covariance function is

Cxx[n1, n0] = E (x[n1]− ηx[n1]) (x[n0]− ηx[n0])∗) . (17)

For wide-sense stationary x[n], in addition to

ηx[n] = ηx , (18)

we have that

Rxx[n1 +m,n1] = Rxx[n1, n1 −m] (19)

= Rxx[n1 − (n1 −m)] (20)

= Rxx[m] (21)

andCxx[n1 +m,n1] = Cxx[n1, n1 −m] = Cxx[m] . (22)

Also,Rxx[m] = Cxx[m] + |ηx|2 . (23)

The independent variable for these functions, m, is the distance in time between samplesand is termed the lag.


Example 6.4 - the complex sinusoidal process described in Example 6.2:


Example 6.5 - a zero-mean, wide-sense stationary, uncorrelated ”noise” througha simple FIR filter:


Properties of the autocorrelation and autocovariance functions of a wide-sense stationaryDT random process:

1. Symmetry:

Rxx[−m] = Ex[n] x∗[n +m] (24)

= Ex[n−m] x∗[n] = Ex[n] x∗[n−m]∗ (25)

= R∗xx[m] . (26)

Similarly,Cxx[−m] = C∗

xx[m] . (27)

The autocorrelation and autocovariance functions are complex symmetric. Check re-sults from Examples 6.4, 6.5 – see Example 6.6 below.

2. Power and variance: The power of a wide-sense stationary random process is

E|x[n]|2 = Rxx[0] . (28)

Since the correlation coefficient between any two samples of x[n] is bounded in mag-nitude by one, it follows that

Rxx[0] ≥ |Rxx[m]| ∀m . (29)

Similarly,Cxx[0] ≥ |Cxx[m]| ∀m , (30)

where Cxx[0] is the variance of x[n]. What is the power and variance of each signalconsidered in Examples 6.4, 6.5? See Example 6.6 below.

3. Positive Semidefinite: By definition, any discrete function f [m] is positive semidefiniteif, for any function a[n],

∞∑

n1=−∞

∞∑

n0=−∞a∗[n1] f [n0 − n1] a[n0] ≥ 0 . (31)

Consider the autocorrelation function Rxx[m] of a wide-sense stationary process. LetxM = [x(−M),x(−M+1), · · · ,x(0), · · · ,x(M−1),x(M)]T . Denote the correlation ma-trix of xM as RM . Also, let aM = [a(−M), a(−M +1), · · · , a(0), · · · , a(M − 1), a(M)]T

be a vector of samples of an arbitrary function a[n]. Then,

∞∑

n1=−∞

∞∑

n0=−∞a∗[n1] Rxx[n0 − n1] a[n0] = lim

M→∞aHM RM aM . (32)

By the positive semidefinite property of the correlation matrix RM , this proves that theautocorrelation function Rxx[m] is positive semidefinite. An alternative proof, basedon the power spectral density, will be presented in Subsection 6.2.2.

Similarly, the autocovariance function Cxx[m] is positive semidefinite.


Example 6.6 - check autocorrelation function properties for Examples 6.4, 6.5.

Solution:


6.2.2 DT Power Spectral Density

Let x[n] be a wide-sense stationary random process, and let x[n] be a realization. Denotethe Discrete-Time Fourier Transform (DTFT) of a 2N + 1 sample window of it as

XN(ejω) = XN(ω) =

N∑

n=−N

x[n] e−jnω . (33)

XN(ω) is a realization of the random function

XN(ejω) =

N∑

n=−N

x[n] e−jnω . (34)

The power spectral density of x[n] is defined as

Sxx(ejω) = lim

N→∞

1

2N + 1E|XN(e

jω)|2 . (35)

The power spectral density is the expected value of the magnitude-squared of the DTFT ofa window of the random process, as the window width approaches infinity. This definitionof the power spectral density captures what we want as a measure of the frequency contentof a random discrete-time sequence.


Let’s take an alternative view of Sxx(ω). First consider the term on the right of Eq. (35),without the limit and expectation:

1

2N + 1|XN(e

jω)|2 =1

2N + 1

N∑

n=−N

x[n] e−jnωN∑

l=−N

x∗[l] ejlω (36)

=1

2N + 1

N∑

n=−N

N∑

l=−N

x[n] x∗[l] e−j(n−l)ω (37)

=1

2N + 1

N∑

n=−N

n+N∑

m=n−N

x[n] x∗[n−m] e−jmω . (38)

Taking the expected value, we have

1

2N + 1E|XN(e

jω)|2 =1

2N + 1

N∑

n=−N

n+N∑

m=n−N

Rxx[m] e−jmω . (39)

Now, taking the limit as N → ∞, we have

Sxx(ejω) = lim

N→∞

1

2N + 1E|XN(e

jω)|2 (40)

= limN→∞

1

2N + 1

N∑

n=−N

∞∑

m=−∞Rxx[m] e−jmω (41)

=∞∑

m=−∞Rxx[m] e−jmω lim

N→∞

1

2N + 1

N∑

n=−N

1 (42)

=∞∑

m=−∞Rxx[m] e−jmω . (43)

Thus, the power spectral density and the correlation function of a wide-sense stationaryprocess form the DTFT pair

Sxx(ejω) =

∞∑

m=−∞Rxx[m] e−jmω (44)

Rxx[m] =1

2π

∫ π

−πSxx(e

jω) ejmω dω . (45)


Table 6.1 - Discrete Time Fourier Transform (DTFT) Pairs

Signal DTFT 1

(∀n) (−π ≤ ω ≤ π)

δ[n− k]; integer k e−jωk

anu[n]; |a| < 1 11−ae−jω

(n+ 1)anu[n]; |a| < 1 1(1−ae−jω)2

(n+r−1)!n!(r−1)!

anu[n]; |a| < 1 1(1−ae−jω)r

a|n|; |a| < 1 1−a2

1+a2−2a cos(ω)

pN [n] = u[n]− u[n−N ] e−jωN−1

2

sin(Nω2)

sin(ω2)

u[n+N1]− u[n− (N1 + 1)]sin(ω(N1+

1

2))

sin(ω2)

sin(Wn)πn

; 0 < W ≤ π

1 0 ≤ |ω| ≤ W0 W < |ω| ≤ π

1 2πδ(ω); |ω| ≤ π

ejnω0; −π ≤ ω0 ≤ π 2πδ (ω − ω0) ; |ω| ≤ π

N−1∑

k=0

(

X [k]

N

)

ej(2π/N)nkN−1∑

k=0

2π

(

X [k]

N

)

δ(

ω − 2π

Nk)

; |ω| ≤ π

1. The DTFT, X(ejω), is periodic in ω with period 2π. In this table we describe X(ejω)over the one period −π ≤ ω ≤ π. The DTFT is the periodic extension of this over allω.


Example 6.7 - Given the correlation function Rxx[m] = σ2x 0.5|m|, determine and

sketch the power spectral density Sxx(ejω).

Solution:

Table 6.2 - Discrete Time Fourier Transform (DTFT) Properties

Property Time Domain Frequency Domain

Periodicity x[n] X(ejω) = X(ej(ω+2π)) ∀ω

Symmetry real-valued x[n] X(e−jω) = X∗(ejω)

Delay x[n− k] X(ejω) e−jωk = |X(ejω)| ej[ 6 X(ejω)−ωk]

Linearity a1x1[n] + a2x2[n] a1X1(ejω) + a2X2(e

jω)

Convolution x[n] ∗ h[n] X(ejω) H(ejω)

Parseval’s Theorem E =∞∑

n=−∞|x[n]|2 E = 1

2π

∫ π

−π|X(ejω)|2dω

Parseval’s Theorem P = 1N

N−1∑

n=0

|x[n]|2 P = 1N2

N−1∑

k=0

|X [k]|2

Multiplication x[n] w[n] 12π

∫ π

−πX(ejλ)W (ej(ω−λ))dλ

Modulation x[n] ejω0n X(ej(ω−ω0))


Properties of the power spectral density of a wide-sense stationary DT random process:

1. Sxx(ejω) is real-valued. This is by definition of Sxx(e

jω).

2. Sxx(ejω) ≥ 0. This is also by definition of Sxx(e

jω).

Alternative proof that Rxx[m] is positive semidefinite:

Define C(a[n]) as

C(a[n]) =∞∑

n1=−∞

∞∑

n0=−∞a∗[n1] Rxx[n0 − n1] a[n0] (46)

where Rxx[m] = 12π

∫ π−π Sxx(e

jω) ejmω dω. For Rxx[m] to be positive semidefinite,C(a[n]) must be ≥ 0 for all a[n]. We have that

C(a[n]) =∞∑

n1=−∞

∞∑

n0=−∞a∗[n1]

1

2π

∫ π

−πSxx(e

jω) ej[n0−n1]ω dω

a[n0]

=1

2π

∫ π

−π

∞∑

n1=−∞a∗[n1] e

−jn1ω ·∞∑

n0=−∞a[n0] e

jn0ω

Sxx(ejω) dω

=1

2π

∫ π

−π

∣

∣

∣A(ejω)∣

∣

∣

2Sxx(e

jω) dω (47)

where A(ejω) is the DTFT of a[n]. Since both |A(ejω)|2 ≥ 0 and Sxx(ejω) ≥ 0 for all

ω and any a[n], C(a[n]) ≥ 0 for any a[n]. Thus Rxx[m] is positive semidefinite. Alsonote that if Sxx(e

jω) > 0 for all ω (i.e. if the random process x[n] is full bandwidth),then Rxx[m] is positive definite.

Corollary:

If Sxx(ejω) < 0 for some ω, Rxx[m] is not a valid correlation function.


3. Sxx(ej(ω+2π)) = Sxx(e

jω). This is by the periodic property of the DTFT.

4. For real-valued Rxx[m], Sxx(e−jω) = Sxx(e

jω). This is by a symmetry property of theDTFT.

5. Rxx[0] =12π

∫ π−π Sxx(e

jω) dω. This is from the inverse DTFT equation, Eq. (45). Thisindicates why Sxx(e

jω) is called the power spectral density – power because it givesus Rx[0], spectral because it’s a function if frequency, and density because power iscomputed by integrating over it.


6. Consider the window function w[m], so called because it is assumed zero outside somerange 0 ≤ n ≤ N . Let W (ejω) be its DTFT. By the multiplication property of theDTFT, the DTFT of the product Rxx[m] w[m] is equal to

1

2π

∫ π

−πSxx(e

jλ) W (ej(ω−λ)) dλ =1

2πSxx(e

jω)©∗ W (ejω) . (48)

The DTFT of the windowed autocorrelation function is the circular convolution, di-vided by 2π, of the power spectral density and the DTFT of the window. W (ejω) iscalled the smearing function, since through the circular convolution it smears Sxx(e

jω).

7. By the convolution property of the DTFT, Rxx[m] ∗ h[m] has DTFT Sxx(ejω) H(ejω),

where H(ejω) is the DTFT of h[m]. Also, by convolution, fold and conjugate propertiesof the DTFT, the DTFT of Rxx[m] ∗ h[m] ∗ h∗[−m] is Sxx(e

jω) |H(ejω)|2.


Example 6.8 - Given x[n] with correlation function Rxx[m] = σ2x(0.5)

|m|, deter-mine the percentage of power in the frequency band |ω| ≤ π

2.

Solution:

Example 6.9 - White (uncorrelated) noise:

Solution:

Example 6.10 - Band limited signal:

Solution:


Superposition of Random Signals & Periodic Signals:

Consider a random process x[n] that is composed as a superposition of two uncorrelatedcomponents: a general wide-sense stationary process x0[n]; and a periodic component whichconsists of N mutually uncorrelated, wide-sense stationary complex sinusoidal processes,xk[n]; k = 1, 2, · · · , N , the kth of which has correlation A2

kejωkm as illustrated in Example

6.4. Assume |ωk| ≤ π; k = 1, 2, · · · , N .First note that the xk[n]; k = 0, 1, · · · , N are assumed uncorrelated. As a consequence,

E|x[n]|2 = E|x0[n]|2 + E|x1[n]|2 + · · · + E|xN [n]|2 , (49)

i.e. the expectations of all of the cross terms, Exi[n] x∗j [n]; i 6= j, are zero. More generally,

Rxx[m] =N∑

k=0

Rxkxk[m] . (50)

Next note that since each component is a wide-sense stationary, and by the linearity propertyof the DTFT, we have

Sxx(ejω) = Sx0x0

(ejω) + Sx1x1(ejω) + · · · + SxNxN

(ejω) . (51)

Noting that the DTFT of a complex sinusoid ejωkm is 2π δ(ω − ωk); |ω| ≤ π,

Sxx(ejω) = Sx0x0

(ejω) + 2πN∑

k=1

A2k δ(ω − ωk) |ω| ≤ π . (52)

Example 6.11 - Periodic random processes:

Solution:


6.2.3 CT Correlation Function & Power Spectral Density

This Subsection corresponds to Sections 9.1 & 9.3 (up to p. 412) of the Course Text.

Let x(t) be a continuous time random process. Its mean is

ηx(t) = Ex(t) . (53)

It correlation function isRxx(t1, t2) = Ex(t1)x∗(t2) . (54)

Its covariance function is

Cxx(t1, t2) = E(x(t1)− ηx(t1))(x∗(t2)− ηx(t2))

∗ . (55)

For wide-sense stationary x(t),ηx(t) = ηx , (56)

Rxx(t+ τ, t) = Rxx(t, t− τ) = Rxx(τ) , (57)

Cxx(t, t− τ) = Cxx(τ) = Rxx(τ) − |ηx|2 . (58)

Properties of Rxx(τ) and Cxx(τ) (assumes wide-sense stationary x(t)):

1. Symmetry: Rxx(−τ) = R∗xx(τ), Cxx(−τ) = C∗

xx(τ).

2. Rxx(τ) and Cxx(τ) are positive semidefinite.

3. Power: E|x(t)|2 = Rxx(0) ≥ |Rxx(τ)|.


Power Spectral Density:

Let x(t) be a wide-sense stationary random process. Define

XT (jΩ) =∫ T

−Tx(t) e−jΩt dt . (59)

The power spectral density is defined as

Sxx(jΩ) = limT→∞

1

2TE|XT (jΩ)|2 . (60)

Starting with

1

2T|XT (jΩ)|2 =

1

2T

∫ T

−T

∫ T

−Tx(t1)x

∗(t2) e−jΩ(t1−t2) dt1 dt2 , (61)

taking the expected value, and letting the limit go to infinity, it can be shown that the corre-lation function Rxx(τ) and the power spectral density Sxx(jΩ) are related as the continuous-time Fourier transform (CTFT) pair:

Sxx(jΩ) =∫ ∞

−∞Rxx(τ) e

−jΩτ dτ (62)

Rxx(τ) =1

2π

∫ ∞

−∞Sxx(jΩ) e

jΩτ dΩ . (63)

See Table 9.1, p. 409, of the Course Text.


Example 6.12 - A bandlimited signal: Let Sxx(ejω) =

1 |ω| < B0 B < |ω| ≤ ∞ .

Find Rxx(τ).

Solution:

Example 6.13 - A sinusoidal random process:

Solution:


Properties of Sxx(jΩ):

1. Sxx(jΩ) is real-valued.

2. Sxx(jΩ) ≥ 0.

3. For real-valued Rxx(τ), Sxx(−jΩ) = Sxx(jΩ).

4. Rxx[0] =12π

∫∞−∞ Sxx(jΩ) dΩ.

5. Consider the window function w(τ), and letW (jΩ) be its CTFT. The productRxx(τ) w(τ)has CTFT

1

2π

∫ ∞

−∞Sxx(jλ) W (j(Ω− λ)) dλ =

1

2πSxx(jΩ) ∗ W (jΩ) . (64)

W (jΩ) is called the smearing function, since through the convolution it smears Sxx(jΩ).

6. By the convolution property of the CTFT, Rxx(τ) ∗ h(τ) has CTFT Sxx(jΩ) H(jΩ),where H(jΩ) is the CTFT of h(τ).


Example 6.14 - A windowed correlation function:

Solution:


6.2.4 Sampling Wide-Sense Stationary CT Random Processes

This Section corresponds to Section 10.5 of the Course Text.*0.1in

Assume that xa(t) is a wide-sense stationary CT random process with correlation functionRxaxa(τ), corresponding power spectral density Sxaxa(jΩ) and mean ηxa . Recall that

Rxaxa(τ) = Exa(t) x∗a(t− τ) . (65)

Let x[n] = xa(nT ) be a DT random process obtained by sampling xa(t) at sample ratefs =

1Twhere T is the sampling interval. The mean function of x[n] is

ηx[n] = Ex[n] = Exa(nT ) = ηxa . (66)

We see that x[n] is first order stationary since, by assumption, xa(t) is. The correlationfunction of x[n] is

Rxx[n, n−m] = Ex[n] x∗[n−m] = Exa(nT ) x∗a([n−m]T ) (67)

= Rxaxa(mT ) = Rxx[m] . (68)

Note that x[n] is wide-sense stationary because Xa(t) is assumed to be. Also note that theDT signal correlation function Rxx[m] is a sampling of the CT signal correlation functionRxaxa(τ).

Paralleling the development for deterministic signals of the relationship between the DTsignal Fourier transform and the underlying (sampled) CT signal Fourier transform, we havethat, since Rxx[m] = Rxaxa(mT ), the power spectral density of x[n] is

Sxx(ejω) =

1

T

∞∑

k=−∞Sxaxa(j(ω − k2π)/T ) , (69)

where Sxaxa(jΩ) is the power spectral density of xa(t).


Example 6.15 - An illustration of sampling:

Solution:


Having established that the correlation function of the wide-sense stationary DT randomprocess is the sampling of the correlation function of the underlying wide-sense stationaryCT random process, i.e. Rxx[m] = Rxaxa(mT ), we know that the sampling theorem applies.That is, if the wide-sense stationary CT random process is bandlimited by B = 2πW , i.e.

Sxaxa(jΩ) = 0 |Ω| > B , (70)

and if the sampling rate is fs ≥ 2W (i.e. if we sample at at least a rate of twice the highestfrequency), then we can recover Rxaxa(τ) from its samples Rxx[m] as

Rxaxa(τ) =∞∑

m=−∞Rxx[m]

sin( πT(τ −mT ))

πT(τ −mT )

. (71)

This result is the sampling theorem for the (deterministic) correlation function Rxaxa(τ).fs = 2W is called the Nyquist rate.

But what about the CT random process xa(t) itself? We know that, theoretically, anonrandom CT signal can be exactly reconstructed from its samples as long as it is ideallysampled at at least the Nyquist rate. The sampling theorem for the side-sense stationaryprocess xa(t) is:

For xa(t) bandlimited by B = 2πW , and for T ≤ 12W

, given realization

x[n] = xa(nT ) , (72)

and

xa(t) =∞∑

n=−∞x[n]

sin( πT(τ − nT ))

πT(τ − nT )

, (73)

we have thatE|xa(t) − xa(t)|2 = 0 ∀t . (74)

The more general bandpass version of the sampling theorem states that we can reconstructa CT signal from its samples as long as we sample at a rate which is greater than or equalto the bandwidth of the CT signal. So, for example, given a complex-valued CT signal witha 20 kHz bandwidth centered at 2 GHz, we can reconstruct this signal from its samples aslong as we sample it at at rate fs > 20k samples/second. This applies to both deterministicand random signals.


6.3 Note on Cyclostationary Random Processes

This Section corresponds to Section 10.4 of the Course Text.

A CT random process x(t) is cyclostationary if its mean and covariance functions are peri-odic, i.e. if there exists a T such that

ηx(t) = ηx(t + T ) ∀t (75)

Rxx(t1, t2) = Rxx(t1 + T, t2 + T ) ∀t1, t2 . (76)

Figure 18 illustrates this property of Rxx(t1, t2) for a cyclostationary random process x(t).

t 1

t 2

this shaded regioncompletely characterizesthe correlation function

o, x correlation function equalat these points

t 1

t 2

21t −t = −1 21t −t = 0

1t =− t 2

T

−T

−T

T

x

x

x

o

o

o

(b)(a)

1−1

along any of these lines

For Wide−Sense Stationary (WSS) X(t),the correlation function is constant

R (t ,t )xx 21

For WSS X(t), R (t ,t ) is xxcompletely describes the autocorrelation function

1 2

R (t ,t )xx 21

Figure 18: Illustration of the correlation function for a cyclostationary process: (a) wide-sense stationary; (b) cyclostationary.

Examples of cyclostationary processes are transmitted digital communications signals, andsampled-and-held random processes. A cyclostationary random process is not stationary(except for the trivial x(t) constant case).


Example 6.16: Given wide-sense stationary x[n] with mean ηx and correlationfunction Rxx[m], consider the CT signal

xa(t) = x[n] nT ≤ t ≤ (n+ 1)T .

Determine ηxa(t) and Rxaxa(t1, t2).


Example 6.17 - PSK example2

For cyclostationary random processes, the time averaged correlation function

Rxx(τ) =1

T

∫ T/√2

−T/√2Rxx(t, t− τ) dt (77)

is useful. The corresponding averaged power spectral density is

Sxx(Ω) =∫ ∞

−∞Rxx(τ) e

−iΩτ dτ . (78)

Figure 19 illustrates how this averaged correlation function is formed.

t 1

t 2

τ τR (t ,t− ); = 0xx

T

T

−T

−T

integrate over to form

R (t ,t )xx 21

xxrange of R (t ,t ) to21

R ( 0 )xx

Figure 19: Illustration of formation of the averaged correlation function of a cyclostationaryprocess from its 2-D correlation function.

2From Start and Woods, Probability & random Processes w/ Applications to Signal Processing, 3-rd ed.,Prentice Hall, 2002. Also see Proakis & Salehi, Digital Communications, 5-th ed., McGraw Hill, 2001.


6.4 Correlation & Covariance Matrices

6.4.1 Random Vector Observation

Let x[n] be an N -dimensional random vector process. In general form

x[n] = [x1[n], x2[n], · · · , xN [n]]T , (79)

where each xi[n] is a random process. A realization of this random vector process (at timen, or over n) is denoted

x[n] = [x1[n], x2[n], · · · , xN [n]]T . (80)

Examples:

• Data in an FIR filter delay line:

x[n] = [x[n], x[n− 1], · · · , x[n−N + 1]]T , (81)

i.e. xi[n] = x[n− i+ 1] where x[n] is a random process.

• Data operated on with an FFT:

x[n] = [x[n], x[n + 1], · · · , x[n+N − 1]]T , (82)

i.e. xi[n] = x[n + i− 1] where x[n] is a random process.

• Output data from an array of N sensors:

x[n] = [x1[n], x2[n], · · · , xN [n]]T , (83)

where xi[n] is the random process output of the ith sensor.


We are interested in the following probabilistic characterizations of random vector processes:

• Joint pdf of x[n], fx[n](x[n]): Beyond this, we will not directly exploit the completeprobabilistic description of a random vector process, which is the set of all joint pdf’sof all combinations over time of samples of X[n].

• The Mean Vector, ηx[n] = Ex[n]: For x[n] which is stationary in the mean,

ηx[n] = η

x.

For example, if x[n] = [x[n], x[n − 1], · · · , x[n −N + 1]]T and x[n] is a wide-sensestationary random process with mean ηx, then η

x[n] = η

x= ηx1N .

• The Correlation Matrix, Rxx[n] = Ex[n] xH [n]: More generally, we could con-sider Rxx[n1, n2] = Ex[n1],x

H [n2]. Then, given our definition above of Rxx[n], wewould have Rxx[n] = Rxx[n, n]. This Rxx[n] notation that we will use should not beconfused with the correlation matrix function of a wide-sense stationary random vectorprocess. Let x[n] be wide-sense stationary. Then Rxx[n, n−m] = Ex[n],xH [n−m] =Rxx[m], where here m denotes the correlation lag. Then for wide-sense stationary x[n],Ex[n] xH [n] = Rxx[m = 0] = Rxx.

Again, in this Course, we denote Rxx[n] = Ex[n] xH [n]. Though it is certainly ofinterest in some applications (e.g. broadband sensor array processing), in this coursewe will not focus on the correlation matrix function of a wide-sense stationary vectorrandom process.

• The Covariance Matrix,Cxx[n] = E(x[n]− η

x[n]) (x[n]− η

x[n])H = Rxx[n]− η

x[n]ηH

x[n].

Note that Rxx[n] and Cxx[n] are Hermitian (complex symmetric) and positive semidefinite.

Recall that the eigenstructure decomposition of Rxx[n] is:

Rxx[n] = E[n] Λ[n] EH [n] . (84)

When there is no need to keep the eigenstructure temporal notation (e.g. when the vectorrandom process in wide-sense stationary), we will simply use

Rxx[n] = E Λ EH . (85)

Note that this eigenstructure possesses all of the properties of the eigenstructure of a corre-lation matrix identified in Section 5.4 of the Course Notes.


6.4.2 Wide-Sense Stationary Random Processes

Let x[n] = [x[n], x[n− 1], · · · , x[n−N + 1]]T For x[n] wide-sense stationary,

ηx[n] = η

x= ηx 1 , (86)

and

Rxx[n] = Rxx =

Rxx[0] Rxx[1] · · · Rxx[N − 1]Rxx[−1] Rxx[0] · · · Rxx[N − 2]

......

. . ....

Rxx[−N + 1] Rxx[−N + 2] · · · Rxx[0]

, (87)

where Rxx[n] is the correlation function of x[n]. Note that this correlation matrix Rxx, forthe observation x[n] = [x[n], x[n− 1], · · · , x[n−N +1]]T , has a Toeplitz structure. Thatis, it is constant along any diagonal. This Toeplitz structure turns out to be very importantwhen considering computationally efficient algorithms (e.g. for matrix inversion). This is aproperty that we will not have time to explore in this Course.

The covariance matrix Cxx[n] = Cxx for wide-sense stationaryx[n] = [x[n], x[n− 1], · · · , x[n−N + 1]]T also has this Toeplitz structure.

Example 6.18 - White Noise:


Example 6.19 - Complex Sinusoidal Process:


Example 6.20 - Bandlimited White Noise:

Example 6.21 - A Simple Autoregressive (AR) Process:

Example 6.22 - A Simple Moving Average (MA) Process (see Example 6.5):


6.5 Discrete Karhunen-Loeve Transformation (DKLT)

Previously, in Chapter 5 of the Course Notes on random vectors, we introduced the themeof a vector observation (Section 5.3) and its orthogonalization using the eigenstructure of itscorrelation matrix (Subsection 5.4.3). In this Section of the Course Notes, we continue andexpand on this theme. Specifically, we further consider the correlation matrix eigenstructuredecomposition, and corresponding random vector orthogonal decomposition, for several im-portant vector observations of random processes. We will extend this discussion in Section6.6 of the Course Notes with an in-depth treatment of vector observations of narrowbandsignals in additive white noise.

As a review of what we have already discussed concerning the orthogonal decompositionof a random N -dimensional vector observation x, let x be a realization of x. Letλi; i = 1, 2, · · · , N and ei; i = 1, 2, · · · , N be the eigenvalues and corresponding eigenvectorsof the correlation matrix Rxx of x. Then we can expand realization x in terms of theeigenvector basis as:

x =N∑

i=1

yi ei = E y , (88)

where y = [y1, y2, · · · , yN ]T , E = [e1, e2, · · · , eN ], and yi = eHi x (i.e. y = EH x). The

coefficients of this expansion, the yi, have the property

Eyi y∗j = λi δ[i− j] . (89)

That is, they are statistically orthogonal to one another, and their powers, the eigenvaluesλi, represent the powers of x in the eigenvector ei directions.

6.5.1 Orthogonal Expansion of a Random Vector

Orthogonality will have two distinct meanings here:

• in the linear algebra sense, xHy = 0; and

• in the probabilistic sense, Ex y∗ = 0.

Consider a general random vector process x[n] observed in the N -dimensional complex vectorspace CN , i.e. x[n] ∈ CN . We wish to represent x[n] and its position in CN relative to usefulorthogonal bases. Since x[n] is random, we are interested in its position in a probabilisticsense.

Let φi; i = 1, 2, · · · , N be an orthonormal basis for CN , i.e.

φHiφj= δ[i− j] . (90)

So, for Φ = [φ1, φ

2, · · · , φ

N], we have ΦHΦ = IN . Then, representing X[n] in terms of this

basis, we have

x[n] =N∑

i=1

ci φi= Φ c , (91)

where c = [c1, c2, · · · , cN ]T , and c = ΦHx[n]. Note that c is implicitly a function of n.


We now consider expansion of x[n] with respect to several bases of interest.

1. Standard Basis: We start with this trivial case, in order to emphasize that weimplicitly represent vectors in terms of a basis representation. Let

φi= [0, 0, · · · , 0, 1 , 0, · · · , 0]T (92)

↑ith position

x[n] =N∑

i=1

ci φi= Φ c , (93)

with c = ΦHx[n] = INx[n] = x[n]. That is, ci = xi[n].

Illustration:


2. Fourier Basis: Consider x[n] = [x[n], x[n + 1], · · · , x[n + N − 1]]T . Let v(ω) =1√N[1, e−jω, e−j2ω, · · · , e−j(N−1)ω]T be a normalized Fourier vector, and consider

v(ωi); ωi =2π

Ni; i = 0, 1, 2, · · · , N − 1 . (94)

The set of vectors v∗(ωi); i = 0, 1, 2, · · · , N − 1 form an orthonormal basis for CN , i.e.

vH(ωi) v(ωj) =1

N

N−1∑

n=0

ej(ωi−ωj)n (95)

=1

N

N−1∑

n=0

ej2πN

(i−j)n (96)

=

0 i 6= j1 i = j

. (97)

The Fourier basis expansion of x[n] is

x[n] =N−1∑

i=0

ci v∗(ωi) = V ∗ c , (98)

where V = [v(ω0), v(ω1), · · · , v(ωN−1)], and

c = V T x[n] . (99)

The correlation matrix of this expansion coefficient vector c is

Rcc = Ec cH = EV T x[n] xH [n] V ∗ = V T Rxx V ∗ . (100)

In general, Rcc is not diagonal, i.e. Eci c∗j 6= E|ci|2 δ[i− j] since the Fourier basis

is not the eigenvector basis. That is, in general V T does not diagonalize Rxx.

Note that

ci = V T (ωi) x[n] =1√N

N−1∑

k=0

x[n+ k] e−j 2πN

ki i = 0, 1, 2, · · · , N − 1 (101)

is the DFT of x[n], normalized by 1√N. So, in general the DFT outputs

ci; i = 0, 1, · · · , N − 1 are not probabilistically orthogonal.


3. Eigenvector Basis: Consider general x[n]. Let Rxx = Ex[n] xH [n], and considerthe eigenstructure decomposition

Rxx = E Λ EH , (102)

E = [e1, e2, · · · , eN ], Λ = Diagλ1, λ2, · · · , λN; λi ≥ λi+1 ≥ 0. Consider the eigenvec-tor expansion

x[n] =N∑

i=1

ki ei = E k (103)

where k = [k1,k2, · · · ,kN ]T . Since the ei are orthonormal,

ki = eHi x[n] (104)

ork = EH x[n] . (105)

The correlation matrix of the coefficient vector k is

Rkk = Ek kH (106)

= EEH x[n] xH [n] E (107)

= EH Rxx E . (108)

So, Rkk = Λ. The ki expansion coefficients are probabilistically orthogonal, i.e.

Eki k∗j = λi δ[i− j] . (109)

The eigenstructure basis expansion is the orthonormal basis expansion of x[n] for whichthe expansion coefficients are probabilistically orthogonal.

Why is this important? First, it is often easier to process independent random variablesand processes. Second, as we will see next, the eigenvector transformation is optimumin a power concentration or mean-square error sense.


6.5.2 The Discrete Karhunen-Loeve Transformation (DKLT)

Let x[n] ∈ CN and consider the eigenvector transformation

k = EH x[n] , (110)

E = [e1, e2, · · · , eN ], Λ = Diagλ1, λ2, · · · , λN; λi ≥ λi+1 ≥ 0. This is called the DKLTbecause it is optimum in the mean-square error (MSE) sense.

To understand this, consider the general orthonormal transformation (expansion)

x[n] =N∑

i=1

ki φi; ki = φH

ix[n] , (111)

where φHiφj= δ[i− j]. Consider the ”low-rank” or M-rank (M < N) approximation

xM [n] =M∑

i=1

ki φi(112)

and approximation error, for varying M ,

eM = x[n] − xM [n] ; M = 1, 2, · · · , N . (113)

Define EM = EeHM eM as the MSE’s. The DKLT is comprised of the solutions to theproblems: for M = 1, 2, · · · , N ,

minφM

EM (114)

subj. to φHMφM

= 1, φHMφj= 0; j = 1, 2, · · · ,M − 1 . (115)


For M = 1, the constrained minimization problem Eq (114,115) is

minφ1

E1 = E|x[n]− φ1φH1x[n]|2 (116)

subj. to φH1φ1= 1 . (117)

Since φ1φH

1x[n] = φ

1k1, minimizing the above expression is equivalent to maximizing the

energy of k1. Thus an equivalent optimization problem is

maxφ1

E|k1|2 = φH1Rxx φ

1= φH

1E Λ EHφ

1=

N∑

i=1

λi pi (118)

subj. to φH1φ1= 1 , (119)

where the pi = E|eHi φ1|2 are nonnegative. Let p = EHφ

1= [p1, p2, · · · , pn]T . The opti-

mization problem can be written as

maxp

N∑

i=1

pi λi = pHλ (120)

subj. to φH1φ1= φH

1E EHφ

1=

N∑

i=1

pi = 1 . (121)

Clearly, since both the λi ≥ 0 and the pi ≥ 0, popt

= [1, 0, 0, · · · , 0]T , so that φ1,opt

= E p = e1.

That is, the best 1-dimensional basis to represent x[n], in the MSE sense, is the eigenvectore1.

For M = 2, we start with φ1,opt

= e1, and the optimization problem is

minφ2

E2 (122)

subj. to

[

φH1

φH2

]

φ2=

[

01

]

. (123)

This is equivalent to

maxφ2

E|k2|2 = φH2Rxxφ2

(124)

subj. to

[

φH1

φH2

]

φ2=

[

01

]

. (125)

The solution to this is φ2,opt

= e2.

Continuing for M > 2, by induction we get that φM,opt

= eM ; M = 1, 2, · · · , N .


6.6 Narrowband Signals in Additive White Noise

Consider a general vector random process x[n] = [x1[n],x2[n], · · · ,xN [n]]T , which could be,

for example, a sampling of a single time series or the observation at the output of an arrayof sensors. A rank-1 x[n] is of the form

x[n] = s[n] a (126)

where s[n] is a scalar random process and a is a constant vector. Such a vector randomprocess is called rank-1 because it varies along a single dimension in the observation spaceCN .

A complex narrowband vector random process is rank-1. We will illustrate this directlybelow with some specific examples. But first note that in this case a will be a constant vectorwhose elements are the relative phases and perhaps amplitudes across the observation, ands[n] will be a complex-valued time-varying random signal. At any instant in time x[n] willappear sinusoidal, but observations at different times may have different magnitudes andphases. Imagine an oscilloscope observation of a narrowband signal – say the output ofa very narrow bandpass filter with input white noise. For any given trigger, the signalon the oscilloscope screen (the observation) will look like a sinusoid. If the oscilloscope iscontinuously triggered at constant time intervals, then the signal on the screen will appearsinusoidal each time, but each successive display will have a different phase and amplitude,because the signal is random with some finite bandwidth. For a narrowband observation,the analogy is that a is the sinusoidal shape of the signal and s[n] is the different magnitudeand phase for each observation.

Illustration:


Consider the vector random process x[n] = [x[n],x[n− 1], · · · ,x[n−N + 1]]T . Recall ouruse in Section 6.5 of the Fourier vector v(ω) = [1, e−jω, e−j2ω, · · · , e−j(N−1)ω]T (note thistime it is not normalized). For narrowband x[n] of frequency ω0, we have

x[n] = s[n] v(ω0) . (127)

v(ω0) is the vector of linearly progressing relative phases across the observation. s[n] will bea narrowband scalar random process.

Now consider x[n] = [x1[n],x2[n], · · · ,xN [n]]T , the complex-valued observation across a

narrowband array of sensors. The sensor array can have any configuration. The individualarray elements (sensors) can have different responses. However, by definition of a narrowbandarray, the observation of a signal impinging on the array will be of the form

x[n] = s[n] a (128)

where a is the vector of relative amplitudes and phases across the array of the observationof the signal. Just as v(ω) varies as a function of frequency for the single random processcase discussed above, this vector a will be a function of the frequency and also the locationof the signal source (i.e. signals from different locations will impinge on the array resultingin observations with different relative phases and magnitudes across the array. Again, s[n]is a narrowband scalar random process). In both cases, we can see that x[n] is rank-1.

Illustration:


Now consider D rank-1 source signals superimposed along with zero-mean additive whitenoise. Now,

x[n] =D∑

k=1

sk[n] ak + n[n] (129)

= A s[n] + n[n] , (130)

where A = [a1, a2, · · · , aD], s[n] = [s1[n], s2[n], · · · , sD[n]]T , and n[n] is the noise vectorwith correlation matrix Rnn = σ2

nIN . Assume that the noise n[n] is uncorrelated3 with thesignals sk[n]; k = 1, 2, · · · , D, and assume, at least for now, that the signals are mutuallyuncorrelated.4 Also assume that the noise and signals are wide-sense stationary.

Under the assumptions stated above, the correlation matrix of the observation x[n] is

Rxx[n] = Rxx = Ex[n] xH [n] (131)

= E(A s[n] + n[n]) (A s[n] + n[n])H (132)

= A P AH + σ2nIN , (133)

whereP = Es[n] sH [n] = Diagσ2

1, σ22, · · · , σ2

D , (134)

σ2k = E|sk[n]|2.

3To slightly simplify notation and discussions, in this section we assume all random processes are zero-mean. Thus uncorrelated implies statistically orthogonal.

4In general, in practice the signals will not necessarily be uncorrelated. For a single random processobservation they often will be. Many of the results discussed below will hold for correlated signals. If signalsare correlated, however, a few additional issues must be considered.


6.6.1 Correlation Matrix Eigenstructure

Assume that aHk ak = N ; k = 1, 2, · · · , D. This is not a restriction. It is a definition thatfixes the variance σ2

k; k = 1, 2, · · · , D. Since Rxx is a correlation matrix, its eigenstructure

λi; i = 1, 2, · · · , N ; λi ≥ λi+1 (135)

ei; i = 1, 2, · · · , N ; eHi ei = 1 (136)

has all the properties established earlier for the eigenstructure of a correlation matrix.Consider the following assumptions concerning the rank-1 signals:

A1 - RankP = D

A2 - D < N

A3 - The ak are linearly independent.

Assumption A1 is true since the signals sk[n]; k = 1, 2, · · · , D are, by assumption, uncorre-lated. Even if the sk[n] are correlated, A1 will often be true. Assumption A2 says that thereare fewer narrowband signals in the observation than there are dimensions in the observation.Given knowledge of the physical environment in which the signals are observed, observationsare usually ”designed” to assure A2. Assumption A3 says, basically, that we can measure adifference between a signal observation ak and the observation of any superposition of D− 1other signal observations aj ; j = 1, 2, · · · , D−1. For example, consider the single randomprocess observation x[n] = [x[n],x[n−1], · · · ,x[n−N +1]]T , so that ak = v(ωk). If sinusoidsare sampled without aliasing, then A3 holds. On the other hand, if there is the possibilityof aliasing, this means that there are some v(ωk), v(ωj); ωk 6= ωj which are colinear (thuslinearly dependent).

Illustration:


We are interested in characteristics of the eigenstructure of the observation correlationmatrix Rxx, for narrowband sources in white noise and for other types of signal observations.As we will see in Part 3 of The Course, an understanding of these characteristics is importantfor understanding the motivation and performance of what might be called ”modern” signalprocessing methods, which take advantage of observation characteristics (models) of signalsof interest.

Consider the eigenstructure

Rxx ei = λi ei λi ≥ λi+1 i = 1, 2, · · · , N , (137)

where Rxx is the correlation matrix of a sum of D narrowband sources in white noise. Letλi = νi + σ2

n, where σ2n is the variance of the additive white noise. Using the structure of

Rxx developed above, we then have that

(A P AH + σ2n IN ) ei = (νi + σ2

n) ei (138)

or(A P AH) ei = νi ei . (139)

Note that this indicates two very important relationships between the eigenstructure of Rxx

and that of the noiseless correlation matrix A P AH as expressed in terms of the observationvectors ak and the signal powers σ2

k. First, both Rxx and A P AH have the same eigenvectors.Second, their eigenvalues are related as λi = νi + σ2

n. Thus the two matrices have essentiallythe same eigenstructure. This leads to the following two useful results:


1. Eigenvalues and Observation Power Distribution: With assumptions A1-A3, it can beshown that RankA P AH = D. Thus, νi = 0; i = D + 1, D + 2, · · · , N . Also, sinceA P AH is a correlation matrix and thus positive semidefinite, νi > 0; i = 1, 2, · · · , D.Additionally, from the trace property of correlation matrices,

D∑

i=1

σ2i =

D∑

i=1

νi , (140)

Therefore, given that the eigenvalues are the powers of the observation along the eigen-vector directions, the νi; i = 1, 2, · · · , D eigenvalues are a distribution of the signalpowers along the ei i = 1, 2, · · · , D eigenvector directions. Note that there is no signalpower along the ei i = D + 1, D + 2, · · · , N directions. Recall that λi = νi + σ2

n. So,

λi

= σ2n i = D + 1, D + 2, · · · , N

> σ2n i = 1, 2, · · · , D . (141)

We call σ2n the noise floor. We can see from the relation λi = νi+σ2

n that the observedwhite noise power is spread evenly across the eigenvalues and therefore the eigenvectordirections in the observation space.

Illustration:


2. Eigenvectors and Narrowband Signal Observation Orientation: We have that

A (P AH ei) = νi ei ; i = 1, 2, · · · , D . (142)

Note that the vi = (P AH ei) on the left side of Eq (142) are D × 1 vectors, and thatthe A vi = νi ei are linear combinations of the columns of A. So, the subspace spannedby the signal observation vectors, ak; k = 1, 2, · · · , D is equal to the subspace spannedby the eigenvectors ei; i = 1, 2, · · · , D. Thus, we call the ei; i = 1, 2, · · · , D the signaleigenvectors, and the subspace they span the D-dimensional signal subspace. Giventhat the set of all N eigenvectors form an orthonormal basis, we also have that theeigenvectors ei; i = D + 1, D + 2, · · · , N form orthonormal basis for the orthogonalcomplement of the signal subspace. We call these eigenvectors noise-only eigenvectorsand their span the noise-only subspace. In summary,

Span e1, e2, · · · , eD = Span a1, a2, · · · , aD ≡ Signal Subspace (143)

Span eD+1, eD+2, · · · , eN ⊥ Span aD+1, aD+2, · · · , aN ≡ Noise− Only Subspace .(144)

The bottom line here is that the eigenvectors e1, e2, · · · , eD span the same space asthe narrowband signal observations, and thus knowledge of these eigenvectors tells us a lotabout the narrowband signals (e.g. about their frequencies) in an observation X [n]. Theeigenvalues λi; i = 1, 2, · · · , D provide information about narrowband signal powers. Asan example of the utility of this, say we have an estimate of the correlation matrix Rxx

of an N -dimensional observation of D < N rank-1 signals in white noise, from which wecompute an estimated eigenstructure. We can look at the ordered eigenvalues to determinean estimate D of the number of signals. We can then partition the eigenstructure into signaland noise-only components. From either the signal or noise-only eigenvectors estimates,we can consider deriving estimates of the signal frequencies. From the signal eigenvalueestimates, we might derive estimates of the signal powers. In fact, we will look at ways todo this in Part 3 of the course.


Example 6.23 - Consider a DT WSS random process x[n] = x[n] + s[n] + n[n],where n[n] is zero-mean, uncorrelated noise with variance σ2

n. s[n] is a real-valuedsinusoidal process s[n] = A cos(ω0n +Φ) where ω0 is a constant, Φ is uniformlydistributed over 0 ≤ φ < 2π, and A is zero-mean Gaussian with variance σ2

s .Assume A, Φ and the samples of n[n] are independent. Consider random vectorx[n] = [x[n], x[n− 1], · · · , x[n− 15]]T .

a) Using Euler’s identity, write the correlation matrix Rxx in the form

Rxx = A P AH + σ2n I16

where A is not a function of A or Φ, and P is a 2 × 2 signal correlationmatrix.

b) For ω0 =π4, describe as completely as possible the eigenstructure of Rxx.


6.6.2 Two Narrowband Signals in White Noise Example

Here we consider a basic example illustrating the concepts developed in the previous Sub-section. You will look at related examples as Homework problems.

Consider two uncorrelated equi-powered complex sinusoidal signals in additive white noise.Determine the eigenvalues and eigenvectors of the correlation matrixRxx of theN -dimensionalobservation (N > 2) in terms of the power of each signal, σ2

s , the noise power σ2n and the

signal observation vectors a(ωk); k = 1, 2. Assume aH(ω1) a(ω2) = c, andaH(ωk) a(ωk) = N ; k = 1, 2.

At this point you may be thinking that, in general, the problem of finding the eigenstruc-ture of an N -dimensional matrix is very difficult and time consuming, if say N > 3. Theremust be some trick to this.

Solution: Let A = [a1, a2] and P = Diagσ2s , σ

2s = σ2

sI2. From the above assumptions,we have that

AH A =

[

N cc∗ N

]

. (145)

Note that |c| indicates how ”close” aH(ω1) and a(ω2) are in angle. That is,0 ≤ |c| ≤ N where |c| = N implies a(ω1) and a(ω2) are colinear, while |c| = 0 implies a(ω1)and a(ω2) are orthogonal. Also note that in this example

A P AH = σ2s A AH . (146)


The trick to solving this problem is to note that the two nonzero eigenvalues of σ2s A AH

are the two eigenvalues of σ2s AH A. To see this consider the SVD of A,5

A = U Σ V H , (147)

where U is an N ×N and unitary matrix, Σ is the N ×D diagonal matrix of singular valuesσa, σb, and V is a 2 × 2 unitary matrix. The two nonzero eigenvalues of A AH and the twoeigenvalues of AH A are σ2

a, σ2b.

So, for the two non-zero eigenvalues of A P AH , we can simply solve for the roots of thedeterminant

∣

∣

∣σ2s AHA− v I2

∣

∣

∣ =

∣

∣

∣

∣

∣

[

Nσ2s − v σ2

scσ2sc

∗ Nσ2s − v

]∣

∣

∣

∣

∣

(148)

= v2 − 2Nσ2sv + σ4

s(N2 − |c|2) , (149)

i.e. solve for the eigenvalues of σ2s AHA. These eigenvalues (roots) are

v1 = σ2s (N + |c|) (150)

v2 = σ2s (N − |c|) . (151)

Given these eigenvalues of σ2s AHA, the established relationship between these and the

nonzero eigenvalues of σ2s A AH , and the established relationship between these and the

eigenvalues of Rxx = σ2s A AH + σ2

nIN , we have that

λi =

σ2s (N + |c|) + σ2

n i = 1σ2s (N − |c|) + σ2

n i = 2σ2n otherwise

. (152)

5Earlier in the course, for correlation matrices, we considered the relationships between the eigenstructureand SVD.


Concerning the eigenvectors of Rxx, note that we are really interested in the eigenvectorsfor the two distinct eigenvalues λ1 and λ2. These are the signal subspace eigenvalues. Theimportant feature of the eigenvectors of the N−2 repeated noise-only eigenvalues is that theyspan the noise-only subspace, which is the orthogonal complement of the signal subspace.

Recall from an earlier discussion involving the SVD of a matrix that the left singularvectors of A (i.e. the columns of the U matrix in Eq (147)) are the eigenvectors of A AH ,which are the same as the eigenvectors of σ2

s A AH , which we are interested in. That is,U = E, where E is the notation we’ve been using for the eigenvector matrix of Rxx. Startingwith

A = E Σ V H , (153)

and exploiting the fact that V is unitary, we have

A V = E Σ . (154)

V is the 2× 2 matrix of eigenvectors of AH A. Eq (154) indicates that

ei ∝ A vi i = 1, 2 (155)

where ei; i = 1, 2 are the two eigenvectors we are looking for. Since, with eigenvectors, it isthe direction not the length that counts, by computing the 2-dimensional vi (which is easy),we have the N -dimensional ei

′s we want. We can normalize the columns of A vi to generatenormalized ei if need be.


6.7 Whitening

Here we only discuss whitening across a random vector observation x[n], not the whiteningof a random process x[n]. (Earlier, in a Homework problem, we explored whitening randomprocesses.) We briefly consider observation whitening and noise whitening.

Whitening the observation is straightforward. The process, shown employs the eigenstruc-ture of the observation correlation matrix, Rxx = E Λ EH .

Illustration:

For noise whitening, let x[n] be composed of a signal component and additive noise,

x[n] = xs[n] + n[n] . (156)

Assume that the signal and noise components are uncorrelated, so that

Rxx = Rss + σ2n Rnn , (157)

where TrRnn = N . Consider the transformation

y[n] = R−1/2nn x[n] . (158)

Illustration:

Then

Ryy = Ey[n] yH [n] (159)

= R−1/2nn Ey[n] yH [n]R−1/2

nn (160)

= R−1/2nn (Rss + σ2

n Rnn) R−1/2nn (161)

= R−1/2nn Rss R

−1/2nn + σ2

n IN , (162)

which shows that we have whitened the noise across the observation.


Example 6.24 - Consider the N > 2 dimensional WWS random vector process

x[n] = s1[n] a1 + s2[n] a2 + n[n]

where s1[n], s2[n] and n[n] are all zero-mean and mutually uncorrelated. AssumeRnn = 2IN , and that a1 and a2 are orthonormal. Let E|si[n]|2 = 3; i = 1, 2and assume that the “desired signal” is s1[n], so that

n1[n] = s2[n] a2 + n[n]

is the “noise”.

a) Describe, in as much detail as possible given the assumptions, the transfor-mation T which whitens n1[n].

b) Given your T , in as much detail as possible given the listed assumptions,describe the observation of the “desired signal” at the whitening transfor-mation output.





Part2b

Linear Time-Invariant (LTI) Systems


Contents

7 Linear Time-Invariant (LTI) Systems 1537.1 Discrete-Time LTI System Review . . . . . . . . . . . . . . . . . . . . . . . . 1537.2 Wide-Sense Random Processes and DT LTI Systems . . . . . . . . . . . . . 1557.3 Wide-Sense Random Processes and CT LTI Systems . . . . . . . . . . . . . 1617.4 Matched Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.4.1 DT, Known Signal and Unconstrained DT LTI System . . . . . . . . 1637.4.2 DT, Known and Finite-Length Signal, FIR Filter . . . . . . . . . . . 1657.4.3 DT, Wide-Sense Stationary Random Signal, FIR Filter . . . . . . . . 166

7.5 Introduction to Linear Modeling of Random Processes . . . . . . . . . . . . 167

List of Figures

20 Input/output relationship for a DT LTI system: the convolution sum. . . . . 15321 Input/output relationship for a DT LTI system: correlation function and

power spectral density of WSS random processes. . . . . . . . . . . . . . . . 15622 Input/output relationship for a CT LTI system: the convolution integral. . . 16123 The DT matched filter problem. . . . . . . . . . . . . . . . . . . . . . . . . . 16224 The DT matched filter problem. . . . . . . . . . . . . . . . . . . . . . . . . . 16425 Linear model of a DT wide-sense stationary random process. . . . . . . . . . 167


7 Linear Time-Invariant (LTI) Systems

In this Chapter of The Course, we cover processing random processes with LTI systems. Weare principally interested in two topics:

1. characterization of the output random process of a LTI system, given a wide-sensestationary input; and

2. matched filtering for various types of signal in additive noise.

7.1 Discrete-Time LTI System Review

For a DT LTI system with input x[n], output y[n], and impulse response1 h[n], the in-put/output (I/O) relationship is developed in the Figure 21.

Figure 20: Input/output relationship for a DT LTI system: the convolution sum.

This I/O relationship is called a convolution sum.

1The impulse response of a DT system is the output y[n] = h[n] due to the impulse input x[n] = δ[n].


Convolution Sum:

y[n] = x[n] ∗ h[n] =∞∑

k=−∞

x[k] h[n− k]

= h[n] ∗ x[n] =∞∑

k=−∞

h[k] x[n− k] . (1)

This is a fold, shift, multiply and sum operation. The sum variable k represents memorytime.

Frequency Response:Let X(ejω) and Y (ejω) be, respectively, the DTFT of the input and output. By the

convolution property of the DTFT,

Y (ejω) = X(ejω) H(ejω) , (2)

where

H(ejω) =∞∑

n=−∞

h[n] e−jωn , (3)

the DTFT of the impulse response h[n], is the frequency response of the LTI system. By theinverse DTFT,

h[n] =1

2π

∫ π

−πH(ejω) ejωn dω . (4)

Transfer Function:Let X(z) and Y (z) be, respectively, the z-transform of the input and output. By the

convolution property of the z-transform,

Y (z) = X(z) H(z) , (5)

where

H(z) =∞∑

n=−∞

h[n] z−n , (6)

the z-transform of the impulse response h[n], is the transfer function of the LTI system. Bythe inverse z-transform,

h[n] =1

2πj

∫

CH(z) zn−1 dz , (7)

a line integral where∫

C represents a z-plane closed contour in the the region of convergenceof the z-transform of h[n].


7.2 Wide-Sense Random Processes and DT LTI Systems

This Section of the Course Notes corresponds to material in Section 9.4 of the Course Text.Consider a wide-sense stationary random process x[n] with a realization x[n] as the input

to a DT LTI system. Input/output relationships from Section [7.1] hold for individualrealizations, but we are more interested in probabilistic input/output relationships.

In this Section we assume wide-sense stationary x[n] and a LTI DT system, and we startwith the input/output relationship:

y[n] =∞∑

k=−∞

h[k] x[n− k] . (8)

The Output Mean:

Ey[n] =∞∑

k=−∞

h[k] Ex[n− k] (9)

ηy = ηx

∞∑

k=−∞

h[k] . (10)

So the output of a DT LTI system is stationary in the mean if the input is.

The Output Correlation Functions:

Ryy[m] = Ey[n] y∗[n−m] (11)

=∞∑

l=−∞

∞∑

k=−∞

h[k] h∗[l] Ex[n− k] x∗[(n−m)− l] (12)

=∞∑

l=−∞

h∗[l]∞∑

k=−∞

h[k] Rxx[(m+ l)− k] (13)

=∞∑

l=−∞

h∗[l] (h[m] ∗ Rxx[m+ l]) (14)

= h[m] ∗

∞∑

l=−∞

h∗[l] Rxx[m+ l]

(15)

= h[m] ∗

∞∑

i=−∞

h∗[i−m] Rxx[i]

(16)

= h[m] ∗ h∗[−m] ∗ Rxx[m] . (17)

So the output of s DT LTI system is wide-sense stationary if the input is.


Similarly, for the cross correlation function between the input and output random pro-cesses, we have

Ryx[m] = Ey[n] x∗[n−m] (18)

=∞∑

l=−∞

h[l] Ex[n− l] x∗[n−m] (19)

=∞∑

l=−∞

h[l] Rxx[m− l] (20)

= h[m] ∗ Rxx[m] . (21)

andRxy[m] = h∗[m] ∗ R∗

xx[m] = R∗

yx[−m] . (22)

For covariance functions, we have similar results

Cyy[m] = h[m] ∗ h∗[−m] ∗ Cxx[m] (23)

Cyx[m] = h[m] ∗ Cxx[m] (24)

Cxy[m] = C∗

yx[−m] . (25)

Power Spectral Density:From the correlation and covariance function results, and the convolution property of the

DTFT, we have

Syy(ejω) = Sxx(e

jω) |H(ejω)|2 (26)

Syx(ejω) = Sxx(e

jω) H(ejω) (27)

Sxy(ejω) = Sxx(e

jω) H∗(ejω) . (28)

Figure 21: Input/output relationship for a DT LTI system: correlation function and powerspectral density of WSS random processes.


Example 7.1: Let wide-sense stationary input x[n] be zero-mean white noisewith variance σ2

n, and DT LTI system impulse response be h[n] = δ[n] + δ[n− 1].Determine the correlation function and power spectral density of the output y[n].(This is a reworking of Example 6.5).

Solution:


Example 7.2: Let input x[n] be as in Example 7.1, and let the DT LTI systemimpulse response be h[n] = an u[n], where |a| < 1. Determine the correlationfunction and power spectral density of the output y[n].

Solution:


Example 7.3: Let h[n] = 1N(u[n]−u[n−N ]), and x[n] be a wide-sense stationary

complex sinusoidal process with Rxx[m] = σ2X ejω0m where |ω0| ≤ π. Determine

the correlation function and power spectral density of the output y[n].

Solution:


7.3 Wide-Sense Random Processes and CT LTI Systems

This Section corresponds to material in Section 9.3, p. 412, of the Course Text.For a CT LTI system with input x(t), output y(t), and impulse response2 h(t), the in-

put/output (I/O) relationship is developed in Figure 22.

Figure 22: Input/output relationship for a CT LTI system: the convolution integral.

This I/O relationship is called a convolution integral.

Similar to the DT case, for a wide-sense stationary random process xa(t) input to a CTLTI system with impulse response h(t), the output random process ya(t) is also wide-sensestationary, with correlation function

Rya,ya(τ) = h(τ) ∗ h∗(−τ) ∗ Rxa,xa(τ) (29)

and power spectral density

Syy(jΩ) = Sxx(jΩ) |H(jΩ)|2 . (30)

2The impulse response of a CT system is the output y(t) = h(t) due to the impulse input x(t) = δ(t).


7.4 Matched Filters

This Section of the Course Notes is related to material from Section 10.6 of the Course Text.Here we focus on DT matched filters, and we cover the topic more broadly.

The problem we address in this Section is the following:

Given an input signal s[n] in additive wide-sense stationary noise n[n], wheres[n] and n[n] are uncorrelated if s[n] is random, determine the LTI system thatmaximizes the output signal-to-noise ratio (SNR) at time n = np. For reasonsthat will become clear below, this is termed the matched filter problem.

Although this problem is stated here in terms of discrete time signals, maximizing SNR ata point in time is of interest for continuous time applications as well. In fact, there arenumerous variations of this matched filtering problem, for different types of signals and LTIsystems. We will differentiate problem variations in terms of the following characterizations:

• continuous-time vs. discrete-time;

• known (or deterministic) s[n] vs. random s[n];

• constrained LTI system structure vs. unconstrained LTI system structure.

An important application class of matched filters is detection problems, for which specificapplications include: symbol detection in digital communications; sinusoidal signal detectionin SONAR; acoustic emission detection in machine diagnostics; seismic pulse detection ingeophysical exploration, biomedical signal processing ... Maximizing the SNR at a detectionpoint (in time) often optimizes some detection performance measure of interest (such asprobability of decision error).

The Figure 23 illustrates the matched filter problem for the discrete time, random (andstationary) s[n], and unconstrained LTI system case. Below we derive matched filters forseveral important cases.

Figure 23: The DT matched filter problem.


7.4.1 DT, Known Signal and Unconstrained DT LTI System

Here we assume that the in DT signal s[n] is a known energy signal with DTFT S(ejω), andn[n] is a zero-mean, wide-sense stationary noise process with correlation function Rnn[m]and corresponding power spectral density Snn(e

jω). We assume that the matched filter hasfrequency response H(ejω), and we put no restriction on this.

First consider a general noise power spectral density Snn(ejω). The output SNR, at time

n = np, is

SNR =|ys[np]|2Rynyn [0]

(31)

where ys[np] is the signal portion of the output, ys[n], evaluated at time n = np. Rynyn[m]is the correlation function of the noise portion of the output yn[n]. In terms of spectralrepresentation, the components of Eq (31) are

ys[np] =1

2π

∫ π

−πS(ejω) H(ejω) ejωnp dω , (32)

and

Rynyn [0] =1

2π

∫ π

−πSnn(e

jω) |H(ejω)|2 dω . (33)

The maximum SNR design problem is,

maxH(ejω)

SNR =

∣

∣

∣

12π

∫ π−π S(ejω) H(ejω) ejωnp dω

∣

∣

∣

2

12π

∫ π−π Snn(ejω) |H(ejω)|2 dω

. (34)

To solve this design problem we use the following Schwarz’s inequality. Consider two func-tions A(ω) and B(ω), then

∣

∣

∣

∣

∫ π

−πA(ω) B(ω) dω

∣

∣

∣

∣

2

≤∫ π

−π| A(ω) |2 dω

∫ π

−π| B(ω) |2 dω , (35)

with equality if A(ω) ∝ B∗(ω).If we factor the term in the integral of the numerator of Eq (34) as

S(ejω) H(ejω) ejωnp =√

Snn(ejω) H(ejω) · S(ejω)ejωnp

2π√

Snn(ejω)(36)

then, letting A(ω) =√

Snn(ejω) H(ejω) and B(ω) = S(ejω)ejωnp

2π√

Snn(ejω), by Schwarz’s inequality,

the numerator of Eq (34) is bounded as

∣

∣

∣

∣

1

2π

∫ π

−πS(ejω) H(ejω) ejωnp dω

∣

∣

∣

∣

2

≤∫ π

−πSnn(e

jω)∣

∣

∣H(ejω)∣

∣

∣

2dω · 1

(2π)2

∫ π

−π

|S(ejω)|2Snn(ejω)

dω .

(37)Since the first term on the right of Eq (37) is the denominator of Eq (34), and since thesecond term is not a function of H(ejω), we have that

√

Snn(ejω) H(ejω) ∝ S∗(ejω)e−jωnp

2π√

Snn(ejω)(38)


maximizes Eq (34). So the maximum SNR filter frequency response is

Hopt(ejω) = K

S∗(ejω)

SNN(ejω)e−jωnp , (39)

where K is an arbitrary constant.Figure 24 illustrates this optimum filter, implemented as a cascade of a whitening filter

and a filter ”matched” to the signal at the output of the whitening filter.

Figure 24: The DT matched filter problem.

For white noise, i.e. for SNN(ejω) = σ2

n, Eq (39) reduces to

Hopt(ejω) = K1 S∗(ejω) e−jωnp , (40)

where K1 is an arbitrary constant different from K. Taking the inverse DTFT, we have thatthe optimum filter impulse response is

hopt[n] = K1 s∗[np − n] . (41)

That is, the optimum filter impulse response is ”matched” to the signal in that it is theconjugate of the signal, folded and shifted by np. This is why these maximum SNR filtersare called matched filters.

Example 7.4: White noise, and s[n] = n(u[n]− u[n− p]).

Solution:


7.4.2 DT, Known and Finite-Length Signal, FIR Filter

Consider known, deterministic s[n], and say thats[n] = 0; n 6= n0, n0 + 1, · · · , n0 + P − 1 with n0 = np − P + 1. Assume that the matchedfilter is FIR, with length equal to that of the signal. Defineh = [h[0], h[1], · · · , h[n−P +1]]T as the vector of FIR filter coefficients. At time n = np, theFIR filter output due to the signal is

ys[np] = hT s ; s = [s[np], s[np − 1], · · · , s[n0]]T . (42)

The maximum output SNR problem is then:

maxh

SNR =|ys[np]|2Rynyn [0]

=hT s sHh∗

hT Rnn h∗, (43)

where Rnn is the P × P noise covariance matrix. Let h′

= R1/2nn h, so that h = R−1/2

nn h′

.Then,

SNR =h

′T(

R−1/2nn s

) (

sH R−1/2nn

)

h′∗

h′T h

′∗

. (44)

Clearly, the magnitude of h′

doesn’t matter, and SNR is maximized when h′

is colinear withR−1/2

nn s (i.e. look at the numerator of Eq (44)). Therefore,

h′

opt ∝ R−1/2nn s∗ (45)

= K2 R−1/2bb s∗ (46)

where K2 is an arbitrary constant, so that

hopt = K2 R−1nn s∗ . (47)

We sometimes normalize hopt (i.e. scale the impulse response vector such that the outputnoise power is normalized), so that

hopt =R−1

nn s∗√

sH R−1nn s

. (48)

For white input noise, Rnn = σ2n IP , and

hopt ∝ s∗ . (49)

The filter coefficient vector is matched to the input observation vector. (It’s the folded,conjugated signal observation vector.)


7.4.3 DT, Wide-Sense Stationary Random Signal, FIR Filter

In this case we assume that s[n] as well as n[n] are wide-sense stationary random processes.Then,

SNR =E|ys[np]|2E|yn[np]|2

=hT Rss h

∗

hT Rnn h∗. (50)

The matched filter problem is then:

maxh

h′T

(

R−1/2nn Rss R

−1/2nn

)

h′∗

h′T h

′∗

(51)

where again h = R−1/2nn h

′

.

The solution to this problem ish

′

opt ∝ e1 (52)

where e1 is the eigenvector of R−1/2nn Rss R

−1/2nn associated with the largest eigenvalue. Then

hopt ∝ R−1/2nn e1 . (53)

hopt is also the generalize eigenvector associated with the largest generalized eigenvalue ofthe matrix pencil Rss, Rnn, i.e. it solves

Rss e1 = λ1 Rnn e1 . (54)


7.5 Introduction to Linear Modeling of Random Processes

In signal processing, we try to incorporate prior information about the signals we are in-terested in so as to give us an advantage in the task at hand – e.g. filtering, detection,parameter estimation. An effective approach to incorporate prior information is throughsignal modeling. In this Section we briefly introduce this topic by describing one widelyused model, a linear model. We do this here because the model is in terms of wide-sensestationary processes and LTI systems. We will use this model later, in Section 10 of theCourse, when studying high performance spectrum estimators. The models described beloware also described on pages 507-509 of the Course Text.

Figure 25 illustrates a wide-sense stationary, zero-mean, unit variance white noise processthrough a general LTI system. The output, X [n], is the signal we are interested in. Weenvision X [n] as having been generated as shown.

Figure 25: Linear model of a DT wide-sense stationary random process.

The frequency response of the system is of the form

H(ejω) =b0 + b1e

−jω + b2e−j2ω + · · ·+ bQe

−jQω

1 + a1e−jω + a2e−j2ω + · · ·+ aPe−jPω, (55)

and thus

Sxx(ejω) = σ2

w |H(ejω)|2 = σ2w

∣

∣

∣

∣

∣

b0 + b1e−jω + b2e

−j2ω + · · ·+ bQe−jQω

1 + a1e−jω + a2e−j2ω + · · ·+ aP e−jPω

∣

∣

∣

∣

∣

2

. (56)


An example of the advantage of this model, is spectrum estimation. Say we are interestedin estimating the function

Sxx(ejω) ; −π ≤ ω < π . (57)

This is a continuous-function estimation problem. In Section 10 of the Course Notes we willconsider methods which generate densely sampled estimates of this continuous function. Al-ternatively, exploiting the linear-model spectral density expression, Eq (56), we can considerestimating the P +Q+ 1 discrete parameters

b0, b1, · · · , bQ, a1, a2, · · · , aP . (58)

Examples:

1. The all-pole (autoregressive - AR) model:

Har(ejω) =

b0

1 + a1e−jω + a2e−j2ω + · · ·+ aPe−jPω=

1

A(ejω); (59)

2. The all-zero (moving average - MA) model:

Hma(ejω) = b0 + b1e

−jω + b2e−j2ω + · · ·+ bQe

−jQω = B(ejω) ; (60)

3. The autoregressive, moving average – ARMA model:

Harma(ejω) =

B(ejω)

A(ejω). (61)





Part3b

Optimum Filtering & SpectrumEstimation

0 5 10−4

−3

−2

−1

0

1

2

3

4

w1

w2

05

10 −4−2

02

4

−50

0

50

100

150

200

w1w2

σ e2 (w)

0 5 10−4

−3

−2

−1

0

1

2

3

4

w1

w2


Contents

9 Optimum Filtering 1959.1 The Problem Statement and Examples . . . . . . . . . . . . . . . . . . . . . 1969.2 Minimum Mean Squared Error Filtering . . . . . . . . . . . . . . . . . . . . 204

9.2.1 Characterization of the Mean-Squared Error Surface . . . . . . . . . . 2079.2.2 The Least-Mean Squared (LMS) Adaptive Algorithm . . . . . . . . . 212

9.3 Least Squares Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

10 Spectrum Estimation 21810.1 The Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21810.2 Classical Spectrum Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 220

10.2.1 Estimation of Rxx[m] . . . . . . . . . . . . . . . . . . . . . . . . . . . 22010.2.2 The Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22310.2.3 The Averaged Periodogram . . . . . . . . . . . . . . . . . . . . . . . 22510.2.4 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22610.2.5 Averaging and Windowing . . . . . . . . . . . . . . . . . . . . . . . . 22710.2.6 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

10.3 AR (Autoregressive) Spectrum Estimation . . . . . . . . . . . . . . . . . . . 22810.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22810.3.2 Estimation of the AR Coefficients . . . . . . . . . . . . . . . . . . . . 23010.3.3 AR Spectrum Estimation . . . . . . . . . . . . . . . . . . . . . . . . 232

10.4 Optimum Filter Based Spectrum Estimation . . . . . . . . . . . . . . . . . . 23310.4.1 A Swept Filter or Filter Bank Formulation . . . . . . . . . . . . . . . 23310.4.2 The MVDR Spectrum Estimator . . . . . . . . . . . . . . . . . . . . 234

10.5 MUSIC: An Eigenstructure Approach . . . . . . . . . . . . . . . . . . . . . . 23510.5.1 The Model & Correlation Matrix Eigenstructure . . . . . . . . . . . . 23510.5.2 The MUSIC Spectrum Estimator . . . . . . . . . . . . . . . . . . . . 235

List of Figures

33 The optimum FIR filter design problem. . . . . . . . . . . . . . . . . . . . . 19634 The general noise canceler structure. . . . . . . . . . . . . . . . . . . . . . . 19735 A linear predictor used to separate narrowband from wideband input compo-

nents – a.k.a. the Adaptive Line Enhancer (ALE) and AutoRegressive (AR)model generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

36 The narrowband Multiple Sidelobe Canceler (MSC) spatial filter problem. . . 19937 A block diagram of the Generalized Sidelobe Canceler (GSC) which can be

used to convert array data to a MSC type observation. . . . . . . . . . . . . 20038 A RADAR Space-Time Adaptive Processor (STAP). . . . . . . . . . . . . . 20139 Linear equalizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20240 Decision Feedback Equalizer (DFE). . . . . . . . . . . . . . . . . . . . . . . 203


41 Fractionally spaced DFE - K = 2 cuts; or a K = 2 antenna array DFE. . . . 20342 The MSE surface for N = 2 and real-valued data: (a) 3-D plot; (b) contour

plot; (c) gradient search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20443 Sketches of the MSE surface for the N = 2 real-valued data case. . . . . . . . 20544 An illustration of the orientation of the error surface. . . . . . . . . . . . . . 20745 Translation of the MSE surface to the coordinate origin. . . . . . . . . . . . 20846 Translation/rotation of the MSE surface to the coordinate origin. . . . . . . 20847 The original MSE surface characterized by the eigenstructure of Rxx. . . . . 20948 Gradient search of the MSE surface. . . . . . . . . . . . . . . . . . . . . . . . 21249 FIR least squares design problem. . . . . . . . . . . . . . . . . . . . . . . . . 21450 The AR model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22851 A typical AR spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22952 The MVDR filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233


9 Optimum Filtering

The coverage in this Section of the Course Notes relates to Section 7.3 and Chapter 13of the Course Text. The discussions in the Text are more general and more directed toCT signal processing. Here, we discuss discrete-time optimum filters only, and restrict thestructure of the optimum discrete-time filter to be Finite Impulse Response (FIR). This isthe easiest case to comprehend, and thus is a good case to focus on in this introductorydiscussion. Also, the FIR filter structure is predominately used in optimum filtering, due toits design simplicity, its flexibility, its guaranteed stability, and the fact that simple effectivedata adaptive algorithms exist which for approximating the optimum filter.

We assume the signal to be filtered is random and Wide-Sense Stationary (WSS). We willconsider two common filter optimization formulations:

1) Minimum Mean-Squared Error (MMSE) filtering; and

2) Least-Squares (LS) filtering.

These formulations lead to design algorithms which we will introduce. The resulting designalgorithms generate FIR filter coefficients which are a function of statistics (for MMSE) orthe averages (for LS) of the WSS signal the filter is processing. They are time-invariantfilters in that their coefficients are fixed over time. In many applications we wish to processnonstationary random signals, and/or WSS signals that we do not have the statistics oraverages required for the optimum filter design algorithm. Thus, optimum MMSE and LSfilters are often implemented as adaptive filters, for which coefficients are computed usingexisting data as opposed to known statistics or averages. The two most commonly employedadaptive filtering algorithms are:

3) the Least Mean Squares (LMS) algorithm which approximates the MMSE filter; and

4) the Recursive Least Squares (RLS) which is exactly the LS filter.

These algorithms are called data adaptive because the FIR filter coefficients are updatedas as function of the data they process. So, simple adaptive versions of these exist whichare applicable in time varying signal environments. We will lay the foundation necessary tounderstand these adaptive filters, and only briefly describe them. Another popular optimumFIR filter, which is time-varying but not data adaptive,

5) the Kalman filter,

will not be covered here because of time constraints.


9.1 The Problem Statement and Examples

Consider a random vector observation sequence,

x[n]; n = 0, 1, 2, · · · , (1)

assumed to be wide-sense stationary. We wish to process this vector sequence, linearly, toprovide an estimate

d[n] = hH x[n] (2)

of a wide-senses stationary signal d[n] which is embedded in x[n]. We assume that x[n] andd[n] are zero mean (i.e. if necessary, the means have been subtracted off). The objective isto design a filter h (i.e. a linear combiner) to minimize a cost related to the error e[n] =d[n]− d[n]. Figure 33 below illustrates this filtering objective.

e[n]d[n] +

−

T^

x[n]

h d[n] = h x[n]

Optimum

FIR Filter

Figure 33: The optimum FIR filter design problem.

In terms of optimum filter terminology, d[n] is referred to as the primary or desired signal,and x[n] is called the reference signal. In the remainder of this Chapter we will use lowercase (non-bold) notation for signals, i.e. the discussion will be in terms of realizations ofrandom processes.


Example 9.1 - A General FIR Noise Canceler:

The idea is that you have a primary signal d[n] = s[n] + n[n] which consists of asignal of interest s[n] superimposed with an interference signal n[n]. This primarysignal plays the role of the “desired” signal. s[n] and n[n] may share the samebandwidth so they can not be separated with frequency selective filters. Thereference signal x[n] is related to the interferer n[n] such that it can be filtered toclosely replicate n[n]. So samples of x[n] play the role of the data to be linearlycombined – i.e. they form the observation vector x[n].

If the filter coefficient vector h is selected such that d[n] ≈ n[n], then the “error”signal e[n] ≈ s[n]. So e[n] turns out to be what we are looking for. The key hereis to design h to do this. If the reference x[n] is uncorrelated with the signal ofinterest s[n] but highly correlated with the interference n[n], then designing h tominimize the MSE E|e[n]|2 will do the job.

Alternatively, x[n] might be a signal similar to the the signal of interest s[n].Then d[n] turns out to be what we are looking for.

z−1z−1 z−1

.....

...

...1 N−10h h h

e[n]+

−

++

++

^

d[n] = s[n] + n[n]

x[n]

d[n] = h x[n]H

Figure 34: The general noise canceler structure.


Example 9.2: Linear Predictor & Adaptive Line Enhancer

Consider the structure shown in Figure 35 which processes a single input signalx[n]. In terms of the basic optimum FIR filter terminology, this input signal playsthe roles of both the desired signal d[n] and the reference x[n] used to estimatethe desired signal. Specifically, the desired signal is the input, i.e. d[n] = x[n].The reference vector is composed of delayed inputs, i.e.

x[n−D] = [x[n−D], x[n−D − 1], · · · , x[n−D −N + 1]]T (3)

where D is a delay length. By selecting the coefficient vector h to minimizethe power of the error signal e[n], delayed input samples in x[n − D] are usedto form an estimate of x[n]. That is, the past input samples x[n − k]; k =D,D + 1, · · · , D +N − 1 are linearly combined to predict the current input x[n]– thus the term linear predictor. The signal d[n] is then the predictable partof x[n] (the narrowband component) while e[n] is the unpredictable (widebandcomponent). The structure thus separates the narrowband from wideband in-put components, without knowledge of the narrowband component frequencies.When this structure and an adaptive coefficient algorithm are used to separatesinusoids form wideband noise, the structure is called an Adaptive Line Enhancer(ALE). When D = 1 the coefficients in h form the coefficients of the AutoRe-gressive (AR) model (see Section 7.5 of the Course Text) used, for example, forspectrum estimation.

z−1z−1 z−1

.....

...

...

z−D

0 1 N−1h h h

e[n]+

−

++

++

x[n] = d[n]

x[n−D]

^d[n] = h x[n−D]H

Figure 35: A linear predictor used to separate narrowband from wideband input components– a.k.a. the Adaptive Line Enhancer (ALE) and AutoRegressive (AR) model generator.


Example 9.3: The Multiple Sidelobe Canceler:

Optimum and adaptive filtering techniques are very commonly used in the col-laborative processing of signals from a number of sensors. To see how this can bedone, first consider the specialized situation illustrated in Figure 36. The sensorarray configuration consists of one highly directional sensor and N nondirectionalsensors. In this case the highly directional sensor, sensor #1, is pointed towardsa source of interest. This sensor also receives energy from interferers, as main-lobe or sidelobe leakage. Assuming the interferers are stronger than the signalof interest, they can dominate the sensor #1 output d[n]. d[n] is used as theprimary input to a noise canceler.

The nondirectional sensor outputs are used as reference signals. They mainly pickup the interferer signals, which again are assumed to be much stronger than thesignal of interest. These reference signals are linearly combined so as to minimizethe output of e[n]. This structure is called a Multiple Sidelobe Canceler (MSC)since it can cancel multiple strong interferer picked up through the sidelobes ofthe directional antenna.

θi

θi

θi

θd

θd

θd

Σ

Σ+

−

net response

auxiliary beamformer response

main channel response

d[n]Sensor #1 − directional

Reference sensors − nondirectional

e[n]

d[n]

1

2

3

N

h

h

h

h

Figure 36: The narrowband Multiple Sidelobe Canceler (MSC) spatial filter problem.


Example 9.4: The Generalized Sidelobe Canceler:

Like many arrays used in applications such as Radar, Sonar, communicationsor ultrasound imaging, the array does not contain a highly directional sensorand does not operate in an environment where interferers dominate the signal ofinterest. Usually, energy is impinging on an array of similar sensors. The signalof interest is a signal impinging from a selected location, and interferers aresignals from all other locations. So how can useful primary and references signalbe generated? An effective array preprocessor for this is called a GeneralizedSidelobe Canceler (GSC). This is illustrated in Figure 37. The input vector x[n]is the vector of K sensor outputs at sample time n. hq is the coefficient vectorof a spatial filter which passes the signal of interest from the selected locationwith unit gain, while attenuating signals from other locations. This output ofthis spatial filter, d[n], plays the role of the output of the directional sensor ofthe MSC (i.e. the primary input to the optimum FIR filter). The K ×N matrixB, where N < K, represents a bank of spatial filters pointed to other locationswhich all block the signal of interest. The N outputs of these spatial filters playthe role of the MSC reference sensor outputs. An optimum filter can then beapplied to provide the same effect as the MSC.

B

~

+

preprocessor

−

e[n]x [n]

x [n] = B x [n]T

NK

q

Optimum Filter

h

h

Figure 37: A block diagram of the Generalized Sidelobe Canceler (GSC) which can be usedto convert array data to a MSC type observation.

Example 9.5: A RADAR Space-Time Filter:

RADAR is an active system in which electromagnetic pulses are transmittedonto an environment of interest and data is collected and processed to detect,localize (i.e. position and motion) and classify returns of these pulses reflectedback from certain objects. Objects that reflect radiated pulses are both those ofinterest, called targets, and those which interfere. Returns from those objectswhich interfere with the targets are called clutter. In addition to returns, theRADAR receiver data contains additive noise and other interfering signals suchas active jammers transmitted by others in an attempt to defeat the RADAR.

Received data across a RADAR array is gated in numerous ways. For example,pulse gating involved the sampling of data over the duration of the possiblereturns due to one transmitted pulse. Range gating involves dividing a pulse


gate into different times corresponding to returns from different ranges. A space-time observation can be formed at several points along the gating process. Forexample, consider the range gating RADAR array output xi

r(t), where i indicatedthe i− th pulse gate and r the r− th range gate within a pulse gate. Stacking thesame range gate, r, over L consecutive pulse gates, the KL dimensional vector

X ir(t) =

xir(t)

xi−1r (t)...

xi−L+1r (t)

. (4)

is referred to as a space-time observation. Figure 38 shows a general Space-Time Adaptive Processor (STAP) which processes this space-time observation.Note that this is a broadband spatial filter, where the delay length T is the pulseseparation (i.e. the length of the pulse gate. The objective of this STAP processoris to remove unwanted components for the data (e.g. clutter, interference andjammers) so that targets of interest can be more effectively detected and classified.Since the observation is spatial-temporal, the STAP processor can exploit spatial-temporal differences between different components of the received data. Forexample, the STAP processor can form pencil beam beampatterns that can besuccessive swept over azimuth and elevation angles.

pulse/rangegating

pulse/rangegating

pulse/rangegating

Σ y[n]

x (t)1,r

i

x (t)i

2,r

i

K,rx (t)

receiver 1

receiver 2

receiver K

+++

K,0 K,1 K,2 K,L−1

1,L−1

+++

1,21,11,0

z z z z

+ + +

2,L−12,22,12,0

z z z z

z z z z

−T −T −T −T

−T−T−T−T

−T −T −T −T

x (t)1,r

x (t)1,r

i−1

x (t)2,r

x (t)2,r

i−1

i−L+1

i−L+1

K,r K,rx (t)i−1 x (t)

i−L+1

* * * *

* * *

* *

*

* *

h h h h

h h h h

hhhh

Figure 38: A RADAR Space-Time Adaptive Processor (STAP).


Example 9.6: Channel Equalization for Digital Communications:

The following three figures show progressively more sophisticated channel equal-ization processors for receiving digital communications signals at the output of amultipath intersymbol interference channel. The equalizer is employed to miti-gate the effect of the channel. Figure 39 illustrates the linear equalizer structurewhich is most commonly employed. Compared to the linear equalizer, the de-cision feedback equalizer is more effective for channel with spectral nulls. Asshown in Figure 40, with this structure past detected symbols are fed back toassist in canceling the interference of past symbols when estimating a currentsymbol. Multiple cut equalizers exploit multiple received samples per symbol ob-tained, for example, by oversampling the output of a single receiver or by usingmultiple receivers. Figure 41 shows a 2 cut DFE. The design problem for all ofthese structures is commonly formulated in terms of an FIR optimum filteringproblem. These equalizers operate in two modes: a training mode for which thedesired signal d[n] consists of known transmitted symbols which are available atthe receiver; and decision directed mode for which detected symbols are used asd[n].

−1z

−1z

−1z

−1z

Σ

*cK1

vn

In−∆

^ In−∆

~

Σ In−∆

decisiondirectedmode

trainingmode

symboldetector

0 1 2* ** *c c c cK−11

en

Figure 39: Linear equalizer.


Σ

*cK1

vn

In−∆

^ In−∆

~

In−∆


trainingmode

−1z

−1z

−1z

−1z

−1z

−1z

−1z

0*c

Σ

Σ

Σe

n

*bK−1*bK

1 2* * *c c cK−11

...

...... *b122

Figure 40: Decision Feedback Equalizer (DFE).

In−∆

^ In−∆

~


In−∆

trainingmode

−1z

−1z

−1z

ΣΣ

en

*bK−1*bK

...

...... *b122

ΣΣΣ

−1z

−1z

−1z

−1z

−1z

−1z

−1z

−1z

ΣΣ

* *c c*c

v

* *c c

v

*c cc **

1 1

2,n

2,2 2,32,1 2,K2,K −1

−1z

−1z

−1z

−1z

−1z

−1z

−1z

−1z

1,K11,K −11

ΣΣ

* *c c*c

v

* *c c

v

*c1,1 1,2 1,3cc **

1,n

Figure 41: Fractionally spaced DFE - K = 2 cuts; or a K = 2 antenna array DFE.


9.2 Minimum Mean Squared Error Filtering

Consider the general linear combiner optimum filtering problem shown in Figure 9.1. Thelinear filter output is d[n] = hHx[n], and the error is e[n] = d[n]− d[n]. The mean-squarederror (MSE), as a function of the filter coefficient vector h, is

σ2e(h) = σ2

e = E|e[n]|2 = E|d[n]− hHx[n]|2 (5)

= σ2d − rHx,dh − hHrx,d + hHRxxh (6)

=(

σ2d − rHx,d R−1

xx rx,d

)

+(

h−R−1xx rx,d

)HRxx

(

h− R−1xx rx,d

)

, (7)

where σ2d = E|d[n]|2, rx,d = Ed∗[n]x[n], and Rxx = Ex[n]xH [n]. The minimum MSE

(MMSE) filter design problem, also known as the Wiener filter problem, is

minh

σ2e(h) . (8)

Assuming the Rxx is full rank, so that it is positive definite, from Eq. (7) we have

hopt = R−1xx rx,d . (9)

This is the Wiener filter (or MMSE filter) equation.The Wiener filter is the solution to the Wiener-Hopf (or normal) equations

Rxx hopt = rx,d . (10)

Also from Eq. (7), the MMSE is

σ2e,min = σ2

d − rHx,dR−1xx rx,d . (11)

Example 9.7 - An illustration: Figures 42,43 provides a visualization of the MSEerror vs. Wiener filter coefficient vector for the N = 2, real-valued data case.

0 5 10−4

−3

−2

−1

0

1

2

3

4

w1

w2

05

10 −4−2

02

4

−50

0

50

100

150

200

w1w2

σ e2 (w)

0 5 10−4

−3

−2

−1

0

1

2

3

4

w1

w2

Figure 42: The MSE surface for N = 2 and real-valued data: (a) 3-D plot; (b) contour plot;(c) gradient search.


Figure 43: Sketches of the MSE surface for the N = 2 real-valued data case.

The Orthogonality Principle

Consider the normal equations, rearranged as

Rxx hopt − rx,d = 0N . (12)

This can be rewritten as

Ex[n] xH [n] hopt − x[n] d∗[n] = 0N , (13)

which leads toEx[n] (d∗[n] − d∗[n]) = Ex[n] e∗[n] = 0N . (14)

This is the orthogonality principle. It states that with the MMSE solution, the error e[n] isstatistically orthogonal to the data x[n]. Think of it this way. The optimum filter outputd[n] (which is formed from the data x[n]), when subtracted into d[n] (i.e. e[n] = d[n]− d[n]),extracts everything possible in x[n] out of d[n].

Note that Ed[n] d∗[n] = Ed[n] d∗[n] = σ2d, the power of the optimum filter output.

With this, the MMSE can be written as

σ2e,min = σ2

d − hHopt rx,d = σ2

d − EhHopt x[n] d

∗[n] = σ2d − σ2

d. (15)


Example 9.8: Given an observation signal of the form

x[n] = s[n] + n[n] (16)

where s[n] and n[n] are orthogonal and

Rss[m] = 2(0.8)|m| ; Rnn[m] = 2(0.5)|m| (17)

find the FIR optimum MMSE filter and the corresponding mean-squared errorfor estimating d[n] = s[n].

Do this for both N = 2 and N = 3.


9.2.1 Characterization of the Mean-Squared Error Surface

Recall that the FIR filter mean-squared error, as a function the filter coefficient vector h, is

σ2e(h) = σ2

e,min +(

h− R−1xx rx,d

)HRxx

(

h−R−1xx rx,d

)

(18)

whereσ2e,min = σ2

d − rHx,d hopt , (19)

and hopt = R−1xx rx,d. The mean-squared error surface is the plot of σ2

e(h) as a function of h.Characterization of this surface is important for the understanding of how variations fromhopt effect the mean-squared error σ2

e(h). We now investigate this.

Consider Figure 44 of the mean-squared error surface as a function of h. It is shown forthe N = 2 real-valued data case as a contour plot so as to illustrate its shape. Since σ2

e(h)is quadratic in h, the surface is ”bowl shaped” (assuming Rxx is full rank).

Concerning the orientation of this contour plot, notice the axes drawn from the hopt pointon the plot. Deviation from this point along the axis labeled ”1” would be worse thandeviation along ”2”. Our objective is to characterize this error surface in terms of howdeviation from hopt in different directions effects σ2

e(h).

Figure 44: An illustration of the orientation of the error surface.


To characterize this orientation, first let f = h− hopt. As illustrated in Figure 45 for theN = 2, real-valued case, this corresponds to a translation of the error surface such that itsminimum is at the coordinate origin. In terms of f , the mean-squared error is

σ2e(f) = σ2

e,min + fH Rxx f . (20)

Figure 45: Translation of the MSE surface to the coordinate origin.

Next, using the correlation matrix eigenstructure decomposition Rxx = E Λ EH , considerthe transformation v = EHf . Since E is a unitary matrix, this transformation is a rotation.To see what this rotation accomplishes, consider the MSE surface as a function of v:

σ2e(v) = σ2

e,min + vH EH E Λ EHE v (21)

= σ2e,min + vH Λ v (22)

= σ2e,min +

N∑

k=1

|vk|2 λk (23)

where vk is the kth element of v and λk is the kth ordered, positive eigenvalue of Rxx. Fromanalytic geometry, is recognize Eq (23) as an ellipsoid with principal axes correspondingto the coordinate axes. The contour plot in Figure 46, for N = 2 and real-valued data,illustrates this translated, rotated surface.

Figure 46: Translation/rotation of the MSE surface to the coordinate origin.


In Figure 46 note that the eigenvalue spread (i.e. the deviation of eigenvalues from equalvalues) determines the shape of the σ2

e(v) surface. For all equal eigenvalues, the error surface”bowl” is round, and deviation of v from the origin in any direction has the same effect on theerror. For a large eigenvalue spread, deviation from the origin in a direction correspondingto a large eigenvalue has a much more pronounced effect on σ2

e(v) than deviation in thedirection corresponding to a small eigenvalue. Large eigenvalue spread results in a very noncircular bowl.

Consider the principal axes of the equi-error contour ellipsoids. As noted earlier, in termsof v these axes are the coordinate axes. The first figure below illustrates the principal axesin terms of f = E v. Let vk represent a coordinate axes (and principal axes) for the v space.Each vk is a vector of all zeros except that the kth element is a one (i.e. coordinate axesvectors). Since E is unitary, in terms of f , the principal axes f

k= E vk are a rotation of

the vk. Note that fk= ek, the kth eigenvector of Rxx. Thus, in terms of the f space, the

shape of the error surface is characterized by deviation from the origin along the eigenvectordirections. Deviation in the direction of an eigenvector associated with a large eigenvalue isworse than deviation along a eigenvector associated with a small eigenvalue.

Finally, note that since h = f + hopt (a translation of f from the origin to hopt), theerror surface is translated to be centered at hopt. Otherwise, it has the same shape andorientation. So, in terms h, deviation of the filter coefficient vector from hopt is characterizedby the eigenstructure of Rxx. This is illustrated in Figure 47.

Figure 47: The original MSE surface characterized by the eigenstructure of Rxx.


Example 9.9: Two unequal power sinusoidal signals, with orthogonal observa-tions, in additive white noise:


Example 9.10: Two equal power sinusoidal signals, with non-orthogonal obser-vations, in additive white noise:


9.2.2 The Least-Mean Squared (LMS) Adaptive Algorithm

MMSE adaptive algorithms operate by adjusting the FIR filter coefficient vector so as toconverge to the optimum coefficient vector hopt. To understand how this is done, we needto understand how the MSE σ2

e(h) varies as h varies. That is, we need to characterize theMSE surface. MMSE adaptive algorithms search the MSE surface to find the minimum.

Consider the MSE equation

σ2e(h) = σ2

e,min +(

h−R−1xx rx,d

)HRxx

(

h− R−1xx rx,d

)

. (24)

This function is illustrated earlier in Figures 42, 43 for the N = 2, real-valued data case.As a function of h, this surface is quadratic. This means that the function has a uniqueminimum, at h = hopt, as long as the data correlation matrix Rxx is invertible (i.e. as longas it is full rank). A quadratic surface is particularly simple to search for a minimum.

One way to compute the MMSE filter coefficient vector hopt is to search the cost functionσ2e(h), by varying h, to find the minimizing value hopt. An effective approach to such a search

is the steepest descent coefficient vector update algorithm. Let h[k] be the FIR coefficientvector at iteration time k. The steepest descent update is

h[k + 1] = h[k] −µ

2∇[k] (25)

where ∇[k] is the gradient vector of the MSE surface, at h[k]. The gradient vector pointsin the direction of steepest assent of the surface at h[k], so the coefficient vector adjustmentshown is in the steepest descent direction. The Figure 48 illustrates steepest descent usinga MSE contour plot for the N = 2 case.

Figure 48: Gradient search of the MSE surface.


The N dimensional gradient vector is

∇[k] =∂

∂h[k]σ2e(h) = 2 Rxx · h[k] − 2 rx,d . (26)

As with direct computation of hopt, these updates require knowledge of

rx,d = Ed∗[n] x[n] (27)

andRxx = Ex[n] xH [n] (28)

which in application are often not available. This motivates the development of adaptivealgorithms which use available data to in effect estimate required statistics.

To develop the LMS algorithm, in place of the required statistics consider estimating themas follows from available data:

R[n] = x[n] xH [n] (29)

andr[n] = d∗[n] x[n] . (30)

We term these instantaneous estimates since the expectations (i.e., the ensemble averages)are replaced by the actual data values at the current instant in time. If we use theseinstantaneous estimates in place of the actual statistical quantities, we get the followinginstantaneous estimate of the gradient for a current coefficient vector h[n] :

∇[n] = −2 r[n] + 2R[n] h[n]

= −2 x[n] d[n] + 2 x[n] xH [n] h[n]

= −2 x[n](

d[n]− xH [n] h[n])

= −2 x[n] e∗[n] ,

so thath[n + 1] = h[n] + µ x[n] e∗[n] . (31)

This is the update equation for the Least Mean Squared (LMS) algorithm. Note that the“hat” in the notation h[n] signifies that the weight vector is an estimate of the steepestdescent weight vector. It is a random vector since it is a function of the random data itself,not the second order statistics of the data.The LMS algorithms is:

1) Select µ and set h[0] = 0K .2) Given data x[n] and d[n]; k = 0, 1, 2, · · ·.3) For n = 0, 1, 2, · · · compute

d[n] = hH[n] x[n] (32)

e[n] = d[n] − d[n] (33)

h[n+ 1] = h[n] + µ x[n] e∗[n] . (34)

end


9.3 Least Squares Filtering

Least Squares (LS) filtering is similar to MSE filtering in that it is based on minimizationof a cost function which is concerned with the second order difference between the filteroutput and a desired signal. However, LS and MSE are fundamentally different in that theformer is based on a deterministic cost while the latter, as we already know, is based onan probabilistic cost (i.e. the expectation of the squared error). Below we first present thelinear LS filtering problem and solution. We then develop a computational algorithm forlinear LS, termed Recursive Least Squares (RLS).

Let N be the length of h. For time n ≥ N , consider the linear filtering problem illustratedin Figure 49. Available for processing, we have an observation vector sequence x[k]; k =0, 1, · · · , n and desired signal d[k]; k = 0, 1, · · · , n.

Figure 49: FIR least squares design problem.

The observation vector sequence is to be linearly filtered with a coefficient vector h whichis to be designed to minimize the weighted sum of squares cost

J(h) =n∑

k=0

λn−k|e[k]|2 (35)

wheree[k] = d[k]− hHx[k] ; k = 0, 1, · · · , n (36)

and λ, termed the forgetting factor, is chosen according to 0 << λ < 1 so as to deemphasizeerrors further in the past. This cost is termed deterministic because it is based on averagingover available data as opposed to an expectation. Because of this, the optimum coefficientvector will be computed from temporal averages over available observations as opposed toobservation statistics.

To solve for the optimum LS filter, consider the set of linear equations

λn/2 xH [0]λ(n−1)/2 xH [1]

...λ0 xH [n]

h =

λn/2 d∗[0]λ(n−1)/2 d∗[1]

...λ0 d∗[n]

, (37)

or, in matrix form,X h = d . (38)

Since n > N has been assumed, this is an overdetermined set of linear equations. So ingeneral a h does not exist that solves (38).


Assume that (XHX)−1 exists. From linear algebra, we have that

hopt = (XHX)−1XHd (39)

minimizes the Equation (35) cost. That is, Equation (39) solves the linear LS problem

minh

J(h) =n∑

k=0

λn−k|e[k]|2 . (40)

Note that Equation (39) can be written as

hopt = R−1

r (41)

where

R =n∑

k=0

λn−kx[k] xH [k] (42)

and

r =n∑

k=0

λn−kx[k] d∗[k] . (43)

Often, the way that linear LS filtering is implemented, is that hopt is computed at eachtime n and used to process only the observation x[n]. That is, a new coefficient vectoris computed, as in Equation (41), each time sample. For this implementation, linear LScoefficient vector design, as computed directly by Equation (41), requires O(N3) multipliesper sample time. The Recursive Least Squares (RLS) algorithm, described next, reducesthis required computation.

Derivation of the RLS algorithm makes use of the following result from linear algebra,called the Matrix Inversion Lemma (MIL).

Given: A = B−1 + C D−1 CH ,

the MIL states: A−1 = B − B C (D − CH B C)−1 CH B.


RLS:

Given hopt[n− 1], R−1[n− 1] and r[n− 1], along with new data x[n], d[n], the objective

is to compute hopt[n] efficiently. This is the recursive least squares (RLS) objective. Firstnote that

r[n] = x[n] d∗[n] +n−1∑

k=0

λn−kx[k] d∗[k] (44)

= x[n] d∗[n] + λ r[n− 1] .

Similarly,R[n] = x[n] xH [n] + λ R[n− 1] . (45)

Let P n = R−1[n]. Using the matrix inversion lemma, we have that

P n =1

λP n−1 −

1λ2 P n−1 x[n] xH [n] P n−1

1 + 1λxH [n] P n−1 x[n]

(46)

=1

λP n−1 −

1

λkn xH [n] P n−1

where

kn =1λP n−1 x[n]

1 + 1λµn

(47)

is termed the gain vector, with

µn = xH [n] P n−1 x[n] . (48)

It can be shown that Equation (47) reduces to

kn = P n x[n] . (49)

Combining Equation (41) with (46), (49) and (44), we get

hopt[n] = R−1[n] r[n] (50)

= P n r[n]

= λ P n rn−1 + P n x[n] d∗[n]

= hopt[n− 1] + kn ε∗n

whereεn = d[n]− hH

opt[n− 1] x[n] (51)

and to get to the last line of (50), (46) was used and a few steps were skipped.


In summary, the RLS algorithm is:

• initialize:P 0 = c−1IN (52)

hopt[0] = 0N (53)

where c is a small constant.

• At time n, given hopt[n− 1], P n−1 and x[n], d[n],

vn = P n−1x[n] (54)

kn =1λvn

1 + 1λxH [n] vn

(55)

P n =1

λP n−1 −

1

λknv

Hn (56)

εn = d[n]− hHopt,n−1x[n] (57)

hopt[n] = hopt[n− 1] + kn ε∗n (58)

d[n] = hHopt[n] x[n] . (59)

The last equation implements the filter. A count of multiplies indicates that, for each timen, 3N2 + 5N multiplies are required (i.e. O(N2)).

The RLS algorithm is one popular example of a data adaptive filter algorithm. At eachtime, the filter coefficient vector h[n] is updated using the new data at time n. The RLSalgorithm is based in the LS design criterion. Other popular adaptive filter algorithms exist.The least mean-squared (LMS) algorithm is another widely used adaptive filter algorithmbased on the MSE criterion.


10 Spectrum Estimation

In this Section of the Course we provide a broad overview on Spectrum Estimation. Wefocus on discrete time methods, which are much more commonly employed than continuoustime. Spectrum estimation, the empirical study of the frequency content of a signal, hasapplications ranging from astrophysical exploration to SONAR, from machine diagnostics tobiomedical engineering, and from pure science to economics.

We begin in Section 10.1 with the definition of the power spectral density, which describesthe frequency content of a wide sense stationary (WSS) random process. In Section 10.2we study classical spectrum estimation which is the direct extension of the windowing/FFTbased procedure applied to deterministic signals. In studying the performance and limita-tions of this approach, we will see the need for more advanced methods, which we introduce inSection 10.3 (AR Spectrum Estimation), 10.4 (Optimum Filter Based Spectrum Estimation)and 10.5 (Correlation Matrix Eigenstructure Based Spectrum Estimation).

10.1 The Problem Statement

Given a discrete-time, zero-mean, wide-sense stationary process x[n], we know that its powerspectral density is

Sxx(ejω) =

∞∑

m=−∞

Rxx[m] e−jmω (1)

i.e. the DTFT of the autocorrelation function Rxx[m]. We have also seen that, in terms ofthe random process x[n] of the, the power spectral density is

Sxx(ejω) = lim

N→∞

1

2N + 1E|XN(e

jω)|2 , (2)

where

XN(ejω) =

N∑

n=−N

x[n] e−jnω . (3)

The spectrum estimation problem is: given a finite duration sampling of realization x[n] ofrandom process x[n]; i.e.

x[n]; n = 0, 1, · · · , Ns − 1 , (4)

estimate Sxx(ejω).


Example 10.1 - Sinusoids in noise:

Example 10.2 - An AR process:


Issues:

• Resolution, dynamic range or detectability and SNR.

• Estimate bias and variance: let S(ejω) be an estimate of Sxx(ejω). The bias (as a

function of ω) isB(ω) = ES(ejω) − Sxx(e

jω) . (5)

The covariance of S(ejω) is

CovS(ejω1), S(ejω2) = E(

S(ejω1)−ES(ejω1)) (

S(ejω1)−ES(ejω1))

= E

S(ejω1) S(ejω2)

− ES(ejω1) ES(ejω2) . (6)

• V arS(ejω) = CovS(ejω), S(ejω).

• S(ejω) will be real-valued (using established estimation methods).

• S(ejω) should be non-negative.

10.2 Classical Spectrum Estimation

The basic approach is to directly approximate either

Sxx(ejω) =

∞∑

m=−∞

Rxx[m] e−jmω (7)

or

Sxx(ejω) = lim

N→∞

1

2N + 1E|XN(e

jω)|2 , (8)

XN(ejω) =

N∑

n=−N

x[n] e−jnω . (9)

10.2.1 Estimation of Rxx[m]

We start with the fact that

Rxx[m] = Ex[n] x∗[n−m] = Ex[n+m] x∗[n] . (10)

Since Rxx[−m] = R∗xx[m], we need only estimate Rxx[m]; m ≥ 0. The basic idea is to replace

expectations (ensemble average) with temporal averages. In this Subsection we consider twospecific methods.


1. The correlation estimator R′

xx[m]: Given x[n]; n = 0, 1, · · · , Ns − 1, first consider thesample correlation function estimator

R′

xx[m] =

1Ns−m

∑Ns−1−mn=0 x[n +m] x∗[n] m = 0, 1, 2, · · · , Ns − 1

R′∗xx[m] m = −1,−2, · · · ,−Ns + 1

0 otherwise

(11)

Note that within the summation, only data which is available is used (the significanceof this will become clear later). For each m = 0, 1, 2, · · · , Ns − 1, the averaging is overonly available data products.

The mean of R′

xx[m]

The variance of R′

xx[m]


2. The correlation estimator Rxx[m]: Given x[n]; n = 0, 1, · · · , Ns − 1, first consider thealternative sample correlation function estimator

Rxx[m] =

1Ns

∑Ns−1−mn=0 x[n +m] x∗[n] m = 0, 1, 2, · · · , Ns − 1

R∗xx[m] m = −1,−2, · · · ,−Ns + 1

0 otherwise

(12)

That is,

Rxx[m] =Ns − |m|

Ns

R′

xx[m] . (13)

The mean of Rxx[m]

The variance of Rxx[m]


10.2.2 The Periodogram

First consider the Rxx[m] correlation estimator introduced second in Subsection [10.2.1]directly above. The periodogram is the following power spectral density estimator:

INs(ejω) = DTFT

Rxx[m]

=Ns−1∑

m=−Ns+1

Rxx[m] e−jmω . (14)

This can also be expressed as

INs(ejω) = DTFT

1

Nsx[n] ∗ x∗[−n]

, (15)

i.e. the DTFT of x[n];n = 0, 1, 2, · · · , Ns − 1 convolved with its folded, conjugated versionand scaled by 1

Ns

. By the convolution and fold properties of the DTFT, we have that

INs(ejω) =

1

Ns

XNs(ejω) X∗

Ns(ejω) =

1

Ns

∣

∣

∣XNs(ejω)

∣

∣

∣

2, (16)

where

XNs(ejω) =

Ns−1∑

n=0

x[n] e−jnω . (17)

The mean of INs(ejω)


Notes:

• Since WB(ejω) is non-negative, and Sxx(e

jω) is also, E INs(ejω) is non-negative. This

is obvious also since

INs(ejω) =

1

Ns

∣

∣

∣XNs(ejω)

∣

∣

∣

2. (18)

• Consider R′

xx[m], the first autocorrelation estimator introduced in Subsection [10.2.1].The power spectral density estimator

S′

(ejω) =Ns−1∑

m=−Ns+1

R′

xx[m] e−jmω (19)

is not necessarily non-negative for all ω, since

E

S′

(ejω)

=1

2πWR(e

jω) ∗© Sxx(ejω) (20)

and WR(ejω), the DTFT of a rectangular window, goes negative.

The variance of INs(ejω): Assuming Gaussian data,

Notes:

• V ar INs(ejω) is independent of Ns, i.e. more data does not help.

• V ar INs(ejω) is greater for ω where Sxx(e

jω) is greater.

These are problems.


10.2.3 The Averaged Periodogram

The variance problems identified above can be mitigated using the averaged periodogram,which is also referred to as the Bartlett procedure. In this subsection, we describe this.

Given x[n]; n = 0, 1, · · · , Ns − 1, where Ns = M ·K, consider partitioning the data intoK blocks of M data samples each, as follows:

x(i)[n] = x[(i− 1)M + n]; n = 0, 1, · · · ,M − 1; i = 1, 2, · · · , K . (21)

Now letX

(i)M (ejω) = DTFTx(i)[n]; i = 1, 2, · · · , K (22)

I(i)M (ejω) =

1

M

∣

∣

∣X(i)M (ejω)

∣

∣

∣

2(23)

and

B(ejω) =1

K

K∑

i=1

I(i)M (ejω) =

1

Ns

K∑

i=1

∣

∣

∣X(i)M (ejω)

∣

∣

∣

2. (24)

This is the averaged periodogram.

The mean of the averaged periodogram is

E

B(ejω)

=1

2πWB(e

jω) ∗© Sxx(ejω) , (25)

where WB(ejω), the smearing function, is the DTFT of the M dimensional triangle (Barlett)

window . Note the increased bias, compared to using a single block. This is due to the factthat the main lobe width of WB(e

jω) with data block length M = Ns

K(i.e. the main lobe

width is 4πM) is K time larger than the main lobe width 4π

Ns

for a single block of width Ns.

The variance of the averaged periodogram is

V ar

B(ejω)

≈1

KS2xx(e

jω) . (26)

So, compared to the (unaveraged) periodogram, the variance is reduced by K, the numberof data blocks.

Note the classic bias/variance tradeoff. Bias is increased because we are using smallerblock lengths. Variance is reduced via the averaging.


10.2.4 Windowing

The following windowed periodogram approach is referred to as the Blackman-Tukey pro-cedure. Let w[m] be a window of width Ns = 2L + 1. Common examples of windows forthis purpose are Hamming, Blackman, Kiaser (eigenvector) and Dolph-Chebychev (equi-sidelobe level) windows. The windowed periodogram is

SBT (ejω) = DTFT

w[m] · Rxx[m]

. (27)

By the multiplication property of the DTFT,

SBT (ejω) =

1

2πW (ejω) ∗© IL(e

jω) (28)

where IL(ejω) is the periodogram derived from a data block of width L.

The mean of the windowed periodogram is

E

SBT (ejω)

=1

2πW (ejω) ∗© E

IL(ejω)

(29)

=1

4π2W (ejω) ∗© WB(e

jω) ∗© Sxx(ejω) . (30)

The variance of the windowed periodogram is, for relatively smooth Sxx(ejω),

V ar

SBT (ejω)

≈1

Cw

V ar

IL(ejω)

(31)

where

Cw =1

2L+ 1

L−1∑

m=−L+1

w2[m] . (32)

The effect on variance is typically to make it slightly larger. The effect on bias is both goodan bad. For a typical window, the local smearing (local in frequency) is increased becausethe window main lobe is wider. However, again for a typical window, the overall smearingis decreased because of reduced side lobe levels.


10.2.5 Averaging and Windowing

Let w[n] be a window of width M . The averaged, windowed periodogram is

SW (ejω) =1

K

K∑

i=1

I′(i)M (ejω) (33)

I′(i)(ejω) =

1

M

∣

∣

∣X′(i)(ejω)

∣

∣

∣

2(34)

X′(i)(ejω) = DTFTw[n] · x

(i)M [n] . (35)

10.2.6 Computation

Use an FFT to efficiently compute ”samples” of required DTFT’s. Zero pad data blocks,before computing the FFT, to generate more samples of the underlying DTFT.


10.3 AR (Autoregressive) Spectrum Estimation

10.3.1 Model

Recall, from Subsection 5.5 of the Course Notes, that the AR model of a discrete-time, zeromean, wide-sense stationary random process X [n] is the output of an all-pole filter withwhite noise input. The generation of the AR model is illustrated below in Figure 50

The transfer function of the AR model generating filter is

H(z) =1

1 +∑P

k=1 ak z−k, (36)

where P is the order of the AR process (that is, the number of poles). The ak; k =1, 2, · · · , P are the AR coefficients. The frequency response for the AR model generatingfilter is

H(ejω) =1

A(ejω); A(ejω) = 1 +

P∑

k=1

ak e−jωk (37)

Figure 50: The AR model.


The white noise input to the model has correlation function and power spectral density

Rww[m] = σ2w δ[m] ; Sww(e

jω) = σ2w . (38)

Under these model assumptions,

Sxx(ejω) = Sww(e

jω) |H(ejω)|2 =σ2w

|A(ejω)|2(39)

=σ2w

|1 +∑P

k=1 ak e−jωk|2. (40)

So, Sxx(ejω) is parameterized by the AR coefficients ak; k = 1, 2, · · · , P, along with the

model input variance σ2w.

Note that for poles close to the unit circle of the complex z-plane, the power spectraldensity (the spectrum) of X [n] will have distinct peaks. Furthermore, if the poles are closelyspaced in angle, the spectral peaks will be closely spaced in frequency, resulting in a chal-lenging spectrum estimation resolution problem. It is for this type of wide-sense stationaryprocess, for situations where there is limited data, that AR spectrum estimation is most ef-fective. Figure 51 illustrates an AR spectrum for which a high resolution spectrum estimatormay be desired.

Figure 51: A typical AR spectrum.


The spectrum estimation approach is:

1. assume the AR model, and pick the model order P ;

2. estimate, from available data, the model parameters ak; k = 1, 2, · · · , P ; σ2w; and

3. plug the AR coefficient estimates into the spectrum equation.

The model order is determined either through understanding the physical system that gener-ates the random process X [n], or through analysis of the available data. The latter approachis beyond the scope of this Course. Below, we assume that the model order is known. Often,we are interested only in the shape of Sxx(e

jω) (e.g. where the spectral peaks are). Then, weneed not estimate σ2

w.

10.3.2 Estimation of the AR Coefficients

Given that the transfer function of the AR model generating filter is

H(z) =1

1 +∑P

k=1 ak z−k, (41)

the input/output difference equation of the generating filter is

x[n] + a1x[n− 1] + a2x[n− 2] + · · · + aPx[n− P ] = w[n] . (42)

Multiplying by x∗[n−m], and taking the expectation, we get

Ew[n] x∗[n−m] =P∑

k=0

ak Ex[n− k] x∗[n−m] (43)

=P∑

k=0

ak Rxx[m− k] , (44)

where a0 = 1. For m > 0, Ew[n] x∗[n−m] = 0, because for white noise input w[n], pastoutputs x[n−m] are not correlated with the present input. For m = 0, Ew[n] x∗[n−m] =

Ew[n](

w∗[n]−∑P

k=1 a∗kx

∗[n− k])

= σ2w. So,

Ew[n] x∗[n−m] =

σ2w m = 0

0 m > 0, (45)

andP∑

k=0

ak Rxx[m− k] =

σ2w m = 0

0 m = 1, 2, · · · , P, (46)

or, conjugating both sides,

P∑

k=0

a∗k Rxx[k −m] =

σ2w m = 0

0 m = 1, 2, · · · , P. (47)


In matrix form, Eq (47) is

Rxx[0] Rxx[−1] · · · Rxx[−P ]Rxx[1] Rxx[0] · · · Rxx[−P + 1]

......

. . ....

Rxx[P ] Rxx[P − 1] · · · Rxx[0]

1a1...aP

=

σ2w

0...0

. (48)

These are called the Yule-Walker equations.

Let wk = −ak. Then, since a0 = 1,

P∑

k=0

ak Rxx[m− k] = 0 m = 1, 2, · · · , P (49)

becomes

Rxx[m] −P∑

k=1

wk Rxx[m− k] = 0 m = 1, 2, · · · , P (50)

orP∑

k=1

wk Rxx[m− k] = Rxx[m] m = 1, 2, · · · , P . (51)

In matrix form, these are

Rxx[0] Rxx[−1] · · · Rxx[−P + 1]Rxx[1] Rxx[0] · · · Rxx[−P + 2]

......

. . ....

Rxx[P − 1] Rxx[P − 2] · · · Rxx[0]

w1

w2...

wP

=

Rxx[1]Rxx[2]

...Rxx[P ]

, (52)

orRxx w = r . (53)

This is another form of the Yule-Walker equations which is easier to work with.


To use the Yule-Walker equations to estimate the AR coefficients, assume the P is known.Then, from available data x[n];n = 0, 1, · · · , Ns− 1, estimate the required correlation lagsas (for example)

Rxx[m] =1

Ns −m

Ns−1∑

n=m

x[n] x∗[n−m] m = 0, 1, · · · , P . (54)

Then, plugging these correlation estimates into Eq (52), we have

w = R−1

xx r . (55)

Also, a = [1, −wT ]T .

10.3.3 AR Spectrum Estimation

Given data x[n];n = 0, 1, · · · , Ns − 1, and the AR coefficient estimates described above, ifrequired first estimate σ2

w as, for example,

σ2w =

P∑

k=0

ak Rxx[−k] . (56)

Otherwise, set σ2w = 1. Then, the estimated AR spectrum is

SAR(ejω) =

σ2w

|1 +∑P

k=1 ak e−jωk|2(57)

=σ2w

|vH(ejω) a|2(58)

where vH(ejω) = [1, e−jω, · · · , e−j(P−1)ω]T is the unnormalized Fourier vector. (Note that thedenominator can be computed, for a equi-spaced ω from −π to π, using an FFT.)


10.4 Optimum Filter Based Spectrum Estimation

The spectrum estimation algorithm introduced in this Section is based on an optimum filterformulation. In particular, the optimum filters used to derive the spectral estimator areoptimum in a Minimum Variance, Distortionless Response (MVDR) sense. So, the spectrumestimator is often called the MVDR estimator. As mentioned above, filters are used in theformulation. As we will see, filters are not used for the implementation.

The spectrum estimator described here is often called the Maximum Likelihood (ML)estimator, although this name is misleading because the spectral estimate generated is nota maximum likelihood estimate in any sense.

10.4.1 A Swept Filter or Filter Bank Formulation

Figure 52 below illustrates a swept filter or filter bank approach to spectrum estimation. Theinput, x[n], assumed to be wide-sense stationary, is the signal for which we wish to computea power spectrum estimate. In this approach, we consider processing this signal with a bankof narrow band filters, each pointed at a particular frequency. The plot of the powers of theoutputs of these filters, as a function of frequency, is the power spectrum estimate.

Consider a narrow band filter pointed to frequency ω0. Assume it is FIR, with impulseresponse vector hω0

= [hω0[0], hω0

[1], · · · , hω0[N−1]]T . Let x[n] = [x[n], x[n−1], x[n−N+1]]T

be the data in the FIR filter delay line at time n. The filter output is yω0[n] = hH

ω0x[n]. The

filter output power is

E|yω0[n]|2 = E|hH

ω0x[n]|2 = EhH

ω0x[n] xH [n] hω0

(59)

= hHω0

Rxx hω0. (60)

The estimated power spectrum is the plot of E|yω0[n]|2 vs. ω0. That is, the estimator is

S(ω) = hHω Rxx hω , (61)

where Rxx is an estimate of Rxx. Note that the filters are not actually implemented. Theirimpulse response vectors are used on Rxx.

Figure 52: The MVDR filter.


10.4.2 The MVDR Spectrum Estimator

For a given frequency ω0, consider an FIR filter which is designed to minimize the outputvariance (power, assuming zero mean input) subject to the constraint that H(ejω0) = 1.Such a filter would pass a narrow band signal with frequency ω0 without distortion (i.e.with unit gain). The filter is therefore referred to as an MVDR filter. This filter providesan estimate of the power at frequency ω0 which has minimum effect from power at otherfrequencies. The MVDR filter design problem is

minhω0

hHω0

Rxx hω0(62)

subj. to hHω0

v(ejω0) = 1 , (63)

where v(ejω0) is the normalized Fourier vector. The solution to this problem, via. Lagrangemultipliers, is

hω0=

R−1xx v(ejω0)

vH(ejω0) R−1xx v(ejω0)

, (64)

and the corresponding output power is

E|yω0[n]|2 =

vH(ejω0) R−1xx Rxx R−1

xx v(ejω0)(


)2 (65)

=1


. (66)

As we sweep through frequency ω from −π to π, using an estimate of the correlation matrix,we get the MVDR spectrum estimator

SMVDR(ejω) =

1

vH(ejω) R−1

xx v(ejω). (67)


10.5 MUSIC: An Eigenstructure Approach

MUSIC stands for MUltiple Signal Classification. For narrow band signals in additive whitenoise, it exploits the correlation matrix eigenstructure characteristics we identified previouslyin Subsection 4.6 of the Course.

10.5.1 The Model & Correlation Matrix Eigenstructure

This is a summary of results from Subsection 6.5. Given a vector observation of D narrowband signals in additive white noise,

x[n] =D∑

i=1

si[n] v(ejωi) + n[n] , (68)

the correlation matrix is of the form

Rxx = A P AH + σ2n IN . (69)

Under assumptions stated in Subsection 4.6, the eigenstructure

Rxx = E Λ EH (70)

has the following properties:

1. eigenvalues (λi ≥ λi+1)

λi

> σ2n i = 1, 2, · · · , D

= σ2n i = D + 1, D + 2, · · · , N

. (71)

2. eigenvectors (orthonormal; E = [Es, En] ). The D columns of Es, the signal eigen-vectors, span the signal subspace, the span of v(ejωi); i = 1, 2, · · · , D. The N −D

columns of En, the noise-only eigenvectors, span the noise-only subspace, the orthog-onal complement of the signal subspace.

10.5.2 The MUSIC Spectrum Estimator

By the orthogonality properties of eigenvectors, the noise-only subspace is orthogonal toany Fourier vector corresponding to an actual signal. This motivates the following, MUSIC,spectrum estimator

SMUSIC(ejω) =

1

|vH(ejω) En|2

, (72)

where En is the N × (N − D) dimensional matrix of estimated noise-only eigenvectors,computed from an estimated correlation matrix Rxx. Ideally, if En = En, the spectrum willhave infinitely high peaks at all ω = ωi; i = 1, 2, · · · , D, and at no other frequencies.

Documents

ECE 8072, Statistical Signal Processing, Fall 2010