Lecture 17 Maximum entropy applications - Cornell Universityhosting.astro.cornell.edu/~cordes/A6523/A6523_2015_Lecture17.pdf · Lecture 17 – Maximum entropy applications • General

3/26/15

1

A6523 Signal Modeling, Statistical Inference and

Data Mining in Astrophysics

Spring 2015 http://www.astro.cornell.edu/~cordes/A6523

Lecture 17 – Maximum entropy applications

•  General solution for PDFs (constraints, partition function) •  Entropy expression(s) for power spectra/images: •  Maximum entropy spectrum for Gaussian processes •  Relationship to autoregressive model fitting

– Other approaches (HRM, Cholesky decomposition) – Matched filtering –  Interpolation: sampling theorem, missing data, etc.

Time Series to Spectrum Diagram

x(t) FT⇐⇒ X(f )

irreversible ↓ ↓ irreversible

Rx(τ ) = x(t)x∗(t + τ ) FT⇐⇒ S(f ) = |X(f )|2

The autocorrelation function is closely related to the covariance matrix of the signal.

There are (at least) three spectral estimators that involve the covariance matrix and they are all mathematicallydifferent!

How can this be if there is one and only one relationship between ACF and spectrum, as given by the Wiener

Khinchine theorem (or the imaging equivalent, the van Cittert-Zernike theorem)?

The answer is that the relationships in the diagram are all for ensemble average quantities and we do not know

any of them. We have finite data, so there are missing data in the form of gaps and finite data extent in time or

space.

Thus we have

Data −→ Rx(τ )

Fourier Transform based Spectral Estimates

−→ Maximum Entropy Spectral Estimate

High Resolution Method

Missing data are treated differently in the different spectral estimators and that is manifested in thedifferent mathematical structure.

1

3/26/15

2

Matrix form of the Fourier-Transform Based SpectralEstimate:

It is instructive to compare the matrix form for the maximum entropy spectrum with the powerspectral estimate defined as the Fourier transform of the autocorrelation function. This is iden-tical to the spectrum found by taking the squared magnitude of the Fourier transform of thetime series and is sometimes called the Bartlett estimate, because the Bartlett lag window is atriangle function.

Let Cn−n be the n, n element of the covariance matrix; then the Bartlett estimate is

SB(f ) = N−1N+1

=−N+1

C

1− ||

N

e−2πf∆τ ,

which can also be written as

SB(f ) = N−2

C− e−2πif∆τ(−)

= N−2

e2πif∆τC−e−2πif∆τ,

and can be written as a matrix product

SB(f ) = N−2ε tCε ∗

15

This can be compared with the ME and high-resolution (“ML”)estimates

SME =constant

|ε tC−1δ|2

SML(f ) =1

ε tC−1ε

16

3/26/15

3

Comparison of Spectral Estimators

Bartlett MLM MEMN−2ε † Cε (ε † C−1ε)−1 ∝ |ε † C−1δ|−2

100% error

∆f = 1N∆τ = 1

T ≈ same or better better resolutionresolution (up to ×2 of Bartlett)

large sidelobes lower sidelobes

Note all estimators are real because the quadratic form εt Cε is real for C Toeplitz or, in theMEM case, the estimator is manifestly real.

13

The philosophy of entropy maximization dictates that we make no assumptions aboutmissing data.

We will consider the ME estimator in greatest detail and we will show that the ME spectral

estimator is equivalent to

1. fitting an autoregressive (AR) model to the data and finding the spectrum of the AR

model.

2. applying a linear, Wiener-Hopf prediction filter to the time series, thus extending the TS

by some allowable amount, and finding the spectrum of the extended TS.

3. extending the correlation function according to an AR model.

In effect, ME techniques fill in gaps (interpolate) or extend the data (extrapolate) in a way

that is consistent with the available data but also in such a way that entropy is maximized.

This maximizes the uncertainty about the missing data; i.e. the least amount of “structure” is

imposed on the missing data.

In some instances, one can “superresolve” features in the spectrum; this can be viewed as

“beating” the uncertainty principle because the effective resolution in the frequency domain

can be as large as

∆fME −→ 1/2T

(T = length of time series), i.e. twice the resolution of the conventional Bartlett type estimate.

20

3/26/15

4

This occurs (for some cases) because the MEM effectively predicts missing data from known

data.

The regime in which this works best is where

T ∼ correlation time of the process

because, at best, one can predict ∼ one correlation time into the future.

According to the ME philosophy, the sidelobes (Gibbs phenomenon) associated with spectral

estimators based on Fourier transforms (i.e. either the smoothed |F.T.|2 of the time series

data or the F.T. of a correlation function) are a penalty that is consequence of “assuming” that

unknown data are zero. (One is not so naive as to really assume the data are zero, but if one

did, the same estimator would result.)

21

How can the data be extended?

We may have extensions of the form

I. Predict x(t) forward by approximately one correlation time (Wx).

The new resolution will be δf ∼ T −1 < T−1

II. Extend the corelation function, Rx(τ ) or Cx(τ )

22

3/26/15

5

Data adaptive methods help only when there really is significant missing data.

If T = actual time series duration Wx = correlation time then T ≈ T + Wx ≈ T ⇒ no

advantage.

But if T ≈ Wx so that T ∼ 2T then a factor of 2 increase in resolution may result (super resolution)

e.g. The case of 2 sinusoids: x(t) = A sinω1t + B sinω2t + n(t)

If δf = T−1 |ω2 − ω1| then F.T. methods will suffice in resolving the spectral lines.

However, if T−1 ∼ |ω2 − ω1|, then ME estimates may provide better resolution.

23

Core tenet of Maximum Entropy methods:

Out of all possible (hypotheses | PDFs | …)

that agree with constraints,

choose the one that is

maximally non-committal

with respect to missing information.

3/26/15

6

Maximum Entropy Methods

•  Probabilities H = - Σ pi log pi – Combine with constraint equations to determine pi

– No constraints è flat distribution •  Spectra (and images) H = integral of log S(f)

– Constraints are from measurements –  The spectrum is obtained by maximizing entropy

subject to the constraints –  For a Gaussian process, the resulting expression is

the same as optimally fitting an AR model to a time series

– Why Gaussian?

Maximum Entropy: General Solution for PDF with ConstraintsFirst a simple case:

The entropy for a discrete random variable X , which takes on values xk, k = 0, . . . , N − 1 withassociated probabilities pi is

H = −

k

pk ln pk.

When there is a constraint on the nth moment xn the PDF is found by maximizing the quantity

J = H − λ0

k

pk − λ1

k

xn

kpk

which uses a Lagrange multiplier λ0 for the normalization constraint and λ1 for the constraint on themoment. This construction allows us to calculate the total variation with respect to the pk.

Taking increments δpk we obtain

δH = −

k

(δpk ln pk + δpk)

δJ = δH − λ0

k

δpk − λ1

k

xn

kδpk

= −

k

δpk (ln pk + 1 + λ0 + λ1xn

k)

= 0.

The factor in parentheses needs to vanish since δpk is arbitrary (but small), so

pk = e−(λ0+1)

e−λ1x

nk , k = 0, . . . , N − 1.

1

3/26/15

7

We can solve for the Lagrange multipliers by substitution into the constraint equations:

k

pk = e−(λ0+1)

k

e−λ1xnk = 1 =⇒ e−(λ0+1) =

1

k

e−λ1xnk

and

pk =e−λ1x

nk

k

e−λ1xnk

.

The Lagrange multiplier λ1 needs to be obtained from the expression for the nth moment,

xn =

k

xnk δpk =

k

xnk e−λ1x

nk

k

e−λ1xnk

.

n = 1: Constraint on the mean and the xk are constrained to be positive:

x =

k

xk δpk =

k

xk e−λ1xk

k

e−λ1xk.

By inspection, we see that this is an exponential distribution so λ1 has to be the mean, x.

We can show this formally for the case where the xk are uniformly spaced, xk = kδx, k = 0, . . . , N−1:

2

Let a = e−λ1δx.

Then using a trick to get the numerator,

1

δx

k

xk e−λ1xk =

k

k ak = a× d

da

N−1

k=0

ak = a× d

da

1− aN

1− a

= a× (1− a)(−NaN−1)− (1− aN)(−1)

(1− a)2,

we obtainxδx

=1

1− a− (1− a)NaN−1

1− aN.

If N → ∞ only the leading term matters and we can solve for a,

a = e−λ1δx =x/δx− 1

x/δx

A similar approach can be taken for a case where the variance is fixed. In the continuous limit, thisyields a Gaussian PDF.

3

3/26/15

8

General case with arbitrary constraints on moments(From Jaynes, Probability Theory: The Logic of Science, pp. 355-358)

Discrete RV: xn, n = 0, . . . , N − 1

Constraints on arbitrary functions of x: fk(x), k = 1, . . . ,M .

The constrain equations are the expectations:

fk(x) = Fk =N−1

n=0

pnfk(xn).

Maximize entropy subject to constraints:

J = H − λ0

n

pn −

k

λkfk(x)

= −

n

pn

ln pn + λ0 +

k

λkfk(xn)

Setting δJ = 0 for arbitrary δpn and using normalization to unity to obtain λ0 we get

pn =e

−

k

λkfk(xn)

Z(λ)

where the partition function isZ(λ) =

1ne−

k λkfk(xn)

Solving for the Lagrange multipliers clearly is nontrivial in general.

4

Maximum Entropy: Power Spectrum (short approach)So far we know how to calculate the entropy of a random variable in terms of its PDF. For a univariate

Gaussian PDF we have

fX(x) =2πσ2

−1/2e−x

2/2σ2

H = −

dxfX(x) ln fX(x)

=

1

2ln(2πσ2) +

X2

2σ2

=1

2

ln(2πσ2) + 1

=1

2

ln(2πeσ2)

When we maximize the entropy subject to constraints (from data), we only care about terms in the

entropy that depend on relevant parameters. Here the only parameter is σ so the constant term does not

matter. Notice that larger σ implies larger entropy, as we would expect for a measure of uncertainty.

When we maximize entropy, we may as well write it only in terms of the variance,

H ≈ ln σ2 + constant.

1

3/26/15

9

Heuristic “derivation” of the entropy rate expression:

Another way of viewing this is as follows. In calculating a power spectrum we are concerned witha second-order moment, by definition. Consequently, we can assume that the random process underconsideration is Gaussian because:

1. we are maximizing the entropy (subject to constraints) and

2. given the second moment, the process with largest entropy is a Gaussian random process

Note that while this assumption is satisfactory for estimating the power spectrum (a second moment),it is not necessarily accurate when we consider the estimation errors of the spectral estimate, which de-pend on fourth-order statistics. If the central limit theorem can be invoked then, of course, the Gaussianassumption becomes a good one once again.

Imagine that the process under study is constructed by passing white noise through a linear filter whosesystem function is

S(f )

n(t) −→

Sx(f ) −→ x(t)

Consequently, the Fourier transforms are related as

N(f )Sx(f ) = X(f )

Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore, X(f ) is aGRV and viewing it as a 1-D random variable we have that the entropy is

H(f ) =1

2ln [2πeσ2(f )]

but

2

σ2(f ) ≡ [N(f )

S(f )]2 = S(f )SN

Letting the white noise spectrum be SN = 1 we have

σ2(f ) = S(f )

andH(f ) =

1

2ln [2πeS(f )]

Recall that white noise is uncorrelated between different frequencies (in the continuous limit),

N(f ) N ∗(f ) = SN δ(f − f).

Consequently, the information in different frequencies adds because of statistical independence and,therefore, to get the total entropy we simple integrate (add variances):

H =

df H(f ) =

1

2

df ln[2πeS(f )]

Again we ignore additive constants and we also consider the case where the signal is bandlimited in|f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degrees of freedom is2fNT .

The entropy per degree of freedom is

h =H

2fN=

1

4fN

fN

−fN

df ln [S(f )]

3

3/26/15

10

The derivation of the ME spectrum follows the logical flow:

entropy in terms of power spectrum S(f )⇓

maximize H subject to constraints:

⇓Known values of ACF

F.T.⇐⇒ S(f )

⇓ME spectral estimator

van den Bos shows that extrapolating the covariance function to larger lags while maximizing entropyyields a matrix equation that is identical to that obtained by fitting an autoregressive model to the data.This implies that the two procedures are identical mathematically.

4

Maximum Entropy Spectral Estimate •  The MESE is identical to fitting an

autoregressive (AR) model to a time series

•  The unknowns are PM, αj, j=1,M

S(f) =PM

1 +M

j=1

αje2πifj∆τ

2

Xt =M

j=1

αjXt−j + nt

3/26/15

11

CORRESPONDENCE 493

However, since the marginals located near the boundary of the It follows that the particular choice for u(N + 1) maximizes this accept-reject region are approximately equal in importance, these quantity. In order to obtain r(N + 2), the value for r(N + 1) just errors will not materially affect the outcome. found is substituted in R(N + 2) and after that det [R(N + 2)], and

J. J. FREEMAN therefore the corresponding entropy is maximized with respect to

Electron. Div. r(N + 2). Substitution of the values of r(N + 1) and r(N + 2) so

Naval Res. Lab. obtained in det [R(N + 3)] and maximizing yield r(N + 3) and so

Washingtonr D.C’ 20390 on. Note that in general this procedure is different from maximizing det [R(N + 3)] with respect to r(N + l),r(N + 2),r(N + 3).

For what follows it is important to note that maximizing det [R(N + l)] with respect to r(N + 1) is equivalent to the statement

r(0) ... r(N - 1) Alternative Interpretation of Maximum Entropy Spectral Analysis

4) ... r(N - ‘3 = o. (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)

Abstract-Maximum entropy spectral analysis is a method for the estimation of power spectra with a higher resolution than can be obtained with conventional techniques. This is achieved by extrapolation of the autocorrelation function in such a way that the entropy of the corre-

III. INTERPRETATION IN TERMS OF ALL-POLE MODELS Consider the following all-pole model

y(n) + al’y(n - 1) + . . . + a~,‘y(n - N) = e(n) (2) sponding probability density function is maximized in each step of the extrapolation. This correspondence also gives a simple interpretation of where e(n) is a zero-mean random variable with E[e(i)e(j)] = 0 for the method without entropy considerations. i # j. Since E[e(n)y(n - k)] = 0 for k > 0, multiplication of both

sides of (2) with y(n - k) and taking expectations yields

I. INTRODUCTION In the classical approach to calculation of power spectra the sample

autocorrelation function is multiplied by some lag window and then transformed [I]. The window is applied to provide statistical stability and to avoid leakage from other parts of the spectrum. On the other hand, the window limits the spectral resolution.

In the literature, methods are found for the improvement of the spectral resolution through extrapolation of the correlation function beyond the maximum lag [2], [3]. One of these methods, the so-called maximum entropy spectral analysis [3], is discussed in this correspondence. The idea behind the method is to extrapolate the autocorrelation function in such a way that the entropy of the corresponding probability density function is maximized in each step of the extrapolation.

It is shown that the method is equivalent to least-squares fitting of a discrete-time all-pole model to the data.

II. MAXIMUM ENTROPY SPECTRAL ANALYSIS [3] Suppose that the first (N + 1) points r(O),r(l), . . .,r(N) of the

autocorrelation function of a stationary Gaussian process are exactly known. The problem is now to find a value for r(N + 1). Therefore consider

/r (0) 4) ... r(N) W + I)\

R(N + 1) = I 4) r(O) ... r(N - 1) r(N)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

making det [R(N + 1)] equal to zero and all values in between.

\

The basic idea of maximum entropy spectral analysis is to select

r(N + 1) r(N) ... r(1) 43 /

The quantities r(O),r(l),. . .

the value of r(N + 1) that maximizes det [R(N + l)]. The entropy

,r(N),r(N + 1) belong to an autocorrelation only if R(N + 1) is semipositive definite. It follows that

of the N + 2 dimensional Gaussian probability density function with

det [R(N + 1)] must be nonnegative. Det [R(N + l)] is a quadratic function of r(N + 1). The second derivative of det [R(N + l)] with

covariance matrix R(N + 1) is equal to [4]

respect to r(N + 1) is -2 det [R(N - l)], which is always negative. This implies that det [R(N + l)] as a function of r(N + 1) has a single maximum. Hence the allowable values of r(N + 1) are those

In (2~e)~+“’ det [R(N + l)]‘.

r’(k) + aI’r’(k - 1) + . . * + aN’r’(k - N) = 0 for k > 0

with r’(k) = E[y(n)y(n - k)]. It follows that

r’(l) + aI’r’(0) + . . . + a,‘r’(N - 1) = 0 r ‘(74 + aI’r’(l) + ... + aN’r ‘(N - 2) = 0 (3)

r’(N + 1) + al/r’(N) + ... + aN’r’(l) = 0

and hence

r'(0) ... r’(N - 1) r’(1) ... r’(N - 2)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \r’(N + 1) r’(N) ... r’(1) /

Now, suppose that the first N + 1 exact values r(O),r(l); * .,r(N) of any autocorrelation function r(k) are available. Substitution of the values in the first N equations of (3) and solving yields a set of coefficients aI,. . . ,aN. The value for r(N + 1) for a model with these coefficients, however, is uniquely determined by

/r(l) r(O) ... r(N - l)\ r(l) ... r(N - 2)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)

From (1) it follows that the same solution for r (N + 1) would have been obtained from maximum entropy spectral analysis. In the same way it can be proved that all values of the extrapolated autocorrelation

IV. SAMPLE CONSIDERATIONS

function coincide with those corresponding to the model 1, aI,. . ,a,+ So it is concluded that maximum entropy spectral analysis is equivalent

In practice only a finite record of the signal y(t) is available and the

to matching the coefficients of an Nth-order model to the N + 1 given points of the autocorrelation function. Once the coefficients of

autocorrelation function can not be determined exactly. The usual

the scheme have been obtained the computation of the corresponding spectrum is straightforward.

approach to fitting an Nth-order all-pole model to a record of length A4 is as follows [5]. The quantity

.,=$,.1( y(m) + 5 U”Yk - n)

2 (4)

il=1 Manuscript received March 25, 1970; revised November23, 1970.

Proof that maximization of entropy for a Gaussian process is equivalent to extending the ACF using an AR model Van den Bos (1971)

This is the N+2 x N+2 covariance matrix that includes an unknown r(N+1) value

494 IEEE TRANSACTIONS ON INFORMATION THEORY, JULY 197 1

is minimized with respect to al,az; . ‘,uN. The minimization of (4) yields a set of linear equations equivalent to the first N equations of (3) with the exact autocorrelation functions replaced by their estimates. It is clear that the maximum entropy procedure yields the same results when applied to these estimates. Therefore, the extrapolation beyond the (N + 1)th sample autocorrelation point is equivalent to least- squares fitting of an Nth-order all-pole model to the data.

Asymptotic sampling properties of the least-squares coefficient estimates were derived by Mann and Wald [6]. The asymptotic expression for the Nth-order covariance matrix of the estimates is

CONVERGENCE The problem of stochastic approximation has been mentioned by

Sklansky [1] and Fu ei al. [2]. Dvoretzky [3] provided a powerful general theorem useful in proving the mean-square convergence of a class of stochastic approximation algorithms. To make this class more useful particularly in the area of learning systems, it has been suggested by Fu that methods be developed to improve the rate of convergence.

Consider an algorithm of the form

where &+I = 2, + yn+l~.f,+l(r,+l) - ~1 (1)

r n+l = (rd2,. . . ,rn+l )

where the (i, j)th element of R,(N) is given by the sample autocorrela- to be used in the presence of an ergodic process where & is the nth tion function rM(i - .j) of y(m) and sZ denotes the minimum value of estimate of x (the true value sought) of the mean of a normal distribu- (4) divided by M - N. Moreover, the coefficient estimates are max- tion from which samples ri are taken to calculate the function fi(rJ,

imum-likelihood estimates if the e(N) in (2) are Gaussian. and yi is a gain sequence. It is required of the function f&J that

V. CONCLUDING REMARKS Maximum entropy spectral analysis is equivalent to least-squares

fitting of an all-pole model to the available data. The order of the model is determined by the number of available mean-lagged products. This implies the possibility of introduction of poles that were not actually present in the process. On the other hand, if the process contains both poles and zeros a relatively large number of poles is needed to ap- proximate the true spectrum sufficiently [7]. In this case the number of available mean-lagged products may be too small.

Therefore, it seems more adequate to use parametric methods such as the minimum residual methods developed by Astrom and Steiglitz [7], [S] for models with both poles and zeros or conventional fitting of all-pole models, since these approaches provide in addition tests for the order of the model.

A. VAN DEN Bos Dep. Appl. Phys. Delft Technol. Univ. Lorentzweg 1 Delft, Holland

REFERENCES ill R. B. Blackman and J. W. Tukey, The Measurement of Power St~ctra. New York:

Dover, 1958. i21 N. R. Zagalsky, “Exact spectral representation of truncated data,” Proc. IEEE

(Lett.), vol. 55, Jan. 1967, pp. 117-118. [3] J. P. Burg, “Maximum entropy spectral analysis,” presented at the 37th Annu.

Meetinn Sot. of Exoloration Geonhvsicists. Oklahoma Citv. Okla.. 1967. i41 C. E. Shannon and W. Weaver; The M~thematicai The& of &mmunication.

Urbana, Ill.: University of Illinois Press, 1949, pp. 5457. [51 M. G. Kendall and A. Stuart, The Aduanced Theory of Statistics, vol. 3. London:

Griffin, 1966, pp. 4766481. i61 H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic differ-

ence equations,” Econometrica, vol. 11, July/Ott. 1943, pp. 173-219. 171 S. A. Tretter and K. Steiglitz, “Power-spectrum identification in terms of rational

models,” IEEE Trans. Automat. Contr. (Short Papers), vol. AC-12, Apr. 1967, pp. Isa-188.

IS] K. .I. Astrb;m and T. Bohlin, “Numerical identification of linear dynamic systems from normal operating records,” Proc. IFAC Symp. Self-Adaptive Control Systems, Sept. 1966, pp. 3.2-l-3.2-9.

New Algorithm for Stochastic Approximation

4Mn)l = x. (2) Two forms have been used for f:(rJ in previous work, namely,

.f,WA = r, (3) and

(4)

With the appropriate gain sequence yi, Fu has shown that using (4) rather than (3) in (1) gives a faster decrease in the expected mean-square error.

It is suggested that the function f be chosen as

fnVl(r,,+ d = (R,+,(Z))+ where A,,, ,(I) is an estimate of the sample autocorrelation function of the samples rI,r2;. .,r,+l.

By definition,

R,,(Z) = 4m-~rd (6) and if

r, .= x + c, (7)

where r, is the value of an element of zero-mean Gaussian white noise, then combining (6) and (7) gives

&(I) = -ai[x + 5”-Jx + r,11, (8) which reduces to

or I?,$) = x2 + k5J) (9)

since the cross terms in (8) vanish. Now combining (9) with (5) and (1) and subtracting x from both

sides gives

@“,I - xl = (1 - Y”+mn - xl + Yn+1Wn+1 (10)

where, for x # 0,

VI.+1 =x((i+%?/!)i- 1).

Abstract-A general stochastic approximation algorithm is given along Now iterating (10) and writing it in a closed form gives with assumptions and conditions necessary to show that it converges. Convergence is proven in the mean-square sense. The rate of convergence (-E-,+ 1 is shown to be better than two algorithms proposed previously.

(12) Manuscript received March 25, 1970; revised November 25, 1970. if the void product is taken as unity. Combining (11) and (12) and

3/26/15

12

232

The optimal Fourier filter, on the other hand, is given by [S]

HF = C,C2-’ (27) with residual mean-square error

&F 2- - ; T,(C, - CIcy’C,),

where C,, C2, and C, are the circulant approximations to T,, T2, and T,, respectively. Thus, the performance degradation can be written

&F2 - &or, 2- - +W,T~-~T~ - c1c2-*cl)

5 IT, - C~/[/lT,ll~ llT~-~/l + llC,ll . IIG-‘Ill + IT2 - C,l. /IT, II . l/Cd. IIT,-‘II . IIG-‘Il. (29)

Assuming further that T,, T,, and their inverses are uniformly bounded in strong norm, we find that the performance degradation is bounded by a linear combination of two Hilbert-Schmidt norms: IT, - C,l and IT2 - C,l.

Equation (29) can be further simplified for the case of additive white noise. Here T, = T, + NJ, where N, is the noise variance, and similarly C2 = C, + N,I. Substituting in (29) and using (23) we obtain

&F2 - &,p2 I IT, - Cl I Mf 11 m, + No I .

(30)

V. CONCLUSIONS

We have demonstrated that the DFT of a stationary time series with uniformly bounded covariance matrix is asymptoti- cally equivalent to the Karhunen-Loeve expansion. We have also demonstrated that when a finite-order Markov process is coded or filtered in the Fourier domain a performance degradation results with magnitude inversely proportional to the square root of the number of samples. These results establish quantitative measures to guide the choice between the computa- tionally efficient Fourier method and the statistically optimal Karhdnen-Loeve processing of real-time stationary data.

ACKNOWLEDGMENT

The author appreciates his helpful discussions with Dr. R. Gray and Dr. L. Davisson. Dr. Gray’s report, entitied “Toeplitz and Circulant Matrices: A Review,” [7], provided invaluable assistance throughout the course of this work. The proof that IC, - TN1 -+ 0 for square-summable t(k) is due to M. Hamidi of the University of California, Los Angeles.

REFERENCES

111 121

[31

[41

PI

161

[71

W. Pratt and H. Andrew, “Transform image coding,” Univ. Southern California, Los Angeles, Elec. Eng. Rep. 387, Mar. 1970. W. R. Crowther and C. M. Rader, “Efficient coding of vocoder channel signals using linear transformation,” k’roc. IEEE (L&t.), vol. 54, pp. 1594-1595, Nov. 1966. K. Fukunaga and W. L. G. Koontz, “Application of the Karhunen- Loeve expansion to feature selection and ordering,” IEEE Trans. Cornput., vol. C-19, pp. 311-318, Apr. 1970. H. C. Andrews and K. L. Caspari, spectral analysis,”

“A generalized technique for IEEE Trans. Comput., vol. C-19, pp. 16-25, Jan.

1970. J. Pearl, “Basis-restricted transformations and performance measures for spectral representations,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 751-752, Nov. 1971. W. R. Davenport and W. L. Root, An Intrqduction to the Theory o/ Random Sign& and Noise. New York: McGraw-Hill, 1958, pp. 93-96. R. M. Gray, “Toeplitz and circulant matrices: A review,” Stanford Univ. Rep. SU-SEL-71-032, June 1971, pp. 37; also “On the Asymp; totic eigenvalue distribution of Toeplitz matrices,” IEEE Trans. Inform. Theory, pp. 725-30, Nov. 1972.

PI R

1101

fEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1973

U. Grenander and G. Szego, Toeplitz Forms and Their Applications. Berkeley and Los Angeles: Univ. California Press, 1958, p. 62. A. N. Kolmogorov, “On the Shannon theory of information trans- mission in the case of continuous signals,” IRE Trans. Inform. Theory, vol. IT-2, pp. 102-108, Dec. 1956. J. Pearl, H. C. Andrews, and W. K. Pratt, “Performance measures for transform data coding,” IEEE Trans. Commun. (Concise Paper), vol. COM.20, pp. 41 l-415, June 1972.

Notes on Maximum-Entropy Processing

J. A. EDWARD AND M. M. FITELSON

As@+--Maximum-entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrelation function. Unlike the discrete Fourier transform, maximum-entropy processing does not assume that the other lag values are zero. Instead, one mathematically ensures that the fewest possible assumptions about unmeasured data are made by choosing the spectrum that maximizes the entropy for the process.

The use of the maximum entropy approach to spectral analysis was introduced by Burg [l]. In this correspondence, the authors derive the maximum-entropy spectrum by obtaining a spectrum that is forced to maximize the entropy of a stationary random process.

I. INTRODUCTION

Maximum entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrelation function. As such, it may be applied to spectral analysis problems normally approached by the use of the discrete Fourier transform, such as the computation of frequency power density spectra for stationary time series, or of wave number spectra for sonar or seismic arrays. Unlike the discrete Fourier transform, maximum entropy brocessing does not assume that the other lag values are zero. Instead, one chooses the spectrum that fits the measured data and maximizes the entropy of the process. This means one has mathematically ensured that the fewest assumptions regarding the information content of unmeasured data have been made.

The use of this technique for estimating power spectra was introduced by Burg [I]. His paper did not contain an explicit derivation of this result. However, in 1969, Bernard [2], and independently, Edward [3] presented similar derivations (neither of which are available in the open literature) of the maximum entropy spectrum. Subsequently, in 1971, VanDenBos [4] presented a derivation based on a least-squares fitting of an all-pole model to the data. In addition, Lacoss [5] presented quantitative results based on maximum-entropy spectrum analysis.

In this paper the authors present a more rigorous version of Edward’s derivation [3]. The maximum entropy spectrum is found by obtaining a spectrum that is forced to maximize the entropy for a stationary random process and uniformly @aced data. The only assumptions that are necessary arise from demanding that the maximum-entropy spectrum have the analytical properties of the spectrum of the process in question.

The fit to measured data is obtained by requiring the second- order moments (i.e., the covariance matrix) for the spectrum to agree with estimates based on the measured data. A maximum

Manuscript received March 22, 1972; revised September 1, 1972. The authors are with the General Electric Company, Syracuse, N.Y.

13201.

CORRESPONDENCE 233

entropy random process with constraints on only the first and second moments is readily shown to be a Gaussian process. The entropy for a stationary Gaussian process may be shown to be given by

H = $ lnf(l) dA, s

(1) f

where the integral is over the support off(A), and Iz, is the measure of the support of f(A). The maximum entropy estimate of f(A) is derived in Section II.

II. THE MAXIMUM-ENTROPY SPECTRUM

By definition,

I O" f‘(a) = dz exp (-2nirl)E(X(t)X(t + t))

-03

s

cc E dr exp (- 2rcirA)p(r), (2)

-m

where X(t) is the stationary Gaussian process in question. Suppose one had the covariance matrix for X(t,),X(t,),. . ., X(t,), where since the t are uniformly spaced it may be assumed that ti = i. The covariance matrix satisfies

Rij = E(XiXj) = Rj,* = Yt-j, (3) where

r, = s

da exp (27dak)f(a). (4)

Note that R is Hermitian and thus has 2n - 1 (n real and it - 1 imaginary) independent components. To find f(A) so that H is maximized and (4) is satisfied, introduce the quantity

n-1

L = - C O,*r,, k=-(n-1) (5)

where the Lagrange multipliers (0,) satisfy

9, = ewk*. (6) Then f(A) is found by demanding that

6(L + H) = 0, (7)

where the variation is with respect to f(1). Using (1) and (4), (7) becomes

= 0. (8) Since Sf (A) is arbitrary one has

f(a) = !

i

1 ___

At k=x:1, 1,

(9) @k* exp (2nilk)

Imposing the condition that f (A) is positive and integrable, it is possible to express f(A) as

(10) where

II-1

A(a) = c yk* eXp (2nilk) k=O

n-1--m em*= c YkYk*+ m (11)

k=O

r 1

iy -0 1

Fig. 1. Integration contour.

and by Weld’s method [6], [7] A(A)-’ may be chosen to be analytic in the upper half plane. It now remains to solve for yk: k = O,..., n - 1) by satisfying the k constraint equations

rk = s

112 daf(a) exp (- 2niAk), k = O,l,.. ., n - 1, (12)

-l/2

where use has been made of the periodicity of A(1) and A, has been taken to be 1. Multiplying both sides of (10) by A*(A) exp (2ziAk) and integrating yields

n-1

c s

112

Rk+l,l+lYl = lr0 da n-1 exp (2nikl)

-l/2 2 ym* exp (27ciml) * (13)

m=O

Since A(a)-1 is analytic in the upper half-plane, the integral on the right-hand side of (13) may be performed by integration around the contour shown in Fig. 1.

Due to the periodicity of A(A), the integrals over path @ and @ cancel. This fact combined with Cauchy’s theorem implies that the integrals over paths 0 and @ are the negatives of each other. Therefore

s

l/2 da exp (2aiAkj -l/2 44

s

l/Z = da exp (2nirlk - 2nyk)

n-1 -- . (14) -112

22 ym* exp (2rciAm - 2nym) m=o

One may take y to be arbitrarily large, so that

s

1/Z da exp (phlk) = 1 lim

s

l/2

44 d,l exp (2ziAk - 2zyk)

-l/2 Yo y-m -l/2

= 1. a,,,. (15) YO

Combining (15) with (13) yields,

which has the solution, in an obvious matrix notation,

y = (l/y,)R”& This finally yields

(17)

fm = (R-'),,/(IR-'6)T&(a)2)12, (18)

where e’(A) = [I,. e e, exp (- 2ni(n - l)A)]. The right-hand side of (18) is the desired maximum entropy spectrum.

3/26/15

13

234 IEEETRANSACTIONSONINFORMATIONTHEORY,MARCH 1973

ACKNOWLEDGMENT

The authors would like to thank the referees and Prof. T. Kailath for their helpful suggestions, which improved the presentation of this correspondence.

REFERENCES [l] J. P. Burg, “Maxi,mum entropy spectral analysis,” presented at the

;;;; Annual Meeting Sot. Explor. Geophys., Oklahoma Ctty, Okla.,

[2] T. E: Bernard,.“Analytical studies of techniques for the computation of high-resolution wave-number spectra,” Texas Instruments Advan. Array Res., Spec. Rep. 9, 1969.

[3] J. A. Edward, “An approach to passive sonar array processing based on the information in the space/time covariance matrix,” General Electric Co., Heavy Mil. Electron. Syst., TIS Rep. R69EMH38, 1969.

[4] A. VanDenBos, “Alternative interpretation of maximum entropy spectral analysis,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 493494, July 1971.

[5] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, pp. 661-675, Aug. 1971.

[6] E. A. Robinson, “Predictive decomposition of time series with applica. tions to seismic exploration,” Massachusetts Inst. Technol., GAG Rep. 7, sect. 2.8, pp. 185-186, July 12, 1954.

[7] H. Wold, A Study in the Analysis of Stationary Time Series. Almquist and Wiksells, 1938, pp. 214.

Dppsala:

Binary Decisions With Stochastic Inputs

G. FRENKEL

Abstrucf-Upper and lower bounds on the mean error rate for binary decisions are obtained in the case where the amplitude of the signal varies during the symbol period and is a stationary process.

The problem considered is the one of obtaining bounds on the mean error rate in a binary decision process, when the amplitude o(t) of the input signal varies during the symbol period T, with probability density function p(w), which is independent of time during the symbol period (i.e., w(t) is a first-order stationary stochastic process). Under these conditions the decision process is characterized by the relationships: probability of error = P,(x; N,), where

s

T

x= w(t)h(T - t) dt. (1) 0

Here No is the power density of thermal noise, P,(x) is a function, which in general depends on the nature of the decision process, and h(t) is the low-frequency equivalent of the pre: detection filter impulse response, which is assumed positive over the symbol period, as is true in most applications. Equation (1) implies that the discussion is limited to linear detectors; in general if p(w) is known, better results may be obtained without this restriction. Without loss of generality it will be assumed that

s

T h(t) dt = 1. (2)

0

In the following, Pa(x), h(t), and No are assumed known. It is further assumed that P,(x) is a convex function (has a nonnegative second derivative). The reader may verify that this condition is almost always satisfied. For example, in the presence of thermal noise the error rate as a function of the signal component x at the output of a correlator is a convex function of x for all positive values.

Manuscript received March 31, 1972; revised September 7, 1972. The author is with the Computer Sciences Corporation, Falls Church, Va.

It will be shown that the mean error rate is bounded by the following expression

f’e[Eb);No I 5 E[f’&;No)l 5 E[Pe(w; No)], (3)

where

E P’& ; No) 1 = s

Wmax PAwi NolAw) da (4)

wni” A nice heuristic interpretation of (3) exists. Since P,(x) is

convex in x, distributions that have small variance in the received signal energy will have the lowest error probability, averaged over x. The distribution with least variance places all its mass at x = E(x) and has zero variance. By choosing w(t) = E(w) we achieve this situation and minimize P(e) since received signals have equal energy. This explains the lower bound.

The worst situation is one in which the received signals have the maximum possible spread in energy. If we constrain w(t) to be distributed as p(w) for each t, then any changes in w(t) tend to smooth out the energy distribution and lower P(e). Therefore choosing w(t) constant for each sample function, but with o(t) distributed according to p(o) for each t, maximizes P(e), yielding the upper bound.

More rigorous proofs follow, with the upper bound being derived first. From the convexity of P,(x), we have the inequality [l 1.

IS

T

I s

T

PC? w(t)h(T - t) dt; No 5 P,[w(t); N,]h(T - t) dt. 0 0

(5)

This is a generalization of the inequality for convex functions PI

PET ( ) q 5 3Pe(Y> + PJY’).

Taking the expectation of both sides of (5)

s

T

E [Petx; NoI1 c: E[P,(o; N,)]h(T - t) dt 0

or

s

T

E [PAX ; No) 1 2 E [P&J ; No) I h(T - t) dt. 0

The integral equals unity, yielding the upper bound. The lower bound is obtained from the expansion

P,(x) = P&o) + (x - xo)pe’(xo) + (c!+!!22 P/(X1),

where xi E [x0,x]. With P,“(xl) 2 0, P,(x) 1 P,(x,) + (x - x,)P,‘(x,). With x0 = E(x), taking expectations E [P,(x)] 2 P,[E(x)]. From (1) and (2) E(x) = E(w) yielding the lower bound.

An example of these bounds is the behavior of a partially coherent detector. The signal is tracked in a phase-locked loop, with phase error d(t). Under the assumption of a constant phase error over the symbol interval, the error rate is given by the upper bound in (3) E [p&41 = .f P, [x(&b(d) &4 where Ad) is the probability density of 4. Some values of this error rate were computed [3]. For an energy per symbol-to-noise-density ratio equal to 9.1 and a mean-squared phase error of 0.1 rad2, the error rate equals 2.2 x 10m4. The signal component w(t) in (1) is equal to CD,,, cos 4 where w,,,~~ is the correlator input for zero phase error. For small phase errors, taking the first two

Entropy Expressions for ImagesSeveral alternative expressions are used based on analogies with probability density functions and spec-

tra:

H = −

k

Ik ln Ik

H =

k

ln Ik

These expressions enforce positivity because as Ik → 0, H is minimum (0 for the first form and −∞for the second) and they have positive second derivatives that inhibit oscillatory solutions.

If these latter properties are emphasized and the “fundamental” aspects of entropy ignored, other func-

tions can be concocted for use in image restoration. These are generally referred to as “penalty” func-

tions designed to achieve certain results and may be ad hoc in nature.

Often the entropy expressions are written in terms of a default image Mk:

H = −

k

Ik ln Ik/Mk

H =

k

ln Ik/Mk

5

3/26/15

14

Application with Imperfect DataLet

S(f ) = True spectrum

S(f ) = Spectral estimate from entropy maximization

R(τ ) = Estimated values of the ACF from S(f )

Rd(τ ) = Measured (or estimated) values of the ACF from data

Assume that the data values of the ACF are imperfect, perhaps with additive or other kinds of errors.Acceptable spectral estimates are those that are consistent with the data to within the errors, which wecan measure with χ

2:χ2 =

τ

R(τ )− Rd(τ )

2.

The maximum entropy spectral estimate is gotten by maximizing the combined quantity

J = H − χ2

with respect to the parameters of the spectral estimate (e.g. the coefficients of an AR model).

6

1986ARA&A..24..127N

Features of these func9ons: 1.  Infinite slope at I = 0 è posi9vity of

solu9ons 2.  Nega9ve second deriva9ve that

mi9gates oscilla9ons Can be viewed as “penalty func9ons”

1986ARA&A..24..127N

See Narayan & Nityananda 1986 ARAA for a comparison of results

3/26/15

15

1986ARA&A..24..127N

1986, ARAA

From Wolfram Mathworld

In generalized approaches, the entropy expression is treated more as a penalty func9on to drive certain results rather than as a fundamental quan9ty in and of itself.

3/26/15

16

Maximum Entropy Spectra of Red Noise Processes

1.  Generate a realization of red noise: •  Spectral domain:

–  Generate complex white, Gaussian noise –  Shape it according to [S(f)]1/2

–  Inverse FFT to time domain

2.  Find the best fit autoregressive model by minimizing fitting error against the ‘order’ of the model

•  AR model fitting is equivalent to maximizing entropy

3.  Find the Fourier spectrum of the AR model

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

Spe

ctru

m

Black: Generated spectrum Blue: Periodogram Red: AR Spectrum

S(f) ∝ f−0.0

3/26/15

17

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

Spe

ctru

m


S(f) ∝ f−0.0

3/26/15

18

0 1 2 3 4 5

Time (steps)x(

t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

102

103

Spe

ctru

m


S(f) ∝ f−1.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

104

105

Spe

ctru

m


S(f) ∝ f−2.0

3/26/15

19

0 1 2 3 4 5

Time (steps)x(

t)

Time Series

100 101 102 103

Frequency Bin

10−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100101102103104105

Spe

ctru

m


S(f) ∝ f−3.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−12

10−10

10−8

10−6

10−4

10−2

100

102

104

106

108

Spe

ctru

m


S(f) ∝ f−4.0

3/26/15

20

0 1 2 3 4 5

Time (steps)x(

t)

Time Series

100 101 102 103

Frequency Bin

10−15

10−13

10−11

10−9

10−7

10−5

10−3

10−1

101

103

105

107

109

Spe

ctru

m


S(f) ∝ f−5.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

1001021041061081010

Spe

ctru

m


S(f) ∝ f−6.0

3/26/15

21

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−4

10−3

10−2

10−1

100

101

102

103

Spe

ctru

m


S(f) ∝ f−1.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

104

105

Spe

ctru

m


S(f) ∝ f−2.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−9

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100101102103104105

Spe

ctru

m


S(f) ∝ f−3.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−12

10−10

10−8

10−6

10−4

10−2

100

102

104

106

108

Spe

ctru

m


S(f) ∝ f−4.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−15

10−13

10−11

10−9

10−7

10−5

10−3

10−1

101

103

105

107

109

Spe

ctru

m


S(f) ∝ f−5.0

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102 103

Frequency Bin

10−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

1001021041061081010

Spe

ctru

m


S(f) ∝ f−6.0

On the Applicability of Maximum Entropy Spectral Estimate:

The spectral estimator:

Makes use of the 2M+1 values of the covariance function that are known or estimated.There is no choice in M . If ρM → 0 then the spectral estimate will reflect that (Jaynessays that the corresponding Lagrange multiplier will be zero).

The AR Approach:

M appears to be a parameter that must be chosen according to some criterion.

Reconciliation:

Jaynes is correct so long as the expression used for the entropy is correct. It may notbe in some cases. If it is ok, then simply use the 2M + 1 known values. If the entropyexpression is not applicable, then one must view the situation as one where an ARmodel is being fitted to the data, with M an unknown parameter.The problem reduces to finding (i) the order of the AR process and (ii) the coefficients.

17

3/26/15

22

Estimates for AR coefficients For all the nitty-gritty details of calculation of AR coefficients,

see Ulrych and Bishop, “Maximum Entropy Spectral Analysis and Autoregressive Decompo-

sition” 1975, Rev. Geo, and Sp. Phys., 13, 1983. There are Fortran listings for the Yule-Walker

and Burg algorithms for estimating coefficients. See also Numerical Recipes.

Two problems remain:

1. How does one calculate the order of the AR model?

2. What are the estimation errors of the spectral estimate?

The order of the AR Model can be estimated by looking at the prediction error as a function of

order M .

With N = # data points and M = order of AR process (or of a closely related prediction error

filter), evaluate the quantity (the “final prediction error”)

(FPE)M =N + (M + 1)

N − (M + 1) increases as M increases

t

(xt − xt)2

decreases

The order M is chosen as the one that minimizes the FPE (The Akaike criterion).

19

Final Prediction Error

3/26/15

23

FPE Examples for Red Processes with Power-Law Spectra

0 5 10 15 20Order of AR Process

0.512

0.514

0.516

0.518

0.520

0.522

0.524

0.526

0.528

Fina

lPre

dict

ion

Err

or

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102

Frequency Bin

10−2

10−1

100

101

Spe

ctru

m


S(f) ∝ f−0.0

0 5 10 15 20Order of AR Process

0.001045

0.001050

0.001055

0.001060

0.001065

0.001070

Fina

lPre

dict

ion

Err

or

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

Spe

ctru

m


S(f) ∝ f−2.0

Application of MEM •  Sinusoids + noise •  Noise only

•  Δt = 0.01 yr •  Nyquist frequency fN = 50 cy yr-1

•  Points: •  MEM can give much better performance than the FFT-

based power spectrum •  Using the wrong AR order, however, can get spurious

results

3/26/15

24

FFT Power Spectrum

MEM Spectrum AR order determined empirically

3/26/15

25

MEM Spectrum AR order forced to value indicated

MEM Spectrum AR order forced to value indicated

3/26/15

26

MEM Spectrum of Timing Residuals of a Millisecond Pulsar

Maximum Likelihood Spectra Estimation (MLSE)

MLSE is a misnomer; a better name is High Resolution Method because the method is derived

by explicitly maximizing the sensitivity to a given frequency while minimizing the effects (i.e.

leakage, a.k.a. bias) from other frequencies.

The MLSE was developed in the 1960s by Capon to analyze data from arrays of sensors to

maximize the response to one particular direction and minimize the response to others.

e.g. LASA = Large Aperture Seismic Array (test earthquakes vs. underground nuclear tests).

There is a close relationship to beam forming in acoustic arrays and beam forming in radio

interferometric arrays.

In the original development of the method discussed by Capon1

the spectral estimator is very

closely related to a filter that gives the ML estimate of a signal when it is corrupted by Gaussian

noise:

S +N −→ A −→ S

1see “Nonlinear Methods of Spectral Analysis”, Haykin, ed. pp. 154-179

1

3/26/15

27

This system involves:

a) a filter that gives the ML estimate of the signal when corrupted by Gaussian noise is also

. . .

b) the filter that generally gives the minimum variance and unbiased estimate of the signal

for arbitrary noise and . . .

c) has coefficients that yield an unbiased, high resolution spectral estimate for any signal.

The way the filter coefficients are derived (i.e. the constraints applied to the maximization

problem) imply that the spectral estimate minimizes leakage.

The HRM is sometimes described as a positive constrained reconstruction method which min-

imizes leakage.

Thus, the intent of the MLSE technique is much different from the MESE technique:

MLSE minimizes variance and bias (recall how spectral bias was related to resolution)

MESE in effect (via its relationship to prediction filters) tries to maximize resolution

2

We will derive the ML spectral estimate following the derivation of Lacoss.

Method: Construct a linear filter that

1. yields an unbiased estimate of a sinusoidal signal and

2. minimizes the variance of the output with respect to corrupting white noise.

Pass a signal yn through a linear filter:

yn −→ ak −→ xn

xn =n

k=1

ak yn−k+1 (causal)

where the input is of the form of a deterministic sinusoid added to zero mean noise havingan arbitrary spectrum:

yn = AeiΩn + nn.

We will determine the coefficients ak by invoking the above two criteria.

3

3/26/15

28

Goal: We want the filter to pass AeiΩn undistorted but to reject the noise as much as possible.Thus, we require

1. no bias (in the mean): xn ≡N

k=1

ak yn−k+1

=N

k=1

akAeiΩ(n−k+1) + nn−k+1

=N

k=1

akA eiΩ(n−k+1)

≡ AeiΩn (if no bias)

⇒N

k=1

ak eiΩ(1−k) = 1 constraint equation

4

This can be written in matrix form using ‘†’ to denote transpose conjugate:

ε †a = 1 ε ≡

1eiΩ

ei2Ω...

ei(N−1)Ω

a =

a1a2...aN

5

3/26/15

29

2. Minimum variance of the filter output:

σ2 ≡[xn − xn]2

=

k

ak yn−k+1 − AeiΩn2

=

k

akAeiΩ(n−k+1)

cancels last term

+nn−k+1

− AeiΩn

from 1.

2

=

k

ak nn−k+1

2 ≡

k

k

ak nn−k+1 n∗n−k+1 a∗k

= a† Ca,

where C is the covariance matrix of the noise, n.

6

3. Minimize σ2 w.r.t. a and subject to the constraint ε†a = 1.

By minimizing σ2 subject to the constraint, we get the smallest error and no bias.

Therefore we minimize L with respect to a:

L = σ2 + λ(ε †a− 1) = a† Ca + λ(ε †a− 1)

We can take ∂L/∂Re(aj) and ∂L/∂Im(aj) separately to derive equations fora, then recombinethese equations to get

a† C + λε † = 0.

This is the same as we get by taking

∇aL ≡ ∂L

∂a=

∂

∂aa † Ca + λ

∂

∂a(ε †a)

= a† C + λε †

= 0 for a = a0.

7

3/26/15

30

The solution for a0 is

a†0C = −λε †

⇒ C†a0 = −λ∗ε

a0 = −λ∗(C†)−1ε

Now substitute back into the constraint equation ε †a0 = 1 (the no bias relation) to get

ε †a0 = −λ∗ε †(C †)−1ε = 1

or − λ∗ =1

ε †(C †)−1ε

⇒ a0 =(C †)−1ε

ε †(C †)−1ε

Note denominator is real (quadratic form)

⇒ε †(C †)−1ε

†= ε † C−1ε

8

4. Minimum variance: Substitute a0 back into the expression for σ2to find the minimum

variance:

σ2min ≡ a†0 Ca0

=

(C †)−1ε

ε †(C †)−1ε

†

C †

(C †)−1ε

ε †(C †)−1ε

=ε † C−1ε

(ε † C−1ε)(ε † C †−1ε)

=1

ε †C−1ε

σ2min =

1

ε † C−1ε

This is the power in the noise components with the same frequency as the signal Ω.

(Note we have used the Hermitian relation C † ≡ C.)

9

3/26/15

31

Interpretation:

1. σ2min = portion of noise that leaks through the filter, which is attempting to estimate a

sinusoid corrupted by the noise.

2. Note that the filter coefficients and σ2min are functions of Ω and of the noise covariance

matrix. But they do not depend on the amplitude of the sinusoid.

3. The trick: now take away the signal but keep the noise. We allow Ω to vary across arange of frequencies we are interested in. Then, σ2

min(Ω) is a spectral estimate for the noisespectrum (which was left arbitrary)

4. ⇒ maximum likelihood spectral estimator

SML(f ) =1

ε † C−1εwith Ω = 2πf∆τ

Further comments:

1. As used, the covariance matrix C is an ensemble average quantity. Applications to actualdata require use of some estimate for the covariance matrix.

2. The derivation is for equally spaced data.

3. The spectral estimate should work well on processes with steep power-law spectra becausethe estimator is derived explicitly to minimize bias.

10

Data-adaptive aspect of the MLSE spectral estimator:

Recall that the Fourier-transform based estimator has a fixed window. The MLSE has a dataadaptive spectral window, as we will show.

The filter coefficients are a function of the frequency of the sinusoid, Ω:

a0(Ω) =(C−1)†(Ω)

(Ω)†(C−1) † (Ω)

As Ω is varied, the coefficients a0 vary but subject to the normalization constraint a †0 = 1.

For a given Ω, which labels the frequency component we are attempting to estimate, what isthe response to other frequencies, ω?

Define the window function

W (ω,Ω) = a0 (Ω)† (ω)

as the response to frequency ω of a filter designed to pass through the frequency Ω.

The window function satisfies (normalization)

W (Ω,Ω) ≡ 1.

The equivalent quantity for a Fourier transform estimator might be

W (ω,Ω) =sin(ω − Ω)T/2

(ω − Ω)T/2.

11

3/26/15

32

Simulating the HRM

Generate a process with specified noise + signal spectrum or just noise with an arbitrary spec-

trum by passing white noise through a linear filter.

white noise −→ h(t) −→ x(t)

From one or more realizations of x(t) estimate the autocovariance and put it in the form of a

covariance matrix, C.

For each frequency of interest (Ω), calculate the MLM/HRM filter coefficients

a0 =C−1ε

ε † C−1ε.

Calculate the power-spectrum estimate as

S(Ω) =1

ε † C−1ε.

The window function can be calculated as

W (ω,Ω) = a †0 (Ω)ε (ω).

12

3/26/15

33

Appendix •  Derivation of the entropy expression in

terms of the power spectrum for a Gaussian process

•  Summary of mathematical derivation •  Heuristic derivation

Maximum Entropy: Power Spectrum (short approach)So far we know how to calculate the entropy of a random variable in terms of its PDF. For a univariate

Gaussian PDF we have

fX(x) =2πσ2

−1/2e−x

2/2σ2

H = −

dxfX(x) ln fX(x)

=

1

2ln(2πσ2) +

X2

2σ2

=1

2

ln(2πσ2) + 1

=1

2

ln(2πeσ2)

When we maximize the entropy subject to constraints (from data), we only care about terms in the

entropy that depend on relevant parameters. Here the only parameter is σ so the constant term does not

matter. Notice that larger σ implies larger entropy, as we would expect for a measure of uncertainty.

When we maximize entropy, we may as well write it only in terms of the variance,

H ≈ ln σ2 + constant.

1

3/26/15

34

Maximum Entropy Spectral Estimate

So far we know how to calculate the entropy of a random variable in terms of its PDF. For a

univariate Gaussian PDF we have

fX(x) =2πσ2

−1/2e−x

2/2σ2

H = −

dxfX(x) ln fX(x)

=

1

2ln(2πσ2) +

X2

2σ2

=1

2

ln(2πσ2) + 1

=1

2

ln(2πeσ2)

When we maximize the entropy subject to constraints (from data), we only care about terms in

the entropy that depend on relevant parameters. Here the only parameter is σ so the constant

term does not matter. Notice that larger σ implies larger entropy, as we would expect for a

measure of uncertainty.

When we maximize entropy, we may as well write

H ≈ ln σ + constant.

1

Multivariate Case:

Consider a real Gaussian random process xk, k = 1, . . . , N whose correlation function for Nlags can be written as an N ×N covariance matrix Cx. For the zero mean case,

Cx =

x21 x1x2 . . . x1xN... x22 . . . x2xN... . . . ...

xNx1 . . . . . . x2N

Since the random process is continuous, we use the integral expression for the relative entropy(dependent on the coordinate system)

H = −

dx fx(x) ln fx(x)

withfx(x) =

(2π)N det Cx

−1/2exp

− 1

2(x− µ)t Cx

−1(x− µ)

which yields

H =1

2ln(2π)N det Cx

+

1

2

(x− µ)tCx

−1(x− µ)

2

3/26/15

35

We will

1. ignore the factor (2π)N because it is constant in Cx.2. ignore the second term because it is independent of both N and Cx. This is equivalent to

the constant term we found for the univariate case.

Example of a bivariate Gaussian:

Cx =

σ21 σ1σ2ρ12

σ1σ2ρ12 σ22

,

Cx−1 = (detCx)

−1

σ22 −σ1σ2ρ12

−σ1σ2ρ12 σ21

Cx−1

δx1

δx2

= (detCx)

−1

σ22δx1 − σ1σ2ρ12δx2

−σ1σ2ρ12δx2 + σ21δx2

Q ≡ (δx1 δx2)Cx−1

δx1

δx2

Q = (detCx)−1

σ22σ

21 − 2σ2

1σ22ρ12 + σ

21σ

22

=

2σ21σ

22 (1− ρ

212)

σ21σ

22 (1− ρ

212)

=1

2

Since this is a constant, we will ignore it in taking derivatives of H .

3

We therefore use an entropy expression,

H ≡ 1

2ln (detCx)

Unfortunately, as N → ∞, H → ∞ as can be seen for the uncorrelated case where the covari-ance matrix is diagonal:

⇒ H =1

2ln

N

j=1

σ2j

=

1

2

N

j=1

ln σ2j

Define an entropy rate as

h = limN→∞

H

N + 1

= limN→∞

1

2

1

(N + 1)ln (detCx)

= limN→∞

1

2ln

(detCx)

1N+1

=

1

2ln

lim

N→∞(detCx)

1N+1

(1)

4

3/26/15

36

Entropy in terms of the spectrum:

To get a maximum entropy estimate of a spectrum, we need an expression for the entropy in

terms of the spectrum. There is no general relation between the spectrum and the entropy. For

Gaussian processes, however, there is a relation. This is appropriate since a Gaussian process

is one with maximum entropy out of all processes with the same variance. The spectrum is the

variance per unit frequency, so this conceptual step is important. But a relation exists1

between

the determinant of the covariance matrix and the spectrum, which is assumed to be bandlimited

in (−fN, fN).

limN→∞

(detCx)1

N+1 = 2fN exp

1

2fN

fN

−fN

df ln Sx(f )

.

The theorem depends on Cx being a Toeplitz matrix [matrix element Cij depends only on

(i− j)], i.e. that the process be WSS

1An arcane proof exists in “Prediction-Error Filtering and Maximum-Entropy Spectral Estimation” in Non-Linear Methods of Spectral Analysis, Haykin ed.

Springer-Verlag 1979, see Appendix A, pp. 62-67. It is also given by Smylie et al. 1973, Meth. Comp. Phys. 13, 391.

5

Thus

h = limN→∞

1

2ln (detCx)

1N+1

=1

2ln

lim

N→∞(detCx)

1N+1

=1

2ln 2fN +

1

4fN

fN

−fN

df ln Sx(f )

Ignoring the first, constant term, we have

h =1

4fN

fN

−fN

df ln Sx(f )

6

3/26/15

37

Heuristic “derivation” of the entropy rate expression:

Another way of viewing this is as follows. In calculating a power spectrum we are concernedwith a second-order moment, by definition. Consequently, we can assume that the randomprocess under consideration is Gaussian because:

1. we are maximizing the entropy (subject to constraints) and

2. given the second moment, the process with largest entropy is a Gaussian random process

Note that while this assumption is satisfactory for estimating the power spectrum (a secondmoment), it is not necessarily accurate when we consider the estimation errors of the spectralestimate, which depend on fourth-order statistics. If the central limit theorem can be invokedthen, of course, the Gaussian assumption becomes a good one once again.

Imagine that the process under study is constructed by passing white noise through a linearfilter whose system function is

S(f )

n(t) −→Sx(f ) −→ x(t)

Consequently, the Fourier transforms are related as

N(f )Sx(f ) = X(f )

Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore,

7

X(f ) is a GRV and viewing it as a 1-D random variable we have that the entropy is

H(f ) =1

2ln [2πeσ2(f )]

but

σ2(f ) ≡ [N(f )

S(f )]2 = S(f )SN

Letting the white noise spectrum be SN = 1 we have

σ2(f ) = S(f )

andH(f ) =

1

2ln [2πeS(f )]

Recall that white noise is uncorrelated between different frequencies

N(f ) N ∗(f ) = SN δ(f − f).

Consequently, the information in different frequencies adds because of statistical independenceand, therefore, to get the total entropy we simple integrate (add variances):

H =

df H(f ) =

1

2

df ln[2πeS(f )]

8

3/26/15

38

Again we ignore additive constants and we also consider the case where the signal is bandlim-ited in |f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degreesof freedom is 2fNT .

The entropy per degree of freedom is

h =H

2fN=

1

4fN

fN

−fN

df ln [S(f )]

The derivation of the ME spectrum follows the logical flow:

entropy in terms of power spectrum S(f )⇓

maximize H subject to constraints:

⇓(C)known

F.T.

⇔ S(f )⇓

ME spectral estimator

van den Bos shows that extrapolating the covariance function to larger lags while maximizingentropy yields a matrix equation that is identical to that obtained by fitting an autoregressivemodel to the data. This implies that the two procedures are identical mathematically.

9

Maximum Entropy Spectral Estimator

By maximizing the entropy rate that is expressed in terms of the spectrum, the spectral estimatecan be expressed as (e.g. Edward & Fitelson 1973),

S(f ) =1

fN

α20

|ε t C−1δ|2

where C is the Toeplitz covariance matrix, which applies to WSS processes.

ε =

1e2πif∆τ

...e2πifM∆τ

C =

C00C01 . . . C0M

...CM0 . . . . . . CMM

δ =

10...0

Toeplitz ⇒ C ≡

C0C1 . . . CM... C0

CM . . . . . . C0

Let

γ ≡ C−1 =

γ00 . . . γ0M...γM0 . . . γMM

10

3/26/15

39

Then

ε t C−1δ =M

j=0

γj0 e2πif∆τj

andS(f ) =

1

4N

α20

M

j=0

γj0 e2πif∆τj

2

By rewriting the sum and redefining the constants this can be written

S(f ) =α20

fN

γ00 +

M

j=1

γj0 e2πif∆τj

2

=α20

fNγ200

1 +

M

j=1

γj0γ00

e2πif∆τj

2

11

Thus, the ME spectral estimate can be put into the form

S(f ) =PM

1 +M

j=1

αj e2πifj∆τ

2

where PM = a constant that properly normalizes the spectrum.

This is the same spectrum as for an Mth order AR process that can be fitted to the data, where

the coefficients are determined by least squares.

Spectrum of an AR Process:

Consider the following M-th order AR process

xt = atwhite noise

+M

j=1

αj xt−j

autoregressive part

A zero-th order process would be xt = at (i.e. white noise). Scargle would term the above

definition a causal AR process. An acausal or two-sided process would allow negative values

of j in the sum on the RHS.

The coefficients αj, j = 1, . . . ,M are the AR coefficients. In the fitting of an AR model to the

data, one must determine the order M as well as the M coefficients.

12

3/26/15

40

Define the DFTs

X(f ) ≡N−1

t=0

xt e−2πift/N

A(f ) ≡N−1

t=0

at e−2πift/N.

Substituting the definition for the AR process, we have

X(f ) =M

j=1

αj X(f ) e−2πifj/N + A(f ),

and, solving for X(f ),

X(f ) =A(f )

1−

M

j=1

αj e−2πifj/N

.

13

The spectrum is then

S(f ) =|A(f )|2

1−M

j=1

αj e−2πifj/N

2

∝ 11−

M

j=1

αj e−2πifj/N

2.

As is obvious, the AR spectrum has the same form as the maximum-entropy spectrum.

14

Documents

Lecture 17 Maximum entropy applications - Cornell Universityhosting.astro.cornell.edu/~cordes/A6523/A6523_2015_Lecture17.pdf · Lecture 17 – Maximum entropy applications • General