Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
3/26/15
1
A6523 Signal Modeling, Statistical Inference and
Data Mining in Astrophysics
Spring 2015 http://www.astro.cornell.edu/~cordes/A6523
Lecture 17 – Maximum entropy applications
• General solution for PDFs (constraints, partition function) • Entropy expression(s) for power spectra/images: • Maximum entropy spectrum for Gaussian processes • Relationship to autoregressive model fitting
– Other approaches (HRM, Cholesky decomposition) – Matched filtering – Interpolation: sampling theorem, missing data, etc.
Time Series to Spectrum Diagram
x(t) FT⇐⇒ X(f )
irreversible ↓ ↓ irreversible
Rx(τ ) = x(t)x∗(t + τ ) FT⇐⇒ S(f ) = |X(f )|2
The autocorrelation function is closely related to the covariance matrix of the signal.
There are (at least) three spectral estimators that involve the covariance matrix and they are all mathematicallydifferent!
How can this be if there is one and only one relationship between ACF and spectrum, as given by the Wiener
Khinchine theorem (or the imaging equivalent, the van Cittert-Zernike theorem)?
The answer is that the relationships in the diagram are all for ensemble average quantities and we do not know
any of them. We have finite data, so there are missing data in the form of gaps and finite data extent in time or
space.
Thus we have
Data −→ Rx(τ )
Fourier Transform based Spectral Estimates
−→ Maximum Entropy Spectral Estimate
High Resolution Method
Missing data are treated differently in the different spectral estimators and that is manifested in thedifferent mathematical structure.
1
3/26/15
2
Matrix form of the Fourier-Transform Based SpectralEstimate:
It is instructive to compare the matrix form for the maximum entropy spectrum with the powerspectral estimate defined as the Fourier transform of the autocorrelation function. This is iden-tical to the spectrum found by taking the squared magnitude of the Fourier transform of thetime series and is sometimes called the Bartlett estimate, because the Bartlett lag window is atriangle function.
Let Cn−n be the n, n element of the covariance matrix; then the Bartlett estimate is
SB(f ) = N−1N+1
=−N+1
C
1− ||
N
e−2πf∆τ ,
which can also be written as
SB(f ) = N−2
C− e−2πif∆τ(−)
= N−2
e2πif∆τC−e−2πif∆τ,
and can be written as a matrix product
SB(f ) = N−2ε tCε ∗
15
This can be compared with the ME and high-resolution (“ML”)estimates
SME =constant
|ε tC−1δ|2
SML(f ) =1
ε tC−1ε
16
3/26/15
3
Comparison of Spectral Estimators
Bartlett MLM MEMN−2ε † Cε (ε † C−1ε)−1 ∝ |ε † C−1δ|−2
100% error
∆f = 1N∆τ = 1
T ≈ same or better better resolutionresolution (up to ×2 of Bartlett)
large sidelobes lower sidelobes
Note all estimators are real because the quadratic form εt Cε is real for C Toeplitz or, in theMEM case, the estimator is manifestly real.
13
The philosophy of entropy maximization dictates that we make no assumptions aboutmissing data.
We will consider the ME estimator in greatest detail and we will show that the ME spectral
estimator is equivalent to
1. fitting an autoregressive (AR) model to the data and finding the spectrum of the AR
model.
2. applying a linear, Wiener-Hopf prediction filter to the time series, thus extending the TS
by some allowable amount, and finding the spectrum of the extended TS.
3. extending the correlation function according to an AR model.
In effect, ME techniques fill in gaps (interpolate) or extend the data (extrapolate) in a way
that is consistent with the available data but also in such a way that entropy is maximized.
This maximizes the uncertainty about the missing data; i.e. the least amount of “structure” is
imposed on the missing data.
In some instances, one can “superresolve” features in the spectrum; this can be viewed as
“beating” the uncertainty principle because the effective resolution in the frequency domain
can be as large as
∆fME −→ 1/2T
(T = length of time series), i.e. twice the resolution of the conventional Bartlett type estimate.
20
3/26/15
4
This occurs (for some cases) because the MEM effectively predicts missing data from known
data.
The regime in which this works best is where
T ∼ correlation time of the process
because, at best, one can predict ∼ one correlation time into the future.
According to the ME philosophy, the sidelobes (Gibbs phenomenon) associated with spectral
estimators based on Fourier transforms (i.e. either the smoothed |F.T.|2 of the time series
data or the F.T. of a correlation function) are a penalty that is consequence of “assuming” that
unknown data are zero. (One is not so naive as to really assume the data are zero, but if one
did, the same estimator would result.)
21
How can the data be extended?
We may have extensions of the form
I. Predict x(t) forward by approximately one correlation time (Wx).
The new resolution will be δf ∼ T −1 < T−1
II. Extend the corelation function, Rx(τ ) or Cx(τ )
22
3/26/15
5
Data adaptive methods help only when there really is significant missing data.
If T = actual time series duration Wx = correlation time then T ≈ T + Wx ≈ T ⇒ no
advantage.
But if T ≈ Wx so that T ∼ 2T then a factor of 2 increase in resolution may result (super resolution)
e.g. The case of 2 sinusoids: x(t) = A sinω1t + B sinω2t + n(t)
If δf = T−1 |ω2 − ω1| then F.T. methods will suffice in resolving the spectral lines.
However, if T−1 ∼ |ω2 − ω1|, then ME estimates may provide better resolution.
23
Core tenet of Maximum Entropy methods:
Out of all possible (hypotheses | PDFs | …)
that agree with constraints,
choose the one that is
maximally non-committal
with respect to missing information.
3/26/15
6
Maximum Entropy Methods
• Probabilities H = - Σ pi log pi – Combine with constraint equations to determine pi
– No constraints è flat distribution • Spectra (and images) H = integral of log S(f)
– Constraints are from measurements – The spectrum is obtained by maximizing entropy
subject to the constraints – For a Gaussian process, the resulting expression is
the same as optimally fitting an AR model to a time series
– Why Gaussian?
Maximum Entropy: General Solution for PDF with ConstraintsFirst a simple case:
The entropy for a discrete random variable X , which takes on values xk, k = 0, . . . , N − 1 withassociated probabilities pi is
H = −
k
pk ln pk.
When there is a constraint on the nth moment xn the PDF is found by maximizing the quantity
J = H − λ0
k
pk − λ1
k
xn
kpk
which uses a Lagrange multiplier λ0 for the normalization constraint and λ1 for the constraint on themoment. This construction allows us to calculate the total variation with respect to the pk.
Taking increments δpk we obtain
δH = −
k
(δpk ln pk + δpk)
δJ = δH − λ0
k
δpk − λ1
k
xn
kδpk
= −
k
δpk (ln pk + 1 + λ0 + λ1xn
k)
= 0.
The factor in parentheses needs to vanish since δpk is arbitrary (but small), so
pk = e−(λ0+1)
e−λ1x
nk , k = 0, . . . , N − 1.
1
3/26/15
7
We can solve for the Lagrange multipliers by substitution into the constraint equations:
k
pk = e−(λ0+1)
k
e−λ1xnk = 1 =⇒ e−(λ0+1) =
1
k
e−λ1xnk
and
pk =e−λ1x
nk
k
e−λ1xnk
.
The Lagrange multiplier λ1 needs to be obtained from the expression for the nth moment,
xn =
k
xnk δpk =
k
xnk e−λ1x
nk
k
e−λ1xnk
.
n = 1: Constraint on the mean and the xk are constrained to be positive:
x =
k
xk δpk =
k
xk e−λ1xk
k
e−λ1xk.
By inspection, we see that this is an exponential distribution so λ1 has to be the mean, x.
We can show this formally for the case where the xk are uniformly spaced, xk = kδx, k = 0, . . . , N−1:
2
Let a = e−λ1δx.
Then using a trick to get the numerator,
1
δx
k
xk e−λ1xk =
k
k ak = a× d
da
N−1
k=0
ak = a× d
da
1− aN
1− a
= a× (1− a)(−NaN−1)− (1− aN)(−1)
(1− a)2,
we obtainxδx
=1
1− a− (1− a)NaN−1
1− aN.
If N → ∞ only the leading term matters and we can solve for a,
a = e−λ1δx =x/δx− 1
x/δx
A similar approach can be taken for a case where the variance is fixed. In the continuous limit, thisyields a Gaussian PDF.
3
3/26/15
8
General case with arbitrary constraints on moments(From Jaynes, Probability Theory: The Logic of Science, pp. 355-358)
Discrete RV: xn, n = 0, . . . , N − 1
Constraints on arbitrary functions of x: fk(x), k = 1, . . . ,M .
The constrain equations are the expectations:
fk(x) = Fk =N−1
n=0
pnfk(xn).
Maximize entropy subject to constraints:
J = H − λ0
n
pn −
k
λkfk(x)
= −
n
pn
ln pn + λ0 +
k
λkfk(xn)
Setting δJ = 0 for arbitrary δpn and using normalization to unity to obtain λ0 we get
pn =e
−
k
λkfk(xn)
Z(λ)
where the partition function isZ(λ) =
1ne−
k λkfk(xn)
Solving for the Lagrange multipliers clearly is nontrivial in general.
4
Maximum Entropy: Power Spectrum (short approach)So far we know how to calculate the entropy of a random variable in terms of its PDF. For a univariate
Gaussian PDF we have
fX(x) =2πσ2
−1/2e−x
2/2σ2
H = −
dxfX(x) ln fX(x)
=
1
2ln(2πσ2) +
X2
2σ2
=1
2
ln(2πσ2) + 1
=1
2
ln(2πeσ2)
When we maximize the entropy subject to constraints (from data), we only care about terms in the
entropy that depend on relevant parameters. Here the only parameter is σ so the constant term does not
matter. Notice that larger σ implies larger entropy, as we would expect for a measure of uncertainty.
When we maximize entropy, we may as well write it only in terms of the variance,
H ≈ ln σ2 + constant.
1
3/26/15
9
Heuristic “derivation” of the entropy rate expression:
Another way of viewing this is as follows. In calculating a power spectrum we are concerned witha second-order moment, by definition. Consequently, we can assume that the random process underconsideration is Gaussian because:
1. we are maximizing the entropy (subject to constraints) and
2. given the second moment, the process with largest entropy is a Gaussian random process
Note that while this assumption is satisfactory for estimating the power spectrum (a second moment),it is not necessarily accurate when we consider the estimation errors of the spectral estimate, which de-pend on fourth-order statistics. If the central limit theorem can be invoked then, of course, the Gaussianassumption becomes a good one once again.
Imagine that the process under study is constructed by passing white noise through a linear filter whosesystem function is
S(f )
n(t) −→
Sx(f ) −→ x(t)
Consequently, the Fourier transforms are related as
N(f )Sx(f ) = X(f )
Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore, X(f ) is aGRV and viewing it as a 1-D random variable we have that the entropy is
H(f ) =1
2ln [2πeσ2(f )]
but
2
σ2(f ) ≡ [N(f )
S(f )]2 = S(f )SN
Letting the white noise spectrum be SN = 1 we have
σ2(f ) = S(f )
andH(f ) =
1
2ln [2πeS(f )]
Recall that white noise is uncorrelated between different frequencies (in the continuous limit),
N(f ) N ∗(f ) = SN δ(f − f).
Consequently, the information in different frequencies adds because of statistical independence and,therefore, to get the total entropy we simple integrate (add variances):
H =
df H(f ) =
1
2
df ln[2πeS(f )]
Again we ignore additive constants and we also consider the case where the signal is bandlimited in|f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degrees of freedom is2fNT .
The entropy per degree of freedom is
h =H
2fN=
1
4fN
fN
−fN
df ln [S(f )]
3
3/26/15
10
The derivation of the ME spectrum follows the logical flow:
entropy in terms of power spectrum S(f )⇓
maximize H subject to constraints:
⇓Known values of ACF
F.T.⇐⇒ S(f )
⇓ME spectral estimator
van den Bos shows that extrapolating the covariance function to larger lags while maximizing entropyyields a matrix equation that is identical to that obtained by fitting an autoregressive model to the data.This implies that the two procedures are identical mathematically.
4
Maximum Entropy Spectral Estimate • The MESE is identical to fitting an
autoregressive (AR) model to a time series
• The unknowns are PM, αj, j=1,M
S(f) =PM
1 +M
j=1
αje2πifj∆τ
2
Xt =M
j=1
αjXt−j + nt
3/26/15
11
CORRESPONDENCE 493
However, since the marginals located near the boundary of the It follows that the particular choice for u(N + 1) maximizes this accept-reject region are approximately equal in importance, these quantity. In order to obtain r(N + 2), the value for r(N + 1) just errors will not materially affect the outcome. found is substituted in R(N + 2) and after that det [R(N + 2)], and
J. J. FREEMAN therefore the corresponding entropy is maximized with respect to
Electron. Div. r(N + 2). Substitution of the values of r(N + 1) and r(N + 2) so
Naval Res. Lab. obtained in det [R(N + 3)] and maximizing yield r(N + 3) and so
Washingtonr D.C’ 20390 on. Note that in general this procedure is different from maximizing det [R(N + 3)] with respect to r(N + l),r(N + 2),r(N + 3).
For what follows it is important to note that maximizing det [R(N + l)] with respect to r(N + 1) is equivalent to the statement
r(0) ... r(N - 1) Alternative Interpretation of Maximum Entropy Spectral Analysis
4) ... r(N - ‘3 = o. (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)
Abstract-Maximum entropy spectral analysis is a method for the estimation of power spectra with a higher resolution than can be obtained with conventional techniques. This is achieved by extrapolation of the autocorrelation function in such a way that the entropy of the corre-
III. INTERPRETATION IN TERMS OF ALL-POLE MODELS Consider the following all-pole model
y(n) + al’y(n - 1) + . . . + a~,‘y(n - N) = e(n) (2) sponding probability density function is maximized in each step of the extrapolation. This correspondence also gives a simple interpretation of where e(n) is a zero-mean random variable with E[e(i)e(j)] = 0 for the method without entropy considerations. i # j. Since E[e(n)y(n - k)] = 0 for k > 0, multiplication of both
sides of (2) with y(n - k) and taking expectations yields
I. INTRODUCTION In the classical approach to calculation of power spectra the sample
autocorrelation function is multiplied by some lag window and then transformed [I]. The window is applied to provide statistical stability and to avoid leakage from other parts of the spectrum. On the other hand, the window limits the spectral resolution.
In the literature, methods are found for the improvement of the spectral resolution through extrapolation of the correlation function beyond the maximum lag [2], [3]. One of these methods, the so-called maximum entropy spectral analysis [3], is discussed in this corre- spondence. The idea behind the method is to extrapolate the auto- correlation function in such a way that the entropy of the corresponding probability density function is maximized in each step of the extra- polation.
It is shown that the method is equivalent to least-squares fitting of a discrete-time all-pole model to the data.
II. MAXIMUM ENTROPY SPECTRAL ANALYSIS [3] Suppose that the first (N + 1) points r(O),r(l), . . .,r(N) of the
autocorrelation function of a stationary Gaussian process are exactly known. The problem is now to find a value for r(N + 1). Therefore consider
/r (0) 4) ... r(N) W + I)\
R(N + 1) = I 4) r(O) ... r(N - 1) r(N)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
making det [R(N + 1)] equal to zero and all values in between.
\
The basic idea of maximum entropy spectral analysis is to select
r(N + 1) r(N) ... r(1) 43 /
The quantities r(O),r(l),. . .
the value of r(N + 1) that maximizes det [R(N + l)]. The entropy
,r(N),r(N + 1) belong to an autocor- relation only if R(N + 1) is semipositive definite. It follows that
of the N + 2 dimensional Gaussian probability density function with
det [R(N + 1)] must be nonnegative. Det [R(N + l)] is a quadratic function of r(N + 1). The second derivative of det [R(N + l)] with
covariance matrix R(N + 1) is equal to [4]
respect to r(N + 1) is -2 det [R(N - l)], which is always negative. This implies that det [R(N + l)] as a function of r(N + 1) has a single maximum. Hence the allowable values of r(N + 1) are those
In (2~e)~+“’ det [R(N + l)]‘.
r’(k) + aI’r’(k - 1) + . . * + aN’r’(k - N) = 0 for k > 0
with r’(k) = E[y(n)y(n - k)]. It follows that
r’(l) + aI’r’(0) + . . . + a,‘r’(N - 1) = 0 r ‘(74 + aI’r’(l) + ... + aN’r ‘(N - 2) = 0 (3)
r’(N + 1) + al/r’(N) + ... + aN’r’(l) = 0
and hence
r'(0) ... r’(N - 1) r’(1) ... r’(N - 2)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \r’(N + 1) r’(N) ... r’(1) /
Now, suppose that the first N + 1 exact values r(O),r(l); * .,r(N) of any autocorrelation function r(k) are available. Substitution of the values in the first N equations of (3) and solving yields a set of co- efficients aI,. . . ,aN. The value for r(N + 1) for a model with these coefficients, however, is uniquely determined by
/r(l) r(O) ... r(N - l)\ r(l) ... r(N - 2)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . r(N + 1) r(N) ... r(1)
From (1) it follows that the same solution for r (N + 1) would have been obtained from maximum entropy spectral analysis. In the same way it can be proved that all values of the extrapolated autocorrelation
IV. SAMPLE CONSIDERATIONS
function coincide with those corresponding to the model 1, aI,. . ,a,+ So it is concluded that maximum entropy spectral analysis is equivalent
In practice only a finite record of the signal y(t) is available and the
to matching the coefficients of an Nth-order model to the N + 1 given points of the autocorrelation function. Once the coefficients of
autocorrelation function can not be determined exactly. The usual
the scheme have been obtained the computation of the corresponding spectrum is straightforward.
approach to fitting an Nth-order all-pole model to a record of length A4 is as follows [5]. The quantity
.,=$,.1( y(m) + 5 U”Yk - n)
2 (4)
il=1 Manuscript received March 25, 1970; revised November23, 1970.
Proof that maximization of entropy for a Gaussian process is equivalent to extending the ACF using an AR model Van den Bos (1971)
This is the N+2 x N+2 covariance matrix that includes an unknown r(N+1) value
494 IEEE TRANSACTIONS ON INFORMATION THEORY, JULY 197 1
is minimized with respect to al,az; . ‘,uN. The minimization of (4) yields a set of linear equations equivalent to the first N equations of (3) with the exact autocorrelation functions replaced by their estimates. It is clear that the maximum entropy procedure yields the same results when applied to these estimates. Therefore, the extrapolation beyond the (N + 1)th sample autocorrelation point is equivalent to least- squares fitting of an Nth-order all-pole model to the data.
Asymptotic sampling properties of the least-squares coefficient estimates were derived by Mann and Wald [6]. The asymptotic expres- sion for the Nth-order covariance matrix of the estimates is
CONVERGENCE The problem of stochastic approximation has been mentioned by
Sklansky [1] and Fu ei al. [2]. Dvoretzky [3] provided a powerful general theorem useful in proving the mean-square convergence of a class of stochastic approximation algorithms. To make this class more useful particularly in the area of learning systems, it has been suggested by Fu that methods be developed to improve the rate of convergence.
Consider an algorithm of the form
where &+I = 2, + yn+l~.f,+l(r,+l) - ~1 (1)
r n+l = (rd2,. . . ,rn+l )
where the (i, j)th element of R,(N) is given by the sample autocorrela- to be used in the presence of an ergodic process where & is the nth tion function rM(i - .j) of y(m) and sZ denotes the minimum value of estimate of x (the true value sought) of the mean of a normal distribu- (4) divided by M - N. Moreover, the coefficient estimates are max- tion from which samples ri are taken to calculate the function fi(rJ,
imum-likelihood estimates if the e(N) in (2) are Gaussian. and yi is a gain sequence. It is required of the function f&J that
V. CONCLUDING REMARKS Maximum entropy spectral analysis is equivalent to least-squares
fitting of an all-pole model to the available data. The order of the model is determined by the number of available mean-lagged products. This implies the possibility of introduction of poles that were not actually present in the process. On the other hand, if the process contains both poles and zeros a relatively large number of poles is needed to ap- proximate the true spectrum sufficiently [7]. In this case the number of available mean-lagged products may be too small.
Therefore, it seems more adequate to use parametric methods such as the minimum residual methods developed by Astrom and Steiglitz [7], [S] for models with both poles and zeros or conventional fitting of all-pole models, since these approaches provide in addition tests for the order of the model.
A. VAN DEN Bos Dep. Appl. Phys. Delft Technol. Univ. Lorentzweg 1 Delft, Holland
REFERENCES ill R. B. Blackman and J. W. Tukey, The Measurement of Power St~ctra. New York:
Dover, 1958. i21 N. R. Zagalsky, “Exact spectral representation of truncated data,” Proc. IEEE
(Lett.), vol. 55, Jan. 1967, pp. 117-118. [3] J. P. Burg, “Maximum entropy spectral analysis,” presented at the 37th Annu.
Meetinn Sot. of Exoloration Geonhvsicists. Oklahoma Citv. Okla.. 1967. i41 C. E. Shannon and W. Weaver; The M~thematicai The& of &mmunication.
Urbana, Ill.: University of Illinois Press, 1949, pp. 5457. [51 M. G. Kendall and A. Stuart, The Aduanced Theory of Statistics, vol. 3. London:
Griffin, 1966, pp. 4766481. i61 H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic differ-
ence equations,” Econometrica, vol. 11, July/Ott. 1943, pp. 173-219. 171 S. A. Tretter and K. Steiglitz, “Power-spectrum identification in terms of rational
models,” IEEE Trans. Automat. Contr. (Short Papers), vol. AC-12, Apr. 1967, pp. Isa-188.
IS] K. .I. Astrb;m and T. Bohlin, “Numerical identification of linear dynamic systems from normal operating records,” Proc. IFAC Symp. Self-Adaptive Control Systems, Sept. 1966, pp. 3.2-l-3.2-9.
New Algorithm for Stochastic Approximation
4Mn)l = x. (2) Two forms have been used for f:(rJ in previous work, namely,
.f,WA = r, (3) and
(4)
With the appropriate gain sequence yi, Fu has shown that using (4) rather than (3) in (1) gives a faster decrease in the expected mean-square error.
It is suggested that the function f be chosen as
fnVl(r,,+ d = (R,+,(Z))+ where A,,, ,(I) is an estimate of the sample autocorrelation function of the samples rI,r2;. .,r,+l.
By definition,
R,,(Z) = 4m-~rd (6) and if
r, .= x + c, (7)
where r, is the value of an element of zero-mean Gaussian white noise, then combining (6) and (7) gives
&(I) = -ai[x + 5”-Jx + r,11, (8) which reduces to
or I?,$) = x2 + k5J) (9)
since the cross terms in (8) vanish. Now combining (9) with (5) and (1) and subtracting x from both
sides gives
@“,I - xl = (1 - Y”+mn - xl + Yn+1Wn+1 (10)
where, for x # 0,
VI.+1 =x((i+%?/!)i- 1).
Abstract-A general stochastic approximation algorithm is given along Now iterating (10) and writing it in a closed form gives with assumptions and conditions necessary to show that it converges. Convergence is proven in the mean-square sense. The rate of convergence (-E-,+ 1 is shown to be better than two algorithms proposed previously.
(12) Manuscript received March 25, 1970; revised November 25, 1970. if the void product is taken as unity. Combining (11) and (12) and
3/26/15
12
232
The optimal Fourier filter, on the other hand, is given by [S]
HF = C,C2-’ (27) with residual mean-square error
&F 2- - ; T,(C, - CIcy’C,),
where C,, C2, and C, are the circulant approximations to T,, T2, and T,, respectively. Thus, the performance degradation can be written
&F2 - &or, 2- - +W,T~-~T~ - c1c2-*cl)
5 IT, - C~/[/lT,ll~ llT~-~/l + llC,ll . IIG-‘Ill + IT2 - C,l. /IT, II . l/Cd. IIT,-‘II . IIG-‘Il. (29)
Assuming further that T,, T,, and their inverses are uniformly bounded in strong norm, we find that the performance degrada- tion is bounded by a linear combination of two Hilbert-Schmidt norms: IT, - C,l and IT2 - C,l.
Equation (29) can be further simplified for the case of additive white noise. Here T, = T, + NJ, where N, is the noise variance, and similarly C2 = C, + N,I. Substituting in (29) and using (23) we obtain
&F2 - &,p2 I IT, - Cl I Mf 11 m, + No I .
(30)
V. CONCLUSIONS
We have demonstrated that the DFT of a stationary time series with uniformly bounded covariance matrix is asymptoti- cally equivalent to the Karhunen-Loeve expansion. We have also demonstrated that when a finite-order Markov process is coded or filtered in the Fourier domain a performance degrada- tion results with magnitude inversely proportional to the square root of the number of samples. These results establish quan- titative measures to guide the choice between the computa- tionally efficient Fourier method and the statistically optimal Karhdnen-Loeve processing of real-time stationary data.
ACKNOWLEDGMENT
The author appreciates his helpful discussions with Dr. R. Gray and Dr. L. Davisson. Dr. Gray’s report, entitied “Toeplitz and Circulant Matrices: A Review,” [7], provided invaluable assistance throughout the course of this work. The proof that IC, - TN1 -+ 0 for square-summable t(k) is due to M. Hamidi of the University of California, Los Angeles.
REFERENCES
111 121
[31
[41
PI
161
[71
W. Pratt and H. Andrew, “Transform image coding,” Univ. Southern California, Los Angeles, Elec. Eng. Rep. 387, Mar. 1970. W. R. Crowther and C. M. Rader, “Efficient coding of vocoder channel signals using linear transformation,” k’roc. IEEE (L&t.), vol. 54, pp. 1594-1595, Nov. 1966. K. Fukunaga and W. L. G. Koontz, “Application of the Karhunen- Loeve expansion to feature selection and ordering,” IEEE Trans. Cornput., vol. C-19, pp. 311-318, Apr. 1970. H. C. Andrews and K. L. Caspari, spectral analysis,”
“A generalized technique for IEEE Trans. Comput., vol. C-19, pp. 16-25, Jan.
1970. J. Pearl, “Basis-restricted transformations and performance measures for spectral representations,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 751-752, Nov. 1971. W. R. Davenport and W. L. Root, An Intrqduction to the Theory o/ Random Sign& and Noise. New York: McGraw-Hill, 1958, pp. 93-96. R. M. Gray, “Toeplitz and circulant matrices: A review,” Stanford Univ. Rep. SU-SEL-71-032, June 1971, pp. 37; also “On the Asymp; totic eigenvalue distribution of Toeplitz matrices,” IEEE Trans. Inform. Theory, pp. 725-30, Nov. 1972.
PI R
1101
fEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1973
U. Grenander and G. Szego, Toeplitz Forms and Their Applications. Berkeley and Los Angeles: Univ. California Press, 1958, p. 62. A. N. Kolmogorov, “On the Shannon theory of information trans- mission in the case of continuous signals,” IRE Trans. Inform. Theory, vol. IT-2, pp. 102-108, Dec. 1956. J. Pearl, H. C. Andrews, and W. K. Pratt, “Performance measures for transform data coding,” IEEE Trans. Commun. (Concise Paper), vol. COM.20, pp. 41 l-415, June 1972.
Notes on Maximum-Entropy Processing
J. A. EDWARD AND M. M. FITELSON
As@+--Maximum-entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrelation function. Unlike the discrete Fourier transform, maximum-entropy pro- cessing does not assume that the other lag values are zero. Instead, one mathematically ensures that the fewest possible assumptions about un- measured data are made by choosing the spectrum that maximizes the entropy for the process.
The use of the maximum entropy approach to spectral analysis was introduced by Burg [l]. In this correspondence, the authors derive the maximum-entropy spectrum by obtaining a spectrum that is forced to maximize the entropy of a stationary random process.
I. INTRODUCTION
Maximum entropy processing is a method for computing the power density spectrum from the first N lags of the autocorrela- tion function. As such, it may be applied to spectral analysis problems normally approached by the use of the discrete Fourier transform, such as the computation of frequency power density spectra for stationary time series, or of wave number spectra for sonar or seismic arrays. Unlike the discrete Fourier transform, maximum entropy brocessing does not assume that the other lag values are zero. Instead, one chooses the spectrum that fits the measured data and maximizes the entropy of the process. This means one has mathematically ensured that the fewest assump- tions regarding the information content of unmeasured data have been made.
The use of this technique for estimating power spectra was introduced by Burg [I]. His paper did not contain an explicit derivation of this result. However, in 1969, Bernard [2], and independently, Edward [3] presented similar derivations (neither of which are available in the open literature) of the maximum entropy spectrum. Subsequently, in 1971, VanDenBos [4] pre- sented a derivation based on a least-squares fitting of an all-pole model to the data. In addition, Lacoss [5] presented quantitative results based on maximum-entropy spectrum analysis.
In this paper the authors present a more rigorous version of Edward’s derivation [3]. The maximum entropy spectrum is found by obtaining a spectrum that is forced to maximize the entropy for a stationary random process and uniformly @aced data. The only assumptions that are necessary arise from de- manding that the maximum-entropy spectrum have the analytical properties of the spectrum of the process in question.
The fit to measured data is obtained by requiring the second- order moments (i.e., the covariance matrix) for the spectrum to agree with estimates based on the measured data. A maximum
Manuscript received March 22, 1972; revised September 1, 1972. The authors are with the General Electric Company, Syracuse, N.Y.
13201.
CORRESPONDENCE 233
entropy random process with constraints on only the first and second moments is readily shown to be a Gaussian process. The entropy for a stationary Gaussian process may be shown to be given by
H = $ lnf(l) dA, s
(1) f
where the integral is over the support off(A), and Iz, is the measure of the support of f(A). The maximum entropy estimate of f(A) is derived in Section II.
II. THE MAXIMUM-ENTROPY SPECTRUM
By definition,
I O" f‘(a) = dz exp (-2nirl)E(X(t)X(t + t))
-03
s
cc E dr exp (- 2rcirA)p(r), (2)
-m
where X(t) is the stationary Gaussian process in question. Suppose one had the covariance matrix for X(t,),X(t,),. . ., X(t,), where since the t are uniformly spaced it may be assumed that ti = i. The covariance matrix satisfies
Rij = E(XiXj) = Rj,* = Yt-j, (3) where
r, = s
da exp (27dak)f(a). (4)
Note that R is Hermitian and thus has 2n - 1 (n real and it - 1 imaginary) independent components. To find f(A) so that H is maximized and (4) is satisfied, introduce the quantity
n-1
L = - C O,*r,, k=-(n-1) (5)
where the Lagrange multipliers (0,) satisfy
9, = ewk*. (6) Then f(A) is found by demanding that
6(L + H) = 0, (7)
where the variation is with respect to f(1). Using (1) and (4), (7) becomes
= 0. (8) Since Sf (A) is arbitrary one has
f(a) = !
i
1 ___
At k=x:1, 1,
(9) @k* exp (2nilk)
Imposing the condition that f (A) is positive and integrable, it is possible to express f(A) as
(10) where
II-1
A(a) = c yk* eXp (2nilk) k=O
n-1--m em*= c YkYk*+ m (11)
k=O
r 1
iy -0 1
Fig. 1. Integration contour.
and by Weld’s method [6], [7] A(A)-’ may be chosen to be analytic in the upper half plane. It now remains to solve for yk: k = O,..., n - 1) by satisfying the k constraint equations
rk = s
112 daf(a) exp (- 2niAk), k = O,l,.. ., n - 1, (12)
-l/2
where use has been made of the periodicity of A(1) and A, has been taken to be 1. Multiplying both sides of (10) by A*(A) exp (2ziAk) and integrating yields
n-1
c s
112
Rk+l,l+lYl = lr0 da n-1 exp (2nikl)
-l/2 2 ym* exp (27ciml) * (13)
m=O
Since A(a)-1 is analytic in the upper half-plane, the integral on the right-hand side of (13) may be performed by integration around the contour shown in Fig. 1.
Due to the periodicity of A(A), the integrals over path @ and @ cancel. This fact combined with Cauchy’s theorem implies that the integrals over paths 0 and @ are the negatives of each other. Therefore
s
l/2 da exp (2aiAkj -l/2 44
s
l/Z = da exp (2nirlk - 2nyk)
n-1 -- . (14) -112
22 ym* exp (2rciAm - 2nym) m=o
One may take y to be arbitrarily large, so that
s
1/Z da exp (phlk) = 1 lim
s
l/2
44 d,l exp (2ziAk - 2zyk)
-l/2 Yo y-m -l/2
= 1. a,,,. (15) YO
Combining (15) with (13) yields,
which has the solution, in an obvious matrix notation,
y = (l/y,)R”& This finally yields
(17)
fm = (R-'),,/(IR-'6)T&(a)2)12, (18)
where e’(A) = [I,. e e, exp (- 2ni(n - l)A)]. The right-hand side of (18) is the desired maximum entropy spectrum.
3/26/15
13
234 IEEETRANSACTIONSONINFORMATIONTHEORY,MARCH 1973
ACKNOWLEDGMENT
The authors would like to thank the referees and Prof. T. Kailath for their helpful suggestions, which improved the presentation of this correspondence.
REFERENCES [l] J. P. Burg, “Maxi,mum entropy spectral analysis,” presented at the
;;;; Annual Meeting Sot. Explor. Geophys., Oklahoma Ctty, Okla.,
[2] T. E: Bernard,.“Analytical studies of techniques for the computation of high-resolution wave-number spectra,” Texas Instruments Advan. Array Res., Spec. Rep. 9, 1969.
[3] J. A. Edward, “An approach to passive sonar array processing based on the information in the space/time covariance matrix,” General Electric Co., Heavy Mil. Electron. Syst., TIS Rep. R69EMH38, 1969.
[4] A. VanDenBos, “Alternative interpretation of maximum entropy spectral analysis,” IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp. 493494, July 1971.
[5] R. T. Lacoss, “Data adaptive spectral analysis methods,” Geophysics, vol. 36, pp. 661-675, Aug. 1971.
[6] E. A. Robinson, “Predictive decomposition of time series with applica. tions to seismic exploration,” Massachusetts Inst. Technol., GAG Rep. 7, sect. 2.8, pp. 185-186, July 12, 1954.
[7] H. Wold, A Study in the Analysis of Stationary Time Series. Almquist and Wiksells, 1938, pp. 214.
Dppsala:
Binary Decisions With Stochastic Inputs
G. FRENKEL
Abstrucf-Upper and lower bounds on the mean error rate for binary decisions are obtained in the case where the amplitude of the signal varies during the symbol period and is a stationary process.
The problem considered is the one of obtaining bounds on the mean error rate in a binary decision process, when the amplitude o(t) of the input signal varies during the symbol period T, with probability density function p(w), which is in- dependent of time during the symbol period (i.e., w(t) is a first-order stationary stochastic process). Under these conditions the decision process is characterized by the relationships: probability of error = P,(x; N,), where
s
T
x= w(t)h(T - t) dt. (1) 0
Here No is the power density of thermal noise, P,(x) is a function, which in general depends on the nature of the decision process, and h(t) is the low-frequency equivalent of the pre: detection filter impulse response, which is assumed positive over the symbol period, as is true in most applications. Equation (1) implies that the discussion is limited to linear detectors; in general if p(w) is known, better results may be obtained without this restriction. Without loss of generality it will be assumed that
s
T h(t) dt = 1. (2)
0
In the following, Pa(x), h(t), and No are assumed known. It is further assumed that P,(x) is a convex function (has a non- negative second derivative). The reader may verify that this condition is almost always satisfied. For example, in the presence of thermal noise the error rate as a function of the signal com- ponent x at the output of a correlator is a convex function of x for all positive values.
Manuscript received March 31, 1972; revised September 7, 1972. The author is with the Computer Sciences Corporation, Falls Church, Va.
It will be shown that the mean error rate is bounded by the following expression
f’e[Eb);No I 5 E[f’&;No)l 5 E[Pe(w; No)], (3)
where
E P’& ; No) 1 = s
Wmax PAwi NolAw) da (4)
wni” A nice heuristic interpretation of (3) exists. Since P,(x) is
convex in x, distributions that have small variance in the re- ceived signal energy will have the lowest error probability, averaged over x. The distribution with least variance places all its mass at x = E(x) and has zero variance. By choosing w(t) = E(w) we achieve this situation and minimize P(e) since received signals have equal energy. This explains the lower bound.
The worst situation is one in which the received signals have the maximum possible spread in energy. If we constrain w(t) to be distributed as p(w) for each t, then any changes in w(t) tend to smooth out the energy distribution and lower P(e). Therefore choosing w(t) constant for each sample function, but with o(t) distributed according to p(o) for each t, maximizes P(e), yielding the upper bound.
More rigorous proofs follow, with the upper bound being derived first. From the convexity of P,(x), we have the in- equality [l 1.
IS
T
I s
T
PC? w(t)h(T - t) dt; No 5 P,[w(t); N,]h(T - t) dt. 0 0
(5)
This is a generalization of the inequality for convex functions PI
PET ( ) q 5 3Pe(Y> + PJY’).
Taking the expectation of both sides of (5)
s
T
E [Petx; NoI1 c: E[P,(o; N,)]h(T - t) dt 0
or
s
T
E [PAX ; No) 1 2 E [P&J ; No) I h(T - t) dt. 0
The integral equals unity, yielding the upper bound. The lower bound is obtained from the expansion
P,(x) = P&o) + (x - xo)pe’(xo) + (c!+!!22 P/(X1),
where xi E [x0,x]. With P,“(xl) 2 0, P,(x) 1 P,(x,) + (x - x,)P,‘(x,). With x0 = E(x), taking expectations E [P,(x)] 2 P,[E(x)]. From (1) and (2) E(x) = E(w) yielding the lower bound.
An example of these bounds is the behavior of a partially coherent detector. The signal is tracked in a phase-locked loop, with phase error d(t). Under the assumption of a constant phase error over the symbol interval, the error rate is given by the upper bound in (3) E [p&41 = .f P, [x(&b(d) &4 where Ad) is the probability density of 4. Some values of this error rate were computed [3]. For an energy per symbol-to-noise-density ratio equal to 9.1 and a mean-squared phase error of 0.1 rad2, the error rate equals 2.2 x 10m4. The signal component w(t) in (1) is equal to CD,,, cos 4 where w,,,~~ is the correlator input for zero phase error. For small phase errors, taking the first two
Entropy Expressions for ImagesSeveral alternative expressions are used based on analogies with probability density functions and spec-
tra:
H = −
k
Ik ln Ik
H =
k
ln Ik
These expressions enforce positivity because as Ik → 0, H is minimum (0 for the first form and −∞for the second) and they have positive second derivatives that inhibit oscillatory solutions.
If these latter properties are emphasized and the “fundamental” aspects of entropy ignored, other func-
tions can be concocted for use in image restoration. These are generally referred to as “penalty” func-
tions designed to achieve certain results and may be ad hoc in nature.
Often the entropy expressions are written in terms of a default image Mk:
H = −
k
Ik ln Ik/Mk
H =
k
ln Ik/Mk
5
3/26/15
14
Application with Imperfect DataLet
S(f ) = True spectrum
S(f ) = Spectral estimate from entropy maximization
R(τ ) = Estimated values of the ACF from S(f )
Rd(τ ) = Measured (or estimated) values of the ACF from data
Assume that the data values of the ACF are imperfect, perhaps with additive or other kinds of errors.Acceptable spectral estimates are those that are consistent with the data to within the errors, which wecan measure with χ
2:χ2 =
τ
R(τ )− Rd(τ )
2.
The maximum entropy spectral estimate is gotten by maximizing the combined quantity
J = H − χ2
with respect to the parameters of the spectral estimate (e.g. the coefficients of an AR model).
6
1986ARA&A..24..127N
Features of these func9ons: 1. Infinite slope at I = 0 è posi9vity of
solu9ons 2. Nega9ve second deriva9ve that
mi9gates oscilla9ons Can be viewed as “penalty func9ons”
1986ARA&A..24..127N
See Narayan & Nityananda 1986 ARAA for a comparison of results
3/26/15
15
1986ARA&A..24..127N
1986, ARAA
From Wolfram Mathworld
In generalized approaches, the entropy expression is treated more as a penalty func9on to drive certain results rather than as a fundamental quan9ty in and of itself.
3/26/15
16
Maximum Entropy Spectra of Red Noise Processes
1. Generate a realization of red noise: • Spectral domain:
– Generate complex white, Gaussian noise – Shape it according to [S(f)]1/2
– Inverse FFT to time domain
2. Find the best fit autoregressive model by minimizing fitting error against the ‘order’ of the model
• AR model fitting is equivalent to maximizing entropy
3. Find the Fourier spectrum of the AR model
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−4
10−3
10−2
10−1
100
101
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−0.0
3/26/15
17
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−4
10−3
10−2
10−1
100
101
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−0.0
3/26/15
18
0 1 2 3 4 5
Time (steps)x(
t)
Time Series
100 101 102 103
Frequency Bin
10−4
10−3
10−2
10−1
100
101
102
103
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−1.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−6
10−5
10−4
10−3
10−2
10−1
100
101
102
103
104
105
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−2.0
3/26/15
19
0 1 2 3 4 5
Time (steps)x(
t)
Time Series
100 101 102 103
Frequency Bin
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100101102103104105
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−3.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−12
10−10
10−8
10−6
10−4
10−2
100
102
104
106
108
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−4.0
3/26/15
20
0 1 2 3 4 5
Time (steps)x(
t)
Time Series
100 101 102 103
Frequency Bin
10−15
10−13
10−11
10−9
10−7
10−5
10−3
10−1
101
103
105
107
109
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−5.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−18
10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
1001021041061081010
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−6.0
3/26/15
21
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−4
10−3
10−2
10−1
100
101
102
103
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−1.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−6
10−5
10−4
10−3
10−2
10−1
100
101
102
103
104
105
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−2.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100101102103104105
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−3.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−12
10−10
10−8
10−6
10−4
10−2
100
102
104
106
108
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−4.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−15
10−13
10−11
10−9
10−7
10−5
10−3
10−1
101
103
105
107
109
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−5.0
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102 103
Frequency Bin
10−18
10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
1001021041061081010
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−6.0
On the Applicability of Maximum Entropy Spectral Estimate:
The spectral estimator:
Makes use of the 2M+1 values of the covariance function that are known or estimated.There is no choice in M . If ρM → 0 then the spectral estimate will reflect that (Jaynessays that the corresponding Lagrange multiplier will be zero).
The AR Approach:
M appears to be a parameter that must be chosen according to some criterion.
Reconciliation:
Jaynes is correct so long as the expression used for the entropy is correct. It may notbe in some cases. If it is ok, then simply use the 2M + 1 known values. If the entropyexpression is not applicable, then one must view the situation as one where an ARmodel is being fitted to the data, with M an unknown parameter.The problem reduces to finding (i) the order of the AR process and (ii) the coefficients.
17
3/26/15
22
Estimates for AR coefficients For all the nitty-gritty details of calculation of AR coefficients,
see Ulrych and Bishop, “Maximum Entropy Spectral Analysis and Autoregressive Decompo-
sition” 1975, Rev. Geo, and Sp. Phys., 13, 1983. There are Fortran listings for the Yule-Walker
and Burg algorithms for estimating coefficients. See also Numerical Recipes.
Two problems remain:
1. How does one calculate the order of the AR model?
2. What are the estimation errors of the spectral estimate?
The order of the AR Model can be estimated by looking at the prediction error as a function of
order M .
With N = # data points and M = order of AR process (or of a closely related prediction error
filter), evaluate the quantity (the “final prediction error”)
(FPE)M =N + (M + 1)
N − (M + 1) increases as M increases
t
(xt − xt)2
decreases
The order M is chosen as the one that minimizes the FPE (The Akaike criterion).
19
Final Prediction Error
3/26/15
23
FPE Examples for Red Processes with Power-Law Spectra
0 5 10 15 20Order of AR Process
0.512
0.514
0.516
0.518
0.520
0.522
0.524
0.526
0.528
Fina
lPre
dict
ion
Err
or
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102
Frequency Bin
10−2
10−1
100
101
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−0.0
0 5 10 15 20Order of AR Process
0.001045
0.001050
0.001055
0.001060
0.001065
0.001070
Fina
lPre
dict
ion
Err
or
0 1 2 3 4 5
Time (steps)
x(t)
Time Series
100 101 102
Frequency Bin
10−6
10−5
10−4
10−3
10−2
10−1
100
101
102
Spe
ctru
m
Black: Generated spectrum Blue: Periodogram Red: AR Spectrum
S(f) ∝ f−2.0
Application of MEM • Sinusoids + noise • Noise only
• Δt = 0.01 yr • Nyquist frequency fN = 50 cy yr-1
• Points: • MEM can give much better performance than the FFT-
based power spectrum • Using the wrong AR order, however, can get spurious
results
3/26/15
24
FFT Power Spectrum
MEM Spectrum AR order determined empirically
3/26/15
25
MEM Spectrum AR order forced to value indicated
MEM Spectrum AR order forced to value indicated
3/26/15
26
MEM Spectrum of Timing Residuals of a Millisecond Pulsar
Maximum Likelihood Spectra Estimation (MLSE)
MLSE is a misnomer; a better name is High Resolution Method because the method is derived
by explicitly maximizing the sensitivity to a given frequency while minimizing the effects (i.e.
leakage, a.k.a. bias) from other frequencies.
The MLSE was developed in the 1960s by Capon to analyze data from arrays of sensors to
maximize the response to one particular direction and minimize the response to others.
e.g. LASA = Large Aperture Seismic Array (test earthquakes vs. underground nuclear tests).
There is a close relationship to beam forming in acoustic arrays and beam forming in radio
interferometric arrays.
In the original development of the method discussed by Capon1
the spectral estimator is very
closely related to a filter that gives the ML estimate of a signal when it is corrupted by Gaussian
noise:
S +N −→ A −→ S
1see “Nonlinear Methods of Spectral Analysis”, Haykin, ed. pp. 154-179
1
3/26/15
27
This system involves:
a) a filter that gives the ML estimate of the signal when corrupted by Gaussian noise is also
. . .
b) the filter that generally gives the minimum variance and unbiased estimate of the signal
for arbitrary noise and . . .
c) has coefficients that yield an unbiased, high resolution spectral estimate for any signal.
The way the filter coefficients are derived (i.e. the constraints applied to the maximization
problem) imply that the spectral estimate minimizes leakage.
The HRM is sometimes described as a positive constrained reconstruction method which min-
imizes leakage.
Thus, the intent of the MLSE technique is much different from the MESE technique:
MLSE minimizes variance and bias (recall how spectral bias was related to resolution)
MESE in effect (via its relationship to prediction filters) tries to maximize resolution
2
We will derive the ML spectral estimate following the derivation of Lacoss.
Method: Construct a linear filter that
1. yields an unbiased estimate of a sinusoidal signal and
2. minimizes the variance of the output with respect to corrupting white noise.
Pass a signal yn through a linear filter:
yn −→ ak −→ xn
xn =n
k=1
ak yn−k+1 (causal)
where the input is of the form of a deterministic sinusoid added to zero mean noise havingan arbitrary spectrum:
yn = AeiΩn + nn.
We will determine the coefficients ak by invoking the above two criteria.
3
3/26/15
28
Goal: We want the filter to pass AeiΩn undistorted but to reject the noise as much as possible.Thus, we require
1. no bias (in the mean): xn ≡N
k=1
ak yn−k+1
=N
k=1
akAeiΩ(n−k+1) + nn−k+1
=N
k=1
akA eiΩ(n−k+1)
≡ AeiΩn (if no bias)
⇒N
k=1
ak eiΩ(1−k) = 1 constraint equation
4
This can be written in matrix form using ‘†’ to denote transpose conjugate:
ε †a = 1 ε ≡
1eiΩ
ei2Ω...
ei(N−1)Ω
a =
a1a2...aN
5
3/26/15
29
2. Minimum variance of the filter output:
σ2 ≡[xn − xn]2
=
k
ak yn−k+1 − AeiΩn2
=
k
akAeiΩ(n−k+1)
cancels last term
+nn−k+1
− AeiΩn
from 1.
2
=
k
ak nn−k+1
2 ≡
k
k
ak nn−k+1 n∗n−k+1 a∗k
= a† Ca,
where C is the covariance matrix of the noise, n.
6
3. Minimize σ2 w.r.t. a and subject to the constraint ε†a = 1.
By minimizing σ2 subject to the constraint, we get the smallest error and no bias.
Therefore we minimize L with respect to a:
L = σ2 + λ(ε †a− 1) = a† Ca + λ(ε †a− 1)
We can take ∂L/∂Re(aj) and ∂L/∂Im(aj) separately to derive equations fora, then recombinethese equations to get
a† C + λε † = 0.
This is the same as we get by taking
∇aL ≡ ∂L
∂a=
∂
∂aa † Ca + λ
∂
∂a(ε †a)
= a† C + λε †
= 0 for a = a0.
7
3/26/15
30
The solution for a0 is
a†0C = −λε †
⇒ C†a0 = −λ∗ε
a0 = −λ∗(C†)−1ε
Now substitute back into the constraint equation ε †a0 = 1 (the no bias relation) to get
ε †a0 = −λ∗ε †(C †)−1ε = 1
or − λ∗ =1
ε †(C †)−1ε
⇒ a0 =(C †)−1ε
ε †(C †)−1ε
Note denominator is real (quadratic form)
⇒ε †(C †)−1ε
†= ε † C−1ε
8
4. Minimum variance: Substitute a0 back into the expression for σ2to find the minimum
variance:
σ2min ≡ a†0 Ca0
=
(C †)−1ε
ε †(C †)−1ε
†
C †
(C †)−1ε
ε †(C †)−1ε
=ε † C−1ε
(ε † C−1ε)(ε † C †−1ε)
=1
ε †C−1ε
σ2min =
1
ε † C−1ε
This is the power in the noise components with the same frequency as the signal Ω.
(Note we have used the Hermitian relation C † ≡ C.)
9
3/26/15
31
Interpretation:
1. σ2min = portion of noise that leaks through the filter, which is attempting to estimate a
sinusoid corrupted by the noise.
2. Note that the filter coefficients and σ2min are functions of Ω and of the noise covariance
matrix. But they do not depend on the amplitude of the sinusoid.
3. The trick: now take away the signal but keep the noise. We allow Ω to vary across arange of frequencies we are interested in. Then, σ2
min(Ω) is a spectral estimate for the noisespectrum (which was left arbitrary)
4. ⇒ maximum likelihood spectral estimator
SML(f ) =1
ε † C−1εwith Ω = 2πf∆τ
Further comments:
1. As used, the covariance matrix C is an ensemble average quantity. Applications to actualdata require use of some estimate for the covariance matrix.
2. The derivation is for equally spaced data.
3. The spectral estimate should work well on processes with steep power-law spectra becausethe estimator is derived explicitly to minimize bias.
10
Data-adaptive aspect of the MLSE spectral estimator:
Recall that the Fourier-transform based estimator has a fixed window. The MLSE has a dataadaptive spectral window, as we will show.
The filter coefficients are a function of the frequency of the sinusoid, Ω:
a0(Ω) =(C−1)†(Ω)
(Ω)†(C−1) † (Ω)
As Ω is varied, the coefficients a0 vary but subject to the normalization constraint a †0 = 1.
For a given Ω, which labels the frequency component we are attempting to estimate, what isthe response to other frequencies, ω?
Define the window function
W (ω,Ω) = a0 (Ω)† (ω)
as the response to frequency ω of a filter designed to pass through the frequency Ω.
The window function satisfies (normalization)
W (Ω,Ω) ≡ 1.
The equivalent quantity for a Fourier transform estimator might be
W (ω,Ω) =sin(ω − Ω)T/2
(ω − Ω)T/2.
11
3/26/15
32
Simulating the HRM
Generate a process with specified noise + signal spectrum or just noise with an arbitrary spec-
trum by passing white noise through a linear filter.
white noise −→ h(t) −→ x(t)
From one or more realizations of x(t) estimate the autocovariance and put it in the form of a
covariance matrix, C.
For each frequency of interest (Ω), calculate the MLM/HRM filter coefficients
a0 =C−1ε
ε † C−1ε.
Calculate the power-spectrum estimate as
S(Ω) =1
ε † C−1ε.
The window function can be calculated as
W (ω,Ω) = a †0 (Ω)ε (ω).
12
3/26/15
33
Appendix • Derivation of the entropy expression in
terms of the power spectrum for a Gaussian process
• Summary of mathematical derivation • Heuristic derivation
Maximum Entropy: Power Spectrum (short approach)So far we know how to calculate the entropy of a random variable in terms of its PDF. For a univariate
Gaussian PDF we have
fX(x) =2πσ2
−1/2e−x
2/2σ2
H = −
dxfX(x) ln fX(x)
=
1
2ln(2πσ2) +
X2
2σ2
=1
2
ln(2πσ2) + 1
=1
2
ln(2πeσ2)
When we maximize the entropy subject to constraints (from data), we only care about terms in the
entropy that depend on relevant parameters. Here the only parameter is σ so the constant term does not
matter. Notice that larger σ implies larger entropy, as we would expect for a measure of uncertainty.
When we maximize entropy, we may as well write it only in terms of the variance,
H ≈ ln σ2 + constant.
1
3/26/15
34
Maximum Entropy Spectral Estimate
So far we know how to calculate the entropy of a random variable in terms of its PDF. For a
univariate Gaussian PDF we have
fX(x) =2πσ2
−1/2e−x
2/2σ2
H = −
dxfX(x) ln fX(x)
=
1
2ln(2πσ2) +
X2
2σ2
=1
2
ln(2πσ2) + 1
=1
2
ln(2πeσ2)
When we maximize the entropy subject to constraints (from data), we only care about terms in
the entropy that depend on relevant parameters. Here the only parameter is σ so the constant
term does not matter. Notice that larger σ implies larger entropy, as we would expect for a
measure of uncertainty.
When we maximize entropy, we may as well write
H ≈ ln σ + constant.
1
Multivariate Case:
Consider a real Gaussian random process xk, k = 1, . . . , N whose correlation function for Nlags can be written as an N ×N covariance matrix Cx. For the zero mean case,
Cx =
x21 x1x2 . . . x1xN... x22 . . . x2xN... . . . ...
xNx1 . . . . . . x2N
Since the random process is continuous, we use the integral expression for the relative entropy(dependent on the coordinate system)
H = −
dx fx(x) ln fx(x)
withfx(x) =
(2π)N det Cx
−1/2exp
− 1
2(x− µ)t Cx
−1(x− µ)
which yields
H =1
2ln(2π)N det Cx
+
1
2
(x− µ)tCx
−1(x− µ)
2
3/26/15
35
We will
1. ignore the factor (2π)N because it is constant in Cx.2. ignore the second term because it is independent of both N and Cx. This is equivalent to
the constant term we found for the univariate case.
Example of a bivariate Gaussian:
Cx =
σ21 σ1σ2ρ12
σ1σ2ρ12 σ22
,
Cx−1 = (detCx)
−1
σ22 −σ1σ2ρ12
−σ1σ2ρ12 σ21
Cx−1
δx1
δx2
= (detCx)
−1
σ22δx1 − σ1σ2ρ12δx2
−σ1σ2ρ12δx2 + σ21δx2
Q ≡ (δx1 δx2)Cx−1
δx1
δx2
Q = (detCx)−1
σ22σ
21 − 2σ2
1σ22ρ12 + σ
21σ
22
=
2σ21σ
22 (1− ρ
212)
σ21σ
22 (1− ρ
212)
=1
2
Since this is a constant, we will ignore it in taking derivatives of H .
3
We therefore use an entropy expression,
H ≡ 1
2ln (detCx)
Unfortunately, as N → ∞, H → ∞ as can be seen for the uncorrelated case where the covari-ance matrix is diagonal:
⇒ H =1
2ln
N
j=1
σ2j
=
1
2
N
j=1
ln σ2j
Define an entropy rate as
h = limN→∞
H
N + 1
= limN→∞
1
2
1
(N + 1)ln (detCx)
= limN→∞
1
2ln
(detCx)
1N+1
=
1
2ln
lim
N→∞(detCx)
1N+1
(1)
4
3/26/15
36
Entropy in terms of the spectrum:
To get a maximum entropy estimate of a spectrum, we need an expression for the entropy in
terms of the spectrum. There is no general relation between the spectrum and the entropy. For
Gaussian processes, however, there is a relation. This is appropriate since a Gaussian process
is one with maximum entropy out of all processes with the same variance. The spectrum is the
variance per unit frequency, so this conceptual step is important. But a relation exists1
between
the determinant of the covariance matrix and the spectrum, which is assumed to be bandlimited
in (−fN, fN).
limN→∞
(detCx)1
N+1 = 2fN exp
1
2fN
fN
−fN
df ln Sx(f )
.
The theorem depends on Cx being a Toeplitz matrix [matrix element Cij depends only on
(i− j)], i.e. that the process be WSS
1An arcane proof exists in “Prediction-Error Filtering and Maximum-Entropy Spectral Estimation” in Non-Linear Methods of Spectral Analysis, Haykin ed.
Springer-Verlag 1979, see Appendix A, pp. 62-67. It is also given by Smylie et al. 1973, Meth. Comp. Phys. 13, 391.
5
Thus
h = limN→∞
1
2ln (detCx)
1N+1
=1
2ln
lim
N→∞(detCx)
1N+1
=1
2ln 2fN +
1
4fN
fN
−fN
df ln Sx(f )
Ignoring the first, constant term, we have
h =1
4fN
fN
−fN
df ln Sx(f )
6
3/26/15
37
Heuristic “derivation” of the entropy rate expression:
Another way of viewing this is as follows. In calculating a power spectrum we are concernedwith a second-order moment, by definition. Consequently, we can assume that the randomprocess under consideration is Gaussian because:
1. we are maximizing the entropy (subject to constraints) and
2. given the second moment, the process with largest entropy is a Gaussian random process
Note that while this assumption is satisfactory for estimating the power spectrum (a secondmoment), it is not necessarily accurate when we consider the estimation errors of the spectralestimate, which depend on fourth-order statistics. If the central limit theorem can be invokedthen, of course, the Gaussian assumption becomes a good one once again.
Imagine that the process under study is constructed by passing white noise through a linearfilter whose system function is
S(f )
n(t) −→Sx(f ) −→ x(t)
Consequently, the Fourier transforms are related as
N(f )Sx(f ) = X(f )
Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore,
7
X(f ) is a GRV and viewing it as a 1-D random variable we have that the entropy is
H(f ) =1
2ln [2πeσ2(f )]
but
σ2(f ) ≡ [N(f )
S(f )]2 = S(f )SN
Letting the white noise spectrum be SN = 1 we have
σ2(f ) = S(f )
andH(f ) =
1
2ln [2πeS(f )]
Recall that white noise is uncorrelated between different frequencies
N(f ) N ∗(f ) = SN δ(f − f).
Consequently, the information in different frequencies adds because of statistical independenceand, therefore, to get the total entropy we simple integrate (add variances):
H =
df H(f ) =
1
2
df ln[2πeS(f )]
8
3/26/15
38
Again we ignore additive constants and we also consider the case where the signal is bandlim-ited in |f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degreesof freedom is 2fNT .
The entropy per degree of freedom is
h =H
2fN=
1
4fN
fN
−fN
df ln [S(f )]
The derivation of the ME spectrum follows the logical flow:
entropy in terms of power spectrum S(f )⇓
maximize H subject to constraints:
⇓(C)known
F.T.
⇔ S(f )⇓
ME spectral estimator
van den Bos shows that extrapolating the covariance function to larger lags while maximizingentropy yields a matrix equation that is identical to that obtained by fitting an autoregressivemodel to the data. This implies that the two procedures are identical mathematically.
9
Maximum Entropy Spectral Estimator
By maximizing the entropy rate that is expressed in terms of the spectrum, the spectral estimatecan be expressed as (e.g. Edward & Fitelson 1973),
S(f ) =1
fN
α20
|ε t C−1δ|2
where C is the Toeplitz covariance matrix, which applies to WSS processes.
ε =
1e2πif∆τ
...e2πifM∆τ
C =
C00C01 . . . C0M
...CM0 . . . . . . CMM
δ =
10...0
Toeplitz ⇒ C ≡
C0C1 . . . CM... C0
CM . . . . . . C0
Let
γ ≡ C−1 =
γ00 . . . γ0M...γM0 . . . γMM
10
3/26/15
39
Then
ε t C−1δ =M
j=0
γj0 e2πif∆τj
andS(f ) =
1
4N
α20
M
j=0
γj0 e2πif∆τj
2
By rewriting the sum and redefining the constants this can be written
S(f ) =α20
fN
γ00 +
M
j=1
γj0 e2πif∆τj
2
=α20
fNγ200
1 +
M
j=1
γj0γ00
e2πif∆τj
2
11
Thus, the ME spectral estimate can be put into the form
S(f ) =PM
1 +M
j=1
αj e2πifj∆τ
2
where PM = a constant that properly normalizes the spectrum.
This is the same spectrum as for an Mth order AR process that can be fitted to the data, where
the coefficients are determined by least squares.
Spectrum of an AR Process:
Consider the following M-th order AR process
xt = atwhite noise
+M
j=1
αj xt−j
autoregressive part
A zero-th order process would be xt = at (i.e. white noise). Scargle would term the above
definition a causal AR process. An acausal or two-sided process would allow negative values
of j in the sum on the RHS.
The coefficients αj, j = 1, . . . ,M are the AR coefficients. In the fitting of an AR model to the
data, one must determine the order M as well as the M coefficients.
12
3/26/15
40
Define the DFTs
X(f ) ≡N−1
t=0
xt e−2πift/N
A(f ) ≡N−1
t=0
at e−2πift/N.
Substituting the definition for the AR process, we have
X(f ) =M
j=1
αj X(f ) e−2πifj/N + A(f ),
and, solving for X(f ),
X(f ) =A(f )
1−
M
j=1
αj e−2πifj/N
.
13
The spectrum is then
S(f ) =|A(f )|2
1−M
j=1
αj e−2πifj/N
2
∝ 11−
M
j=1
αj e−2πifj/N
2.
As is obvious, the AR spectrum has the same form as the maximum-entropy spectrum.
14