Upload
forscribd1981
View
213
Download
0
Embed Size (px)
Citation preview
7/30/2019 Steves Lecture7[1]
1/9
- 1-
SIO239 Conductivity of the Deep Earth
Iterative Inverse Modeling
Sparsely Parameterized Problems
Bob has shown how one can compute the MT response function one would expectto observe over a one-dimensional (1D) earth. In general processes such as this are
called forward modeling, and can be written as
d = f(x, m) (1)
where d is the predicted response (i.e. the apparent resistivities and phases or
complex c) lets say we have M of them:
d = (d1, d2, d3,....., dM) (2)
and m is the model. Even though weve described the earth as 1D, we note that m
can have infinite dimension in that the conductivity could be a continuous functionof depth. Thus f is also sometimes called the forward functional because it mapsm onto a single data point.
Bob will show you how to handle 1D MT as an infinite dimensional problem, but
in practice 1D models are often constructed from a stack of layers (say, L of them),described using a vector of layer thicknesses and resistivities
m = (1, 2,...,L, t1, t2,...,tL1)
= (m1, m2,...,mN) (3)
where N = 2L 1. The x are independent data variables (in our case, frequenciesor periods) associated with the predicted responses;
x = (x1, x2, x3,...,xM) . (4)
Forward modeling is fine for understanding how MT sounding curves will look
over various structures, and can even be used for fitting data by trial and error
guessing of layer resistivities and thicknesses, but what wed really like is a scheme
for finding a model m given a set of observed data d:
d = (d1, d2, d3,...,dM) (5)
such that
d = f(x, m) . (6)
In practice, it is unlikely that any model exists that will fit the data perfectly, and it
is almost certainly true that your parameterization does not capture the real world.
7/30/2019 Steves Lecture7[1]
2/9
- 2-
Even if the earth were indeed constructed exactly from a stack of layers, the data
will have errors associated with them:
= (1, 2,...,M) (7)
preventing an exact fit for any but the sparsest data sets (note, however, that this is
only true of nonlinear forward models). What we can do is generate a measure ofhow well a model m fits the data, and find some way to reduce, or even minimize,
this misfit. For practical and theoretical reasons, the sum-squared misfit is a favored
measure:
2 =Mi=1
1
2i
di f(xi, m)
2(8)
which equivalently may be written as
2 = ||Wd Wf(m)||2 = ||W(d d)||2 (9)
where W is a diagonal matrix of reciprocal data errors
W = diag(1/1, 1/2,....1/M) . (10)
The least squares (LS) approach attempts to minimize 2 with respect to all themodel parameters simultaneously. If the data errors i are Gaussian and indepen-dent, then least squares provides a maximum liklihood and unbiased estimate of
m, and 2 is Chi-squared distributed with MN degrees of freedom. We see that
parameterized LS requires that N must be less than M. In practice, N
7/30/2019 Steves Lecture7[1]
3/9
- 3-
). At this point |Z| could be expressed as a fraction or percentage, but the phaseerrors must stay absolute, since phase may go through zero.
The estimation procedure you used to get the complex impedances should have
provided some sort of statistical error estimate, which needs to be propagated into
amplitude and phase as above. If the statistical error estimate does not provide a
full accounting of the uncertainties in the data (which is often the case), when youincrease it you should maintain the appropriate proportion between amplitude and
phase. That is, 10% in amplitude corresponds to 5.7 in phase. Note, however, that
|Z|2 = 2Real(Z) = 2Imag(Z) (14)
and so for apparent resistivity and phase the proportions are 10% in apparent
resistivity and 2.9 in phase.
One should also note that sum-square misfit measures are particularly unforgiving
of outliers.
Most practical approaches to the inverse problem usually involve linearizing theforward solution f around an initial model guess m0
f(m0 + m) = f(m0) + Jm + O(m2) (15)
where J is a matrix of partial derivatives of data with respect to the model parameters
Jij =f(xi, m0)
mj(16)
(often called the Jacobian matrix) and
m = (m1, m2,....,mN) (17)
is a model parameter perturbation about m0. Now our expression for 2 is
2 ||W(d f(m0)) + WJm||2 (18)
where the approximation is a recognition that we have dropped the higher order
terms. We will proceed on the assumption that this linear approximation is a good
one. (More on that later.) We can minimize 2 in the usual way by setting thederivatives of2 with respect to m equal to zero:
2 = 2(WJ)T[W(d f(m0)) WJm] = 0 (19)
and solving for m
m = [(WJ)TWJ]1(WJ)T[W(d f(m0))] (20)
which may be equivalently written as N simultaneous equations:
= m (21)
7/30/2019 Steves Lecture7[1]
4/9
- 4-
where
= (WJ)TW(d f(m0)) (22)
= (WJ)TWJ . (23)
The matrix is sometimes called the curvature matrix. This system can be solvedfor m by inverting numerically. If the forward model f is truly linear, then themodel m = m0 + m is the least squares solution.
For non-linear problems a second model m1 = m0 + m is found and the process
repeated in the hope that the process converges to a solution. Note that because J
depends on m, J will need to be re-computed every iteration. This is the general
form of the Gauss-Newton approach to model fitting.
Near the least squares solution, the Gauss-Newton method will work, but any
significant nonlinearity will result in likely failure unless you start linearly close
to a solution. Long, thin, valleys in the 2 hyper-surface are common and producesearch directions which diverge radically from the direction of the minimum.
Another approach which does not depend on how large the second-order terms in
the expansion off are is based on the expansion of2, rather than f:
2(m0 + m) = 2(m0) + m
T2(m0) + O(m2) (24)
into which we can substitute our expression (19) for 2 (setting m = 0) anddrop the high order terms
2(m0 + m) = 2(m0) 2m
T(WJ)T[W(d f(m0))] . (25)
If we choose m = (WJ)T[W(d f(m0))] for some scalar then we get
2
(m0 + m) = 2
(m0) ||(WJ)T[W(d f(m0))]||2
(26)
and there will always be a which keeps reducing 2 (not all at some point thehigher order terms will take over). Solutions of the form
m = (WJ)T[W(d f(m0))] (27)
go down the slope of the 2 hyper-surface, and are thus called steepest descentmethods. Although they are guaranteed to reduce 2, as they approach a leastsquares solution 2 0 and so does m, and the method becomes very ineffi-cient.
An algorithm to compensate for this behaviour was suggested by Marquardt (1963).Consider a model perturbation given by
m = [I + (WJ)TWJ]1(WJ)T[W(d f(m0))] (28)
for another scalar, . When = 0 it is easy to see that this reduces to the Gauss-Newton step. When is large, it reduces to a steepest descent step with = 1/.
7/30/2019 Steves Lecture7[1]
5/9
- 5-
Referring to our earlier, compact notation (2123), this is the same as increasing
the diagonal terms of the curvature matrix by a factor :
jk = jk (1 + ) for j = k (29)
jk = jk for j = k . (30)
By adjusting to be large far from the solution, and small as we approach theminimum, we have a fairly stable method of determining a model composed of a
small number of parameters.
The classic Marquardt algorithm is as follows:
i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of
the starting model.
ii) Compute m1 and a new 2. If
a) the 2
decreased, keep the model, reduce by 10, and go to (iii)
b) the 2 increased, increase by 10 and go to (ii).
iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).
The assumption in the classic Marquardt algorithm is that forward model calcula-
tions are expensive. If they are not, or if the cost of computing J is high compared
with the forward solution, then a more efficient algorithm would be:
i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of
the starting model.
ii) Do a line search over to find the minimum 2 and the associated model.
iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).
What about calculating those derivatives in J? One approach is to do this numeri-
cally by first differences. Forward or backward differences
d
dxf(x)
f(x + h) f(x)
h
f(x) f(x h)
h(31)
has a relative error term that goes as
h2
|f(x)| (32)
but that a central difference
d
dxf(x)
f(x + h) f(x h)
2h(33)
7/30/2019 Steves Lecture7[1]
6/9
- 6-
has an error term that goes as
h2
6|f(x)| . (34)
How to choose h? The tradeoff is to generate a computationally significant dif-ference in f while not being so large that nonlinearities compromise the result.Usually something like 5% works well enough, but make sure you dont take apercentage of anything that can go through zero.
Alternatively, one can do the derivative calculations analytically, which is usually
computationally much faster and more accurate but may take some math and
coding. If you do, check the analytical results against the central differences if
your derivatives are wrong, no gradient-based inverse method is going to converge.
Neither the resistivities nor layer thicknesses can go negative in the real world,
but believe me they almost certainly will in iterative inversion schemes. There
are non-negative least squares algorithms out there, but the easy fix for this is to
parameterize them as logarithms, and adjust f and J accordingly.
The iterative parameterized LS approach may converge to a local minimum rather
than a global minimum, and it may not work (diverge) unless you start reasonably
close to a solution. One approach to this is to run lots of inversions using a fairly
large random set of starting models, assuming the computation is fast enough.
For nonlinear problems that are truly parameterized the Marquardt method is pretty
hard to beat. It also works fairly well for problems where the number of degrees of
freedom is large, given by M N when the M data are truly independent, and thestarting model provides a rough fit to the data. In practice, this means 1D models
which consist of a small number of layers. Any attempt to increase the number
of layers significantly, to approximate the realistic case of a continuously varying
conductivity, will result in wildly fluctuating parameters followed by failure of the
algorithm. This is because there will be columns of J that are small, reflecting
the fact that large variations in a model parameter may relate to negligible effects
on the combined data. However, when you limit the parameterization you need to
remember that your solution will be as much constrained by the parameterization
as the data.
Infinite-Dimensional Models
When one admits that the model space is infinite dimensional, then geophysical
inversion becomes non-unique for finite, noisy data. That means that a single
misfit (which is, after all, only a single scalar) will map into an infinite number of
models (or none). It is also poorly constrained, in that a small difference in misfit
can correspond to a large distance in model space. Even if you have an infinite
dimensional model space, the true minimum misfit is probably outside your chosen
space.
As it happens, the 1D MT sounding problem is one of the few (2?) geophysical
7/30/2019 Steves Lecture7[1]
7/9
- 7-
problems for which an analytical least squares solution exists, derived by Parker
and Whaler (1981) and called the D+ solution. Unfortunately, the D+ solution,
although best fitting in a least squares sense, is pathological (Bob will tell you
more). However, the misfit measure obtained from D+ can be a useful guide to data
quality.
So the true LS solution is too extreme to be useful and sparse parameterizationslimit the solution to a fixed, small number of layers. What to do? One approach,
suggested by Backus and Gilbert (1967), is to allow a large number of parameters
but minimize m. This and related algorithms converge extremely slowly and
are called by Parker (1994) creeping methods. In any case, we have seen that
trying to find a true least squares solutions is folly. Almost all high-dimensional
inversion today incorporates some type of regularization, an approach suggested
by Tikhonov and Arsenin (1977), which explicitly penalizes bad behavior in the
model. For example, instead of minimizing 2, we minimize an unconstrainedfunctional
U = ||Rm1||2 + 1
||Wd Wf(m1)||
2 2
(35)
where 2 is a target misfit that is greater than the minimum possible, but statisticallyacceptable, and Rm is some measure of roughness in the model, often taken to be
first differences between adjacent model parameters and easily generated by a
matrix R consisting of (-1, 1) entries on the diagonal. That is
R =
1 1 0 0 0 . . . 00 1 1 0 0 . . . 00 0 1 1 0 . . . 0
. . .. . .
1 1
. (36)
For a second difference roughness you can use
R2 = R2 =
1 2 1 0 0 . . . 00 1 2 1 0 . . . 00 0 1 2 1 . . . 0
. . .. . .
1 2 1
. (37)
We can weight the rows of R with layer thickness, depth, or whatever we desire.
We can even neglect them or zero them out (which amounts to the same thing) if
we want to allow an un-penalized jump to appear in the model.
Minimizing U has the effect of minimizing model roughness as well as how farthe data misfit is from being acceptable. To do this we substitute our linearization
around f(m0) for a starting model m0 (i.e. (15) with m expanded as m1 m0)into (35)
U = ||Rm1||2 + 1
||Wd W
f(m0) + J(m1 m0)
||2 2
7/30/2019 Steves Lecture7[1]
8/9
- 8-
and differentiate U with respect to m1, rearranging the result to get m1 directly:
m1 =
RTR + (WJ)TWJ
1(WJ)TW(d f(m0) + Jm0) .
We need only to choose the tradeoff (Lagrange) multiplier . The approach of
Constable et al. (1987) was to note that for each iteration 2 is a function of, andto use 1D optimization (simply a line search) to minimize 2 when 2 > 2 and tofind such that 2 = 2 otherwise. Constable et al. called this approach Occamsinversion. Although the Occam algorithm is reliable and has good convergence
behavior, the computation and storage ofJ for large models can be limiting, but for
1D models there is no problem.
7/30/2019 Steves Lecture7[1]
9/9
- 9-
References:
Backus G.E., and Gilbert, J.F. , 1967. Numerical applications of a formalism.
Geophysical Journal of the Royal Astronomical Society, 13, 247276.
Constable, S.C., Parker, R.L., and Constable, C.G., 1987. Occams Inversion: A
practical algorithm for generating smooth models from EM sounding data.Geophysics, 52, 289300.
Marquardt, D.W., 1963. An algorithm for least-squares estimation of non-linear
parameters. Journal of the Society for Industrial and Applied Mathematics,
11, 431441.
Parker, R.L., 1994. Geophysical Inverse Theory. Princeton, NJ, Princeton Univer-
sity Press.
Tikhonov, A.N., and Arsenin, V.Y. , 1977. Solutions of Ill-Posed Problems. New
York, John Wiley and Sons.