Steves Lecture7[1]

7/30/2019 Steves Lecture7[1]

1/9

- 1-

SIO239 Conductivity of the Deep Earth

Iterative Inverse Modeling

Sparsely Parameterized Problems

Bob has shown how one can compute the MT response function one would expectto observe over a one-dimensional (1D) earth. In general processes such as this are

called forward modeling, and can be written as

d = f(x, m) (1)

where d is the predicted response (i.e. the apparent resistivities and phases or

complex c) lets say we have M of them:

d = (d1, d2, d3,....., dM) (2)

and m is the model. Even though weve described the earth as 1D, we note that m

can have infinite dimension in that the conductivity could be a continuous functionof depth. Thus f is also sometimes called the forward functional because it mapsm onto a single data point.

Bob will show you how to handle 1D MT as an infinite dimensional problem, but

in practice 1D models are often constructed from a stack of layers (say, L of them),described using a vector of layer thicknesses and resistivities

m = (1, 2,...,L, t1, t2,...,tL1)

= (m1, m2,...,mN) (3)

where N = 2L 1. The x are independent data variables (in our case, frequenciesor periods) associated with the predicted responses;

x = (x1, x2, x3,...,xM) . (4)

Forward modeling is fine for understanding how MT sounding curves will look

over various structures, and can even be used for fitting data by trial and error

guessing of layer resistivities and thicknesses, but what wed really like is a scheme

for finding a model m given a set of observed data d:

d = (d1, d2, d3,...,dM) (5)

such that

d = f(x, m) . (6)

In practice, it is unlikely that any model exists that will fit the data perfectly, and it

is almost certainly true that your parameterization does not capture the real world.


2/9

- 2-

Even if the earth were indeed constructed exactly from a stack of layers, the data

will have errors associated with them:

= (1, 2,...,M) (7)

preventing an exact fit for any but the sparsest data sets (note, however, that this is

only true of nonlinear forward models). What we can do is generate a measure ofhow well a model m fits the data, and find some way to reduce, or even minimize,

this misfit. For practical and theoretical reasons, the sum-squared misfit is a favored

measure:

2 =Mi=1

1

2i

di f(xi, m)

2(8)

which equivalently may be written as

2 = ||Wd Wf(m)||2 = ||W(d d)||2 (9)

where W is a diagonal matrix of reciprocal data errors

W = diag(1/1, 1/2,....1/M) . (10)

The least squares (LS) approach attempts to minimize 2 with respect to all themodel parameters simultaneously. If the data errors i are Gaussian and indepen-dent, then least squares provides a maximum liklihood and unbiased estimate of

m, and 2 is Chi-squared distributed with MN degrees of freedom. We see that

parameterized LS requires that N must be less than M. In practice, N


3/9

- 3-

). At this point |Z| could be expressed as a fraction or percentage, but the phaseerrors must stay absolute, since phase may go through zero.

The estimation procedure you used to get the complex impedances should have

provided some sort of statistical error estimate, which needs to be propagated into

amplitude and phase as above. If the statistical error estimate does not provide a

full accounting of the uncertainties in the data (which is often the case), when youincrease it you should maintain the appropriate proportion between amplitude and

phase. That is, 10% in amplitude corresponds to 5.7 in phase. Note, however, that

|Z|2 = 2Real(Z) = 2Imag(Z) (14)

and so for apparent resistivity and phase the proportions are 10% in apparent

resistivity and 2.9 in phase.

One should also note that sum-square misfit measures are particularly unforgiving

of outliers.

Most practical approaches to the inverse problem usually involve linearizing theforward solution f around an initial model guess m0

f(m0 + m) = f(m0) + Jm + O(m2) (15)

where J is a matrix of partial derivatives of data with respect to the model parameters

Jij =f(xi, m0)

mj(16)

(often called the Jacobian matrix) and

m = (m1, m2,....,mN) (17)

is a model parameter perturbation about m0. Now our expression for 2 is

2 ||W(d f(m0)) + WJm||2 (18)

where the approximation is a recognition that we have dropped the higher order

terms. We will proceed on the assumption that this linear approximation is a good

one. (More on that later.) We can minimize 2 in the usual way by setting thederivatives of2 with respect to m equal to zero:

2 = 2(WJ)T[W(d f(m0)) WJm] = 0 (19)

and solving for m

m = [(WJ)TWJ]1(WJ)T[W(d f(m0))] (20)

which may be equivalently written as N simultaneous equations:

= m (21)


4/9

- 4-

where

= (WJ)TW(d f(m0)) (22)

= (WJ)TWJ . (23)

The matrix is sometimes called the curvature matrix. This system can be solvedfor m by inverting numerically. If the forward model f is truly linear, then themodel m = m0 + m is the least squares solution.

For non-linear problems a second model m1 = m0 + m is found and the process

repeated in the hope that the process converges to a solution. Note that because J

depends on m, J will need to be re-computed every iteration. This is the general

form of the Gauss-Newton approach to model fitting.

Near the least squares solution, the Gauss-Newton method will work, but any

significant nonlinearity will result in likely failure unless you start linearly close

to a solution. Long, thin, valleys in the 2 hyper-surface are common and producesearch directions which diverge radically from the direction of the minimum.

Another approach which does not depend on how large the second-order terms in

the expansion off are is based on the expansion of2, rather than f:

2(m0 + m) = 2(m0) + m

T2(m0) + O(m2) (24)

into which we can substitute our expression (19) for 2 (setting m = 0) anddrop the high order terms

2(m0 + m) = 2(m0) 2m

T(WJ)T[W(d f(m0))] . (25)

If we choose m = (WJ)T[W(d f(m0))] for some scalar then we get

2

(m0 + m) = 2

(m0) ||(WJ)T[W(d f(m0))]||2

(26)

and there will always be a which keeps reducing 2 (not all at some point thehigher order terms will take over). Solutions of the form

m = (WJ)T[W(d f(m0))] (27)

go down the slope of the 2 hyper-surface, and are thus called steepest descentmethods. Although they are guaranteed to reduce 2, as they approach a leastsquares solution 2 0 and so does m, and the method becomes very ineffi-cient.

An algorithm to compensate for this behaviour was suggested by Marquardt (1963).Consider a model perturbation given by

m = [I + (WJ)TWJ]1(WJ)T[W(d f(m0))] (28)

for another scalar, . When = 0 it is easy to see that this reduces to the Gauss-Newton step. When is large, it reduces to a steepest descent step with = 1/.


5/9

- 5-

Referring to our earlier, compact notation (2123), this is the same as increasing

the diagonal terms of the curvature matrix by a factor :

jk = jk (1 + ) for j = k (29)

jk = jk for j = k . (30)

By adjusting to be large far from the solution, and small as we approach theminimum, we have a fairly stable method of determining a model composed of a

small number of parameters.

The classic Marquardt algorithm is as follows:

i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of

the starting model.

ii) Compute m1 and a new 2. If

a) the 2

decreased, keep the model, reduce by 10, and go to (iii)

b) the 2 increased, increase by 10 and go to (ii).

iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).

The assumption in the classic Marquardt algorithm is that forward model calcula-

tions are expensive. If they are not, or if the cost of computing J is high compared

with the forward solution, then a more efficient algorithm would be:

i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of

the starting model.

ii) Do a line search over to find the minimum 2 and the associated model.

iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).

What about calculating those derivatives in J? One approach is to do this numeri-

cally by first differences. Forward or backward differences

d

dxf(x)

f(x + h) f(x)

h

f(x) f(x h)

h(31)

has a relative error term that goes as

h2

|f(x)| (32)

but that a central difference

d

dxf(x)

f(x + h) f(x h)

2h(33)


6/9

- 6-

has an error term that goes as

h2

6|f(x)| . (34)

How to choose h? The tradeoff is to generate a computationally significant dif-ference in f while not being so large that nonlinearities compromise the result.Usually something like 5% works well enough, but make sure you dont take apercentage of anything that can go through zero.

Alternatively, one can do the derivative calculations analytically, which is usually

computationally much faster and more accurate but may take some math and

coding. If you do, check the analytical results against the central differences if

your derivatives are wrong, no gradient-based inverse method is going to converge.

Neither the resistivities nor layer thicknesses can go negative in the real world,

but believe me they almost certainly will in iterative inversion schemes. There

are non-negative least squares algorithms out there, but the easy fix for this is to

parameterize them as logarithms, and adjust f and J accordingly.

The iterative parameterized LS approach may converge to a local minimum rather

than a global minimum, and it may not work (diverge) unless you start reasonably

close to a solution. One approach to this is to run lots of inversions using a fairly

large random set of starting models, assuming the computation is fast enough.

For nonlinear problems that are truly parameterized the Marquardt method is pretty

hard to beat. It also works fairly well for problems where the number of degrees of

freedom is large, given by M N when the M data are truly independent, and thestarting model provides a rough fit to the data. In practice, this means 1D models

which consist of a small number of layers. Any attempt to increase the number

of layers significantly, to approximate the realistic case of a continuously varying

conductivity, will result in wildly fluctuating parameters followed by failure of the

algorithm. This is because there will be columns of J that are small, reflecting

the fact that large variations in a model parameter may relate to negligible effects

on the combined data. However, when you limit the parameterization you need to

remember that your solution will be as much constrained by the parameterization

as the data.

Infinite-Dimensional Models

When one admits that the model space is infinite dimensional, then geophysical

inversion becomes non-unique for finite, noisy data. That means that a single

misfit (which is, after all, only a single scalar) will map into an infinite number of

models (or none). It is also poorly constrained, in that a small difference in misfit

can correspond to a large distance in model space. Even if you have an infinite

dimensional model space, the true minimum misfit is probably outside your chosen

space.

As it happens, the 1D MT sounding problem is one of the few (2?) geophysical


7/9

- 7-

problems for which an analytical least squares solution exists, derived by Parker

and Whaler (1981) and called the D+ solution. Unfortunately, the D+ solution,

although best fitting in a least squares sense, is pathological (Bob will tell you

more). However, the misfit measure obtained from D+ can be a useful guide to data

quality.

So the true LS solution is too extreme to be useful and sparse parameterizationslimit the solution to a fixed, small number of layers. What to do? One approach,

suggested by Backus and Gilbert (1967), is to allow a large number of parameters

but minimize m. This and related algorithms converge extremely slowly and

are called by Parker (1994) creeping methods. In any case, we have seen that

trying to find a true least squares solutions is folly. Almost all high-dimensional

inversion today incorporates some type of regularization, an approach suggested

by Tikhonov and Arsenin (1977), which explicitly penalizes bad behavior in the

model. For example, instead of minimizing 2, we minimize an unconstrainedfunctional

U = ||Rm1||2 + 1

||Wd Wf(m1)||

2 2

(35)

where 2 is a target misfit that is greater than the minimum possible, but statisticallyacceptable, and Rm is some measure of roughness in the model, often taken to be

first differences between adjacent model parameters and easily generated by a

matrix R consisting of (-1, 1) entries on the diagonal. That is

R =

1 1 0 0 0 . . . 00 1 1 0 0 . . . 00 0 1 1 0 . . . 0

. . .. . .

1 1

. (36)

For a second difference roughness you can use

R2 = R2 =

1 2 1 0 0 . . . 00 1 2 1 0 . . . 00 0 1 2 1 . . . 0

. . .. . .

1 2 1

. (37)

We can weight the rows of R with layer thickness, depth, or whatever we desire.

We can even neglect them or zero them out (which amounts to the same thing) if

we want to allow an un-penalized jump to appear in the model.

Minimizing U has the effect of minimizing model roughness as well as how farthe data misfit is from being acceptable. To do this we substitute our linearization

around f(m0) for a starting model m0 (i.e. (15) with m expanded as m1 m0)into (35)

U = ||Rm1||2 + 1

||Wd W

f(m0) + J(m1 m0)

||2 2


8/9

- 8-

and differentiate U with respect to m1, rearranging the result to get m1 directly:

m1 =

RTR + (WJ)TWJ

1(WJ)TW(d f(m0) + Jm0) .

We need only to choose the tradeoff (Lagrange) multiplier . The approach of

Constable et al. (1987) was to note that for each iteration 2 is a function of, andto use 1D optimization (simply a line search) to minimize 2 when 2 > 2 and tofind such that 2 = 2 otherwise. Constable et al. called this approach Occamsinversion. Although the Occam algorithm is reliable and has good convergence

behavior, the computation and storage ofJ for large models can be limiting, but for

1D models there is no problem.


9/9

- 9-

References:

Backus G.E., and Gilbert, J.F. , 1967. Numerical applications of a formalism.

Geophysical Journal of the Royal Astronomical Society, 13, 247276.

Constable, S.C., Parker, R.L., and Constable, C.G., 1987. Occams Inversion: A

practical algorithm for generating smooth models from EM sounding data.Geophysics, 52, 289300.

Marquardt, D.W., 1963. An algorithm for least-squares estimation of non-linear

parameters. Journal of the Society for Industrial and Applied Mathematics,

11, 431441.

Parker, R.L., 1994. Geophysical Inverse Theory. Princeton, NJ, Princeton Univer-

sity Press.

Tikhonov, A.N., and Arsenin, V.Y. , 1977. Solutions of Ill-Posed Problems. New

York, John Wiley and Sons.

Documents

Steves Lecture7[1]