Steves Lecture7[1]

Embed Size (px)

Citation preview

  • 7/30/2019 Steves Lecture7[1]

    1/9

    - 1-

    SIO239 Conductivity of the Deep Earth

    Iterative Inverse Modeling

    Sparsely Parameterized Problems

    Bob has shown how one can compute the MT response function one would expectto observe over a one-dimensional (1D) earth. In general processes such as this are

    called forward modeling, and can be written as

    d = f(x, m) (1)

    where d is the predicted response (i.e. the apparent resistivities and phases or

    complex c) lets say we have M of them:

    d = (d1, d2, d3,....., dM) (2)

    and m is the model. Even though weve described the earth as 1D, we note that m

    can have infinite dimension in that the conductivity could be a continuous functionof depth. Thus f is also sometimes called the forward functional because it mapsm onto a single data point.

    Bob will show you how to handle 1D MT as an infinite dimensional problem, but

    in practice 1D models are often constructed from a stack of layers (say, L of them),described using a vector of layer thicknesses and resistivities

    m = (1, 2,...,L, t1, t2,...,tL1)

    = (m1, m2,...,mN) (3)

    where N = 2L 1. The x are independent data variables (in our case, frequenciesor periods) associated with the predicted responses;

    x = (x1, x2, x3,...,xM) . (4)

    Forward modeling is fine for understanding how MT sounding curves will look

    over various structures, and can even be used for fitting data by trial and error

    guessing of layer resistivities and thicknesses, but what wed really like is a scheme

    for finding a model m given a set of observed data d:

    d = (d1, d2, d3,...,dM) (5)

    such that

    d = f(x, m) . (6)

    In practice, it is unlikely that any model exists that will fit the data perfectly, and it

    is almost certainly true that your parameterization does not capture the real world.

  • 7/30/2019 Steves Lecture7[1]

    2/9

    - 2-

    Even if the earth were indeed constructed exactly from a stack of layers, the data

    will have errors associated with them:

    = (1, 2,...,M) (7)

    preventing an exact fit for any but the sparsest data sets (note, however, that this is

    only true of nonlinear forward models). What we can do is generate a measure ofhow well a model m fits the data, and find some way to reduce, or even minimize,

    this misfit. For practical and theoretical reasons, the sum-squared misfit is a favored

    measure:

    2 =Mi=1

    1

    2i

    di f(xi, m)

    2(8)

    which equivalently may be written as

    2 = ||Wd Wf(m)||2 = ||W(d d)||2 (9)

    where W is a diagonal matrix of reciprocal data errors

    W = diag(1/1, 1/2,....1/M) . (10)

    The least squares (LS) approach attempts to minimize 2 with respect to all themodel parameters simultaneously. If the data errors i are Gaussian and indepen-dent, then least squares provides a maximum liklihood and unbiased estimate of

    m, and 2 is Chi-squared distributed with MN degrees of freedom. We see that

    parameterized LS requires that N must be less than M. In practice, N

  • 7/30/2019 Steves Lecture7[1]

    3/9

    - 3-

    ). At this point |Z| could be expressed as a fraction or percentage, but the phaseerrors must stay absolute, since phase may go through zero.

    The estimation procedure you used to get the complex impedances should have

    provided some sort of statistical error estimate, which needs to be propagated into

    amplitude and phase as above. If the statistical error estimate does not provide a

    full accounting of the uncertainties in the data (which is often the case), when youincrease it you should maintain the appropriate proportion between amplitude and

    phase. That is, 10% in amplitude corresponds to 5.7 in phase. Note, however, that

    |Z|2 = 2Real(Z) = 2Imag(Z) (14)

    and so for apparent resistivity and phase the proportions are 10% in apparent

    resistivity and 2.9 in phase.

    One should also note that sum-square misfit measures are particularly unforgiving

    of outliers.

    Most practical approaches to the inverse problem usually involve linearizing theforward solution f around an initial model guess m0

    f(m0 + m) = f(m0) + Jm + O(m2) (15)

    where J is a matrix of partial derivatives of data with respect to the model parameters

    Jij =f(xi, m0)

    mj(16)

    (often called the Jacobian matrix) and

    m = (m1, m2,....,mN) (17)

    is a model parameter perturbation about m0. Now our expression for 2 is

    2 ||W(d f(m0)) + WJm||2 (18)

    where the approximation is a recognition that we have dropped the higher order

    terms. We will proceed on the assumption that this linear approximation is a good

    one. (More on that later.) We can minimize 2 in the usual way by setting thederivatives of2 with respect to m equal to zero:

    2 = 2(WJ)T[W(d f(m0)) WJm] = 0 (19)

    and solving for m

    m = [(WJ)TWJ]1(WJ)T[W(d f(m0))] (20)

    which may be equivalently written as N simultaneous equations:

    = m (21)

  • 7/30/2019 Steves Lecture7[1]

    4/9

    - 4-

    where

    = (WJ)TW(d f(m0)) (22)

    = (WJ)TWJ . (23)

    The matrix is sometimes called the curvature matrix. This system can be solvedfor m by inverting numerically. If the forward model f is truly linear, then themodel m = m0 + m is the least squares solution.

    For non-linear problems a second model m1 = m0 + m is found and the process

    repeated in the hope that the process converges to a solution. Note that because J

    depends on m, J will need to be re-computed every iteration. This is the general

    form of the Gauss-Newton approach to model fitting.

    Near the least squares solution, the Gauss-Newton method will work, but any

    significant nonlinearity will result in likely failure unless you start linearly close

    to a solution. Long, thin, valleys in the 2 hyper-surface are common and producesearch directions which diverge radically from the direction of the minimum.

    Another approach which does not depend on how large the second-order terms in

    the expansion off are is based on the expansion of2, rather than f:

    2(m0 + m) = 2(m0) + m

    T2(m0) + O(m2) (24)

    into which we can substitute our expression (19) for 2 (setting m = 0) anddrop the high order terms

    2(m0 + m) = 2(m0) 2m

    T(WJ)T[W(d f(m0))] . (25)

    If we choose m = (WJ)T[W(d f(m0))] for some scalar then we get

    2

    (m0 + m) = 2

    (m0) ||(WJ)T[W(d f(m0))]||2

    (26)

    and there will always be a which keeps reducing 2 (not all at some point thehigher order terms will take over). Solutions of the form

    m = (WJ)T[W(d f(m0))] (27)

    go down the slope of the 2 hyper-surface, and are thus called steepest descentmethods. Although they are guaranteed to reduce 2, as they approach a leastsquares solution 2 0 and so does m, and the method becomes very ineffi-cient.

    An algorithm to compensate for this behaviour was suggested by Marquardt (1963).Consider a model perturbation given by

    m = [I + (WJ)TWJ]1(WJ)T[W(d f(m0))] (28)

    for another scalar, . When = 0 it is easy to see that this reduces to the Gauss-Newton step. When is large, it reduces to a steepest descent step with = 1/.

  • 7/30/2019 Steves Lecture7[1]

    5/9

    - 5-

    Referring to our earlier, compact notation (2123), this is the same as increasing

    the diagonal terms of the curvature matrix by a factor :

    jk = jk (1 + ) for j = k (29)

    jk = jk for j = k . (30)

    By adjusting to be large far from the solution, and small as we approach theminimum, we have a fairly stable method of determining a model composed of a

    small number of parameters.

    The classic Marquardt algorithm is as follows:

    i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of

    the starting model.

    ii) Compute m1 and a new 2. If

    a) the 2

    decreased, keep the model, reduce by 10, and go to (iii)

    b) the 2 increased, increase by 10 and go to (ii).

    iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).

    The assumption in the classic Marquardt algorithm is that forward model calcula-

    tions are expensive. If they are not, or if the cost of computing J is high compared

    with the forward solution, then a more efficient algorithm would be:

    i) Choose a starting m0 and a fairly small (say, 0.001). Compute the 2 of

    the starting model.

    ii) Do a line search over to find the minimum 2 and the associated model.

    iii) If the change in 2 or m are very small, stop. Otherwise, go to (ii).

    What about calculating those derivatives in J? One approach is to do this numeri-

    cally by first differences. Forward or backward differences

    d

    dxf(x)

    f(x + h) f(x)

    h

    f(x) f(x h)

    h(31)

    has a relative error term that goes as

    h2

    |f(x)| (32)

    but that a central difference

    d

    dxf(x)

    f(x + h) f(x h)

    2h(33)

  • 7/30/2019 Steves Lecture7[1]

    6/9

    - 6-

    has an error term that goes as

    h2

    6|f(x)| . (34)

    How to choose h? The tradeoff is to generate a computationally significant dif-ference in f while not being so large that nonlinearities compromise the result.Usually something like 5% works well enough, but make sure you dont take apercentage of anything that can go through zero.

    Alternatively, one can do the derivative calculations analytically, which is usually

    computationally much faster and more accurate but may take some math and

    coding. If you do, check the analytical results against the central differences if

    your derivatives are wrong, no gradient-based inverse method is going to converge.

    Neither the resistivities nor layer thicknesses can go negative in the real world,

    but believe me they almost certainly will in iterative inversion schemes. There

    are non-negative least squares algorithms out there, but the easy fix for this is to

    parameterize them as logarithms, and adjust f and J accordingly.

    The iterative parameterized LS approach may converge to a local minimum rather

    than a global minimum, and it may not work (diverge) unless you start reasonably

    close to a solution. One approach to this is to run lots of inversions using a fairly

    large random set of starting models, assuming the computation is fast enough.

    For nonlinear problems that are truly parameterized the Marquardt method is pretty

    hard to beat. It also works fairly well for problems where the number of degrees of

    freedom is large, given by M N when the M data are truly independent, and thestarting model provides a rough fit to the data. In practice, this means 1D models

    which consist of a small number of layers. Any attempt to increase the number

    of layers significantly, to approximate the realistic case of a continuously varying

    conductivity, will result in wildly fluctuating parameters followed by failure of the

    algorithm. This is because there will be columns of J that are small, reflecting

    the fact that large variations in a model parameter may relate to negligible effects

    on the combined data. However, when you limit the parameterization you need to

    remember that your solution will be as much constrained by the parameterization

    as the data.

    Infinite-Dimensional Models

    When one admits that the model space is infinite dimensional, then geophysical

    inversion becomes non-unique for finite, noisy data. That means that a single

    misfit (which is, after all, only a single scalar) will map into an infinite number of

    models (or none). It is also poorly constrained, in that a small difference in misfit

    can correspond to a large distance in model space. Even if you have an infinite

    dimensional model space, the true minimum misfit is probably outside your chosen

    space.

    As it happens, the 1D MT sounding problem is one of the few (2?) geophysical

  • 7/30/2019 Steves Lecture7[1]

    7/9

    - 7-

    problems for which an analytical least squares solution exists, derived by Parker

    and Whaler (1981) and called the D+ solution. Unfortunately, the D+ solution,

    although best fitting in a least squares sense, is pathological (Bob will tell you

    more). However, the misfit measure obtained from D+ can be a useful guide to data

    quality.

    So the true LS solution is too extreme to be useful and sparse parameterizationslimit the solution to a fixed, small number of layers. What to do? One approach,

    suggested by Backus and Gilbert (1967), is to allow a large number of parameters

    but minimize m. This and related algorithms converge extremely slowly and

    are called by Parker (1994) creeping methods. In any case, we have seen that

    trying to find a true least squares solutions is folly. Almost all high-dimensional

    inversion today incorporates some type of regularization, an approach suggested

    by Tikhonov and Arsenin (1977), which explicitly penalizes bad behavior in the

    model. For example, instead of minimizing 2, we minimize an unconstrainedfunctional

    U = ||Rm1||2 + 1

    ||Wd Wf(m1)||

    2 2

    (35)

    where 2 is a target misfit that is greater than the minimum possible, but statisticallyacceptable, and Rm is some measure of roughness in the model, often taken to be

    first differences between adjacent model parameters and easily generated by a

    matrix R consisting of (-1, 1) entries on the diagonal. That is

    R =

    1 1 0 0 0 . . . 00 1 1 0 0 . . . 00 0 1 1 0 . . . 0

    . . .. . .

    1 1

    . (36)

    For a second difference roughness you can use

    R2 = R2 =

    1 2 1 0 0 . . . 00 1 2 1 0 . . . 00 0 1 2 1 . . . 0

    . . .. . .

    1 2 1

    . (37)

    We can weight the rows of R with layer thickness, depth, or whatever we desire.

    We can even neglect them or zero them out (which amounts to the same thing) if

    we want to allow an un-penalized jump to appear in the model.

    Minimizing U has the effect of minimizing model roughness as well as how farthe data misfit is from being acceptable. To do this we substitute our linearization

    around f(m0) for a starting model m0 (i.e. (15) with m expanded as m1 m0)into (35)

    U = ||Rm1||2 + 1

    ||Wd W

    f(m0) + J(m1 m0)

    ||2 2

  • 7/30/2019 Steves Lecture7[1]

    8/9

    - 8-

    and differentiate U with respect to m1, rearranging the result to get m1 directly:

    m1 =

    RTR + (WJ)TWJ

    1(WJ)TW(d f(m0) + Jm0) .

    We need only to choose the tradeoff (Lagrange) multiplier . The approach of

    Constable et al. (1987) was to note that for each iteration 2 is a function of, andto use 1D optimization (simply a line search) to minimize 2 when 2 > 2 and tofind such that 2 = 2 otherwise. Constable et al. called this approach Occamsinversion. Although the Occam algorithm is reliable and has good convergence

    behavior, the computation and storage ofJ for large models can be limiting, but for

    1D models there is no problem.

  • 7/30/2019 Steves Lecture7[1]

    9/9

    - 9-

    References:

    Backus G.E., and Gilbert, J.F. , 1967. Numerical applications of a formalism.

    Geophysical Journal of the Royal Astronomical Society, 13, 247276.

    Constable, S.C., Parker, R.L., and Constable, C.G., 1987. Occams Inversion: A

    practical algorithm for generating smooth models from EM sounding data.Geophysics, 52, 289300.

    Marquardt, D.W., 1963. An algorithm for least-squares estimation of non-linear

    parameters. Journal of the Society for Industrial and Applied Mathematics,

    11, 431441.

    Parker, R.L., 1994. Geophysical Inverse Theory. Princeton, NJ, Princeton Univer-

    sity Press.

    Tikhonov, A.N., and Arsenin, V.Y. , 1977. Solutions of Ill-Posed Problems. New

    York, John Wiley and Sons.