A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

  • Upload
    nqfaq

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    1/15

    WATERRESOURCES RESEARCH, VOL.32, NO. 3,PAGES 679-693,MARCH1996

    A

    nearest neighbor bootstrap for resampling hydrologic time

    series

    Upmanu Lall and Ashish Sharma

    Utah Water Research Laboratory, Utah State University, Logan

    Abstract A nonparametric method for resampling scalar or vector-valued time series is

    introduced. Multivariate nearest neighbor probability density estimation provides the basis

    fo r the resampling scheme developed. The motivation for this work comes

    from

    a desire

    to preserve the dependence structure of the time series while bootstrapping (resampling it

    withreplacement). The method is data driven and is preferred where the investigator is

    uncomfortable

    with prior assumptions as to the

    form

    (e.g., linear or nonlinear) of

    dependence and the form of the probability densityfunction (e.g., Gaussian). Such prior

    assumptionsareoften made in an ad hocmanner foranalyzing hydrologic data.

    Connections

    of the

    nearest neighbor bootstrap

    to

    Markov processes

    as

    well

    as its

    utility

    in

    a general Monte Carlo setting are discussed. Applications to resampling monthly

    streamflow

    and some synthetic data are presented. The method is shown to be

    effective

    withtime series generated by linear and nonlinear autoregressive models. The utility of

    the

    method

    for

    resampling monthly streamflow sequences with asymmetric

    and

    bimodal

    marginalprobability densities

    is

    also demonstrated.

    Introduction

    Autoregressive moving average (ARMA) models for time

    series analysis are

    often

    used by hydrologists to generate syn-

    thetic streamflow and weather sequences to aid in the analysis

    of

    reservoir

    an d

    drought management. Hydrologic time series

    can exhibit the

    following

    behaviors, which can be a problem for

    commonly

    used linearARMAmodels: (1) asymmetric and/or

    multimodal

    conditional

    a nd

    marginal probability distributions;

    (2) persistent large amplitude variations at irregular time in-

    tervals; (3) amp litude-frequency dependence (e.g., the ampli-

    tude of the oscillations increases as the oscillation period in-

    creases); (4) apparent long memory (this could be related to

    (2 )

    and/or (3));

    (5 )

    nonlinear dependence between

    x

    t

    versus

    x

    t

    _

    T

    fo r

    some

    lagr; and (6)

    t ime irreversibility (i.e.,

    the

    time

    series plotted

    in

    reverse time

    i s

    "different" from

    th e

    time series

    in forward

    time).The physicsofmost geophysical processes is

    time irreversible. Streamflow hydrographs often

    rise

    rapidly

    and attenuate slowly, leading to time irreversibility.

    Kendall

    and

    Dracup [1991] have argued for simple resam-

    pling schemes, such as the index sequential method, for

    streamflow simulation in place of ARMA models, suggesting

    that

    th e

    ARM A streamflow sequences usually

    do not

    "look"

    like real stream flow sequences. An alternative is presented in

    the workofYakowitz[Yakowitz, 1973, 1979;

    Schusterand Ya-

    kowitz , 1979;Yakow itz , 1985, 1993;

    Karlsson

    and Yakowitz,

    1987a,b ],Smith

    etal [1992],

    Smith

    [1991],

    and Tarboton

    et

    a l.

    [1993] who consider the time series as the outcome of a

    Markov process and estimate the requisite probability densities

    using nonparametric methods. A resampling technique or

    bootstrap for scalar or v ector-valued, stationary, ergodic time

    series data that recognizes

    th e

    serial dependence structure

    of

    the time series is presented here. The technique is nonpara-

    Copyright1996by the American Geophysical Union.

    Paper number 95WR02966.

    0043-1397/96/95WR-02966$05.00

    metric; that

    is ,

    no prior assumptions

    as to the

    distributional

    form of the

    underlying stochastic process

    ar e

    made.

    The

    bootstrap

    [Efron, 1979;

    Efroh an d

    Tibishirani,1993]is a

    technique that prescribes a d ata resam pling strategy using the

    random mechanism that generated the data. Its applications

    for estimating confidence intervalsa nd parameter uncertainty

    are well known

    [see

    Hardle

    an d

    Bowman,

    1988;

    Tasker,

    1987;

    Woo, 1989;

    Zucchini andAdamson, 1989].

    Usually the boot-

    strap resamples with replacement from the empirical distribu-

    tion function

    of

    inde penden t, identically distributed data.

    The

    contribution

    o f

    this paper

    is the

    development

    of a

    bootstrap

    for

    dependent data that preserves the dependence in a probabi-

    listic sense. This method should beuseful for the Monte Carlo

    analysis

    of a

    variety

    of

    hydrologic design

    and

    operation prob-

    lems where time series data on one or more interacting vari-

    ables are available.

    The

    underlying concept

    of the

    methodology

    is

    introduced

    through Figure 1. Consider that the serial dependence islim-

    ited to the two previous lags; that is,x

    t

    depends on the two

    prior values

    x

    t

    _

    l

    andx

    t

    _

    2

    .

    Denote this ordered pair,

    or

    bituple, at a timet

    t

    byD,. Let the corresponding succeeding

    value be denoted byS.Consider thek nearest neighbors of

    D,

    as thek

    bituples

    in the

    time series that

    are

    closest

    in

    terms

    of

    Euclidean distance to D,. The first three nearest neighbors are

    marked as

    D

    1?

    D

    2

    an d

    D

    3

    .

    The

    expected value

    of the

    forecast

    S

    can be estimated as an appropriate weighted averageof the

    successors

    x

    t

    (marked as

    1,2,

    and 3, respectively) to these th ree

    nearest neighbors. The weights may depend inversely on the

    distance between

    D, and its

    k nearest neighbors

    D

    1 ?

    D

    2

    , ,

    D^.

    Aconditional probab ilitydensityf x\D

    f

    ) may beevaluated

    empirically using a nearest neighbor density estimator [see

    Silverman,

    1986,

    p. 96]

    with

    th e

    successors

    x

    l9

    ,

    x

    k

    .

    For

    simulation thex

    i

    can be drawn randomly from one of the

    k

    successors to the D

    1?

    D

    2

    , , D^using this estimated condi-

    tional density.Here,this operation

    willbe

    done

    by

    resampling

    th e

    original data with replacement.

    Hence th e

    procedure

    de -

    veloped is termed a nearest neighbor time series bootstrap. In

    679

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    2/15

    680

    LALL AND SHARMA:

    NEAREST NEIGHBOR BOOTSTRAP

    I

    I S S

    f i l j

    |

    f i l

    j|

    Values o f

    :

    s ^ >

    t i m e

    Figure

    1 Atime

    series

    fromthe

    modelx

    f

    +

    1

    =

    (1 - 4 x

    t

    -

    0.5)

    2

    ).

    This is a deterministic, nonlinear model with a time

    series

    that looks random. A forecast of the

    successor

    to the

    bituple D; , marked

    as

    S

    in the

    figure,

    is of

    interest.

    The

    "pat-

    terns"

    or bituples of

    interest

    are the

    filled

    circles,

    near

    th e

    three

    nearest neighborsD

    1 }

    D

    2

    andD

    3

    to the pattern

    D

    z

    .

    T he

    successors to

    these

    bituples are marked as 1, 2, and 3, respec-

    tively.

    Note

    how the

    successor

    (1) to the

    closest

    nearestneigh-

    bo r D J L ) is closest to the successor S) of D,. A sense of the

    marginal probability distributionof

    x

    f

    is

    obtained

    b y

    looking

    at

    the

    values

    of

    x

    t

    shown

    on the

    right

    side of the

    figure.

    As the

    sample size

    n

    increases, the sample space of

    x

    gets filled in

    between 0 and 1, such that the sample

    values

    are arbitrarily

    close to each other, but no

    value

    is ever repeated exactly.

    summary,o ne

    finds

    k

    patterns

    in the

    datathat

    ar e

    "similar"

    to

    th e

    current

    pattern

    and then

    operates

    on their respective

    suc-

    cessors to define a

    local regression,

    conditional

    density,

    or

    resampling.

    The nearest neighbor

    probability density estimator

    and its

    use

    withMarkov

    processesis

    reviewed

    in the

    next

    section.The

    resampling

    algorithm is

    described after that. Applications

    to

    synthetic an d

    streamflow data

    are then

    presented.

    Background

    It is natural to pursue

    nonparametric estimation

    of

    proba-

    bilitydensitiesandregression functions through weighted local

    averages of the target

    function.

    This is the foundation fo r

    nearestneighbormethods.

    T he

    recognition

    of the

    nonlinearity

    of

    the underlying dynamics of

    geophysical

    processes,

    gains

    in

    computational

    ability,

    and the

    availability

    of large

    data sets

    have spurred the

    growth

    of the nonparametric literature. The

    reader isreferred to

    work

    b ySilverman [1986],Eubank[1988],

    Hiirdle

    [1989,

    1990], and Scott [1992] for

    accessible

    mono-

    graphs. G y o r f i eta l.[1989]provide

    a

    theoreticalaccount

    that is

    relevant for timeseriesanalysis.

    Lall

    [1995]surveys hydrologic

    applications. For timeseries analysisamovingblockbootstrap

    (MBB) waspresented by

    Kunsch

    [1989].Here a block ofm

    observations is resam pled with replacement, as opposed to a

    single

    observation in the

    bootstrap.

    Serial dependance is

    pre-

    served

    within,

    but not across, a block. The block length

    m

    determines the order of the serial dependence that can be

    preserved. Objective

    proceduresfor the selection of the

    block

    lengthm

    ar e

    evolving.

    Strategies for

    conditioning

    the MBB on

    other processes

    (e.g., runoff on

    rainfall)

    are not obvious. Our

    investigations

    indicated that the MBB may not be able to

    reproduce the sample

    statistics

    as well as nearest neighbor

    bootstrap presentedhere.

    The k nearest neighbor (A>nn)

    density

    estimator is defined

    as

    [Silverman,

    1986, p. 96]

    / NN( X) =

    kin

    kin

    (1)

    wherek

    is thenumbero f

    nearestneighbors

    considered,

    d

    is the

    dimension of the

    space,

    c

    d

    is the volume of a unit sphere ind

    dimensions

    c

    l

    = 2 , c

    2

    =

    T T , c

    3

    =

    47T/3,

    , c

    d

    =

    [7r

    d/2

    /Y d/2 + 1)]), r

    k

    x) is the Euclidean distance to the

    A:th-nearest data point, and V

    k

    \) is the volume of a d-

    dimensional sphere

    of

    radiusr

    k

    \).

    This estimatoris readilyunderstoodby observing that for a

    sample

    of size

    n,

    we expect

    approximately

    {nf x)V

    k

    x)} ob -

    servations

    to lie in the

    volume V

    k

    x).

    Equating

    this

    to the

    number

    observed,

    that

    is ,k ,

    completes

    the definition.

    A

    generalized

    nearest

    neighbor

    density

    estimator

    [Silver-

    man,

    1986,

    p .

    97],defined

    in

    (2),

    c an

    improve

    thetailbehavior

    of

    the

    nearestneighbor

    densityestimator by

    using

    a monoton-

    ically an d

    possibly rapidly

    decreasing smooth kernelfunction.

    7GNN(

    X

    )

    1

    rt x)n

    x

    x/

    r

    k

    x)

    (2)

    The "smoothing" parameter is the number of

    neighbors

    used,k,

    and the

    tail behavior

    is determined by the

    kernelK i).

    The

    kernel

    has the

    role

    of a

    weight function

    (data

    vectors X ,

    closer

    to the point of estimate x are

    weighted more),

    and can

    be chosen to be any validprobability

    density

    function.Asymp-

    totically,

    under

    optimal mean square

    error (MSE)

    arguments,

    k

    should be chosen

    proportional

    to

    ^

    4/

    (^

    +4

    )

    for a probability

    density

    that

    is twice

    differentiable.

    However, given a

    single

    sample from an unknown

    density,

    such a rule is of little prac-

    tical

    utility.The

    sensitivity

    to the choiceo f A :issomewhat lower

    as akernel that ism onotonically decreasingwith

    r

    k

    x)

    isused.

    A new

    kernel function

    that

    weightsthej'thneighbor

    ofx,

    using

    a

    kernel

    that

    depends

    on the

    distance between x,

    and itsy th

    neighbor is developed in the resampling methodology section.

    Yakowitz

    (references cited

    earlier) developed

    a theoretical

    basis for using nearest neighbor an dkernel methods fo r time

    series forecasting and applied

    them

    in a hyd rologic context. In

    those papers

    Yakowitz considers

    a finite order,

    continuous

    parameter

    Markov

    chain as an

    appropriate

    model fo r

    hydro-

    logic time series. He observes that discretization of the state

    space

    ca n

    quickly lead

    to

    either

    an

    unmanageable number

    of

    parameters

    (the curse of dimensionality) orpoor approxima-

    tion of the transition functions,

    while

    theARMA approxima-

    tions

    to such a

    process call

    fo r

    restrictive

    distributional and

    structural assumptions. Strategies for the

    simulation

    of daily

    flowsequences,

    one-step-ahead

    prediction, and the conditional

    probability of

    flooding

    (flow crossing a threshold) are exem-

    plified with

    river flows and shown to be

    superior

    to

    A R M A

    models.

    Seasonality

    is

    accommodated

    by including thecalen-

    da r date as one of the

    predictors. Yakowitz

    claims that this

    continuous

    parameter Markov chain

    approach is

    capable

    of

    reproducing any possible

    Hurst

    coefficient.

    Classical

    ARMA

    models

    are optimalonly

    under squared

    error loss and onlyfor

    linear operations on the

    observables.

    The loss/risk functions

    associated

    with

    hydrologic decisions (e.g., wheth er

    to

    declare

    a

    flood

    warning or not) are usually asymmetric. The nonpara-

    metric

    framework

    allows

    attention to be

    focused directly

    on

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    3/15

    LALL AND SHARMA:

    NEAREST

    NEIGHBORBOOTSTRAP

    681

    calculating

    these

    loss functions and evaluating the conse-

    quences.

    The

    example

    of Figure 1 is now extended to showhow the

    nearest neighbor

    method

    is

    used

    in the Markov framework.

    One step Markov transition functions are considered. The

    relationship betweenx

    t l

    andx

    t

    is shown in Figure2. The

    correlation between

    x

    t

    andx

    t l

    is 0,

    even though

    there is

    clear-cut dependence between the two variables.

    Consider an approximation of the

    model

    in Figure 1 by a

    multistate, first-order

    Markov

    chain,

    where

    transitions

    from,

    say, state 1 forx

    t

    (0 to 0.25 in Figure 2) to states 1, 2, 3, or 4

    forx

    t l

    are ofinterest.The state / to

    state; transition

    prob-

    ability p

    tj

    is evaluated by

    counting

    the

    relative fraction

    of

    transitions

    from

    state i to

    state

    j. The

    estimated

    transition

    probabilities depe ndon thenumberofstates chosenas

    well

    a s

    their

    actual

    demarcation

    (e.g., one may

    need

    a nonuniform

    grid that recognizes variations in data density). For the non-

    linear model used in our

    example,

    a

    fine discretization

    would

    be

    needed.

    Given a

    finite data set, estimates

    of the

    multistate

    transition

    probabilities

    may be

    unreliable.

    Clearly, this situa-

    tion

    is

    exacerbated

    if one

    considershigher dimensions

    for the

    predictor space. Further, a

    reviewer

    has observed that a dis-

    cretization of a

    continuous space

    Markovprocess is not nec-

    essarily

    Markov.

    Now consider the

    nearest

    neighbor

    approach. Consider

    tw o

    conditioning points x*

    A

    an d x*

    B

    . The k

    nearest

    neighbors of

    thesepointsare in the

    dashed

    windows and

    B,

    respectively.

    The neighborhoods are

    seen

    to

    adapt

    to variations in the

    sam-

    pling density

    of*,.

    Since such neighborhoodsrepresentmoving

    windows (a s opposed to fixed

    windows

    for the multistate

    Markov chain)

    at

    each

    point of

    estimate,

    we can expect re-

    duced bias in the recovery of the

    target

    transition

    functions.

    The one

    step

    transition probabilities at x* can be obtained

    through an application of thenearest

    neighbor

    density estima-

    to r

    to thex

    t+

    values

    that

    fall in

    windows likeA

    and

    B.

    A

    conditional bootstrap of the data can be obtained by resam-

    pling from this set ofx

    t l

    values. Since

    each transition prob-

    ability estimate is

    based

    on k points, the problem faced in a

    multistate Markov

    chain

    model of sometimes not

    having

    an

    adequate

    number

    of

    events

    or state

    transitions

    to develop an

    estimate is

    circumvented.

    The Nearest Neighbor

    Resampling Algorithm

    In thissectiona new

    algorithm

    fo rgen erating synthetic

    time

    series samples by

    bootstrapping

    (i.e., resampling the original

    time series with replacement) ispresented.Denote the time

    seriesbyx

    t

    ,t=l,

    ,

    n, an d

    assume

    aknown

    dependence

    structure, that is, which and how m any lags the

    future

    flowwill

    depend

    on. This

    conditioning

    set is

    termed

    a

    "featurevector,"

    an d

    th e simulated or forecasted

    value,

    th e

    "successor."

    The

    strategy is to find the historical

    nearest

    neighbors of the cur-

    rent

    feature

    vector and resample

    from

    theirsuccessors. Rathe r

    than

    resampling uniformly from the k successors, a discrete

    resampling kernel is introduced to weight th e

    resamples

    to

    reflect th e similarityof the neighbor to the conditioning point.

    This

    kernel

    decreases

    mon otonically

    with

    distance and adapts

    to the local sampling

    density,

    to the dimension of the feature

    vector, and to

    boundaries

    of the sample space. An attractive

    probabilistic

    interpretation

    of this

    kernel consistent with

    the

    nearest

    neighbor

    density estimator

    is also

    offered.

    The resam-

    pling

    strategy

    ispresented through the following flowchart:

    State

    1 2 3

    0.75.

    0.5-

    0.25.

    State

    4

    0 0.25 0.5 0.75 1

    Figure

    2 Aplotof

    x

    t l

    versusx

    t

    for the

    time

    series gener-

    ated from themodel

    x

    t+

    =(1 -

    4 x

    t

    - 0.5)

    2

    ).Thestate

    space

    for

    x

    isdiscretized

    intofour states,

    a s

    shown.Also

    shown

    ar e

    windows^

    an d

    B

    with

    "whiskers"located

    over two

    points

    x*

    A

    an d

    x*

    B

    .

    These

    windows

    represent

    a

    k

    nearest

    neighbor-

    hood of the corresponding

    x

    t

    .

    In

    general,

    these

    windows

    will

    not be symmetric about the

    x

    t

    of interest. One can

    think

    of

    statetransition

    probabilities

    using thesewindows in

    much

    th e

    same

    w ay aswiththe

    m ultistate Markov chain.

    Avalueo f

    x

    t

    +1

    conditional

    to point

    A

    or

    B

    can be

    bootstrapped

    by

    appropri-

    atelysamplingwith

    replacementone of the

    valueso fx

    t

    +

    that

    fall

    in the

    correspondingwindow.

    1. Define the

    composition

    of the "feature vector"

    t

    of

    dimensiond, e.g.,

    Case 1 D,: (*,_ , J t ,_

    2

    );

    d

    = 2

    Case2

    =

    Ml+M2

    Case3 D, : (* l ,_

    T l

    ,

    l ,_

    M l r l

    ;

    x2

    t

    ,*2,_

    T

    ,

    *2,_

    M2T2

    );

    d = Ml

    + M2 + 1

    where

    rl (e.g., 1 month) and r2

    (e.g.,

    12 months) are lag

    intervals, and

    Ml,

    M 2 > 0 are the number of

    such

    lags

    considered in the model.

    Case 1 represents dependence on two prior values. Case 2

    permits direct

    dependence

    on

    multiple

    time scales, allowing

    one to

    incorporate monthly

    and

    interannual

    dependence. For

    case 3,xl an dx2 may

    refer

    to rainfall an d runoff or to two

    different

    streamflow

    stations.

    2.

    Denote

    the current feature vector as

    D,

    and determine

    its

    k

    nearest neighbors among the D,, using the weighted

    Euclidean

    distance

    1/ 2

    (3)

    wherev

    tj

    is

    the;

    th componento f

    D

    r

    ,

    an d

    W j

    are scaling

    weights

    (e.g., 1 or 1 / S j , where s

    }

    - issome measure of scale such as the

    standard deviation or range of V j ) .

    The weights W j may be specified apriori,as indicated above,

    or m ay be

    chosen

    toprovide th e

    best

    forecast

    for a particular

    successor

    in a least

    squares

    sense [see Yakowitz an d Karlsson,

    1987].

    Denote theordered set ofnearest neighbor indices byJ

    t

    ^ .

    An

    element

    y ( /) of this se t

    records

    the

    time t

    associated with

    t h e j t h closest D

    r

    to D,. Denote jc

    y( / )

    as the successor to D

    y ( / )

    .

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    4/15

    68 2

    LALL

    AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    0.4

    0.0

    0.3

    -

    0.2-

    It

    r

    . .*

    Figure

    3 . Illustration of

    resam pling weights,

    K j i ) ) , at

    selected

    conditioning

    points*,,

    using

    k =

    10 with

    a

    sample

    of

    size

    10 0

    from

    an

    exponential distributionwithparameter

    of 1. The

    original sampled values

    are

    shown at the top of the

    figure. Note

    how the bandwidth and kernel shape

    vary with

    sampling density.

    If

    the

    data

    are

    highlyquantized,

    it ispossible

    that

    a number of

    observations

    may be the

    same distance from

    the conditioning

    point.

    The resampling kerneldefined instep3 (below) is

    based

    on theorder ofelements in

    J^

    k

    .

    Wherea number of

    observa-

    tions are the same

    distance away,

    th e

    original ordering

    of the

    data can

    impact

    the ordering inJ^

    k

    . To

    avoid such

    artifacts,

    the time

    indices

    t ar e copied into a

    temporary

    array that is

    randomlyperm uted prior to distance calculations and creation

    of thelist]

    ik

    .

    3. Define a

    discrete

    kernel K j i ) ) fo r resampling one of

    the j c

    y

    -

    / )

    as

    follows:

    (4)

    exactly of volume

    F(r

    /J(/)

    ).

    Assuming that the observations

    are independent (which theymay not be), the likelihood with

    whichthej(0

    tn

    observationshouldberesampledasrepresen-

    tative of D; is

    proportional

    to

    HV r

    ij i}

    ).

    Now,consider that in a small locale of D,, the local density

    can be approximatedas aPoisson process,withconstant

    rate

    A .

    Under

    this

    assumption

    the expected value ofllV r

    ij i}

    ) is

    E l l V r

    i j l }

    } } =

    A/;

    (6)

    The kernel K j i ) ) isobtainedb y

    norm alizingtheseweights

    over

    th e

    k nearest

    neighborhood.

    c lj

    1/7

    (7)

    where K j i ) ) is the

    probabilitywith which

    x

    j i}

    is

    resampled.

    This resampling kernel

    is the

    same

    for any

    /

    and can be

    computed

    an d

    stored prior

    to the

    start

    of the

    simulation.

    4. Using the discrete probability mass function (p.m.f.)

    K j i ) ) , resample anj c

    y( O

    , update th e current

    feature

    vector,

    an d

    return

    tostep2 if additional simulated values areneeded.

    A similar

    strategy

    fo r

    time

    series forecasting ispossible.A n

    m-step-ahead forecast is obtained byusing the corresponding

    generalizednearest neighbor regression estimator:

    (5 )

    where*,

    m

    andz

    y(/)m

    denotethe rath successortoiand;(0

    respectively.

    Parameters of the Nearest Neighbor Resampling

    Method

    Choosing the Weight Function K v)

    The goals fo r designing a resampling

    kernel

    are to (1) re-

    duce

    the

    sensitivity

    of the

    procedure

    to the actual

    choice

    of

    k,

    (2 ) keep

    the estimator

    local,

    (3) have k

    sufficiently

    large to

    avoid

    simulating

    nearly identical

    traces,

    and (4)

    develop

    a

    weight function

    that adapts

    automatically to

    boundaries

    of the

    domain and to the

    dimensiond

    of the

    featurevectorD ,.These

    criteria suggesta resampling kernel thatdecreases monotoni-

    cally

    as

    r

    tj

    increases.

    Consider

    a

    d-dimensional ball

    of

    volume V r)

    centered at

    D,.. The observation D

    ;(0

    falls in this ball

    when

    the ball is

    where c is a constant of proportionality.

    These weights do not

    explicitly

    depend on the dimension

    d

    of the

    feature vector

    D,. The

    dependence

    of the

    resampling

    scheme

    on

    d

    is

    implicit

    through the behavior of the distance

    calculations

    usedto

    find

    nea rest neighborsa sd

    varies.

    Initially,

    we avoided

    m aking the

    assumption

    of alocal

    Poissondistribu-

    t ion and def ined K j i ) ) t h r o u g h a no rma l iza t ion o f

    l/K(r,

    7(0

    ). Thisapproach

    gave satisfactory

    results

    as

    well

    bu t

    was computationally more

    demanding. The

    results obtained

    using(7)were comparablefor a

    given

    k.

    The behavior of this kernel in the boundary region, the

    interior,

    and the tails isseen inFigures3 and 4.

    From

    Figure

    3,

    observe that

    th e

    nearest

    neighbor

    method

    allows

    consider-

    ablevariation in the"bandwidth"( intermsof arangeof values

    of

    jc)

    as a function of position and underlying

    density.

    The

    bandwidth becomes automatically

    larger

    as the density be-

    comes

    sparser an d

    flatter.

    In

    regions

    of

    high data

    density

    (left

    tail or

    interior)

    the

    kernel

    is nearly symmetric(the

    slightasym-

    metry

    follows

    th e

    asymmetry

    in the

    underlying distribution).

    Along

    t he sparse

    right tail

    the kernels are qu ite asymmetric, as

    expected. Some

    attributes

    of these kernels relative to a uni-

    form

    kernel (with the same k), used by the ordinary nearest

    neighbor method, are

    shown

    inFigure4.

    For bounded data (e.g.,

    streamflow

    that is constrained to be

    greater

    than

    0), simulation of

    values

    across the boundary is

    often a

    concern. This problem

    is avoided in the

    method

    pre-

    sented

    since

    the resampling weights are defined only for the

    sample

    points.

    A second

    problem

    with bounded

    data

    is

    that

    bias inestimating the target

    function using

    local

    averages

    in -

    creases near

    the boundaries. This bias can be recognized by

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    5/15

    LALL AND SHARMA:NEAREST NEIGHBOR BOOTSTRAP

    683

    0.4

    0.3-

    0.2-

    0.1

    -

    0.0

    Centroido fK (j(i))

    Uniform K erne]

    yCentroid

    0.00

    0.02

    0.04 0.06 0.08

    0

    .22 0.26 0.30 0.34

    (a)

    X

    (b )

    K ()

    0.2-

    2.0

    6.0

    (c)

    Figure 4.

    Illustration

    of weights

    K j i ) )

    versus

    weights

    from

    the

    uniform

    kernel applied

    at

    three

    points

    selected

    from

    Figure 3. The uniform kernel

    weights

    are l/k fo r

    each

    jsJ i, k). The effective centroid

    corresponding to

    eachkernel

    fo r

    eachconditioning

    point / (in

    each case

    with the

    highest value

    of K j i ) ) ) is

    shown.

    For

    /

    in the interior of the data

    (Figure

    4b) the

    centroids

    of both th e

    uniform

    kernel an d K j i ) )

    coincide with i.

    Toward

    the edges of the

    data (Figures

    4a and 4c) the centroid corresponding to K j i ) ) is

    closer to / than that for the uniform kernel. The K j i ) ) ar e thus less biased than th e uniform kernel for a

    givenk . The kernel K j i ) ) also has a lower varianc e than the uniform

    kernel

    for a givenvalue ofk.

    observingthat

    the

    centroid

    of the

    data

    in thewindowdoesn ot

    lie

    at the

    point

    ofestimate. From

    Figure

    4note that while the

    kernel

    is

    biased toward

    th e

    edges

    of the

    data, that

    is, the

    centroid of the kernel does not match the

    conditioning

    point,

    th e

    bias

    is much

    smaller than

    for the

    uniform kernel.

    If one

    insists

    on

    kernels that

    ar e

    strictly positivewith

    monotonically

    decreasing weights with distance, it may not be possible to

    devise

    kernels

    that are substantially

    better

    in terms of bias in

    th e boundary region.

    The

    primary advantages

    of the

    kernel

    introduced here ar e

    that (1) it adapts automatically to the

    dimension

    of the data,

    (2) the

    resampling weights

    need to be computed

    just

    once fo r

    a

    given

    k , (there is no need to

    recompute

    th e

    weights

    or

    their

    normalization

    to 1), and (3) bad effectsof data

    quantization

    or

    clustering (this

    could lead to

    near

    zero values ofr-

    J(/)

    at some

    points) on the resampling strategy, that arise if one were to

    resample

    using

    a kernel

    that depends

    directlyon

    distance

    (e.g.,

    proportional

    to\IV r

    i

    j^)), areavoided.Thesefactors trans-

    late

    into

    considerable

    savings

    in computational

    time

    that ca n

    be im portant for large datasetsan d high dimensional settings,

    an d

    into

    improved

    stability

    of the

    resampling

    algorithm.

    Choosing

    the Number of Neighborsk and Model Orderd

    The

    order

    of

    ARMA

    models is

    often picked

    [Loucks

    etal.,

    1981]using theAkaike information criteria(AIC). Suchcrite-

    ria

    estimate

    th e

    variance

    of the residual to

    time

    series

    forecasts

    from amodel, ap propriately penalizedfor the

    effective degrees

    of

    freedom in the model. A similarperspective based on cross

    validation isadvocated

    here.

    Crossvalidation

    involves

    "fitting"

    the model byleavingout onevaluea t a

    time

    from the data and

    forecasting it using the remainder. The model

    that yields

    th e

    least-predictive sum of squares of errors across all such fore-

    castsi spicked. One can

    approximate

    the averageeffect of such

    an

    exercise

    on the sum of

    squares

    of

    errors without

    going

    through th eprocess.

    Here, the

    forecast

    is formed

    (equation

    (5)) as a

    weighted

    average of the successors. When using the full sample, define

    th e

    weight

    used

    with

    the successor to the current point asw

    7;

    .

    Thisweight

    recognizes

    the influenceof that point on the esti-

    mate

    at the same location.Hence the influence of the rest of

    th e pointson the fit at that point is (1 -

    n >

    7 7

    ) .

    This suggests

    thatifestimated full sampleforecast error < ?

    y

    is

    divided

    by(1 -

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    6/15

    684 LALL AND SHARMA:

    NEAREST

    NEIGHBOR

    BOOTSTRAP

    Table 1. Statistical

    Comparison

    of

    A ; - n n

    and AR1

    Model

    SimulationsApplied to an A R1 Sample

    Simulations

    5% Quantile Median

    95%

    Quantile

    AR1

    Sample

    k-nn AR1

    k-nn

    AR1

    k-nn

    AR1

    Mean

    0.04

    Standard

    111

    deviation

    Skew

    -0.17

    Lag

    1 0.63

    correlation

    -0.14 -0.12 0.02

    0.04

    1.02 1.03 110 111

    0.24 0.20

    1.18

    1.20

    -0.32

    -0.25 -0.18 0.00 -0.03 0.21

    0.56

    0.57 0.62 0.63 0.68 0.69

    W j j) , ameasureofwhat the

    error

    may be if the datapoint (D

    7

    ,

    X j )

    was not used in developing the

    estimate

    is

    provided.

    Note

    that the degrees of

    freedom

    (e.g., 0 for ak = 1 and 4/3 for a

    k = 3

    using

    a

    uniform

    kernel)

    of

    estimate

    are implicit in this

    idea.

    Craven and Wahba

    [1979]

    present

    a

    generalized

    cross

    validation (GCV) score function that considers

    the average

    influence

    of

    excludedobservations

    fo r

    estimation

    at

    eachsam-

    ple

    point

    an d approximates the

    predictive

    squared error of

    estimate. The GCV score is

    given

    as

    I

    e]ln

    (8)

    The GCV

    scorefunction

    can be used to

    choose

    both

    k

    and

    d.

    For the kernelsuggested in thispaper, w

    ;7

    is aconstant fo r

    a givenk, and the GCV can bewritten as

    G CV

    =

    (9 )

    A prescriptive

    choice

    of

    A :

    =n

    l/2

    from experience isalso

    suggested.

    This is agood choicefor 1

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    7/15

    LALL AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    685

    Xt-1

    Figure

    6. (a) An ASH estimate of thebivariate

    probability

    densityf x

    t

    ,

    *,_i)

    for the

    SETARsample,

    th e

    thick curve

    denoting

    aLOWESS

    smooth,

    and (b) an average of the ASH estimates of the bivariate

    probability

    density

    f x

    t

    ,

    x

    t

    _-

    across 100

    /c-nn

    resamples

    from

    the

    SETAR

    sample.

    ings are reproduced. One can try various combinations of k

    and d and

    visually

    compare

    resamp ling attributes with

    histor-

    ical sample attributes

    (sample

    mom ents and

    marginal

    or

    joint

    probability densities).

    Applications

    Two synthetic

    examples,

    on e

    from

    a

    linear

    and one

    from

    a

    nonlinear model,

    a re

    presented

    first.

    These

    are

    followed

    by an

    application

    to

    monthly

    flows

    from

    Weber River near Oakley,

    Utah.

    In all cases a lag 1 model

    with

    k

    chosen

    a s

    n

    1/2

    was

    used.

    Comparative

    performance

    of the

    simulations

    is

    judged

    using

    sample

    moments and samplep.d.f.'s

    estimatedusing adaptive

    shifted

    histograms

    (ASH;Scott

    [1992,

    chap.

    5]).

    In all applica-

    tions usingtheunivariate

    ASH,

    a binwidthof (*

    max

    - *

    min

    )/

    9.1,

    wherex

    max

    and*

    min

    are the

    respective

    maximum and

    minimumvalues of the data, and five shifted histogramswere

    used.For bivariate densitiesa binwidthof x

    max

    -

    *

    min

    )/3.6

    an d

    five

    shifted histograms in each

    coordinate

    direction were

    used.These are the default settings for the computer code dis-

    tributed by D.Scott.Conditional expectations,E x

    t

    \x

    t

    _

    1

    ), are es-

    timated

    using

    LOWESS

    [Cleveland,

    1979].LOWESSis a popular

    robust, locallyweighted linear regression technique that allowsa

    flexible

    curve

    to be fitbetween two variables. We used default

    parameter choices (threeiterations for computing the robust es-

    timates

    basedon two thirds of the da ta)

    with

    the "lowess"func-

    tion

    availableon

    S-Plus [Statistical Sciences,

    Inc.,

    1991].

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    8/15

    686

    LALL AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    200

    400 600

    Time (months)

    2 3 4 5 6 7 8 9 10 11 12

    Time

    (monthsf or

    year

    2 )

    3 4 5 6 7 8 9 10 11 12

    Time

    (months

    for

    year

    29)

    Figure 7. TheWeber River

    monthly

    streamflow t imeseries.Flowsfor 2 years (1906

    (year

    2) and1933(year

    29)) are shown in the lower panels.Note the

    asymmetry

    about thepeak

    monthly

    flow in 1906

    clearly shows

    that the t ime

    series

    is irreversible; that is, its

    propert ies

    will be quite different in

    reverse

    t ime.

    ExampleEl

    The

    first

    data se t considered is a

    sample

    of size 500

    from

    a

    linear autoregressive model of order 1,AR1, defined as

    x

    t

    = (10)

    wheree

    t

    is a meanzero,variance 1, Gaussian randomvariable.

    One hundred real izat ions,

    each

    of length 500, were

    then

    generated

    from

    an AR1 model

    fitted

    to the

    sample

    an d

    from

    th e

    nearest neighbor (/c-nn) bootstrap. Selected statisticsfrom

    these simulations are compared inTable1. The sample statis-

    tics

    considered

    ar e

    reproduced

    by the /c-nn

    method ,

    whileonly

    th e

    sa mple statistics

    used in fitting the parametr ic AR1 model

    are reproduced.The AR1

    simulations

    insteadreproduce pop-

    ulation

    or model

    statistics

    (e.g.,skew = 0) for parameters that

    are not explicitly fitted to the

    sample. With

    repeated applica-

    tions to anumber of samples from th e same distribution, the

    A:-nn procedure reproduces

    the

    population

    statistics as

    well.

    On theother

    hand,

    a parame tric model only

    reproduces

    fitted

    statistics. The ASH estimated median

    p.d.f.

    ofx

    t

    , from th e

    /c-nn resamples, matches th e sample

    p.d.f.,

    and the scatter of

    the estimated

    p.d.f.'s

    across

    resamples

    is comparable to the

    scatter

    from the AR1 samples. In order to save

    space,

    these

    results

    are not reproducedhere.

    Example E2

    A sample of size 200 was generated from a

    self-exciting

    threshold autoregressive

    (SETAR)

    model described by Tong

    [1990, pp.

    99-101].

    The general structure of such models is

    similar to

    that

    of a

    linear autoregressive model,

    with the

    dif-

    ference being that the parameters of the model change

    upon

    crossing one ormore thresholds.

    Such

    a m odel may be appro-

    priate

    for daily

    streamflow, since crossing

    a

    flow threshold

    (defined on a single

    past

    flow or collectively on a set of past

    flows) with

    flow

    increasing

    may signal

    runoff

    response to

    rain-

    fall

    or snow

    melt ,

    and crossing the thresho ld

    withflow decreas-

    ingm ay

    signal

    return

    to base flow or recessionbehavior.Here

    a lag 1

    SETAR model

    was

    used:

    x

    t

    = 0.4 + 0.

    x

    t

    = 1 . 5

    *,-

    + e

    t

    .5*,- +e,

    if x

    t

    P O

    O3P

    @3 S -

    ^ ^ C/3 O

    a p o

    Q

    Probability Density

    0.0 0.0004

    0.0008 0.0012

    & S 8 -

    ^

    S

    -^

    ^ 8

    v ?

    3

    * ^ o

    I"

    cr

    O^

    x H -.

    8 E C

    Probability

    Density

    0.0 0.0004 0.0008

    0.0012

    O

    I o

    HOSHOIHN

    1SHHV3N

    ^

    QNV1TV1

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    11/15

    LALL AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    689

    1 15

    O c t o b e r

    1 15

    O c t o b e r

    Figure

    10.

    (a) An ASH estimateof the

    b ivariate probability

    densityf x

    t

    ,

    x

    t

    _ )

    for the

    October/November

    historicalflows,

    the thickcurvedenotingaLOWESSsmooth; (b) anaverageof

    theASH estimates

    of/(;t,,

    x

    t

    _ across100

    simulations

    from

    an

    ARl

    model;

    and (c) an

    average

    of the ASH

    estimates

    of

    f x

    f

    , J c

    f

    _

    1

    )

    across 100k-rm resamples. Values are given in

    cubic

    feet pe r second (1 cfsequals 0.028

    m

    3

    /s).

    snow

    m elt. January

    is

    typically

    the month

    with

    the

    lowest

    flow.

    The

    1905-1988

    monthly

    t ime

    series

    and

    flows

    for two specific

    years

    ar e presented in Figure7.

    A monthly ARl model was fitted to logarithmically trans-

    formed

    monthly flows,

    with

    monthly

    varyingparameters

    esti-

    mated

    asdescribedbyLoucks eta l.

    [1981,

    p.285] topreserve

    moments in real space. This entails sequentially simulating

    monthly

    flow

    moving

    through

    th e calendar

    year,

    using (for

    example)only the 83

    monthly

    values for

    J anuary

    an d February

    over the 83-year record to simulate February flows

    givenJan-

    uary flows. O ne hundred

    simulations

    of 83 yearswere gener-

    ated in

    each case.

    The A:-nn

    bootstrap

    is

    applied

    in a similar

    mannerusing(forexample)

    J anuary

    flows to findneighborsfo r

    a given

    January

    flow and the corresponding February succes-

    sor.

    The

    flow

    data

    are not

    logarithmically transformed

    for the

    k-rmbootstrap.

    Selected results

    ar e

    presented.

    A

    comparison

    of

    means,

    standard deviations,skew,and lag 1 correlations are presen ted

    for

    the 12

    months

    in

    Figure

    8.

    Both

    thek-rmand the ARl

    model

    seem to be comparable in reproducing th e mean

    monthlyflow

    and its variation

    acrosssimulations.

    The standard

    deviations

    of

    monthly flows

    are

    somewhat more

    variable in

    k-rm simulations than those

    from the ARl

    model . Some

    months (low flows) have

    a

    slight

    downward

    bias

    in the

    simu-

    lated standard deviation,

    while

    others (March and

    July,

    th e

    months following a minimum and maximum in monthly flow,

    respectively)

    show

    a

    larger spread

    in standard

    deviation

    than in

    the ARl case. While both models seem to do well in repro-

    ducing

    the historical lag 1 correlation, the

    k-rm

    statistics ap -

    pear to be

    more variable across

    realizations. A difference be -

    tween the two simulators is apparent in Figure 8c ,where the

    ARl

    model fails

    to

    reproduce

    th e monthly and the

    annual

    skews

    aswell as the

    k-rm

    mo del does. Recall that the ad hoc

    prescriptive choiceof A : =n

    l/2

    wasused here,withnoat tempt

    at fine tuning th ek-rmsimulator.

    The

    marginal probability density

    functions were estimated

    by

    ASH for flow in

    each month. Selected

    results for

    simula-

    tions

    from

    th ek-rmand for the AR l model applied to loga-

    rithmically transformed flows for 3

    months

    ar e

    illustrated

    in

    Figure

    9. We see

    from

    Figure

    9 that the

    k-rm

    samples are

    indeed

    aboo tstrap; thatis, the

    simulated marginal probability

    densities behave much like the empirical sample probability

    density.

    The usual shortcoming of thebootstrapin reproducing

    only historical sample values

    is

    also apparent.

    We see

    that

    while the lognormal

    density

    usedby the ARl modelis

    plausi-

    ble in a number of months

    (e.g., October),

    the lognormal

    modelseemsto be inappropriateinotherm onths, for example,

    March,

    where the

    skew

    is too extreme for the

    ARl,

    andJ une ,

    whichexhibits a distinct bimodalitythatm ay be related to the

    timing

    or

    amount

    of snow

    melt.

    T he

    latter

    is

    interesting, since

    the 10 0 simulationsfrom the ARl model

    fail

    tobracketthe two

    prominent modes of the ASH density estimate,

    lending

    sup-

    port to the

    idea

    of

    bimodality

    under apseudohypothesis

    test

    obtained from thisM onte Carlo experiment.

    The

    bivariate probability

    density

    functions

    fo r flows ineach

    consecutive pair of months (e.g., May and J une) were also

    computed

    by

    ASH.

    Results forselected months arepresented

    in Figures 10 through 12. Other month pairswere

    found

    to

    exhibit

    features similar

    to those in Figures 10 through 12. In

    each

    casew epresent ascatterplot of the flows (in

    cfs)

    in the 2

    months,with

    a

    LOWESS

    [C leveland,

    1979] smooth

    of thecon-

    ditionalexpectationE x

    t

    \x

    t

    _

    1

    ).

    Anexaminationof the

    Octo-

    ber/Novem ber density inFigure10 reveals that the ARl model

    may

    be quite

    appropriate

    for this

    pair

    of

    months.

    The ASH

    estimated

    density

    from

    the

    sample

    and the

    averages

    of the

    ASH

    estimated densitiesfrom

    the ARl and the

    k-rm

    samples

    are all

    very similar.

    The

    LOWESS estimate

    of the

    conditional

    expectation of the November flow, given the October flow, is

    very nearly a straight

    line.

    Figures 11 and 12 refer to the months of

    April/May

    and

    May/June , where runoff

    from

    snow

    melt becomes important.

    The timing of the start and of thepeak rate of snow meltvary

    over

    this

    period. Consequently, one can expect some hetero-

    geneity in the sampling distributions of flows in thesemonths.

    From Figure

    lla

    we see tha t the LOWESS estimate of

    E x

    t

    \x

    t

    _

    1

    ), exhibits some degree of nonlinearity fo r April/

    May.The

    slope

    o f E x

    t

    \ x

    t

    _

    l

    ) f o r * , _ < 150 cfs

    (4.2

    m

    3

    /s) is

    quite

    different

    from

    th e

    slope

    forx

    t

    _

    l

    >

    150

    cfs.

    This

    is

    reminiscent of the

    SETAR

    model examinedearlier.We could

    belabor this

    point through formal tests

    of

    significance

    for

    dif-

    ference

    in slope. Our purpose here is to show that the

    k-rm

    approach

    ca n

    adapt

    to such

    sample

    features, while the ARl

    model

    may

    not.

    The

    averagebivariate densities

    of the

    Simula-

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    12/15

    690

    LALL

    AND

    SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    1 0 0

    200 300

    April

    300

    April

    400

    500

    200 300

    April

    500

    Figure

    11 (a) An ASHestimateof the

    bivariate

    probability densityf x

    t

    , x

    t

    _ ) for the April/May

    historical

    flows, the thick

    curve denoting

    aLOWESSsmooth, (b) an average of the ASH estimatesof/(*,,x

    t

    _

    x

    ) across

    100 simulationsfrom an AR1

    model,

    and (c) an average of the ASH estimates

    o f f x

    t

    , x

    t

    _

    x

    ) across 100k-nn

    resamples.

    Values

    are given in

    cubic

    feet per second (1 cfs

    equals

    0.028

    m

    3

    /s).

    tions based on ASH

    once

    again reinforce this difference be -

    tween the A ; - n n and the AR1

    models

    and their

    attributes.

    Note

    also

    that the AR1

    simulations

    do not

    reproduce

    the sample

    skews

    for this

    period

    either.

    The May/June analysis,

    shown

    inFigure 12, is

    marked

    by

    considerably

    increased variability in stream

    flow

    as the snow

    melt

    runoff develops.Once again, LOWESS shows some

    de -

    gree ofnonlinearity inE x

    t

    \x

    t

    _

    1

    ), with theslope of the rela-

    tionship smaller fo r high flows than for low flows. A

    compar-

    ison of the ASH bivariatedensitycontours inFigures 12a and

    12 c reveals that the AR1 density is oriented quite differently

    from

    the ASH estimate for the raw sample and is unable to

    reproduce

    th e degree of

    heterogeneity

    in the sample density.

    Recall that

    the marginal density of June

    flows

    was bimodal,

    with an antimode around 1000 cfs (28 m

    3

    /s). The

    antimode

    suggests that the

    data

    is

    clustered

    into two

    classes

    of events:

    thosewith

    flow below 1000

    cfs (mode at 700 cfs

    (19.6

    m

    3

    /s))

    and those with

    flow

    above 1000 cfs (mode at 1300 cfs (36.4

    m

    3

    /s)). The LOWESS fi t

    suggests

    that the June flows

    have

    an

    expectationcloseto1000cfs for M ayflowsgreater

    than

    about

    70 0cfs.It appears that the conditional expectation averages

    across the two modes fo r J u n e flows an d that the conditional

    density

    (Figure 12b) of June flows,

    given

    May flow, may be

    bimodal, as seen in the marginal densityplot fo r J u n e flows.

    The

    significance

    of the

    findings

    reported

    above

    is

    that

    the

    nearest

    neighborbootstrapprovides a

    rather

    flexible an d

    adap-

    tive

    method for reproducing the historical

    frequency

    distribu-

    tion

    of streamflow. The possibly tenuous issue of choosing

    between

    a

    variety

    of candidate

    parametric models month

    by

    month is

    avoided. Matching

    the historical

    frequency

    distribu-

    tion

    of

    flows properly

    is

    important

    fo r

    properly estimating

    storage requirements for a reservoir and analyzing reservoir

    release

    options. For snowmelt-driven streams in arid

    regions

    the

    timing

    an d

    amount

    of melt is

    important

    in

    determining

    reservoir operation.

    The bimodality in the

    probability density

    of monthly streamflow during

    the

    melt months

    may be con-

    nected

    to

    structured

    low-frequency

    ( interannual

    and interdec-

    adal)

    climatic

    fluctuations

    in this

    area [see Lall and Mann,

    1995]. This would be a significant factor for reservoir opera-

    tion,sincethe timing and amoun t of snowmelt maycorrespond

    to a circulation pattern that corresponds to

    specific

    flow pat-

    terns

    in

    subsequentmonths

    as

    well.

    The

    nearestneighborboot-

    strap

    would be an appropriate technique for simulating se-

    quences conditioned on such factors. Work in this

    direction

    is

    inprogress.

    Summary and Discussion

    A

    nearest neighbor method

    for a conditional

    bootstrap

    of

    time

    series waspresented an d exemplified. A corresponding

    forecasting

    strategy was

    indicated.

    O ur

    contributions here

    lie

    primarily in the development of a newkernel, suggestion of a

    parameter selectionstrategy,application to a

    conditional

    boot-

    strap, an d demonstration of the methodology. It was shown

    that sample attributes

    ar e

    reproduced

    quite

    well

    by this

    non-

    parametric method

    f or

    both synthetic

    and

    real da ta sets. Given

    the

    flexibility

    of these techniques, we consider them to have

    tremendous

    practical

    potential.

    The

    parametricversus

    nonparametric

    statistical

    method de-

    bate

    oftenveers toward

    sample sizerequirements

    an d

    statisti-

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    13/15

    LALL AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    691

    \

    500 1000

    15

    June

    flows

    (cfs)

    2

    2 4 6 8 1 12

    May

    2

    8

    2

    2

    8 12

    May

    May

    Figure 1 2. (a) An ASH estimate of the bivariate probability

    densityf x

    t

    , *

    r

    _

    x

    )

    for the May/Jun e historical

    flows,

    the thick

    curve denoting

    a

    LOWESS smooth;

    (b) ASH

    estimate

    of the probability density of

    J u n e

    flows

    conditionalto a Mayflowof 867 cfs(24.3

    m

    3

    /s);

    (c) an averageof the ASH estimateso f f x

    t

    , x

    t

    _ ) across1 00

    simulations

    from an AR1

    model;

    and (d) an

    average

    of the ASH

    estimates off x

    f

    , x

    f

    _ across

    10 0

    k-rm

    resamples. Values are given in cubic feet pe r second (1 cfsequals 0.028 m

    3

    /s).

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    14/15

    692

    LALL AND SHARMA: NEAREST NEIGHBOR BOOTSTRAP

    cal efficiency arguments.

    In the

    context

    of a

    resampling strat-

    egy,

    as

    espoused here,

    these arguments

    take

    a somewhat

    different

    complexion. For some

    processes,

    such as daily

    streamflow, identification

    or

    even definition

    of anappropriate

    parametric model is

    problematic.

    In

    these

    cases, data are

    rel-

    atively plentiful. For such cases, methods such as

    those

    pre-

    sented here

    ar e enticing.F or

    monthly

    an d annual

    flows,

    there

    is

    progressively

    less structure, and sample sizes are smaller. In

    these situations, parametric methods m ay indeed be

    statisti-

    cally

    more

    efficient provided the correct model is identifiable

    an d

    parsimonious.

    In our

    view, particularly

    for the snow-fed

    rivers

    of the

    westernUnited

    States,

    this

    may not alwaysbe the

    case. Indeed,for theapplication presented hereat the

    monthly

    timescaleit ishard tojustify choosingtheparam etric approach

    over the

    nearest

    neighbor method. The consideration of pa-

    rameter uncertainty is

    justifiably

    considered a good idea in

    parametric

    time series resampling of streamflow [Grygier and

    Stedinger, 1990]. Likewise

    it may be

    useful

    to think

    about

    model

    uncertainty

    when developing parametric models. The

    latter consideration

    is

    implicit

    in the

    nonparametric approach,

    since a

    rather broad class

    of

    models

    is

    approximated.

    The

    impact ofvarying the"parameters"k and the modelorder on

    specific attributes

    of the resamples bears

    further

    investigation.

    Our preliminary analyses suggest that the sensitivity of the

    scheme

    is

    limited

    over a

    range

    ofk values

    near

    the "optimal"

    with the kernel used

    here. Formal

    investigations of this issue

    are being pursued.

    One can devise a

    strategy

    that allows

    nearest neighbor

    re -

    sampling withperturbation

    of the historical

    data

    in the

    spirit

    o f

    traditional

    autoregressive

    m odels, that is, conditional expecta-

    tion

    withan addedrandominnovation. First, one evaluates the

    conditional

    expectationusing

    the generalizednearestneighbor

    regression estim atorf or

    each

    vectorD,in the

    historicalrecord.

    A residuale

    t

    can be computed as the

    difference between

    th e

    successor*;o fD,and the

    nearest

    neighbor regression forecast.

    The

    simulation

    proceeds by

    estimating

    th e nearest neighbor

    regression

    forecast relative to a conditioning vector D, an d

    then adding to this one of the < ?

    y

    correspondingt o adata point

    j

    that

    lie in the

    k nearest neighborhoodJ

    tk

    . The innovation e ^

    is chosen using the resampling kernel K j i ) ) .

    This

    scheme

    will per turb

    th e historical

    data

    points in the

    series,

    with inno-

    vations

    that

    ar e representative of the neighborhood, and will

    thus

    "fill in" between the

    historical

    data

    values

    as well as

    extrapolating

    beyond the

    sample.

    The computational

    burden

    is

    increased and

    there

    is a

    possibility that

    the bounds on the

    variables

    will be violated during

    simulation. However,

    there

    may be

    situations

    where the

    investigator

    m ay

    wish

    to adoptthis

    strategy.

    Further exploration

    of

    this

    strategy is

    planned.

    Issues

    such as

    disaggregation

    of streamflows

    bear

    further

    investigation. O ne

    strategy

    is

    trivial: resample

    the

    flow vector

    that aggregates

    to the

    aggregate flow simulated.

    A question

    that

    arises

    is

    whether there

    is even any

    need

    to

    work with

    models that disaggregate(especially

    in

    time)usingthesemeth-

    ods. One may

    wish

    to work directlywith,

    say,

    the daily flows,

    conditioned on a

    sequence

    of

    past daily flows

    an d

    weekly

    or

    monthly

    flows.

    The real

    utility

    of the method presented

    here

    may lie in

    exploiting a

    dependence

    structure (e.g., in

    daily

    flows)

    that is

    difficult

    to treat by traditional methods, as well as

    complex

    relationships betwe en variables, and in estimating confidence

    limitsor riskin problems that

    have

    a

    time

    series structure. The

    traditional

    time

    series

    analysis

    framework

    directs

    the

    research-

    er's attention toward an efficient estimation of model param-

    etersund er som e m etric (e.g., least squares or maximum like-

    lihood).

    T he

    performance metric

    of interest to the

    hydrologist

    may not be the one optimal for the estimationof a certain set

    of

    parameters an d

    selected model form. There

    is

    reason

    to

    directly explore other aspects of the problem that may be of

    directinterest fo r reservoir

    operation

    and flood control, using

    flexible,

    adaptive, data exploratory methods.

    Such investiga-

    tionsusing

    th e

    /c-nnbootstrap

    are in

    progress.

    Acknowledgments. The work reported herewas supported in part

    by

    t he

    USGS through g rant nu mb er 1434-92-G-2265. Discussions

    with

    an d

    review

    of

    this work

    by David

    Tarboton

    are

    acknowledged.

    The

    comments

    of the

    anonymous reviewers resulted

    in a significant im -

    provement of the

    manuscript.

    References

    Cleveland,

    W. S.,

    Robust

    locally weighted

    regression

    and smoothing

    scatterplots,/.

    Am. Stat.

    Assoc.,

    7 4,

    829-836,

    1979.

    Craven,

    P., and G.

    Wahba ,

    Smoothingnoisy

    data

    withsplinefunctions,

    Numer. Math.,

    31 ,

    377-403, 1979.

    Efron, B.,Bootstrap methods:Another look at the Jacknknife,Ann.

    Stat.,

    7,

    1-26, 1979.

    Efron, B., and R. Tibishirani,A n Introduction to theBootstrap,

    Chap-

    man and

    Hall,

    Ne w

    York,1993.

    Eubank, R. L.,

    Spline Smoothingand Nonparametric

    Regression,

    Statis-

    tics: Textbooks

    and Monographs, Marcel Dekker,

    New

    York,1988.

    Grygier,

    J. C, and J. R.

    Stedinger, Spigot,

    a synthetic streamflow

    generation package,technicaldescription,version2.5,School

    o f

    Civ.

    an d

    Environ.Eng.,Cornell

    Univ.,

    Ithaca, N. Y.,1990.

    Gyorfi,

    L., W .

    Hardle,

    P.

    Sarda,

    and P. Vieu,

    Nonparametric Curve

    Est imat ion f rom Time Series,

    vol. 60 ,

    Lecture Notes in Statistics,

    Springer-Verlag,

    New York,1989.

    H a rd le ,

    W ., A pplied Nonp arametric Reg ression, Econometr ic

    Soc.

    Monogr., Cambridge Univ. Press,

    New

    York, 1989.

    Hardle ,

    W .,

    Smoothing

    Techniques

    With

    Implementation

    in S,

    Springer-

    Verlag,

    New

    York ,1990.

    Hardle, W .,

    and A. W. Bowman,

    Bootstrapping

    in

    nonparametric

    regression:

    Local adaptive smoothing

    and confidence bands,/.Am.

    Stat.

    Assoc.,

    83,

    102-110,

    1988.

    Karlsson, M., and S.

    Yakowitz, Nearest-Neighbor methods

    fo r

    non-

    parametric rainfall-runoff forecasting, Water Resour.Res.,

    23 ,

    1300-

    1308, 1987a.

    Karlsson,

    M .,

    and S. Yakowitz, Rainfall-runoff forecasting

    methods,

    old and

    new,Stochastic Hydrol. Hydraul . , J ,303-318, 1987b.

    Kendal l ,

    D. R., and J . A. Dracup, A comparison of

    index-sequential

    an dAR(1)generated

    hydrologicsequences,/

    Hydrol.,122,335-352,

    1991.

    Kunsch , H.R.,The Jackknife and the Bootstrap for General

    Stat ion-

    ary Observations,

    ,4wL

    Stat.,

    1 7,

    1217-1241, 1989.

    Lall, U., Nonparamet r ic

    function

    est imation:

    Recent

    hydrologic appli-

    cations,

    U.S.

    Natl.

    Rep. Int.

    Union Ge o d . Geophys. 1991-1994,

    Rev.

    Geophys., 33,1093, 1995.

    Lall, U.,

    and M.

    Mann ,

    The

    Grea t Salt Lake:

    A

    barometer

    of

    low-

    frequency climate

    variability,

    Water

    Resour.Res.,31 W) , 2503-2515,

    1995.

    Loucks, D. P., J. R.

    Stedinger,

    and D. A.Hai th ,WaterResourceSystems

    Planning and Analysis,

    55 9

    pp.,

    Prentice-Hall ,

    Englewood

    Cliffs,

    N. J . ,

    1981.

    Schuster, E., and S.

    Yakowitz,C ont r ibut ions

    to the

    theory

    of nonpara-

    metric regression,

    with

    application

    to

    system identif ication, Ann.

    Stat., 7,139-149,1979.

    Scott,

    D. W ., Multivariate Density Estimation: Theory, Practice an d

    Visualization, J o hn Wiley, New York ,1992.

    Silverman,

    B.

    W .,

    Density

    Estimation for Statistics and Data Analysis,

    Chapman and

    Hall,

    New York,

    1986.

    Slack,

    J . R., J . M.

    Landwehr,

    and A.

    Lumb,

    A

    U.S.

    Geological

    Survey

    streamflow data set for the United States for the

    study

    of climate

    variations,1874-1988,Rep.92-129,U.S.Geol.Surv.,

    Oakley,

    Utah ,

    1992.

    Smith,

    J . A.,

    Long-range streamflow

    forecasting

    using

    nonparametric

    regression,

    Water

    Resour.

    Bull. ,

    27,39-46,1991.

    Smith,J.,

    G. N.

    Day,

    and M . D.

    K ane , Nonparametr ic framework

    fo r

  • 8/11/2019 A Nearest Neighbor Bootstrap for Resampling Hydrologic Time Series

    15/15