38
Journal of Machine Learning Research 1 (2017) 1-48 Submitted 20/6/2017; Published Robust Estimation of Derivatives Using Locally Weighted Least Absolute Deviation Regression WenWu Wang [email protected] Ping Yu [email protected] School of Economics and Finance The University of Hong Kong Pokfulam Road, Hong Kong Lu Lin [email protected] Zhongtai Securities Institute for Financial Studies Shandong University Jinan, 250100, China Tiejun Tong [email protected] Department of Mathematics Hong Kong Baptist University Kowloon Tong, Hong Kong Editor: Abstract In nonparametric regression, the derivative estimation has attracted much attention in recent years due to its wide applications. In this paper, we propose a new method for the derivative estimation using the locally weighted least absolute deviation regression. Dif- ferent from the local polynomial regression, the proposed method does not require a finite variance for the error term and so is robust to the presence of heavy-tailed errors. Mean- while, it does not require a zero median or a positive density at zero for the error term in comparison with the local median regression. We further show that the proposed estimator with random difference is asymptotically equivalent to the (infinitely) composite quantile regression estimator. The proposed method is also extended to estimate the derivatives at the boundaries and to estimate the higher-order derivatives. For the equidistant design, we derive theoretical results for the proposed estimators, including the asymptotic bias and variance, consistency, and asymptotic normality. Finally, we conduct simulation studies to demonstrate that the proposed method has better performance than the existing methods in the presence of outliers and heavy-tailed errors. Keywords: Composite quantile regression, Differenced method, LowLAD, Outlier and Heavy-tailed error, Robust nonparametric derivative estimation 1. Introduction The derivative estimation is an important problem in nonparametric regression and it has applications in a wide range of fields. For instance, when analyzing human growth data (M¨ uller, 1988; Ramsay and Silverman, 2002) or maneuvering target tracking data (Li and Jilkov, 2003, 2010), the first- and second-order derivatives of the height as a function of c 2017 WenWu Wang, Ping Yu, Lu Lin and Tiejun Tong.

Robust Estimation of Derivatives Using Locally Weighted ...tongt/papers/LADD.pdfin locally weighted least squares regression (LowLSR), and proposed a new estimator for the derivative

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Journal of Machine Learning Research 1 (2017) 1-48 Submitted 20/6/2017; Published

    Robust Estimation of Derivatives Using Locally WeightedLeast Absolute Deviation Regression

    WenWu Wang [email protected] Yu [email protected] of Economics and Finance

    The University of Hong Kong

    Pokfulam Road, Hong Kong

    Lu Lin [email protected] Securities Institute for Financial Studies

    Shandong University

    Jinan, 250100, China

    Tiejun Tong [email protected] of Mathematics

    Hong Kong Baptist University

    Kowloon Tong, Hong Kong

    Editor:

    Abstract

    In nonparametric regression, the derivative estimation has attracted much attention inrecent years due to its wide applications. In this paper, we propose a new method for thederivative estimation using the locally weighted least absolute deviation regression. Dif-ferent from the local polynomial regression, the proposed method does not require a finitevariance for the error term and so is robust to the presence of heavy-tailed errors. Mean-while, it does not require a zero median or a positive density at zero for the error term incomparison with the local median regression. We further show that the proposed estimatorwith random difference is asymptotically equivalent to the (infinitely) composite quantileregression estimator. The proposed method is also extended to estimate the derivatives atthe boundaries and to estimate the higher-order derivatives. For the equidistant design, wederive theoretical results for the proposed estimators, including the asymptotic bias andvariance, consistency, and asymptotic normality. Finally, we conduct simulation studies todemonstrate that the proposed method has better performance than the existing methodsin the presence of outliers and heavy-tailed errors.

    Keywords: Composite quantile regression, Differenced method, LowLAD, Outlier andHeavy-tailed error, Robust nonparametric derivative estimation

    1. Introduction

    The derivative estimation is an important problem in nonparametric regression and it hasapplications in a wide range of fields. For instance, when analyzing human growth data(Müller, 1988; Ramsay and Silverman, 2002) or maneuvering target tracking data (Li andJilkov, 2003, 2010), the first- and second-order derivatives of the height as a function of

    c©2017 WenWu Wang, Ping Yu, Lu Lin and Tiejun Tong.

  • Wang, Yu, Lin and Tong

    time are two important parameters, with the first-order derivative representing the speedand the second-order derivative representing the acceleration. The derivative estimates arealso needed in change-point problems, e.g., for exploring the structures of curves (Chaudhuriand Marron, 1999; Gijbels and Goderniaux, 2005), for detecting the extremum of derivatives(Newell et al., 2005), for characterizing submicroscopic nanoparticle from scattering data(Charnigo et al., 2007, 2011a), for comparing regression curves (Park and Kang, 2008), fordetecting abrupt climate changes (Matyasovszky, 2011), and for inferring the cell growthrates (Swain et al., 2017).

    In the existing literature, one usually obtains the derivative estimates as a by-productby taking the derivative of a nonparametric fit of the regression function. There are threemain approaches for the derivative estimation: smoothing spline, local polynomial regres-sion, and differenced estimation. For smoothing spline, the derivatives are estimated bytaking derivatives of the spline estimation of the regression function (Stone, 1985; Zhouand Wolfe, 2000). For local polynomial regression, a polynomial using the Taylor expansionis fitted locally by the kernel method (Ruppert and Wand, 1994; Fan and Gijbels, 1996;Delecroix and Rosa, 1996). These two methods both require an estimate of the regressionfunction. And as pointed out in Wang and Lin (2015), the convergence rates of their re-gression function estimator and their derivative estimators are often different to each other.In particular, when the regression function estimator achieves the optimal rate of conver-gence, the corresponding derivative estimators may not able to achieve it. In other words,minimizing the mean square error of the regression function estimator does not necessarilyguarantee the derivatives be optimally estimated (Wahba and Wang, 1990; Charnigo et al.,2011b).

    For the differenced estimation, Müller et al. (1987) and Härdle (1990) proposed a cross-validation method to estimate the first-order derivative without estimating the regressionfunction. Unfortunately, their method may not perform well in practice as the variance oftheir estimator is proportional to n2 when the design points are equally spaced. Inspired bythis, Charnigo et al. (2011b) and De Brabanter et al. (2013) proposed a variance-reducingestimator for the derivative function called the empirical derivative that is essentially a linearcombination of the symmetric difference quotients. They further derived the order of theasymptotic bias and variance, and established the consistency of the empirical derivative.Wang and Lin (2015) represented the empirical derivative as a local constant estimatorin locally weighted least squares regression (LowLSR), and proposed a new estimator forthe derivative function to reduce the estimation bias in both valleys and peaks of the truederivative functions. Dai et al. (2016) further proposed a method to minimize the meansquare error of the derivative estimation.

    The aforementioned differenced derivative estimators are all based on the least squaresmethod. Although very elegant, the least squares method is not robust and may oftenbe sensitive to outliers (Huber and Ronchetti, 2009). To overcome the problem, variousrobust methods have been proposed in the literature to improving the estimation of theregression function, see, for example, kernel M-smoother (Härdle and Gasser, 1984), localleast absolute deviation (Fan and Hall, 1994; Wang and Scott, 1994), and locally weightedleast squares (Cleveland, 1979; Ruppert and Wand, 1994). In contrast, little attention hasbeen paid to improving the derivative estimation except for the parallel developments ofthe above remedies (Härdle and Gasser, 1985; Welsh, 1996; Boente and Rodriguez, 2006).

    2

  • Robust Derivative Estimation

    In particular, the convergence rates of the regression function estimator and the derivativeestimators remain different and so call for a better solution.

    In this paper, we propose a locally weighted least absolute deviation (LowLAD) methodby combining the differenced method and the L1 regression systematically. Over a neighbor-hood centered at a fixed point, we first obtain a sequence of linear regression representationin which the derivative is estimated as the intercept term. We then estimate the derivativein a linear model by minimizing the sum of weighted absolute errors. The proposed methoddoes not require a finte variance for the error term and so is robust to the presence ofheavy-tailed errors. Meanwhile, it does not require a zero median or a positive density atzero for the error term, since the differenced errors are guaranteed to have a zero medianand a positive symmetric density in the neighborhood of zero, regardless of the density ofthe original errors. By repeating this local fitting over a grid of points, we can obtain thederivative estimates on a discrete set of points. Finally, the entire derivative function isobtained by applying the local polynomial regression or the cubic spline interpolation.

    The rest of the paper is organized as follows. Section 2 presents the motivation, the first-order derivative estimator and its theoretical properties, including the asymptotic bias andvariance, consistency, and asymptotic normality. Section 3 studies the relation between theLowLAD estimator and the existing estiamtors. In particular, we show that the LowLADestimator with random difference is asymptotically equivalent to the (infinitely) compositequantile regression estimator. Section 4 derives the first-order derivative estimation atthe boundaries of the domain, and Section 5 generalizes the proposed method to estimatethe higher-order derivatives. We then conduct extensive simulation studies to assess thefinite-sample performance of the proposed estimators and compare them with the existingcompetitors in Section 6. Finally, we conclude the paper with some discussions in Section7, and provide the proofs of the theoretical results in the Appendices.

    2. First-Order Derivative Estimation

    Combining the differenced method and the L1 regression, we propose the LowLAD regres-sion to estimate the first-order derivative. The new method inherits the advantage of thedifferenced method and also the robustness of the L1 method.

    2.1 Motivation

    Consider the nonparametric regression model

    Yi = m(xi) + �i, 1 ≤ i ≤ n, (1)

    where xi = i/n is the design point, Yi is the response variable, m(·) is an unknown regressionfunction, and �i are independent and identically distributed (iid) random errors with acontinuous density f(·).

    First-order symmetric (about i) difference quotients (Charnigo et al., 2011b; De Bra-banter et al., 2013) are defined as

    Y(1)ij =

    Yi+j − Yi−jxi+j − xi−j

    , 1 ≤ j ≤ k, (2)

    3

  • Wang, Yu, Lin and Tong

    where k is a positive integer. We decompose Y(1)ij into two parts as

    Y(1)ij =

    m(xi+j)−m(xi−j)2j/n

    +�i+j − �i−j

    2j/n, 1 ≤ j ≤ k. (3)

    On the right hand side of (3), the first term contains the bias information of the true deriva-tive, and the second term contains the variance information. By Wang and Lin (2015), thefirst-order derivative estimation based on the third-order Taylor expansion usually outper-forms the estimation based on the first-order Taylor expansion due to bias correction. Forthe same reason, we assume that m(·) is three times continuously differentiable on [0, 1].By the Taylor expansion, we obtain

    m(xi+j)−m(xi−j)2j/n

    = m(1)(xi) +m(3)(xi)

    6

    j2

    n2+ o

    (j2

    n2

    ), (4)

    where the estimation bias is contained in the remainder term of the Taylor expansion.

    By (3) and (4), we have

    Y(1)ij = m

    (1)(xi) +m(3)(xi)

    6

    j2

    n2+ o

    (j2

    n2

    )+�i+j − �i−j

    2j/n. (5)

    In Proposition 10 (see Appendix A), we show that the median of �i+j − �i−j is always zero,no matter whether the median of �i is zero or not. As a result, for any fixed k = o(n), wehave

    Median[Y(1)ij ] ≈ m

    (1)(xi) +m(3)(xi)

    6d2j , 1 ≤ j ≤ k, (6)

    where dj = j/n. We treat (6) as a linear regression with d2j and Y

    (1)ij as the independent

    and dependent variables, respectively. In the presence of heavy-tailed errors, we propose toestimate m(1)(xi) as the intercept of the linear regression using the LowLAD method.

    2.2 Estimation Methodology

    In order to derive the estimation bias, we further assume that m(·) is five times continuouslydifferentiable, that is, the regression function is two degrees smoother than our postulatedmodel due to the equidistant design. Following the paradigm of Draper and Smith (1981)and Wang and Scott (1994), we discard the higher-order terms of m(·) and assume locallythat the true model is

    Y(1)ij ≈ βi1 + βi3d

    2j + βi5d

    4j + ζij ,

    where βi = (βi1, βi3, βi5)T = (m(1)(xi),m

    (3)(xi)/6,m(5)(xi)/120)

    T are the unknown co-efficients of the true underlying quintic function, and ζij = (�i+j − �i−j)/(2j/n) withMedian[ζij ] = 0. Under the assumption of the true model, βi1 can be estimated as

    minbi

    k∑j=1

    wj∣∣Y (1)ij − (bi1 + bi3d2j + bi5d4j )∣∣ = min

    bi

    k∑j=1

    ∣∣Ỹ (1)ij − (bi1dj + bi3d3j + bi5d5j )∣∣,4

  • Robust Derivative Estimation

    where wj = dj are the weights, bi = (bi1, bi3, bi5)T , and Ỹ

    (1)ij = (Yi+j−Yi−j)/2. Accordingly,

    the true model can be rewritten as

    Ỹ(1)ij ≈ βi1dj + βi3d

    3j + βi5d

    5j + ζ̃ij ,

    where ζ̃ij = (�i+j − �i−j)/2 are iid random errors with Median[ζ̃ij ] = 0 and a continuous,symmetric density g(·) which is positive in a neighborhood of zero (see Appendix A).

    Rather than the best L1 quintic fitting, we search for the best L1 cubic fitting to Ỹ(1)ij .

    Specifically, we estimate the model by LowLAD:

    (β̂i1, β̂i3) = arg minbi1,bi3

    k∑j=1

    ∣∣Ỹ (1)ij − bi1dj − bi3d3j ∣∣,and define the LowLAD estimator of m(1)(xi) as

    m̂(1)LowLAD(xi) = β̂i1. (7)

    The following theorem reveals the asymptotic behavior of β̂i1.

    Theorem 1 Assume that �i are iid random errors with a continuous bounded density f(·).Then as k →∞ and k/n→ 0, β̂i1 in (7) is asymptotically normally distributed with

    Bias[β̂i1] ≈ −m(5)(xi)

    504

    k4

    n4, Var[β̂i1] ≈

    75

    16g(0)2n2

    k3,

    where g(0) = 2∫∞−∞ f

    2(x)dx. The optimal k that minimizes the asymptotic mean squareerror (AMSE) is

    kopt ≈ 3.26(

    1

    g(0)2m(5)(xi)2

    )1/11n10/11,

    and, consequently, the minimum AMSE is given by

    AMSE[β̂i1] ≈ 0.19

    (m(5)(xi)

    6

    g(0)16

    )1/11n−8/11.

    In the local median regression (see Section 3.1 below for its defintion), the density f(·)is usually assumed to have a zero median and a positive f(0) value. While in Theorem 1,we only require a continuity condition on the density f(·). In addition, the variance of theLowLAD estimator depends on g(0) = 2

    ∫∞−∞ f

    2(x)dx which is always positive, while thevariance of the local median estimator relies on a single value f(0) only. In this sense, theLowLAD estimator is more robust than the local median estimator.

    We now compare the LowLAD and LowLSR estimators using a simulated data set,where the LowLSR estimator of Wang and Lin (2015) is defined in Section 3.1. Figure1 displays the LowLAD estimator (use the R package ‘L1pack’ in Osorio (2015)) and theLowLSR estimator with k ∈ {6, 12, 25, 30, 50}. The data set with size 300 is generated frommodel (1) with xi ∈ [0.25, 1], m(x) =

    √x(1− x) sin((2.1π)/(x+0.05)), and �i

    iid∼ N(0, 0.12).

    5

  • Wang, Yu, Lin and Tong

    Figure 1: (a) Simulated data set of size 300 from model (1) with equidistant xi ∈ [0.25, 1],m(x) =

    √x(1− x) sin((2.1π)/(x+0.05)), �i

    iid∼ N(0, 0.12), and the true regressionfunction (bold line). (b)-(f) The first-order LowLADderivative estimators (greenpoints) and the first-order LowLSRderivative estimators (red dashed line) fork ∈ {6, 9, 12, 25, 30, 50}. As a reference, the true first-order derivative fucntion isalso plotted (bold line).

    We note that both estimators have the same variation trend whereas our estimator has aslightly larger oscillation compared to the LowLSR estimator. When k is small (see Figure 1(b) and (c)), both estimators are noise-corrupted versions of the true first-order derivatives.As k becomes larger (see Figure 1 (d)-(f)), our estimator provides a similar performance asthe LowLSR estimator. Furthermore, by combining the left part of Figure 1 (d), the middlepart of (e) and the right part of (f), more accurate derivative estimators can be obtainedfor practical use.

    In what follows, we compare the LowLAD and LowLSR estimators from a theoretical

    point of view. Both estimators have the same bias −m(5)(xi)504

    k4

    n4, while their variances are

    6

  • Robust Derivative Estimation

    7516g(0)2

    n2

    k3and 75σ

    2

    8n2

    k3, respectively. When � ∼ N(0, σ2), we have g(0) = 1/(

    √πσ) and

    75

    16g(0)2n2

    k3=π

    2

    75σ2

    8

    n2

    k3>

    75σ2

    8

    n2

    k3.

    Figure 1 conincides with the theoretical results that the same variation trend is mainly dueto the same bias, while a little larger oscillation indicates that the LowLAD estimator hasa slightly larger variance.

    To provide more quantitative evidence on the relative performace of these two estima-tors, we consider five distributions for the random errors: N(0, 12), 0.9N(0, 12)+0.1N(0, 32),0.9N(0, 12)+0.1N(0, 102), the Laplace distribution, and the Cauchy distribution, which wereadopted in the robust location estimators by Koenker and Bassett (1978). Table 1 lists theratios of variances between the LowLAD and LowLSR estimators for the five cases, whereRatio=Var(LowLSR)/Var(LowLAD). The table shows that the variance of the LowLADestimator is smaller than that of the LowLSR estimator in the presence of sharp-peak andheavy-tailed errors.

    N(0, 1) 0.9N(0, 1) + 0.1N(0, 32) 0.9N(0, 1) + 0.1N(0, 102) Laplace CauchyRatio 0.64 0.92 4.85 1 ∞

    Table 1: The ratio of variances between the LowLSR and LowLAD derivative estimators.

    For further illustration of the trade-off between the sharp-peak and heavy-tailed errors,we consider the mixed normal errors as �i ∼ (1 − α)N(0, σ20) + αN(0, σ21) with σ21 > σ20and 0 < α ≤ 1/2. Letting λ = σ1/σ0, we list in Table 2 the critical values of λ such thatthe LowLAD and LowLSR estimators have the same variance. The overall comparison isprovided in Appendix E.

    α 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009λ 24.04 17.10 14.04 12.22 10.99 10.09 9.38 8.82 8.36

    α 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09λ 7.96 5.88 4.99 4.47 4.12 3.87 3.69 3.54 3.42

    α 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50λ 3.32 3.01 2.86 2.79 2.77 2.78 2.83 2.92 3.05

    Table 2: The critical values of λ that equate the variances of the LowLAD and LowLSRderivative estimators with different contamination.

    3. Comparison to Existing First-Order Derivative Estimators

    We first review some existing methods for the first-order derivative estimation: either theleast squares estimation or the quantile regression. Both methods can be used to estimatethe first-order derivative of m(·) due to the structure of model (1). Note that E [Yi|xi] =m(xi)+E [�i], and Qτ (Yi|xi) = m(xi)+Qτ (�i) because �i is independent of xi, where Qτ (�i)

    7

  • Wang, Yu, Lin and Tong

    is the τth unconditional quantile of �i. Although E [Yi|xi] may not equal to Qτ (Yi|xi), theirderivatives must be the same as the corresponding derivatives ofm(xi). To make the existingestimators comparable to our LowLAD estimator, we use the uniform kernel throughoutthis section.

    3.1 LS, LowLSR and LAD Estimators

    In Fan and Gijbels (1996), the first-order derivative is estimated by the least squares method:

    (α̂LSi0 , α̂LSi1 , α̂

    LSi2 , α̂

    LSi3 , α̂

    LSi4 ) = arg minα

    k∑j=−k

    (Yi+j − αi0 − αi1dj − αi2d2j − αi3d3j − αi4d4j

    )2,

    where α = (αi0, αi1, αi2, αi3, αi4)T . For ease of comparison, we exclude the point j = 0 so

    that the same Yi’s are used as in the LowLAD estimation. Define the least squares (LS)estimator as

    m̂(1)LS (xi) = α̂

    LSi1 . (8)

    Corollary 2 Under the assumptions of Theorem 1, the bias and variance of the LS esti-mator in (8) are, respectively,

    Bias[α̂LSi1 ] ≈ −m(5)(xi)

    504

    k4

    n4, Var[α̂LSi1 ] ≈

    75σ2

    8

    n2

    k3.

    To derive the optimal convergence rate of the derivative estimation, Wang and Lin(2015) proposed the LowLSR estimator:

    m̂(1)LowLSR(xi) = α̂

    LowLSRi1 , (9)

    where

    (α̂LowLSRi1 , α̂LowLSRi3 ) = arg minαi1,αi3

    k∑j=1

    (Ỹ

    (1)ij − αi1dj − αi3d

    3j

    )2.

    Corollary 3 Under the assumptions of Theorem 1, the bias and variance of the LowLSRestimator in (9) are, respectively,

    Bias[α̂LowLSRi1 ] ≈ −m(5)(xi)

    504

    k4

    n4, Var[α̂LowLSRi1 ] ≈

    75σ2

    8

    n2

    k3.

    Wang and Scott (1994) proposed the local polynomial least absolute deviation (LAD)estimator:

    m̂(1)LAD(xi) = β̂

    LADi1 , (10)

    where β = (βi0, βi1, βi2, βi3, βi4)T and

    (β̂LADi0 , β̂LADi1 , β̂

    LADi2 , β̂

    LADi3 , β̂

    LADi4 ) = arg min

    β

    k∑j=−k

    ∣∣Yi+j − βi0 − βi1d1j − βi2d2j − βi3d3j − βi4d4j ∣∣ .8

  • Robust Derivative Estimation

    Corollary 4 Under the assumptions of Theorem 1, the bias and variance of the LAD es-timator in (10) are, respectively,

    Bias[β̂LADi1 ] ≈ −m(5)(xi)

    504

    k4

    n4, Var[β̂LADi1 ] ≈

    75

    32f(0)2n2

    k3.

    Now we can see one key difference between the LS method and the LAD method. Forthe LS method, the LS estimator and the LowLSR estimator are asymptotically equivalent;while for the LAD method, the asymptotic variances of the LAD estimator and the LowLADestimator are very different, although their asymptotic biases are the same. To understandtheir discrepancy, we show here that the objective functions of the LS and LowLSR esti-mation are asymptotically equivalent while those of the LAD and LowLAD estimation arenot. Note that the objective function of the LowLSR estimation is equivalent to

    4∑k

    j=1

    (Ỹ

    (1)ij − αi1dj − αi3d3j

    )2=∑k

    j=1

    (Yi+j − Yi−j − 2αi1dj − 2αi3d3j

    )2=∑k

    j=1

    (Yi+j − Yi−j − (αi0 − αi0)− αi1 (dj − d−j)− αi2

    (d2j − d2−j

    )− αi3

    (d3j − d3−j

    )− αi4

    (d4j − d4−j

    ))2=∑k

    j=1

    (Yi+j − αi0 − αi1dj − αi2d2j − αi3d3j − αi4d4j

    )2+∑k

    j=1

    (Yi−j − αi0 − αi1d−j − αi2d2−j − αi3d3−j − αi4d4−j

    )2−2∑k

    j=1

    (Yi+j − αi0 − αi1dj − αi2d2j − αi3d3j − αi4d4j

    )(Yi−j − αi0 − αi1d−j − αi2d2−j − αi3d3−j − αi4d4−j

    ),

    where d−j = −j/n. It can be shown that the cross term in the last equality is a higher-orderterm and is negligible, and the first two terms constitute exactly the objective function ofthe least squares estimation. More specifically, we can show that

    m̂(1)LS (xi)−m

    (1)(xi)− bias ≈ (0, 1, 0, 0, 0)(∑k

    j=−kXjX

    ′j

    )−1∑kj=−k

    Xj�i+j

    =∑k

    j=−kDj�i+j

    with Xj =(

    1, dj , d2j , d

    3j , d

    4j

    )Tand

    Dj = nj(75k4 + 150k3 − 75k + 25)− j3

    (105k2 + 105k − 35

    )k(8k6 + 28k5 + 14k4 − 35k3 − 28k2 + 7k + 6)

    = −D−j ,

    andm̂

    (1)LowLSR(xi)−m(1)(xi)− bias

    ≈ (1, 0)

    [(0 1 0 0 00 0 0 1 0

    )(∑kj=1XjX

    ′j

    )( 0 1 0 0 00 0 0 1 0

    )T]−1·(

    0 1 0 0 00 0 0 1 0

    )∑kj=1Xj

    �i+j−�i−j2

    =∑k

    j=1 2Dj�i+j−�i−j

    2 =∑k

    j=−kDj�i+j .

    9

  • Wang, Yu, Lin and Tong

    The influence functions of m̂(1)LS (xi) and m̂

    (1)LowLSR(xi) are exactly the same, so the contribu-

    tion of the cross term is null.

    On the contrary, the objective function of the LowLAD estimator is equivalent to

    2∑k

    j=1

    ∣∣∣Ỹ (1)ij − βi1dj − βi3d3j ∣∣∣=∑k

    j=1

    ∣∣∣Yi+j − Yi−j − (βi0 − βi0)− βi1 (dj − d−j)− βi2 (d2j − d2−j)− βi3 (d3j − d3−j)− βi4 (d4j − d4−j)∣∣∣=∑k

    j=1

    ∣∣∣Yi+j − βi0 − βi1dj − βi2d2j − βi3d3j − βi4d4j ∣∣∣+∑kj=1 ∣∣∣Yi−j − βi0 − βi1d−j − βi2d2−j − βi3d3−j − βi4d4−j∣∣∣+ extra term,

    where the extra term is equal to

    −2 max

    (sign

    (Yi+j − βi0 − βi1dj − βi2d2j − βi3d3j − βi4d4j

    )·sign

    (Yi−j − βi0 − βi1d−j − βi2d2−j − βi3d3−j − βi4d4−j

    ), 0

    )·min

    (∣∣∣Yi+j − βi0 − βi1dj − βi2d2j − βi3d3j − βi4d4j ∣∣∣ , ∣∣∣Yi−j − βi0 − βi1d−j − βi2d2−j − βi3d3−j − βi4d4−j∣∣∣) ,and cannot be neglected, where sign(x) = 1 if x > 0 and sign(x) = −1 if x < 0. Morespecifically, we can show that

    m̂(1)LAD(xi)−m(1)(xi)− bias

    ≈ (0, 1, 0, 0, 0)(

    2f(0)∑k

    j=−kXjX′j

    )−1∑kj=−kXjsign (�i+j)

    = 12f(0)∑k

    j=−kDjsign (�i+j) =1

    f(0)

    ∑kj=1Dj

    sign(�i+j)−sign(�i−j)2 ,

    and

    m̂(1)LowLAD(xi)−m(1)(xi)− bias

    ≈ (1, 0)

    [2g(0)

    (0 1 0 0 00 0 0 1 0

    )(∑kj=1XjX

    ′j

    )( 0 1 0 0 00 0 0 1 0

    )T]−1·(

    0 1 0 0 00 0 0 1 0

    )∑kj=1Xjsign

    (�i+j−�i−j

    2

    )= 1g(0)

    ∑kj=1Djsign

    (�i+j−�i−j

    2

    ).

    So the contribution of the extra term to the influence function is

    Dj

    [sign

    (�i+j−�i−j

    2

    )g(0) −

    sign(�i+j)−sign(�i−j)2f(0)

    ]=

    Djg(0)

    {sign (�i+j − �i−j) , if �i+j�i−j > 0,(

    1− g(0)f(0))

    sign (�i+j) , if �i+j�i−j < 0,

    which is not null, where we neglect the event that �i+j�i−j = 0 because the probability ofsuch an event is zero.

    10

  • Robust Derivative Estimation

    3.2 Quantile Regression Estimators

    Quantile regression (Koenker and Bassett, 1978; Koenker, 2005) exploits the distributioninformation of the error term to improve the estimation efficiency. Following the compositequantile regression (CQR) in Zou and Yuan (2008), Kai et al. (2010) proposed the localpolynomial CQR estimator. In general, the local polynomial CQR estimator of m(1)(xi) isdefined as

    m̂(1)CQR(xi) = γ̂

    CQRi1 , (11)

    where

    ({γ̂CQRi0h

    }qh=1

    , γ̂CQRi1 , γ̂CQRi2 , γ̂

    CQRi3 , γ̂

    CQRi4 )

    = arg minγ

    q∑h=1

    k∑j=−k

    ρτh(Yi+j − γi0h − γi1d1j − γi2d2j − γi3d3j − γi4d4j )

    ,with γ =

    ({γi0h}qh=1, γi1, γi2, γi3, γi4

    )T, ρτ (x) = τx − xI(x < 0) is the check function, and

    τh = h/(q + 1).

    Corollary 5 Under the assumptions of Theorem 1, the bias and variance of the CQRestimator in (11) are, respectively,

    Bias[γ̂CQRi1 ] ≈ −m(5)(xi)

    504

    k4

    n4, Var[γ̂CQRi1 ] ≈

    75R1(q)

    8

    n2

    k3,

    where R1(q) =∑q

    l=1

    ∑ql′=1 τll′/{

    ∑ql=1 f(cl)}

    2, cl = F−1(τl), and τll′ = min{τl, τl′} − τlτl′.

    As q →∞,

    R1(q)→1

    12(E[f(�)])2=

    1

    3g(0)2, Var[γ̂CQRi1 ] ≈

    75

    24g(0)2n2

    k3.

    Zhao and Xiao (2014) proposed the weighted quantile average (WQA) estimator for theregression function in nonparametric regression, an idea originated from Koenker (1984).We now extend the WQA method to estimate m(1)(xi) using the local polynomial quantileregression. Specifically, we define

    m̂(1)WQA(xi) =

    q∑h=1

    whγ̂WQAi1h , (12)

    where∑q

    h=1wh = 1, and

    (γ̂WQAi0h , γ̂WQAi1h , γ̂

    WQAi2h , γ̂

    WQAi3h , γ̂

    WQAi4h )

    = arg minγh

    k∑j=−k

    ρτh(Yi+j − γi0h − γi1hd1j − γi2hd2j − γi3hd3j − γi4hd4j )

    with γh = (γi0h, γi1h, γi2h, γi3h, γi4h)

    T .

    11

  • Wang, Yu, Lin and Tong

    Corollary 6 Under the assumptions of Theorem 1, the bias and variance of the WQAestimator in (12) are, respectively,

    Bias[m̂(1)WQA(xi)] ≈ −

    m(5)(xi)

    504

    k4

    n4, Var[m̂

    (1)WQA(xi)] ≈

    75R2(q|w)8

    n2

    k3,

    where R2(q|w) = wTHw with w = (w1, · · · , wq)T , and H ={

    τll′f(cl)f(cl′ )

    }1≤l,l′≤q

    . The

    optimal weights are given by w∗ =H−1eqeTq H

    −1eq, where eq = (1, . . . , 1)

    Tq×1, and R2(q|w∗) =

    (eTq H−1eq)

    −1. As q → ∞, under the regularity assumption in Theorem 6.2 of Zhao andXiao (2014), the variance of the optimal CQR is

    Var[m̂(1)WQA(xi)] ≈

    75I(f)−1

    8

    n2

    k3,

    where I(f) is the Fisher information of f .

    LS LowLSR LAD LowLAD CQR WQA ML

    Asymptotic s2 σ2 σ2 14f(0)2

    12g(0)2

    13g(0)2

    I(f)−1 I(f)−1

    s2 σ2 σ2 14f(0)2

    12g(0)2

    R1(q) R2(q) I(f)−1

    Table 3: The variances s2 75n2

    8k3of the existing first-order derivative estimators.

    For the LS, LowLSR, LAD, CQR, WQA and LowLAD estimators with the same k, theirasymptotic biases are all the same. In contrast, their asymptotic variances are 75n

    2

    8k3s2 with s2

    being σ2, σ2, 14f(0)2

    , R1(q), R2(q) and1

    2g(0)2, respectively. From the kernel interpretation of

    the differenced estimator by Wang and Yu (2017), we expect an equivalence between the LSand LowLSR estimators. As q →∞, we have R1(q)→ 1/{3g(0)2} and R2(q)→ I(f)−1, andhence the WQA estimator is the most efficient estimator as q becomes large. Nevertheless,it requires the error density function to be known in advance to carry out the most efficientWQA estimator. Otherwise, a two-step procedure is needed, where the first step estimatesthe error density. For a fixed q, the three quantile-based estimators (i.e., LAD, CQR, WQA)depend only on the density values of f(·) at finite quantile points, whose behaviors areuncertain and may not be reliable. In contrast, our new estimator relies on g(0) = 2E[f(x)],which includes all information on the density f(·) and hence is more robust.

    3.3 Equivalence to the Infinitely CQR Estimator

    From the asymptotic variances, we can see that LowLAD has a close relationship withCQR. We now show that the LowLAD estimator with random difference is asymptoticallyequivalent to the infinitely CQR estimator. First, define the first-order random differencesequence as

    Yijl = Yi+j − Yi+l, −k ≤ j, l ≤ k,

    where we implicitly exclude j and/or l being 0 to be comparable with the LowLAD esti-mator. Second, define the LowLAD estimator with random difference, referred to as the

    12

  • Robust Derivative Estimation

    RLowLAD estimator of m(1)(xi), as

    m̂(1)RLowLAD(xi) = β̂

    RLowLADi1 , (13)

    where

    (β̂RLowLADi1 , β̂RLowLADi2 , β̂

    RLowLADi3 , β̂

    RLowLADi4 )

    = arg minb

    k∑j=−k

    k∑l=−k,l

  • Wang, Yu, Lin and Tong

    where γ̂WQAi1 =(γ̂WQAi11 , · · · , γ̂

    WQAi1q

    )T, W is a symmetric weight matrix, and w =

    WeqeTq Weq

    .

    When W = Iq, the q × q identity matrix, we get the equally-weighted WQA estimator;when W = H−1 we get the optimally weighted WQA estimator; and when

    W = aIq + (1− a)H−1 6= Iq,

    we get an estimator that is asymptotically equivalent to the CQR estimator, where

    a =− (q −B) (1−BC) +

    √(q2 −AB) (1−BC)

    A− (2q −B) (1−BC)− q2C6= 1

    with A = eTq Heq = q2R2(q|wE), B = R2 (q|w∗)−1 and C = R1 (q). For example, if

    ε ∼ N (0, 1), then when q = 5, a = −0.367; when q = 9, a = −0.165; when q = 19,a = −0.067; and when q = 99, a = −0.011. This is why imposing constraints directly on theobjective function (i.e., the CQR estimator) or on the resulting estimators (i.e., eTq γ̂

    WQAi1 )

    would generate different estimators. The RLowLAD estimator and the CQR estimatorhave the same asymptotic variance because both of them impose constraints directly on theobjective function.

    4. Derivative Estimation at Boundaries

    At the left boundary with 2 ≤ i ≤ k, the bias and variance of the LowLAD estimator are−m(5)(xi)(i− 1)4/(504n4) and 75n2/(16g(0)2(i− 1)3), respectively. At the endpoint i = 1,the LowLAD estimator does not exist. Similar results hold for the estimation at the rightboundary with n − k + 1 ≤ i ≤ n − 1. In the presence of outliers and heavy-tailed errors,the estimation variance at the boundaries may be even larger. In this section, we proposean asymmetrical LowLAD method (As-LowLAD) to further reduce the estimation varianceas well as to improve the finite-sample performance at the boundaries.

    Assume that m(·) is twice continuously differentiable on [0, 1]. For 1 ≤ i ≤ k, we definethe asymmetric lag-j first-order difference quotients as

    Y ij =Yi+j − Yixi+j − xi

    , −(i− 1) ≤ j ≤ k, j 6= 0.

    By decomposing Y ij into two parts, we have

    Y ij =m(xi+j)−m(xi)

    j/n+�i+j − �ij/n

    ≈ m(1)(xi) + (−�i)d−1j +�i+jj/n

    +m(2)(xi)

    2!

    j

    n,

    where �i is fixed as j increases. Thus, Median(Y

    ij

    ∣∣�i) = m(1)(xi) + (−�i)d−1j + m(2)(xi)2! jn .Ignoring the last term, we can rewrite the above model as

    Y ij ≈ βi1 + βi0d−1j + δij , −(i− 1) ≤ j ≤ k, j 6= 0,

    where (βi0, βi1)T = (−�i,m(1)(xi))T . By the LowLAD method, the regression coefficients

    are estimated by

    14

  • Robust Derivative Estimation

    (β̂i0, β̂i1

    )= arg min

    βi0,βi1

    k∑j=−(i−1)

    ∣∣Y ij − βi1 − βi0d−1j ∣∣wj= arg min

    βi0,βi1

    k∑j=−(i−1)

    ∣∣Ỹ ij − βi0 − βi1dj∣∣,(14)

    where wj = dj and Ỹ

    ij = Yi+j − Yi. The As-LowLAD estimator of m(1)(xi) is β̂i1.

    Similarly to Theorem 1, we can prove the asymptotic normality for β̂i1. The followingtheorem derives its asymptotic bias and variance.

    Theorem 8 Assume that �i are iid random errors with median 0 and a continuous, positivedensity f(·) in a neighborhood of zero. Furthermore, assume that m(·) is twice continuouslydifferentiable on [0, 1]. Then for each 1 ≤ i ≤ k, the bias and variance of β̂i1 in (14) are,respectively,

    Bias[β̂i1] ≈m(2)(xi)

    2

    k4 + 2k3i− 2ki3 − i3

    n(k3 + 3k2i+ 3ki2 + i3),

    Var[β̂i1] ≈3

    f(0)2n2

    k3 + 3k2i+ 3ki2 + i3.

    For the estimation at the boundaries, Wang and Lin (2015) proposed a one-side LowLSR(OS-LowLSR) estimator of the first-order derivative, with the bias and variance beingm(2)(xi)k/(2n) and 12σ

    2n2/k3, respectively. In contrast, if we consider the one-side LowLAD(OS-LowLAD) estimator of the first-order derivative, its bias and variance arem(2)(xi)k/(2n)and 3n2/(f(0)2k3), respectively. Note that the two biases are the same, while the variancessatisfy the same proportional relationship as the LowLSE and LowLAD estimators at theinterior point:

    3

    f(0)2n2

    k3=π

    2

    12σ2n2

    k3>

    12σ2n2

    k3

    when �i ∼ N(0, σ2).Theorem 8 shows that our estimator has a smaller bias than the OS-LowLSR and OS-

    LowLAD estimators. In the special case i = k, the bias disappears (in fact it reducesto the higher-order term O(k2/n2)). With normal errors, when 1 < i < b0.163kc, theorder of variances of the three estimators is Var(m̂

    (1)OS−LAD(xi)) > Var(m̂

    (1)AS−LAD(xi)) >

    Var(m̂(1)OS−LSR(xi)), where for a real number x, bxc means the largest integer less than x;

    when b0.163kc < i < k, the order of variances of the three estimators is Var(m̂(1)AS−LAD(xi)) <Var(m̂

    (1)OS−LAD(xi)) < Var(m̂

    (1)OS−LSR(xi)). As i approaches k, the variance of the As-

    LowLAD estimator is reduced to one-eighth of the variance of the OS-LowLAD estimator,and is much smaller than the variance of the OS-LowLSR estimator.

    Up to now, we have the first-order derivative estimators {m̂(1)(xi)}ni=1 on the discretepoints {xi}ni=1. To estimate the first-order derivative function, we suggest two strategies fordifferent noise levels of the derivative data: the cubic spline interpolation in Knott (2000)

    15

  • Wang, Yu, Lin and Tong

    for ‘good’ derivative estimators, and the local polynomial regression in Brown and Levine(2007) and De Brabanter et al. (2013) for ‘bad’ derivative estimators. Here, the terms‘good’ and ‘bad’ indicate small and large estimation variances of the derivative estimator,respectively.

    5. Second- and Higher-Order Derivative Estimation

    In this section, we generalize the robust method for the first-order derivative estimation tothe second- and higher-order derivatives estimation.

    5.1 Second-Order Derivative Estimation

    Define the second-order difference quotients as

    Y(2)ij =

    Yi−j − 2Yi + Yi+jj2/n2

    , 1 ≤ j ≤ k. (15)

    Assume that m(·) is six times continuously differentiable. We decompose (15) into twoparts and simplify it by the Taylor expansion as

    Y(2)ij =

    m(xi−j)− 2m(xi) +m(xi+j)j2/n2

    +�i−j − 2�i + �i+j

    j2/n2

    ≈ m(2)(xi) +m(4)(xi)

    12

    j2

    n2+m(6)(xi)

    360

    j4

    n4+�i−j − 2�i + �i+j

    j2/n2.

    Since i is fixed as j varies, the conditional expectation of Y(2)ij given �i is

    E[Y(2)ij

    ∣∣�i] ≈ m(2)(xi) + m(4)(xi)12

    j2

    n2+m(6)(xi)

    360

    j4

    n4+ (−2�i)

    n2

    j2.

    This results in the true regression model as

    Y(2)ij ≈ αi2 + αi4d

    2j + αi6d

    4j + αi0d

    −2j + δij , 1 ≤ j ≤ k,

    where αi = (αi0, αi2, αi4, αi6)T = (−2�i,m(2)(xi),m(4)(xi)/12,m(6)(xi)/360)T , and the er-

    rors δij =�i+j+�i−jj2/n2

    . If �i has a symmetric density about zero, then δ̃ij =j2

    n2δij = �i+j + �i−j

    has median 0 (see Appendix D). Following the similar estimation as in the first-order deriva-tive estimation, our LowLAD estimator of m(2)(xi) is defined as

    m̂(2)LowLAD(xi) = α̂i2, (16)

    where

    (α̂i0, α̂i2, α̂i4)T = arg min

    αi0,αi2,αi4

    k∑j=1

    ∣∣Y (2)ij − (αi2 + αi4d2j + αi0d−2j )∣∣Wj= arg min

    αi0,αi2,αi4

    k∑j=1

    ∣∣Ỹ (2)ij − (αi0 + αi2d2j + αi4d4j )∣∣16

  • Robust Derivative Estimation

    with Wj = d2j and Ỹ

    (2)ij = Yi−j−2Yi+Yi+j . The following theorem shows that m̂

    (2)LowLAD(xi)

    behaves similarly as m̂(1)LowLAD(xi).

    Theorem 9 Assume that �i are iid random errors whose density f(·) is continuous andsymmetric about zero. Then as k →∞ and k/n→ 0, α̂i2 in (16) is asymptotically normallydistributed with

    Bias[α̂i2] ≈ −m(6)(xi)

    792

    k4

    n4, Var[α̂i2] ≈

    2205

    16h(0)2n4

    k5.

    where h(0) =∫∞−∞ f

    2(x)dx. The optimal k that minimizes the AMSE is

    kopt ≈ 3.93(

    1

    h(0)2m(6)(xi)2

    )1/13n12/13,

    and, consequently, the minimum AMSE is given by

    AMSE[α̂i2] ≈ 0.24

    (m(6)(xi)

    10

    h(0)16

    )1/13n−8/13.

    Following the simulation settings in Figure 1, Figure 2 displays the LowLAD andLowLSR estimators for the second-order derivative. The finite-sample performance of twoestimators is parallel to the first-order derivative estimation. For the theoretical results, the

    LowLAD and LowLSR estimators have the same asymptotic bias −m(6)(xi)792

    k4

    n4, while their

    asymptotic variances are 220516h(0)2

    n4

    k5and 2205σ

    2

    8n4

    k5, respectively. When � ∼ N(0, σ2), we have

    h(0) = 1/(2√πσ), so

    2205

    64h(0)2n4

    k5=π

    2

    2205σ2

    8

    n4

    k5>

    2205σ2

    8

    n4

    k5,

    which matches the the simulation results in Figure 2.

    5.2 Higher-Order Derivative Estimation

    We now propose the robust method for estimating the higher-order derivatives m(l)(xi)with l > 2 via a two-step procedure. In the first step, we construct a sequence of symmetricdifference quotients in which the higher-order derivative is the intercept of the linear regres-sion derived by the Taylor expansion, and in the second step, we estimate the higher-orderderivative using the LowLAD method.

    When l is odd, let d = (l + 1)/2. We linearly combine m(xi±j) subject to

    d∑h=1

    [ajd+hm(xi+jd+h) + a−(jd+h)m(xi−(jd+h))] = m(l)(xi) +O

    (j

    n

    ), 0 ≤ j ≤ k,

    where k is a positive integer. We can derive a total of 2d equations through the Taylorexpansion to solve the 2d unknown parameters. Define

    Y(l)ij =

    d∑h=1

    [ajd+hYi+jd+h + a−(jd+h)Yi−(jd+h)],

    17

  • Wang, Yu, Lin and Tong

    Figure 2: (a) Simulated data set of size 300 as in Figure 1. (b)-(d) The second-orderLowLAD derivative estimator (green points) and second-order LowLSR derivativeestimator (red dashed line) for k ∈ {25, 35, 60}. As a reference, the true second-order derivative curve is also plotted (bold line).

    and consider the linear regression

    Y(l)ij = m

    (l)(xi) + δij , 1 ≤ j ≤ k,

    where δij =∑d

    h=1[ai,jd+h�i+jd+h + ai,−(jd+h)�i−(jd+h)] +O(j/n).

    When l is even, let d = l/2. We linearly combine m(xi±j) subject to

    bjm(xi) +d∑

    h=1

    [ajd+hm(xi+jd+h) + a−(jd+h)m(xi−(jd+h))] = m(l)(xi) +O

    (j

    n

    ), 0 ≤ j ≤ k,

    18

  • Robust Derivative Estimation

    where k is a positive integer. We can derive a total of 2d+ 1 equations through the Taylorexpansion to solve the 2d+ 1 unknown parameters. Define

    Y(l)ij = bjYi +

    d∑h=1

    [ajd+hYi+jd+h + a−(jd+h)Yi−(jd+h)],

    and consider the linear regression

    Y(l)ij = m

    (l)(xi) + bj�i + δij , 1 ≤ j ≤ k,

    where δij =∑d

    h=1[ai,jd+h�i+jd+h + ai,−(jd+h)�i−(jd+h)] +O(j/n).When k is large, we suggest to keep the j2/n2 term as in (6) to reduce the estimation

    bias. If∑d

    h=1[ai,jd+h�i+jd+h + ai,−(jd+h)�i−(jd+h)] has median zero, then we can obtain thehigher-order derivative estimators by the LowLAD method and deduce their asymptoticproperties by similar arguments as in the previous sections. To save space, we omit thetechnical details in this paper.

    6. Simulation Studies

    In this section, we conduct more simulations to further evaluate the finite-sample perfor-mance of the first- and second-order derivative estimators and compare them with threeexisting methods.

    6.1 First-Order Derivative Estimators

    We first consider the following two regression functions:

    m1(x) = sin(2πx) + cos(2πx) + log(4/3 + x), x ∈ [−1, 1],m2(x) = 32e

    −8(1−2x)2(1− 2x), x ∈ [0, 1].

    These two functions were also considered in Hall (2010) and De Brabanter et al. (2013).We simulated the data set from model (1) with n = 500 and the errors being generated asfollows: 90% of the errors come from � ∼ N(0, σ2) with σ = 0.1, and the remaining 10%come from � ∼ N(0, σ21) with σ1 = 1 or 10 corresponding to the low or high contaminatedlevel. Figures 3 and 4 present the finite-sample performance of the first-order derivativeestimator for the regression functions m1 and m2, respectively. It shows that our estimatedcurves of the first-order derivative fit the true curves more accurately than the LowLSRestimator in the presence of sharp-peak and heavy-tailed errors. The heavier the tail, themore significant the improvement.

    We also compute the mean absolute errors to further assess the performance of differ-ent methods. Since the oscillation of the periodic function depends on the frequency andamplitude, we consider the sine function as the regression function,

    m3(x) = A sin(2πfx), x ∈ [0, 1].

    The errors are generated in the same way as in the previous simulation study. We considerthree sample sizes: n = 50, 200 and 1000, two standard deviations: σ = 0.1 and 0.5, two

    19

  • Wang, Yu, Lin and Tong

    frequencies: f = 0.5 and 1, and two amplitudes: A = 1 and 10. We consider the followingadjusted mean absolute error (AMAE) as the criterion for simplicity and robustness:

    AMAE(k) =1

    n− 2k

    n−k∑i=k+1

    ∣∣m̂(1)(xi)−m(1)(xi)∣∣.Based on the condition k < n/4 and the AMAE, we choose the optimal k value as

    k̂ = min{arg mink

    AMAE(k),n

    4}.

    Table 5 reports the simulation results based on 1000 repetitions. The numbers outsideand inside the brackets represent the mean and standard deviation of the AMAE, respec-tively. It is evident that our robust estimator performs better than the LowLSR estimatorfor the first-order derivative in most cases except for the case with σ1 = 1 and σ = 0.5. Alsofor the cases with σ1 = 2σ, the contamination is very light and thus the simulation resultis similar to the cases with normal errors in Section 2.2. In other cases, the contaminationalways dominates in the variance.

    LowLAD LowLSR LowLAD LowLSR LowLAD LowLSR

    A f σ n = 50 n = 50 n = 200 n = 200 n = 1000 n = 1000

    σ1 = 1

    1 0.5 0.1 0.61(0.19) 1.13(0.53) 0.29(0.06) 0.62(0.20) 0.13(0.02) 0.28(0.09)0.5 2.45(0.56) 2.07(0.62) 1.27(0.25) 1.08(0.31) 0.58(0.11) 0.49(0.15)

    1 0.1 0.61(0.18) 1.12(0.53) 0.29(0.06) 0.61(0.21) 0.13(0.02) 0.28(0.08)0.5 2.45(0.54) 2.08(0.62) 1.28(0.25) 1.08(0.31) 0.58(0.11) 0.49(0.13)

    10 0.5 0.1 0.45(0.14) 0.84(0.43) 0.29(0.06) 0.61(0.20) 0.13(0.02) 0.28(0.82)0.5 1.87(0.50) 1.58(0.56) 1.27(0.24) 1.07(0.31) 0.58(0.11) 0.49(0.13)

    1 0.1 0.62(0.15) 0.95(0.41) 0.33(0.06) 0.65(0.21) 0.21(0.03) 0.32(0.08)0.5 1.89(0.46) 1.60(0.51) 1.28(0.24) 1.08(0.30) 0.60(0.11) 0.51(0.14)

    σ1 = 10

    1 0.5 0.1 0.67(0.55) 7.82(4.50) 0.30(0.06) 5.95(2.10) 0.13(0.03) 2.69(0.78)0.5 2.43(0.82) 8.07(4.10) 1.47(0.29) 5.82(1.97) 0.66(0.12) 2.75(0.84)

    1 0.1 0.65(0.47) 7.85(4.42) 0.30(0.06) 5.74(1.97) 0.13(0.02) 2.69(0.78)0.5 2.43(0.80) 7.95(4.24) 1.47(0.30) 5.97(2.03) 0.66(0.13) 2.76(0.78)

    10 0.5 0.1 0.67(0.62) 7.84(4.36) 0.30(0.06) 5.98(2.10) 0.13(0.02) 2.71(0.78)0.5 2.39(0.82) 7.97(4.28) 1.47(0.30) 5.92(2.05) 0.66(0.13) 2.75(0.79)

    1 0.1 0.78(0.47) 7.87(4.56) 0.34(0.06) 5.91(2.03) 0.21(0.03) 2.74(0.80)0.5 2.50(0.95) 8.04(4.42) 1.48(0.29) 5.78(1.97) 0.67(0.12) 2.64(0.75)

    Table 4: AMAE for the first-order derivative estimation of the function m3.

    6.2 Second-Order Derivative Estimators

    To assess the finite-sample performance of the second-order derivative estimators, we con-sider the same regression functions as in Section 6.1. Figures 5 and 6 present the estimated

    20

  • Robust Derivative Estimation

    curves and the true curves of the second-order derivatives of m1 and m2, respectively. Itshows that our LowLAD estimator fits the true curves more accurately than the LowLSRestimator in all settings.

    We further compare our method with two well-known methods: the local polynomialregression with p = 5 (use R package ‘locpol’ in Cabrera (2012)) and the penalized smooth-ing splines with norder = 6 and method = 4 (use R package ‘pspline’ in Ramsay and Ripley(2013)). For simplicity, we consider the simple version of m3 with A = 5:

    m4(x) = 5 sin(2πx), x ∈ [0, 1].

    We let n = 500, and the errors are generated in the same way as in Section 6.1. With 1000repetitions, the simulation results in Figures 7 and 8 indicate that our robust estimator issuperior to the existing methods in the presence of sharp-peak and heavy-tailed errors.

    7. Conclusion and Extensions

    In this paper, we propose a robust differenced method for estimating the first- and higher-order derivatives of the regression function in nonparametric models. The new methodconsists of two main steps: first construct a sequence of symmetric (or random) differencequotients, and then estimate the derivatives using the LowLAD regression. The new deriva-tive estimator does not require a zero median or a positive density at zero for the error term.Also, the modified version of the new estimator with random difference is asymptoticallyequivalent to the (infinitely) composite quantile regression estimator. We further extendthe robust differenced method to the derivative estimation at boundary points and thehigher-order derivatives estimation.

    We in this paper focus on the derivative estimation with fixed designs with iid errors.With minor technical extension, the proposed method can be extended to random designswith heteroskedastic errors. Further extensions to semiparametric estimation and change-point estimation are also possible. Following the discussions at the end of Section 3.3, it isalso interesting to study which kinds of weights should be imposed on the objective functionsin CQR and Yijl terms in RLowLAD to generate the same efficiency as the optimal WQAestimator. These questions would be investigated in a separate paper.

    8. Acknowledge

    The research was supported by NNSF projects (11571204 and 11231005) of China.

    References

    G. Boente and D. Rodriguez. Robust estimators of high order derivatives of regressionfunctions. Statistics & Probability Letters, 76(13):1335–1344, 2006.

    L.D. Brown and M. Levine. Variance estimation in nonparametric regression via the differ-ence sequence method. The Annals of Statistics, 35(5):2219–2232, 2007.

    J.L.O. Cabrera. locpol: Kernel local polynomial regression. R package version 0.6-0, 2012.URL http://mirrors.ustc.edu.cn/CRAN/web/packages/locpol/index.html.

    21

  • Wang, Yu, Lin and Tong

    R. Charnigo, M. Francoeur, M.P. Mengüç, A. Brock, M. Leichter, and C. Srinivasan. Deriva-tives of scattering profiles: tools for nanoparticle characterization. Journal of the OpticalSociety of America, 24(9):2578–2589, 2007.

    R. Charnigo, M. Francoeur, M.P. Mengüç, B. Hall, and C. Srinivasan. Estimating quanti-tative features of nanoparticles using multiple derivatives of scattering profiles. Journalof Quantitative Spectroscopy and Radiative Transfer, 112:1369–1382, 2011a.

    R. Charnigo, B. Hall, and C. Srinivasan. A generalized Cp criterion for derivative estimation.Technometrics, 53(3):238–253, 2011b.

    P. Chaudhuri and J.S. Marron. SiZer for exploration of structures in curves. Journal of theAmerican Statistical Association, 94(447):807–823, 1999.

    W.S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal ofthe American Statistical Association, 74(368):829–836, 1979.

    W. Dai, T. Tong, and M.G. Genton. Optimal estimation of derivatives in nonparametricregression. Journal of Machine Learning Research, 17(164):1–25, 2016.

    K. De Brabanter, J. De Brabanter, B. De Moor, and I. Gijbels. Derivative estimation withlocal polynomial fitting. Journal of Machine Learning Research, 14(1):281–301, 2013.

    M. Delecroix and A.C. Rosa. Nonparametric estimation of a regression function and itsderivatives under an ergodic hypothesis. Journal of Nonparametric Statistics, 6(4):367–382, 1996.

    N.R. Draper and H. Smith. Applied Regression Analysis. Wiley and Sons, New York, 2ndedition, 1981.

    J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman & Hall,London, 1996.

    J. Fan and P. Hall. On curve estimation by minimizing mean absolute deviation and itsimplications. The Annals of Statistics, 22(2):867–885, 1994.

    S. Ghosal, A. Sen, and A.W. vander Vaart. Testing monotonicity of regression. The Annalsof Statistics, 28(4):1054–1082, 2000.

    I. Gijbels and A.C. Goderniaux. Data-driven discontinuity detection in derivatives of aregression function. Communications in Statistics-Theory and Methods, 33(4):851–871,2005.

    B. Hall. Nonparametric Estimation of Derivatives with Applications. PhD thesis, Universityof Kentucky, Lexington, Kentucky, 2010.

    W. Härdle. Applied Nonparametric Regression. Cambridge University Press, Cambridge,1990.

    W. Härdle and T. Gasser. Robust non-parametric function fitting. Journal of the RoyalStatistical Society, Series B, 46(1):42–51, 1984.

    22

  • Robust Derivative Estimation

    W. Härdle and T. Gasser. On robust kernel estimation of derivatives of regression functions.Scandinavian Journal of Statistics, 12(3):233–240, 1985.

    P.J. Huber and E.M. Ronchetti. Robust Statistics. John Wiley & Sons, Inc., 2009.

    B. Kai, R. Li, and H. Zou. Local composite quantile regression smoothing: an efficient andsate alternative to local polynomial regression. Journal of the Royal Statistical Society,Series B, 72(1):49–69, 2010.

    G.D. Knott. Interpolating Cubic Splines. Spring, 1st edition, 2000.

    R. Koenker. A note on l-estimation for linear models. Statistics & Probability Letters, 2:323–325, 1984.

    R. Koenker. Quantile Regression. Cambridge University Press, New York, 2005.

    R. Koenker and G. Bassett. Regression quantiles. Econometrica, 46:33–50, 1978.

    X.R. Li and V.P. Jilkov. Survey of maneuvering target tracking. Part I: Dynamic models.IEEE Transations on Aerospace and Electronic Systems, 39(4):1333–1364, 2003.

    X.R. Li and V.P. Jilkov. Survey of maneuvering target tracking. Part II: Motion models ofballistic and space targets. IEEE Transations on Aerospace and Electronic Systems, 46(1):96–119, 2010.

    I. Matyasovszky. Detecting abrupt climate changes on different time scales. Theoretical andApplied Climatology, 105:445–454, 2011.

    H.G. Müller. Nonparametric Regression Analysis of Longitudinal Data. Springer, New York,1988.

    H.G. Müller, U. Stadtmüller, and T. Schmitt. Bandwidth choice and confidence intervalsfor derivatives of noisy data. Biometrika, 74(4):743–749, 1987.

    J. Newell, J. Einbeck, N. Madden, and K. McMillan. Model free endurance markers basedon the second derivative of blood lactate curves. In Proceedings of the 20th InternationalWorkshop on Statistical Modelling, pages 357–364, Sydney, 2005.

    F. Osorio. L1pack: Routines for L1 estimation. R package version 0.3, 2015. URL http://www.ies.ucv.cl/l1pack/.

    A. Pakes and D. Pollard. Simulation and the asymptotics of optimization estimators. Econo-metrica, 57(5):1027–1057, 1989.

    C. Park and K.H Kang. SiZer analysis for the comparison of regression curves. Computa-tional Statistics & Data Analysis, 52(8):3954–3970, 2008.

    J. Porter and P. Yu. Regression discontinuity designs with unknown discontinuity points:Testing and estimation. Journal of Econometrics, 189:132–147, 2015.

    J. Ramsay and B. Ripley. pspline: Penalized smoothing splines. R package version 1.0-16,2013. URL http://mirrors.ustc.edu.cn/CRAN/web/packages/pspline/index.html.

    23

  • Wang, Yu, Lin and Tong

    J.O. Ramsay and B.W. Silverman. Applied Functional Data Analysis: Methods and CaseStudies. Springer, New York, 2002.

    D. Ruppert and M.P. Wand. Multivariate locally weighted least squares regression. TheAnnals of Statistics, 22(3):1346–1370, 1994.

    C.J. Stone. Additive regression and other nonparametric models. The Annals of Statistics,13(2):689–705, 1985.

    P.S. Swain, K. Stevenson, A. Leary, L.F. Montano-Gutierrez, I.B.N. Clark, J. Vogel, andT. Pilizota. Inferring time derivatives including cell growth rates using gaussian process.Nature Communications, in press, 2017.

    G. Wahba and Y. Wang. When is the optimal regularization parameter insensitive to thechoice of the loss function? Communications in Statistics-Theory and Methods, 19:1685–1700, 1990.

    F.T. Wang and D.W. Scott. The L1 method for robust nonparametric regression. Journalof the American Statistical Association, 89(425):65–76, 1994.

    W.W. Wang and L. Lin. Derivative estimation based on difference sequence via locallyweighted least squares regression. Journal of Machine Learning Research, 16:2617–2641,2015.

    W.W. Wang and P. Yu. Asymptotically optimal differenced estimators of error variancein nonparametric regression. Computational Statistics & Data Analysis, (105):125–143,2017.

    A.H. Welsh. Robust estimation of smooth regression and spread functions and their deriva-tives. Statistica Sinica, 6:347–366, 1996.

    Z. Zhao and Z. Xiao. Efficient regression via optimally combining quantile information.Econometric Theory, 30(6):1272–1314, 2014.

    S. Zhou and D.A. Wolfe. On derivative estimation in spline regression. Statistica Sinica, 10(1):93–108, 2000.

    H. Zou and M. Yuan. Composite quantile regression and the oracle model selection theory.The Annals of Statistics, 36(3):1108–1126, 2008.

    24

  • Robust Derivative Estimation

    Appendix A. Proof of Theorem 1

    Proposition 10 If �i are iid with a continuous, positive density f(·) in a neighborhood ofthe median, then ζ̃ij = (�i+j − �i−j)/2 (j=1,. . . , k) are iid with median 0 and a continuous,positive, symmetric density g(·), where

    g(x) = 2

    ∫ ∞−∞

    f(2x+ �)f(�)d�.

    Proof of Proposition 10 The distribution of ζ̃ij = (�i+j − �i−j)/2 is

    Fζ̃ij (x) = P ((�i+j − �i−j)/2 ≤ x)

    =

    ∫∫�i+j≤2x+�i−j

    f(�i+j)f(�i−j)d�i+jd�i−j

    =

    ∫ ∞−∞{∫ 2x+�i−j−∞

    f(�i+j)d�i+j}f(�i−j)d�i−j

    =

    ∫ ∞−∞

    f(2x+ �i−j)f(�i−j)d�i−j .

    Then the density of ζ̃ij is

    g(x) ,dFζ̃ij (x)

    dx= 2

    ∫ ∞−∞

    f(2x+ �i−j)f(�i−j)d�i−j .

    The density g(·) is symmetric due to

    g(−x) = 2∫ ∞−∞

    f(−2x+ �i−j)f(�i−j)d�i−j

    = 2

    ∫ ∞−∞

    f(�i−j)f(�i−j + 2x)d�i−j

    = g(x).

    Therefore, we have

    Fζ̃ij (0) =

    ∫ ∞−∞

    f(�i−j)f(�i−j)d�i−j =1

    2f2(�i−j) |∞−∞=

    1

    2,

    g(0) = 2

    ∫ ∞−∞

    f2(�i−j)d�i−j .

    Proof of Theorem 1 Rewrite the objective function as

    Sn (b) =1

    n

    ∑j

    fn

    (Ỹij |b

    ),

    25

  • Wang, Yu, Lin and Tong

    where

    fn

    (Ỹij |b

    )=∣∣∣Ỹ (1)ij − bi1dj − bi3d3j ∣∣∣ 1h1(0 < dj ≤ h)

    with b = (bi1, bi3)T and h = k/n. Define Xj =

    (dj , d

    3j

    )Tand H =diag

    {h, h3

    }. Note that

    arg minbSn (b) = arg min

    b[Sn(b)− Sn (β)], where β =

    (m(1)(xi),m

    (3)(xi)/6)T

    .

    We first show that H(β̂ − β

    )= op(1), where β̂ = (β̂i1, β̂i3)

    T . We use Lemma 4 of

    Porter and Yu (2015) to show the consistency. Essentially, we need to show that

    (i) supb∈B|Sn (b)− Sn (β)− E [Sn (b)− Sn (β)]|

    p−→ 0,

    (ii) inf‖H(b−β)‖>δ

    E [Sn (b)− Sn (β)] > ε for n large enough, where B is a compact parameter

    space for β, and δ and ε are fixed positive small numbers.

    We use Lemma 2.8 of Pakes and Pollard (1989) to show (i), where

    Fn ={fn

    (Ỹ |b)− fn

    (Ỹ |β

    ): b ∈ B

    }.

    Note that Fn is Euclidean (see, e.g., Definition 2.7 of Pakes and Pollard (1989) for thedefiniton of an Euclidean-class of functions) by applying Lemma 2.13 of Pakes and Pollard(1989), where α = 1, f(·, t0) = 0, φ(·) = ‖Xj‖ 1h1(0 < dj ≤ h) and the envelope functionis Fn (·) = Mφ(·) for some finite constant M . Since E [Fn] = E

    [‖Xj‖ 1h1(0 < dj ≤ h)

    ]=

    O (h) δ

    E [Sn (b)− Sn (β)], by Proposition 1 of Wang and Scott (1994),

    E [Sn (b)− Sn (β)]

    ≈1

    n

    ∑j

    g(0)[XTj H

    −1H (b− β)]2 1h

    1(0 < dj ≤ h)

    − 1n

    ∑j

    2g(0)

    [m (di+j)−m(di−j)

    2−XTj β

    ] [XTj H

    −1H (b− β)] 1h

    1(0 < dj ≤ h)

    & δ2 − h5δ,

    where & means the left side is bounded blow by a constant times the right side.

    We then derive the asymptotic distribution of√nhH

    (β̂ − β

    )by applying the empirical

    process technique. First, the first order conditions can be written as

    1

    n

    ∑j

    sign(Ỹ

    (1)ij − Z

    Tj Hβ̂

    )Zj

    √h

    h1(0 < dj ≤ h) = op (1) ,

    26

  • Robust Derivative Estimation

    which is denoted as1

    n

    ∑j

    f ′n(Ỹ(1)ij |β̂) , S

    ′n

    (β̂)

    = op (1) ,

    where Zj = H−1Xj . By Example 2.9 of Pakes and Pollard (1989), F ′n forms an Euclidean-

    class of functions with envelope F ′n = ‖Zj‖√hh 1(0 < dj ≤ h), where F

    ′n =

    {f ′n(Ỹ

    (1)ij |b) : b ∈ B

    },

    and E[F ′2n]

  • Wang, Yu, Lin and Tong

    and the variance is

    Var[β̂i1] ≈1

    4g(0)2[1, 0]

    ∑j

    XjXTj

    −1 [ 10

    ]≈ 75

    16g(0)2n2

    k3.

    Combining the squared bias and the variance, we obtain the AMSE

    AMSE[β̂i1] =m(5)(xi)

    2

    5042k8

    n8+

    75

    16g(0)2n2

    k3. (17)

    To minimize (17) with respect to k, we take the first-order derivative of (17) and yield thegradient as

    dAMSE[β̂i1]

    dk=m(5)(xi)

    2

    31752

    k7

    n8− 225σ

    2

    16g(0)2n2

    k4.

    Our optimization problem is to solve dAMSE[β̂i1]dk = 0. So we obtain

    kopt =

    (893025σ2

    2g(0)2m(5)(xi)2

    )1/11n10/11 ≈ 3.26

    (σ2

    g(0)2m(5)(xi)2

    )1/11n10/11,

    and

    AMSE[β̂i1] ≈ 0.19((σ/g(0))16m(5)(xi)6)1/11n−8/11.

    Appendix B. Proof of Theorem 7

    Rewrite the objective function as a U-process,

    Sn (b) =∑l

  • Robust Derivative Estimation

    (ii) inf‖H(b−β)‖>δ

    E [Un (b)− Un (β)] > ε for n large enough, where B is a compact parameter

    space for β, and δ and ε are fixed positive small numbers.

    We use Theorem A.2 of Ghosal et al. (2000) to show (i), where

    Fn = {fn (Yi+j , Yi+l|b)− fn (Yi+j , Yi+l|β) : b ∈ B} .

    Note that Fn forms an Euclidean-class of functions by applying Lemma 2.13 of Pakes andPollard (1989), where α = 1, f(·, t0) = 0, φ(·) = ‖Xjl‖ 1h2 1(|dj | ≤ h)1(|dl| ≤ h) and theenvelope function is Fn (·) = Mφ(·) for some finite constant M . It follows that

    N(ε ‖Fn‖Q,2 ,Fn, L2 (Q)

    ). ε−C

    for any probability measure Q and some positive constant C, where . means the left sideis bounded by a constant times the right side. Hence,

    1

    nE

    [∫ ∞0

    logN (ε,Fn, L2 (Un2 )) dε].

    1

    n

    √E [F 2n ]

    ∫ ∞0

    log1

    εdε = O

    (1

    n

    ),

    where Un2 is the random discrete measure putting mass1

    n(n−1) on each of the points

    (Yi+j , Yi+l). Next, by Lemma A.2 of Ghosal et al. (2000), the projections

    fn (Yi+j |b) =∫fn (Yi+j , Yi+l|b) dFYi+l(Yi+l)

    satisfy

    supQN(ε∥∥Fn∥∥Q,2 ,Fn, L2 (Q)) . ε−2C ,

    where Fn ={fn (Yi+j |b)− fn (Yi+j |β) : b ∈ B

    }, and Fn is an envelope of Fn. Thus

    1√nE

    [∫ ∞0

    logN(ε,Fn, L2 (Pn)

    )dε

    ].

    1√n

    √E[F

    2n

    ] ∫ ∞0

    log1

    εdε = O

    (1√n

    ).

    By Theorem A.2 and Markov’s inequality, supb∈B|Un (b)− Un (β)− E [Un (b)− Un (β)]|

    p−→ 0.

    As to inf‖H(b−β)‖>δ

    E [Un (b)− Un (β)], by Proposition 1 of Wang and Scott (1994),

    E [Un (b)− Un (β)]

    ≈2

    n(n− 1)∑l

  • Wang, Yu, Lin and Tong

    We then derive the asymptotic distribution of√nhH

    (β̂ − β

    ). First, by Theorem A.1

    of Ghosal et al. (2000), we approximate the first order conditions by an empirical process.

    Second, we derive the asymptotic distribution of√nhH

    (β̂ − β

    )by applying the empirical

    process technique.

    First, the first order conditions can be written as

    2

    n(n− 1)∑l

  • Robust Derivative Estimation

    with F� (·) being the cumulative distribution function of �. In other words,

    2Gn(E2

    [f ′n(Yi+j , Yi+l|β̂)

    ])+√nE[f ′n(Yi+j , Yi+l|β̂)

    ]= op(1).

    By Lemma 2.13 of Pakes and Pollard (1989), F ′1n ={E2 [f

    ′n(Yi+j , Yi+l|b)] : b ∈ B

    }is Eu-

    clidean with envelope F1n =√h

    nh2∑l

    ‖Zjl‖ 1(0 < |dj | ≤ h)1(0 < |dl| ≤ h), so by Lemma 2.17

    of Pakes and Pollard (1989) and H(β̂ − β

    )= op(1),

    Gn(E2

    [f ′n(Yi+j , Yi+l|β̂)

    ])= Gn

    (E2[f ′n(Yi+j , Yi+l|β)

    ])+ op(1).

    As a result,

    2Gn(E2[f ′n(Yi+j , Yi+l|β)

    ])+√nE[f ′n(Yi+j , Yi+l|β̂)

    ]= 2

    √nPn

    (E2[f ′n(Yi+j , Yi+l|β)

    ])− 2√nE[f ′n(Yi+j , Yi+l|β)

    ]+√n(E[f ′n(Yi+j , Yi+l|β̂)

    ]− E

    [f ′n(Yi+j , Yi+l|β)

    ])+√nE[f ′n(Yi+j , Yi+l|β)

    ]= 2

    √nPnE2

    [f ′n(Yi+j , Yi+l)

    ]+ 2√nPn

    (E2[f ′n(Yi+j , Yi+l|β)

    ]− E2

    [f ′n(Yi+j , Yi+l)

    ])+√n(E[f ′n(Yi+j , Yi+l|β̂)

    ]− E

    [f ′n(Yi+j , Yi+l|β)

    ])−√nE[f ′n(Yi+j , Yi+l|β)

    ]= 2

    √nPnE2

    [f ′n(Yi+j , Yi+l)

    ]+√n(E[f ′n(Yi+j , Yi+l|β̂)

    ]− E

    [f ′n(Yi+j , Yi+l|β)

    ])+√nE[f ′n(Yi+j , Yi+l|β)

    ]= op(1),

    where

    E2[f ′n(Yi+j , Yi+l)

    ]=

    √h

    nh2

    ∑l

    [2F� (�i+j)− 1]Zjl1(0 < |dj | ≤ h)1(0 < |dl| ≤ h)

    satisfies E[E2 [f

    ′n(Yi+j , Yi+l)]

    ]= 0, and the second to last equality is from

    √nPn

    (E2[f ′n(Yi+j , Yi+l|β)

    ]− E2

    [f ′n(Yi+j , Yi+l)

    ])≈√nE[f ′n(Yi+j , Yi+l|β)

    ].

    Since

    E[f ′n(Yi+j , Yi+l|b)

    ]=

    √h

    n2h2

    ∑l,j

    Zjl1(0 < |dj | ≤ h)1(0 < |dl| ≤ h)

    ·[2

    ∫F�(�+m(di+j)−m(di+l)− ZTjlHb

    )− 1]f (�) d�,

    we have

    √n(E[f ′n(Yi+j , Yi+l|β̂)

    ]− E

    [f ′n(Yi+j , Yi+l|β)

    ])≈ −

    √nhg(0)

    1n2h2

    ∑l,j

    ZjlZTjl

    H (β̂ − β) ,√nE[f ′n(Yi+j , Yi+l|β)

    ]≈√nhg(0)

    1

    n2h2

    ∑l,j

    Zjl(d5j − d5l

    ) m(5)(xi)5!

    .

    31

  • Wang, Yu, Lin and Tong

    In summary,

    √nh

    H (β̂ − β)− 1n2h2

    ∑l,j

    ZjlZTjl

    −1 1n2h2

    ∑l,j

    Zjl(d5j − d5l

    ) m(5)(xi)5!

    ≈ 2g(0)−1

    1n2h2

    ∑l,j

    ZjlZTjl

    −1√nPnE2 [f ′n(Yi+j , Yi+l)] ,Thus, the asymptotic bias is

    eTH−1

    1n2h2

    ∑l,j

    ZjlZTjl

    −1 1n2h2

    ∑l,j

    Zjl(d5j − d5l

    ) m(5)(xi)5!

    = −m(5)(xi)

    504

    k4

    n4,

    and the asymptotic variance is

    4

    kg(0)2eTH−1G−1V G−1H−1e =

    75

    24g(0)2n2

    k3,

    where e = (1, 0, 0, 0)T , G = 1k2∑

    l,j ZjlZTjl , and V =

    13k

    ∑kj=−k(

    1k

    ∑kl=−k Zjl)(

    1k

    ∑kl=−k Zjl)

    T

    with Var (2F� (�i+j)− 1) = 1/3.

    Appendix C. Proof of Theorem 8

    Proof of Theorem 8 Following the proof of Theorem 1, the asymptotic bias is

    Bias[β̂i11] ≈m(2)(xi)

    2

    k4 + 2k3i− 2ki3 − i3

    n(k3 + 3k2i+ 3ki2 + i3),

    and the asymptotic variance is

    Var[β̂i11] ≈3

    f(0)2n2

    k3 + 3k2i+ 3ki2 + i3.

    Appendix D. Proof of Theorem 9

    Proposition 11 If the errors �i are iid with median 0 and a symmetric, continuous, positivedensity function f(·), then δ̃ij = �i+j + �i−j (j=1, . . . , k) are iid with Median[δ̃ij ] = 0 anda continuous, positive density h(·) in a neighborhood of 0, where

    h(x) =

    ∫ ∞−∞

    f(x− �)f(�)d�.

    32

  • Robust Derivative Estimation

    Proof of Proposition 11 The distribution of δ̃ij = �i+j + �i−j is

    Fδ̃ij (x) = P (�i+j + �i−j ≤ x)

    =

    ∫∫�i+j≤x−�i−j

    f(�i+j)f(�i−j)d�i+jd�i−j

    =

    ∫ ∞−∞{∫ x−�i−j−∞

    f(�i+j)d�i+j}f(�i−j)d�i−j

    =

    ∫ ∞−∞

    f(x− �i−j)f(�i−j)d�i−j .

    Then the density of δ̃ij is

    h(x) ,dFδ̃ij (x)

    dx=

    ∫ ∞−∞

    f(x− �i−j)f(�i−j)d�i−j .

    By the symmetry of the density function, we have

    Fδ̃ij (0) =

    ∫ ∞−∞

    f(−�i−j)f(�i−j)d�i−j

    = (f − 12f2(�i−j)) |∞−∞

    =1

    2,

    h(0) =

    ∫ ∞−∞

    f2(�i−j)d�i−j .

    Proof of Theorem 9 Following the proof of Theorem 1, the asymptotic bias is

    Bias[α̂i2] ≈ −m(6)(xi)

    792

    k4

    n4,

    and the asymptotic variance of α̂i1 is

    Var[α̂i2] ≈2205

    64h(0)2n4

    k5.

    Combining the squared bias and the variance, we obtain the AMSE as

    AMSE[α̂i2] =m(6)(xi)

    2

    7922k8

    n8+

    2205

    64h(0)2n4

    k5. (18)

    To minimize (18) with respect to k, we take the first-order derivative of (18) and yield thegradient as

    dAMSE[α̂i2]

    dk=m(6)(xi)

    2

    78408

    k7

    n8− 11025

    16h(0)2n4

    k6.

    33

  • Wang, Yu, Lin and Tong

    Now the optimization problem is to solve dAMSE[α̂i2]dk = 0. So we obtain

    kopt =

    (108056025

    8h(0)2m(6)(xi)2

    )1/13n12/13 ≈ 3.54

    (1

    h(0)2m(6)(xi)2

    )1/13n12/13,

    andAMSE[α̂i2] ≈ 0.29(m(6)(xi)10/h(0)16)1/13n−8/13.

    Appendix E. Variances of LowLAD and LowLSR Derivative Estimatorswith Mixed Normal Distributions

    We consider the mixed normal contaminated errors �i ∼ (1− α)N(0, σ20) + αN(0, σ21) withσ21 > σ

    20 and 0 < α < 1/2. Now the mixed density function is

    f(x) = (1− α) 1√2πσ0

    e− x

    2

    2σ20 + α1√

    2πσ1e− x

    2

    2σ21 .

    Thus

    g(0) = 2

    ∫ ∞−∞

    f2(x)dx

    = 2{(1− α)2∫ ∞−∞

    1

    2πσ20e− x

    2

    2σ20 dx+ 2α(1− α)∫ ∞−∞

    1

    2πσ0σ1e−( x

    2

    2σ20+ x

    2

    2σ21)dx

    +α2∫ ∞−∞

    1

    2πσ21e− x

    2

    2σ21 dx}

    = 2{(1− α)2 12√πσ0

    + 2α(1− α) 1√2π√σ20 + σ

    21

    + α21

    2√πσ1},

    Var(�i) = (1− α)σ20 + ασ21 , σ2.

    Next, we compare σ2 and 1/(2g2(0)). With λ = σ1/σ0, the difference of these two variancesis

    DV(α, λ) = σ2 − 1/(2g2(0))

    = (1− α) + αλ2 − (8{(1− α)2

    2√π

    +2α(1− α)√2π(1 + λ2)

    +α2

    2√πλ})−1.

    34

  • Robust Derivative Estimation

    −1.0 −0.5 0.0 0.5 1.0

    −5

    05

    10

    (a) LowLAD (k=70)

    1st d

    eriv

    ativ

    e

    −1.0 −0.5 0.0 0.5 1.0

    −5

    05

    10

    (b) LowLSR (k=70)

    1st d

    eriv

    ativ

    e−1.0 −0.5 0.0 0.5 1.0

    −5

    05

    10

    (c) LowLAD (k=70)

    1st d

    eriv

    ativ

    e

    −1.0 −0.5 0.0 0.5 1.0

    −5

    05

    10(d) LowLSR (k=100)

    1st d

    eriv

    ativ

    e

    Figure 3: (a-b) The true first-order derivative function (bold line), LowLAD (green line) andLowLSR estimators (red line). Simulated data set of size 500 from model (1) withequidistant xi ∈ [−1, 1], regression function m1, and errors � ∼ 90%N(0, 0.12) +10%N(0, 12). (c-d) The same designs except � ∼ 90%N(0, 0.12) + 10%N(0, 102)

    −1.0 −0.5 0.0 0.5 1.0

    −60

    −20

    2060

    (a) LowLAD (k=90)

    2nd

    deriv

    ativ

    e

    −1.0 −0.5 0.0 0.5 1.0

    −60

    −20

    2060

    (b) LowLSR (k=100)

    2nd

    deriv

    ativ

    e

    −1.0 −0.5 0.0 0.5 1.0

    −60

    −20

    2060

    (c) LowLAD (k=90)

    2nd

    deriv

    ativ

    e

    −1.0 −0.5 0.0 0.5 1.0

    −60

    −20

    2060

    (d) LowLSR (k=100)

    2nd

    deriv

    ativ

    e

    Figure 4: (a-d) The true first-order derivative function (bold line), LowLAD (green line)and LowLSR estimators (red line). The same designs as in Figure 3 except theregression function being m2.

    35

  • Wang, Yu, Lin and Tong

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    −20

    020

    (a) LowLAD (k=50)

    1st d

    eriv

    ativ

    e

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    −20

    020

    (b) LowLSR (k=60)

    1st d

    eriv

    ativ

    e0.0 0.2 0.4 0.6 0.8 1.0

    −60

    −20

    020

    (c) LowLAD (k=50)

    1st d

    eriv

    ativ

    e

    0.0 0.2 0.4 0.6 0.8 1.0−

    60−

    200

    20(d) LowLSR (k=100)

    1st d

    eriv

    ativ

    e

    Figure 5: (a)-(d) The true second-order derivative function (bold line), LowLAD (greenline) and LowLSR estimators (red line) based on the simulated data sets fromFigure 3 correspondingly.

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    00

    400

    (a) LowLAD (k=70)

    2nd

    deriv

    ativ

    e

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    00

    400

    (b) LowLSR (k=80)

    2nd

    deriv

    ativ

    e

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    00

    400

    (c) LowLAD (k=70)

    2nd

    deriv

    ativ

    e

    0.0 0.2 0.4 0.6 0.8 1.0

    −60

    00

    400

    (d) LowLSR (k=100)

    2nd

    deriv

    ativ

    e

    Figure 6: (a)-(d) The true second-order derivative function (bold line), LowLAD (greenline) and LowLSR estimators (red line) based on the simulated data sets fromFigure 4 correspondingly.

    36

  • Robust Derivative Estimation

    LowLAD LowLSR locpol Psline

    050

    100

    150

    Figure 7: Boxplot of four estimators for the function m4 with errors � ∼ 95%N(0, 0.12) +5%N(0, 12).

    LowLAD LowLSR locpol Psline

    020

    040

    060

    080

    012

    00

    Figure 8: Boxplot of four estimators for the function m4 with errors � ∼ 95%N(0, 0.12) +5%N(0, 102).

    37

  • Wang, Yu, Lin and Tong

    Figure 9: The curve of the same variance between LowLAD and LowLSR estimators witherrors �i ∼ (1−α)N(0, σ20)+αN(0, σ21). The two horizontal lines are y = 1 (green)and y = 2.77 (blue).

    38