Smoothing Splines - 221.114.158.246221.114.158.246/~bunken/statistics/others_smoothingspline.pdf · Applications covers basic smoothing spline models, including polynomial, periodic,

A general class of powerful and flexible modeling techniques, spline smoothing has attracted a great deal of research attention in recent years and has been widely used in many application areas, from medicine to economics. Smoothing Splines: Methods and Applications covers basic smoothing spline models, including polynomial, periodic, spherical, thin-plate, L-, and partial splines, as well as more advanced models, such as smoothing spline ANOVA, extended and generalized smoothing spline ANOVA, vector spline, nonparametric nonlinear regression, semiparametric regression, and semiparametric mixed-effects models. It also presents methods for model selection and inference.

The book provides unified frameworks for estimation, inference, and software implementation by using the general forms of nonparametric/semiparametric, linear/nonlinear, and fixed/mixed smoothing spline models. The theory of reproducing kernel Hilbert space (RKHS) is used to present various smoothing spline models in a unified fashion. Although this approach can be technical and difficult, the author makes the advanced smoothing spline methodology based on RKHS accessible to practitioners and students. He offers a gentle introduction to RKHS, keeps theory at a minimum level, and explains how RKHS can be used to construct spline models.

Smoothing Splines offers a balanced mix of methodology, compu-tation, implementation, software, and applications. It uses R to per-form all data analyses and includes a host of real data examples from astronomy, economics, medicine, and meteorology. The codes for all examples, along with related developments, can be found on the book’s web page.

C7755

Smoothing Splines

Wang

Statistics

Smoothing Splines Methods and Applications

Yuedong Wang

Monographs on Statistics and Applied Probability 121121

C7755_Cover.indd 1 4/25/11 9:28 AM

Smoothing SplinesMethods and Applications

C7755_FM.indd 1 4/29/11 2:13 PM

MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY

General Editors

F. Bunea, V. Isham, N. Keiding, T. Louis, R. L. Smith, and H. Tong

1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960)2 Queues D.R. Cox and W.L. Smith (1961)

3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966)

5 Population Genetics W.J. Ewens (1969)6 Probability, Statistics and Time M.S. Barlett (1975)

7 Statistical Inference S.D. Silvey (1975)8 The Analysis of Contingency Tables B.S. Everitt (1977)

9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)10 Stochastic Abundance Models S. Engen (1978)

11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979)12 Point Processes D.R. Cox and V. Isham (1980)13 Identification of Outliers D.M. Hawkins (1980)

14 Optimal Design S.D. Silvey (1980)15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)

16 Classification A.D. Gordon (1981)17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995)

18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)

20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984)21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)

22 An Introduction to Latent Variable Models B.S. Everitt (1984)23 Bandit Problems D.A. Berry and B. Fristedt (1985)

24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985)25 The Statistical Analysis of Composition Data J. Aitchison (1986)

26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986)27 Regression Analysis with Applications G.B. Wetherill (1986)

28 Sequential Methods in Statistics, 3rd edition G.B. Wetherill and K.D. Glazebrook (1986)

29 Tensor Methods in Statistics P. McCullagh (1987)30 Transformation and Weighting in Regression

R.J. Carroll and D. Ruppert (1988)31 Asymptotic Techniques for Use in Statistics

O.E. Bandorff-Nielsen and D.R. Cox (1989)32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989)

33 Analysis of Infectious Disease Data N.G. Becker (1989) 34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)

35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989)36 Symmetric Multivariate and Related Distributions

K.T. Fang, S. Kotz and K.W. Ng (1990)37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989)

38 Cyclic and Computer Generated Designs, 2nd edition J.A. John and E.R. Williams (1995)

39 Analog Estimation Methods in Econometrics C.F. Manski (1988)40 Subset Selection in Regression A.J. Miller (1990)

41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990)42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)

44 Inspection Errors for Attributes in Quality Control N.L. Johnson, S. Kotz and X. Wu (1991)

45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)

C7755_FM.indd 2 4/29/11 2:13 PM

46 The Analysis of Quantal Response Data B.J.T. Morgan (1992)47 Longitudinal Data with Serial Correlation—A State-Space Approach

R.H. Jones (1993)48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993)

49 Markov Models and Optimization M.H.A. Davis (1993)50 Networks and Chaos—Statistical and Probabilistic Aspects O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993)

51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994)52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994)

53 Practical Risk Theory for Actuaries C.D. Daykin, T. Pentikäinen and M. Pesonen (1994)

54 Biplots J.C. Gower and D.J. Hand (1996)55 Predictive Inference—An Introduction S. Geisser (1993)

56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993)57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)

58 Nonparametric Regression and Generalized Linear Models P.J. Green and B.W. Silverman (1994)

59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994)60 Kernel Smoothing M.P. Wand and M.C. Jones (1995)61 Statistics for Long Memory Processes J. Beran (1995)

62 Nonlinear Models for Repeated Measurement Data M. Davidian and D.M. Giltinan (1995)

63 Measurement Error in Nonlinear Models R.J. Carroll, D. Rupert and L.A. Stefanski (1995)

64 Analyzing and Modeling Rank Data J.J. Marden (1995)65 Time Series Models—In Econometrics, Finance and Other Fields

D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996)66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996)

67 Multivariate Dependencies—Models, Analysis and Interpretation D.R. Cox and N. Wermuth (1996)

68 Statistical Inference—Based on the Likelihood A. Azzalini (1996)69 Bayes and Empirical Bayes Methods for Data Analysis

B.P. Carlin and T.A Louis (1996)70 Hidden Markov and Other Models for Discrete-Valued Time Series

I.L. MacDonald and W. Zucchini (1997)71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997)72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997)73 Multivariate Models and Dependence Concepts H. Joe (1997)

74 Theory of Sample Surveys M.E. Thompson (1997)75 Retrial Queues G. Falin and J.G.C. Templeton (1997)

76 Theory of Dispersion Models B. Jørgensen (1997)77 Mixed Poisson Processes J. Grandell (1997)

78 Variance Components Estimation—Mixed Models, Methodologies and Applications P.S.R.S. Rao (1997)79 Bayesian Methods for Finite Population Sampling

G. Meeden and M. Ghosh (1997)80 Stochastic Geometry—Likelihood and computation

O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998)81 Computer-Assisted Analysis of Mixtures and Applications— Meta-analysis, Disease Mapping and Others D. Böhning (1999)

82 Classification, 2nd edition A.D. Gordon (1999)83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999)

84 Statistical Aspects of BSE and vCJD—Models for Epidemics C.A. Donnelly and N.M. Ferguson (1999)

85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000)86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000)

87 Complex Stochastic Systems O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001)

88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001)

C7755_FM.indd 3 4/29/11 2:13 PM

89 Algebraic Statistics—Computational Commutative Algebra in Statistics G. Pistone, E. Riccomagno and H.P. Wynn (2001)

90 Analysis of Time Series Structure—SSA and Related Techniques N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001)

91 Subjective Probability Models for Lifetimes Fabio Spizzichino (2001)

92 Empirical Likelihood Art B. Owen (2001)93 Statistics in the 21st Century

Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001)94 Accelerated Life Models: Modeling and Statistical Analysis

Vilijandas Bagdonavicius and Mikhail Nikulin (2001)95 Subset Selection in Regression, Second Edition Alan Miller (2002)

96 Topics in Modelling of Clustered Data Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002)

97 Components of Variance D.R. Cox and P.J. Solomon (2002)98 Design and Analysis of Cross-Over Trials, 2nd Edition

Byron Jones and Michael G. Kenward (2003)99 Extreme Values in Finance, Telecommunications, and the Environment

Bärbel Finkenstädt and Holger Rootzén (2003)100 Statistical Inference and Simulation for Spatial Point Processes

Jesper Møller and Rasmus Plenge Waagepetersen (2004)101 Hierarchical Modeling and Analysis for Spatial Data

Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)102 Diagnostic Checks in Time Series Wai Keung Li (2004)

103 Stereology for Statisticians Adrian Baddeley and Eva B. Vedel Jensen (2004)104 Gaussian Markov Random Fields: Theory and Applications

Havard Rue and Leonhard Held (2005)105 Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition

Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu (2006)

106 Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood Youngjo Lee, John A. Nelder, and Yudi Pawitan (2006)

107 Statistical Methods for Spatio-Temporal Systems Bärbel Finkenstädt, Leonhard Held, and Valerie Isham (2007)

108 Nonlinear Time Series: Semiparametric and Nonparametric Methods Jiti Gao (2007)

109 Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis Michael J. Daniels and Joseph W. Hogan (2008)

110 Hidden Markov Models for Time Series: An Introduction Using R Walter Zucchini and Iain L. MacDonald (2009)

111 ROC Curves for Continuous Data Wojtek J. Krzanowski and David J. Hand (2009)112 Antedependence Models for Longitudinal Data

Dale L. Zimmerman and Vicente A. Núñez-Antón (2009)113 Mixed Effects Models for Complex Data

Lang Wu (2010)114 Intoduction to Time Series Modeling

Genshiro Kitagawa (2010)115 Expansions and Asymptotics for Statistics

Christopher G. Small (2010)116 Statistical Inference: An Integrated Bayesian/Likelihood Approach

Murray Aitkin (2010)117 Circular and Linear Regression: Fitting Circles and Lines by Least Squares

Nikolai Chernov (2010)118 Simultaneous Inference in Regression Wei Liu (2010)

119 Robust Nonparametric Statistical Methods, Second Edition Thomas P. Hettmansperger and Joseph W. McKean (2011)

120 Statistical Inference: The Minimum Distance Approach Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park (2011)

121 Smoothing Splines : Methods and Applications Yuedong Wang (2011)

C7755_FM.indd 4 4/29/11 2:13 PM


Smoothing SplinesMethods and Applications

Yuedong WangUniversity of California

Santa Barbara, California, USA

C7755_FM.indd 5 4/29/11 2:13 PM

CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

© 2011 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government worksVersion Date: 20110429

International Standard Book Number-13: 978-1-4200-7756-8 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor-age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy-right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-vides licenses and registration for a variety of users. For organizations that have been granted a pho-tocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com

and the CRC Press Web site athttp://www.crcpress.com

TO

YAN, CATHERINE, AND KEVIN

This page intentionally left blankThis page intentionally left blank

Contents

1 Introduction 11.1 Parametric and Nonparametric Regression . . . . . . . 11.2 Polynomial Splines . . . . . . . . . . . . . . . . . . . . . 41.3 Scope of This Book . . . . . . . . . . . . . . . . . . . . 71.4 The assist Package . . . . . . . . . . . . . . . . . . . . 9

2 Smoothing Spline Regression 112.1 Reproducing Kernel Hilbert Space . . . . . . . . . . . . 112.2 Model Space for Polynomial Splines . . . . . . . . . . . 142.3 General Smoothing Spline Regression Models . . . . . . 162.4 Penalized Least Squares Estimation . . . . . . . . . . . 172.5 The ssr Function . . . . . . . . . . . . . . . . . . . . . 202.6 Another Construction for Polynomial Splines . . . . . . 222.7 Periodic Splines . . . . . . . . . . . . . . . . . . . . . . 242.8 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . 262.9 Spherical Splines . . . . . . . . . . . . . . . . . . . . . . 292.10 Partial Splines . . . . . . . . . . . . . . . . . . . . . . . 302.11 L-splines . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.11.1 Motivation . . . . . . . . . . . . . . . . . . . . . 392.11.2 Exponential Spline . . . . . . . . . . . . . . . . . 412.11.3 Logistic Spline . . . . . . . . . . . . . . . . . . . 442.11.4 Linear-Periodic Spline . . . . . . . . . . . . . . . 462.11.5 Trigonometric Spline . . . . . . . . . . . . . . . . 48

3 Smoothing Parameter Selection and Inference 533.1 Impact of the Smoothing Parameter . . . . . . . . . . . 533.2 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 573.3 Unbiased Risk . . . . . . . . . . . . . . . . . . . . . . . 623.4 Cross-Validation and Generalized Cross-Validation . . . 643.5 Bayes and Linear Mixed-Effects Models . . . . . . . . . 673.6 Generalized Maximum Likelihood . . . . . . . . . . . . 713.7 Comparison and Implementation . . . . . . . . . . . . . 723.8 Confidence Intervals . . . . . . . . . . . . . . . . . . . . 75

3.8.1 Bayesian Confidence Intervals . . . . . . . . . . . 753.8.2 Bootstrap Confidence Intervals . . . . . . . . . . 81

ix

x

3.9 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . 843.9.1 The Hypothesis . . . . . . . . . . . . . . . . . . . 843.9.2 Locally Most Powerful Test . . . . . . . . . . . . 853.9.3 Generalized Maximum Likelihood Test . . . . . . 863.9.4 Generalized Cross-Validation Test . . . . . . . . 873.9.5 Comparison and Implementation . . . . . . . . . 87

4 Smoothing Spline ANOVA 914.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . 914.2 Tensor Product Reproducing Kernel Hilbert Spaces . . 924.3 One-Way SS ANOVA Decomposition . . . . . . . . . . 93

4.3.1 Decomposition of Ra: One-Way ANOVA . . . . 95

4.3.2 Decomposition of Wm2 [a, b] . . . . . . . . . . . . 96

4.3.3 Decomposition of Wm2 (per) . . . . . . . . . . . . 97

4.3.4 Decomposition of Wm2 (Rd) . . . . . . . . . . . . 97

4.4 Two-Way SS ANOVA Decomposition . . . . . . . . . . 984.4.1 Decomposition of R

a ⊗ Rb: Two-Way ANOVA . 99

4.4.2 Decomposition of Ra ⊗Wm

2 [0, 1] . . . . . . . . . 1004.4.3 Decomposition of Wm1

2 [0, 1] ⊗Wm2

2 [0, 1] . . . . . 1034.4.4 Decomposition of R

a ⊗Wm2 (per) . . . . . . . . . 106

4.4.5 Decomposition of Wm1

2 (per) ⊗Wm2

2 [0, 1] . . . . . 1074.4.6 Decomposition of W 2

2 (R2) ⊗Wm2 (per) . . . . . . 108

4.5 General SS ANOVA Decomposition . . . . . . . . . . . 1104.6 SS ANOVA Models and Estimation . . . . . . . . . . . 1114.7 Selection of Smoothing Parameters . . . . . . . . . . . 1144.8 Confidence Intervals . . . . . . . . . . . . . . . . . . . . 1164.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.9.1 Tongue Shapes . . . . . . . . . . . . . . . . . . . 1174.9.2 Ozone in Arosa — Revisit . . . . . . . . . . . . . 1264.9.3 Canadian Weather — Revisit . . . . . . . . . . . 1314.9.4 Texas Weather . . . . . . . . . . . . . . . . . . . 133

5 Spline Smoothing with Heteroscedastic and/or Corre-lated Errors 1395.1 Problems with Heteroscedasticity and Correlation . . . 1395.2 Extended SS ANOVA Models . . . . . . . . . . . . . . 142

5.2.1 Penalized Weighted Least Squares . . . . . . . . 1425.2.2 UBR, GCV and GML Criteria . . . . . . . . . . 1445.2.3 Known Covariance . . . . . . . . . . . . . . . . . 1475.2.4 Unknown Covariance . . . . . . . . . . . . . . . . 1485.2.5 Confidence Intervals . . . . . . . . . . . . . . . . 150

5.3 Variance and Correlation Structures . . . . . . . . . . . 1505.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xi

5.4.1 Simulated Motorcycle Accident — Revisit . . . . 153

5.4.2 Ozone in Arosa — Revisit . . . . . . . . . . . . . 154

5.4.3 Beveridge Wheat Price Index . . . . . . . . . . . 157

5.4.4 Lake Acidity . . . . . . . . . . . . . . . . . . . . 158

6 Generalized Smoothing Spline ANOVA 163

6.1 Generalized SS ANOVA Models . . . . . . . . . . . . . 163

6.2 Estimation and Inference . . . . . . . . . . . . . . . . . 164

6.2.1 Penalized Likelihood Estimation . . . . . . . . . 164

6.2.2 Selection of Smoothing Parameters . . . . . . . . 167

6.2.3 Algorithm and Implementation . . . . . . . . . . 168

6.2.4 Bayes Model, Direct GML and ApproximateBayesian Confidence Intervals . . . . . . . . . . . 170

6.3 Wisconsin Epidemiological Study of DiabeticRetinopathy . . . . . . . . . . . . . . . . . . . . . . . . 172

6.4 Smoothing Spline Estimation of Variance Functions . . 176

6.5 Smoothing Spline Spectral Analysis . . . . . . . . . . . 182

6.5.1 Spectrum Estimation of a Stationary Process . . 182

6.5.2 Time-Varying Spectrum Estimation of a LocallyStationary Process . . . . . . . . . . . . . . . . . 183

6.5.3 Epileptic EEG . . . . . . . . . . . . . . . . . . . 185

7 Smoothing Spline Nonlinear Regression 195

7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.2 Nonparametric Nonlinear Regression Models . . . . . . 196

7.3 Estimation with a Single Function . . . . . . . . . . . . 197

7.3.1 Gauss–Newton and Newton–Raphson Methods . 197

7.3.2 Extended Gauss–Newton Method . . . . . . . . . 199

7.3.3 Smoothing Parameter Selection and Inference . . 201

7.4 Estimation with Multiple Functions . . . . . . . . . . . 204

7.5 The nnr Function . . . . . . . . . . . . . . . . . . . . . 205

7.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.6.1 Nonparametric Regression Subject to PositiveConstraint . . . . . . . . . . . . . . . . . . . . . . 206

7.6.2 Nonparametric Regression Subject to MonotoneConstraint . . . . . . . . . . . . . . . . . . . . . . 207

7.6.3 Term Structure of Interest Rates . . . . . . . . . 212

7.6.4 A Multiplicative Model for Chickenpox Epidemic 218

7.6.5 A Multiplicative Model for Texas Weather . . . . 223

xii

8 Semiparametric Regression 2278.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2278.2 Semiparametric Linear Regression Models . . . . . . . . 228

8.2.1 The Model . . . . . . . . . . . . . . . . . . . . . 2288.2.2 Estimation and Inference . . . . . . . . . . . . . 2298.2.3 Vector Spline . . . . . . . . . . . . . . . . . . . . 233

8.3 Semiparametric Nonlinear Regression Models . . . . . . 2408.3.1 The Model . . . . . . . . . . . . . . . . . . . . . 2408.3.2 SNR Models for Clustered Data . . . . . . . . . 2418.3.3 Estimation and Inference . . . . . . . . . . . . . 2428.3.4 The snr Function . . . . . . . . . . . . . . . . . 245

8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2478.4.1 Canadian Weather — Revisit . . . . . . . . . . . 2478.4.2 Superconductivity Magnetization Modeling . . . 2548.4.3 Oil-Bearing Rocks . . . . . . . . . . . . . . . . . 2578.4.4 Air Quality . . . . . . . . . . . . . . . . . . . . . 2598.4.5 The Evolution of the Mira Variable R Hydrae . . 2628.4.6 Circadian Rhythm . . . . . . . . . . . . . . . . . 267

9 Semiparametric Mixed-Effects Models 2739.1 Linear Mixed-Effects Models . . . . . . . . . . . . . . . 2739.2 Semiparametric Linear Mixed-Effects Models . . . . . . 274

9.2.1 The Model . . . . . . . . . . . . . . . . . . . . . 2749.2.2 Estimation and Inference . . . . . . . . . . . . . 2759.2.3 The slm Function . . . . . . . . . . . . . . . . . 2799.2.4 SS ANOVA Decomposition . . . . . . . . . . . . 280

9.3 Semiparametric Nonlinear Mixed-Effects Models . . . . 2839.3.1 The Model . . . . . . . . . . . . . . . . . . . . . 2839.3.2 Estimation and Inference . . . . . . . . . . . . . 2849.3.3 Implementation and the snm Function . . . . . . 286

9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2889.4.1 Ozone in Arosa — Revisit . . . . . . . . . . . . . 2889.4.2 Lake Acidity — Revisit . . . . . . . . . . . . . . 2919.4.3 Coronary Sinus Potassium in Dogs . . . . . . . . 2949.4.4 Carbon Dioxide Uptake . . . . . . . . . . . . . . 3059.4.5 Circadian Rhythm — Revisit . . . . . . . . . . . 310

A Data Sets 323A.1 Air Quality Data . . . . . . . . . . . . . . . . . . . . . . 324A.2 Arosa Ozone Data . . . . . . . . . . . . . . . . . . . . . 324A.3 Beveridge Wheat Price Index Data . . . . . . . . . . . 324A.4 Bond Data . . . . . . . . . . . . . . . . . . . . . . . . . 324A.5 Canadian Weather Data . . . . . . . . . . . . . . . . . 325

xiii

A.6 Carbon Dioxide Data . . . . . . . . . . . . . . . . . . . 325A.7 Chickenpox Data . . . . . . . . . . . . . . . . . . . . . 325A.8 Child Growth Data . . . . . . . . . . . . . . . . . . . . 326A.9 Dog Data . . . . . . . . . . . . . . . . . . . . . . . . . . 326A.10 Geyser Data . . . . . . . . . . . . . . . . . . . . . . . . 326A.11 Hormone Data . . . . . . . . . . . . . . . . . . . . . . . 327A.12 Lake Acidity Data . . . . . . . . . . . . . . . . . . . . . 327A.13 Melanoma Data . . . . . . . . . . . . . . . . . . . . . . 327A.14 Motorcycle Data . . . . . . . . . . . . . . . . . . . . . . 328A.15 Paramecium caudatum Data . . . . . . . . . . . . . . . 328A.16 Rock Data . . . . . . . . . . . . . . . . . . . . . . . . . 328A.17 Seizure Data . . . . . . . . . . . . . . . . . . . . . . . . 328A.18 Star Data . . . . . . . . . . . . . . . . . . . . . . . . . . 329A.19 Stratford Weather Data . . . . . . . . . . . . . . . . . . 329A.20 Superconductivity Data . . . . . . . . . . . . . . . . . . 329A.21 Texas Weather Data . . . . . . . . . . . . . . . . . . . . 330A.22 Ultrasound Data . . . . . . . . . . . . . . . . . . . . . . 330A.23 USA Climate Data . . . . . . . . . . . . . . . . . . . . 331A.24 Weight Loss Data . . . . . . . . . . . . . . . . . . . . . 331A.25 WESDR Data . . . . . . . . . . . . . . . . . . . . . . . 331A.26 World Climate Data . . . . . . . . . . . . . . . . . . . . 332

B Codes for Fitting Strictly Increasing Functions 333B.1 C and R Codes for Computing Integrals . . . . . . . . . 333B.2 R Function inc . . . . . . . . . . . . . . . . . . . . . . 336

C Codes for Term Structure of Interest Rates 339C.1 C and R Codes for Computing Integrals . . . . . . . . . 339C.2 R Function for One Bond . . . . . . . . . . . . . . . . . 341C.3 R Function for Two Bonds . . . . . . . . . . . . . . . . 342

References 347

Author Index 355

Subject Index 359


List of Tables

2.1 Bases of null spaces and RKs for linear and cubic splinesunder the construction in Section 2.2 with X = [0, b] . . 21

2.2 Bases of null spaces and RKs for linear and cubic splinesunder the construction in Section 2.6 with X = [0, 1] . . 23

5.1 Standard varFunc classes . . . . . . . . . . . . . . . . . 1515.2 Standard corStruct classes for serial correlation struc-

tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.3 Standard corStruct classes for spatial correlation struc-

tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

A.1 List of all data sets . . . . . . . . . . . . . . . . . . . . . 323

xv


List of Figures

1.1 Geyser data, observations, the straight line fit, and resid-uals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Motorcycle data, observations, and a polynomial fit . . . 21.3 Geyser data, residuals, and the cubic spline fits . . . . . 31.4 Motorcycle data, observations, and the cubic spline fit . 41.5 Relationship between functions in the assist package and

some of the existing R functions . . . . . . . . . . . . . 10

2.1 Motorcycle data, the linear, and cubic spline fits . . . . 222.2 Arosa data, observations, and the periodic spline fits . . 252.3 USA climate data, the thin-plate spline fit . . . . . . . . 282.4 World climate data, the spherical spline fit . . . . . . . . 312.5 Geyser data, the partial spline fit, residuals, and the AIC

and GCV scores . . . . . . . . . . . . . . . . . . . . . . 332.6 Motorcycle data, the partial spline fit, and the AIC and

GCV scores . . . . . . . . . . . . . . . . . . . . . . . . . 342.7 Arosa data, the partial spline estimates of the month and

year effects . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Canadian weather data, estimate of the weight function,

and confidence intervals . . . . . . . . . . . . . . . . . . 382.9 Weight loss data, observations and the nonlinear regres-

sion, cubic spline, and exponential spline fits . . . . . . 432.10 Paramecium caudatum data, observations and the non-

linear regression, cubic spline, and logistic spline fits . . 452.11 Melanoma data, observations, and the cubic spline and

linear-periodic spline fits . . . . . . . . . . . . . . . . . . 482.12 Arosa data, the overall fits and their projections . . . . 51

3.1 Stratford weather data, observations, and the periodicspline fits with different smoothing parameters . . . . . 54

3.2 Weights of the periodic spline filter . . . . . . . . . . . . 573.3 Stratford data, degrees of freedom, and residual sum of

squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4 Squared bias, variance, and MSE from a simulation . . . 613.5 PSE and UBR functions . . . . . . . . . . . . . . . . . . 65

xvii

xviii

3.6 PSE, CV, and GCV functions . . . . . . . . . . . . . . . 68

3.7 Geyser data, estimates of the smooth components in thecubic and partial spline models . . . . . . . . . . . . . . 79

3.8 Motorcycle data, partial spline fit, and t-statistics . . . . 80

3.9 Pointwise coverages and across-the-function coverages . 83

4.1 Ultrasound data, 3-d plots of observations . . . . . . . . 93

4.2 Ultrasound data, observations, fits, confidence intervals,and the mean curves among three environments . . . . . 118

4.3 Ultrasound data, the overall interaction . . . . . . . . . 119

4.4 Ultrasound data, effects of environment . . . . . . . . . 120

4.5 Ultrasound data, estimated tongue shapes as functions oflength and time . . . . . . . . . . . . . . . . . . . . . . 122

4.6 Ultrasound data, the estimated time effect . . . . . . . . 123

4.7 Ultrasound data, estimated tongue shape as a function ofenvironment, length and time . . . . . . . . . . . . . . 125

4.8 Ultrasound data, the estimated environment effect . . . 126

4.9 Arosa data, estimates of the interactions and smooth com-ponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.10 Arosa data, estimates of the main effects . . . . . . . . . 129

4.11 Canadian weather data, temperature profiles of stationsin four regions and the estimated profiles . . . . . . . . 132

4.12 Canadian weather data, the estimated region effects totemperature . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.13 Texas weather data, observations as curves . . . . . . . 135

4.14 Texas weather data, observations as surfaces . . . . . . . 135

4.15 Texas weather data, the location effects for four selectedstations . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.16 Texas weather data, the month effects for January, April,July, and October . . . . . . . . . . . . . . . . . . . . . 138

5.1 WMSEs and coverages of Bayesian confidence intervalswith the presence of heteroscedasticity . . . . . . . . . . 140

5.2 Cubic spline fits when data are correlated . . . . . . . . 141

5.3 Cubic spline fits and estimated autocorrelation functionsfor two simulations . . . . . . . . . . . . . . . . . . . . . 149

5.4 Motorcycle data, estimates of the mean and variance func-tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.5 Arosa data, residuals variances and PWLS fit . . . . . . 155

5.6 Beveridge data, time series and cubic spline fits . . . . . 157

5.7 Lake acidity data, effects of calcium and geological loca-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

xix

6.1 WESDR data, the estimated probability functions . . . 176

6.2 Motorcycle data, estimates of the variance function basedon three procedures . . . . . . . . . . . . . . . . . . . . 178

6.3 Motorcycle data, DGML function, and estimate of thevariance and mean functions . . . . . . . . . . . . . . . . 182

6.4 Seizure data, the baseline and preseizure IEEG segments 186

6.5 Seizure data, periodograms, estimates of the spectra basedon the iterative UBR method and confidence intervals . 187

6.6 Seizure data, estimates of the time-varying spectra basedon the iterative UBR method . . . . . . . . . . . . . . . 189

6.7 Seizure data, estimates of the time-varying spectra basedon the DGML method . . . . . . . . . . . . . . . . . . . 192

7.1 Nonparametric regression under positivity constraint . . 207

7.2 Nonparametric regression under monotonicity constraint 210

7.3 Child growth data, cubic spline fit, fit under monotonicityconstraint and estimate of the velocity . . . . . . . . . . 210

7.4 Bond data, unconstrained and constrained estimates ofthe discount functions, forward rates and credit spread . 214

7.5 Chickenpox data, time series plot and the fits by multi-plicative and SS ANOVA models . . . . . . . . . . . . . 219

7.6 Chickenpox data, estimates of the mean and amplitudefunctions in the multiplicative model . . . . . . . . . . . 222

7.7 Chickenpox data, estimates of the shape function in themultiplicative model and its projections . . . . . . . . . 222

7.8 Texas weather data, estimates of the mean and amplitudefunctions in the multiplicative model . . . . . . . . . . . 224

7.9 Texas weather data, temperature profiles . . . . . . . . . 225

7.10 Texas weather data, the estimated interaction against theestimated main effect for two stations . . . . . . . . . . 226

8.1 Separate and joint fits from a simulation . . . . . . . . . 236

8.2 Estimates of the differences . . . . . . . . . . . . . . . . 238

8.3 Canadian weather data, estimated region effects to pre-cipitation . . . . . . . . . . . . . . . . . . . . . . . . . . 249

8.4 Canadian weather data, estimate of the coefficient func-tion for the temperature effect . . . . . . . . . . . . . . 249

8.5 Canadian weather data, estimates of the intercept andweight functions . . . . . . . . . . . . . . . . . . . . . . 253

8.6 Superconductivity data, observations, and the fits by non-linear regression, cubic spline, nonlinear partial spline,and L-spline . . . . . . . . . . . . . . . . . . . . . . . . . 254

xx

8.7 Superconductivity data, estimates of departures from thestraight line model and the “interpolation formula” . . . 256

8.8 Rock data, estimates of functions in the projection pursuitregression model . . . . . . . . . . . . . . . . . . . . . . 259

8.9 Air quality data, estimates of functions in SNR models . 2618.10 Star data, observations, and the overall fit . . . . . . . . 2628.11 Star data, folded observations, estimates of the common

shape function and its projection . . . . . . . . . . . . . 2648.12 Star data, estimates of the amplitude and period functions 2668.13 Hormone data, cortisol concentrations for normal subjects

and the fits based on an SIM . . . . . . . . . . . . . . . 2688.14 Hormone data, estimate of the common shape function in

an SIM and its projection for normal subjects . . . . . . 270

9.1 Arosa data, the overall fit, seasonal trend, and long-termtrend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

9.2 Arosa data, the overall fit, seasonal trend, long-term trendand local stochastic trend . . . . . . . . . . . . . . . . . 292

9.3 Lake acidity data, effects of calcium and geological loca-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

9.4 Dog data, coronary sinus potassium concentrations overtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

9.5 Dog data, estimates of the group mean response curves . 3029.6 Dog data, estimates of the group mean response curves

under new penalty . . . . . . . . . . . . . . . . . . . . . 3049.7 Dog data, predictions for four dogs . . . . . . . . . . . . 3059.8 Carbon dioxide data, observations and fits by the NLME

and SNM models . . . . . . . . . . . . . . . . . . . . . . 3079.9 Carbon dioxide data, overall estimate and projections of

the nonparametric shape function . . . . . . . . . . . . . 3099.10 Hormone data, cortisol concentrations for normal sub-

jects, and the fits based on a mixed-effects SIM . . . . . 3129.11 Hormone data, cortisol concentrations for depressed sub-

jects, and the fits based on a mixed-effects SIM . . . . . 3139.12 Hormone data, cortisol concentrations for subjects with

Cushing’s disease, and the fits based on a mixed-effectsSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

9.13 Hormone data, estimates of the common shape functionsin the mixed-effect SIM . . . . . . . . . . . . . . . . . . 315

9.14 Hormone data, plot of the estimated 24-hour mean levelsagainst amplitudes . . . . . . . . . . . . . . . . . . . . . 321

xxi

Symbol Description

(x)+ max{x, 0}x ∧ z min{x, z}x ∨ z max{x, z}det+ Product of the nonzero eigenvalueskr(x) Scaled Bernoulli polynomials(·, ·) Inner product‖ · ‖: NormX Domain of a functionS Unit sphereH Function spaceL Linear functionalN Nonlinear functionalP ProjectionA Averaging operatorM Model spaceR(x, z) Reproducing kernelR

d Euclidean d-spaceNS2m(t1, · · · , tk) Natural polynomial spline spaceWm

2 [a, b] Sobolev space on [a, b]Wm

2 (per) Sobolev space on unit circleWm

2 (Rd) Thin-plate spline model spaceWm

2 (S) Sobolev space on unit sphere⊕ Direct sum of function spaces⊗ Tensor product of function spaces


Preface

Statistical analysis often involves building mathematical models thatexamine the relationship between dependent and independent variables.This book is about a general class of powerful and flexible modelingtechniques, namely, spline smoothing.

Research on smoothing spline models has attracted a great deal ofattention in recent years, and the methodology has been widely used inmany areas. This book provides an introduction to some basic smoothingspline models, including polynomial, periodic, spherical, thin-plate, L-,and partial splines, as well as an overview of more advanced models, in-cluding smoothing spline ANOVA, extended and generalized smoothingspline ANOVA, vector spline, nonparametric nonlinear regression, semi-parametric regression, and semiparametric mixed-effects models. Meth-ods for model selection and inference are also presented.

The general forms of nonparametric/semiparametric linear/nonlinearfixed/mixed smoothing spline models in this book provide unified frame-works for estimation, inference, and software implementation. This bookdraws on the theory of reproducing kernel Hilbert space (RKHS) topresent various smoothing spline models in a unified fashion. On theother hand, the subject of smoothing spline in the context of RKHSand regularization is often regarded as technical and difficult. One ofmy main goals is to make the advanced smoothing spline methodologybased on RKHS more accessible to practitioners and students. With thisin mind, the book focuses on methodology, computation, implementa-tion, software, and application. It provides a gentle introduction to theRKHS, keeps theory at the minimum level, and provides details on howthe RKHS can be used to construct spline models.

User-friendly software is key to the routine use of any statistical method.The assist library in R implements methods presented in this book forfitting various nonparametric/semiparametric linear/nonlinear fixed/mixedsmoothing spline models. The assist library can be obtained at

http://www.r-project.org

Much of the exposition is based on the analysis of real examples.Rather than formal analysis, these examples are intended to illustratethe power and versatility of the spline smoothing methodology. All dataanalyses are performed in R, and most of them use functions in the

xxiii

xxiv

assist library. Codes for all examples and further developments relatedto this book will be posted on the web page

http://www.pstat.ucsb.edu/faculty/yuedong/book.html

This book is intended for those wanting to learn about smoothingsplines. It can be a reference book for statisticians and scientists whoneed advanced and flexible modeling techniques. It can also serve as atext for an advanced-level graduate course on the subject. In fact, topicsin Chapters 1–4 were covered in a quarter class at the University of Cal-ifornia — Santa Barbara, and the University of Science and Technologyof China.

I was fortunate indeed to have learned the smoothing spline fromGrace Wahba, whose pioneering work has paved the way for much ongo-ing research and made this book possible. I am grateful to Chunlei Ke,my former student and collaborator, for developing the assist pack-age. Special thanks goes to Anna Liu for reading the draft carefully andcorrecting many mistakes. Several people have helped me over variousphases of writing this book: Chong Gu, Wensheng Guo, David Hinkley,Ping Ma, and Wendy Meiring. I must thank my editor, David Grubbes,for his patience and encouragement. Finally, I would like to thank sev-eral researchers who kindly shared their data sets for inclusion in thisbook; they are cited where their data are introduced.

Yuedong WangSanta BarbaraDecember 2010

Chapter 1

Introduction

1.1 Parametric and Nonparametric Regression

Regression analysis builds mathematical models that examine the rela-tionship of a dependent variable to one or more independent variables.These models may be used to predict responses at unobserved and/orfuture values of the independent variables. In the simple case whenboth the dependent variable y and the independent variable x are scalarvariables, given observations (xi, yi) for i = 1, . . . , n, a regression modelrelates dependent and independent variables as follows:

yi = f(xi) + ǫi, i = 1, . . . , n, (1.1)

where f is the regression function and ǫi are zero-mean independent ran-dom errors with a common variance σ2. The goal of regression analysisis to construct a model for f and estimate it based on noisy data.

For example, for the Old Faithful geyser in Yellowstone National Park,consider the problem of predicting the waiting time to the next eruptionusing the length of the previous eruption. Figure 1.1(a) shows the scatterplot of waiting time to the next eruption (y = waiting) against durationof the previous eruption (x = duration) for 272 observations from theOld Faithful geyser. The goal is to build a mathematical model thatrelates the waiting time to the duration of the previous eruption. A firstattempt might be to approximate the regression function f by a straightline

f(x) = β0 + β1x. (1.2)

The least squares straight line fit is shown in Figure 1.1(a). There is noapparent sign of lack-of-fit. Furthermore, there is no clear visible trendin the plot of residuals in Figure 1.1(b).

Often f is nonlinear in x. A common approach to dealing with non-linear relationship is to approximate f by a polynomial of order m

f(x) = β0 + β1x+ · · · + βm−1xm−1. (1.3)

1

2 Smoothing Splines: Methods and Applications

1.5 2.5 3.5 4.5

50

60

70

80

90

duration (min)

waitin

g (

min

)

(a)

1.5 2.5 3.5 4.5

−10

05

10

15

duration (min)

resid

uals

(m

in)

(b)

FIGURE 1.1 Geyser data, plots of (a) observations and the leastsquares straight line fit, and (b) residuals.

Figure 1.2 shows the scatter plot of acceleration (y = acceleration)against time after impact (x = time) from a simulated motorcycle crashexperiment on the efficacy of crash helmets. It is clear that a straight linecannot explain the relationship between acceleration and time. Polyno-mials with m = 1, . . . , 20 are fitted to the data, and Figure 1.2 showsthe best fit selected by Akaike’s information criterion (AIC). There arewaves in the fitted curve at both ends of the range. The fit is still notcompletely satisfactory even when polynomials up to order 20 are con-sidered. Unlike the linear regression model (1.2), except for small m,coefficients in model (1.3) no longer have nice interpretations.

time (ms)

accele

ration (

g)

−100

−50

050

0 10 20 30 40 50 60

ooooo ooo oooooooooo ooo

oooooo

o

oo

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

o oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

o o

o

o o

o

o

o

o

o

o

o

FIGURE 1.2 Motorcycle data, plot of observations, and a polynomialfit.

Introduction 3

In general, a parametric regression model assumes that the form off is known except for finitely many unknown parameters. The specificform of f may come from scientific theories and/or approximations tomechanics under some simplified assumptions. The assumptions maybe too restrictive and the approximations may be too crude for someapplications. An inappropriate model can lead to systematic bias andmisleading conclusions. In practice, one should always check the as-sumed form for the function f .

It is often difficult, if not impossible, to obtain a specific functionalform for f . A nonparametric regression model does not assume a prede-termined form. Instead, it makes assumptions on qualitative propertiesof f . For example, one may be willing to assume that f is “smooth”,which does not reduce to a specific form with finite number of param-eters. Rather, it usually leads to some infinite dimensional collectionsof functions. The basic idea of nonparametric regression is to let thedata speak for themselves. That is to let the data decide which functionfits the best without imposing any specific form on f . Consequently,nonparametric methods are in general more flexible. They can uncoverstructure in the data that might otherwise be missed.

1.5 2.5 3.5 4.5

−1

00

51

01

5

duration (min)

resid

uals

(m

in)

(a)

1.5 2.5 3.5 4.5

50

60

70

80

90

duration (min)

waitin

g (

min

)

(b)

FIGURE 1.3 Geyser data, plots of (a) residuals from the straight linefit and the cubic spline fit to the residuals, and (b) the cubic spline fitto the original data.

For illustration, we fit cubic splines to the geyser data. The cubicspline is a special nonparametric regression model that will be introducedin Section 1.2. A cubic spline fit to residuals from the linear model (1.2)reveals a nonzero trend in Figure 1.3(a). This raises the question of


whether a simple linear regression model is appropriate for the geyserdata. A cubic spline fit to the original data is shown in Figure 1.3(b).It reveals that there are two clusters in the independent variable, and adifferent linear model may be required for each cluster. Sections 2.10,3.8, and 3.9 contain more analysis of the geyser data. A cubic spline fitto the motorcycle data is shown in Figure 1.4. It fits data much betterthan the polynomial model. Sections 2.10, 3.8, 5.4.1, and 6.4 containmore analysis of the motorcycle data.

time (ms)

accele

ration (

g)

−100

−50

050

0 10 20 30 40 50 60

ooooo ooo oooooooooo ooo

oooooo

o

oo

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

o oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

o o

o

o o

o

o

o

o

o

o

o

FIGURE 1.4 Motorcycle data, plot of observations, and the cubicspline fit.

The above simple exposition indicates that the nonparametric regres-sion technique can be applied to different steps in regression analysis:data exploration, model building, testing parametric models, and diag-nosis. In fact, as illustrated throughout the book, spline smoothing isa powerful and versatile tool for building statistical models to exploitstructures in data.

1.2 Polynomial Splines

The polynomial (1.3) is a global model which makes it less adaptive tolocal variations. Individual observations can have undue influence on thefit in remote regions. For example, in the motorcycle data, the behav-ior of the mean function varies drastically from one region to another.

Introduction 5

These local variations led to oscillations at both ends of the range in thepolynomial fit. A natural solution to overcome this limitation is to usepiecewise polynomials, the basic idea behind polynomial splines.

Let a < t1 < · · · < tk < b be fixed points called knots. Let t0 =a and tk+1 = b. Roughly speaking, polynomial splines are piecewisepolynomials joined together smoothly at knots. Formally, a polynomialspline of order r is a real-valued function on [a, b], f(t), such that

(i) f is a piecewise polynomial of order r on [ti, ti+1), i = 0, 1, . . . , k;

(ii) f has r − 2 continuous derivatives and the (r − 1)st derivative is astep function with jumps at knots.

Now consider even orders represented as r = 2m. The function f isa natural polynomial spline of order 2m if, in addition to (i) and (ii), itsatisfies the natural boundary conditions

(iii) f (j)(a) = f (j)(b) = 0, j = m, . . . , 2m− 1.

The natural boundary conditions imply that f is a polynomial of orderm on the two outside subintervals [a, t1] and [tk, b]. Denote the functionspace of natural polynomial splines of order 2m with knots t1, . . . , tk asNS2m(t1, . . . , tk).

One approach, known as regression spline, is to approximate f usinga polynomial spline or natural polynomial spline. To get a good approx-imation, one needs to decide the number and locations of knots. Thisbook covers a different approach known as smoothing spline. It startswith a well-defined model space for f and introduces a penalty to preventoverfitting. We now describe this approach for polynomial splines.

Consider the regression model (1.1). Suppose f is “smooth”. Specifi-cally, assume that f ∈ Wm

2 [a, b] where the Sobolev space

Wm2 [a, b] =

{f : f, f ′, . . . , f (m−1) are absolutely continuous,

∫ b

a

(f (m))2dx <∞}. (1.4)

For any a ≤ x ≤ b, Taylor’s theorem states that

f(x) =

m−1∑

ν=0

f (ν)(a)

ν!(x− a)ν

︸︷︷︸

polynomial of order m

+

∫ x

a

(x − u)m−1

(m− 1)!f (m)(u)du

︸︷︷︸

Rem(x)

. (1.5)

It is clear that the polynomial regression model (1.3) ignores the re-mainder term Rem(x) in the hope that it is negligible. It is often difficult


to verify this assumption in practice. The idea behind the spline smooth-ing is to let data decide how large Rem(x) should be. Since Wm

2 [a, b] isan infinite dimensional space, a direct fit to f by minimizing the leastsquares (LS)

1

n

n∑

i=1

(yi − f(xi))2 (1.6)

leads to interpolation. Therefore, certain control over Rem(x) is neces-sary. One natural approach is to control how far f is allowed to departfrom the polynomial model. Under appropriate norms defined later inSections 2.2 and 2.6, one measure of distance between f and polynomials

is∫ b

a(f (m))2dx. It is then reasonable to estimate f by minimizing the

LS (1.6) under the constraint

∫ b

a

(f (m))2dx ≤ ρ (1.7)

for a constant ρ. By introducing a Lagrange multiplier, the constrainedminimization problem (1.6) and (1.7) is equivalent to minimizing thepenalized least squares (PLS):

1

n

n∑

i=1

(yi − f(xi))2 + λ

∫ b

a

(f (m))2dx. (1.8)

In the remainder of this book, a polynomial spline refers to the solu-tion of the PLS (1.8) in the model space Wm

2 [a, b]. A cubic spline isa special case of the polynomial spline with m = 2. Since it measures

the roughness of the function f ,∫ b

a(f (m))2dx is often referred to as a

roughness penalty. It is obvious that there is no penalty for polynomialsof order less than or equal to m. The smoothing parameter λ balancesthe trade-off between goodness-of-fit measured by the LS and roughness

of the estimate measured by∫ b

a(f (m))2dx.

Suppose that n ≥ m and a ≤ x1 < x2 < · · · < xn ≤ b. Then, for fixed0 < λ < ∞, (1.8) has a unique minimizer f and f ∈ NS2m(x1, . . . , xn)(Eubank 1988). This result indicates that even though we started withthe infinite dimensional space Wm

2 [a, b] as the model space for f , the so-lution to the PLS (1.8) belongs to a finite dimensional space. Specifically,the solution is a natural polynomial spline with knots at distinct designpoints. One approach to computing the polynomial spline estimate isto represent f as a linear combination of a basis of NS2m(x1, . . . , xn).Several basis constructions were provided in Section 3.3.3 of Eubank(1988). In particular, the R function smooth.spline implements thisapproach for the cubic spline using the B-spline basis. For example, thecubic spline fit in Figure 1.4 is derived by the following statements:

Introduction 7

> library(MASS); attach(mcycle)

> smooth.spline(times, accel, all.knots=T)

This book presents a different approach. Instead of basis functions,representers of reproducing kernel Hilbert spaces will be used to repre-sent the spline estimate. This approach allows us to deal with manydifferent spline models in a unified fashion. Details of this approach forpolynomial splines will be presented in Sections 2.2 and 2.6.

When λ = 0, there is no penalty, and the natural spline that inter-polates observations is the unique minimizer. When λ = ∞, the uniqueminimizer is the mth order polynomial. As λ varies from ∞ to 0, wehave a family of models ranging from the parametric polynomial modelto interpolation. The value of λ decides how far f is allowed to departfrom the polynomial model. Thus the choice of λ holds the key to thesuccess of a spline estimate. We discuss how to choose λ based on datain Chapter 3.

1.3 Scope of This Book

Driven by many sophisticated applications and fueled by modern com-puting power, many flexible nonparametric and semiparametric mod-eling techniques have been developed to relax parametric assumptionsand to exploit possible hidden structure. There are many different non-parametric methods. This book concentrates on one of them, smoothingspline. Existing books on this topic include Eubank (1988), Wahba(1990), Green and Silverman (1994), Eubank (1999), Gu (2002), andRuppert, Wand and Carroll (2003). The goals of this book are to (a)make the advanced smoothing spline methodology based on reproducingkernel Hilbert spaces more accessible to practitioners and students; (b)provide software and examples so that the spline smoothing methods canbe routinely used in practice; and (c) provide a comprehensive coverageof recently developed smoothing spline nonparametric/semiparametriclinear/nonlinear fixed/mixed models. We concentrate on the methodol-ogy, implementation, software, and application. Theoretical results arestated without proofs. All methods will be demonstrated using real datasets and R functions.

The polynomial spline in Section 1.2 concerns the functions definedon the domain [a, b]. In many applications, the domain of the regres-sion function is not a continuous interval. Furthermore, the regressionfunction may only be observed indirectly. Chapter 2 introduces gen-


eral smoothing spline regression models with reproducing kernel Hilbertspaces on general domains as model spaces. Penalized LS estimation,Kimeldorf–Wahba representer theorem, computation, and the R func-tion ssr will be covered. Explicit constructions of model spaces will bediscussed in detail for some popular smoothing spline models includingpolynomial, periodic, thin-plate, spherical, and L-splines.

Chapter 3 introduces methods for selecting the smoothing param-eter and making inferences about the regression function. The impactof the smoothing parameter and basic concepts for model selection willbe discussed and illustrated using an example. Connections betweensmoothing spline models and Bayes/mixed-effects models will be es-tablished. The unbiased risk, generalized cross-validation, and gener-alized maximum likelihood methods will be introduced for selecting thesmoothing parameter. Bayesian and bootstrap confidence intervals willbe introduced for the regression function and its components. The lo-cally most powerful, generalized maximum likelihood and generalizedcross-validation tests will also be introduced to test the hypothesis of aparametric model versus a nonparametric alternative.

Analogous to multiple regression, Chapter 4 constructs models formultivariate regression functions based on smoothing spline analysis ofvariance (ANOVA) decompositions. The resulting models have hier-archical structures that facilitate model selection and interpretation.Smoothing spline ANOVA decompositions for tensor products of somecommonly used smoothing spline models will be illustrated. PenalizedLS estimation involving multiple smoothing parameters and componen-twise Bayesian confidence intervals will be covered.

Chapter 5 presents spline smoothing methods for heterogeneous andcorrelated observations. Presence of heterogeneity and correlation maylead to wrong choice of the smoothing parameters and erroneous infer-ence. Penalized weighted LS will be used for estimation. Unbiased risk,generalized cross-validation, and generalized maximum likelihood meth-ods will be extended for selecting the smoothing parameters. Varianceand correlation structures will also be discussed.

Analogous to generalized linear models, Chapter 6 introduces smooth-ing spline ANOVA models for observations generated from a particulardistribution in the exponential family including binomial, Poisson, andgamma distributions. Penalized likelihood will be used for estimation,and methods for selecting the smoothing parameters will be discussed.Nonparametric estimation of variance and spectral density functions willbe presented.

Analogous to nonlinear regression, Chapter 7 introduces spline smooth-ing methods for nonparametric nonlinear regression models where someunknown functions are observed indirectly through nonlinear function-

Introduction 9

als. In addition to fitting theoretical and empirical nonlinear nonpara-metric regression models, methods in this chapter may also be used todeal with constraints on the nonparametric function such as positivityor monotonicity. Several algorithms based on Gauss–Newton, Newton–Raphson, extended Gauss–Newton and Gauss–Seidel methods will bepresented for different situations. Computation and the R function nnr

will be covered.Chapter 8 introduces semiparametric regression models that involve

both parameters and nonparametric functions. The mean function maydepend on the parameters and the nonparametric functions linearly ornonlinearly. The semiparametric regression models include many well-known models such as the partial spline, varying coefficients, projectionpursuit, single index, multiple index, functional linear, and shape invari-ant models as special cases. Estimation, inference, computation, andthe R function snr will also be covered.

Chapter 9 introduces semiparametric linear and nonlinear mixed-effects models. Smoothing spline ANOVA decompositions are extendedfor the construction of semiparametric mixed-effects models that parallelthe classical mixed models. Estimation and inference methods, compu-tation, and the R functions slm and snm will be covered as well.

1.4 The assist Package

The assist package was developed for fitting various smoothing splinemodels covered in this book. It contains five main functions, ssr, nnr,snr, slm, and snm for fitting various smoothing spline models. The func-tion ssr fits smoothing spline regression models in Chapter 2, smoothingspline ANOVA models in Chapter 4, extended smoothing spline ANOVAmodels with heterogeneous and correlated observations in Chapter 5,generalized smoothing spline ANOVA models in Chapter 6, and semi-parametric linear regression models in Chapter 8, Section 8.2. Thefunction nnr fits nonparametric nonlinear regression models in Chap-ter 7. The function snr fits semiparametric nonlinear regression modelsin Chapter 8, Section 8.3. The functions slm and snm fit semiparametriclinear and nonlinear mixed-effects models in Chapter 9. The assist

package is available at

http://cran.r-project.org

Figure 1.5 shows how the functions in assist generalize some of theexisting R functions for regression analysis.


lm

glm

smooth.spline

nls

lme

gam

nnr

nlme

slm

snr

ssr

snm

�

��3

QQQs

JJ

JJJ

XXXXXXXXz��:XXXXz

XXXXXXXXz

��:

��:

XXXXXXXXz

-

-

HHHHHj

XXXXXz

��1

FIGURE 1.5 Functions in assist (dashed boxes) and some exist-ing R functions (solid boxes). An arrow represents an extension to amore general model. lm: linear models. glm: generalized linear models.smooth.spline: cubic spline models. nls: nonlinear regression mod-els. lme: linear mixed-effects models. gam: generalized additive models.nlme: nonlinear mixed-effects models. ssr: smoothing spline regres-sion models. nnr: nonparametric nonlinear regression models. snr:semiparametric nonlinear regression models. slm: semiparametric lin-ear mixed-effects models. snm: semiparametric nonlinear mixed-effectsmodels.

Chapter 2

Smoothing Spline Regression

2.1 Reproducing Kernel Hilbert Space

Polynomial splines concern functions defined on a continuous interval.This is the most common situation in practice. Nevertheless, manyapplications require modeling functions defined on domains other than acontinuous interval. For example, for spatial data with measurements onlatitude and longitude, the domain of the function is the Euclidean spaceR

2. Specific spline models were developed for different applications. Itis desirable to develop methodology and software on a general platformsuch that special cases are dealt with in a unified fashion. ReproducingKernel Hilbert Space (RKHS) provides such a general platform.

This section provides a very brief review of RKHS. Throughout thisbook, important theoretical results are presented in italic without proofs.Details and proofs related to RKHS can be found in Aronszajn (1950),Wahba (1990), Gu (2002), and Berlinet and Thomas-Agnan (2004).

A nonempty set E of elements f, g, h, . . . forms a linear space if thereare two operations: (1) addition: a mapping (f, g) → f + g from E ×Einto E; and (2) multiplication: a mapping (α, f) → αf from R ×E intoE, such that for any α, β ∈ R, the following conditions are satisfied: (a)f + g = g+ f ; (b) (f + g)+ h = f + (g+ h); (c) for every f, g ∈ E, thereexists h ∈ E such that f + h = g; (d) α(βf) = (αβ)f ; (e) (α + β)f =αf + βf ; (f) α(f + g) = αf + αg; and (g) 1f = f . Property (c) impliesthat there exists a zero element, denoted as 0, such that f + 0 = f forall f ∈ E.

A finite collection of elements f1, . . . , fk in E is called linearly inde-pendent if the relation α1f1+ · · ·+αkfk = 0 holds only in the trivial casewith α1 = · · · = αk = 0. An arbitrary collection of elements A is calledlinearly independent if every finite subcollection is linearly independent.Let A be a subset of a linear space E. Define

spanA , {α1f1 + · · · + αkfk : f1, . . . , fk ∈ A,

α1, . . . , αk ∈ R, k = 1, 2, . . . }.

11


A set B ⊂ E is called a basis of E if B is linearly independent andspanB = E.

A nonnegative function || · || on a linear space E is called a norm if(a) ||f || = 0 if and only if f = 0; (b) ||αf || = |α|||f ||; and (c) ||f + g|| ≤||f ||+ ||g||. If the function || · || satisfies (b) and (c) only, then it is calleda seminorm. A linear space with a norm is called a normed linear space.

Let E be a linear space. A mapping (·, ·) : E × E → R is called aninner product in E if it satisfies (a) (f, g) = (g, f); (b) (αf + βg, h) =α(f, h) + β(g, h); and (c) (f, f) ≥ 0, and (f, f) = 0 if and only if f = 0.An inner product defines a norm: ||f || ,

√

(f, f). A linear space withan inner product is called an inner product space.

Let E be a normed linear space and fn be a sequence in E. Thesequence fn is said to converge to f ∈ E if limn→∞ ||fn − f || = 0, andf is called the limit point. The sequence fn is called a Cauchy sequenceif liml,n→∞ ||fl − fn|| = 0. The space E is complete if every Cauchysequence converges to an element in E. A complete inner product spaceis called a Hilbert space.

A functional L on a Hilbert space H is a mapping from H to R. L is alinear functional if it satisfies L(αf + βg) = αLf + βLg. L is said to becontinuous if limn→∞ Lfn = Lf when limn→∞ fn = f . L is said to bebounded if there exists a constant M such that |Lf | ≤M ||f || for all f ∈H. L is continuous if and only if L is bounded. For every fixed h ∈ H,Lhf , (h, f) defines a continuous linear functional. Conversely, everycontinuous linear functional L can be represented as an inner productwith a representer.

Riesz representation theoremLet L be a continuous linear functional on a Hilbert space H. Thereexists a unique hL such that Lf = (hL, f) for all f ∈ H. The elementhL is called the representer of L.

Let H be a Hilbert space of real-valued functions from X to R whereX is an arbitrary set. For a fixed x ∈ X , the evaluational functionalLx : H → R is defined as

Lxf , f(x).

Note that the evaluational functional Lx maps a function to a real valuewhile the function f maps a point x to a real value. Lx applies to allfunctions in H with a fixed x. Evaluational functionals are linear sinceLx(αf + βg) = αf(x) + βg(x) = αLxf + βLxg.

Definition A Hilbert space of real-valued functions H is an RKHS ifevery evaluational functional is continuous.

Smoothing Spline Regression 13

Let H be an RKHS. Then, for each x ∈ X , the evaluational functionalLxf = f(x) is continuous. By the Riesz representation theorem, thereexists an element Rx in H such that

Lxf = f(x) = (Rx, f),

where the dependence of the representer on x is expressed explicitlyas Rx. Consider Rx(z) as a bivariate function of x and z and letR(x, z) , Rx(z). The bivariate function R(x, z) is called the repro-ducing kernel (RK) of an RKHS H. The term reproducing kernel comesfrom the fact that (Rx, Rz) = R(x, z). It is easy to check that an RK isnonnegative definite. That is, R is symmetric R(x, z) = R(z, x), and forany α1, . . . , αn ∈ R and x1, . . . , xn ∈ X ,

n∑

i,j=1

αiαjR(xi, xj) ≥ 0.

Therefore, every RKHS has a unique RK that is nonnegative definite.Conversely, an RKHS can be constructed based on a nonnegative definitefunction.

Moore–Aronszajn theoremFor every nonnegative definite function R on X×X , there exists a uniqueRKHS on X with R as its RK.

The above results indicate that there exists an one-to-one correspon-dence between RKHS’s and nonnegative definite functions. For a finitedimensional space H with an orthonormal basis φ1(x), . . . , φp(x), it iseasy to see that

R(x, z) ,

p∑

i=1

φi(x)φi(z)

is the RK of H.The following definitions and results are useful for the construction

and decomposition of model spaces. S is called a subspace of a Hilbertspace H if S ⊂ H and αf + βg ∈ S for every α, β ∈ R and f, g ∈ S. Aclosed subspace S is a Hilbert space. The orthogonal complement of Sis defined as

S⊥ , {f ∈ H : (f, g) = 0 for all g ∈ S}.

S⊥ is a closed subspace of H. If S is a closed subspace of a Hilbert spaceH, then every element f ∈ H has a unique decomposition in the formf = g + h, where g ∈ S and h ∈ S⊥. Equivalently, H is decomposed


into two subspaces H = S ⊕ S⊥. This decomposition is called a tensorsum decomposition, and elements g and h are called projections onto Sand S⊥, respectively. Sometimes the notation H ⊖ S will be used todenote the subspaces S⊥. Tensor sum decomposition with more thantwo subspaces can be defined recursively.

All closed subspaces of an RKHS are RKHS’s. If H = H0⊕H1 and R,R0, and R1 are RKs of H, H0, and H1 respectively, then R = R0 +R1.Suppose H is a Hilbert space and H = H0 ⊕ H1. If H0 and H1 areRKHS’s with RKs R0 and R1, respectively, then H is an RKHS withRK R = R0 +R1.

2.2 Model Space for Polynomial Splines

Before introducing the general smoothing spline models, it is instruc-tive to see how the polynomial splines introduced in Section 1.2 can bederived under the RKHS setup. Again, consider the regression model

yi = f(xi) + ǫi, i = 1, . . . , n, (2.1)

where the domain of the function f is X = [a, b] and the model space forf is the Sobolev space Wm

2 [a, b] defined in (1.4). The smoothing spline

estimate f is the solution to the PLS (1.8).

Model space construction and decomposition of Wm2 [a, b]

The Sobolev space Wm2 [a, b] is an RKHS with the inner product

(f, g) =

m−1∑

ν=0

f (ν)(a)g(ν)(a) +

∫ b

a

f (m)g(m)dx. (2.2)

Furthermore, Wm2 [a, b] = H0 ⊕H1, where

H0 = span{1, (x− a), . . . , (x− a)m−1/(m− 1)!

},

H1 ={f : f (ν)(a) = 0, ν = 0, . . . ,m− 1,

∫ b

a

(f (m))2dx <∞},

(2.3)

are RKHS’s with corresponding RKs

R0(x, z) =

m∑

ν=1

(x− a)ν−1

(ν − 1)!

(z − a)ν−1

(ν − 1)!,

R1(x, z) =

∫ b

a

(x− u)m−1+

(m− 1)!

(z − u)m−1+

(m− 1)!du.

(2.4)


The function (x)+ = max{x, 0}.

Details about the foregoing construction can be found in Schumaker(2007). It is clear that H0 contains the polynomial of order m in theTaylor expansion. Note that the basis listed in (2.3), φν(x) = (x −a)ν−1/(ν − 1)! for ν = 1, . . . ,m, is an orthonormal basis of H0. For anyf ∈ H1, it is easy to check that

f(x) =

∫ x

a

f ′(u)du = · · · =

∫ x

a

dx1

∫ x1

a

dx2· · ·∫ xm−1

a

f (m)(u)du

=

∫ x

a

dx1

∫ x1

a

dx2· · ·∫ xm−2

a

(xm−2 − u)f (m)(u)du = · · ·

=

∫ x

a

(x− u)m−1

(m− 1)!f (m)(u)du.

Thus the subspace H1 contains the remainder term in the Taylorexpansion.

Denote P1 as the orthogonal projection operator onto H1. From thedefinition of the inner product, the roughness penalty

∫ b

a

(f (m))2dx = ||P1f ||2. (2.5)

Therefore,∫ b

a (f (m))2dx measures the distance between f and the para-metric polynomial space H0. There is no penalty to functions in H0.

The PLS (1.8) can be rewritten as

1

n

n∑

i=1

(yi − f(xi))2 + λ||P1f ||2. (2.6)

The solution to (2.6) will be given for the general case in Section 2.4.The above setup for polynomial splines suggests the following ingredientsfor the construction of a general smoothing spline model:

1. An RKHS H as the model space for f

2. A decomposition of the model space into two subspaces, H = H0⊕H1, where H0 consists of functions that are not penalized

3. A penalty ||P1f ||2

Based on prior knowledge and purpose of the study, different choicescan be made on the model space, its decomposition, and the penalty.These options make the spline smoothing method flexible and versatile.Choices of these options will be illustrated throughout the book.


2.3 General Smoothing Spline Regression Models

A general smoothing spline regression (SSR) model assumes that

yi = f(xi) + ǫi, i = 1, . . . , n, (2.7)

where yi are observations of the function f evaluated at design pointsxi, and ǫi are zero-mean independent random errors with a commonvariance σ2. To deal with different situations in a unified fashion, letthe domain of the function f be an arbitrary set X , and the model spacebe an RKHS H on X with RK R(x, z). The choice of H depends onseveral factors including the domain X and prior knowledge about thefunction f . Suppose H can be decomposed into two subspaces,

H = H0 ⊕H1, (2.8)

where H0 is a finite dimensional space with basis functions φ1(x), . . . ,φp(x), and H1 is an RKHS with RK R1(x, z). H0, often referred to as thenull space, consists of functions that are not penalized. In addition to theconstruction for polynomial splines in Section 2.2, specific constructionsof commonly used model spaces will be discussed in Sections 2.6–2.11.The decomposition (2.8) is equivalent to decomposing the function

f = f0 + f1, (2.9)

where f0 and f1 are projections onto H0 and H1, respectively. Thecomponent f0 represents a linear regression model in space H0, andthe component f1 represents systematic variation not explained by f0.Therefore, the magnitude of f1 can be used to check or test if the para-metric model is appropriate. Projections f0 and f1 will be referred to asthe “parametric” and “smooth” components, respectively.

Sometimes observations of f are made indirectly through linear func-

tionals. For example, f may be observed in the form∫ b

awi(x)f(x)dx

where wi are known functions. Another example is that observationsare taken on the derivatives f ′(xi). Therefore, it is useful to consider aneven more general SSR model

yi = Lif + ǫi, i = 1, . . . , n, (2.10)

where Li are bounded linear functionals on H. Model (2.7) is a specialcase of (2.10) with Li being evaluational functionals at design pointsdefined as Lif = f(xi). By the definition of an RKHS, these evaluationalfunctionals are bounded.


2.4 Penalized Least Squares Estimation

The estimation method will be presented for the general model (2.10).

The smoothing spline estimate of f , f , is the minimizer of the PLS

1

n

n∑

i=1

(yi − Lif)2 + λ||P1f ||2, (2.11)

where λ is a smoothing parameter controlling the balance between thegoodness-of-fit measured by the least squares and departure from the nullspace H0 measured by ||P1f ||2. Functions in H0 are not penalized since

||P1f ||2 = 0 when f ∈ H0. Note that f depends on λ even though thedependence is not expressed explicitly. Estimation procedures presentedin this chapter assume that the λ has been fixed. The impact of thesmoothing parameter and methods of selecting it will be discussed inChapter 3.

Since Li are bounded linear functionals, by the Riesz representationtheorem, there exists a representer ηi ∈ H such that Lif = (ηi, f). Fora fixed x, consider Rx(z) , R(x, z) as a univariate function of z. Then,by properties of the reproducing kernel, we have

ηi(x) = (ηi, Rx) = LiRx = Li(z)R(x, z), (2.12)

where Li(z) indicates that Li is applied to what follows as a function ofz. Equation (2.12) implies that the representer ηi can be obtained byapplying the operator to the RK R. Let ξi = P1ηi be the projectionof ηi onto H1. Since R(x, z) = R0(x, z) + R1(x, z), where R0 and R1

are RKs of H0 and H1, respectively, and P1 is self-adjoint such that(P1g, h) = (g, P1h) for any g, h ∈ H, we have

ξi(x) = (ξi, Rx) = (P1ηi, Rx) = (ηi, P1Rx) = Li(z)R1(x, z). (2.13)

Equation (2.13) implies that the representer ξi can be obtained by ap-plying the operator to the RK R1. Furthermore, (ξi, ξj) = Li(x)ξj(x) =Li(x)Lj(z)R1(x, z). Denote

T = {Liφν}ni=1

pν=1,

Σ = {Li(x)Lj(z)R1(x, z)}ni,j=1,

(2.14)

where T is an n × p matrix, and Σ is an n × n matrix. For the specialcase of evaluational functionals Lif = f(xi), we have ξi(x) = R1(x, xi),T = {φν(xi)}n p

i=1 ν=1, and Σ = {R1(xi, xj)}ni,j=1.


Write the estimate f as

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ciξi(x) + ρ,

where ρ ∈ H1 and (ρ, ξi) = 0 for i = 1, . . . , n. Since ξi = P1ηi, then ηi

can be written as ηi = ζi + ξi, where ζi ∈ H0. Therefore,

Liρ = (ηi, ρ) = (ζi, ρ) + (ξi, ρ) = 0. (2.15)

Let y = (y1, . . . , yn)T and f = (L1f , . . . ,Lnf)T be the vectors of ob-servations and fitted values, respectively. Let d = (d1, . . . , dp)

T andc = (c1, . . . , cn)T . From (2.15), we have

f = Td+ Σc. (2.16)

Furthermore, ||P1f ||2 = ||∑ni=1 ciξi + ρ||2 = cT Σc + ||ρ||2. Then the

PLS (2.11) becomes

1

n||y − Td− Σc||2 + λcT Σc+ ||ρ||2. (2.17)

It is obvious that (2.17) is minimized when ρ = 0, which leads to thefollowing result in Kimeldorf and Wahba (1971).

Kimeldorf–Wahba representer theoremSuppose T is of full column rank. Then the PLS (2.11) has a uniqueminimizer given by

f(x) =

p∑

ν=1

dνφν(x) +n∑

i=1

ciξi(x). (2.18)

The above theorem indicates that the smoothing spline estimate f fallsin a finite dimensional space. Equation (2.18) represents the smoothing

spline estimate f as a linear combination of basis of H0 and representersin H1. Coefficients c and d need to be estimated from data. Based on(2.18), the PLS (2.17) reduces to

1

n||y − Td− Σc||2 + λcT Σc. (2.19)

Taking the first derivatives leads to the following equations for c and d:

(Σ + nλI)Σc+ ΣTd = Σy,

T T Σc+ T TTd = T Ty,(2.20)


where I is the identity matrix. Equations in (2.20) are equivalent to

(Σ + nλI ΣTT T T TT

)(Σcd

)

=

(ΣyT Ty

)

.

There may be multiple sets of solutions for c when Σ is singular. Never-theless, all sets of solutions lead to the same estimate of the function f(Gu 2002). Therefore, it is only necessary to derive one set of solutions.Consider the following equations

(Σ + nλI)c+ Td = y,

T Tc = 0.(2.21)

It is easy to see that a set of solutions to (2.21) is also a set of solutionsto (2.20). The solutions to (2.21) are

d = (T TM−1T )−1T TM−1y,

c = M−1{I − T (T TM−1T )−1T TM−1}y,(2.22)

where M = Σ + nλI.To compute the coefficients c and d, consider the QR decomposition

of T ,

T = (Q1 Q2)

(R0

)

,

where Q1, Q2, and R are n × p, n × (n − p), and p × p matrices;Q = (Q1 Q2) is an orthogonal matrix; and R is upper triangular and in-vertible. Since T Tc = RTQT

1 c = 0, we have QT1 c = 0 and c = QQTc =

(Q1QT1 +Q2Q

T2 )c = Q2Q

T2 c. Multiplying the first equation in (2.21) by

QT2 and using the fact that QT

2 T = 0, we have QT2MQ2Q

T2 c = QT

2 y.Therefore,

c = Q2(QT2 MQ2)

−1QT2 y. (2.23)

Multiplying the first equation in (2.21) by QT1 , we have Rd = QT

1 (y −Mc). Thus,

d = R−1QT1 (y −Mc). (2.24)

Equations (2.23) and (2.24) will be used to compute coefficients c andd.

Based on (2.16), the first equation in (2.21) and equation (2.23), thefitted values

f = Td+ Σc = y − nλc = H(λ)y, (2.25)

where

H(λ) , I − nλQ2(QT2 MQ2)

−1QT2 (2.26)


is the so-called hat (influence, smoothing) matrix. The dependence of thehat matrix on the smoothing parameter λ is expressed explicitly. Notethat equation (2.25) provides the fitted values while equation (2.18) canbe used to compute estimates at any values of x.

2.5 The ssr Function

The R function ssr in the assist package is designed to fit SSR mod-els. After deciding the model space and the penalty, the estimate f iscompletely decided by y, T , and Σ. Therefore, these terms need to bespecified in the ssr function. A typical call is

ssr(formula, rk)

where formula and rk are required arguments. Together they specifyy, T , and Σ. Suppose the vector y and matrices T and Σ have beencreated in R. Then, formula lists y on the left-hand side, and T matrixon the right-hand side of an operator ~. The argument rk specifies thematrix Σ.

In the most common situation where Li are evaluational functionals,the fitting can be greatly simplified since Li are decided by design pointsxi. There is no need to compute T and Σ matrices before calling thessr function. Instead, they can be computed internally. Specifically, adirect approach to fit the standard SSR model (2.7) is to list y on theleft-hand side and φ1(x), . . . , φp(x) on the right-hand side of an operator~ in the formula, and to specify a function for computing R1 in the rk

argument. Functions for computing the RKs of some commonly usedRKHS’s are available in the assist package. Users can easily writetheir own functions for computing RKs.

There are several optional arguments for the ssr function, some ofwhich will be discussed in the following chapters. In particular, methodsfor selecting the smoothing parameter λ will be discussed in Chapter 3.For simplicity, unless explicitly specified, all examples in this chapter usethe default method that selects λ using the generalized cross-validationcriterion. Bayesian and bootstrap confidence intervals for fitted func-tions are constructed based on the methods in Section 3.8.

We now show how to fit polynomial splines to the motorcycle data.Consider the construction of polynomial splines in Section 2.2. For sim-plicity, we first consider the special cases of polynomial splines withm = 1 and m = 2, which are called linear and cubic splines, respec-tively. Denote x∧ z = min{x, z} and x∨ z = max{x, z}. Based on (2.3)


and (2.4), Table 2.1 lists bases for null spaces and RKs of linear andcubic splines for the special domain X = [0, b].

TABLE 2.1 Bases of null spaces and RKs of linearand cubic splines under the construction in Section 2.2with X = [0, b]

m Spline φν R0 R1

1 Linear 1 1 x ∧ z2 Cubic 1, x 1 + xz (x ∧ z)2{3(x ∨ z) − x ∧ z}/6

Functions linear2 and cubic2 in the assist package compute eval-uations of R1 in Table 2.1 for linear and cubic splines, respectively.Functions for higher-order polynomial splines are also available. Notethat the domain for functions linear2 and cubic2 is X = [0, b] for anyfixed b > 0. The RK on the general domain X = [a, b] can be calculatedby a translation, for example, cubic2(x-a).

To fit a cubic spline to the motorcycle data, one may create matricesT and Σ first and then call the ssr function:

> T <- cbind(1, times)

> Sigma <- cubic2(times)

> ssr(accel~T-1, rk=Sigma)

The intercept is automatically included in the formula statement. There-fore, T-1 is used to exclude the intercept since it is already included inthe T matrix.

Since Li are evaluational functionals for the motorcycle example, thessr function can be called directly:

> ssr(accel~times, rk=cubic2(times))

The inputs for formula and rk can be modified for fitting polynomialsplines of different orders. For example, the following statements fitlinear, quintic (m = 3), and septic (m = 4) splines:

> ssr(accel~1, rk=linear2(times))

> ssr(accel~times+I(times^2), rk=quintic2(times))

> ssr(accel~times+I(times^2)+I(times^3), rk=septic2(times))

The linear and cubic spline fits are shown in Figure 2.1.


time (ms)

accele

ration (

g)

−1

00

−5

00

50

0 10 20 30 40 50 60

ooooo ooooooooooooo ooo

oooooo

o

oo

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

ooo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

o o

o

oo

o

o

o

o

o

o

o

Linear spline fit

time (ms)0 10 20 30 40 50 60


oooooo

o

oo

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

ooo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

o o

o

oo

o

o

o

o

o

o

o

Cubic spline fit

FIGURE 2.1 Motorcycle data, plots of observations (circles), the lin-ear spline fit (left), and the cubic spline fit (right) as solid lines, and 95%Bayesian confidence intervals (shaded regions).

2.6 Another Construction for Polynomial Splines

One construction of RKHS for the polynomial spline was presentedin Section 2.2. This section presents an alternative construction forWm

2 [0, 1] on the domain X = [0, 1]. In practice, without loss of general-ity, a continuous interval [a, b] can always be transformed into [0, 1].

Let kr(x) = Br(x)/r! be scaled Bernoulli polynomials where Br are

defined recursively by B0(x) = 1, B′r(x) = rBr−1(x) and

∫ 1

0 Br(x)dx = 0for r = 1, 2, . . . (Abramowitz and Stegun 1964). The first four scaledBernoulli polynomials are

k0(x) = 1,

k1(x) = x− 0.5,

k2(x) =1

2

{

k21(x) − 1

12

}

,

k4(x) =1

24

{

k41(x) −

1

2k21(x) +

7

240

}

.

(2.27)

Alternative model space construction and decomposition ofWm

2 [0, 1]The Sobolev space Wm

2 [0, 1] is an RKHS with the inner product

(f, g) =

m−1∑

ν=0

(∫ 1

0

f (ν)dx

)(∫ 1

0

g(ν)dx

)

+

∫ 1

0

f (m)g(m)dx. (2.28)


Furthermore, Wm2 [0, 1] = H0 ⊕H1, where

H0 = span{k0(x), k1(x), . . . , km−1(x)

},

H1 ={f :

∫ 1

0

f (ν)dx = 0, ν = 0, . . . ,m− 1,

∫ 1

0

(f (m))2dx <∞},

(2.29)


R0(x, z) =m−1∑

ν=0

kν(x)kν (z),

R1(x, z) = km(x)km(z) + (−1)m−1k2m(|x− z|).(2.30)

The foregoing alternative construction was derived by Craven andWahba (1979). Note that the inner product (2.28) is different from(2.2). Again, H0 contains polynomials, and the basis listed in (2.29),φν(x) = kν−1(x) for ν = 1, . . . ,m, is an orthonormal basis of H0. DenoteP1 as the orthogonal projection operator onto H1. From the definition

of the inner product, the roughness penalty∫ 1

0(f (m))2dx = ||P1f ||2.

Based on (2.29) and (2.30), Table 2.2 lists bases for the null spacesand RKs of linear and cubic splines under the alternative constructionin this section.

TABLE 2.2 Bases of the null spaces and RKs for linear andcubic splines under the construction in Section 2.6 with X = [0, 1]

m Spline φν R0 R1

1 Linear 1 1 k1(x)k1(z) + k2(|x − z|)2 Cubic 1, k1(x) 1 + k1(x)k1(z) k2(x)k2(z) − k4(|x − z|)

Functions linear and cubic in the assist package compute evalua-tions of R1 in Table 2.2 for linear and cubic splines respectively. Func-tions for higher-order polynomial splines are also available. Note thatthe domain under construction in this section is restricted to [0, 1]. Thusthe scale option is needed when the domain is not [0, 1]. For example,the following statements fit linear and cubic splines to the motorcycledata:


> ssr(accel~1, rk=linear(times), scale=T)

> ssr(accel~times, rk=cubic(times), scale=T)

The scale option scales the independent variable times into the interval[0, 1]. It is a good practice to scale a variable first before fitting. Forexample, the following statements lead to the same cubic spline fit:

> x <- (times-min(times))/(max(times)-min(times))

> ssr(accel~x, rk=cubic(x))

2.7 Periodic Splines

Many natural phenomena follow a cyclic pattern. For example, manybiochemical, physiological, or behavioral processes in living beings followa daily cycle called circadian rhythm, and many Earth processes followan annual cycle. In these cases the mean function f is known to be asmooth periodic function. Without loss of generality, assume that thedomain of the function X = [0, 1] and f is a periodic function on [0, 1].Since periodic functions can be regarded as functions defined on the unitcircle, periodic splines are often referred to as splines on the circle.

The model space for periodic spline of order m is

Wm2 (per) =

{f : f (j) are absolutely continuous, f (j)(0) = f (j)(1),

j = 0, . . . ,m− 1,

∫ 1

0

(f (m))2dx <∞}. (2.31)

Craven and Wahba (1979) derived the following construction.

Model space construction and decomposition of Wm2 (per)

The space Wm2 (per) is an RKHS with inner product

(f, g) =

(∫ 1

0

fdx

)(∫ 1

0

gdx

)

+

∫ 1

0

f (m)g(m)dx.

Furthermore, Wm2 (per) = H0 ⊕H1, where

H0 = span{1},

H1 ={f ∈Wm

2 (per) :

∫ 1

0

fdx = 0},

(2.32)



R0(x, z) = 1,

R1(x, z) = (−1)m−1k2m(|x− z|).(2.33)

Again, the roughness penalty∫ 1

0(f (m))2dx = ||P1f ||2. The function

periodic in the assist library calculates R1 in (2.33). The order m isspecified by the argument order. The default is a cubic periodic splinewith order=2.

We now illustrate how to fit a periodic spline using the Arosa data,which contain monthly mean ozone thickness (Dobson units) in Arosa,Switzerland, from 1926 to 1971. Suppose we want to investigate howozone thickness changes over months in a year. It is reasonable to assumethat the mean ozone thickness is a periodic function of month. Letthick be the dependent variable and x be the independent variablemonth scaled into the interval [0, 1]. The following statements fit a cubicperiodic spline:

> data(Arosa); Arosa$x <- (Arosa$month-0.5)/12

> ssr(thick~1, rk=periodic(x), data=Arosa)

The fit of the periodic spline is shown in Figure 2.2.

month

thic

kness

300

350

400

1 2 3 4 5 6 7 8 9 10 11 12

FIGURE 2.2 Arosa data, plot of observations (points), and the peri-odic spline fits (solid line). The shaded region represents 95% Bayesianconfidence intervals.


2.8 Thin-Plate Splines

Suppose f is a function of a multivariate independent variable x =(x1, . . . , xd) ∈ R

d, where Rd is the Euclidean d-space. Assume the re-

gression modelyi = f(xi) + ǫi, i = 1, . . . , n, (2.34)

where xi = (xi1, . . . , xid) and ǫi are zero-mean independent randomerrors with a common variance σ2.

Define the model space for a thin-plate spline as

Wm2 (Rd) =

{f : Jd

m(f) <∞}, (2.35)

where

Jdm(f) =

∑

α1+···+αd=m

m!

α1! . . . αd!

∫ ∞

−∞· · ·∫ ∞

−∞

(∂mf

∂xα1

1 . . . ∂xαd

d

)2 d∏

j=1

dxj .

(2.36)Since Jd

m(f) is invariant under a rotation of the coordinates, the thin-plate spline is especially well suited for spatial data (Wahba 1990, Gu2002).

Define an inner product as

(f, g) =∑

α1+···+αd=m

m!

α1! . . . αd!

∫ ∞

−∞· · ·∫ ∞

−∞(

∂mf

∂xα1

1 . . . ∂xαd

d

)(∂mg

∂xα1

1 . . . ∂xαd

d

) d∏

j=1

dxj . (2.37)

Model space construction of Wm2 (Rd)

With the inner product (2.37), Wm2 (Rd) is an RKHS if and only if

2m− d > 0.

Details can be found in Duchon (1977) and Meinguet (1979). A thin-plate spline estimate is the minimizer to the PLS

1

n

n∑

i=1

(yi − f(xi))2 + λJd

m(f) (2.38)

in Wm2 (Rd). The null space H0 of the penalty functional Jd

m(f) is thespace spanned by polynomials in d variables of total degree up to m−1.


Thus the dimension of the null space p =

(d+m− 1

d

)

. For example,

when d = 2 and m = 2,

J22 (f) =

∫ ∞

−∞

∫ ∞

−∞

{(∂2f

∂x21

)2

+ 2

(∂2f

∂x1∂x2

)2

+

(∂2f

∂x22

)2}

dx1dx2,

and the null space is spanned by φ1(x) = 1, φ2(x) = x1, and φ3(x) = x2.In general, denote φ1, . . . , φp as the p polynomials of total degree up to

m−1 that span H0. Denote Em as the Green function for the m-iteratedLaplacian Em(x, z) = E(||x − z||), where ||x − z|| is the Euclideandistance and

E(u) =

{

(−1)d2+1+m|u|2m−d log |u|, d even,

|u|2m−d, d odd.

Let T = {φν(xi)}n pi=1 ν=1 and K = {Em(xi,xj)}n

i,j=1. The bivariate

function Em is not the RK ofWm2 (Rd) since it is not nonnegative definite.

Nevertheless, it is conditionally nonnegative definite in the sense thatT Tc = 0 implies that cTKc ≥ 0. Referred to as a semi-kernel, thefunction Em is sufficient for the purpose of estimation. Assume that Tis of full column rank. It can be shown that the unique minimizer of thePLS (2.38) is given by (Wahba 1990, Gu 2002)

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ciξi(x), (2.39)

where ξi(x) = Em(xi,x). Therefore, Em plays the same role as the RKR1. The coefficients c and d are solutions to

(K + nλI)c + Td = y,

T Tc = 0.(2.40)

The above equations have the same form as those in (2.21). Therefore,computations in Section 2.4 carry over with Σ being replaced by K. Thesemi-kernel Em is calculated by the function tp.pseudo in the assist

package. The order m is specified by the order argument with defaultas order=2.

The USA climate data contain average winter temperatures in 1981from 1214 stations in USA. To investigate how average winter temper-ature (temp) depends on geological locations (long and lat), we fit athin-plate spline as follows:

> attach(USAtemp)

> ssr(temp~long+lat, rk=tp.pseudo(list(long,lat)))


70

65

60

55

50

45

40

35

30

25

20

15

10

FIGURE 2.3 USA climate data, contour plot of the thin-plate splinefit.

The contour plot of the fit is shown in Figure 2.3.

A genuine RK for Wm2 (Rd) is needed later in the computation of

posterior variances in Chapter 3 and the construction of tensor productsplines in Chapter 4. We now discuss briefly how to derive the genuineRK. Define inner product

(f, g)0 =

J∑

j=1

wjf(uj)g(uj), (2.41)

where uj are fixed points in Rd, and wj are fixed positive weights such

that∑J

j=1 wj = 1. Points uj and weights wj are selected in such a way

that the matrix {(φν , φµ)0}pν,µ=1 is nonsingular. Let φν , ν = 1, . . . , p,

be an orthonormal basis derived from φν with φ1(x) = 1. Let P0 be theprojection operator onto H0 defined as P0f =

∑pν=1(f, φν)0φν . Then it

can be shown that (Gu 2002)

R0(x, z) =

p∑

ν=1

φν(x)φν(z),

R1(x, z) = (I − P0(x))(I − P0(z))E(||x− z||)(2.42)


are RKs of H0 and H1 , Wm2 (Rd) ⊖ H0, where P0(x) and P0(z) are

projections applied to the arguments x and z, respectively.Assume that T = {φν(xi)}n p

i=1 ν=1 is of full column rank. Let Σ =

{R1(xi,xj)}ni,j=1. One relatively simple approach to compute φν and

Σ is to let J = n, uj = xj , and wj = n−1. It is easy to see that{(φν , φµ)0}p

ν,µ=1 = n−1T TT , which is nonsingular. Let

T = (Q1 Q2)

(R0

)

be the QR decomposition of T . Then

(φ1(x), . . . , φp(x)) =√n(φ1(x), . . . , φp(x))R−1

andΣ = Q2Q

T2KQ2Q

T2 .

The function tp computes evaluations of R1 in (2.42) with J = n, uj =xj and wj = n−1.

2.9 Spherical Splines

Spherical spline, also called spline on the sphere, is an extension of boththe periodic spline defined on the unit circle and the thin-plate splinedefined on R

2. Let the domain be X = S, where S is the unit sphere.Any point x on S can be represented as x = (θ, φ), where θ (0 ≤ θ ≤ 2π)is the longitude and φ (−π/2 ≤ φ ≤ π/2) is the latitude. Define

J(f) =

∫ 2π

0

∫ π2

−π2

(∆m2 f)2 cosφdφdθ, m even,

∫ 2π

0

∫ π2

−π2

{

(∆m−1

2 f)2θcos2 φ

+ (∆m−1

2 f)2φ

}

cosφdφdθ, m odd,

where the notation (g)z represents the partial derivative of g with respectto z, ∆f represents the surface Laplacian on the unit sphere defined as

∆f =1

cos2 φfθθ +

1

cosφ(cosφfφ)φ.

Consider the model space

Wm2 (S) =

{

f :

∣∣∣∣

∫

Sfdx

∣∣∣∣<∞, J(f) <∞

}

.


Model space construction and decomposition of Wm2 (S)

Wm2 (S) is an RKHS when m > 1. Furthermore, Wm

2 (S) = H0 ⊕ H1,where

H0 = span{1},

H1 ={f ∈Wm

2 (S) :

∫

Sfdx = 0

},


R0(x, z) = 1,

R1(x, z) =∞∑

i=1

2i+ 1

4π

1

{i(i+ 1)}mGi(cos γ(x, z)),

where γ(x, z) is the angle between x and z, and Gi are the Legendrepolynomials.

Details of the above construction can be found in Wahba (1981). Thepenalty ||P1f ||2 = J(f). The RK R1 is in the form of an infinite series,which is inconvenient to compute. Closed-form expressions are availableonly when m = 2 and m = 3. Wahba (1981) proposed replacing J by atopologically equivalent seminorm Q under which closed-form RKs canbe derived. The function sphere in the assist package calculates R1

under the seminorm Q for 2 ≤ m ≤ 6. The argument order specifies mwith default as order=2.

The world climate data contain average winter temperatures in 1981from 725 stations around the globe. To investigate how average wintertemperature (temp) depends on geological locations (long and lat), wefit a spline on the sphere:

> data(climate)

> ssr(temp~1, rk=sphere(cbind(long,lat)), data=climate)

The contour plot of the spherical spline fit is shown in Figure 2.4.

2.10 Partial Splines

A partial spline model assumes that

yi = sTi β + Lif + ǫi, i = 1, . . . , n, (2.43)


longitude

latit

ude

−40

−35

−30 −25

−20

−15

−10

−5

0

5

5

10

10

15

15

20

20

25

25

25

−150 −100 −50 0 50 100 150

−5

00

50

FIGURE 2.4 World climate data, contour plot of the spherical splinefit.

where s is a q-dimensional vector of independent variables, β is a vectorof parameters, Li are bounded linear functionals, and ǫi are zero-meanindependent random errors with a common variance σ2. We assumethat f ∈ H, where H is an RKHS on an arbitrary domain X . Model(2.43) contains two components: a parametric linear model and a non-parametric function f . The partial spline model is a special case of thesemiparametric linear regression model discussed in Chapter 8.

Suppose H = H0 ⊕ H1, where H0 = span{φ1, . . . , φp} and H1 is anRKHS with RK R1. Denote P1 as the projection onto H1. The functionf and parameters β are estimated as minimizers to the following PLS:

1

n

n∑

i=1

(yi − sTi β − Lif)2 + λ||P1f ||2. (2.44)

Let S = (s1, . . . , sn)T , T = {Liφν}n pi=1 ν=1, X = (S T ), and Σ =

{Li(x)Lj(z)R1(x, z)}ni,j=1. Assume that X is of full column rank. Fol-

lowing similar arguments as in Section 2.4, it can be shown that the PLS(2.44) has a unique minimizer, and the solution of f is given in (2.18).


Therefore, the PLS (2.44) reduces to

1

n||y −Xα− Σc||2 + λcT Σc,

where α = (βT ,dT )T . As in Section 2.4, we can solve α and c from thefollowing equations:

(Σ + nλI)c+Xα = y,

XTc = 0.(2.45)

The above equations have the same form as those in (2.21). Thus, com-putations in Section 2.4 carry over with T and d being replaced by Xand α, respectively. The ssr function can be used to fit partial splines.When Li are evaluational functionals, the partial spline model (2.43)can be fitted by adding s variables at the right-hand side of the formulaargument. When Li are not evaluational functionals, matrices X and Σneed to be created and supplied in the formula and rk arguments.

One interesting application of the partial spline model is to fit a non-parametric regression model with potential change-points. A change-point is defined as a discontinuity in the mean function or one of itsderivatives. Note that the function g(x) = (x − t)k

+ has a jump in itskth derivative at location t. Therefore, it can be used to model change-points. Specifically, consider the following model

yi =

J∑

j=1

βj(xi − tj)kj

+ + f(xi) + ǫi, i = 1, . . . , n, (2.46)

where xi ∈ [a, b] are design points, tj ∈ [a, b] are change-points, f is asmooth function, and ǫi are zero-mean independent random errors witha common variance σ2. The mean function in model (2.46) has a jumpat tj in its kjth derivative with magnitude βj . The choice of modelspace for f depends on the application. For example, the polynomial orperiodic spline space may be used. When tj and kj are known, model

(2.46) is a special case of partial spline with q = J , sij = (xi − tj)kj

+ , andsi = (si1, . . . , siJ)T .

We now use the geyser data and motorcycle data to illustrate change-points detection using partial splines. For the geyser data, Figure 1.3(b)indicates that there may be a jump in the mean function between 2.5and 3.5 minutes. Therefore, we consider the model

yi = β(xi − t)0+ + f(xi) + ǫi, i = 1, . . . , n, (2.47)

where xi are the duration variable scaled into [0, 1], t is a change-point,and (x − t)0+ = 0 when x ≤ t and 1 otherwise. We assume that f ∈


W 22 [0, 1]. For a fixed t, say t = 0.397, we can fit the partial spline as

follows:

> attach(faithful)

> x <- (eruptions-min(eruptions))/diff(range(eruptions))

> ssr(waiting~x+(x>.397), rk=cubic(x))

The partial spline fit is shown in Figure 2.5(a). No trend is shown in theresidual plot in Figure 2.5(b).

The change-point is fixed at t = 0.397 in the above fit. Often it isunknown in practice. To search for the location of the change-point t,we compute AIC and GCV (generalized cross-validation) criteria on agrid points between 0.25 and 0.55:

> aic <- gcv <- NULL

> for (t in seq(.25,.55,by=.001)) {

fit <- ssr(waiting~x+(x>t), rk=cubic(x))

aic <- c(aic, length(x)*log(sum(fit$resi**2))+2*fit$df)

gcv <- c(gcv, fit$rkpk.obj$score)

}

The vector fit$resi contains residuals. The value fit$df representsthe degrees of freedom (trH(λ)) defined later in Chapter 3, Section 3.2.The GCV criterion is defined in (3.24). Figure 2.5(c) shows the AIC andGCV scores scaled into [0, 1]. The scaled scores are identical and reachthe minimum in the same region with the middle point at t = 0.397.

1.5 2.5 3.5 4.5

50

60

70

80

90

duration (min)

waitin

g (

min

)

(a)

1.5 2.5 3.5 4.5

−10

−5

05

10

15

duration (min)

resid

uals

(m

in)

(b)

0.25 0.35 0.45 0.55

0.0

0.4

0.8

t

scale

d A

IC/G

CV

(c)

FIGURE 2.5 Geyser data, plots of (a) observations (points), the par-tial spline fit (solid line) with 95% Bayesian confidence intervals (shadedregion), (b) residuals (points) from the partial spline fit and the cubicspline fit (solid line) to the residuals, and (c) the AIC (dashed line) andGCV scores (solid line) scaled into [0, 1].


For the motorcycle data, it is apparent that the mean curve is flat onthe left and there is a sharp corner around 15 ms. The linear and cubicsplines fit this region with round corners (Figure 2.1). The sharp cornersuggests that there may be a change-point in the first derivative of themean function. Therefore, we consider the model

yi = β(xi − t)+ + f(xi) + ǫi, i = 1, . . . , n, (2.48)

where x is the time variable scaled into [0, 1] and t is the change-pointin the first derivative. We assume that f ∈ W 2

2 [0, 1]. Again, we use theAIC and GCV criteria to search for the location of the change-point t.Figure 2.6(b) shows the scaled AIC and GCV scores. They both reachthe minimum at t = 0.214. For the fixed t = 0.214, the model (2.48) canbe fitted as follows:

> t <- .214; s <- (x-t)*(x>t)

> ssr(accel~x+s, rk=cubic(x))

The partial spline fit is shown in Figure 2.6(a). The sharp corner around15 ms is preserved. Chapter 3 contains more analysis of potential change-points for the motorcycle data.

time (ms)

accele

ration (

g)

−1

00

−5

00

50

0 10 20 30 40 50 60


oooooo

o

oo

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

ooo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

o o

o

oo

o

o

o

o

o

o

o

(a)

0.20 0.22 0.24

0.0

0.4

0.8

t

scale

d A

IC/G

CV

(b)

FIGURE 2.6 Motorcycle data, plots of (a) observations (circles), thepartial spline fit (solid line) with 95% Bayesian confidence intervals(shaded region), and (b) the AIC (dashed line) and GCV (solid line)scores scaled into [0, 1].

Often, in practice, there is enough knowledge to model some com-ponents in the regression function parametrically. For other uncertain


components, it may be desirable to leave them unspecified. Combin-ing parametric and nonparametric components, the partial spline (semi-parametric) models are well suited to these situations.

As an illustration, consider the Arosa data. We have investigatedhow ozone thickness changes over months in a year by fitting a periodicspline (Figure 2.2). Suppose now we want to investigate how ozonethickness changes over time. That is, we need to consider both month(seasonal) effect and year (long-term) effect. Let x1 and x2 be the monthand year variables scaled into the interval [0, 1]. For the purpose ofillustration, suppose the seasonal trend can be well approximated bya simple sinusoidal function. The form of long-term trend will be leftunspecified. Therefore, we consider the following partial spline model

yi = β1 + β2 sin(2πxi1) + β3 cos(2πxi1) + f(xi2) + ǫi,

i = 1, . . . , 518,(2.49)

where yi represents the average ozone thickness in month xi1 of year xi2,and f ∈ W 2

2 [0, 1] ⊖ {1}. Note that the constant functions are removedfrom the model space for f such that f is identifiable with the constantβ1. The partial spline model (2.49) can be fitted as follows:

> x1 <- (Arosa$month-0.5)/12; x2 <- (Arosa$year-1)/45

> ssr(thick~sin(2*pi*x1)+cos(2*pi*x1)+x2, rk=cubic(x2))

Estimates of the main effect of month, β2 sin(2πx1) + β3 cos(2πx1),

and the main effect of year, f(x2), are shown in Figure 2.7. The simplesinusoidal model for the seasonal trend is too restrictive. More generalmodels for the Arosa data can be found in Chapter 4, Section 4.9.2.

Functional data are observations in a form of functions. The mostcommon forms of functional data are curves defined on a continuousinterval and surfaces defined on R

2. A functional linear model (FLM) is alinear model that involves functional data as (i) the independent variable,(ii) the dependent variable, or (iii) both independent and dependentvariables. Many methods have been developed for fitting FLMs wherefunctional data are curves or surfaces (Ramsay and Silverman 2005).We will use the Canadian weather data to illustrate how to fit FLMsusing methods in this book. An FLM corresponding to situation (i) isdiscussed in this section. FLMs corresponding to situations (ii) and (iii)will be introduced in Chapter 4, Sections 4.9.3, and Chapter 8, Section8.4.1, respectively. Note that methods illustrated in these sections forfunctional data apply to general functions defined on arbitrary domains.In particular, they can be used to fit surfaces defined on R

2.Now consider the Canadian weather data. To investigate the relation-

ship between total annual precipitation and the temperature function,


month

thic

kness

−4

0−

20

02

04

0

1 2 3 4 5 6 7 8 9 11

o

o

o

oo

o

o

o

oo

o

o

Main effect of month

yearth

ickness

−4

0−

20

02

04

0

1926 1935 1944 1953 1962 1971

o

o

o

o

o

oo

o

ooo

o

ooo

oo

oo

o

ooo

ooo

oo

o

o

ooo

o

oo

o

oo

o

o

o

o

o

o

Main effect of year

FIGURE 2.7 Arosa data, plots of estimated main effects with 95%Bayesian confidence intervals. A circle in the left panel represents theaverage thickness for a particular month minus the overall mean. Acircle in the right panel represents the average thickness for a particularyear minus the overall mean.

consider the following FLM

yi = β1 +

∫ 1

0

wi(x)f(x)dx + ǫi, i = 1, . . . , 35, (2.50)

where yi is the logarithm of total annual precipitation at station i, β1 is aconstant parameter, x is the variable month transformed into [0, 1], wi(x)is the temperature function at station i, f(x) is an unknown weight func-tion, and ǫi are random errors. Model (2.50) is the same as model (15.2)in Ramsay and Silverman (2005). It is an example when the indepen-dent variable is a curve. The goal is to estimate the weight function f .It is reasonable to assume that f is a smooth periodic function. Specif-ically, we model f using cubic periodic spline space W 2

2 (per). Writef(x) = β2 + f1(x) where f1 ∈W 2

2 (per)⊖{1}. Then model (2.50) can berewritten as a partial spline model

yi = β1 + β2

∫ 1

0

wi(x)dx +

∫ 1

0

wi(x)f1(x)dx + ǫi

, sTi β + Lif1 + ǫi, (2.51)

where si = (1,∫ 1

0wi(x)dx)

T , β = (β1, β2)T , and Lif1 =

∫ 1

0wi(x)f1(x)dx.

Assume that wi are square integrable. Then Li are bounded linear func-tionals. Let R1 be the RK of W 2

2 (per) ⊖ {1}. From (2.14), the (i, j)th


element of Σ equals

Li(x)Lj(z)R1(x, z) =

∫ 1

0

∫ 1

0

wi(s)wj(t)R1(s, t)dsdt

≈ 1

144

12∑

k=1

12∑

l=1

wi(xk)wj(xl)R1(xk, xl)

=1

144wT

i R1(x,x)wj , (2.52)

where xk represents the middle point of month k, x = (x1, . . . , x12)T ,

R1(x,x) = {R1(xk, xl)}12k,l=1, wi(xk) is the temperature of month k at

station i, and wi = (wi(x1), . . . , wi(x12))T . The rectangle rule is used to

approximate the integrals. More accurate approximations may be used.Let W = (w1, . . . ,w35). Then Σ ≈ WTR1(x,x)W/144. The followingstatements fit the partial spline model (2.51):

> library(fda); attach(CanadianWeather)

> y <- log(apply(monthlyPrecip,2,sum))

> W <- monthlyTemp

> s <- apply(W,2,mean)

> x <- seq(0.5,11.5,1)/12

> Sigma <- t(W)%*%periodic(x)%*%W/144

> canada.fit1 <- ssr(y~s, rk=Sigma, spar=‘‘m’’)

where the vector s contains elements∑12

j=1 wi(xj)/12, which are approx-

imations of∫ 1

0 wi(x)dx. The generalized maximum likelihood (GML)method in Chapter 3, Section 3.6, is used to select the smoothing pa-rameter since the GCV estimate is too small due to small sample size.This is accomplished by setting the option spar=‘‘m’’.

From equation (2.18), the estimate of the weight function is

f(x) = d2 +35∑

i=1

ciLi(z)R1(x, z),

where d2 is an estimate of β2. To compute f at a set of points, say,


x0 = (x01, . . . , x0n0)T , we have

f(x0) = d21n0+

35∑

i=1

ci

∫ 1

0

R1(x0, x)wi(x)dx

≈ d21n0+

35∑

i=1

ci

12∑

j=1

R1(x0, xj)wi(xj)/12

= d21n0+

35∑

i=1

ciR1(x0,x)wi/12

= d21n0+ Sc,

where f(x0) = (f(x01), . . . , f(x0n0))T , 1m represents a m-vector of all

ones, R1(x0, x) = (R1(x01, x), . . . , R1(x0n0, x))T , R1(x0,x) =

{R1(x0i, xj)}n0 12i=1 j=1, S = R1(x0,x)W/12, and c = (c1, . . . , cn)T . We

compute f(x0) at 50 equally spaced points in [0, 1] as follows:

> S <- periodic(seq(0,1,len=50),x)%*%W/12

> fhat <- canada.fit1$coef$d[2]+S%*%canada.fit1$coef$c

Figure 2.8 displays the estimated weight function and 95% bootstrapconfidence intervals. The shape of the weight function is similar to thatin Figure 15.5 of Ramsay and Silverman (2005). Note that monthlydata are used in this book while daily data were used in Ramsay andSilverman (2005).

month

weig

ht fu

nction f(x

)

−0.4

0.0

0.2

0.4

0.6

J F M A M J J A S O N D

FIGURE 2.8 Canadian weather data, plot of the estimated weightfunction (solid line), and 95% bootstrap confidence intervals (shadedregion).


2.11 L-splines

2.11.1 Motivation

In the construction of a smoothing spline model, one needs to decidethe penalty functional or, equivalently, the null space H0 consisting offunctions that are not penalized. The squared bias of the spline estimatesatisfies the following inequality (Wahba, 1990, p. 59)

1

n

n∑

i=1

{E(Lif) − Lif}2 ≤ λ||P1f ||2.

That is, the squared bias is bounded by the distance of f to the null spaceH0. Therefore, selecting a penalty such that ||P1f ||2 is small can leadto low bias in the spline estimate of the function. This is equivalent toselecting a null space H0 such that it is close to the true function. Ideally,one wants to choose H0 such that ||P1f ||2 = 0, that is f ∈ H0. However,it is usually difficult, if not impossible, to specify such a parametric spacein practice. Nevertheless, often there is prior information suggesting thatf can be well approximated by a parametric model. That is, f is closeto, but not necessarily in, the space H0. Heckman and Ramsay (2000)called such a space a favored parametric model. L-splines allow us toincorporate this kind of indefinite information. L-splines can also beused to check or test parametric models (Chapter 3).

Consider functions on the domain X = [a, b]. Let L be the lineardifferential operator

L = Dm +

m−1∑

j=0

ωj(x)Dj , (2.53)

where m ≥ 1, Dj denotes the jth derivative operator, and ωi are con-tinuous real-valued functions. The minimizer of the following PLS

1

n

n∑

i=1

(yi − Lif)2 + λ

∫ b

a

(Lf)2dx

is called an L-spline. Note that the penalty is in general different fromthat of a polynomial or periodic spline. To utilize the general estimationprocedure developed in Section 2.4, we need to construct an RKHS such

that the penalty∫ b

a (Lf)2dx = ||P1f ||2 with an appropriately definedinner product.


Suppose f ∈ Wm2 [a, b]. Then Lf exists and is square integrable. Let

H0 = {f : Lf = 0} be the kernel of L. Based on results from dif-ferential equations (Coddington 1961), there exist real-valued functionsu1, . . . , um ∈Wm

2 [a, b] such that u1, . . . , um form a basis of H0. Further-more, the Wronskian matrix associated with u1, . . . , um,

W (x) =

u1(x) u2(x) · · · um(x)u′1(x) u′2(x) · · · u′m(x)

......

...

u(m−1)1 (x) u

(m−1)2 (x) · · · u(m−1)

m (x)

,

is invertible for all x. Define inner product

(f, g) =

m−1∑

ν=0

f (ν)(a)g(ν)(a) +

∫ b

a

(Lf)(Lg)dx. (2.54)

Equation (2.54) defines a proper inner product since f (ν)(a) = 0, ν =0, . . . ,m−1, and Lf = 0 leads to f = 0. Denote u(x) = (u1(x), . . . , um(x))T .Let u∗(x) = (u∗1(x), . . . , u

∗m(x))T be the last column of W−1(x) and

G(x, s) =

{

uT (x)u∗(s), s ≤ x,

0, s > x,

be the Green function associated with L.

Model space construction and decomposition of Wm2 [a, b]

The space Wm2 [a, b] is an RKHS with inner product (2.54). Furthermore,

Wm2 [a, b] = H0 ⊕H1, where

H0 = span{u1, . . . , um},H1 =

{f ∈Wm

2 [a, b] : f (ν)(a) = 0, ν = 0, . . . ,m− 1},

(2.55)


R0(x, z) = uT (x){WT (a)W (a)}−1u(z),

R1(x, z) =

∫ b

a

G(x, s)G(z, s)ds.(2.56)

The above construction of an L-spline is based on a given differen-tial operator L. In practice, rather than a differential operator, priorknowledge may be in a form of basis functions for the null space H0.


Specifically, suppose prior knowledge suggests that f can be well ap-proximated by a parametric space spanned by a basis u1, . . . , um. Thenone can solve the following equations to derive coefficients ωj in (2.53):

(Luν)(x) = u(m)ν (x) +

m−1∑

j=0

ωj(x)u(j)ν (x) = 0, ν = 1, . . . ,m.

Let W (x) be the Wronskian matrix, ω(x) = (ω0(x), . . . , ωm−1(x))T , and

u(m)(x) = (u(m)1 (x), . . . , u

(m)m (x))T . Then the above equations can be

written in a matrix form: WT (x)ω(x) = −u(m)(x). Assume that W (x)is invertible. Then ω(x) = −W−T (x)u(m)(x).

It is clear that the polynomial spline is a special case with L = Dm.Some special L-splines are discussed in the following subsections. Moredetails can be found in Schumaker (2007), Wahba (1990), Dalzell andRamsay (1993), Heckman (1997), Gu (2002), and Ramsay and Silverman(2005).

2.11.2 Exponential Spline

Assume that f ∈W 22 [a, b]. Suppose prior knowledge suggests that f can

be well approximated by a linear combination of 1 and exp(−γx) for afixed γ 6= 0. Consider the parametric model space

H0 = span{1, exp(−γx)}.

It is easy to see that H0 is the kernel of the differential operator L =D2 + γD. Nevertheless, we derive this operator following the procedurediscussed in Section 2.11.1. Let u1(x) = 1 and u2(x) = exp(−γx). TheWronskian matrix

W (x) =

(1 exp(−γx)0 −γ exp(−γx)

)

.

Then

(ω0(x)ω1(x)

)

= −(

1 01γ − 1

γ exp(γx)

)(0

γ2 exp(−γx)

)

=

(0γ

)

.

Thus we have

L = D2 + γD.

Since

{W (a)TW (a)}−1 =

(1 + 1

γ2 − 1γ2 exp(γa)

− 1γ2 exp(γa) 1

γ2 exp(2γa)

)

,


the RK of H0

R0(x, z) = 1 +1

γ2− 1

γ2exp(−γx∗) − 1

γ2exp(−γz∗)

+1

γ2exp{−γ(x∗ + z∗)}, (2.57)

where x∗ = x− a and z∗ = z − a.The Green function

G(x, s) =

{1γ [1 − exp{−γ(x− s)}], s ≤ x,

0, s > x.

Thus the RK of H1

R1(x, z)

=

∫ b

a

G(x, s)G(z, s)ds

=

∫ x∗∧z∗

0

1

γ2[1 − exp{−γ(x∗ − s∗)}] [1 − exp{−γ(z∗ − s∗)}] ds∗

=1

γ3

{γ(x∗ ∧ z∗) + exp(−γx∗) + exp(−γz∗)

− exp{γ(x∗ ∧ z∗ − x∗)} − exp{γ(x∗ ∧ z∗ − z∗)}

− 1

2exp{−γ(x∗ + z∗)} +

1

2exp[γ{2(x∗ ∧ z∗) − x∗ − z∗}]

}. (2.58)

Evaluations of the RK R1 for some simple L-splines can be calculatedusing the lspline function in the assist package. The argument typespecifies the type of an L-spline. The option type=‘‘exp’’ computesR1 in (2.58) for the special case when a = 0 and γ = 1. For general aand γ 6= 0, the RK R1 can be computed using a simple transformationx = γ(x− a).

The weight loss data contain weight measurements of a male obese pa-tient since the start of a weight rehabilitation program. We now illustratehow to fit an exponential spline to the weight loss data. Observationsare shown in Figure 2.9(a). Let y = Weight and x = Days. We first fita nonlinear regression model as in Venables and Ripley (2002):

yi = β1 + β2 exp(−β3xi) + ǫi, i = 1, . . . , 51. (2.59)

The following statements fit model (2.59):

> library(MASS); attach(wtloss)

> y <- Weight; x <- Days

> weight.nls <- nls(y~b1+b2*exp(-b3*x),

start=list(b1=81,b2=102,b3=.005))


The fit is shown in Figure 2.9(a).

0 50 100 150 200 250

11

01

30

15

01

70

Days

Weig

ht (k

g)

ooooo

oo

oo

oo

oo

ooooo

oo

o ooo

oo

oooooooo

ooooooooo

o oooo o

ooo

(a)

0 50 100 150 200 250−

1.5

−0.5

0.5

1.5

Days

(b)

FIGURE 2.9 Weight loss data, plots of (a) observations (circles), thenonlinear regression fit (dotted line), the cubic spline fit (dashed line),and the exponential spline fit (solid line); and (b) the cubic spline fit and95% Bayesian confidence intervals minus the nonlinear regression fit asdashed lines, and the exponential spline fit and 95% Bayesian confidenceintervals minus the nonlinear regression fit as solid lines.

Next we consider the nonparametric regression model (1.1). It is rea-sonable to assume that the regression function can be well approximatedby H0 = span{1, exp(−γx)}, where γ = β3 = 0.0048 is the LS estimateof β3 in the model (2.59). We now fit the exponential spline:

> r <- coef(weight.nls)[3]

> ssr(y~exp(-r*x), rk=lspline(r*x,type=‘‘exp’’))

The exponential spline fit in Figure 2.9(a) is essentially the same as that

from nonlinear regression model. The parameter β3 is fixed as β3 in theabove construction of the exponential spline. One may treat β3 as aparameter and estimate it using the GCV criterion as in Gu (2002):

> gcv.fun <- function(r) ssr(y~exp(-r*x),

rk=lspline(r*x,type=‘‘exp’’))$rkpk.obj$score

> nlm(gcv.fun,.001)$estimate

0.004884513

For comparison, the cubic spline fit is also shown in Figure 2.9(a).To look at the difference between cubic and exponential splines moreclosely, we plot their fits and 95% Bayesian confidence intervals minus


the fit from nonlinear regression in Figure 2.9(b). It is clear that theconfidence intervals for the exponential spline are narrower.

Figure 2.9 is essentially the same as Figure 4.3 in Gu (2002). A dif-ferent approach was used to fit the exponential spline in Gu (2002): it isshown that fitting the exponential spline is equivalent to fitting a cubicspline to the transformed variable x = 1−exp(−γx). Thus the followingstatements lead to the same fit to the exponential spline:

> tx <- 1-exp(-r*x)

> ssr(y~tx, rk=cubic2(tx))

2.11.3 Logistic Spline


be well approximated by a logistic model H0 = span{1/(1+δ exp(−γx))}for some fixed δ > 0 and γ > 0. It is easy to see that H0 is the kernel ofthe differential operator

L = D − δγ exp(−γx)1 + δ exp(−γx) .

The Wronskian is an 1 × 1 matrix W (x) = {1 + δ exp(−γx)}−1. Since{WT (a)W (a)}−1 = {1 + δ exp(−γa)}2, then the RK of H0

R0(x, z) ={1 + δ exp(−γa)}2

{1 + δ exp(−γx)}{1 + δ exp(−γz)} . (2.60)

The Green function

G(x, s) =

1 + δ exp(−γs)1 + δ exp(−γx) , s ≤ x,

0, s > x.

Thus the RK of H1

R1(x, z) = {1 + δ exp(−γx)}−1{1 + δ exp(−γz)}−1

{x ∧ z − a+ 2δγ−1[exp(−γa) − exp{−γ(x∧ z)}]

+δ2(2γ)−1[exp(−2γa)− exp{−2γ(x ∧ z)}]}. (2.61)

The paramecium caudatum data consist of growth of paramecium cau-datum population in the medium of Osterhout. We now illustrate howto fit a logistic spline to the paramecium caudatum data. Observationsare shown in Figure 2.10. Let y = density and x = days. We first fitthe following logistic growth model

yi =β1

1 + β2 exp(−β3x)+ ǫi, i = 1, . . . , 25, (2.62)


using the statements

> data(paramecium); attach(paramecium)

> para.nls <- nls(density~b1/(1 + b2*exp(-b3*day)),

start=list(b1=202,b2=164,b3=0.74))

Initial values are the estimates in Neal (2004). The fit is shown in Figure2.10.

0 5 10 15 20 25

050

100

150

200

day

density

oo o o

o

o

o

o

o

o

o

o o

o

o

o

o

o o

o o

o

o o

o

FIGURE 2.10 Paramecium caudatum data, observations (circles),the nonlinear regression fit (dotted line), the cubic spline fit (dashedline), and the logistic spline fit (solid line).

Now we consider the nonparametric regression model (2.7). It is rea-sonable to assume that the regression function can be well approxi-mated by H0 = span{1/(1 + δ exp(−γx)}, where δ = β2 = 705.9496

and γ = β3 = 0.9319 are the LS estimates in model (2.62).The option type=‘‘logit’’ in the lspline function computes R1

in (2.61) for the special case when a = 0, δ = 1, and γ = 1. It can-not be adapted to compute R1 for the general situation. We take thisopportunity to show how to write a function for computing an RK.

> logit.rk <- function(x,a,d,r) {

tmp1 <- x%o%rep(1,length(x))

tmp2 <- (tmp1+t(tmp1)-abs(tmp1-t(tmp1)))/2

tmp3 <- exp(-r*a)-exp(-r*tmp2)

tmp4 <- exp(-2*r*a)-exp(-2*r*tmp2)


tmp5 <- 1/((1+d*exp(-r*x))%o%(1+d*exp(-r*x)))

(tmp2-a+2*d*tmp3/r+d**2*tmp4/(2*r))*tmp5

}

> bh <- coef(para.nls)

> ssr(density~I(1/(1+bh[2]*exp(-bh[3]*day)))-1,

rk=logit.rk(day,0,bh[2],bh[3]),spar=‘‘m’’)

The function logit.rk computes R1 in (2.61). Since the sample size issmall, the GML method is used to select the smoothing parameter. Thelogistic spline fit is shown in Figure 2.10. The fit is essentially the sameas that from the logistic growth model (2.62). For comparison, the cubicspline fit is also shown in Figure 2.10. The logistic spline fit smoothsout oscillations after 10 days while the cubic spline fit preserves them.

To include the constant function in H0, one may consider the operator

L = D

{

D − δγ exp(−γx)1 + δ exp(−γx)

}

.

Details of this situation can be found in Gu (2002).

2.11.4 Linear-Periodic Spline


be well approximated by a parametric model

H0 = span{1, x, cosx, sinx}.It is easy to check that H0 is the kernel of the differential operator

L = D4 +D2.

The Wronskian matrix and its inverse are, respectively,

W (x) =

1 x cosx sinx0 1 − sinx cosx0 0 − cosx − sinx0 0 sinx − cosx

and

W−1(x) =

1 −x 1 −x0 1 0 10 0 − cosx sinx0 0 − sinx − cosx

.

For simplicity, suppose a = 0. Then

{WT (0)W (0)}−1 =

2 0 −1 00 2 0 −1

−1 0 1 00 −1 0 1

.


Therefore, the RK of H0

R0(x, z) = 2 − cosx− cos z + 2xz − x sin z − z sinx

+ cosx cos z + sinx sin z. (2.63)

The Green function

G(x, s) =

{

x− s− sin(x− s), s ≤ x,

0, s > x.

Thus the RK of H1

R1(x, z) = −1

6(x ∧ z)3 +

1

2xz(x ∧ z) − |x− z| − sinx− sin z

+ x cos z + z cosx+1

2(x ∧ z) cos(z − x)

+5

4sin |x− z| − 1

4sin(x+ z). (2.64)

The option type=‘‘linSinCos’’ in the lspline function computesR1 in (2.64) for the special case when a = 0. The translation x−amay beused when a 6= 0. When the null space H0 = span{1, x, cos τx, sin τx}for a fixed τ 6= 0, the corresponding differential operator is L = D4 +τ2D2. The linear-periodic spline with a general τ can be fitted using thetransformation x = τx.

The melanoma data contain numbers of melanoma cases per 100,000in Connecticut from 1936 to 1972. We now illustrate how to fit a linear-periodic spline to the melanoma data. The observations are shown inFigure 2.11. There are two apparent trends: a nearly linear long-termtrend over the years, and a cycle of around 10 years corresponding tothe sunspot cycle. Let y = cases and x = year. As in Heckman andRamsay (2000), we fit a linear-periodic spline with L = D4 + τ2D2,where τ = 0.58.

> library(fda); attach(melanoma)

> x <- year-1936; y <- incidence

> tau <- .58; tx <- tau*x

> ssr(y~tx+cos(tx)+sin(tx),

rk=lspline(tx,type=‘‘linSinCos’’), spar=‘‘m’’)

Again, since the sample size is small, the GML method is used to selectthe smoothing parameter. For comparison, the cubic spline fit withsmoothing parameter selected by the GML method is also shown inFigure 2.11.


year

cases p

er

100,0

00

1940 1950 1960 1970

12

34

oo o

oo

o

oo

oo o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

FIGURE 2.11 Melanoma data, observations (circles), the cubicspline fit (dashed line), and the linear-periodic spline fit (solid line).

2.11.5 Trigonometric Spline

Suppose f is a periodic function on X = [0, 1]. Assume that f ∈Wm

2 (per). Then we have the Fourier expansion

f(x) = a0 +

∞∑

ν=1

aν cos 2πνx+

∞∑

ν=1

bν sin 2πνx

= a0 +

m−1∑

ν=1

aν cos 2πνx+

m−1∑

ν=1

bν sin 2πνx+ Rem(x), (2.65)

where the first three elements in (2.65) represent a trigonometric poly-nomial of degree m−1 and Rem(x) ,

∑∞ν=m(aν cos 2πνx+bν sin 2πνx).

The penalty used in Section 2.7 for a periodic spline corresponds toL = Dm with kernel H0 = span{1}. That is, all nonconstant functionsincluding those in the trigonometric polynomial are penalized. Anal-ogous to the Taylor expansion and polynomial splines, one may wantto include lower-degree trigonometric polynomials in the null space.The operator L = Dm does not decompose Wm

2 (per) into lower-degreetrigonometric polynomials plus the remainder terms.

Now consider the null space

H0 = span{1, sin 2πνx, cos 2πνx, ν = 1, . . . ,m− 1

}, (2.66)

which includes trigonometric polynomials with degree up to m− 1. As-sume that f ∈ Wm+1

2 (per). It is easy to see that H0 is the kernel of the


differential operator

L = D

m−1∏

ν=1

{D2 + (2πν)2

}. (2.67)

Model space construction and decomposition of Wm+12 (per)

The space Wm+12 (per) is an RKHS with the inner product

(f, g) =

(∫ 1

0

fdx

)(∫ 1

0

gdx

)

+

m−1∑

ν=1

(∫ 1

0

f cos 2πνxdx

)(∫ 1

0

g cos 2πνxdx

)

+

m−1∑

ν=1

(∫ 1

0

f sin 2πνxdx

)(∫ 1

0

g sin 2πνxdx

)

+

∫ 1

0

(Lf)(Lg)dx,

where L is defined in (2.67). Furthermore, Wm+12 (per) = H0 ⊕ H1,

where H0 is given in (2.66) and H1 = Wm+12 (per) ⊖ H0. H0 and H1


R0(x, z) = 1 +

m−1∑

ν=1

cos{2π(x− z)},

R1(x, z) =

∞∑

ν=m

2

(2π)4m+2

m−1∏

j=0

(j2 − ν2)−2

cos 2πν(x− z).

(2.68)

Sometimes f satisfies the constraint∫ 1

0 fdx = 0. This constraint canbe handled easily by removing constant functions in the above construc-tion. Specifically, let

H0 = span{

sin 2πνx, cos 2πνx, ν = 1, . . . ,m− 1}. (2.69)

Assume that f ∈Wm2 (per)⊖{1}. Then H0 is the kernel of the differential

operator

L =

m−1∏

ν=1

{D2 + (2πν)2

}. (2.70)


Model space construction and decomposition of Wm2 (per)⊖{1}

The space Wm2 (per) ⊖ {1} is an RKHS with the inner product

(f, g) =

m−1∑

ν=1

(∫ 1

0

f cos 2πxdx

)(∫ 1

0

g cos 2πxdx

)

+

m−1∑

ν=1

(∫ 1

0

f sin 2πxdx

)(∫ 1

0

g sin 2πxdx

)

+

∫ 1

0

(Lf)(Lg)dx,

where L is defined in (2.70). Furthermore, Wm2 (per) = H0 ⊕H1, where

H0 is given in (2.69) and H1 = Wm2 (per) ⊖ {1} ⊖ H0. H0 and H1 are

RKHS’s with RKs

R0(x, z) =m−1∑

ν=1

cos{2π(x− z)},

R1(x, z) =

∞∑

ν=m

2

(2π)4(m−1)

m−1∏

j=1

(j2 − ν2)−2

cos 2πν(x− z).

(2.71)

In the lspline function, the options type=‘‘sine1’’ and type=

‘‘sine0’’ compute RKs R1 in (2.68) and (2.71), respectively.We now use the Arosa data to show how to fit a trigonometric spline

and illustrate its difference from periodic and partial splines. Suppose wewant to investigate how ozone thickness changes over months in a year.We have fitted a cubic periodic spline with f ∈ W 2

2 (per) in Section2.7. Note that W 2

2 (per) = H10 ⊕ H11, where H10 = {1} and H11 =W 2

2 (per) ⊖ {1}. Therefore, we can rewrite the periodic spline model as

yi = f10(xi) + f11(xi) + ǫi, i = 1, . . . , n, (2.72)

where f10 and f11 are projections onto H10 and H11, respectively. Theestimates of the overall function, f10 + f11, and its projections f10 (para-

metric) and f11 (smooth) are shown in the first row of Figure 2.12.It is apparent that the monthly pattern can be well approximated by

a simple sinusoidal function in the model space

P = span{1, sin 2πx, cos 2πx

}. (2.73)

Suppose we want to check the departure from the parametric model P .One approach is to add the sine and cosine functions to the null space


month

thic

kn

ess

0

100

200

300

400

2 4 6 8 10 12

overallL−spline

parametricL−spline

2 4 6 8 10 12

smoothL−spline

overall

partial spline

parametric

partial spline

0

100

200

300

400

smooth

partial spline

0

100

200

300

400

overall

periodic spline

2 4 6 8 10 12

parametric

periodic spline

smooth

periodic spline

FIGURE 2.12 Arosa data, plots the overall fits (left), the paramet-ric components (middle), and the smooth components (right) of theperiodic spline (top), the partial spline (middle), and the L-spline (bot-tom). Parametric components represent components in the spaces H10,P , and P for periodic spline, partial spline, and L-spline, respectively.Smooth components represent components in the spaces H11, H11, andW 3

2 (per)⊖P for periodic spline, partial spline, and L-spline, respectively.Dotted lines are 95% Bayesian confidence intervals.


H10 of the periodic spline. This leads to the following partial splinemodel

yi = f20(xi) + f21(xi) + ǫi, i = 1, . . . , n, (2.74)

where f20 ∈ P and f21 ∈ H11. The model (2.74) is fitted as follows:

> ssr(thick~sin(2*pi*x)+cos(2*pi*x), rk=periodic(x))

The estimates of the overall function and its two components f20 (para-

metric) and f21 (smooth) are shown in the second row of Figure 2.12.Another approach is to fit a trigonometric spline with m = 2:

yi = f30(xi) + f31(xi) + ǫi, i = 1, . . . , n, (2.75)

where f30 ∈ P and f31 ∈ W 32 (per) ⊖ P . The model (2.75) is fitted as

follows:

> ssr(thick~sin(2*pi*x)+cos(2*pi*x),

rk=lspline(x,type=‘‘sine1’’))

The estimates of the overall function and its two components f30 (para-

metric) and f31 (smooth) are shown in the third row of Figure 2.12.All three models have similar overall fits. However, their components

are quite different. The smooth component of the periodic spline revealsthe departure from a constant function. To check the departure from thesimple sinusoidal model space P , we can look at the smooth componentsfrom the partial and L-splines. Estimates of the smooth componentsfrom the partial and L-splines are similar. However, the confidenceintervals based on the L-spline are narrower than those based on thepartial spline. This is due to the fact that the two components f30 andf31 in the L-spline are orthogonal, while the two components f20 andf21 in the partial spline are not necessarily orthogonal. Therefore, wecan expect the inference based on the L-spline to be more efficient ingeneral.

Chapter 3

Smoothing Parameter Selectionand Inference

3.1 Impact of the Smoothing Parameter

The penalized least squares (2.11) represents a compromise between thegoodness-of-fit and a penalty to the departure from the null space H0.The balance is controlled by the smoothing parameter λ. As λ variesfrom 0 to ∞, we have a family of estimates with f ∈ H0 when λ = ∞.

To illustrate the impact of the smoothing parameter, consider theStratford weather data consisting of daily maximum temperatures inStratford, Texas, during 1990. Observations are shown in Figure 3.1.Consider the regression model (1.1) where n = 73 and f representsexpected maximum temperature as a function of time in a year. Denote xas the time variable scaled into [0, 1]. It is reasonable to assume that f isa smooth periodic function. In particular, we assume that f ∈W 2

2 (per).For a fixed λ, say 0.001, one can fit the cubic periodic spline as follows:

> data(Stratford); attach(Stratford)

> ssr(y~1, rk=periodic(x), limnla=log10(73*.001))

where the argument limnla specifies a search range for log10(nλ). Tosee how a spline fit is affected by the choice of λ, periodic spline fitswith six different values of λ are shown in Figure 3.1. It is obvious thatthe fit with λ = ∞ is a constant, that is, f∞ ∈ H0. The fit with λ = 0interpolates data. A larger λ leads to a smoother fit. Both λ = 0.0001and λ = 0.00001 lead to visually reasonable fits.

In practice it is desirable to select the smoothing parameter using anobjective method rather than visual inspection. In a sense, a data-drivenchoice of λ allows data to speak for themselves. Thus, it is not exagger-ating to say that the choice of λ is the spirit and soul of nonparametricregression.

We now inspect how λ controls the fit. Again, consider model (1.1)for Stratford weather data. Let us first consider a parametric approachthat approximates f using a trigonometric polynomial up to a certain

53


time

tem

pera

ture

(F

ahre

nheit)

20

40

60

80

100

0.0 0.2 0.4 0.6 0.8 1.0

λ = 1e−05 λ = 1e−06

0.0 0.2 0.4 0.6 0.8 1.0

λ = 0

λ = ∞

0.0 0.2 0.4 0.6 0.8 1.0

λ = 0.001

20

40

60

80

100

λ = 1e−04

FIGURE 3.1 Stratford weather data, plot of observations, and theperiodic spline fits with different smoothing parameters.

degree, say k, where 0 ≤ k ≤ K and K = (n − 1)/2 = 36. Denote thecorresponding parametric model space for f as

Mk = span{1,√

2 sin 2πνx,√

2 cos 2πνx, ν = 1, . . . , k}, (3.1)

where M0 = span{1}. For a fixed k, write the regression model basedon Mk in a matrix form as

y = Xkβk + ǫ,

where

Xk =

1√

2 sin 2πx1

√2 cos 2πx1 · · ·

√2 sin 2πkx1

√2 cos 2πkx1

1√

2 sin 2πx2

√2 cos 2πx2 · · ·

√2 sin 2πkx2

√2 cos 2πkx2

......

... · · ·...

...

1√

2 sin 2πxn

√2 cos 2πxn · · ·

√2 sin 2πkxn

√2 cos 2πkxn

is the design matrix, xi = i/n, βk = (β1, . . . , β2k+1)T , and ǫ = (ǫ1, . . . ,

ǫn)T . Since design points are equally spaced, we have the following

Smoothing Parameter Selection and Inference 55

orthogonality relations:

2

n

n∑

i=1

cos 2πνxi cos 2πµxi = δν,µ, 1 ≤ ν, µ ≤ K,

2

n

n∑

i=1

sin 2πνxi sin 2πµxi = δν,µ, 1 ≤ ν, µ ≤ K,

2

n

n∑

i=1

cos 2πνxi sin 2πµxi = 0, 1 ≤ ν, µ ≤ K,

where δν,µ is the Kronecker delta. Therefore, XTk Xk = nI2k+1, where

I2k+1 is an identity matrix of size 2k+1. Note that XK/√n is an orthog-

onal matrix. Define the discrete Fourier transformation y = XTKy/n.

Then the LS estimate of βk is βk = (XTk Xk)−1XT

k y = XTk y/n = yk,

where yk consists of the first 2k + 1 elements of the discrete Fouriertransformation y. More explicitly,

β1 =1

n

n∑

i=1

yi = y1,

β2ν =

√2

n

n∑

i=1

yi sin 2πνxi = y2ν , 1 ≤ ν ≤ k, (3.2)

β2ν+1 =

√2

n

n∑

i=1

yi cos 2πνxi = y2ν+1, 1 ≤ ν ≤ k.

Now consider modeling f using the cubic periodic spline spaceW 22 (per).

The exact solution was given in Chapter 2. To simplify the argument,let us consider the following PLS

minf∈MK

{

1

n

n∑

i=1

(yi − f(xi))2 + λ

∫ 1

0

(f ′′)2dx

}

, (3.3)

where the model space W 22 (per) is approximated by MK . The follow-

ing discussion holds true for the exact solution in W 22 (per) (Gu 2002).

However, the approximation makes the following argument simpler andtransparent.

Let

f(x) = α1 +

K∑

ν=1

(α2ν

√2 sin 2πνx+ α2ν+1

√2 cos 2πνx)


be the solution to (3.3). Then f , (f(x1), . . . , f(xn))T = XKα, whereα = (α1, . . . , α2K+1)

T . The LS

1

n||y − f ||2 =

1

n|| 1√

nXT

K(y − f )||2

= || 1nXT

Ky − 1

nXT

KXKα||2

= ||y − α||2.

Thus (3.3) reduces to the following ridge regression problem

(α1 − y1)2 +

K∑

ν=1

{(α2ν − y2ν)2 + (α2ν+1 − y2ν+1)

2}

+ λK∑

ν=1

(2πν)4(α22ν + α2

2ν+1). (3.4)

The solutions to (3.4) are

α1 = y1,

α2ν =y2ν

1 + λ(2πν)4, ν = 1, . . . ,K, (3.5)

α2ν+1 =y2ν+1

1 + λ(2πν)4, ν = 1, . . . ,K.

Thus the periodic spline is essentially a low-pass filter: components atfrequency ν are downweighted by a factor of 1 + λ(2πν)4. Figure 3.2shows how λ controls the nature of the filter: more high frequenciesare filtered out as λ increases. Comparing (3.2) and (3.5), it is clearthat selecting an order k for the trigonometric polynomial model maybe viewed as hard thresholding, and selecting the smoothing parameterλ for the periodic spline may be viewed as soft thresholding.

Now consider the general spline model (2.10). From (2.26), the hatmatrix H(λ) = I − nλQ2(Q

T2MQ2)

−1QT2 . Let UEUT be the eigende-

composition of QT2 ΣQ2, where U(n−p)×(n−p) is an orthogonal matrix and

E = diag(e1, . . . , en−p). The projection onto the space spanned by T

PT , T (T TT )−1T T = Q1R(RTR)−1RTQT1 = Q1Q

T1 .

Then

H(λ) = I − nλQ2U(E + nλI)−1UTQT2 (3.6)

= Q1QT1 +Q2Q

T2 − nλQ2U(E + nλI)−1UTQT

2

= PT +Q2UV UTQT

2 , (3.7)


0 5 10 15 20 25 30 35

0.0

0.2

0.4

0.6

0.8

1.0

frequency

weig

hts

FIGURE 3.2 Weights of the periodic spline filter, 1/(1 + λ(2πν)4),plotted as a function of frequency ν. Six curves from top down corre-spond to six different λ: 0, 10−6, 10−5, 10−4, 10−3, and ∞.

where V = diag(e1/(e1 + nλ), . . . , en−p/(en−p + nλ)). The hat matrixis divided into two mutually orthogonal matrices: one is the projectiononto the space spanned by T , and the other is responsible for shrinkingpart of the signal that is orthogonal to T . The smoothing parametershrinks eigenvalues in the form eν/(eν + nλ). The choices λ = ∞ andλ = 0 lead to the parametric model H0 and interpolation, respectively.

Equation (3.7) also indicates that the hat matrix H(λ) is nonnegativedefinite. However, unlike the projection matrix for a parametric model,H(λ) is usually not idempotent. H(λ) has p eigenvalues equal to oneand the remaining eigenvalues less than one when λ > 0.

3.2 Trade-Offs

Before introducing methods for selecting the smoothing parameter, it ishelpful to discuss some basic concepts and principles for model selection.In general, model selection boils down to compromises between differentaspects of a model. Occam’s razor has been the guiding principle for thecompromises: the model that fits observations sufficiently well in theleast complex way should be preferred. To be precise on fits observationssufficiently well, one needs a quantity that measures how well a model fitsthe data. One such measure is the LS in (1.6). To be precise on the leastcomplex way, one needs a quantity that measures the complexity of a


model. For a parametric model, a common measure of model complexityis the number of parameters in the model, often called the degrees offreedom (df). For example, the df of model Mk in (3.1) equals 2k + 1.

What would be a good measure of model complexity for a nonparamet-ric regression procedure? Consider the general nonparametric regressionmodel (2.10). Let fi = Lif , and f = (f1, . . . , fn). Let f be an estimate

of f based on a modeling procedure M, and fi = Lif . Ye (1998) definedgeneralized degrees of freedom (gdf) of M as

gdf(M) ,

n∑

i=1

∂Ef (fi)

∂fi. (3.8)

The gdf is an extension of the standard degrees of freedom for generalmodeling procedures. It can be viewed as the sum of the average sen-sitivities of the fitted values fi to a small change in the response. It iseasy to check that (Efron 2004)

gdf(M) =1

σ2

n∑

i=1

Cov(fi, yi),

where∑n

i=1 Cov(fi, yi) is the so-called covariance penalty (Tibshirani

and Knight 1999). For spline estimate with a fixed λ, we have f = H(λ)y

based on (2.25). Denote the modeling procedure leading to f as Mλ andH(λ) = {hij}n

i,j=1. Then

gdf(Mλ) =n∑

i=1

∂Ef (∑n

j=1 hijyj)

∂fi=

n∑

i=1

∂∑n

j=1 hijfj

∂fi= trH(λ),

where tr represents the trace of a matrix. Even though λ does nothave a physical interpretation as k, trH(λ) is a useful measure of modelcomplexity and will be simply referred to as the degrees of freedom.

For Stratford weather data, Figure 3.3(a) depicts how trH(λ) for thecubic periodic spline depends on the smoothing parameter λ. It is clearthat the degrees of freedom decrease as λ increases.

To illustrate the interplay between the LS and model complexity, wefit trigonometric polynomial models from the smallest model with k = 0to the largest model with k = K. The square root of residual sumof squares (RSS) are plotted against the degrees of freedom (2k + 1)as circles in Figure 3.3(b). Similarly, we fit the periodic spline with awide range of values for the smoothing parameter λ. Again, we plot thesquare root of RSS against the degrees of freedom (trH(λ)) as the solidline in Figure 3.3(b). Obviously, RSS decreases to zero (interpolation)as the degrees of freedom increase to n. The square root of RSS keeps


−12 −10 −8 −6 −4 −2 0

010

20

30

40

50

60

70

log10(λ)

degre

es o

f fr

eedom

(a)

0 10 20 30 40 50 60 70

05

10

15

20

degrees of freedom

square

root of R

SS

o

ooooooooooooooo

ooooooooooo

oooo

ooooo

o

(b)

FIGURE 3.3 Stratford data, plots of (a) degrees of freedom of theperiodic spline against the smoothing parameter on the logarithm base10 scale, and (b) square root of RSS from the trigonometric polynomialmodel (circles) and periodic spline (line) against the degrees of freedom.

declining almost linearly after the initial big drop. It is quite clear thatthe constant model does not fit data well. However, it is unclear whichmodel fits observations sufficiently well.

Figure 3.3(b) shows that the LS and model complexity are two oppo-site aspects of a model: the approximation error decreases as the modelcomplexity increases. Our goal is to find the “best” model that strikes abalance between these two conflicting aspects. To make the term “best”meaningful, we need a target criterion that quantifies a model’s perfor-mance. It is clear that the LS cannot be used as the target becauseit will lead to the most complex model. Even though there is no uni-versally accepted measure, some criteria are widely accepted and usedin practice. We now introduce a criterion that is commonly used forregression models.

Consider the loss function

L(λ) =1

n||f − f ||2. (3.9)

Define the risk function, also called mean squared error (MSE), as

MSE(λ) , EL(λ) = E

(1

n||f − f ||2

)

. (3.10)

We want the estimate f to be as close to the true function f as possible.Obviously, MSE is the expectation of the Euclidean distance betweenthe estimates and the true values. It can be decomposed into two com-


ponents:

MSE(λ)

=1

nE||(Ef − f) + (f − Ef )||2

=1

nE||Ef − f ||2 +

2

nE(Ef − f)T (f − Ef) +

1

nE||f − Ef ||2

=1

n||Ef − f ||2 +

1

nE||f − Ef ||2

=1

n||(I −H(λ))f ||2 +

σ2

ntrH2(λ)

, b2(λ) + v(λ), (3.11)

where b2 and v represent squared bias and variance, respectively. Notethat bias depends on the true function, while the variance does not.Based on notations introduced in Section 3.2, let h = (h1, . . . , hn−p)

T ,

UTQT2 f . From (3.6), we have

b2(λ) =1

n||(I −H(λ))f ||2

=1

nfTQ2Udiag

{(nλ

e1 + nλ

)2

, . . . ,

(nλ

en−p + nλ

)2}

UTQT2 f

=1

n

n−p∑

ν=1

(nλhν

eν + nλ

)2

.

From (3.7), we have

v(λ) =σ2

ntrH2(λ) =

σ2

ntr(PT +Q2UV

2UTQT2 )

=σ2

n

{

p+

n−p∑

ν=1

(eν

eν + nλ

)2}

.

The squared bias measures how well f approximates the true functionf , and the variance measures how well the function can be estimated.As λ increases from 0 to ∞, b2(λ) increases from 0 to

∑n−pν=1 h

2ν/n =

||QT2 f ||2/n, while v(λ) decreases from σ2 to pσ2/n. Therefore, the MSE

represents a trade-off between bias and variance. Note that QT2 f repre-

sents the signal that is orthogonal to T .It is easy to check that db2(λ)/dλ|λ=0 = 0 and dv(λ)/dλ|λ=0 < 0.

Thus, dMSE(λ)/dλ|λ=0 < 0, and MSE(λ) has at least one minimizerλ∗ > 0. Therefore, when MSE(0) ≤ MSE(∞), there exists at leastone λ∗ such that the corresponding PLS estimate performs better in


terms of MSE than the LS estimate in H0 and the interpolation. WhenMSE(0) > MSE(∞), considering the MSE as a function of δ = 1/λ, wehave db2(δ)/dδ|δ=0 < 0 and dv(δ)/dδ|δ=0 = 0. Then, again, there existsat least one δ∗ such that the corresponding PLS estimate performs betterin terms of MSE than the LS estimate in H0 and the interpolation.

To calculate MSE, one needs to know the true function f . The follow-ing simulation illustrates the bias-variance trade-off. Observations aregenerated from model (1.1) with f(x) = sin(4πx2) and σ = 0.5. Thesame design points as in the Stratford weather data are used: xi = i/nfor i = 1, . . . , n and n = 73. The true function and one realization of ob-servations are shown in Figure 3.4(a). For a fixed λ, the bias, variance,and MSE can be calculated since the true function is known in the sim-ulation. For the cubic periodic spline, Figure 3.4(b) shows b2(λ), v(λ),and MSE(λ) as functions of log10(nλ). Obviously, as λ increases, thesquared bias increases and the variance decreases. The MSE representsa compromise between bias and variance.

0.0 0.2 0.4 0.6 0.8 1.0

−1

.5−

0.5

0.5

1.5

x

y

(a)

−8 −7 −6 −5 −4 −3 −2

0.0

00.1

00.2

00.3

0

log10(nλ)

square

d b

ias,v

ariance, and M

SE (b)

FIGURE 3.4 Plots of (a) true function (line) and observations (cir-cles), and (b) squared bias b2(λ) (dashed), variance v(λ) (dotted line),and MSE (solid line) for the cubic periodic spline.

Another closely related target criterion is the average predictive squarederror (PSE)

PSE(λ) = E

(1

n||y+ − f ||2

)

, (3.12)

where y+ = f + ǫ+ are new observations of f , ǫ+ = (ǫ+1 , . . . , ǫ+n )T are

independent of ǫ, and ǫ+i are independent and identically distributed


with mean zero and variance σ2. PSE measures the performance of amodel’s prediction for new observations. We have

PSE(λ) = E

(1

n||(y+ − f) + (f − f)||2

)

= σ2 + MSE(λ).

Thus PSE differs from MSE only by a constant σ2.Ideally, one would want to select λ that minimizes the MSE (PSE).

This is, however, not practical because MSE (PSE) depends on the un-known true function f that one wants to estimate in the first place.Instead, one may estimate MSE (PSE) from the data and then mini-mize the estimated criterion. We discuss unbiased and cross-validationestimates of PSE (MSE) in Sections 3.3 and 3.4, respectively.

3.3 Unbiased Risk

First consider the case when the error variance σ2 is known. Since

E

(1

n||y − f ||2

)

= E

{1

n||y − f ||2 +

2

n(y − f )T (f − f ) +

1

n||f − f ||2

}

= σ2 − 2σ2

ntrH(λ) + MSE(λ), (3.13)

then,

UBR(λ) ,1

n||(I −H(λ))y||2 +

2σ2

ntrH(λ) (3.14)

is an unbiased estimate of PSE(λ). Since PSE differs from MSE onlyby a constant σ2, one may expect the minimizer of UBR(λ) to be closeto the minimizer of the risk function MSE(λ). In fact, a stronger re-sult holds: under certain regularity conditions, UBR(λ) is a consistentestimate of the relative loss function L(λ) + n−1ǫT ǫ (Gu 2002). Thefunction UBR(λ) is referred to as the unbiased risk (UBR) criterion,and the minimizer of UBR(λ) is referred to as the UBR estimate of λ.It is obvious that UBR(λ) is an extension of the Mallow’s Cp criterion.

The error variance σ2 is usually unknown in practice. In general,there are two classes of estimators for σ2: residual-based and difference-based estimators. The first class of estimators is based on residuals froman estimate of f . For example, analogous to parametric regression, an


estimator of σ2 based on fit f = H(λ)y, is

σ2 ,||(I −H(λ))y||2n− trH(λ)

. (3.15)

The estimator σ2 is consistent under certain regularity conditions (Gu2002). However, it depends critically on the smoothing parameter λ.Thus, it cannot be used in the UBR criterion since the purpose of thiscriterion is to select λ. For choosing the amount of smoothing, it isdesirable to have an estimator of σ2 without needing to fit the functionf first.

The difference-based estimators of σ2 do not require an estimate of themean function f . The basic idea is to remove the mean function f by tak-ing differences based on some well-chosen subsets of data. Consider thegeneral SSR model (2.10). Let Ij = {i(j, 1), . . . , i(j,Kj)} ⊂ {1, . . . , n}be a subset of index and d(j, k) be some fixed coefficients such that

Kj∑

k=1

d2(j, k) = 1,

Kj∑

k=1

d(j, k)Li(j,k)f ≈ 0, j = 1, . . . , J.

Since

E

Kj∑

k=1

d(j, k)yi(j,k)

2

≈ E

Kj∑

k=1

d(j, k)ǫi(j,k)

2

= σ2,

then

σ2 ≈ 1

J

J∑

j=1

Kj∑

k=1

d(j, k)yi(j,k)

2

(3.16)

provides an approximately unbiased estimator of σ2. The estimator σ2

is referred to as a differenced-based estimator since d(j, k) are usually

chosen to be contrasts such that∑Kj

k=1 d(j, k) = 0. The specific choicesof subsets and coefficients depend on factors including prior knowledgeabout f and the domain X .

Several methods have been proposed for the common situation whenx is a univariate continuous variable, f is a smooth function, and Li areevaluational functionals. Suppose design points are ordered such thatx1 ≤ x2 ≤ · · · ≤ xn. Since f is smooth, then f(xj+1) − f(xj) ≈ 0 whenneighboring design points are close to each other. Setting Ij = {j, j+1}and d(j, 1) = −d(j, 2) = 1/

√2 for j = 1, . . . , n−1, we have the first-order

difference-based estimator proposed by Rice (1984):

σ2R =

1

2(n− 1)

n∑

i=2

(yi − yi−1)2. (3.17)


Hall, Kay and Titterington (1990) proposed the mth order difference-based estimator

σ2HKT =

1

n−m

n−m∑

j=1

{m∑

k=1

δkyj+k

}2

, (3.18)

where coefficients δk satisfy∑m

k=1 δk = 0,∑m

k=1 δ2k = 1, and δ1δm 6= 0.

Optimal choices of δk are studied in Hall et al. (1990). It is easy tosee that σ2

HKT corresponds to Ij = {j, . . . , j +m} and d(j, k) = δk forj = 1, . . . , n−m.

Both σ2R and σ2

HKT require an ordering of design points that couldbe problematic for multivariate independent variables. Tong and Wang(2005) proposed a different method for a general domain X . Suppose Xis equipped with a norm. Collect squared distances, dij = ||xi − xj ||2,for all pairs {xi, xj}, and half squared differences, sij = (yi − yj)

2/2,for all pairs {yi, yj}. Then E(sij) = {f(xi) − f(xj)}2/2 + σ2. Suppose{f(xi)−f(xj)}2/2 can be approximated by βdij when dij is small. Thenthe LS estimate of the intercept in the simple linear model

sij = α+ βdij + ǫij , dij ≤ D, (3.19)

provides an estimate of σ2. Denote such an estimator as σ2TW. Theoret-

ical properties and the choice of bandwidth D were studied in Tong andWang (2005).

To illustrate the UBR criterion as an estimate of PSE, we generateresponses from model (1.1) with f(x) = sin(4πx2), σ = 0.5, xi = i/nfor i = 1, . . . , n, and n = 73. For the cubic periodic spline, the UBRfunctions based on 50 replications of simulation data are shown in Figure3.5 where the true variance is used in (a) and the Rice estimator is usedin (b).

3.4 Cross-Validation and GeneralizedCross-Validation

Equation (3.13) shows that the RSS underestimates the PSE by theamount of 2σ2trH(λ)/n. The second term in the UBR criterion correctsthis bias. The bias in RSS is a consequence of using the same data formodel fitting and model evaluation. Ideally, these two tasks should beseparated using independent samples. This can be achieved by splittingthe whole data into two subsamples: a training (calibration) sample for


−8 −7 −6 −5 −4 −3 −2

0.1

0.2

0.3

0.4

0.5

0.6

log10(nλ)

PS

E a

nd U

BR

UBR with true variance

log10(nλ)

−8 −7 −6 −5 −4 −3 −2

UBR with Rice estimator

FIGURE 3.5 Plots of the PSE function as solid lines, the UBR func-tions with true σ2 as dashed lines (left), and the UBR functions withRice estimator σ2

R. The minimum point of the PSE is marked as longbars at the bottom. The UBR estimates of log10(nλ) are marked asshort bars.

model fitting, and a test (validation) sample for model evaluation. Thisapproach, however, is not efficient unless the sample size is large. Theidea behind cross-validation is to recycle data by switching the roles oftraining and test samples. For simplicity, we present leaving-out-onecross-validation only. That is, each time one observation will be left outas the test sample, and the remaining n− 1 samples will be used as thetraining sample.

Let f [i] be the minimizer of the PLS based on all observations exceptyi:

1

n

∑

j 6=i

(yj − Ljf)2 + λ||P1f ||2. (3.20)

The cross-validation estimate of PSE is

CV(λ) ,1

n

n∑

i=1

(

Lif[i] − yi

)2

. (3.21)

CV(λ) is referred to as the cross-validation criterion, and the minimizerof CV(λ) is called the cross-validation estimate of the smoothing param-

eter. Computation of f [i] based on (3.21) for each i = 1, . . . , n would becostly. Fortunately, this is unnecessary due to the following lemma.


Leaving-out-one LemmaFor any fixed i, f [i] is the minimizer of

1

n

(

Lif[i] − Lif

)2

+1

n

∑

j 6=i

(yj − Ljf)2 + λ||P1f ||2. (3.22)

[Proof] For any function f , we have

1

n

(

Lif[i] − Lif

)2

+1

n

∑

j 6=i

(yj − Lif)2 + λ||P1f ||2

≥ 1

n

∑

j 6=i

(yj − Lif)2 + λ||P1f ||2

≥ 1

n

∑

j 6=i

(

yj − Lj f[i])2

+ λ||P1f[i]||2

=1

n

(

Lif[i] − Lif

[i])2

+1

n

∑

j 6=i

(

yj − Lj f[i])2

+ λ||P1f[i]||2,

where the second inequality holds since f [i] is the minimizer of (3.20).The above lemma indicates that the solution to the PLS (3.20) without

the ith observation, f [i], is also the solution to the PLS (2.11) with the

ith observation yi being replaced by the fitted value Lif[i]. Note that

the hat matrix H(λ) depends on the model space and operators Li only.It does not depend on observations of the dependent variable. Therefore,both fits based on (2.11) and (3.22) have the same hat matrix. That is,

f = H(λ)y and f[i]

= H(λ)y[i], where f[i]

= (L1f[i], . . . ,Lnf

[i])T and

y[i] is the same as y except that the ith element is replaced by Lif[i].

Denote H(λ) = {hij}ni,j=1. Then

Lif =

n∑

j=1

hijyj,

Lif[i] =

∑

j 6=i

hijyj + hiiLif[i].

Solving for Lif[i], we have

Lif[i] =

Lif − hiiyi

1 − hii.

Then

Lif[i] − yi =

Lif − hiiyi

1 − hii− yi =

Lif − yi

1 − hii.


Plugging into (3.21), we have

CV(λ) =1

n

n∑

i=1

(Lif − yi)2

(1 − hii)2. (3.23)

Therefore, the cross-validation criterion can be calculated using the fitbased on the whole sample and the diagonal elements of the hat matrix.

Replacing hii by the average of diagonal elements, trH(λ), we havethe generalized cross-validation (GCV) criterion

GCV(λ) ,

1n

∑ni=1(Lif − yi)

2

{1n tr(I −H(λ))

}2 . (3.24)

The GCV estimate of λ is the minimizer of GCV(λ). Since trH(λ)/n isusually small in the neighborhood of the optimal λ, we have

E{GCV(λ)} ≈{

1

n||(I −H(λ))f ||2 +

σ2

ntrH2(λ) + σ2 − 2σ2

ntrH(λ)

}

{

1 +2

ntrH(λ)

}

= PSE(λ){1 + o(1)}.

The above approximation provides a very crude argument supportingthe GCV criterion as a proxy for the PSE. More formally, under certainregularity conditions, GCV(λ) is a consistent estimate of the relativeloss function. Furthermore, GCV(λ) is invariant to an orthogonal trans-formation of y. See Wahba (1990) and Gu (2002) for details. Onedistinctive advantage of the GCV criterion over the UBR criterion isthat the former does not require an estimate of σ2.

To illustrate the CV(λ) and GCV(λ) criteria as estimates of the PSE,we generate responses from model (1.1) with f(x) = sin(4πx2), σ = 0.5,xi = i/n for i = 1, . . . , n, and n = 73. For cubic periodic spline, theCV and GCV scores for 50 replications of simulation data are shown inFigure 3.6.

3.5 Bayes and Linear Mixed-Effects Models

Assume a prior for f as

F (x) =

p∑

ν=1

ζνφν(x) + δ1

2U(x), (3.25)


−8 −7 −6 −5 −4 −3 −2

0.1

0.2

0.3

0.4

0.5

0.6

log10(nλ)

PS

E a

nd C

V

CV

−8 −7 −6 −5 −4 −3 −2

0.1

0.2

0.3

0.4

0.5

0.6

log10(nλ)

PS

E a

nd G

CV

GCV

FIGURE 3.6 Plots of the PSE function as solid lines, the CV func-tions as dashed lines (left), and the GCV functions as dashed lines(right). The minimum point of the PSE is marked as long bars at thebottom. CV and GCV estimates of log10(nλ) are marked as short bars.

where ζ1, . . . , ζpiid∼ N(0, κ), U(x) is a zero-mean Gaussian stochastic pro-

cess with covariance function R1(x, z), ζν and U(x) are independent, andκ and δ are positive constants. Note that the bounded linear functionalsLi are defined for elements in H. Its application to the random processF (x) is yet to be defined. For simplicity, the subscript i in Li is ignoredin the following definition. Define L(ζνφν) = ζνLφν . The definition ofLU requires the duality between the Hilbert space spanned by a familyof random variables and its associated RKHS.

Consider the linear space

U ={

W : W =∑

αjU(xj), xj ∈ X , αj ∈ R

}

with inner product (W1,W2) = E(W1W2). Let L2(U) be the Hilbertspace that is the completion of U . Note that the RK R1 of H1 coincideswith the covariance function of U(x). Consider a linear map Ψ : H1 →L2(U) such that Ψ{R1(xj , ·)} = U(xj). Since

(R1(x, ·), R1(z, ·)) = R1(x, z) = E{U(x)U(z)} = (U(x), U(z)),

the map Ψ is inner product preserving. In fact, H1 is isometricallyisomorphic to L2(U). See Parzen (1961) for details. Since L is a boundedlinear functional in H1, by the Riesz representation theorem, there existsa representer h such that Lf = (h, f). Finally we define

LU , Ψh.


Note that LU is a random variable in L2(U). The application of L to F

is defined as LF =∑p

ν=1 ζνLφν + δ12LU .

When L is an evaluational functional, say Lf = f(x0) for a fixed x0,we have h(·) = R1(x0, ·). Consequently,

LU = Ψh = Ψ{R1(x0, ·)} = U(x0),

the evaluation of U at x0. Therefore, as expected, LF = F (x0) when Lis an evaluational functional.

Suppose observations are generated by

yi = LiF + ǫi, i = 1, . . . , n, (3.26)

where the prior F is defined in (3.25) and ǫiiid∼ N(0, σ2). Note that the

normality assumption has been made for random errors.We now compute the posterior mean E(L0F |y) for a bounded linear

functional L0 on H. Note that L0 is arbitrary, which could be quitedifferent from Li. For example, suppose f ∈ Wm

2 [a, b] and Li are eval-uational functionals. Setting L0f = f ′(x0) leads to an estimate of f ′.Using the correspondence between H and L2(U), we have

E(LiULjU) = (LiU,LjU) = (Li(x)R1(x, ·),Lj(z)R1(z, ·))= Li(x)Lj(z)R1(x, z),

E(L0ULiU) = (L0U,LjU) = (L0(x)R1(x, ·),Lj(z)R1(z, ·))= L0(x)Lj(z)R1(x, z).

Let ζ = (ζ1, . . . , ζp)T , φ = (φ1, . . . , φp)

T and L0φ = (L0φ1, . . . ,L0φp)T .

Then F (x) = φT (x)ζ + δ12U(x). It is easy to check that

y = Tζ + δ12 (L1U, . . . ,LnU)T + ǫ ∼ N(0, κT TT + δΣ + σ2I), (3.27)

and

L0F = (L0φ)T ζ + δ12L0U

∼ N(0, κ(L0φ)TL0φ+ δL0(x)L0(z)R1(x, z)). (3.28)

Furthermore,

E{(L0F )y} = κTL0φ+ δL0ξ, (3.29)

where

ξ(x) = (L1(z)R1(x, z), . . . ,Ln(z)R1(x, z))T ,

L0ξ = (L0(x)L1(z)R1(x, z), . . . ,L0(x)Ln(z)R1(x, z))T .

(3.30)


Let λ = σ2/nδ and η = κ/δ. Using properties of multivariate normalrandom variables and equations (3.27), (3.28), and (3.29), we have

E(L0F |y)

= (L0φ)T ηT T (ηTT T +M)−1y + (L0ξ)T (ηTT T +M)−1y. (3.31)

It can be shown (Wahba 1990, Gu 2002) that for any full-column rankmatrix T and symmetric and nonsingular matrix M ,

limη→∞

(ηT TT +M)−1 = M−1 −M−1T (T TM−1T )−1T TM−1,

limη→∞

ηT T (ηT TT +M)−1 = T (T TM−1T )−1T TM−1. (3.32)

Combining results in (3.31), (3.32), and (2.22), we have

limκ→∞

E(L0F |y) = (L0φ)TT (T TM−1T )−1T TM−1y

+ (L0ξ)T {M−1 −M−1T (T TM−1T )−1T TM−1}y

= (L0φ)Td+ (L0ξ)Tc

= L0f .

The above result indicates that the smoothing spline estimate f is aBayes estimator with a diffuse prior for ζ. From a frequentist perspec-tive, the smoothing spline estimate may be regarded as the best linearunbiased prediction (BLUP) estimate of a linear mixed-effects (LME)model. We now present three corresponding LME models. The firstLME model assumes that

y = Tζ + u+ ǫ, (3.33)

where ζ = (ζ1, . . . , ζp)T are deterministic parameters, u = (u1, . . . , un)T

are random effects with distribution u ∼ N(0, σ2Σ/nλ), ǫ = (ǫ1, . . . , ǫn)T

are random errors with distribution ǫ ∼ N(0, σ2I), and u and ǫ are in-dependent. The second LME model assumes that

y = Tζ + Σu+ ǫ, (3.34)

where ζ are deterministic parameters, u are random effects with dis-tribution u ∼ N(0, σ2Σ+/nλ), Σ+ is the Moore–Penrose inverse of Σ,ǫ are random errors with distribution ǫ ∼ N(0, σ2I), and u and ǫ areindependent.

It is inconvenient to use the above two LME models for computationsince Σ may be singular. Write Σ = ZZT , where Z is a n ×m matrixwith m = rank(Σ). The third LME model assumes that

y = Tζ + Zu+ ǫ, (3.35)


where ζ are deterministic parameters, u are random effects with dis-tribution u ∼ N(0, σ2I/nλ), ǫ are random errors with distributionǫ ∼ N(0, σ2I), and u and ǫ are independent.

It can be shown that the BLUP estimates for each of the three LMEmodels (3.33), (3.34), and (3.35) are the same as the smoothing splineestimate. See Wang (1998b) and Chapter 9 for more details.

3.6 Generalized Maximum Likelihood

The connection between smoothing spline models and Bayes models canbe exploited to develop a likelihood-based estimate for the smoothingparameter. From (3.27), the marginal distribution of y is N(0, δ(ηTT T +M)). Consider the following transformation

(w1

w2

)

=

(QT

21√ηT

T

)

y. (3.36)

It is easy to check that

w1 = QT2 y ∼ N(0, δQT

2MQ2),

Cov(w1,w2) =δ√ηQT

2 (ηTT T +M)T → 0, η → ∞,

Var(w2) =δ

ηT T (ηTT T +M)T → δ(T TT )(T TT ), η → ∞.

Note that the distribution of w2 is independent of λ. Therefore, weconsider the negative marginal log-likelihood of w1

l(λ, δ|w1) =1

2log |δQT

2MQ2| +1

2δwT

1 (QT2MQ2)

−1w1 + C1, (3.37)

where C1 is a constant. Minimizing l(λ, δ|w1) with respect to δ, we have

δ =wT

1 (QT2 MQ2)

−1w1

n− p. (3.38)

The profile negative log-likelihood

lp(λ, δ|w1) =1

2log |QT

2MQ2| +n− p

2log δ + C2

=n− p

2log

wT1 (QT

2MQ2)−1w1

{det(QT2MQ2)−1} 1

n−p

+ C2, (3.39)


where C2 is another constant. The foregoing profile negative log-likelihoodis equivalent to

GML(λ) ,wT

1 (QT2MQ2)

−1w1

{det(QT2MQ2)−1} 1

n−p

=yT (I −H(λ))y

[det+{(I −H(λ))}] 1n−p

, (3.40)

where the second equality is based on (2.26), and det+ represents theproduct of the nonzero eigenvalues. The function GML(λ) is referred toas the generalized maximum likelihood (GML) criterion, and the mini-mizer of GML(λ) is called the GML estimate of the smoothing parame-ter.

From (3.38), a likelihood-based estimate of σ2 is

σ2 ,nλwT

1 (QT2MQ2)

−1w1

n− p=yT (I −H(λ))y

n− p. (3.41)

The GML criterion may also be derived from the connection betweensmoothing spline models and LME models. Consider any one of the threecorresponding LME models (3.33), (3.34), and (3.35). The smoothingparameter λ is part of the variance component for the random effects.It is common practice in the mixed-effects literature to estimate thevariance components using restricted likelihood based on an orthogonalcontrast of original observations where the orthogonal contrast is usedto eliminate the fixed effects. Note that w1 is one such orthogonal con-trast since Q2 is orthogonal to T . Therefore, l(λ, δ|w1) in (3.37) is thenegative log restricted likelihood, and the GML estimate of the smooth-ing parameter is the restricted maximum likelihood (REML) estimate.Furthermore, the estimate of error variance in (3.41) is the REML esti-mate of σ2. The connection between a smoothing spline estimate withGML estimate of the smoothing parameter and a BLUP estimate withREML estimate of the variance component in a corresponding LMEmodel may be utilized to fit a smoothing spline model using software forLME models. This approach will be adopted in Chapters 5, 8, and 9 tofit smoothing spline models for correlated observations.

3.7 Comparison and Implementation

Theoretical properties of the UBR, GCV, and GML criteria can be foundin Wahba (1990) and Gu (2002). The UBR criterion requires an estimateof the variance σ2. No distributional assumptions are required for theUBR and GCV criteria, while the normality assumption is required in


the derivation of the GML criterion. Nevertheless, limited simulationssuggest that the GML method is quite robust to the departure from thenormality assumption.

Theoretical comparisons between UBR, GCV, and GML criteria havebeen studied using large-sample asymptotics (Wahba 1985, Li 1986, Stein1990) and finite sample arguments (Efron 2001). Conclusions based ondifferent perspectives do not always agree with each other. In practice,all three criteria usually perform well and lead to similar estimates. Eachmethod has its own strengths and weaknesses. The UBR and GCV crite-ria occasionally lead to gross undersmooth (interpolation) when samplesize is small. Fortunately, this problem diminishes quickly when samplesize increases (Wahba and Wang 1995).

The argument spar in the ssr function specifies which method shouldbe used for selecting the smoothing parameter λ. The options spar=‘‘v’’,spar=‘‘m’’, and spar=‘‘u’’ correspond to the GCV, GML, and UBRmethods, respectively. The default choice is the GCV method.

We now use the motorcycle data to illustrate how to specify the spar

option. For simplicity, the variable times is first scaled into [0, 1]. Wefirst use the Rice method to estimate error variance and use the esti-mated variance in the UBR criterion:

> x <- (times-min(times))/(max(times)-min(times))

> vrice <- mean((diff(accel))**2)/2

> mcycle.ubr.1 <- ssr(accel~x, rk=cubic(x),

spar=‘‘u’’, varht=vrice)

> summary(mcycle.ubr.1)

Smoothing spline regression fit by UBR method

...

UBR estimate(s) of smoothing parameter(s) : 8.60384e-07

Equivalent Degrees of Freedom (DF): 12.1624

Estimate of sigma: 23.09297

The option varht specifies the parameter σ2 required for the UBRmethod. The summary function provides a synopsis including the es-timate of the smoothing parameter, the degrees of freedom trH(λ), andthe estimate of standard deviation σ.

Instead of the Rice estimator, we can estimate the error variance usingTong and Wang’s estimator σ2

TW. Note that there are multiple observa-tions at some time points. We use these replicates to estimate the errorvariance. That is, we select all pairs with zero distances:

> d <- s <- NULL

> for (i in 1:132) {


for (j in (i+1):133) {

d <- c(d, (x[i]-x[j])**2)

s <- c(s, (accel[i]-accel[j])**2/2)

}}

> vtw <- coef(lm(s~d,subset=d==0))[1]

> mcycle.ubr.2 <- ssr(accel~x, rk=cubic(x),

spar=‘‘u’’, varht=vtw)

> summary(mcycle.ubr.2)

Smoothing spline regression fit by UBR method

...

UBR estimate(s) of smoothing parameter(s) : 8.35209e-07



Next we use the GCV method to select the smoothing parameter:

> mcycle.gcv <- ssr(accel~x, rk=cubic(x), spar=‘‘v’’)

> summary(mcycle.gcv)

Smoothing spline regression fit by GCV method

...

GCV estimate(s) of smoothing parameter(s) : 8.325815e-07



Finally, we use the GML method to select the smoothing parameter:

> mcycle.gml <- ssr(accel~x, rk=cubic(x), spar=‘‘m’’)

> summary(mcycle.gml)

Smoothing spline regression fit by GML method

...

GML estimate(s) of smoothing parameter(s) : 4.729876e-07



For the motorcycle data, all three methods lead to similar estimates ofthe smoothing parameter and the function f .


3.8 Confidence Intervals

3.8.1 Bayesian Confidence Intervals

Consider the Bayes model (3.25) and (3.26). The computation in Section3.5 can be carried out one step further to derive posterior distributions.In the following arguments, as in Section 3.5, a diffuse prior is assumedfor ζ with κ→ ∞. For simplicity of notation, the limit is not expressedexplicitly.

Let F0ν = ζνφν for ν = 1, . . . , p, and F1 = δ12U . Let L0, L01, and

L02 be bounded linear functionals. Since F0ν , F1, and ǫi are all normalrandom variables, then the posterior distributions of L0F0ν and L0F1

are normal with the following mean and covariances.

Posterior means and covariancesFor ν, µ = 1, . . . , p, the posterior means are

E(L0F0ν |y) = (L0φν)eTν d,

E(L0F1|y) = (L0ξ)T c,

(3.42)

and the posterior covariances are

δ−1Cov(L01F0ν ,L02F0µ|y) = (L01φν)(L02φµ)eTν Aeµ,

δ−1Cov(L01F0ν ,L02F1|y) = −(L01φν)eTν B(L02ξ), (3.43)

δ−1Cov(L01F1,L02F1|y) = L01L02R1 − (L01ξ)TC(L02ξ),

where eν is a vector of dimension p with the νth element being one andall other elements being zero, the vectors c and d are given in (2.22), andthe matrices A = (T TM−1T )−1, B = AT TM−1, and C = M−1(I −B).

The vectors ξ and Lξ are defined in (3.30). Proofs can be found inWahba (1990) and Gu (2002). Note that H = span{φ1, . . . , φp} ⊕ H1.Then any f ∈ H can be represented as

f = f01 + · · · + f0p + f1, (3.44)

where f0ν ∈ span{φν} for ν = 1, . . . , p, and f1 ∈ H1. The estimate fcan also be decomposed similarly

f = f01 + · · · + f0p + f1, (3.45)

where f0ν = φνdν for ν = 1, . . . , p, and f1 = ξTc.


The functionals L0, L01, and L02 are arbitrary as long as they are welldefined. Equations in (3.42) indicate that the posterior means of compo-nents in F equal their corresponding components in the spline estimatef . Equations (3.42) and (3.43) can be used to compute posterior meansand variances for any combinations of components of F . Specifically,consider the linear combination

Fγ (x) =

p∑

ν=1

γνF0ν(x) + γp+1F1(x), (3.46)

where γν equals 1 when the corresponding component in F is to beincluded and 0 otherwise, and γ = (γ1, . . . , γp+1)

T . Then, for any linearfunctional L0,

E(L0Fγ |y) =

p∑

ν=1

γν(L0φν)dν + γp+1(L0ξ)Tc,

Var(L0Fγ |y) =

p∑

ν=1

p∑

µ=1

γνγµCov(L0F0ν ,L0F0µ|y)

+

p∑

ν=1

γνγp+1Cov(L0F0ν ,L0F1|y)

+ γp+1Cov(L0F1,L0F1|y).

(3.47)

For various reasons it is often desirable to have interpretable con-fidence intervals for the function f and its components. For example,one may want to decide whether a nonparametric model is more suitablethan a particular parametric model. A parametric regression model maybe considered not suitable if a larger portion of its estimate is outsidethe confidence intervals of a smoothing spline estimate.

Consider a collection of points x0j ∈ X , j = 1, . . . , J . For eachj, posterior mean E{Fγ (x0j)|y} and variance Var{Fγ (x0j)|y} can becalculated using equations in (3.47) by setting L0F = F (x0j). Then,100(1 − α)% Bayesian confidence intervals for

fγ (x0j) =

p∑

ν=1

γνf0ν(x0j) + γp+1f1(x0j), j = 1, . . . , J (3.48)

are

E{Fγ(x0j)|y} ± zα2

√

Var{Fγ(x0j)|y}, j = 1, . . . , J, (3.49)

where zα2

is the 1 − α/2 percentile of a standard normal distribution.


In particular, let F = (L1F, . . . ,LnF )T . Applying (3.42) and (3.43),we have

E(F |y) = H(λ)y,

Cov(F |y) = σ2H(λ).(3.50)

Therefore, the posterior variances of the fitted values Var(LiF |y) =σ2hii, where hii are diagonal elements of the matrix H(λ). When Li

are evaluational functionals Lif = f(xi), Wahba (1983) proposed thefollowing 100(1 − α)% confidence intervals

f(xi) ± zα2σ√

hii, (3.51)

where σ is an estimates of σ. Note that confidence intervals for a linearcombination of components of f can be constructed similarly.

Though based on the Bayesian argument, it has been found that theBayesian confidence intervals have good frequentist properties providedthat the smoothing parameter has been estimated properly. They mustbe interpreted as “across-the-function” rather than pointwise. More pre-cisely, define the average coverage probability (ACP) as

ACP =1

n

n∑

i=1

P{f(xi) ∈ C(α, xi)}

for some (1 − α)100% confidence intervals {C(α, xi), i = 1, . . . , n}.Rather than considering a confidence interval for f(τ), where f(·) isthe realization of a stochastic process and τ is fixed, one may con-sider confidence intervals for f(τn), where f is now a fixed functionand τn is a point randomly selected from {xi, i = 1, . . . , n}. ThenACP = P{f(τn) ∈ C(α, τn)}. Note that the ACP coverage property isweaker than the pointwise coverage property. For polynomial splines andC(α, xi) being the Bayesian confidence intervals defined in (3.51), undercertain regularity conditions, Nychka (1988) showed that ACP ≈ 1− α.

The predict function in the assist package computes the posteriormean and standard deviation of Fγ (x) in (3.46). The option terms spec-ifies the coefficients γ, and the option newdata specifies a data frame con-sisting of the values at which predictions are required. We now use thegeyser, motorcycle, and Arosa data to illustrate how to use the predict

function to compute the posterior means and standard deviations.For the geyser data, we have fitted a cubic spline in Chapter 1, Section

1.1 and a partial spline in Chapter 2, Section 2.10. In the following wefit a cubic spline using the ssr function, compute posterior means andstandard deviations for the estimate of the smooth component P1f usingthe predict function, and plot the estimate of the smooth componentsand 95% Bayesian confidence intervals:


> geyser.cub.fit <- ssr(waiting~x, rk=cubic(x))

> grid <- seq(0,1,len=200)

> geyser.cub.pred <- predict(geyser.cub.fit, pstd=T,

terms=c(0,0,1), newdata=data.frame(x=grid))

> grid1 <- grid*diff(range(eruptions))+min(eruptions)

> plot(eruptions, waiting, xlab=‘‘duration (mins)’’,

ylim=c(-6,6), ylab=‘‘smooth component (mins)’’,

type=‘‘n’’)

> polygon(c(grid1,rev(grid1)),

c(geyser.cub.pred$fit-1.96*geyser.cub.pred$pstd,

rev(geyser.cub.pred$fit+1.96*geyser.cub.pred$pstd)),

col=gray(0:8/8)[8], border=NA)

> lines(grid1, geyser.cub.pred$fit)

> abline(0,0,lty=2)

where the option pstd specifies whether the posterior standard devia-tions should be calculated. Note that the option pstd=T can be droppedin the above statement since it is the default. There are in total threecomponents in the cubic spline fit: two basis functions φ1 (constant) andφ2 (linear) for the null space, and the smooth component in the spaceH1. In the order in which they appear in the ssr function, these threecomponents correspond to the intercept (~1, which is automatically in-cluded), the linear basis specified by ~x, and the smooth componentspecified by rk=cubic(x). Therefore, the option terms=c(0,0,1) wasused to compute the posterior means and standard deviations for thesmooth component f1 in the space H1. The estimate of the smoothcomponents and 95% Bayesian confidence intervals are shown in Figure3.7(a). A large portion of the zero constant line is outside the confidenceintervals, indicating the lack-of-fit of a linear model (the null space ofthe cubic spline).

For the partial spline model (2.43), consider the following Bayes model

yi = sTi β + LiF + ǫi, i = 1, . . . , n, (3.52)

where the prior for β is assumed to be N(0, κIq), the prior F is defined

in (3.25), and ǫiiid∼ N(0, σ2). Again, it can be shown that the PLS

estimates of components in β and f based on (2.44) equal the posteriormeans of their corresponding components in the Bayes model as κ→ ∞.Posterior covariances and Bayesian confidence intervals for β and fγ canbe calculated similarly.

We now refit the partial spline model (2.47) and compute posteriormeans and standard deviations for the estimate of the smooth compo-nent P1f :


1.5 2.5 3.5 4.5

−6

−4

−2

02

46

duration (min)

sm

ooth

com

ponent (m

in)

(a)

1.5 2.5 3.5 4.5

−0.0

40.0

00.0

4

duration (min)

sm

ooth

com

ponent (m

in)

(b)

FIGURE 3.7 Geyser data, plots of estimates of the smooth compo-nents, and 95% Bayesian confidence intervals for (a) the cubic spline and(b) the partial spline models. The constant zero is marked as the dottedline in each plot.

> t <- .397; s <- 1*(x>t)

> geyser.ps.fit <- ssr(waiting~x+s, rk=cubic(x))

> geyser.ps.pred <- predict(geyser.ps.fit,

terms=c(0,0,0,1),

newdata=data.frame(x=grid,s=1*(grid>t)))

Since there are in total four components in the partial spline fit, theoption terms=c(0,0,0,1) was used to compute the posterior meansand standard deviations for the smooth component f1.

Figure 3.7(b) shows the estimate of the smooth components and 95%Bayesian confidence intervals for the partial spline. The zero constantline is well inside the confidence intervals, indicating that this smoothcomponent may be dropped in the partial spline model. That is, a simplelinear change-point model may be appropriate for this data.

For the motorcycle data, based on visual inspection, we have searcheda potential change-point t to the first derivative in the interval [0.2, 0.25]for the variable x in Section 2.10. To search for all possible change-pointsto the first derivative, we fit the partial spline model (2.48) repeatedlywith t taking values on a grid point in the interval [0.1, 0.9]. We thencalculate the posterior mean and standard deviation for β. Define at-statistic at point t as E(β|y)/

√

Var(β|y). The t-statistics were calcu-lated as follows:

> tgrid <- seq(0.05,.95,len=200); tstat <- NULL

> for (t in tgrid) {

s <- (x-t)*(x>t)


tmp <- ssr(accel~x+s, rk=cubic(x))

tmppred <- predict(tmp, terms=c(0,0,1,0),

newdata=data.frame(x=.5,s=1))

tstat <- c(tstat, tmppred$fit/tmppred$pstd)

}

Note that term=c(0,0,1,0) requests the posterior mean and standarddeviation for the component β × s, where s = (x − t)+, and s is set toone in the newdata argument. The t-statistics are shown in the bottompanel of Figure 3.8.

accele

ration (

g)

−100

−50

050

oo oo

o ooo oo oo ooo

oo

ooo

o

o

ooo

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

o

o

o

ooo

o

o

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

o o

o

o o

o

o

o

o

o

o

o

time (ms)

t−sta

tistics

−2

02

0 10 20 30 40 50 60

FIGURE 3.8 Motorcycle data, plots of t-statistics (bottom), and thepartial spline fit with 95% Bayesian confidence intervals (top).

There are three clusters of large t-statistics that suggests three po-tential change points in the first derivative at t1 = 0.2128, t2 = 0.3666,and t3 = 0.5113, respectively (see also Speckman (1995)). So we fit the


following partial spline model

yi =3∑

j=1

βj(xi − tj)+ + f(xi) + ǫi, i = 1, . . . , n, (3.53)

where x is the variable times scaled into [0, 1], tj are the change-points inthe first derivative, and f ∈ W 2

2 [0, 1]. We fit model (3.53) and computeposterior means and standard deviations as follows:

> t1 <- .2128; t2 <- .3666; t3 <- .5113

> s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)

> s3 <- (x-t3)*(x>t3)

> mcycle.ps.fit2 <- ssr(accel~x+s1+s2+s3, rk=cubic(x))

> grid <- seq(0,1,len=100)

> mcycle.ps.pred2 <- predict(mcycle.ps.fit2,

newdata=data.frame(x=grid, s1=(grid-t1)*(grid>t1),

s2=(grid-t2)*(grid>t2), s3=(grid-t3)*(grid>t3)))

The fit and Bayesian confidence intervals are shown in Figure 3.8.For the Arosa data, Figure 2.12 in Chapter 2 shows the estimates

and 95% confidence intervals for the overall function and its decomposi-tion. In particular, the posterior means and standard deviations for theperiodic spline were calculated as follows:

> arosa.per.fit <- ssr(thick~1, rk=periodic(x))

> grid <- data.frame(x=seq(.5/12,11.5/12,length=50))

> arosa.per.pred <- predict(arosa.per.fit,grid,

terms=matrix(c(1,0,0,1,1,1),nrow=3,byrow=T))

where the input for the term argument is a 3 × 2 matrix with the firstrow (1,0) specifying the parametric component, the second row (0,1)

specifying the smooth component, and the third row (1,1) specifyingthe overall function.

3.8.2 Bootstrap Confidence Intervals

Consider the general SSR model (2.10). Let f and σ2 be the estimatesof f and σ2, respectively. Let

y∗i,b = Lif + ǫ∗i,b, i = 1, . . . , n; b = 1, . . . , B (3.54)

be B bootstrap samples where ǫ∗i,biid∼ N(0, σ2). The random errors ǫ∗i,b

may also be drawn from residuals with replacement when the normalityassumption is undesirable. Let f∗

γ,b be the estimate of fγ in (3.48) based


on the bth bootstrap sample {y∗i,b, i = 1, . . . , n}. For any well-defined

functional L0, there are B bootstrap estimates L0f∗γ,b for b = 1, . . . , B.

Then the 100(1−α)% percentile bootstrap confidence interval of L0fγ is

(L0fγ,L,L0fγ,U ), (3.55)

where L0fγ,L and L0fγ,U are the lower and upper α/2 quantiles of

{L0f∗γ,b, b = 1, . . . , B}.

The percentile-t bootstrap confidence intervals can be constructed asfollows. Let τ = (L0fγ − L0fγ)/σ, where division by σ is introduced

to reduce the dependence on σ. Let τ∗b = (L0f∗γ,b − L0fγ)/σ∗

b be thebootstrap estimates of τ for b = 1, . . . , B, where σ∗

b is the estimate of σbased on the bth bootstrap sample. Then the 100(1 − α)% percentile-tbootstrap confidence interval of L0fγ is

(L0fγ − q1−α2σ,L0fγ − qα

2σ), (3.56)

where qα2

and q1−α2

are the lower and upper α/2 quantiles of {τ∗b , b =1, . . . , B}. Note that the bounded linear condition is not required for L0

in the above construction of bootstrap confidence intervals. Other formsof bootstrap confidence intervals and comparison between Bayesian andbootstrap approaches can be found in Wang and Wahba (1995).

For example, the 95% percentile bootstrap confidence intervals in Fig-ure 2.8 in Chapter 2 were computed as follows:

> nboot <- 9999

> fb <- NULL

> for (i in 1:nboot) {

yb <- canada.fit1$fit +

sample(canada.fit1$resi, 35, replace=T)

bfit <- ssr(yb~s, rk=Sigma, spar=‘‘m’’)

fb <- cbind(fb, bfit$coef$d[2]+S%*%bfit$coef$c)

}

> lb <- apply(fb, 1, quantile, prob=.025)

> ub <- apply(fb, 1, quantile, prob=.975)

where random errors for bootstrap samples were drawn from residualswith replacement, and lb and ub represent lower and upper bounds.

We use the following simulation to show the performance of Bayesianand bootstrap confidence intervals. Observations are generated frommodel (1.1) with f(x) = exp{−64(x − 0.5)2}, σ = 0.1, xi = i/n fori = 1, . . . , n, and n = 100. We fit a cubic spline and construct 95%Bayesian, percentile bootstrap (denoted as Per), and percentile-t boot-strap (denoted as T) confidence intervals. We set B = 1000 and repeat


the simulation 100 times. Figures 3.9(a)(b)(c) show average pointwisecoverages for the Bayesian, percentile bootstrap, and percentile-t boot-strap confidence intervals, respectively. The absolute value of f ′′ is alsoplotted to show the curvature of the function. The pointwise coverage isusually smaller than the nominal value at high curvature points. Box-plots of the across-the-function coverages for these three methods areshown in Figure 3.9(d). The average and median ACPs are close to thenominal value.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

poin

twis

e c

overa

ge

(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

poin

twis

e c

overa

ge

(b)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

poin

twis

e c

overa

ge

(c)

Bayes Per T

0.6

50.7

50.8

50.9

5avera

ge c

overa

ge +

+ +

(d)

FIGURE 3.9 Plots of pointwise coverages (solid line) and nominalvalue (dotted line) for (a) the Bayesian confidence intervals, (b) the per-centile bootstrap confidence intervals, and (c) the percentile-t bootstrapconfidence intervals. Dashed lines in (a), (b), and (c) represent a scaledversion of |f ′′|. Boxplots of ACPs are shown in (d) with mean coveragesmarked as pluses and nominal value plotted as a dotted line.


3.9 Hypothesis Tests

3.9.1 The Hypothesis

One of the most useful applications of the nonparametric regression mod-els is to check or suggest a parametric model. When appropriate, para-metric models, especially linear models, are preferred in practice becauseof their simplicity and interpretability. One important step in buildinga parametric model is to investigate potential departure from a specifiedmodel. Tests with specific alternatives are often performed in practice.This kind of tests would not perform well for different forms of departurefrom the parametric model, especially those orthogonal to the specificalternative. For example, to detect departure from a straight line model,one may consider a quadratic polynomial as the alternative. Then de-parture in the form of higher-order polynomials may be missed. It isdesirable to have tools that can detect general departures from a spe-cific parametric model.

One approach to check a parametric model is to construct confidenceintervals for f or its smooth component in an SSR model using meth-ods in Section 3.8. The parametric model may be deemed unsuitableif a larger portion of its estimate is outside the confidence intervals off . When the null space H0 corresponds to the parametric model un-der consideration, one may check the magnitude of the estimate of thesmooth component P1f since it represents the remaining systematic vari-ation not explained by the parametric model under consideration. If alarger portion of the confidence intervals for P1f does not contain zero,the parametric model may be deemed unsuitable. This approach wasillustrated using the geyser data in Section 3.8.1.

Often the confidence intervals are all one needs in practice. Never-theless, sometimes it may be desirable to conduct a formal test on thedeparture from a parametric model. In this section we consider thefollowing hypothesis

H0 : f ∈ H0, H1 : f ∈ H and f /∈ H0, (3.57)

where H0 corresponds to the parametric model. Note that when theparametric model satisfies Lf = 0 with a differential operator L given in(2.53), the L-spline may be used to check or test the parametric model.

The alternative is equivalent to ||P1f || > 0. Note that λ = ∞ in(2.11), or equivalently, δ = 0 in the corresponding Bayes model (3.25)leads to f ∈ H0. Thus the hypothesis (3.57) can be reexpressed as

H0 : λ = ∞, H1 : λ <∞, (3.58)


orH0 : δ = 0, H1 : δ > 0. (3.59)

3.9.2 Locally Most Powerful Test

Consider the hypothesis (3.59). As in Section 3.8, we use the marginallog-likelihood of w1, where w1 = QT

2 y ∼ N(0, δQT2MQ2). Since Q2

is orthogonal to T , it is clear that the transformation QT2 y eliminates

contribution from the model under the null hypothesis. Thus w1 reflectssignals, if any, from H1. Note that M = Σ+nλI and QT

2 ΣQ2 = UEUT ,where notations Q2, U , and E were defined in Section 3.2. Let z ,

UTw1 and denote z = (z1, . . . , zn−p)T . Then z ∼ N(0, δE + σ2I). First

assume that σ2 is known. Note that λ = σ2/nδ. The negative log-likelihood of z

l(δ|z) =1

2

n−p∑

ν=1

log(δeν + σ2) +1

2

n−p∑

ν=1

z2ν

δeν + σ2+ C1, (3.60)

where C1 is a constant. Let Uδ(δ) and Iδδ(δ) be the score and Fisherinformation of δ. It is not difficult to check that the score test statistic(Cox and Hinkley 1974)

tscore ,Uδ(0)√

Iδδ(0)= C2

(n−p∑

ν=1

eνz2ν + C3

)

,

where C2 and C3 are constants. Therefore, the score test is equivalentto the following test statistic

tLMP =

n−p∑

ν=1

eνz2ν . (3.61)

For polynomial splines, Cox, Koh, Wahba and Yandell (1988) showedthat there is no uniformly most powerful test and that tLMP is the locallymost powerful test.

The variance is usually unknown in practice. Replacing σ2 by itsMLE under the null hypothesis (3.59), σ2

0 =∑n−p

ν=1 z2ν/(n− p), leads to

the approximate LMP test statistic

tappLMP =

∑n−pν=1 eνz

2ν

∑n−pν=1 z

2ν

. (3.62)

The null hypothesis is rejected for large values of tappLMP. The teststatistic tappLMP does not follow a simple distribution under H0. Nev-ertheless, it is straightforward to simulate the null distribution. Under


H0, zνiid∼ N(0, σ2). Without loss of generality, σ2 can be set to one in

the simulation for the null distribution since both the numerator andthe denominator depend on z2

ν . Specifically, samples zν,jiid∼ N(0, 1) for

ν = 1, . . . , n − p and j = 1, . . . , N are generated, and the statisticstappLMP,j =

∑n−pν=1 eνz

2ν,j/

∑n−pν=1 z

2ν,j are computed. Note that tappLMP,j

are N realizations of the statistic under H0. Then the proportion thattappLMP,j is greater than tappLMP provides an estimate of the p-value.This approach usually requires a very large N . The p-value can alsobe calculated numerically using the algorithm in Davies (1980). Theapproximation method is very fast and agrees with the results from theMonte Carlo method (Liu and Wang 2004).

3.9.3 Generalized Maximum Likelihood Test

Consider the hypothesis (3.58). Since z ∼ N(0, δ(E + nλI)), the MLEof δ is

δ =1

n− p

n−p∑

ν=1

z2ν

eν + nλ.

From (3.39), the profile likelihood of λ is

Lp(λ, δ|z) = C

{ ∑n−pν=1 z

2ν/(eν + nλ)

∏n−pν=1 (eν + nλ)−

1

n−p

}−n−p2

, (3.63)

where C is a constant. The GML estimate of λ, λGML, is the maximizerof (3.63). The GML test statistic for the hypothesis (3.58) is

tGML ,

{

Lp(λGML|z)Lp(∞|z)

}− 2n−p

=

∑n−pν=1 z

2ν/(eν + nλGML)

∏n−pν=1 (eν + nλGML)−

1

n−p

1∑n−p

ν=1 z2ν

. (3.64)

It is clear that the GML test is equivalent to the ratio of restrictedlikelihoods. The null hypothesis is rejected when tGML is too small.The standard theory for likelihood ratio tests does not apply becausethe parameter λ locates on the boundary of the parameter space underthe null hypothesis. Thus, it is difficult to derive the null distributionfor tGML. The Monte Carlo method described for the LMP test can beadapted to compute an estimate of the p-value. Note that tGML involvesthe GML estimate of the smoothing parameter. Therefore, λGML needsto be estimated for each simulation sample, which makes this approachcomputationally intensive.


The null distribution of −(n−p) log tGML can be well approximated bya mixture of χ2

1 and χ20, denoted by rχ2

0 + (1− r)χ21. However, the ratio

r is not fixed. It is difficult to derive a formula for r since it depends onmany factors. One can approximate the ratio r first and then calculatethe p-value based on the mixture of χ2

1 and χ20 with the approximated

r. A relatively small sample size is required to approximate r. See Liuand Wang (2004) for details.

3.9.4 Generalized Cross-Validation Test

Consider the hypothesis (3.58). Let λGCV be the GCV estimate of λ.Similar to the GML test statistic, the GCV test statistic is defined asthe ratio between GCV scores

tGCV ,GCV(λGCV)

GCV(∞)

= (n− p)2∑n−p

ν=1 z2ν/(1 + eν/nλGCV)2

{∑n−pν=1 1/(1 + eν/nλGCV)}2

1∑n−p

ν=1 z2ν

. (3.65)

H0 is rejected when tGCV is too small. Again, similar to the GML test,the Monte Carlo method can be used to compute an estimate of thep-value.

3.9.5 Comparison and Implementation

The LMP and GML tests derived based on Bayesian arguments performwell under deterministic models. In terms of eigenvectors of QT

2 ΣQ2,the LMP test is more powerful in detecting departure in the directionof the first eigenvector, the GML test is more powerful in detectingdeparture in low-frequencies, and the GCV test is more powerful indetecting departure in high frequencies. See Liu and Wang (2004) formore details.

These tests can be carried out using the anova function in the assistpackage. We now use the geyser and Arosa data to illustrate how to usethis function.

For the geyser data, first consider the following hypotheses

H0 : f ∈ span{1, x}, H1 : f ∈ W 22 [0, 1] and f /∈ span{1, x}.

We have fitted cubic spline model where the null space corresponds tothe model under H0. Therefore, we can test the hypothesis as follows:

> anova(geyser.cub.fit, simu.size=500)


Testing H_0: f in the NULL space

test.value simu.size simu.p-value

LMP 0.02288057 500 0

GCV 0.00335470 500 0

where the option simu.size specifies the Monte Carlo sample size N .The simple linear model is rejected. Next we consider the hypothesis

H0 : f ∈ span{1, x, s}, H1 : f ∈ W 22 [0, 1] and f /∈ span{1, x, s},

where s = (x−0.397)0+. We have fitted a partial spline model where thenull space corresponds to the model under H0. Therefore, we can testthe hypothesis as follows:

> anova(geyser.ps.fit, simu.size=500)



LMP 0.000351964 500 0.602

GCV 0.003717477 500 0.602

The null hypothesis is not rejected. The conclusions are the same asthose based on Bayesian confidence intervals. To apply the GML test,we need to first fit using the GML method to select the smoothing pa-rameter:

> geyser.ps.fit.m <- ssr(waiting~x+s, rk=cubic(x),

spar=‘‘m’’)

> anova(geyser.ps.fit.m, simu.size=500)


test.value simu.size simu.p-value approximate.p-value

LMP 0.000352 500 0.634

GML 1.000001 500 0.634 0.5

where the approximate.p-value was computed using the mixture oftwo Chi-square distributions.

For the Arosa data, consider the hypothesis

H0 : f ∈ P , H1 : f ∈W 22 (per) and f /∈ P ,

where P = span{1, sin 2πx, cos 2πx} is the model space for the sinusoidalmodel. Two approaches can be used to test the above hypothesis: fit apartial spline and fit an L-spline:


> arosa.ps.fit <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),

rk=periodic(x), data=Arosa)

> anova(arosa.ps.fit,simu.size=500)



LMP 0.001262064 500 0

GCV 0.001832394 500 0

> arosa.ls.fit <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),

rk=lspline(x,type=‘‘sine1’’))

> anova(arosa.ls.fit,simu.size=500)



LMP 2.539163e-06 500 0

GCV 0.001828071 500 0

The test based on the L-spline is usually more powerful since the para-metric and the smooth components are orthogonal.


Chapter 4

Smoothing Spline ANOVA

4.1 Multiple Regression

Consider the problem of building regression models that examine therelationship between a dependent variable y and multiple independentvariables x1, . . . , xd. For generality, let the domain of each xk be anarbitrary set Xk. Denote x = (x1, . . . , xd). Given observations (xi, yi)for i = 1, . . . , n, where xi = (xi1, . . . , xid), a multiple regression modelrelates the dependent variable and independent variables as follows:

yi = f(xi) + ǫi, i = 1, . . . , n, (4.1)

where f is a multivariate regression function, and ǫi are zero-mean in-dependent random errors with a common variance σ2. The goal is toconstruct a model for f and estimate it based on noisy data.

There exist many different methods to construct a model space forf , parametrically, semiparametrically, or nonparametrically. For exam-ple, a thin-plate spline model may be used when all xk are univariatecontinuous variables, and partial spline models may be used if a linearparametric model can be assumed for all but one variable. This chapterintroduces a nonparametric approach called smoothing spline analysis ofvariance (smoothing spline ANOVA or SS ANOVA) decomposition forconstructing model spaces for the multivariate function f .

The multivariate function f is defined on the product domain X =X1×X2×· · ·×Xd. Note that each Xk is arbitrary: it may be a continuousinterval, a discrete set, a unit circle, a unit sphere, or R

d. Constructionof model spaces for a single variable was introduced in Chapter 2. LetH(k) be an RKHS on Xk. The choice of the marginal space H(k) dependson the domain Xk and prior knowledge about f as a function of xk. Tomodel the joint function f , we start with the tensor product of thesemarginal spaces defined in the following section.

91


4.2 Tensor Product Reproducing Kernel HilbertSpaces

First consider the simple case when d = 2. Denote RKs for H(1) andH(2) as R(1) and R(2), respectively. It is known that the product of non-negative definite functions is nonnegative definite (Gu 2002). As RKs,both R(1) and R(2) are nonnegative definite. Therefore, the bivariatefunction on X = X1 ×X2

R((x1, x2), (z1, z2)) , R(1)(x1, z1)R(2)(x2, z2)

is nonnegative definite. By Moore–Aronszajn theorem, there exists aunique RKHS H on X = X1 × X2 such that R is its RK. The resultingRKHS H is called the tensor product RKHS and is denoted as H(1)⊗H(2).For d > 2, the tensor product RKHS of H(1), . . . ,H(d) on the productdomain X = X1 × X2 × · · · × Xd, H(1) ⊗ H(2) ⊗ · · · ⊗ H(d), is definedrecursively. Note that the RK for a tensor product space equals theproduct of RKs of marginal spaces. That is, the RK of H(1) ⊗ H(2) ⊗· · · ⊗ H(d) equals

R(x, z) = R(1)(x1, z1)R(2)(x2, z2) · · ·R(d)(xd, zd),

where x ∈ X , z = (z1, . . . , zd) ∈ X , X = X1 × X2 × · · · × Xd, and R(k)

is the RK of H(k) for k = 1, . . . , d.For illustration, consider the ultrasound data consisting of tongue

shape measurements over time from ultrasound imaging. The data setcontains observations on the response variable height (y) and three in-dependent variables: environment (x1), length (x2), and time (x3).The variable x1 is a factor with three levels: x1 = 1, 2, 3 correspondingto 2words, cluster, and schwa, respectively. Both continuous variablesx2 and x3 are scaled into [0, 1]. Interpolations of the raw data are shownin Figure 4.1.

In linguistic studies, researchers want to determine (1) how tongueshapes for an articulation differ under different environments, (2) howthe tongue shape changes as a function of time, and (3) how changes overtime differ under different environments. To address the first questionat a fixed time point, we need to model a bivariate regression functionf(x1, x2). Assume marginal spaces R

3 and Wm2 [0, 1] for variables x1 and

x2, respectively. Then we may consider the tensor product space R3 ⊗

Wm2 [0, 1] for the bivariate function f . To address the second question,

for a fixed environment, we need to model a bivariate regression functionf(x2, x3). Assume marginal spacesWm1

2 [0, 1] andWm2

2 [0, 1] for variables

Smoothing Spline ANOVA 93

80100

120140

0

50

100150

200

40

45

50

55

60

65

70

length (mm)

time (m

s)

he

igh

t (m

m)

2words

80100

120140

0

50

100150

200

40

45

50

55

60

65

70

length (mm)

time (m

s)

he

igh

t (m

m)

cluster

80100

120140

0

50

100150

200

40

45

50

55

60

65

70

length (mm)

time (m

s)

he

igh

t (m

m)

schwa

FIGURE 4.1 Ultrasound data, 3-d plots of observations.

x2 and x3, respectively. Then we may consider the tensor product spaceWm1

2 [0, 1] ⊗Wm2

2 [0, 1] for the bivariate function. To address the thirdquestion, we need to model a trivariate regression function f(x1, x2, x3).We may consider the tensor product space R

3 ⊗Wm1

2 [0, 1] ⊗Wm2

2 [0, 1]for the trivariate function. Analysis of the ultrasound data are given inSection 4.9.1.

The SS ANOVA decomposition decomposes a tensor product spaceinto subspaces with a hierarchical structure similar to the main ef-fects and interactions in the classical ANOVA. The resulting hierarchi-cal structure facilitates model selection and interpretation. Sections 4.3,4.4, and 4.5 present SS ANOVA decompositions for a single space, tensorproduct of two spaces, and tensor product of d spaces. More SS ANOVAdecompositions can be found in Sections 4.9, 5.4.4, 6.3, and 9.2.4.

4.3 One-Way SS ANOVA Decomposition

SS ANOVA decompositions of tensor product RKHS’s are based on de-compositions of marginal spaces for each independent variable. There-fore, decompositions for a single space are introduced first in this section.Spline models for a single independent variable have been introduced inChapter 2. Denote the independent variable as x and the regression


function as f . It was assumed that f belongs to an RKHS H. Thefunction f was decomposed into a parametric and a smooth component,f = f0 + f1, or in terms of model space, H = H0 ⊕ H1. This can beregarded as one form of the SS ANOVA decomposition. We now intro-duce general decompositions based on averaging operators. Consider afunction space H on the domain X . An operator A is called an aver-aging operator if A = A2. Instead of the more appropriate term as anidempotent operator, the term averaging operator is used since it is mo-tivated by averaging in the classical ANOVA decomposition. Note thatan averaging operator does not necessarily involve averaging. As we willsee in the following subsections, the commonly used averaging operatorsare projection operators. Thus they are idempotent.

Suppose model space H = H0 ⊕H1, where H0 is a finite dimensionalspace with orthogonal basis φ1(x), . . . , φp(x). Let Aν be the projectionoperator onto the subspace {φν(x)} for ν = 1, . . . , p, and Ap+1 be theprojection operator onto H1. Then the function can be decomposed as

f = (A1 + · · · + Ap + Ap+1)f , f01 + · · · + f0p + f1. (4.2)

Correspondingly, the model space is decomposed into

H = {φ1(x)} ⊕ · · · ⊕ {φp(x)} ⊕H1.

For simplicity, {·} represents the space spanned by the basis functionsinside the bracket. Some averaging operators Aν (subspaces) can becombined. For example, combining A1, . . . ,Ap leads to the decompo-sition f = f0 + f1, a parametric component plus a smooth compo-nent. When φ1(x) = 1, combining A2, . . . ,Ap+1 leads to the decom-

position f = f01 + f1, where f01 is a constant independent of x, andf1 = f02 + · · · + f0p + f1 collects all components that depend on x.

Therefore, f = f01 + f1 decomposes the function into a constant plus anonconstant function.

For the same model space, different SS ANOVA decompositions maybe constructed for different purposes. In general, we denote the one-waySS ANOVA decomposition as

f = A1f + · · · + Arf, (4.3)

where A1 + · · · + Ar = I, and I is the identity operator. The aboveequality always holds since f = If . Equivalently, in terms of the modelspace, the one-way SS ANOVA decomposition is denoted as

H = H(1) ⊕ · · · ⊕ H(r).

The following subsections provide one-way SS ANOVA decompositionsfor some special model spaces.


4.3.1 Decomposition of Ra: One-Way ANOVA

Suppose x is a discrete variable with a levels. The classical one-waymean model assumes that

yik = µi + ǫik, i = 1, . . . , a; k = 1, . . . , ni, (4.4)

where yik represents the observation of the kth replication at level i ofx, µi represents the mean at level i, and ǫik represent random errors.

Regarding µi as a function of i and expressing explicitly as f(i) , µi,then f is a function defined on the discrete domain X = {1, . . . , a}. Itis easy to see that the model space for f is the Euclidean a-space R

a.

Model space construction and decomposition of Ra

The space Ra is an RKHS with the inner product (f, g) = fT g. Further-

more, Ra = H0 ⊕H1, where

H0 = {f : f(1) = · · · = f(a)},

H1 = {f :

a∑

i=1

f(i) = 0}, (4.5)


R0(i, j) =1

a,

R1(i, j) = δi,j −1

a,

(4.6)

and δi,j is the Kronecker delta.

Details about the above construction can be found in Gu (2002). De-fine an averaging operator A1 : R

a → Ra such that

A1f =1

a

a∑

i=1

f(i).

The operator A1 maps f to the constant function that equals the averageover all indices. Let A2 = I −A1. The one-way ANOVA effect model isbased on the following decomposition of the function f :

f = A1f + A2f , µ+ αi, (4.7)

where µ is the overall mean, and αi is the effect at level i. From the defi-nition of the averaging operator, αi satisfy the sum-to-zero side condition∑a

i=1 αi = 0. It is clear that A1 and A2 are the projection operators


from Ra onto H0 and H1 defined in (4.5). Thus, they divide R

a into H0

and H1.Under the foregoing construction, ||P1f ||2 =

∑ai=1{f(i) − f}2, where

f =∑a

i=1 f(i)/a. For balanced designs with ni = n, the solution to thePLS (2.11) is

f(i) = yi· −aλ

1 + aλ(yi· − y··), (4.8)

where yi· =∑n

k=1 yik/n, and y·· =∑a

i=1 yi·/a. It is easy to checkthat when n = 1, σ2 = 1 and λ = (a − 3)/a(

∑ai=1 y

2i· − a + 3), the

spline estimate f is the James–Stein estimator (shrink toward mean).Therefore, in a sense, the Stein’s shrinkage estimator can be regarded asspline estimate on a discrete domain.

There exist other ways to decompose Ra. For example, the averaging

operator A1f = f(1) leads to the same decomposition as (4.7) with theset-to-zero side condition: α1 = 0 (Gu 2002).

4.3.2 Decomposition of Wm2 [a, b]

Under the construction in Section 2.2, let

Aνf(x) = f (ν−1)(a)(x − a)ν−1

(ν − 1)!, ν = 1, . . . ,m. (4.9)

It is easy to see that A2v = Av. Thus, they are averaging operators.

In fact, Aν is the projection operator onto {(x − a)ν−1/(ν − 1)!}. LetAm+1 = I −A1 − · · · − Am. The decomposition

f = A1f + · · · + Amf + Am+1f

corresponds to the Taylor expansion (1.5). It decomposes the modelspace Wm

2 [a, b] into

Wm2 [a, b] = {1} ⊕ {x− a} ⊕ · · · ⊕ {(x− a)m−1/(m− 1)!} ⊕H1,

where H1 is given in (2.3). For f ∈ H1, conditions f (ν)(a) = 0 forν = 0, . . . ,m−1 are analogous to the set-to-zero condition in the classicalone-way ANOVA model.

Under the construction in Section 2.6 for Wm2 [0, 1], let

Aνf(x) =

{∫ 1

0

f (ν−1)(u)du

}

kν(x), ν = 1, . . . ,m.

Again, Aν is an averaging (projection) operator extracting the polyno-

mial of order ν. In particular, A1f =∫ 1

0 f(u)du is a natural extension


of the averaging in the discrete domain. Let Am+1 = I −A1−· · ·−Am.The decomposition

f = A1f + · · · + Amf + Am+1f

decomposes the model space Wm2 [0, 1] into

Wm2 [0, 1] = {1} ⊕ {k1(x)} ⊕ · · · ⊕ {km−1(x)} ⊕H1,

where H1 is given in (2.29). For f ∈ H1, conditions∫ 1

0f (ν)dx = 0

for ν = 0, . . . ,m − 1 are analogous to the sum-to-zero condition in theclassical one-way ANOVA model.

4.3.3 Decomposition of Wm2 (per)

Under the construction in Section 2.7, let

A1f =

∫ 1

0

fdu

be an averaging (projection) operator. Let A2 = I −A1. The decompo-sition

f = A1f + A2f

decomposes the model space

Wm2 (per) =

{1}⊕{f ∈ Wm

2 (per) :

∫ 1

0

fdu = 0}.

4.3.4 Decomposition of Wm2 (Rd)

For simplicity, consider the special case with d = 2 and m = 2. Decom-positions for general d and m can be derived similarly. Let φ1(x) = 1,φ2(x) = x1, and φ3(x) = x2 be polynomials of total degree less thanm = 2. Let φ1 = 1, and φ2 and φ3 be an orthonormal basis suchthat (φν , φµ)0 = δν,µ based on the norm (2.41). Define two averagingoperators

A1f(x) =

J∑

j=1

wjf(uj),

A2f(x) =

J∑

j=1

wjf(uj){φ2(uj)φ2(x) + φ3(uj)φ3(x)

},

(4.10)


where uj are fixed points in R2, and wj are fixed positive weights such

that∑J

j=1 wj = 1. It is clear that A1 and A2 are projection operators

onto spaces {φ1} and {φ2, φ3}, respectively. To see how they general-ize averaging operators, define a probability measure µ on X = R

2 byassigning probability wj to the point uj , j = 1, . . . , J . Then A1f(x) =∫

R2 fφ1dµ and A2f(x) = (∫

R2 fφ2dµ)φ2(x) + (∫

R2 fφ3dµ)φ3(x). There-fore, A1 and A2 take averages with respect to the discrete probabilitymeasure µ. In particular, µ puts mass 1/n on design points xj whenJ = n, wj = 1/n and uj = xj . A continuous density on R

2 may be used.However, the resulting integrals usually do not have closed forms, andapproximations such as a quadrature formula would have to be used. Itis then essentially equivalent to using an approximate discrete probabil-ity measure.

Let A3 = I −A1 −A2. The decomposition

f = A1f + A2f + A3f

divides the model space

Wm2 (R2) =

{1}⊕{φ2, φ3

}⊕{f ∈Wm

2 (R2) : J22 (f) = 0

},

where Jdm(f) is defined in (2.36).

4.4 Two-Way SS ANOVA Decomposition

Suppose there are two independent variables, x1 ∈ X1 and x2 ∈ X2.Consider the tensor product space H(1) ⊗ H(2) on the product domainX1 × X2. For f as a marginal function of xk, assume the followingone-way decomposition based on Section 4.3,

f = A(k)1 f + · · · + A(k)

rkf, k = 1, 2, (4.11)

where A(k)j are averaging operators on H(k) and

∑rk

j=1 A(k)j = I. Then

for the joint function, we have

f ={A(1)

1 +· · ·+A(1)r1

}{A(2)

1 +· · ·+A(2)r2

}f =

r1∑

j1=1

r2∑

j2=1

A(1)j1

A(2)j2f. (4.12)

The above decomposition of the bivariate function f is referred to as thetwo-way SS ANOVA decomposition.


Denote

H(k) = H(k)(1) ⊕ · · · ⊕ H(k)

(rk), k = 1, 2,

as the one-way decomposition to H(k) associated with (4.11). Then,(4.12) decomposes the tensor product space

H(1) ⊗H(2) ={

H(1)(1) ⊕ · · · ⊕ H(1)

(r1)

}

⊗{

H(2)(1) ⊕ · · · ⊕ H(2)

(r2)

}

=

r1∑

j1=1

r2∑

j2=1

H(1)(j1) ⊗H(2)

(j2).

Consider the special case when rk = 2 for k = 1, 2. Assume that

A(k)1 f is independent of xk, or equivalently, H(k)

0 = {1}. Then thedecomposition (4.12) can be written as

f = A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ f1(x1) + f2(x2) + f12(x1, x2), (4.13)

where µ represents the grand mean, f1(x1) and f2(x2) represent themain effects of x1 and x2, respectively, and f12(x1, x2) represents theinteraction between x1 and x2.

For general rk, assuming A(k)1 f is independent of xk, the decomposi-

tion (4.13) can be derived by combining operators A(k)2 , . . . , A(k)

rk into

one averaging operator A(k)2 = A(k)

2 + · · · + A(k)rk . Therefore, decompo-

sition (4.13) combines components in (4.12) and reorganizes them intothe overall main effects and interactions.

The following subsections provide two-way SS ANOVA decomposi-tions for combinations of some special model spaces.

4.4.1 Decomposition of Ra ⊗ R

b: Two-Way ANOVA

Suppose both x1 and x2 are discrete variables with a and b levels, re-spectively. The classical two-way mean model assumes that

yijk = µij + ǫijk, i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , nij ,

where yijk represents the observation of the kth replication at level i ofx1, and level j of x2, µij represents the mean at level i of x1 and level jof x2, and ǫijk represent random errors.

Regarding µij as a bivariate function of (i, j) and letting f(i, j) ,

µij , then f is a bivariate function defined on the product domain X ={1, . . . , a}×{1, . . . , b}. The model space for f is the tensor product space


Ra ⊗ R

b. Define two averaging operators A(1)1 and A(2)

2 such that

A(1)1 f =

1

a

a∑

i=1

f(i, j),

A(2)1 f =

1

b

b∑

j=1

f(i, j).

A(1)1 and A(2)

1 map f to univariate functions by averaging over all levels

of x1 and x2, respectively. Let A(1)2 = I−A(1)

1 and A(2)2 = I−A(2)

1 . Thenthe classical two-way ANOVA effect model is based on the following SSANOVA decomposition of the function f :

f = (A(1)1 + A(1)

2 )(A(2)1 + A(2)

2 )f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ αi + βj + (αβ)ij , (4.14)

where µ represents the overall mean, αi represents the main effect of x1,βj represents the main effect of x2, and (αβ)ij represents the interactionbetween x1 and x2. The sum-to-zero side conditions are satisfied fromthe definition of the averaging operators.

Based on one-way ANOVA decomposition in Section 4.3.1, we have

Ra = H(1)

0 ⊕ H(1)1 and R

b = H(2)0 ⊕ H(2)

1 , where H(1)0 and H(2)

0 are

subspaces containing constant functions, and H(1)1 and H(2)

1 are orthog-

onal complements of H(1)0 and H(2)

0 , respectively. The classical two-wayANOVA model decomposes the tensor product space

Ra ⊗ R

b ={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

. (4.15)

The four subspaces in (4.15) contain components µ, αi, βj , and (αβ)ij

in (4.14), respectively.


2 [0, 1]

Suppose x1 is a discrete variable with a levels, and x2 is a continuousvariable in [0, 1]. A natural model space for x1 is R

a and a natural modelspace for x2 is Wm

2 [0, 1]. Therefore, we consider the tensor productspace R

a ⊗Wm2 [0, 1] for the bivariate regression function f(x1, x2). For


simplicity, we derive SS ANOVA decompositions for m = 1 and m = 2only. SS ANOVA decompositions for higher-order m can be derivedsimilarly. In this and the remaining sections, the construction in Section2.6 for the marginal space Wm

2 [0, 1] will be used. Similar SS ANOVAdecompositions can be derived under the construction in Section 2.2.

Consider the tensor product space Ra ⊗ W 1

2 [0, 1] first. Define two

averaging operators A(1)1 and A(2)

1 as

A(1)1 f =

1

a

a∑

x1=1

f,

A(2)1 f =

∫ 1

0

fdx2,

where A(1)1 and A(2)

1 extract the constant term out of all possible func-

tions for each variable. Let A(1)2 = I −A(1)

1 and A(2)2 = I −A(2)

1 . Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2

}f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ f1(x1) + f2(x2) + f12(x1, x2). (4.16)

Obviously, (4.16) is a natural extension of the classical two-way ANOVAdecomposition (4.14) from the product of two discrete domains to theproduct of one discrete and one continuous domain. As in the classicalANOVA model, components in (4.16) have nice interpretations: µ rep-resents the overall mean, f1(x1) represents the main effect of x1, f2(x2)represents the main effect of x2, and f12(x1, x2) represents the interac-tion. Components also have nice interpretation collectively: µ+ f2(x2)represents the mean curve among all levels of x1, and f1(x1)+f12(x1, x2)represents the departure from the mean curve at level x1. Write R

a =

H(1)0 ⊕H(1)

1 and W 12 [0, 1] = H(2)

0 ⊕H(2)1 , where H(1)

0 and H(1)1 are given

in (4.5), H(2)0 = {1} and H(2)

1 = {f ∈ W 12 [0, 1] :

∫ 1

0 fdu = 0}. Then, interms of the model space, (4.16) decomposes

Ra ⊗W 1

2 [0, 1]

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

, H0 ⊕H1 ⊕H2 ⊕H3. (4.17)

To fit model (4.16), we need to find basis for H0 and RKs for H1, H2,and H3. It is clear that H0 contains all constant functions. Thus, H0


is an one-dimensional space with the basis φ(x) = 1. The RKs of H(1)0

and H(1)1 are given in (4.6), and the RKs of H(2)

0 and H(2)1 are given in

Table 2.2. The RKs of H1, H2, and H3 can be calculated using the factthat the RK of a tensor product space equals the product of RKs of the

involved marginal spaces. For example, the RK of H3 = H(1)1 ⊗ H(2)

1

equals (δx1,z1− 1/a){k1(x2)k1(z2) + k2(|x2 − z2|)}.

Now suppose we want to model the effect of x2 using the cubic splinespace W 2

2 [0, 1]. Consider the tensor product space Ra ⊗W 2

2 [0, 1]. Define

three averaging operators A(1)1 , A(2)

1 , and A(2)2 as

A(1)1 f =

1

a

a∑

x1=1

f,

A(2)1 f =

∫ 1

0

fdx2,

A(2)2 f =

(∫ 1

0

f ′dx2

)

(x2 − 0.5),


1 extract the constant function out of all possible

functions for each variable, and A(2)2 extracts the linear function for x2.

Let A(1)2 = I −A(1)

1 and A(2)3 = I −A(2)

1 −A(2)2 . Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2 + A(2)

3

}f

= A(1)1 A(2)

1 f + A(1)1 A(2)

2 f + A(1)1 A(2)

3 f

+ A(1)2 A(2)

1 f + A(1)2 A(2)

2 f + A(1)2 A(2)

3 f

, µ+ β × (x2 − 0.5) + fs2 (x2)

+ f1(x1) + γx1× (x2 − 0.5) + fss

12 (x1, x2), (4.18)

where µ represents the overall mean, f1(x1) represents the main effect ofx1, β×(x2−0.5) represents the linear main effect of x2, f

s2 (x2) represents

the smooth main effect of x2, γx1×(x2−0.5) represents the smooth–linear

interaction, and fss12(x1, x2) represents the smooth–smooth interaction.

The overall main effect of x2

f2(x2) = β × (x2 − 0.5) + fs2 (x2),

and the overall interaction between x1 and x2

f12(x1, x2) = γx1× (x2 − 0.5) + fss

12(x1, x2).

It is obvious that f2 and f12 are the results of combining averaging

operators A(2)2 and A(3)

2 . One may look at the components in the overall


main effects and interactions to decide whether to include them in themodel. The first three terms in (4.18) represent the mean curve amongall levels of x1, and the last three terms represent the departure fromthe mean curve. The simple ANCOVA (analysis of covariance) modelwith x2 being modeled by a straight line is a special case of (4.18) withfs2 = fss

12 = 0. Thus, checking whether fs2 and fss

12 are negligible provides

a diagnostic tool for the ANCOVA model. Write Ra = H(1)

0 ⊕H(1)1 and

W 22 [0, 1] = H(2)

0 ⊕H(2)1 ⊕ H(2)


1 are given in (4.5),

H(2)0 = {1}, H(2)

1 = {x2 − 0.5}, and H(2)2 = {f ∈ W 2

2 [0, 1],∫ 1

0fdu =

∫ 1

0f ′du = 0}. Then, in terms of the model space, (4.18) decomposes

Ra ⊗W 2

2 [0, 1]

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1 ⊕H(2)2

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)0 ⊗H(2)

2

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

2

}

, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4, (4.19)

where H0 = {H(1)0 ⊗ H(2)

0 } ⊕ {H(1)0 ⊗ H(2)

1 }, H1 = H(1)0 ⊗ H(2)

2 , H2 =

H(1)1 ⊗H(2)

0 , H3 = H(1)1 ⊗H(2)

1 , and H4 = H(1)1 ⊗H(2)

2 . It is easy to seethat H0 is a two-dimensional space with basis functions φ1(x) = 1, andφ2(x) = x2 − 0.5. RKs of H1, H2, H3, and H4 can be calculated from

RKs of H(1)0 and H(1)

1 given in (4.6) and the RKs of H(2)0 , H(2)

1 , and H(2)2

given in Table 2.2.


2 [0, 1] ⊗Wm2

2 [0, 1]

Suppose both x1 and x2 are continuous variables in [0, 1]. Wm2 [0, 1] is a

natural model space for both effects of x1 and x2. Therefore, we considerthe tensor product space Wm1

2 [0, 1] ⊗Wm2

2 [0, 1]. For simplicity, we willderive SS ANOVA decompositions for combinations m1 = m2 = 1 andm1 = m2 = 2 only. SS ANOVA decompositions for other combinationsof m1 and m2 can be derived similarly.

Consider the tensor product space W 12 [0, 1] ⊗W 1

2 [0, 1] first. Definetwo averaging operators as

A(k)1 f =

∫ 1

0

fdxk, k = 1, 2,

where A(k)1 extracts the constant term out of all possible functions of


xk. Let A(k)2 = I −A(k)

1 for k = 1, 2. Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2

}f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ f1(x1) + f2(x2) + f12(x1, x2). (4.20)

Obviously, (4.20) is a natural extension of the classical two-way ANOVAdecomposition (4.14) from the product of two discrete domains to theproduct of two continuous domains. Components µ, f1(x1), f2(x2), andf12(x1, x2) represent the overall mean, the main effect of x1, the maineffect of x2, and the interaction between x1 and x2, respectively. Interms of the model space, (4.20) decomposes

W 12 [0, 1]⊗W 1

2 [0, 1]

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

, H0 ⊕H1 ⊕H2 ⊕H3,

where H(k)0 = {1} and H(k)

1 = {f ∈ W 12 [0, 1] :

∫ 1

0 fdxk = 0} for

k = 1, 2, H0 = H(1)0 ⊗ H(2)

0 , H1 = H(1)1 ⊗ H(2)

0 , H2 = H(1)0 ⊗ H(2)

1 , and

H3 = H(1)1 ⊗H(2)

1 . H0 is an one-dimensional space with basis φ(x) = 1.

The RKs of H1, H2, and H3 can be calculated from RKs of H(k)0 and

H(k)1 given in Table 2.2.

Now suppose we want to model both x1 and x2 using cubic splines.That is, we consider the tensor product space W 2

2 [0, 1]⊗W 22 [0, 1]. Define

four averaging operators

A(k)1 f =

∫ 1

0

fdxk,

A(k)2 f =

(∫ 1

0

f ′dxk

)

(xk − 0.5), k = 1, 2,


1 extract the constant function out of all possible

functions for each variable, and A(1)2 and A(2)

2 extract the linear function


for each variable. Let A(k)3 = I −A(k)

1 −A(k)2 for k = 1, 2. Then

f ={A(1)

1 + A(1)2 + A(1)

3

}{A(2)

1 + A(2)2 + A(2)

3

}f

= A(1)1 A(2)

1 f + A(1)1 A(2)

2 f + A(1)1 A(2)

3 f

+ A(1)2 A(2)

1 f + A(1)2 A(2)

2 f + A(1)2 A(2)

3 f

+ A(1)3 A(2)

1 f + A(1)3 A(2)

2 f + A(1)3 A(2)

3 f

, µ+ β2 × (x2 − 0.5) + fs2 (x2)

+ β1 × (x1 − 0.5) + β3 × (x1 − 0.5)× (x2 − 0.5) + f ls12(x1, x2)

+ fs1 (x1) + fsl

12(x1, x2) + fss12(x1, x2), (4.21)

where µ represents the overall mean; β1 × (x1 − 0.5) and β2 × (x2 −0.5) represent the linear main effects of x1 and x2; f

s1 (x1) and fs

2 (x2)represent the smooth main effect of x1 and x2; β3 × (x1 − 0.5) × (x2 −0.5), f ls

12(x1, x2), fsl12(x1, x2) and fss

12 (x1, x2) represent the linear–linear,linear–smooth, smooth–linear, and smooth–smooth interactions betweenx1 and x2. The overall main effect of xk

fk(xk) = βk × (xk − 0.5) + fsk(xk), k = 1, 2,


f12(x1, x2) = β3 × (x1 − 0.5)× (x2 − 0.5) + f ls12(x1, x2) + fsl

12(x1, x2)

+ fss12 (x1, x2).

The simple regression model with both x1 and x2 being modeled bystraight lines is a special case of (4.21) with fs

1 = fs2 = f ls

12 = fsl12 =

fss12 = 0.In terms of the model space, (4.21) decomposes

W 22 [0, 1]⊗W 2

2 [0, 1]

={

H(1)0 ⊕H(1)

1 ⊕H(1)2

}

⊗{

H(2)0 ⊕H(2)

1 ⊕H(2)2

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)2 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

⊕{

H(1)2 ⊗H(2)

1

}

⊕{

H(1)0 ⊗H(2)

2

}

⊕{

H(1)1 ⊗H(2)

2

}

⊕{

H(1)2 ⊗H(2)

2

}

, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4 ⊕H5,

where H(k)0 = {1}, H(k)

1 = {xk − 0.5}, and H(k)2 = {f ∈ W 2

2 [0, 1] :∫ 1

0f

dxk =∫ 1

0 f′dxk = 0} for k = 1, 2, H0 = {H(1)

0 ⊗H(2)0 }⊕{H(1)

1 ⊗H(2)0 }⊕


{H(1)0 ⊗ H(2)

1 } ⊕ {H(1)1 ⊗ H(2)

1 }, H1 = H(1)2 ⊗ H(2)

0 , H2 = H(0)2 ⊗ H(2)

1 ,

H3 = H(1)0 ⊗ H(2)

2 , H4 = H(1)1 ⊗ H(2)

2 , and H5 = H(1)2 ⊗ H(2)

2 . H0 is afour-dimensional space with basis functions φ1(x) = 1, φ2(x) = x1−0.5,φ3(x) = x2 − 0.5, and φ4(x) = (x1 − 0.5)× (x2 − 0.5). The RKs of H1,

H2, H3, H4 and H5 can be calculated from RKs of H(k)0 , H(k)

1 , and H(k)2

given in Table 2.2.


2 (per)

Suppose x1 is a discrete variable with a levels, and x2 is a continuousvariable in [0, 1]. In addition, suppose that f is a periodic function ofx2. A natural model space for x1 is R

a, and a natural model spacefor x2 is Wm

2 (per). Therefore, we consider the tensor product spaceR

a ⊗Wm2 (per).

Define two averaging operators A(1)1 and A(2)

1 as

A(1)1 f =

1

a

a∑

x1=1

f,

A(2)1 f =

∫ 1

0

fdx2.

Let A(1)2 = I −A(1)

1 and A(2)2 = I −A(2)

1 . Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2

}f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ f1(x1) + f2(x2) + f12(x1, x2), (4.22)

where µ represents the overall mean, f1(x1) represents the main effect ofx1, f2(x2) represents the main effect of x2, and f12(x1, x2) represents the

interaction between x1 and x2. Write Ra = H(1)

0 ⊕H(1)1 and Wm

2 (per) =

H(2)0 ⊕ H(2)


1 are given in (4.5), H(2)0 = {1}, and

H(2)1 = {f ∈ Wm

2 (per) :∫ 1

0 fdu = 0}. Then, in terms of the modelspace, (4.22) decomposes

Ra ⊗Wm

2 (per)

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

, H0 ⊕H1 ⊕H2 ⊕H3,


where H0 = H(1)0 ⊗ H(2)

0 , H1 = H(1)1 ⊗ H(2)

0 , H2 = H(1)0 ⊗ H(2)

1 , and

H3 = H(1)1 ⊗ H(2)

1 . H0 is an one-dimensional space with basis functionφ(x) = 1. The RKs of H1, H2, and H3 can be calculated from RKs of

H(1)0 and H(1)

1 given in (4.6) and RKs of H(2)0 and H(2)

1 given in (2.33).


2 (per) ⊗Wm2

2 [0, 1]

Suppose both x1 and x2 are continuous variables in [0, 1]. In addition,suppose f is a periodic function of x1. A natural model space for x1 isWm1

2 (per), and a natural model space for x2 is Wm2

2 [0, 1]. Therefore, weconsider the tensor product space Wm1

2 (per)⊗Wm2

2 [0, 1]. For simplicity,we derive the SS ANOVA decomposition for m2 = 2 only.

Define three averaging operators

A(1)1 f =

∫ 1

0

fdx1,

A(2)1 f =

∫ 1

0

fdx2,

A(2)2 f =

(∫ 1

0

f ′dx2

)

(x2 − 0.5).

Let A(1)2 = I −A(1)

1 and A(2)3 = I −A(2)

1 −A(2)2 . Then

f ={

A(1)1 + A(1)

2

}{

A(2)1 + A(2)

2 + A(2)3

}

f

= A(1)1 A(2)

1 f + A(1)1 A(2)

2 f + A(1)1 A(2)

3 f

+ A(1)2 A(2)

1 f + A(1)2 A(2)

2 f + A(1)2 A(2)

3 f

, µ+ β × (x2 − 0.5) + fs2 (x2)

+ f1(x1) + fsl12(x1, x2) + fss

12 (x1, x2), (4.23)

where µ represents the overall mean, f1(x1) represents the main effectof x1, β × (x2 − 0.5) and fs

2 (x2) represent the linear and smooth maineffects of x2, f

sl12(x1, x2) and fss

12 (x1, x2) represent the smooth–linear andsmooth–smooth interactions. The overall main effect of x2

f2(x2) = β × (x2 − 0.5) + fs2 (x2),


f12(x1, x2) = fsl12(x1, x2) + fss

12(x1, x2).


Write Wm1

2 (per) = H(1)0 ⊕ H(1)

1 and W 22 [0, 1] = H(2)

0 ⊕ H(2)1 ⊕ H(2)

2 ,

where H(1)0 = {1}, H(1)

1 = {f ∈ Wm1

2 (per) :∫ 1

0 fdu = 0}, H(2)0 = {1},

H(2)1 = {x2 − 0.5}, and H(2)

2 = {f ∈ W 22 [0, 1] :

∫ 1

0fdu =

∫ 1

0f ′du = 0}.

Then, in terms of the model space, (4.23) decomposes

Wm1

2 (per) ⊗W 22 [0, 1]

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1 ⊕H(2)2

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)0 ⊗H(2)

2

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

2

}

, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4,

where H0 = {H(1)0 ⊗ H(2)

0 } ⊕ {H(1)0 ⊗ H(2)

1 }, H1 = H(1)0 ⊗ H(2)

2 , H2 =

H(1)1 ⊗ H(2)

0 , H3 = H(1)1 ⊗ H(2)

1 , and H4 = H(1)1 ⊗ H(2)

2 . H0 is a two-dimensional space with basis functions φ(x) = 1 and φ(x) = x2 − 0.5.

The RKs of H1, H2, H3, and H4 can be calculated from the RKs H(1)0

and H(1)1 given in (2.33) and the RKs of H(2)

0 , H(2)1 , and H(2)

2 given inTable 2.2.

4.4.6 Decomposition of W 22 (R2) ⊗Wm

2 (per)

Suppose x1 = (x11, x12) is a bivariate continuous variable in R2, and x2

is a continuous variable in [0, 1]. In addition, suppose that f is a periodicfunction of x2. We consider the tensor product space W 2

2 (R2)⊗W 22 (per)

for the joint regression function f(x1, x2).

Let φ1(x1) = 1, φ2(x1) = x11, and φ3(x1) = x12 be polynomials oftotal degree less than 2. Define three averaging operators

A(1)1 f =

J∑

j=1

wjf(uj),

A(1)2 f =

J∑

j=1

wjf(uj){φ2(uj)φ2 + φ3(uj)φ3},

A(2)1 f =

∫ 1

0

fdx2,

where uj are fixed points in R2, wj are fixed positive weights such that

∑Jj=1 wj = 1, φ1 = 1, and φ2 and φ3 are orthonormal bases based on


the norm (2.41). Let A(1)3 = I −A(1)

1 −A(1)2 and A(2)

2 = I −A(2)1 . Then

f ={

A(1)1 + A(1)

2 + A(1)3

}{

A(2)1 + A(2)

2

}

f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)3 A(2)

1 f

+ A(1)1 A(2)

2 f + A(1)2 A(2)

2 f + A(1)3 A(2)

2 f

= µ+ β1φ2(x1) + β2φ3(x1)

+ fs1 (x1) + f2(x2) + f ls

12(x1, x2) + fss12(x1, x2), (4.24)

where µ is the overall mean, β1φ2(x1) + β2φ3(x1) is the linear maineffect of x1, f

s1 (x1) is the smooth main effect of x1, f2(x2) is the main

effect of x2, fls12(x1, x2) is the linear–smooth interaction, and fss

12(x1, x2)is smooth–smooth interaction. The overall main effect of x1

f1(x1) = β1φ2(x1) + β2φ3(x1) + fs1 (x1),

and the overall interaction

f12(x1, x2) = f ls12(x1, x2) + fss

12 (x1, x2).

Write W 22 (R2) = H(1)

0 ⊕H(1)1 ⊕H(1)

2 and Wm2 (per) = H(2)

0 ⊕H(2)1 , where

H(1)0 = {1}, H(1)

1 = {φ2, φ3}, H(1)2 = {f ∈ W 2

2 (R2) : J22 (f) = 0},

H(2)0 = {1}, and H(2)

1 = {f ∈ Wm1

2 (per) :∫ 1

0fdu = 0}. Then, in terms

of the model space, (4.24) decomposes

W 22 (R2) ⊗Wm

2 (per)

={

H(1)0 ⊕H(1)

1 ⊕H(1)2

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)2 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

⊕{

H(1)2 ⊗H(2)

1

}

, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4, (4.25)

where H0 = {H(1)0 ⊗H(2)

0 } ⊕ {H(1)1 ⊗H(2)

0 }, H1 = {H(1)2 ⊗H(2)

0 }, H2 =

{H(1)0 ⊗H(2)

1 }, H3 = {H(1)1 ⊗H(2)

1 }, and H4 = {H(1)2 ⊗H(2)

1 }. The basis

functions of H0 are 1, φ2, and φ3. The RKs of H(1)0 and H(1)

1 are 1 and

φ2(x1)φ2(z1) + φ3(x1)φ3(z1), respectively. The RK of H(1)2 is given in

(2.42). The RKs of H(2)0 and H(2)

1 are given in (2.33). The RKs of H1,H2, H3, and H4 can be calculated as products of the RKs of the involvedmarginal spaces.


4.5 General SS ANOVA Decomposition

Consider the general case with d independent variables x1 ∈ X1, x2 ∈X2, . . . , xd ∈ Xd, and the tensor product space H(1)⊗H(2)⊗· · ·⊗H(d) onX = X1 ×X2 × · · · ×Xd. For f as a function of xk, assume the followingone-way decomposition as in (4.3),

f = A(k)1 f + · · · + A(k)

rkf, 1 ≤ k ≤ d, (4.26)

where A(k)1 + · · · + A(k)

rk = I. Then, for the joint function,

f ={A(1)

1 + · · · + A(1)r1

}. . .{A(d)

1 + · · · + A(d)rd

}f

=

r1∑

j1=1

. . .

rd∑

jd=1

A(1)j1. . .A(d)

jdf. (4.27)

The above decomposition of the function f is referred to as the SSANOVA decomposition.

DenoteH(k) = H(k)

(1) ⊕ · · · ⊕ H(k)(rk), k = 1, 2, . . . , d

as the one-way decomposition to H(k) associated with (4.26). Then,(4.27) decomposes the tensor product space

H(1) ⊗H(2) ⊗ · · · ⊗ H(d)

={

H(1)(1) ⊕ · · · ⊕ H(1)

(r1)

}

⊗ · · · ⊗{

H(d)(1) ⊕ · · · ⊕ H(d)

(rd)

}

=

r1∑

j1=1

. . .

rd∑

jd=1

H(1)(j1) ⊗ . . .⊗H(d)

(jd). (4.28)

The RK of H(1)(j1) ⊗· · ·⊗H(d)

(jd) equals∏d

k=1 R(k)(jk), where R

(k)(jk) is the RK

of H(k)(jk) for k = 1, . . . , d.

Consider the special case when rk = 2 for all k = 1, . . . , d. Assume

that A(k)1 f is independent of xk, or equivalently, H(k)

(1) = {1}. Then the

decomposition (4.27) can be written as

f =∑

B⊆{1,...,d}

{∏

k∈B

A(k)2

∏

k∈Bc

A(k)1 f

}

= µ+

d∑

k=1

fk(xk) +∑

k<l

fkl(xk, xl) + · · · + f1...d(x1, . . . , xd), (4.29)


where µ represents the grand mean, fk(xk) represents the main effectof xk, fkl(xk, xl) represents the two-way interaction between xk and xl,and the remaining terms represent higher-order interactions.

For general rk, assuming A(k)1 f is independent of xk, the decomposi-

tion (4.29) can be derived by combining operators A(k)2 , . . . , A(k)

rkinto

one averaging operator A(k)2 = A(k)

2 + · · · + A(k)rk

. Therefore, decompo-sition (4.29) combines components in (4.27) and reorganizes them intooverall main effects and interactions.

When all xk are discrete variables, the SS ANOVA decomposition(4.27) leads to the classical d-way ANOVA model. Therefore, SS ANOVAdecompositions are natural extensions of classical ANOVA decomposi-tions from discrete domains to general domains and from finite dimen-sional spaces to infinite dimensional spaces. They decompose tensorproduct RKHS’s into meaningful subspaces. As the classical ANOVAdecompositions, SS ANOVA decompositions lead to hierarchical struc-tures that are useful for model selection and interpretation.

Different SS ANOVA decompositions can be derived based on differentaveraging operators (or equivalently different decompositions of marginalspaces). Therefore, the SS ANOVA decomposition should be regardedas a general prescription for building multivariate nonparametric modelsrather than some fixed models. They can also be used to constructsubmodels for components in more complicated models.

4.6 SS ANOVA Models and Estimation

The curse of dimensionality is a major problem in dealing with multi-variate functions. From the SS ANOVA decomposition in (4.27), it isclear that the number of components in the decomposition increases ex-ponentially as the dimension d increases. To overcome this problem, asin classical ANOVA, high-order interactions are often deleted from themodel space.

A model containing any subset of components in the SS ANOVA de-composition (4.27) is referred to as an SS ANOVA model. The well-known additive model is a special case that contains main effects only(Hastie and Tibshirani 1990). Given an SS ANOVA model, we can re-group and write the model space as

M = H0 ⊕H1 ⊕ · · · ⊕ Hq, (4.30)

where H0 is a finite dimensional space collecting all functions that are


not going to be penalized, and H1, . . . ,Hq are orthogonal RKHS’s withRKs Rj for j = 1, . . . , q. The RK Rj equals the product of RKs of thesubspaces involved in the tensor product space Hj . The norms on thecomposite Hj are the tensor product norms induced by the norms on thecomponent subspaces. Details about the induced norm can be found inAronszajn (1950), and an illustrative example can be found in Chapter10 of Wahba (1990). Note that ‖f‖2 = ‖P0f‖2 +

∑qj=1 ‖Pjf‖2, where

Pj is the orthogonal projector in M onto Hj .For generality, suppose observations are generated by

yi = Lif + ǫi, i = 1, . . . , n, (4.31)

where f is a multivariate function in the model space M defined in(4.30), Li are bounded linear functionals, and ǫi are zero-mean indepen-dent random errors with a common variance σ2.

The PLS estimate of f is the solution to

minf∈M

1

n

n∑

i=1

(yi − Lif)2 +

q∑

j=1

λj‖Pjf‖2

. (4.32)

Different smoothing parameters λj allow different penalties for each com-ponent. For fixed smoothing parameters, the following rescaling allowsus to derive and compute the solution to (4.32) using results in Chapter2. Let H∗

1 = ⊕qj=1Hj . Then, for any f ∈ H∗

1,

f(x) = f1(x) + · · · + fq(x), fj ∈ Hj , j = 1, . . . , q.

Write λj , λ/θj . The set of parameters λ and θj are overparameterized.The penalty is controlled by the ratio λ/θj , that is, λj . Define the innerproduct in H∗

1 as

(f, g)∗ =

q∑

j=1

θ−1j (fj , gj). (4.33)

Then, ||f ||2∗ =∑q

j=1 θ−1j ||fj ||2. Let R∗

1 =∑q

j=1 θjRj . Since

(R∗1(x, ·), f(·))∗ =

q∑

j=1

θ−1j (θjR

j(x, ·), fj(·)) =

q∑

j=1

fj(x) = f(x),

then R∗1 is the RK of H∗

1 with the inner product (4.33).Let P ∗

1 =∑q

j=1 Pj be the orthogonal projection in M onto H∗1. Then

the minimization problem (4.32) is reduced to

minf∈M

{

1

n

n∑

i=1

(yi − Lif)2 + λ‖P ∗1 f‖2

∗

}

. (4.34)


The PLS (4.34) has the same form as (2.11), with H1 and P1 beingreplaced by H∗

1 and P ∗1 , respectively. Therefore, results in Section 2.4

apply. Specifically, let θ = (θ1, . . . , θq), φ1, . . . , φp be basis functions ofH0, and

T = {Liφν}n pi=1 ν=1,

Σk = {LiLjRk}n

i,j=1, k = 1, . . . , q,

Σθ = θ1Σ1 + · · · + θqΣq.

(4.35)

Applying the Kimeldorf–Wahba representer theorem and noting that

ξi(x) = Li(z)R∗1(x, z) =

q∑

j=1

θjLi(z)Rj(x, z),

the solution can be represented as

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ci

q∑

j=1

θjLi(z)Rj(x, z), (4.36)

where d = (d1, . . . , dp)T and c = (c1, . . . , cn)T are solutions to

(Σθ + nλI)c + Td = y,

T Tc = 0.(4.37)

Equations in (4.37) have the same form as those in (2.21), with Σ beingreplaced by Σθ. Therefore, the coefficients c and d can be computedsimilarly.

Let f = (L1f , . . . ,Lnf)T be the vector of fitted values. Let the QRdecomposition of T be

T = (Q1 Q2)

(R0

)

and M = Σθ + nλI. Then

f = H(λ,θ)y

whereH(λ,θ) = I − nλQ2(Q

T2MQ2)

−1QT2 (4.38)

is the hat matrix.The ssr function in the assist package can be used to fit the SS

ANOVA model (4.31). As in Chapter 2, observations y and T matrixcan be specified using the argument formula. Instead of a single matrix


Σ, we now have multiple matrices Σj for j = 1, . . . , q. They are specifiedas elements of a list for the argument rk. Examples are given in Section4.9.

Sometimes it may be desirable to use the same smoothing parameterfor a subset of penalties in the PLS (4.32). For illustration, suppose wewant to solve the PLS (4.32) with λq−1 = λq. This can be achieved

by combining Hq−1 and Hq into one space, say Hq−1, and fit the SSANOVA model with model space M = H0 ⊕ H1 ⊕ · · · ⊕ Hq−1. TheRK of Hq−1 is Rq−1 = Rq−1 + Rq. Then the model can be fitted by acall to the ssr function with the combined RK. The same approach canbe applied to multiple subsets such that penalties in each subset sharethe same smoothing parameter. When appropriate, this approach cangreatly reduce the computation time when q is large. An example willbe given in Section 4.9.1.

4.7 Selection of Smoothing Parameters

The set of parameters λ and θ are overparameterized. Therefore, eventhough the criteria in this section are presented as functions of λ and θ,they should be understood as functions of (λ1, . . . , λq), where λj = λ/θj .

Define mean squared error (MSE) as

MSE(λ,θ) = E

(1

n||f − f ||2

)

, (4.39)

where f = (L1f, . . . ,Lnf)T . Following the same arguments as in Section3.3, it is easy to check that the function

UBR(λ,θ) ,1

n||(I −H(λ,θ))y||2 +

2σ2

ntrH(λ,θ) (4.40)

is an unbiased estimate of MSE(λ,θ) + σ2. The function UBR(λ,θ)is referred to as the unbiased risk (UBR) criterion and the minimizer ofUBR(λ,θ) is referred to as the UBR estimate of (λ,θ). The UBR methodrequires an estimate of error variance σ2. Few methods are available forestimating σ2 in a multivariate nonparametric model without needing toestimate the function f first. When the product domain X is equippedwith a norm, the method in Tong and Wang (2005) may be used toestimate σ2.


A parallel derivation as in Section 3.4 leads to the following extensionof the GCV criterion

GCV(λ,θ) ,

1n

∑ni=1(Lif − yi)

2

{1n tr(I −H(λ,θ))

}2 . (4.41)

The GCV estimate of (λ,θ) is the minimizer of GCV(λ,θ).We now construct a Bayes model for the SS ANOVA model (4.31).

Assume a prior for f as

F (x) =

p∑

ν=1

ζνφν(x) + δ12

q∑

j=1

√

θjUj(x), (4.42)

where ζνiid∼ N(0, κ); Uj(x) are independent, zero-mean Gaussian stochas-

tic processes with covariance function Rj(x, z); ζν and Uj(x) are mutu-ally independent; and κ and δ are positive constants. Suppose observa-tions are generated from

yi = LiF + ǫi, i = 1, . . . , n, (4.43)

where ǫiiid∼ N(0, σ2).

Let L0 be a bounded linear functional on M. Let λ = σ2/nδ. Thesame arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλI in this chapter. Therefore,

limκ→∞

E(L0F |y) = L0f .

That is, the PLS estimate f is a Bayes estimate. Furthermore, we havethe following extension of the GML criterion:

GML(λ,θ) ,y

T

(I −H(λ,θ))y

[det+((I −H(λ,θ)))]1

n−p

. (4.44)

All three forms of LME models for the SSR model in Section 3.5can be extended for the SS ANOVA model. We present the extensionof (3.35) only. Let Σk = ZkZ

Tk , where Zk is an n × mk matrix with

mk = rank(Σk). It is not difficult to see that the GML criterion is theREML criterion based on the following linear mixed-effects model

y = Tζ +

q∑

k=1

Zkuk + ǫ, (4.45)

where ζ = (ζ1, . . . , ζp)T are deterministic parameters, uk are mutually

independent random effects, uk ∼ N(0, σ2θkImk/nλ), ǫ ∼ N(0, σ2I),

and uk are independent of ǫ. Details can be found in Chapter 9.


4.8 Confidence Intervals

Any function f ∈ M can be represented as

f =

p∑

ν=1

f0ν +

q∑

j=1

f1j , (4.46)

where f0ν ∈ span{φν} for ν = 1, . . . , p, and f1j ∈ Hj for j = 1, . . . , q.Our goal is to construct Bayesian confidence intervals for

L0fγ =

p∑

ν=1

γνL0f0ν +

q∑

j=1

γp+jL0f1j (4.47)

for any bounded linear functional L0 and any γ = (γ1, . . . , γp+q)T , where

γk = 1 when the corresponding component in (4.46) is to be includedand 0 otherwise.

Let F0ν = ζνφν for ν = 1, . . . , p, and F1j =√δθjUj for j = 1, . . . , q.

Let L0, L01, and L02 be bounded linear functionals.

Posterior means and covariancesFor ν, µ = 1, . . . , p and j, k = 1, . . . , q, the posterior means are

E(L0F0ν |y) = (L0φν)eTν d,

E(L0F1j |y) = θj(L0ξj)Tc,

(4.48)

and the posterior covariances are

δ−1Cov(L01F0ν ,L02F0µ|y) = (L01φν)(L02φµ)eTν Aeµ,

δ−1Cov(L01F0ν ,L02F1j |y) = −θj(L01φν)eTν B(L02ξj), (4.49)

δ−1Cov(L01F1j ,L02F1k|y) = δj,kθjL01L02Rj − θjθk(L01ξj)

TC(L02ξk),

where eν is a vector of dimension p with the νth element being oneand all other elements being zero, δj,k is the Kronecker delta, c andd are solutions to (4.37), Lξj = (LL1R

j, . . . ,LLnRj)T for any well-

defined L, M = Σθ + nλI, A = (T TM−1T )−1, B = AT TM−1, andC = M−1(I −B).

As in Section 3.8.1, even though not explicitly expressed, a diffuseprior is assumed for ζ with κ → ∞. The first two equations in (4.48)

state that the projections of f on subspaces are the posterior meansof the corresponding components in the Bayes model (4.42). The next


three equations in (4.49) can be used to compute posterior covariancesof the spline estimates and their projections. Based on these posteriorcovariances, we construct Bayesian confidence intervals for the overallfunction f and its components in (4.46). Specifically, posterior meanand variance for L0fγ in (4.47) can be calculated using the formulae in(4.48) and (4.49). Then 100(1 − α)% Bayesian confidence interval forL0fγ is

E{L0Fγ |y} ± zα2

√

Var{L0Fγ |y}

where

Fγ (x) =

p∑

ν=1

γνF0ν(x) +

q∑

j=1

γp+jF1j(x).

Confidence intervals for a collection of points can be constructedsimilarly.

The same approach in Section 3.8.2 can be used to construct bootstrapconfidence intervals. The extension is straightforward.

4.9 Examples

4.9.1 Tongue Shapes

Consider the ultrasound data. Let y be the response variable height; x1

be the index of environment with x1 = 1, 2, 3 corresponding to 2words,cluster, and schwa respectively; x2 be the variable length scaled into[0, 1]; and x3 be the variable time scaled into [0, 1].

We first investigate how tongue shapes for an articulation differ underdifferent environments at a particular time, say, at time 60 ms. Obser-vations are shown in Figure 4.2. Consider a bivariate regression functionf(x1, x2) where x1 (environment) is a discrete variable with three levels,and x2 (length) is a continuous variable in [0, 1]. Therefore, we modelthe joint function using the tensor product space R

3⊗Wm2 [0, 1]. The SS

ANOVA decompositions of Ra ⊗Wm

2 [0, 1] are given in (4.16) for m = 1and (4.18) for m = 2.

The following statements fit the SS ANOVA model (4.16):

> data(ultrasound)

> ultrasound$y <- ultrasound$height

> ultrasound$x1 <- ultrasound$env

> ultrasound$x2 <- ident(ultrasound$length)

> ssr(y~1, data=ultrasound, subset=ultrasound$time==60,


length (mm)

heig

ht (m

m)

o

o

o

o

o

o

o

o oo

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

80 100 120 140

45

50

55

60

65

70

75

80 2words

length (mm)

o

o

o

o

o

o o o o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o o oo

o

o

o

o

o

80 100 120 140

cluster

length (mm)

o

o

o

o

o

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o

oo

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

80 100 120 140

schwa

FIGURE 4.2 Ultrasound data, plots of observations (circles), fits(thin lines), and 95% Bayesian confidence intervals (shaded regions),and the mean curves among three environments (thicker lines). Thetongue root is on the left, and the tongue tip is on the right in each plot.

rk=list(shrink1(x1),

linear(x2),

rk.prod(shrink1(x1),linear(x2))))

where the ident function transforms a variable into [0, 1], the shrink1

function computes the RK of H(1)1 = {f :

∑3i=1 f(i) = 0} using the

formula in (4.6), and the rk.prod function computes the product ofRKs.

The following statements fit the SS ANOVA model (4.18) and sum-marize the fit:

> ultra.el.c.fit <- ssr(y~x2, data=ultrasound,

subset=ultrasound$time==60,


cubic(x2),

rk.prod(shrink1(x1),kron(x2-.5)),

rk.prod(shrink1(x1),cubic(x2))))

> summary(ultra.el.c.fit)

...


2.650966e-05 1.557379e-03 3.177218e-05




where the function kron computes the RK, k1(x2)k1(z2), of the space

H(2)1 = {k1(x2)}.Estimates of the smoothing parameters λ3 and λ4 are small, which

indicates that both interactions γx1× (x2 −0.5) and fss

12(x1, x2) may notbe negligible. We compute the posterior mean and standard deviationfor the overall interaction f12(x1, x2) = γx1

× (x2 − 0.5)+ fss12 (x1, x2) on

grid points as follows:

> grid <- seq(0,1,len=40)

> predict(ultra.el.c.fit, terms=c(0,0,0,0,1,1),

newdata=expand.grid(x2=grid,x1=as.factor(1:3)))

The overall interaction and its 95% Bayesian confidence intervals areshown in Figure 4.3. It is clear that the interaction is nonzero.

length (mm)

heig

ht (m

m)

80 100 120 140

−5

05

2words

length (mm)80 100 120 140

cluster

length (mm)80 100 120 140

schwa

FIGURE 4.3 Ultrasound data, plots of the overall interaction (solidlines), and 95% Bayesian confidence intervals (shaded regions). Dashedline in each plot represents the constant function zero.

The posterior mean and standard deviation of the function f(x1, x2)can be calculated as follows:



The default for the option terms is a vector of all 1’s. Therefore, this op-tion can be dropped in the above statement. The fits and 95% Bayesianconfidence intervals are shown in Figure 4.2.


Note that in model (4.18) the first three terms represent the meancurve among three environments and the last three terms represent thedeparture of a particular environment from the mean curve. For com-parison, we compute the estimate of the mean curve among three envi-ronments:


newdata=expand.grid(x2=grid,x1=as.factor(1)))

The estimate of the mean curve is also displayed in Figure 4.2. Thedifference between the tongue shape under a particular environment andthe average tongue shape can be made by comparing two lines in eachplot. To look at the effect of each environment more closely, we calculatethe estimate of the departure from the mean curve for each environment:



The estimates of environment effects are shown in Figure 4.4. We cansee that, comparing to the average shape, the tongue shape for 2wordsis front-raising, and the tongue shape for cluster is back-raising. Thetongue shape for schwa is close to the average shape.

length (mm)

heig

ht (m

m)

80 100 120 140

−5

05

2words

length (mm)80 100 120 140

cluster

length (mm)80 100 120 140

schwa

FIGURE 4.4 Ultrasound data, plots of effects of environment, and95% Bayesian confidence intervals. The dashed line in each plot repre-sents the constant function zero.

The model space of the SS ANOVA model (4.18) is M = H0 ⊕H1 ⊕H2 ⊕ H3 ⊕ H4, where Hj for j = 0, . . . , 4 are defined in (4.19). In


particular, the spaces H3 and H4 contain smooth–linear and smooth–smooth interactions between x1 and x2. For illustration, suppose nowthat we want to fit model (4.18) with the same smoothing parameter forpenalties to functions in H3 and H4. That is to set λ3 = λ4 in the PLS.As discussed in Section 4.6, this can be achieved by combining H3 andH4 into one space. The following statements fit the SS ANOVA model(4.18) with λ3 = λ4:

> ultra.el.c.fit1 <- ssr(y~x2, data=ultra,

subset=ultra$time==60,


cubic(x2),

rk.prod(shrink1(x1),kron(x2-.5))+

rk.prod(shrink1(x1),cubic(x2))))

> summary(ultra.el.c.fit1)

...


2.739598e-05 3.766634e-05



Next we investigate how the tongue shapes change over time for eachenvironment. Figure 4.1 shows 3-d plots of observations. For a fixedenvironment, consider a bivariate regression function f(x2, x3) whereboth x2 and x3 are continuous variables. Therefore, we model the jointfunction using the tensor product space Wm1

2 [0, 1] ⊗Wm2

2 [0, 1]. TheSS ANOVA decompositions of Wm1

2 [0, 1] ⊗Wm2

2 [0, 1] were presented inSection 4.4.3. Note that variables x2 and x3 in this section correspondto x1 and x2 in Section 4.4.3.

The following statements fit the tensor product of linear splines withm1 = m2 = 1, that is, the SS ANOVA model (4.20), under environment2words (x1 = 1):

> ultrasound$x3 <- ident(ultrasound$time)

> ssr(height~1, data=ultrasound, subset=ultrasound$env==1,

rk=list(linear(x2),

linear(x3),

rk.prod(linear(x2),linear(x3))))

The following statements fit the tensor product of cubic splines withm1 = m2 = 2, that is, the SS ANOVA model (4.21), under environment2words and calculate estimates at grid points:

> ultra.lt.c.fit[[1]] <- ssr(height~x2+x3+x2*x3,

data=ultrasound, subset=ultrasound$env==1,


rk=list(cubic(x2),

cubic(x3),

rk.prod(kron(x2),cubic(x3)),

rk.prod(cubic(x2),kron(x3)),

rk.prod(cubic(x2),cubic(x3))))


> ultra.lt.c.pred <- predict(ultra.lt.c.fit[[1]],

newdata=expand.grid(x2=grid,x3=grid))

80100

120140

0

50

100150

200

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

2words

80100

120140

0

50

100150

200

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

cluster

80100

120140

0

50

100150

200

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

schwa

FIGURE 4.5 Ultrasound data, 3-d plots of the estimated tongueshapes as functions of length and time based on the SS ANOVA model(4.21).

The SS AONVA model (4.21) for environments cluster and schwa canbe fitted similarly. The estimates of all three environments are shownin Figure 4.5. These surfaces show how the tongue shapes change overtime. Note that µ+β1× (x2− 0.5)+ fs

1(x2) represents the mean tongueshape over the time period [0, 140], and the rest in (4.21), β2×(x3−0.5)+fs2 (x3)+β3×(x2−0.5)×(x3−0.5)+f ls

12(x2, x3)+fsl12(x2, x3)+f

ss12(x2, x3),

represents the departure from the mean shape at time x3. To look atthe time effect, we compute posterior means and standard deviations ofthe departure on grid points:

> predict(ultra.lt.c.fit[[1]], term=c(0,0,1,1,0,1,1,1,1),

newdata=expand.grid(x2=grid,x3=grid))


Figure 4.6 shows the contour plots of the estimated time effect for threeenvironments. Regions where the lower bounds of the 95% Bayesianconfidence intervals are positive are shaded in dark grey, while regionswhere the upper bounds of the 95% Bayesian confidence intervals arenegative are shaded in light grey.

length (mm)

tim

e (

ms)

−2

0

0

0 2

2

2

4

4

4

6

6

80 100 120 140

05

01

00

15

02

00

2words

length (mm)

−5

−3

0

0

5

5

10

10

80 100 120 140

cluster

length (mm)

−4

−2

0

0

2

2

4

4

6

6

80 100 120 140

schwa

FIGURE 4.6 Ultrasound data, contour plots of the estimated timeeffect for three environments based on the SS ANOVA model (4.21).Regions where the lower bounds of the 95% Bayesian confidence intervalsare positive are shaded in dark grey. Regions where the upper boundsof the 95% Bayesian confidence intervals are negative are shaded in lightgrey.

Finally we investigate how the changes of tongue shapes over timediffer among different environments. Consider a trivariate regressionfunction f(x1, x2, x3) in tensor product space R

3⊗Wm1

2 [0, 1]⊗Wm2

2 [0, 1].For simplicity, we derive the SS ANOVA decomposition form1 = m2 = 2only. Define averaging operators:

A(1)1 f =

1

3

3∑

x1=1

f,

A(k)1 f =

∫ 1

0

fdxk,

A(k)2 f =

(∫ 1

0

f ′dxk

)

(xk − 0.5), k = 2, 3,

where A(1)1 , A(2)

1 , and A(3)1 extract the constant function out of all pos-


sible functions for each variable, and A(2)2 and A(3)

2 extract the linear

function for x2 and x3. Let A(1)2 = I −A(1)

1 and A(k)3 = I −A(k)

1 −A(k)2

for k = 2, 3. Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2 + A(2)

3

}{A(3)

1 + A(3)2 + A(3)

3

}f

= A(1)1 A(2)

1 A(3)1 f + A(1)

1 A(2)1 A(3)

2 f + A(1)1 A(2)

1 A(3)3 f

+ A(1)1 A(2)

2 A(3)1 f + A(1)

1 A(2)2 A(3)

2 f + A(1)1 A(2)

2 A(3)3 f

+ A(1)1 A(2)

3 A(3)1 f + A(1)

1 A(2)3 A(3)

2 f + A(1)1 A(2)

3 A(3)3 f

+ A(1)2 A(2)

1 A(3)1 f + A(1)

2 A(2)1 A(3)

2 f + A(1)2 A(2)

1 A(3)3 f

+ A(1)2 A(2)

2 A(3)1 f + A(1)

2 A(2)2 A(3)

2 f + A(1)2 A(2)

2 A(3)3 f

+ A(1)2 A(2)

3 A(3)1 f + A(1)

2 A(2)3 A(3)

2 f + A(1)2 A(2)

3 A(3)3 f

, µ+ β2 × (x3 − 0.5) + fs3 (x3)

+ β1 × (x2 − 0.5) + β3 × (x2 − 0.5)(x3 − 0.5) + f ls23(x2, x3)

+ fs2 (x2) + fsl

23(x2, x3) + fss23 (x2, x3)

+ f1(x1) + fsl13(x1, x3) + fss

13 (x1, x3)

+ fsl12(x1, x2) + fsll

123(x1, x2, x3) + fsls123(x1, x2, x3)

+ fss12 (x1, x2) + fssl

123(x1, x2, x3) + fsss123(x1, x2, x3), (4.50)

where µ represents the overall mean; f1(x1) represents the main effect ofx1; β1× (x2−0.5) and β2× (x3−0.5) represent the linear main effects ofx2 and x3; f

s2 (x2) and fs

3 (x3) represent the smooth main effects of x2 andx3; f

sl12(x1, x2) (fsl

13(x1, x3)) represents the smooth–linear interaction be-tween x1 and x2 (x3); β3× (x2−0.5)× (x3−0.5), f ls

23(x2, x3), fsl23(x2, x3)

and fss23 (x2, x3) represent linear–linear, linear–smooth, smooth–linear

and smooth–smooth interactions between x2 and x3; and fsll123(x1, x2, x3),

fsls123(x1, x2, x3), f

ssl123(x1, x2, x3), and fsss

123(x1, x2, x3) represent three-wayinteractions between x1, x2, and x3. The overall main effect of xk,fk(xk), equals βk−1 × (xk − 0.5)+ fs

k(xk) for k = 2, 3. The overall inter-action between x1 and xk, f1k(x1, xk), equals fsl

12(x1, xk)+fss1k(x1, xk) for

k = 2, 3. The overall interaction between x2 and x3, f23(x2, x3), equalsβ3 × (x2 − 0.5)× (x3 − 0.5)+ f ls

23(x2, x3)+ fsl23(x2, x3)+ fss

23(x2, x3). Theoverall three-way interaction, f123(x1, x2, x3), equals fsll

123(x1, x2, x3) +fsls123(x1, x2, x3) + fssl

123(x1, x2, x3) + fsss123(x1, x2, x3).

We fit model (4.50) as follows:

> ssr(height~I(x2-.5)+I(x3-.5)+I((x2-.5)*(x3-.5)),

data=ultrasound,



0.00.2

0.40.6

0.81.0

0.0

0.20.4

0.60.8

1.0

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

2words0.0

0.20.4

0.60.8

1.0

0.0

0.20.4

0.60.8

1.0

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

cluster0.0

0.20.4

0.60.8

1.0

0.0

0.20.4

0.60.8

1.0

40

50

60

70

80

length (mm)

time (m

s)

heig

ht (m

m)

schwa

FIGURE 4.7 Ultrasound data, 3-d plots of the estimated tongueshape as a function of environment, length and time based on theSS ANOVA model (4.50).

cubic(x2),

cubic(x3),


rk.prod(shrink1(x1),cubic(x2)),


rk.prod(shrink1(x1),cubic(x3)),

rk.prod(cubic(x2),kron(x3-.5)),

rk.prod(kron(x2-.5),cubic(x3)),

rk.prod(cubic(x2),cubic(x3)),

rk.prod(shrink1(x1),kron(x2-.5),kron(x3-.5)),

rk.prod(shrink1(x1),kron(x2-.5),cubic(x3)),

rk.prod(shrink1(x1),cubic(x2),kron(x3-.5)),

rk.prod(shrink1(x1),cubic(x2),cubic(x3))))

The estimates of all three environments are shown in Figure 4.7. Notethat the first nine terms in (4.50) represent the mean tongue shape sur-face over time, and the last nine terms in (4.50) represent the departureof an environment from this mean surface. To look at the environmenteffect on the tongue shape surface over time, we calculate the posteriormean and standard deviation of the departure for each environment:

> pred <- predict(ultra.elt.c.fit,

newdata=expand.grid(x1=as.factor(1:3),x2=grid,x3=grid),

terms=c(0,0,0,0,1,0,0,1,1,1,1,0,0,0,1,1,1,1))


The contour plots of the estimated departures for three environmentsare shown in Figure 4.8. Note that the significant regions at time 60 msare similar to those in Figure 4.4.

length (mm)

tim

e (

ms)

−2

−1

−1

−1

0

0

2

2

4

4

6

80 100 120 140

05

01

00

15

02

00

2words

length (mm)

−5

−3

−3

−1

−1

−1

0

0

0

1

1

3

80 100 120 140

cluster

length (mm)

−4

−2

−2

−1

−1

−1

0

0

0

1

1

3

80 100 120 140

schwa

FIGURE 4.8 Ultrasound data, contour plots of estimated environ-ment effect for three environments in the SS ANOVA model (4.50). Re-gions where the lower bounds of the 95% Bayesian confidence intervalsare positive are shaded in dark grey. Regions where the upper boundsof the 95% Bayesian confidence intervals are negative are shaded in lightgrey.

4.9.2 Ozone in Arosa — Revisit

Suppose we want to investigate how ozone thickness changes over time byconsidering the effects of both month and year. In Section 2.10 we fittedan additive model (2.49) with the month effect modeled parametricallyby a simple sinusoidal function, and the year effect modeled nonpara-metrically by a cubic spline. The following analyses show how to modelboth effects nonparametrically and investigate their interaction using SSANOVA decompositions. We also show how to check the partial splinemodel (2.49).

Let y be the response variable thick, x1 be the independent variablemonth scaled into [0, 1], and x2 be the independent variable year scaledinto [0, 1]. It is reasonable to assume that the mean ozone thickness is aperiodic function of x1. We model the effect of x1 using a periodic splinespace W 2

2 (per) and the effect of x2 using a cubic spline space W 22 [0, 1].


Therefore, we consider the SS ANOVA decomposition (4.23) for thetensor product space Wm1

2 (per) ⊗W 22 [0, 1]. The following statements

fit model (4.23) with m1 = 2:

> Arosa$x1 <- (Arosa$month-0.5)/12

> Arosa$x2 <- (Arosa$year-1)/45

> arosa.ssanova.fit1 <- ssr(thick~I(x2-0.5), data=Arosa,

rk=list(periodic(x1),

cubic(x2),

rk.prod(periodic(x1),kron(x2-.5)),

rk.prod(periodic(x1),cubic(x2))))

> summary(arosa.ssanova.fit1)

...


2.154531e-09 3.387917e-06 2.961559e-02



The mean function f(x) in model (4.23) evaluated at design pointsf = (f(x1), . . . , f(xn))T can be represented as

f = µ1 + f1 + f2 + f12,

where 1 is a vector of all ones, f1 = (f1(x1), . . . , f1(xn))T , f2 =(f2(x1), . . . , f2(xn))T , f12 = (f12(x1), . . . , f12(xn))T , and f1(x), f2(x)and f12(x1) are the main effect of x1, the main effect of x2, and theinteraction between x1 and x2. Eliminating the constant term, we have

f∗ = f∗1 + f∗

2 + f∗12,

where a∗ = a− a1, and a =∑n

i=1 ai/n. Let f∗, f

∗1, f

∗2, and f

∗12 be the

estimates of f∗, f∗1, f

∗2, and f∗

12, respectively. To check the contributionsof the main effects and interaction, we compute the quantities πk =

(f∗k)T f

∗/||f∗||2 for k = 1, 2, 12, and the Euclidean norms of f

∗, f

∗1, f

∗2,

and f∗12:

> f1 <- predict(arosa.ssanova.fit1, terms=c(0,0,1,0,0,0))



> fs1 <- scale(f1$fit, scale=F)



> ys <- fs1+fs2+fs12

> pi1 <- sum(fs1*ys)/sum(ys**2)




> print(round(c(pi1,pi2,pi12),4))

0.9375 0.0592 0.0033

> print(round(sqrt(c(sum(fs1**2),sum(fs2**2),

sum(fs12**2))),2))

768.31 186.21 30.25

See Gu (2002) for more details about the cosine diagnostics. It is clearthat the contribution from the interaction to the total variation is neg-ligible. We also compute the posterior mean and standard deviation ofthe interaction f12:


> predict(arosa.ssanova.fit1, terms=c(0,0,0,0,1,1),

expand.grid(x1=grid,x2=grid))

The estimate of the interaction is shown in Figure 4.9(a). Except fora narrow region, the zero function is contained in the 95% Bayesianconfidence intervals. Thus the interaction is negligible.

2 4 6 8 10

1930

1940

1950

1960

1970

month

year

−3

−2

−2

−1

−1

−1

0 0 0 0

1

1

1 2

2

3

(a)

2 4 6 8 10

1930

1940

1950

1960

1970

month

year

−3

−2

−2

−1

−1

−1

0 0 0 0

1

1

1

2

2

3

3

(b)

2 4 6 8 10

−10

−5

05

10

month

thic

kness

(c)

FIGURE 4.9 Arosa data, (a) plot of estimate of the interaction inmodel (4.23), (b) plot of estimate of the interaction in model (4.52),and (c) plot of estimate of the smooth component fs

1 in model (4.53).For plots (a) and (b), regions where the lower bounds of the confidenceintervals are positive are shaded in dark grey, while regions where theupper bounds of the confidence intervals are negative are shaded in lightgrey. For plot (c), the solid line represents the estimate of fs

1 , the shadedregion represents 95% Bayesian confidence intervals, and the dashed linerepresents the zero function.


We drop the interaction term and fit an additive model

f(x1, x2) = µ+ β × (x2 − 0.5) + fs2 (x2) + f1(x1) (4.51)

as follows:

> update(arosa.ssanova.fit1,

rk=list(periodic(x1),cubic(x2)))

The estimates of two main effects are shown in Figure 4.10.

month

thic

kness

−4

0−

20

02

04

0

1 2 3 4 5 6 7 8 9 11

o

o

o

oo

o

o

o

oo

o

o

Main effect of month

year

thic

kness

−4

0−

20

02

04

0

1926 1935 1944 1953 1962 1971

o

o

o

o

o

oo

o

ooo

o

ooo

oo

oo

o

ooo

ooo

oo

o

o

ooo

o

oo

o

oo

o

o

o

o

o

o

Main effect of year

FIGURE 4.10 Arosa data, plots of the estimates of the main effectsin the SS ANOVA model (4.51). Circles in the left panel representmonthly average thickness minus the overall mean. Circles in the rightpanel represent yearly average thickness minus the overall mean. Shadedregions are 95% Bayesian confidence intervals.

Now suppose we want to check if the partial spline model (2.49) is ap-propriate for the Arosa data. For the month effect, consider the trigono-metric spline model W 3

2 (per) with L = D{D2+(2π)2} defined in Section2.11.5. The null space span{1, sin(2πx), cos(2πx)} corresponds to the si-nusoidal model space assumed for the partial spline model (2.49). Con-sider the tensor product space W 3

2 (per) ⊗W 22 [0, 1]. Define averaging


operators:

A(1)1 f =

∫ 1

0

fdx1,

A(1)2 f =

∫ 1

0

f sin(2πx1)dx1,

A(1)3 f =

∫ 1

0

f cos(2πx1)dx1,

A(2)1 f =

∫ 1

0

fdx2,

A(2)2 f =

(∫ 1

0

f ′dx2

)

(x2 − 0.5).

Let A(1)4 = I −A(1)

1 −A(1)2 −A(1)

3 and A(2)3 = I −A(2)

1 −A(2)2 . Then

f ={A(1)

1 + A(1)2 + A(1)

3 + A(1)4

}{A(2)

1 + A(2)2 + A(2)

3

}f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)3 A(2)

1 f + A(1)4 A(2)

1 f

+ A(1)1 A(2)

2 f + A(1)2 A(2)

2 f + A(1)3 A(2)

2 f + A(1)4 A(2)

2 f

+ A(1)1 A(2)

3 f + A(1)2 A(2)

3 f + A(1)3 A(2)

3 f + A(1)4 A(2)

3 f

, µ+ β1 × sin(2πx1) + β2 × cos(2πx1) + fs1 (x1)

+ β3 × (x2 − 0.5) + β4 × sin(2πx1) × (x2 − 0.5)

+ β5 × cos(2πx1) × (x2 − 0.5) + fsl12(x1, x2)

+ fs2 (x2) + f1s

12 (x1, x2) + f2s12 (x1, x2) + fss

12(x1, x2), (4.52)

where µ represents the overall mean; β1 × sin(2πx1) and β2 × cos(2πx1)represent the parametric main effects of x1; f

s1 (x1) represents the smooth

main effect of x1; β3 × (x2 − 0.5) represents the linear main effect of x2;fs2 (x2) represents the smooth main effects of x2; and β4 × sin(2πx1) ×

(x2−0.5), β5×cos(2πx1)×(x2−0.5), fsl12(x1, x2), f

1s12 (x1, x2), f

2s12 (x1, x2),

and fss12(x1, x2) represent interactions. We fit model (4.52) and compute

the posterior mean and standard deviations for the overall interactionas follows:

> arosa.ssanova.fit3 <- ssr(thick~sin(2*pi*x1)+cos(2*pi*x1)

+I(x2-0.5)+I(sin(2*pi*x1)*(x2-0.5))

+I(cos(2*pi*x1)*(x2-0.5)), data=Arosa,

rk=list(lspline(x1,type=‘‘sine1’’), cubic(x2),

rk.prod(kron(sin(2*pi*x1)),cubic(x2)),

rk.prod(kron(cos(2*pi*x1)),cubic(x2)),


rk.prod(lspline(x1,type=‘‘sine1’’), kron(x2-.5)),

rk.prod(lspline(x1,type=‘‘sine1’’), cubic(x2))))

> ngrid <- 50

> grid1 <- seq(.5/12,11.5/12,length=ngrid)

> grid2 <- seq(0,1,length=ngrid)

> predict(arosa.ssanova.fit3,

expand.grid(x1=grid1,x2=grid2),

terms=c(0,0,0,0,1,1,0,0,1,1,1,1))

The estimate of the interaction is shown in Figure 4.9(b). Except fora narrow region, the zero function is contained in the 95% Bayesianconfidence intervals. Therefore, we drop interaction terms and considerthe following additive model:

f(x1, x2) = µ+ β1 × sin(2πx1) + β2 × cos(2πx1) + fs1 (x1)

β3 × (x2 − 0.5) + fs2 (x2). (4.53)

Note that the partial spline model (2.49) is a special case of model (4.53)with fs

1 (x1) = 0. We fit model (4.53) and compute posterior means andstandard deviations for fs

1 (x1):

> arosa.ssanova.fit4 <- ssr(thick~sin(2*pi*x1)

+cos(2*pi*x1)+I(x2-0.5), data=Aros,

rk=list(lspline(x1,type=‘‘sine1’’), cubic(x2)))

> predict(arosa.ssanova.fit4, expand.grid(x1=grid1,x2=0),

terms=c(0,0,0,0,1,0))

The estimate of fs1 (x1) is shown in Figure 4.9(c) with 95% Bayesian

confidence intervals. It is clear that fs1 (x1) is nonzero, which indicates

that the simple sinusoidal function is inadequate for modeling the montheffect.

4.9.3 Canadian Weather — Revisit

Consider the Canadian weather data with annual temperature profilesfrom all 35 stations as functional data. To investigate how the weatherpatterns differ, Ramsay and Silverman (2005) divided Canada into fourclimatic regions: Atlantic, Continental, Pacific, and Arctic. Let y bethe response variable temp, x1 be the independent variable region, andx2 be the independent variable month scaled into [0, 1]. The functionalANOVA (FANOVA) model (13.1) in Ramsay and Silverman (2005) as-sumes that

yk,x1(x2) = η(x2) + αx1

(x2) + ǫk,x1(x2), (4.54)


where yk,x1(x2) is the temperature profile of station k in climate re-

gion x1, η(x2) is the average temperature profile across all of Canada,αx1

(x2) is the departure of the region x1 profile from the populationaverage profile η(x2), and ǫk,x1

(x2) are random errors. It is clear thatthe FANOVA can be derived from the SS ANOVA decomposition (4.22)by letting η(x2) = µ + f2(x2) and αx1

(x2) = f1(x1) + f12(x1, x2). The

side condition for the FANOVA model (4.54),∑4

x1=1 αx1(x2) = 0 for all

x2, is satisfied from the construction of the SS ANOVA decomposition.Model (4.54) is an example of situation (ii) in Section 2.10 where thedependent variable involves functional data.

month

tem

pe

ratu

re (

C)

−30

−20

−10

0

10

20

2 4 6 8 10 12

Continental Arctic

Atlantic

2 4 6 8 10 12

−30

−20

−10

0

10

20Pacific

FIGURE 4.11 Canadian weather data, plots of temperature profilesof stations in four regions (thin lines), and the estimated profiles (thicklines).

Observed temperature profiles are shown in Figure 4.11. The followingstatements fit model (2.10) and compute posterior means and standarddeviations for four regions:

> x1 <- rep(as.factor(region),rep(12,35))


> x2 <- (rep(1:12,35)-.5)/12

> y <- as.vector(monthlyTemp)

> canada.fit2 <- ssr(y~1,

rk=list(shrink1(x1), periodic(x2),

rk.prod(shrink1(x1),periodic(x2))))

> xgrid <- seq(.5/12,11.5/12,len=50)

> zone <- c(‘‘Atlantic’’,‘‘Pacific’’,

‘‘Continental’’,‘‘Arctic’’)

> grid <- data.frame(x1=rep(zone,rep(50,4)),

x2=rep(xgrid,4))

> canada.fit2.p1 <- predict(canada.fit2, newdata=grid)

Estimates of mean temperature functions for four regions are shown inFigure 4.11. To look at the region effects αx1

more closely, we computetheir posterior means and standard deviations:

> canada.fit2.p2 <- predict(canada.fit2, newdata=grid,

terms=c(0,1,0,1))

Estimates of region effects and 95% Bayesian confidence intervals areshown in Figure 4.12. These estimates are similar to those in Ramsayand Silverman (2005).

4.9.4 Texas Weather

Instead of dividing stations into geological regions as in model (4.54),suppose we want to investigate how the weather patterns depend on ge-ographical locations in terms of latitude and longitude. For illustration,consider the Texas weather data consisting of average monthly temper-atures during 1961–1990 from 48 weather stations in Texas. Denotex1 = (lat, long) as the geological location of a station, and x2 asthe month variable scaled into [0, 1]. We want to investigate how theexpected temperature, f(x1, x2), depends on both location and month

variables. Average monthly temperatures are computed using monthlytemperatures during 1986–1990 for all 48 stations and are used as ob-servations of f(x1, x2). For each fixed station, the annual temperatureprofile can be regarded as functional data on a continuous interval. Fig-ure 4.13 shows these curves for all 48 stations. For each fixed month,the temperature surface as a function of latitude and longitude can beregarded as functional data on R

2. Figure 4.14 shows contour plots ofobserved surfaces for January, April, July, and October.

A natural model space for the location variable is the thin-platespline W 2

2 (R2), and a natural model space for the month variable isW 2

2 (per). Therefore, we fit the SS ANOVA model (4.24):


month

reg

ion

eff

ect

−10

0

10

2 4 6 8 10 12

Continental Arctic

Atlantic

2 4 6 8 10 12

−10

0

10

Pacific

FIGURE 4.12 Canadian weather data, plots of the estimated regioneffects to temperature, and 95% Bayesian confidence intervals.

> data(TXtemp); TXtemp1 <- TXtemp[TXtemp$year>1985,]

> y <- gapply(TXtemp1, which=5,

FUN=function(x) mean(x[x!=-99.99]),

group=TXtemp1$stacod*TXtemp1$month)

> tx.dat <- data.frame(y=as.vector(t(matrix(y,48,12,

byrow=F))))

> tx.dat$x2 <- rep((1:12-0.5)/12, 48)

> lat <- TXtemp$lat[seq(1, nrow(TXtemp),by=360)]

> long <- TXtemp$long[seq(1, nrow(TXtemp),by=360)]

> tx.dat$x11 <- rep(scale(lat), rep(12,48))

> tx.dat$x12 <- rep(scale(long), rep(12,48))

> tx.dat$stacod <- rep(TXtemp$stacod[seq(1,nrow(TXtemp),

by=360)],rep(12,48))

> tx.ssanova <- ssr(y~x11+x12, data=tx.dat,

rk=list(tp(list(x11,x12)),

periodic(x2),

rk.prod(tp.linear(list(x11,x12)),periodic(x2)),

rk.prod(tp(list(x11,x12)), periodic(x2))))


months

tem

pera

ture

(F

)

x x

x

x

x

xx x

x

x

x

x

xx

x

x

x

x x x

x

x

x

x

x

x

x

x

x

x xx

x

x

x

xx

x

x

x

x

xx x

x

x

x

xx

x

x

x

x

x xx

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

xx

x

x

x

x

xx x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

xx

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x

x

x

x xx

x

x

x

x

x

x

x

x

x

x xx

x

x

x

xx

x

x

x

x

x x x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x xx

x

x

x

xx

x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

x x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

xx

x

x

x

x

x x

x

x

x

x

xx

x

x

x

x

x x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x x

x

x

x

xx

x

x

x

x

xx x

x

x

x

xx

x

x

x

x

x

x x

x

x

x

x

xx

x

x

x

x

x x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x xx

x

x

x

x

xx

x

x

x

xx x

x

x

x

xx

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

xx

x

x

x

xx x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

xx

x

x

x

x

x

x x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

x x

x

x

x

x

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10 11 12

FIGURE 4.13 Texas weather data, plot of temperature profiles forall 48 stations.

January 36

38

40 42

44

46

48 50

52 54

56

April 56

58

60

62

64

64

66 68

70 72

July

78

78

78 79

80

80

81 82

82

83 84

85

October 58 60

62

64

64

66

68

68 70

72

FIGURE 4.14 Texas weather data, contour plots of observations forJanuary, April, July, and October. Dots represent station locations.


where tp.linear computes the RK, φ2(x)φ2(z) + φ3(x)φ3(z), of thesubspace {φ2, φ3}.

The location effect equals β1φ1(x1)+β2φ2(x1)+fs1 (x1)+f

ls12(x1, x2)+

fss12(x1, x2). We compute posterior means and standard deviations of

the location effect for the southmost (Rio Grande City 3W), northmost(Stratford), westmost (El Paso WSO AP), and eastmost (Marshall) sta-tions:

> selsta <- c(tx.dat[tx.dat$x11==min(tx.dat$x11),7][1],

tx.dat[tx.dat$x11==max(tx.dat$x11),7][1],

tx.dat[tx.dat$x12==min(tx.dat$x12),7][1],

tx.dat[tx.dat$x12==max(tx.dat$x12),7][1])

> sellat <- sellong <- NULL

> for (i in 1:4) {

sellat <- c(sellat,

tx.dat$x11[tx.dat$stacod==selsta[i]][1])

sellong <- c(sellong,

tx.dat$x12[tx.dat$stacod==selsta[i]][1])

}

> grid <- data.frame(x11=rep(sellat,rep(40,4)),

x12=rep(sellong ,rep(40,4)),

x2=rep(seq(0,1,len=40), 4))

> tx.pred1 <- predict(tx.ssanova, grid,

terms=c(0,1,1,1,0,1,1))

The estimates of these effects and 95% Bayesian confidence intervalsare shown in Figure 4.15. The curve in each plot shows how temperatureprofile of that particular station differ from the average profile amongthe 48 stations. It is clear that the temperature in Rio Grande Cityis higher than average, and the temperature in Stratford is lower thanaverage, especially in the winter. The temperature in El Paso is close toaverage in the first half year, and lower than average in the second halfyear. The temperature in Marshall is slightly above average.

The month effect equals f2(x2)+fls12(x1, x2)+f

ss12(x1, x2). We compute

posterior means and standard deviations of the month effect:

> tx.pred2 <- predict(tx.ssanova, terms=c(0,0,0,0,1,1,1))

The estimates of these effects for January, April, July, and Octoberare shown in Figure 4.16. Each plot in Figure 4.16 shows how temper-ature pattern for that particular month differ from the average patternamong all 12 months. As expected, the temperature in January is colderthan average, while the temperature in July is warmer than average. Ingeneral, the difference becomes smaller from north to south. The tem-peratures in April and October are close to average.


month

tem

pera

ture

(F

)

−15−10

−505

10

0 2 4 6 8 10 12

El Paso WSO AP Marshall

Rio Grande City 3W

0 2 4 6 8 10 12

−15−10−50510

Stratford

FIGURE 4.15 Texas weather data, plots of the location effect (solidlines) for four selected stations with 95% Bayesian confidence intervals(dashed lines).


January

−20

−20

−19

−19

−18

−17

−16

−16

−15

April

0

0.3

0.6

0.9

0.9

1.2

July

13

15

15

17

19

21

October

0 0.4

0.8

0.8

0.8

1.2

1.2

FIGURE 4.16 Texas weather data, plots of the month effect for Jan-uary, April, July, and October.

Chapter 5

Spline Smoothing withHeteroscedastic and/orCorrelated Errors

5.1 Problems with Heteroscedasticityand Correlation

In the previous chapters we have assumed that observations are indepen-dent with equal variances. These assumptions may not be appropriatefor many applications. This chapter presents spline smoothing methodsfor heteroscedastic and/or correlated observations. Before introducingthese methods, we first illustrate potential problems associated with thepresence of heteroscedasticity and correlation in spline smoothing.

We use the following simulation to illustrate potential problems withheteroscedasticity. Observations are generated from model (1.1) withf(x) = sin(4πx2), xi = i/n for i = 1, . . . , n and n = 100. Randomerrors are generated independently from the Gaussian distribution withmean zero and variance σ2 exp{α|f(x)|}. Therefore, we have unequalvariances when α 6= 0. We set σ = 0.05 and α = 4. For each simulateddata, we first fit the cubic spline directly using PLS with GCV andGML choices of smoothing parameters. Note that these direct fits ig-nore heteroscedasticity. We then fit the cubic spline using the penalizedweighted LS (PWLS) introduced in Section 5.2.1 with known weightsW = diag(exp{−4|f(x1)|}, . . . , exp{−4|f(xn)|}). For each fit, we com-pute weighted MSE (WMSE)

WMSE =1

n

n∑

i=1

wi(f(xi) − f(xi))2,

where wi = exp{−4|f(xi)|}. We also construct 95% Bayesian confidenceintervals for each fit. The simulation is repeated 100 times. Figure 5.1shows the performances of unweighted and weighted methods in termsof WMSE and coverages of Bayesian confidence intervals. Figure 5.1(a)

139


GCV GCVW GML GMLW

0.0

005

0.0

015

WM

SE

+

++

+

(a)

unweighted weighted

0.7

50.8

50.9

5avera

ge c

overa

ge + +

(b)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

poin

twis

e c

overa

ge

(c)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

poin

twis

e c

overa

ge

(d)

FIGURE 5.1 (a) Boxplots of WMSEs based on PLS with GCV andGML choices of the smoothing parameter and PWLS with GCV (la-beled as GCVW) and GML (labeled as GMLW) choices of the smooth-ing parameter. Average WMSEs are marked as pluses; (b) boxplots ofacross-the-function coverages of 95% Bayesian confidence intervals forthe PLS fits with GML choice of the smoothing parameter (labeled asunweighted) and PWLS fits with GML choice of the smoothing param-eter (labeled as weighted). Average coverages are marked as pluses; (c)plot of pointwise coverages (solid line) of 95% Bayesian confidence inter-vals for the PLS fits with GML choice of the smoothing parameter; (d)plot of pointwise coverages (solid line) of 95% Bayesian confidence in-tervals for the PWLS fits with GML choice of the smoothing parameter.Dotted lines in (b), (c), and (d) represent the nominal value. Dashedlines in (c) and (d) represent a scaled version of the variance functionexp{α|f(x)|}.

indicates that, even though ignoring heteroscedasticity, the unweightedmethods provide good fits to the function. The weighted methods leadto better fits. Bayesian confidence intervals based on both methods pro-vide the intended across-the-function coverages (Figure 5.1(b)). How-ever, Figure 5.1(c) reveals the problem with heteroscedasticity: point-

Spline Smoothing with Heteroscedastic and/or Correlated Errors 141

wise coverages in regions with larger variances are smaller than the nom-inal value, while pointwise coverages in other regions are larger than thenominal value. Obviously, this is caused by ignoring heteroscedasticity.Bayesian confidence intervals based on the PWLS method overcomesthis problem (Figure 5.1(d)).

x

y

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 0.2 0.4 0.6 0.8 1.0

o

ooo

ooooo

o

o

ooo

oo

oooo

oo

oooooooo

oo

o

oooooooo

oo

oooo

oo

o

ooooooooooo

o

o

oooooo

o

oooo

o

oo

oo

ooo

o

ooo

o

oo

o

oooo

oo

ooo

o

UBR

0.0 0.2 0.4 0.6 0.8 1.0

o

ooo

ooooo

o

o

ooo

oo

oooo

oo

oooooooo

oo

o

oooooooo

oo

oooo

oo

o

ooooooooooo

o

o

oooooo

o

oooo

o

oo

oo

ooo

o

ooo

o

oo

o

oooo

oo

ooo

o

GCV

0.0 0.2 0.4 0.6 0.8 1.0

o

ooo

ooooo

o

o

ooo

oo

oooo

oo

oooooooo

oo

o

oooooooo

oo

oooo

oo

o

ooooooooooo

o

o

oooooo

o

oooo

o

oo

oo

ooo

o

ooo

o

oo

o

oooo

oo

ooo

o

GML

FIGURE 5.2 Plots of the true function (dashed lines), observations(circles), and the cubic spline fits (solid lines) with UBR, GCV, andGML choices of smoothing parameters. The true variance is used in thecalculation of the UBR criterion.

Comparing with heteroscedasticity, the potential problems associatedwith correlation are more fundamental and difficult to deal with. For il-lustration, again we simulate data from model (1.1) with f(x) = sin(4πx2),xi = i/n for i = 1, . . . , n and n = 100. Random errors ǫi are generatedby a first-order autoregressive model (AR(1)) with mean zero, standarddeviation 0.2, and first-order correlation 0.6. Figure 5.2 shows the simu-lated data, the true function, and three cubic spline fits with smoothingparameters chosen by the UBR, GCV, and GML methods, respectively.The true variance is used in the calculation of the UBR criterion. Allfits are wiggly, which indicates that the estimated smoothing parame-ters are too small. This undersmoothing phenomenon is common in thepresence of positive correlation. Ignoring correlation, the UBR, GCV,and GML methods perceive that all the trend (signal) in the data is dueto the mean function f and attempt to incorporate that trend into theestimate. Correlated random errors may induce local trend and thusfool these methods to select smaller smoothing parameters such that thelocal trend can be picked up.


5.2 Extended SS ANOVA Models

Suppose observations are generated by

yi = Lif + ǫi, i = 1, . . . , n, (5.1)

where f is a multivariate function in a model space M, and Li arebounded linear functionals. Let ǫ = (ǫ1, . . . , ǫn)T . We assume thatE(ǫ) = 0 and Cov(ǫ) = σ2W−1. In this chapter we consider the SSANOVA model space

M = H0 ⊕H1 ⊕ · · · ⊕ Hq, (5.2)

where H0 is a finite dimensional space collecting all functions that arenot going to be penalized, and H1, . . . ,Hq are orthogonal RKHS’s withRKs Rj for j = 1, . . . , q. Model (5.1) is an extension of the SS ANOVAmodel (4.31) with non-iid random errors. See Section 8.2 for a moregeneral model involving linear functionals and non-iid errors.

5.2.1 Penalized Weighted Least Squares

Our goal is to estimate f as well as W when it is unknown. We firstassume that W is fixed and consider the estimation of the nonparametricfunction f . A direct generalization of the PLS (4.32) is the followingPWLS:

minf∈M

1

n(y − f )TW (y − f) + λ

q∑

j=1

θ−1j ‖Pjf‖2

, (5.3)

where y = (y1, . . . , yn)T , f = (L1f, . . . ,Lnf)T , and Pj is the orthogonalprojection in M onto Hj .

Let θ = (θ1, . . . , θq). Denote φ1, . . . , φp as basis functions of H0,T = {Liφν}n p

i=1 ν=1, Σk = {LiLjRk}n

i,j=1 for k = 1, . . . , q, and Σθ =∑q

j=1 θjΣj . As in the previous chapters, we assume that T is of fullcolumn rank. The same arguments in Section 2.4 apply to the PWLS.Therefore, by the Kimeldorf–Wahba representer theorem, the solutionto (5.3) exists and is unique, and the solution can be represented as

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ci

q∑

j=1

θjLi(z)Rj(x, z).


Let c = (c1, . . . , cn)T and d = (d1, . . . , dp)T . Let f = (L1f , . . . ,Lnf)T

be the vector of fitted values. It is easy to check that f = Td + Σθc

and ‖P ∗1 f‖2

∗ = cT Σθc. Then the PWLS (5.3) reduces to

1

n(y − Td− Σθc)

TW (y − Td− Σθc) + λcT Σθc. (5.4)

Taking the first derivatives leads to the following equations for c and d:

(ΣθWΣθ + nλΣθ)c + ΣθWTd = ΣθWy,

T TWΣθc+ T TWTd = T TWy.(5.5)

It is easy to check that a solution to

(Σθ + nλW−1)c+ Td = y,

T Tc = 0,(5.6)

is also a solution to (5.5). Let M = Σθ + nλW−1 and

T = (Q1 Q2)

(R0

)

be the QR decomposition of T . Then the solutions to (5.6) are

c = Q2(QT2 MQ2)

−1QT2 y,

d = R−1QT1 (y −Mc).

(5.7)

Based on the first equation in (5.6) and the fact that f = Td+ Σθc,we have

f = y − nλW−1c = H(λ,θ)y

where

H(λ,θ) = I − nλW−1Q2(QT2MQ2)

−1QT2 (5.8)

is the hat matrix. The dependence of H on the smoothing parametersis expressed explicitly in (5.8). In the reminder of this chapter, forsimplicity, the notation H will be used. Note that, different from theindependent case, H may be asymmetric.

To solve (5.6), for fixed W and smoothing parameters, consider trans-formations y = W 1/2y, T = W 1/2T , Σθ = W 1/2ΣθW

1/2, c = W−1/2c,

and d = d. Then equations in (5.6) are equivalent to the followingequations

(Σθ + nλI)c + T d = y,

T T c = 0.(5.9)

Note that equations in (5.9) have the same form as those in (2.21); thusmethods in Section 2.4 can be used to compute c and d. Transformingback, we have the solutions of c and d.


5.2.2 UBR, GCV and GML Criteria

We now extend the UBR, GCV, and GML criteria for the estimation ofsmoothing parameters λ and θ as well as W when it is unknown.

Define the weighted version of the loss function

L(λ,θ) =1

n(f − f )TW (f − f ) (5.10)

and weighted MSE as WMSE(λ,θ) , EL(λ,θ). Then

WMSE(λ,θ) =1

nfT (I −HT )W (I −H)f +

σ2

ntrHTWHW−1. (5.11)

It is easy to check that an unbiased estimate of WMSE(λ,θ) + σ2 is

UBR(λ,θ) =1

nyT (I −HT )W (I −H)y +

2σ2

ntrH. (5.12)

The UBR estimates of smoothing parameters andW when it is unknownare the minimizers of UBR(λ,θ).

To extend the CV and GCV methods, we first consider the specialcase when W is diagonal: W = diag(w1, . . . , wn). Denote f [i] as theminimizer of the PWLS (5.3) based on all observations except yi. It iseasy to check that the leaving-out-one lemma in Section 3.4 still holdsand Lif

[i]−yi = (Lif−yi)/(1−hii), where hii are the diagonal elementsof H . Then the cross-validation criterion is

CV(λ,θ) ,1

n

n∑

i=1

wi

(

Lif[i] − yi

)2

=1

n

n∑

i=1

wi(Lif − yi)

2

(1 − hii)2. (5.13)

Replacing hii by the average of diagonal elements leads to the GCVcriterion

GCV(λ,θ) ,

1n ||W 1/2(I −H)y||2{

1n tr(I −H)

}2 . (5.14)

Next consider the following model for clustered data

yij = Lijf + ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (5.15)

where yij is the jth observation in cluster i, and Lij are bounded linearfunctionals. Let n =

∑mi=1 ni, yi = (yi1, . . . , yini

)T , y = (yT1 , . . . ,y

Tm)T ,

ǫi = (ǫi1, . . . , ǫini)T , and ǫ = (ǫT

1 , . . . , ǫTm)T . We assume that Cov(ǫi) =

σ2W−1i and observations between clusters are independent. Conse-

quently, Cov(ǫ) = σ2W−1, where W = diag(W1, . . . ,Wm) is a blockdiagonal matrix. Assume that f ∈ M, where M is the model space in


(5.2). We estimate f using the PWLS (5.3). Let f be the PWLS esti-

mate based on all observations, and f [i] be the PWLS estimate basedon all observations except those from the ith cluster yi. Let f i =

(Li1f, . . . ,Linif)T , f = (fT

1 , . . . ,fTm)T , f i = (Li1f , . . . ,Lini

f)T , f =

(fT

1 , . . . , fT

m)T , f[i]

j = (Lj1f[i], . . . ,Ljnj

f [i])T , and f[i]

= ((f[i]

1 )T , . . . ,

(f[i]

m)T )T .

Leaving-out-one-cluster Lemma

For any fixed i, f [i] is the minimizer of

1

n

(

f[i]

i − f i

)T

Wi

(

f[i]

i − f i

)

+1

n

∑

j 6=i

(yj − f j

)TWj

(yj − f j

)

+λ

q∑

j=1

θ−1j ‖Pjf‖2. (5.16)

[Proof] For any function f ∈ M, we have

1

n

(

f[i]

i − f i

)T

Wi

(

f[i]

i − f i

)

+1

n

∑

j 6=i

(yj − f j

)TWj

(yj − f j

)

+λ

q∑

j=1

θ−1j ‖Pjf‖2

≥ 1

n

∑

j 6=i

(yj − f j

)TWj

(yj − f j

)+ λ

q∑

j=1

θ−1j ‖Pjf‖2

≥ 1

n

∑

j 6=i

(

yj − f[i]

j

)T

Wj

(

yj − f[i]

j

)

+ λ

q∑

j=1

θ−1j ‖Pj f

[i]‖2

=1

n

(

f[i]

i − f [i]

i

)T

Wi

(

f[i]

i − f [i]

i

)

+1

n

∑

j 6=i

(

yj − f[i]

j

)T

Wj

(

yj − f[i]

j

)

+λ

q∑

j=1

θ−1j ‖Pj f

[i]‖2.

The leaving-out-one-cluster lemma implies that f = Hy and f[i]

=Hy[i], where H is the hat matrix and y[i] is the same as y except that yi

is replaced by f[i]

i . Divide the hat matrix H according to clusters such


that H = {Hik}mi,k=1, where Hik is an ni × nk matrix. Then we have

f i =

m∑

j=1

Hijyj ,

f[i]

i =∑

j 6=i

Hijyj +Hiif[i]

i .

Assume that I −Hii is invertible. Then

f[i]

i − yi = (I −Hii)−1(f i − yi).

The cross-validation criterion

CV(λ,θ)

,1

n

m∑

i=1

(

f[i]

i − yi

)T

Wi

(

f[i]

i − yi

)

=1

n

m∑

i=1

(

f i − yi

)T

(I −Hii)−TWi(I −Hii)

−1(

f i − yi

)

. (5.17)

Replacing I−Hii by its generalized averageGi (Ma, Dai, Klein, Klein,Lee and Wahba 2010), we have

GCV(λ,θ) ,1

n

m∑

i=1

(

f i − yi

)T

GiWiGi

(

f i − yi

)

, (5.18)

where Gi = aiIni− bi1ni

1Tni

, ai = 1/(δi − γi), bi = γi/[(δi − γi){δi +(ni − 1)γi}], δi = (n − trH)/mni, γi = 0 when ni = 1 and γi =−∑m

i=1

∑

s6=t hist/mni(ni − 1) when ni > 1, hi

st is the element of thesth row and tth column of the matrix Hii, Ini

is an identity matrix ofsize ni, and 1ni

is an ni-vector of all ones. The CV and GCV estimatesof smoothing parameters and W when it is unknown are the minimizersof the CV and GCV criteria.

To derive the GML criterion, we first construct a Bayes model for theextended SS ANOVA model (5.1). Assume the same prior for f as in(4.42):

F (x) =

p∑

ν=1

ζνφν(x) + δ12

q∑

j=1

√

θjUj(x), (5.19)

where ζνiid∼ N(0, κ), Uj(x) are independent, zero-mean Gaussian stochas-

tic processes with covariance function Rj(x, z), ζν and Uj(x) are mutu-ally independent, and κ and δ are positive constants. Suppose observa-tions are generated from

yi = LiF + ǫi, i = 1, . . . , n, (5.20)


where ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1).Let L0 be a bounded linear functional on M. Let λ = σ2/nδ. The

same arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλW−1 in this chapter (Wang 1998b). Therefore,

limκ→∞

E(L0F |y) = L0f ,

and an extension of the GML criterion is

GML(λ,θ) =yTW (I −H)y

{det+(W (I −H))} 1n−p

, (5.21)

where det+ is the product of the nonzero eigenvalues. The GML es-timates of smoothing parameters and W when it is unknown are theminimizers of GML(λ,θ).

The GML estimator of the variance σ2 is (Wang 1998b)

σ2 =yTW (I −H)y

n− p. (5.22)

5.2.3 Known Covariance

In this section we discuss the implementation of the UBR, GCV, andGML criteria in Section 5.2.2 when W is known. In this situation weonly need to estimate the smoothing parameters λ and θ. Considertransformations discussed at the end of Section 5.2.1. Let f = W 1/2f

be the fits to the transformed data, and H be the hat matrix associatedwith the transformed data. Then Hy = f = W−1/2f = W−1/2Hy =W−1/2HW 1/2y for any y. Therefore,

H = W−1/2HW 1/2. (5.23)

From (5.23), the UBR, GCV, and GML in (5.12), (5.14), and (5.21)can be rewritten based on the transformed data as

UBR(λ,θ) =1

n||(I − H)y||2 +

2σ2

ntrH, (5.24)

GCV(λ,θ) =1n

∑ni=1 ||(I − H)y||2{

1n tr(I − H)

}2 , (5.25)

GML(λ,θ) =CyT (I − H)y

{det+(I − H)} 1

n−p

, (5.26)

where C in GML(λ,θ) is a constant independent of λ and θ. Equations(5.24), (5.25), and (5.26) indicate that, when W is known, the UBR,


GCV, and GML estimates of smoothing parameters can be calculatedbased on the transformed data using the method described in Section4.7.

5.2.4 Unknown Covariance

We now consider the case when W is unknown. When a separate methodis available for estimating W , one approach is to estimate the functionf and covariance W iteratively. For example, the following two-stepprocedure is simple and easy to implement: (1) Estimate the functionwith a “sensible” choice of the smoothing parameter; (2) Estimate thecovariance using residuals; and (3) Estimate the function again using theestimated covariance. However, it may not work in certain situations dueto the interplay between the smoothing parameter and correlation.

We use the following two simple simulations to illustrate the poten-tial problem associated with the above iterative approach. In the firstsimulation, n = 100 observations are generated according to model (1.1)

with f(x) = sin(4πx2), xi = i/n for i = 1, . . . , n, and ǫiiid∼ N(0, 0.22).

We fit a cubic spline with a fixed smoothing parameter λ such thatlog10(nλ) = −3.5. Figure 5.3(a) shows the fit, which is slightly over-smoothed. The estimated autocorrelation function (ACF) of residuals(Figure 5.3(b)) suggests an autoregressive structure even though thetrue random errors are independent. It is clear that the leftover signaldue to oversmoothing shows up in the residuals. In the second simula-tion, n = 100 observations are generated according to model (1.1) withf(x) ≡ 0, xi = i/n for i = 1, . . . , n, and ǫi are generated by an AR(1)model with mean zero, standard deviation 0.2, and first-order correla-tion 0.6. Figure 5.3(c) shows the cubic spline fit with GCV choice ofthe smoothing parameter. The fit picks up local trend in the AR(1)process, and the estimated ACF of residuals (Figure 5.3(d)) does not re-veal any autoregressive structure. In both cases, the mean functions areincorrectly estimated, and the conclusions about the error structuresare erroneous. These two simulations indicate that a wrong choice ofthe smoothing parameter in the first step will lead to a deceptive serialcorrelation in the second step.

In the most general setting where no parametric shape is assumed forthe mean or the correlation function, the model is essentially unidentifi-able. In the following we will model the correlation structure paramet-rically. Specifically we assume that W depends on an unknown vectorof parameters τ . Models for W−1 will be discussed in Section 5.3.

When there is no strong connection between the smoothing and cor-relation parameters, an iterative procedure as described earlier may be


o

o

o

ooo

oo

o

o

o

oo

o

o

o

o

o

o

o

o

ooo

o

o

o

oooo

o

ooo

oo

oo

ooo

o

o

ooooo

o

oo

ooooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

oo

o

o

o

o

oo

o

o

o

o

ooo

o

o

o

oo

o

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

0.0

0.5

1.0

x

y

(a)

0 5 10 15 20

−0.2

0.2

0.6

1.0

Lag

AC

F

(b)

o

o

o

o

o

o

oo

oo

oo

o

o

oo

o

o

oo

o

ooo

o

o

o

o

oo

oo

o

ooo

o

o

ooo

o

o

oo

o

o

o

o

o

o

oooo

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

ooo

o

o

o

o

o

o

oo

o

o

ooo

oo

o

oo

o

o

o

0.0 0.2 0.4 0.6 0.8 1.0

−0.6

−0.2

0.2

0.6

x

y

(c)

0 5 10 15 20

−0.5

0.0

0.5

1.0

Lag

AC

F

(d)

FIGURE 5.3 For the first simulation, plots of (a) the true function(dashed lines), observations (circles), and the cubic spline fits (solid line)with log10(nλ) = −3.5, and (b) estimated ACF of residuals. For the sec-ond simulation, plots of (c) the true function (dashed lines), observations(circles), and the cubic spline fits (solid line) with GCV choice of thesmoothing parameter, and (d) estimated ACF of residuals.

used to estimate f and W . See Sections 5.4.1 and 6.4 for examples.When there is a strong connection, it is helpful to estimate the smooth-ing and correlation parameters simultaneously. One may use the UBR,GCV, or GML criterion to estimate the smoothing and correlation pa-rameters simultaneously. It was found that the GML method performsbetter than the UBR and GCV methods (Wang 1998b). Therefore, inthe following, we present the implementation of the GML method only.

To compute the minimizers of the GML criterion (5.21), we nowconstruct a corresponding LME model for the extended SS ANOVAmodel (5.1). Let Σk = ZkZ

Tk , where Zk is an n × mk matrix with


mk = rank(Σk). Consider the following LME model

y = Tζ +

q∑

k=1

Zkuk + ǫ, (5.27)

where ζ is a p-dimensional vector of deterministic parameters, uk aremutually independent random effects, uk ∼ N(0, σ2θkImk

/nλ), Imkis

the identity matrix of order mk, ǫ ∼ N(0, σ2W−1), and uk are indepen-dent of ǫ. It is not difficult to show that the spline estimate based onthe PWLS is the BLUP estimate, and the GML criterion (5.21) is theREML criterion based on the LME model (5.27) (Opsomer, Wang andYang 2001). Details for the more complicated semiparametric mixed-effects models will be given in Chapter 9. In the ssr function, we firstcalculate Zk through Cholesky decomposition. Then we calculate theREML (GML) estimates of λ, θ and τ using the function lme in thenlme library by Pinheiro and Bates (2000).

5.2.5 Confidence Intervals

Consider the Bayes model defined in (5.19) and (5.20). Conditional onW , it is not difficult to show that formulae for posterior means andcovariances in Section 4.8 hold when M = Σθ+nλI is replaced by M =Σθ+nλW−1. Bayesian confidence intervals can be constructed similarly.When the matrix W involves unknown parameters τ , it is replaced byW (τ ) in the construction of Bayesian confidence intervals. This simpleplug-in approach does not account for the variation in estimating thecovariance parameters τ . The bootstrap approach may also be used toconstruct confidence intervals. More research is necessary for inferenceson the nonparametric function f as well as on parameters τ .

5.3 Variance and Correlation Structures

In this section we discuss commonly used models for the variance–covariance matrix W−1. The matrix W−1 can be decomposed as

W−1 = V CV,

where V is a diagonal matrix with positive diagonal elements, and Cis a correlation matrix with all diagonal elements equal to one. Thematrices V and C describe the variance and correlation, respectively.


The above decomposition allows us to develop separate models for thevariance structure and correlation structure. We now describe structuresavailable for the assist package. Details about these structures can befound in Pinheiro and Bates (2000).

Consider the variance structure first. Define general variance functionas

Var(ǫi) = σ2v2(µi, zi; ζ), (5.28)

where v is a known variance function of the mean µi = E(yi) and a vectorof covariates zi associated with the variance, and ζ is a vector of varianceparameters. As in Pinheiro and Bates (2000), when v depends on µi,the pseudo-likelihood method will be used to estimate all parameters andthe function f .

In the ssr function, a variance matrix (vector) or a model for thevariance function is specified using the weights argument. The inputto the weights argument may be the matrix W when it is known. Fur-thermore, when W is a known diagonal matrix, the diagonal elementsmay be specified as a vector by the weights argument. The input tothe weights argument may also be a varFunc structure specifying amodel for variance function v in (5.28). All varFunc objects availablein the nlme package are available for the assist package. The varFunc

structure is defined through the function v(s; ζ), where s can be eithera variance covariate or the fitted value. Standard varFunc classes andtheir corresponding variance functions v are listed in Table 5.1. Two ormore variance models may be combined using the varComb constructor.

TABLE 5.1 Standard varFunc classes

Class v(s; ζ)

varFixed√

|s|varIdent ζs, s is a stratification variablevarPower |s|ζvarExp exp(ζs)varConstPower ζ1 + |s|ζ2

Next consider the correlation structure. Assume the general isotropiccorrelation structure

cor(ǫi, ǫj) = h(d(pi,pj);ρ), (5.29)

where pi and pj are position vectors associated with observations i andj respectively; h is a known correlation function of the distance d(pi,pj)


such that h(0;ρ) = 1; and ρ is a vector of the correlation parameters.In the ssr function, correlation structures are specified as corStruct

objects through the correlation argument. There are two commonfamilies of correlation structures: serial correlation structures for timeseries and spatial correlation structures for spatial data.

For time series data, observations are indexed by an one-dimensionalposition vector, and d(pi, pj) represents lags that take nonnegative inte-gers. Therefore, serial correlation is determined by the function h(k;ρ)for k = 1, 2, . . . . Standard corStruct classes for serial correlation struc-tures and their corresponding correlation functions h are listed in Table5.2.

TABLE 5.2 Standard corStruct classes for serialcorrelation structures

Class h(k;ρ)corCompSymm ρcorSymm ρk

corAR1 ρk

corARMA correlation function given in (5.30)

The classes corCompSymm and corSymm correspond to the compoundsymmetry and general correlation structures. An autoregressive-movingaverage (ARMA(p,q)) model assumes that

ǫt =

p∑

i=1

φiǫt−i +

q∑

j=1

θjat−j + at,

where at are iid random variables with mean zero and a constant vari-ance, φ1, . . . , φp are autoregressive parameters, and θ1, . . . , θq are movingaverage parameters. The correlation function is defined recursively as

h(k;ρ) =

φ1h(|k − 1|;ρ) + · · · + φph(|k − p|;ρ)+θ1ψ(k − 1;ρ) + · · · + θqψ(k − q;ρ), k ≤ q,

φ1h(|k − 1|;ρ) + · · · + φph(|k − p|;ρ), k > q,(5.30)

where ρ = (φ1, . . . , φp, θ1, . . . , θq) and ψ(k;ρ) = E(ǫt−kat)/Var(ǫt).The continuous AR(1) correlation function is defined as

h(s; ρ) = ρs, s ≥ 0, ρ ≥ 0.

The corCAR1 constructor specifies the continuous AR(1) structure. Forspatial data, locations pi ∈ R

r. Any well-defined distance such as


Euclidean d(pi,pj) = {∑rk=1(pik − pjk)2}1/2, Manhattan d(pi,pj) =

∑rk=1 |pik−pjk|, and maximum distance d(pi,pj) = max1≤k≤r |pik−pjk|

may be used. The correlation function structure is defined through thefunction h(s;ρ), where s ≥ 0. Standard corStruct classes for spatialcorrelation structures and their corresponding correlation functions hare listed in Table 5.3. The function I(s < ρ) equals 1 when s < ρ and0 otherwise. When desirable, the following correlation function allows anugget effect:

hnugg(s, c0;ρ) =

{(1 − c0)hcont(s;ρ), s > 0,1, s = 0,

where hcont is any standard correlation function that is continuous in sand 0 < c0 < 1 is a nugget effect.

TABLE 5.3 Standard corStruct classes forspatial correlation structures

Class h(s;ρ)corExp exp(−s/ρ)corGaus exp{−(s/ρ)2}corLin (1 − s/ρ)I(s < ρ)corRatio 1/{1 + (s/ρ)2}corSpher {1 − 1.5(s/ρ) + 0.5(s/ρ)3}I(s < ρ)

5.4 Examples

5.4.1 Simulated Motorcycle Accident — Revisit

Plot of observations in Figure 3.8 suggests that variances may changeover time. For the partial spline fit to model (3.53), we compute squaredresiduals and plot the logarithm of squared residuals over time in Figure5.4(a). Cubic spline fit to the logarithm of squared residuals in Figure5.4(a) indicates that the constant variance assumption may be violated.We then fit model (3.53) again using PWLS with weights fixed as theestimated variances:

> r <- residuals(mcycle.ps.fit2); y <- log(r**2)


> mcycle.v <- ssr(y~x, cubic(x), spar=‘‘m’’)

> update(mcycle.ps.fit2, weights=exp(mcycle.v$fit))

> predict(mcycle.ps.fit3,

newdata=data.frame(x=grid, s1=(grid-t1)*(grid>t1),

s2=(grid-t2)*(grid>t2), s3=(grid-t3)*(grid>t3)))

time (ms)

log(s

quare

d r

esid

ual)

−5

05

0 10 20 30 40 50 60

o

o

ooo

ooo

o

o

o

o

o

o

o

oo

ooo

o

o

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

oo

o

o

ooo

oo

oo

o

o

o

o

oo

o

oo

o

oo

o

o

o

o

o

o

oooo

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

oo

o

oo

o

o

o o

(a)

time (ms)

accele

ration (

g)

−1

50

−5

00

50

0 10 20 30 40 50 60


oooooo

o

oo

oo

o

oo

oo

oo

oo

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

ooo

o

o

o

ooo

o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

oo

o

ooo o

o

oo

oo

o

o

o

o

o

(b)

FIGURE 5.4 Motorcycle data, plots of (a) logarithm of the squaredresiduals (circles) based on model (3.53), and the cubic spline fit (line)to logarithm of the squared residuals; and (b) observations (circles), newPWLS fit (line), and 95% Bayesian confidence intervals (shaded region).

The PWLS fit and 95% Bayesian confidence intervals are shown in5.4(b). The impact of the unequal variances is reflected in the widths ofconfidence intervals. The two-step approach adapted here is crude, andthe variation in the estimation of the variance function has been ignoredin the construction of the confidence intervals. Additional methods forestimating the mean and variance functions will be discussed in Section6.4.


Figure 2.2 suggests that the variances may not be a constant. Basedon fit to the trigonometric spline model (2.75), we calculate residualvariances for each month and plot them on the logarithm scale in Figure5.5(a). It is obvious that variations depend on the time of the year.It seems that a simple sinusoidal function can be used to model thevariance function.


> v <- sapply(split(arosa.ls.fit$resi,Arosa$month),var)

> a <- sort(unique(Arosa$x))

> b <- lm(log(v)~sin(2*pi*a)+cos(2*pi*a))

> summary(b)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.43715 0.05763 94.341 8.57e-15 ***

sin(2 * pi * a) 0.71786 0.08151 8.807 1.02e-05 ***

cos(2 * pi * a) 0.49854 0.08151 6.117 0.000176 ***

---

Residual standard error: 0.1996 on 9 degrees of freedom

Multiple R-squared: 0.9274, Adjusted R-squared: 0.9113

F-statistic: 57.49 on 2 and 9 DF, p-value: 7.48e-06

month

log(v

ariance)

o

o

o o

o

o

o

o

o

o

o

o

4.5

5.5

6.5

1 3 5 7 9 11

(a)

month

thic

kness

30

03

50

40

0

1 3 5 7 9 11

(b)

FIGURE 5.5 Arosa data, plots of (a) logarithm of residual variances(circles) based on the periodic spline fit in Section 2.7, the sinusoidal fitto logarithm of squared residuals (dashed line), and the fit from model(5.31) (solid line); and (b) observations (dots), the new PWLS fit (solidline), and 95% Bayesian confidence intervals (shaded region).

The fit of the simple sinusoidal model to the log variance is shown inFigure 5.5(a). We now assume the following variance function for thetrigonometric spline model (2.75):

v(x) = exp(ζ1 sin 2πx+ ζ2 cos 2πx), (5.31)

and fit the model as follows:


> arosa.ls.fit1 <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),

rk=lspline(x,type=‘‘sine1’’), spar=‘‘m’’, data=Arosa,

weights=varComb(varExp(form=~sin(2*pi*x)),

varExp(form=~cos(2*pi*x))))

> summary(arosa.ls.fit1)

...




Combination of:

Variance function structure of class varExp representing

expon

0.3555942


expon

0.2497364

The estimated variance parameters, 0.3556 and 0.2497, are very closeto (up to a scale of 2 by definition) those in the sinusoidal model basedon residual variances — 0.7179 and 0.4985. The fitted variance functionis plotted in Figure 5.5(a) which is almost identical to the fit basedon residual variances. Figure 5.5(b) plots the trigonometric spline fitto the mean function with 95% Bayesian confidence intervals. Notethat these confidence intervals are conditional on the estimated varianceparameters. Thus they may have smaller coverage than the nominalvalue since variation in the estimation of variance parameters is notcounted. Nevertheless, we can see that unequal variances are reflectedin the widths of these confidence intervals.

Observations close in time may be correlated. We now consider afirst-order autoregressive structure for random errors. Since some obser-vations are missing, we use the continuous AR(1) correlation structureh(s, ρ) = ρs, where s represents distance in terms of calendar time be-tween observations. We refit model (2.75) using the variance structure(5.31) and continuous AR(1) correlation structure as follows:

> Arosa$time <- Arosa$month+12*(Arosa$year-1)

> arosa.ls.fit2 <- update(arosa.ls.fit1,

corr=corCAR1(form=~time))


...




Correlation structure of class corCAR1 representing


Phi

0.3411414

Combination of:


expon

0.3602905


expon

0.3009282

where the variable time represents the continuous calendar time in months.

5.4.3 Beveridge Wheat Price Index

The Beveridge data contain the time series of annual wheat price indexfrom 1500 to 1869. The time series of price index on the logarithmicscale is shown in Figure 5.6.

1500 1600 1700 1800

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

year

log o

f price index

o

oo

o

ooooo

o

o

o

o

oo

oo

o

o

o

o

o

oo

o

o

o

oooo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

oo

oo

o

o

o

o

o

o

ooo

o

o

o

o

o

oooo

oooo

oo

o

o

o

o

oo

o

oo

o

o

oo

o

oo

o

oo

ooo

o

oo

oooo

oooo

o

o

o

o

o

oo

o

o

oo

o

o

oo

ooooo

oooo

oo

o

oo

oo

o

o

o

o

o

o

ooo

oo

o

o

o

o

oo

ooooooo

o

o

o

oo

ooo

ooo

o

oo

oo

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

oooo

o

o

o

oo

ooo

oo

oo

oo

o

ooooo

o

o

o

o

oo

ooo

oo

o

ooo

oo

oo

ooo

oo

ooooooo

oo

o

oo

oo

ooo

ooooo

oo

o

oo

oo

oo

o

o

o

oo

o

o

oo

o

o

o

oooo

o

o

o

oo

o

o

o

o

oo

oo

ooo

o

o

o

o

o

o

ooooo

o

oo

ooooo

o

o

o

ooo

o

o

ooo

o

oo

o

o

o

o

oo

o

oo

oo

FIGURE 5.6 Beveridge data, observations (circles), the cubic splinefit under the assumption of independent random errors (dashed line),and the cubic spline fit under AR(1) correlation structure (solid line)with 95% Bayesian confidence intervals (shaded region).


Let x be the year scaled into the interval [0, 1] and y be the logarithmof price index. Consider the following nonparametric regression model

yi = f(xi) + ǫi, i = 1, . . . , n, (5.32)

where f ∈ W 22 [0, 1] and ǫi are random errors. Under the assumption

that ǫi are independent with a common variance, model (5.32) can befitted as follows:

> library(tseries); data(bev)

> y <- log(bev); x <- seq(0,1,len=length(y))

> bev.fit1 <- ssr(y~x, rk=cubic(x))

> summary(bev.fit1)

...




where the GCV method was used to select the smoothing parameter.The estimate of the smoothing parameter is essentially zero, which in-dicates that the cubic spline fit interpolates observations (dashed line inFigure 5.6). The undersmoothing is likely caused by the autocorrelationin time series. Now consider an AR(1) correlation structure for randomerrors:

> bev.fit2 <- update(bev.fit1, spar=‘‘m’’,

correlation=corAR1(form=~1))

> summary(bev.fit2)




Correlation structure of class corAR1 representing

Phi

0.6936947

Estimate of f and 95% Bayesian confidence intervals under AR(1) corre-lation structure are shown in Figure 5.6 as the solid line and the shadedregion.

5.4.4 Lake Acidity

The lake acidity data contain measurements of 112 lakes in the southernBlue Ridge mountains area. It is of interest to investigate the dependenceof the water pH level on calcium concentration and geological location.To match notations in Chapter 4, we relabel calcium concentration t1


as variable x1, and geological location x1 (latitude) and x2 (longitude)as variables x21 and x22, respectively. Let x2 = (x21, x22).

First, we fit an one-dimensional thin-plate spline to the response vari-able ph using one independent variable x1 (calcium):

ph(xi1) = f(xi1) + ǫi, (5.33)

where f ∈W 22 (R), and ǫi are zero-mean independent random errors with

a common variance.

> data(acid)

> acid$x21 <- acid$x1; acid$x22 <- acid$x2

> acid$x1 <- acid$t1

> acid.tp.fit1 <- ssr(ph~x1, rk=tp(x1), data=acid)

> summary(acid.tp.fit1)

...




Number of Observations: 112

> anova(acid.tp.fit1)



LMP 0.003250714 100 0.08

GCV 0.008239078 100 0.01

Both p-values from the LMP and GCV tests suggest that the departurefrom a straight line model is borderline significant. The estimate of thefunction f in model (5.33) is shown in the left panel of Figure 5.7.

Observations of the pH level close in geological locations are oftencorrelated. Suppose that we want to model potential spatial correla-tion among random errors in model (5.33) using the exponential spa-tial correlation structure with nugget effect for the location variablex2 = (x21, x22). That is, we assume that

hnugg(s, c0, ρ) =

{(1 − c0) exp(−s/ρ), s > 0,1, s = 0,

where c0 is the nugget effect and s is the Euclidean distance between twogeological locations. Model (5.33) with the above correlation structurecan be fitted as follows:


calcium(log10 mg/L)

pH

6.0

6.5

7.0

7.5

8.0

−0.5 0.0 0.5 1.0 1.5

−0.020.00

0.02−0.02

0.00

0.02−0.4

−0.2

0.0

0.2

latitude

long

itude

pH

FIGURE 5.7 Lake acidity data, the left panel includes observations(circles), the fit from model (5.33) (solid line), the fit from model (5.33)with the exponential spatial correlation structure (dashed line), and es-timate of the constant plus main effect of x1 from model (5.36) (dottedline); the right panel includes estimate of the main effect of x2 frommodel (5.36).

> acid.tp.fit2 <- update(acid.tp.fit1, spar=‘‘m’’,

corr=corExp(form=~x21+x22,nugget=T))

> summary(acid.tp.fit2)

...

GML estimate(s) of smoothing parameter(s) : 53310.63

Equivalent Degrees of Freedom (DF): 2


Correlation structure of class corExp representing

range nugget

0.02454532 0.62321744

The GML estimate of the smoothing parameter is large, and the splineestimate is essentially a straight line (left panel of Figure 5.7). Thesmaller smoothing parameter in the first fit with independence assump-tion might be caused by the spatial correlation. Equivalent degrees offreedom for f have been reduced from 8.2 to 2.

We can also model the effect of geological location directly in the meanfunction. That is to consider the mean pH level as a bivariate functionof x1 (calcium) and x2 (geological location):

ph(xi1,xi2) = f(xi1,xi2) + ǫi, (5.34)

where ǫi are zero-mean independent random errors with a common vari-


ance. One possible model space for x1 is W 22 (R), and one possible model

space for x2 is W 22 (R2). Therefore, we consider the tensor product space

W 22 (R) ⊗W 2

2 (R2). Define three averaging operators

A(1)1 f =

J1∑

j=1

wj1f(uj1),

A(1)2 f =

J1∑

j=1

wj1f(uj1)φ12(uj1)φ12,

A(2)1 f =

J2∑

j=1

wj2f(uj2),

A(2)2 f =

J2∑

j=1

wj2f(uj2){φ22(uj2)φ22 + φ23(uj2)φ23},

where uj1 and uj2 are fixed points in R and R2, and wj1 and wj2 are

fixed positive weights such that∑J1

j=1 wj1 =∑J2

j=1 wj2 = 1; φ11 = 1

and φ12 are orthonormal bases for the null space in W 22 (R) based on the

norm (2.41); φ21 = 1, and φ22 and φ23 are orthonormal bases for the null

space in W 22 (R2) based on the norm (2.41). Let A(1)

3 = I −A(1)1 −A(2)

2

and A(2)3 = I − A(2)

1 − A(2)2 . Then we have the following SS ANOVA

decomposition:

f ={

A(1)1 + A(1)

2 + A(1)3

}{

A(2)1 + A(2)

2 + A(2)3

}

f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)3 A(2)

1 f

+ A(1)1 A(2)

2 f + A(1)2 A(2)

2 f + A(1)3 A(2)

2 f

+ A(1)1 A(2)

3 f + A(1)2 A(2)

3 f + A(1)3 A(2)

3 f

, µ+ β1φ12(x1) + β2φ22(x2) + β3φ23(x2) + β4φ12(x1)φ22(x2)

+ β5φ12(x1)φ23(x2) + fs1 (x1) + fs

2 (x2) + f ls12(x1,x2)

+ fsl12(x1,x2) + fss

12(x1,x2). (5.35)

Due to small sample size, we ignore all interactions and consider thefollowing additive model

y1 = µ+ β1φ12(xi1) + β2φ22(xi2) + β3φ23(xi2)

+ fs1 (xi1) + fs

2 (xi2) + ǫi, (5.36)

where ǫi are zero-mean independent random errors with a common vari-ance. We fit model (5.36) as follows:


> acid.ssanova.fit <- ssr(ph~x1+x21+x22,

rk=list(tp(x1), tp(list(x21,x22))),

data=acid, spar=‘‘m’’)

> summary(acid.ssanova.fit)

...


2.870235e-07



The estimate of the main effect of x1 plus the constant, µ+β1φ12(x1)+fs1 (x1), is shown in Figure 5.7(a). This estimate is almost identical to

that from model (5.33) with the exponential spatial correlation structure.The estimate of the main effect of x2 is shown in the right panel of Figure5.7.

An alternative approach to modeling the effect of geological locationusing mixed-effects will be discussed in Section 9.4.2.

Chapter 6

Generalized Smoothing SplineANOVA

6.1 Generalized SS ANOVA Models

Generalized linear models (GLM) (McCullagh and Nelder 1989) providea unified framework for analysis of data from exponential families. De-note (xi, yi) for i = 1, . . . , n as independent observations on independentvariables x = (x1, . . . , xd) and dependent variable y. Assume that yi aregenerated from a distribution in the exponential family with the densityfunction

g(yi; fi, φ) = exp

{yih(fi) − b(fi)

ai(φ)+ c(yi, φ)

}

, (6.1)

where fi = f(xi), h(fi) is a monotone transformation of fi known as thecanonical parameter, and φ is a dispersion parameter. The function fmodels the effect of independent variables x. Denote µi = E(yi), Gc asthe canonical link such that Gc(µi) = h(fi), and G as the link functionsuch that G(µi) = fi. Then h = Gc ◦G−1, and it reduces to the identityfunction when the canonical link is chosen for G. The last term c(yi, φ)in (6.1) is independent of f .

A GLM assumes that f(x) = xTβ. Similar to the linear models forGaussian data, the parametric GLM may be too restrictive for some ap-plications. We consider the nonparametric extension of the GLM in thischapter. In addition to providing more flexible models, the nonparamet-ric extension also provides model building and diagnostic methods forGLMs.

Let the domain of each covariate xk be an arbitrary set Xk. Consider fas a multivariate function on the product domain X = X1×X2×· · ·×Xd.The SS ANOVA decomposition introduced in Chapter 4 may be appliedto construct candidate model spaces for f . In particular, we will assumethat f ∈ M, where

M = H0 ⊕H1 ⊕ · · · ⊕ Hq

163


is an SS ANOVA model space defined in (4.30), H0 = span{φ1, . . . , φp} isa finite dimensional space collecting all functions that are not penalized,and Hj for j = 1, . . . , q are orthogonal RKHS’s with RKs Rj . The samenotations in Chapter 4 will be used in this Chapter.

We assume the same density function (6.1) for yi. However, for gen-erality, we assume that f is observed through some linear functionals.Specifically, fi = Lif , where Li are known bounded linear functionals.

6.2 Estimation and Inference

6.2.1 Penalized Likelihood Estimation

Assume that ai(φ) = a(φ)/i for i = 1, . . . , n where i are knownconstants. Denote σ2 = a(φ), y = (y1, . . . , yn)T and f = (f1, . . . , fn)T .Let

li(fi) = i{b(fi) − yih(fi)}, i = 1, . . . , n, (6.2)

and l(f) =∑n

i=1 li(fi). Then the log-likelihood

n∑

i=1

log g(yi; fi, φ) =

n∑

i=1

{yih(fi) − b(fi)

ai(φ)+ c(yi, φ)

}

= − 1

σ2l(f) + C, (6.3)

where C is independent of f . Therefore, up to an additive and a mul-tiplying constant, l(f) is the negative log-likelihood. Note that l(f) isindependent of the dispersion parameter.

For a GLM, the MLE of parameters β are the maximizers of the log-likelihood. For a generalized SS ANOVA model, as in the Gaussiancase, a penalty term is necessary to avoid overfitting. We will use thesame form of penalty as in Chapter 4. Specifically, we estimate f as thesolution to the following penalized likelihood (PL)

minf∈M

l(f) +

n

2

q∑

j=1

λj‖Pjf‖2

, (6.4)

where Pj is the orthogonal projector in M onto Hj , and λj are smooth-ing parameters. The multiplying term 1/σ2 is absorbed into smoothingparameters, and the constant C is dropped since it is independent of f .The multiplying constant n/2 is added such that the PL reduces to the

Generalized Smoothing Spline ANOVA 165

PLS (4.32) for Gaussian data. Under the new inner product defined in(4.33), the PL can be rewritten as

minf∈M

{

l(f) +n

2λ‖P ∗

1 f‖2∗

}

, (6.5)

where λj = λ/θj , and P ∗1 =

∑qj=1 Pj is the orthogonal projection in

M onto H∗1 = ⊕q

j=1Hj. Note that the RK of H∗1 under the new inner

product is R∗1 =

∑qj=1 θjR

j.It is easy to check that l(f) is convex in f under the canonical link. In

general, we assume that l(f) is convex in f and has a unique minimizerin H0. Then the PL (6.5) has a unique minimizer (Theorem 2.9 in Gu(2002)). We now show the Kimeldorf–Wahba representer theorem holds.Let R0 be the RK of H0 and R = R0 + R∗

1. Let ηi be the representerassociated with Li. Then, from (2.12),

ηi(x) = Li(z)R(x, z) = Li(z)R0(x, z) + Li(z)R

∗1(x, z) , δi(x) + ξi(x).

That is, the representers ηi for Li belong to the finite dimensional sub-space S = H0 ⊕ span{ξ1, . . . , ξn}. Let Sc be the orthogonal complementof S. Any f ∈ H can be decomposed into f = ς1 + ς2, where ς1 ∈ S andς2 ∈ Sc. Then we have

Lif = (ηi, f) = (ηi, ς1) + (ηi, ς2) = (ηi, ς1) = Liς1.

Consequently, for any f ∈ H, the PL (6.5) satisfies

n∑

i=1

li(Lif) +n

2λ‖P ∗

1 f‖2∗

=

n∑

i=1

li(Liς1) +n

2λ(‖P ∗

1 ς1‖2∗ + ‖P ∗

1 ς2‖2∗)

≥n∑

i=1

li(Liς1) +n

2λ‖P ∗

1 ς1‖2∗,

where equality holds iff ||P ∗1 ς2||∗ = ||ς2||∗ = 0. Thus, the minimizer of

the PL falls in the finite dimensional space S, which can be representedas

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ciξi

=

p∑

ν=1

dνφν(x) +

n∑

i=1

ci

q∑

j=1

θjLi(z)Rj(x, z). (6.6)


For simplicity of notation, the dependence of f on the smoothing pa-rameters λ and θ = (θ1, . . . , θq) is not expressed explicitly. Let T ={Liφν}n p

i=1 ν=1, Σk = {LiLjRk}n

i,j=1 for k = 1, . . . , q and Σθ = θ1Σ1 +

· · · + θqΣq. Let d = (d1, . . . , dp)T and c = (c1, . . . , cn)T . Denote f =

(L1f , . . . ,Lnf)T . Then f = Td + Σθc. Note that ‖P ∗1 f‖2

∗ = cT Σθc.Substituting (6.6) into (6.5), we need to solve c and d by minimizing

I(c,d) = l(Td+ Σθc) +n

2λcT Σθc. (6.7)

Except for the Gaussian distribution, the function l(f) is not quadraticand (6.7) cannot be solved directly. For fixed λ and θ, we will applythe Newton–Raphson procedure to compute c and d. Let ui = dli/dfi

and wi = d2li/df2i , where fi = Lif . Let uT = (u1, . . . , un)T and W =

diag(w1, . . . , wn). Then

∂I

∂c= Σθu+ nλΣθc,

∂I

∂d= T Tu,

∂2I

∂c∂cT= ΣθWΣθ + nλΣθ ,

∂2I

∂c∂dT= ΣθWT,

∂2I

∂d∂dT= T TWT.

The Newton–Raphson procedure iteratively solves the linear system(

ΣθW Σθ + nλΣθ ΣθW TT TW Σθ T TW T

)(c− cd− d

)

=

(−Σθu − nλΣθc

−T Tu

)

, (6.8)

where the subscript minus indicates quantities evaluated at the previousNewton–Raphson iteration. Equations in (6.8) can be rewritten as

(ΣθW Σθ + nλΣθ)c+ ΣθW Td = ΣθW f − Σθu ,

T TW Σθc+ T TW Td = T TW f − T Tu ,(6.9)

where f = Td +Σθc . As discussed in Section 2.4, it is only necessary

to derive one set of solutions. Let y = f −W−1u . It is easy to seethat a solution to

(Σθ + nλW−1)c+ Td = y,

T Tc = 0,(6.10)


is also a solution to (6.9). Note that W is known at the current Newton–Raphson iteration. Since equations in (6.10) have the same form asthose in (5.6), then methods in Section 5.2.1 can be used to solve (6.10).Furthermore, (6.10) corresponds to the minimizer of the following PWLSproblem:

1

n

n∑

i=1

wi−(yi − fi)2 +

q∑

j=1

λj‖Pjf‖2, (6.11)

where yi is the ith element of y. Therefore, at each iteration, theNewton–Raphson procedure solves the PWLS criterion with workingvariables yi and working weights wi−. Consequently, the procedure canbe regarded as iteratively reweighted PLS.

6.2.2 Selection of Smoothing Parameters

With canonical link such that h(f) = f , we have E(yi) = b′(fi), Var(yi) =b′′(fi)ai(φ), ui = i{b′(fi)−yi}, andwi = ib

′′(fi). Therefore, E(ui/wi)

= 0 and Var(ui/wi) = σ2w−1i . Consequently, when f− is close to f and

under some regularity conditions, it can be shown (Wang 1994, Gu 2002)that the working variables approximately have the same structure as in(5.1)

yi = Lif + ǫi + op(1), (6.12)

where ǫi has mean 0 and variance σ2w−1i . The Newton–Raphson proce-

dure essentially reformulates the problem to model f on working vari-ables at each iteration.

From the above discussion and note that W is known at the cur-rent Newton–Raphson iteration, we can use the UBR, GCV, and GMLmethods discussed in Section 5.2.3 to select smoothing parameters ateach step of the Newton–Raphson procedure. Since working data arereformulated at each iteration, the target criteria of the above iterativesmoothing parameter selection methods change throughout the itera-tion. Therefore, the overall target criteria of these iterative methods arenot explicitly defined. A justification for the UBR criterion can be foundin Gu (2002).

One practical problem with the iterative methods for selecting smooth-ing parameters is that they are not guaranteed to converge. Neverthe-less, extensive simulations indicate that, in general, the above algorithmworks reasonably well in practice (Wang 1994, Wang, Wahba, Chappelland Gu 1995).

Nonconvergence may become a serious problem for certain applica-tions (Liu, Tong and Wang 2007). Some direct noniterative methods


for selecting smoothing parameters have been proposed. For Poissonand gamma distributions, it is possible to derive unbiased estimates ofthe symmetrized Kullback–Leibler discrepancy (Wong 2006, Liu et al.2007). Xiang and Wahba (1996) proposed a direct GCV method. De-tails about the direct GCV method can be found in Gu (2002). A directGML method will be discussed in Section 6.2.4. These direct methodsare usually more computationally intensive. They have not been imple-mented in the current version of the assist package. It is not difficultto write R functions to implement these direct methods. A simple im-plementation of the direct GML method for gamma distribution will begiven in Sections 6.4 and 6.5.3.

6.2.3 Algorithm and Implementation

The whole procedure discussed in Sections 6.2.1 and 6.2.2 is summarizedin the following algorithm.

Algorithm for generalized SS ANOVA models

1. Compute matrices T and Σk for k = 1, . . . , q, and set an initialvalue for f .

2. Compute u , W , T = W 1/2T , Σk = W 1/2ΣkW1/2 for k = 1, . . . , q,

and y = W 1/2y, and fit the transformed data with smoothingparameters selected by the UBR, GCV, or GML method.

3. Iterate step 2 until convergence.

The above algorithm is easy to implement. All we need to do is tocompute quantities ui and wi. We now compute these quantities forsome special distributions.

First consider logistic regression with logit link function. Assume thaty ∼ Binomial(m, p) with density function

g(y) =

(my

)

py(1 − p)m−y, y = 0, . . . ,m.

Then σ2 = 1 and li = −yifi+mi log(1+exp(fi)), where fi = log(pi/(1−pi)). Consequently, ui = −yi +mipi and wi = mipi(1 − pi). Note thatbinary data is a special case with mi = 1.

Next consider Poisson regression with log-link function. Assume thaty ∼ Poisson(µ) with density function

g(y) =1

y!µy exp(−µ), y = 0, 1, . . . .


Then σ2 = 1 and li = −yifi + exp(fi), where fi = logµi. Consequently,ui = −yi + µi and wi = µi.

Last consider gamma regression with log-link function. Assume thaty ∼ Gamma(α, β) with density function

g(y) =1

Γ(α)βαyα−1 exp

(

− y

β

)

, y > 0,

where α > 0 and β > 0 are shape and scale parameters. We are in-terested in modeling the mean µ , E(y) as a function of independentvariables. We assume that the shape parameter does not depend onindependent variables. Note that µ = αβ. The density function may bereparametrized as

g(y) =αα

Γ(α)µαyα−1 exp

(

−αyµ

)

, y > 0.

The canonical parameter −µ−1 is negative. To avoid this constraint, weadopt the log-link. Then σ2 = α−1 and li = yi exp(−fi) + fi, wherefi = logµi. Consequently, ui = −yi/µi +1 and wi = yi/µi. Since wi arenonnegative, then l(f) is a convex function of f .

For the binomial and the Poisson distributions, the dispersion param-eter σ2 is fixed as σ2 = 1. For the gamma distribution, the dispersionparameter σ2 = α−1. Since this constant has been separated from thedefinition of li, then we can set σ2 = 1 in the UBR criterion. There-fore, for binomial, Poisson, and gamma distributions, the UBR criterionreduces to

UBR(λ,θ) =1

n‖(I − H)y‖2 +

2

ntrH, (6.13)

where H is the hat matrix associated with the transformed data.

In general, the weighted average of residuals

σ2− =

1

n

n∑

i=1

wi−

(ui−wi−

)2

=1

n

n∑

i=1

u2i−

wi−(6.14)

provides an estimate of σ2 at the current iteration when it is unknown.Then the UBR criterion reduces to

UBR(λ,θ) =1

n‖(I − H)y‖2 +

2σ2−ntrH. (6.15)

There are two versions of the UBR criterion given in equations (6.13)and (6.15). The first version is favorable when σ2 is known.


The above algorithm is implemented in the ssr function for bino-mial, Poisson, and gamma distributions based on a collection of For-tran subroutines called GRKPACK (Wang 1997). The specific distribu-tion is specified by the family argument. The method for selectingsmoothing parameters is specified by the argument spar with “u ”, “v”,and “m” representing UBR, GCV, and GML criteria defined in (6.15),(5.25) and (5.26), respectively. The UBR method with fixed dispersionparameter (6.13) is specified as spar=‘‘u’’ together with the optionvarht for specifying the fixed dispersion parameter. Specifically, for bi-nomial, Poisson and gamma distributions with σ2 = 1, the combinationspar=‘‘u’’ and varht=1 is used.

6.2.4 Bayes Model, Direct GML and ApproximateBayesian Confidence Intervals

Suppose observations yi are generated from (6.1) with fi = Lif . Assumethe same prior for f as in (4.42):

F (x) =

p∑

ν=1

ζνφν(x) + δ12

q∑

j=1

√

θjUj(x), (6.16)

where ζνiid∼ N(0, κ); Uj(x) are independent, zero-mean Gaussian stochas-

tic processes with covariance function Rj(x, z); ζν and Uj(x) are mu-tually independent; and κ and δ are positive constants. Conditional onζ = (ζ1, . . . , ζp)

T , f |ζ ∼ N(Tζ, δΣθ). Letting κ → ∞ and integratingout ζ, Gu (1992) showed that the marginal density of f

p(f ) ∝ exp

{

− 1

2δfT(

Σ+

θ− Σ+

θT (T T Σ+

θT )−1T T Σ+

θ

)

f

}

,

where Σ+

θis the Moore–Penrose inverse of Σθ. The marginal density of

y,

p(y) =

∫

p(y|f)p(f )df , (6.17)

usually does not have a closed form since, except for the Gaussian dis-tribution, the log-likelihood log p(y|f) is not quadratic in f . Note that

log p(y|f) =

n∑

i=1

log g(yi; fi, φ) = − 1

σ2l(f) + C.

We now approximate l(f) by a quadratic function.

Let uic and wic be ui and wi evaluated at f . Let uc = (u1c, . . . , unc)T ,

Wc = diag(w1c, . . . , wnc), and yc = f−W−1c uc. Note that ∂l(f)/∂f | ˆf =


uc, and ∂2l(f)/∂f∂fT | ˆf = Wc. Expanding l(f) as a function of f

around the fitted values f to the second order leads to

l(f) ≈ 1

2(f − yc)

TWc(f − yc) + l(f) − 1

2uT

c W−1c uc. (6.18)

Note that log p(f) is quadratic in f . Then it can be shown that, apply-ing approximation (6.18), the marginal density of y in (6.17) is approx-imately proportional to (Liu, Meiring and Wang 2005)

p(y) ∝ |Wc|−1

2 |V |− 12 |T TV −1T |− 1

2 p(y|f) exp

{1

2σ2uT

c W−1c uc

}

× exp

{

−1

2yT

c

(V −1 − V −1T (T TV −1T )−1T TV −1

)yc

}

, (6.19)

where V = δΣθ + σ2W−1c . When Σθ is nonsingular, f is the maxi-

mizer of the integrand p(y|f )p(f ) in (6.17) (Gu 2002). In this case theforegoing approximation is simply the Laplace approximation.

Let yc = W1/2c yc, Σθ = W

1/2c ΣθW

1/2c , T = W

1/2c T , and the QR de-

composition of T be (Q1 Q2)(RT 0)T . Let UEUT be the eigendecompo-

sition of QT2 ΣθQ2, where E = diag(e1, . . . , en−p), e1 ≥ e2 ≥ . . . ≥ en−p

are eigenvalues. Let z = (z1, . . . , zn−p)T , UT QT

2 y. Then it can beshown (Liu et al. 2005) that (6.19) is equivalent to

p(y) ∝ |R|−1p(y|f ) exp

{1

2σ2uT

c W−1c uc

}

×n−p∏

ν=1

(δeν + σ2)−12 exp

{

−1

2

n−p∑

ν=1

z2ν

δeν + σ2

}

.

Let δ = σ2/nλ. Then an approximation of the negative log-marginallikelihood is

DGML(λ,θ) = log |R| + 1

σ2l(f) − 1

2σ2uT

c W−1c uc

+1

2

n−p∑

i=1

{

log(eν/nλ+ 1) +z2

v

σ2(eν/nλ+ 1)

}

. (6.20)

Notice that f , uc, Wc, R, eν , and zν all depend on λ and θ even thoughthe dependencies are not expressed explicitly. The function DGML(λ,θ)is referred to as the direct generalized maximum likelihood (DGML) crite-rion, and the minimizers of DGML(λ,θ) are called the DGML estimateof the smoothing parameter. Section 6.4 shows a simple implementationof the DGML criterion for the gamma distribution.


Let yc = (y1c, . . . , ync)T . Based on (6.12), consider the approximation

model at convergence

yic = Lif + ǫi, i = 1, . . . , n, (6.21)

where ǫi has mean 0 and variance σ2w−1ic . Assume prior (6.16) for f .

Then, as in Section 5.2.5, Bayesian confidence intervals can be con-structed for f in the approximation model (6.21). They provide approx-imate confidence intervals for f in the generalized SS ANOVA model.The bootstrap approach may also be used to construct confidence inter-vals, and the extension is straightforward.

Connections between smoothing spline models and LME models arepresented in Sections 3.5, 4.7, and 5.2.4. We now extend this connectionto data from exponential families. Consider the following generalizedlinear mixed-effects model (GLMM) (Breslow and Clayton 1993)

G{E(y|u)} = Td+ Zu, (6.22)

where G is the link function, d are fixed effects, u = (uT1 , . . . ,u

Tq )T

are random effects, Z = (In, . . . , In), uk ∼ N(0, σ2θkΣk/nλ) for k =1, . . . , q, and uk are mutually independent. Then u ∼ N(0, σ2D/nλ),whereD = diag(θ1Σ1, . . . , θqΣq). Setting uk = θkΣkc as in the Gaussiancase (Opsomer et al. 2001) and noting that ZDZT = Σθ , we have u =DZTc and uT {Cov(u)}+u = nλcTZDD+DZTc/σ2 = nλcTZDZTc/σ2

= nλcT Σθc/σ2. Note that Zu = Σθc. Therefore the PL (6.7) is the

same as the penalized quasi-likelihood (PQL) of the GLMM (6.22) (equa-tion (6) in Breslow and Clayton (1993)).

6.3 Wisconsin Epidemiological Study of DiabeticRetinopathy

We use the Wisconsin Epidemiological Study of Diabetic Retinopathy(WESDR) data to illustrated how to fit an SS ANOVA model to binaryresponses. Based on Wahba, Wang, Gu, Klein and Klein (1995), weinvestigate how probability of progression to diabetic retinopathy at thefirst follow-up (prg) depends on the following covariates at baseline:duration of diabetes (dur), glycosylated hemoglobin (gly), and bodymass index (bmi).

Let y be the response variable prg where y = 1 represents progressionof retinopathy and y = 0 otherwise. Let x1, x2, and x3 be the covariatesdur, gly, and bmi transformed into [0, 1]. Let x = (x1, x2, x3) and


f(x) = logitP (y = 1|x). We model f using the tensor product spaceW 2

2 [0, 1]⊗W 22 [0, 1]⊗W 2

2 [0, 1]. The three-way SS ANOVA decompositioncan be derived similarly using the method in Chapter 4. For simplicity,we will ignore three-way interactions and start with the following SSANOVA model with all two-way interactions:

f(x) = µ+ β1 × (x1 − 0.5) + β2 × (x2 − 0.5) + β3 × (x3 − 0.5)

+ β4 × (x1 − 0.5)(x2 − 0.5) + β5 × (x1 − 0.5)(x3 − 0.5)

+ β6 × (x2 − 0.5)(x3 − 0.5) + fs1 (x1) + fs

2 (x2) + fs3 (x3)

+ f ls12(x1, x2) + fsl

12(x1, x2) + fss12 (x1, x2)

+ f ls13(x1, x3) + fsl

13(x1, x3) + fss13 (x1, x3)

+ f ls23(x2, x3) + fsl

23(x2, x3) + fss23 (x2, x3). (6.23)

The following statements fit model (6.23) with smoothing parameterselected by the UBR method:

> data(wesdr); attach(wesdr)

> y <- prg

> x1 <- (dur-min(dur))/diff(range(dur))

> x2 <- (gly-min(gly))/diff(range(gly))

> x3 <- (bmi-min(bmi))/diff(range(bmi))

> wesdr.fit1 <- ssr(y~I(x1-.5)+I(x2-.5)+I(x3-.5)+

I((x1-.5)*(x2-.5))+I((x1-.5)*(x3-.5))+

I((x2-.5)*(x3-.5)),

rk=list(cubic(x1), cubic(x2), cubic(x3),









rk.prod(cubic(x2),cubic(x3))),

family=‘‘binary’’, spar=‘‘u’’, varht=1)

> summary(wesdr.fit1)

...

UBR estimate(s) of smoothing parameter(s) :

6.913248e-06 1.920409e+04 9.516636e-01 2.966542e+03

6.005694e+02 6.345814e+01 5.602521e+02 2.472658e+02

1.816387e-07 9.820496e-07 1.481754e+03 2.789458e-07



Components corresponding to large values of smoothing parameters aresmall. As in Section 4.9.2, we compute the Euclidean norm of the esti-mate for each component centered around zero:

> norm.cen <- function(x) sqrt(sum((x-mean(x))**2))

> comp.est1 <- predict(wesdr.fit1, pstd=F,

terms=diag(rep(1,19))[-1,])

> comp.norm1 <- apply(comp.est1$fit, 2, norm.cen)

> print(round(comp.norm1,2))

9.13 44.02 15.25 5.40 14.29 21.88 8.51 0.00 0.00

0.00 0.00 0.00 0.00 0.00 5.17 5.39 0.00 2.41

Both the estimates of smoothing parameters and the norms indicatethat the interaction between dur (x1) and gly (x2) can be dropped.Therefore, we fit the SS ANOVA model (6.23) with components f ls

12,fsl12, and fss

12 being eliminated:

> wesdr.fit2 <- update(wesdr.fit1,

rk=list(cubic(x1), cubic(x2), cubic(x3),






rk.prod(cubic(x2),cubic(x3))))

> comp.est2 <- predict(wesdr.fit2,

terms=diag(rep(1,16))[-1,], pstd=F)


> print(round(comp.norm2,2))

9.13 44.02 15.25 5.40 14.29 21.88 8.51 0.00 0.00

0.00 0.00 5.17 5.39 0.00 2.41

Compared to other components, the norms of the nonparametric inter-actions f ls

13, fsl13, f

ss13 , f ls

23, fsl23, and fss

23 are relatively small. We furthercompute 95% Bayesian confidence intervals for the overall nonparametricinteraction between x1 and x3, f

ls13+fsl

13+fss13 , the overall nonparametric

interaction between x2 and x3, fls23 + fsl

23 + fss23 , and the proportion of

design points for which zero is outside these confidence intervals:

> int.dur.bmi <- predict(wesdr.fit2,

terms=c(rep(0,10),rep(1,3),rep(0,3)))

> mean((int.dur.bmi$fit-1.96*int.dur.bmi$pstd>0)|

(int.dur.bmi$fit+1.96*int.dur.bmi$pstd<0))

0


> int.gly.bmi <- predict(wesdr.fit2,

terms=c(rep(0,13),rep(1,3)))

> mean((int.gly.bmi$fit-1.96*int.gly.bmi$pstd>0)|

(int.gly.bmi$fit+1.96*int.gly.bmi$pstd<0))

0

Therefore, for both overall nonparametric interactions, the 95% confi-dence intervals contain zero at all design points. This suggests that thenonparametric interactions may be dropped. We fit the SS ANOVAmodel (6.23) with all nonparametric interactions being eliminated:


rk=list(cubic(x1), cubic(x2), cubic(x3)))



4.902745e-06 5.474122e+00 1.108322e-05


> comp.est3 <- predict(wesdr.fit3, pstd=F,

terms=diag(rep(1,10))[-1,])


> print(round(comp.norm3))

7.22 33.71 12.32 4.02 8.99 11.95 10.15 0.00 6.79

Based on the estimates of smoothing parameters and the norms, thenonparametric main effect of x2, f

s2 , can be dropped. Therefore, we fit

the final model:


rk=list(cubic(x1), cubic(x3)))


...


4.902693e-06 1.108310e-05


Estimate of sigma: 1

To look at the effect of dur, with gly and bmi being fixed at the me-dians of their observed values, we compute estimates of the probabilitiesand posterior standard deviations at a grid point of dur. The estimatedprobability function and the approximate 95% Bayesian confidence in-tervals are shown in Figure 6.1(a). The risk of progression of retinopathyincreases up to a duration of about 10 years and then decreases, possiblycaused by censoring due to death in patients with longer durations. Sim-ilar plots for the effects of gly and bmi are shown in Figures 6.1(b) and


0 10 30 50

0.0

0.2

0.4

0.6

0.8

1.0

duration (yr)

pro

ba

bili

ty

(a)

10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

gly. hemoglobin

pro

ba

bili

ty

(b)

20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

body mass index (kg/m2)

pro

ba

bili

ty

(c)

FIGURE 6.1 WESDR data, plots of (a) the estimated probabilityas a function of dur with gly and bmi being fixed at the medians oftheir observed values, (b) the estimated probability as a function of glywith dur and bmi being fixed at the medians of their observed values,and (c) the estimated probability as a function of bmi with dur and gly

being fixed at the medians of their observed values. Shaded regions areapproximate 95% Bayesian confidence intervals. Rugs on the bottomand the top of each plot are observations of prg.

6.1(c). The risk of progression of retinopathy increases with increasingglycosylated hemoglobin, and the risk increases with increasing bodymass index until a value of about 25 kg/m2, after which the trend isuncertain due to wide confidence intervals. As expected, the confidenceintervals are wider in areas where observations are sparse.

6.4 Smoothing Spline Estimation of VarianceFunctions

Consider the following heteroscedastic SS ANOVA model

yi = L1if1 + exp{L2if2/2}ǫi, i = 1, . . . , n, (6.24)

where fk is a function on Xk = Xk1 ×Xk2 × · · ·×Xkdkwith model space

Mk = Hk0⊕Hk1⊕· · ·⊕Hkqk for k = 1, 2; L1i and L2i are bounded linear

functionals; and ǫiiid∼ N(0, 1). The goal is to estimate both the mean

function f1 and the variance function f2. Note that both the mean andvariance functions are modeled nonparametrically in this section. This


is in contrast to the parametric model (5.28) for variance functions inChapter 5.

One simple approach to estimating functions f1 and f2 is to use thefollowing two-step procedure:

1. Estimate the mean function f1 as if random errors are homo-scedastic.

2. Estimate the variance function f2 based on squared residuals fromthe first step.

3. Estimate the mean function again using the estimated variancefunction.

The PLS estimation method in Chapter 4 can be used in the first step.Denote the estimate at the first step as f1 and ri = yi−L1if1 as residuals.Let zi = r2i . Under suitable conditions, zi ≈ exp{L2if2}χ2

i,1, where χ2i,1

are iid Chi-square random variables with degree of freedom 1. RegardingChi-square distribution as a special case of the gamma distribution, thePL method described in this chapter can be used to estimate the variancefunction f2 at the second step. Denote the estimate at the second stepas f2. Then the PWLS method in Chapter 5 can be used in the thirdstep with known covariance W−1 = diag(L21f2, . . . ,L2nf2).

We now use the motorcycle data to illustrate the foregoing two-stepprocedure. We have fitted a cubic spline to the logarithm of squaredresiduals in Section 5.4.1. Based on model (3.53), consider the het-eroscedastic partial spline model

yi = f1(xi) + exp{f2(xi)/2}ǫi, i = 1, . . . , n, (6.25)

where f1(x) =∑3

j=1 βj(x − tj)+ + g1(x), t1 = 0.2128, t2 = 0.3666

and t3 = 0.5113 are the change-points in the first derivative, and ǫiiid∼

N(0, 1). We model both g1 and f2 using the cubic spline model spaceW 2

2 [0, 1]. The following statements implement the two-step procedurefor model (6.25):

# step 1

> t1 <- .2128; t2 <- .3666; t3 <- .5113;

> s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)

> s3 <- (x-t3)*(x>t3)

> mcycle.ps.fit4 <- ssr(accel~x+s1+s2+s3, rk=cubic(x))

# step 2

> z1 <- residuals(mcycle.ps.fit4)**2

> mcycle.v.1 <- ssr(z1~x, cubic(x), limnla=c(-6,-1),


family=‘‘gamma’’, spar=‘‘u’’, varht=1)

# step 3

> mcycle.ps.fit5 <- update(mcycle.ps.fit4,

weights=mcycle.v.1$fit)

In the second step, the search range for log10(nλ) is set as [−6,−1]to avoid numerical problems. The actual estimate of the smoothingparameter log10(nλ) = −6, which leads to a rough estimate of f2 (Figure6.2(a)).

time (ms)

log(s

quare

d r

esid

ual)

−10

−5

05

0 20 40 60

o

o

ooo

ooo

o

o

o

o

o

o

o

oo

ooo

o

o

ooo

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

oo

o

o

ooo

oo

oo

o

o

o

o

oo

o

oo

o

oo

o

o

o

o

oo

oooo

o

o

o

o

o

oooo

oo

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

oo

o

o

oo

o

oo

o

oo

o

o

o o

(a)

time (ms)

log(s

quare

d d

iffe

rence)

−10

−5

05

0 20 40 60

oo

oo

oo

oo

o

oo

oo

ooo

o

o

o

o

o

o

o

o

oo

oo

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

oo

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

ooo

o

o

o

o

o

oo

oo

oo

oo

o

oo

o

o

o

ooo

oo

o

o

ooo

oo

o

o

o

o

oo

oo

oo

o

o

oooo

o

o

o

o

o

o

o

o

o

oo

o

oooo

(b)

time (ms)

log(s

quare

d r

esid

ual)

−10

−5

05

0 20 40 60

oo

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

oo

oo

o

o

oo

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oooooo

o

oo

o

o

ooo

oo

oo

o

o

o

o

oo

o

ooo

oo

o

o

o

o

o

o

oooo

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

ooo

o

o

o

oo

o

o

oo

o

ooo

oo

o

o

o

o

(c)

FIGURE 6.2 Motorcycle data, plots of the PL cubic spline estimatesof f2 (lines), and 95% approximate Bayesian confidence intervals (shadedregions) based on (a) the two-step procedure with squared residuals, (b)the two-step procedure with squared differences, and (c) the backfittingprocedure. Circles are (a) logarithm of the squared residuals based onmodel (3.53), (b) logarithm of the squared differences, and (c) logarithmof the squared residuals based on the final fit to the mean function inthe backfitting procedure.

The first step in the two-step procedure may be replaced by a difference-based method. Note that x1 ≤ x2 ≤ · · · ≤ xn in the motorcycledata. Let zi = (yi+1 − yi)

2/2 for i = 1, . . . , n − 1. When xi+1 −xi is small, yi+1 − yi = f1(xi+1) − f1(xi) + exp{f2(xi+1)/2}ǫi+1 −exp{f2(xi)/2}ǫi ≈ exp{f2(xi)/2}(ǫi+1 − ǫi), where xi = (xi+1 + xi)/2.Then, zi ≈ exp{f2(xi)}χ2

i,1, where χ2i,1 are Chi-square random variables

with degree of freedom 1. Ignoring correlations between neighboringobservations, the following statements implement this difference-basedmethod:

> z2 <- diff(accel)**2/2; z2[z2<.00001] <- .00001


> n <- length(x); xx <- (x[1:(n-1)]+x[2:n])/2

> mcycle.v.g <- ssr(z2~xx, cubic(xx), limnla=c(-6,-1),


> w <- predict(mcycle.v.g, pstd=F,

newdata=data.frame(xx=x))

> mcycle.ps.fit6 <- update(mcycle.ps.fit4,

weights=exp(w$fit))

As in Yuan and Wahba (2004), we set zi = max{.00001, (yi+1 − yi)2/2}

to avoid overfitting. Again, the actual estimate of the smoothing param-eters reaches the lower bound such that log10(nλ) = −6. The estimateof f2 is rough (Figure 6.2(b)).

A more formal procedure is to estimate f1 and f2 in model (6.24)jointly as the minimizers of the following doubly penalized likelihood(DPL) (Yuan and Wahba 2004):

1

n

n∑

i=1

{(yi − L1if1)

2 exp(−L2if2) + L2if2}

+

2∑

k=1

qk∑

j=1

λkj‖Pkjfk‖2,

(6.26)where the first part is the negative log-likelihood, Pkj is the orthogonalprojector in Mk onto Hkj for k = 1, 2, and λkj are smoothing parame-ters. Following the same arguments in Section 6.2.1, the minimizers ofthe DPL can be represented as

fk(x) =

pk∑

ν=1

dk,νφk,ν(x) +

n∑

i=1

ck,i

qk∑

j=1

θkjLk,i(z)Rkj(x, z),

x, z ∈ Xk, k = 1, 2, (6.27)

where λkj = λk/θkj , φk,ν for ν = 1, . . . , pk are basis functions of Hk0,and Rkj for j = 1, . . . , qk are RKs of Hkj . It is difficult to solve coeffi-cients dk,ν and ck,ν directly. However, it is easy to implement a back-fitting procedure by iterating the following two steps until convergence:(a) Conditional on current estimates of d2,ν and c2,ν , update d1,ν andc1,ν ; (b) Conditional on current estimates of d1,ν and c1,ν , update d2,ν

and c2,ν . Note that, when d2,ν and c2,ν are fixed, f2 in (6.26) is fixed,and the DPL reduces to the PWLS (5.3) with known weights. Whend1,ν and c1,ν are fixed, f1 in (6.27) is fixed and the DPL reduces tothe PL (6.4) for observations zi = exp{L2if2}χ2

i,1. Therefore, steps (a)and (b) correspond to steps 2 and 3 in the two-step procedure, and thebackfitting procedure extends the two-step procedure by iterating steps2 and 3 until convergence. The above backfitting procedure is essentiallythe same as the iterative procedure in Yuan and Wahba (2004) where


different methods were used to select smoothing parameters. The fol-lowing is a simple R function that implements the backfitting procedurefor motorcycle data:

> jemv <- function(x, y, prec=1e-6, maxit=30) {

t1 <- .2128; t2 <- .3666; t3 <- .5113

s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)

s3 <- (x-t3)*(x>t3)

err <- 1e20; eta <- rep(1,length(x))

while (err>prec&maxit>0) {

fitf <- ssr(y~x+s1+s2+s3, cubic(x), weights=exp(eta))

z <- fitf$resi**2; z[z<.00001] <- .00001

fitv <- ssr(z~x, cubic(x), limnla=c(-5,-1),


oldeta <- eta

eta <- fitv$rkpk$eta

err <- sqrt(mean(((eta-oldeta)/(1+abs(oldeta)))**2))

maxit <- maxit-1

}

return(list(fitf=fitf,fitv=fitv))

}

> mcycle.mv <- jemv(x, accel)

For estimation of the variance function, a new search range for log10(nλ)has to be set as [−5,−1] to avoid numerical problems. The backfittingalgorithm converged in 13 iterations. The estimate of f2 is shown inFigure 6.2(c).

The smoothing parameters in the above procedures are selected by theiterative UBR method. For Chi-square distributed response variableswith small degrees of freedom, nonconvergence of the iterative approachmay become a serious problem (Liu et al. 2007). Note that the degreesof freedom of Chi-square random variables in the above procedures equalto 1. The direct methods such as the DGML criterion guarantee con-vergence. We now show a simple implementation of the DGML methodfor gamma regression. Instead of the Newton–Raphson method, we usethe Fisher scoring method, which leads to ui = 1 − yi/µi and wi = 1.Consequently, Wc = I, and there is no need for the transformation. Fur-thermore, |R| will be dropped since it is independent of the smoothingparameter. The following R function computes the DGML in (6.20) forgamma regression:

DGML <- function(th, nlaht, y, S, Q) {

if (length(th)==1) { nlaht <- th; Qt <- Q }

if (length(th)>1) {


theta <- 10^th; Qt <- 0

for (i in 1:dim(Q)[3]) Qt <- Qt + theta[i]*Q[,,i]

}

fit <- try(ssr(y~S-1, Qt, limnla=nlaht, family=‘‘gamma’’,

spar=‘‘u’’, varht=1))

if(class(fit)==‘‘ssr’’) {

fht <- fit$rkpk$eta

tmp <- y*exp(-fht)

uc <- 1-tmp

yt <- fht-uc

qrq <- qr.Q(qr(S), complete=T)

q2 <- qrq[ ,(ncol(S)+1):nrow(S)]

V <- t(q2)%*%Qt%*%q2

l <- eigen((V + t(V))/2)

U <- l$vectors

e <- l$values

z <- t(U)%*%t(q2)%*%yt

delta <- 10^{-nlaht}

GML <- sum(tmp+fit)-sum(uc^2)/2+sum(log(delta*e+1))/2

+sum(z^2/(delta*e+1))/2

return(GML)

}

else return(1e10)

}

where fht, uc, yt, V, U, e, z, and delta correspond to f , uc, y, QT2 ΣθQ2,

U , (e1, . . . , en−p)T , z, and δ, respectively, in the definition of the DGML

in (6.20). Note that σ2 = 1. The R functions qr and eigen are usedto compute the QR decomposition and eigendecomposition. For an SSANOVA model with q = 1, the input th corresponds to log10(nλ), nlahtis not used, y are observations, S corresponds to the matrix T , and Q

corresponds to the matrix Σ. For an SS ANOVA model with q > 1, theinput th corresponds to log10 θ, nlaht corresponds to log10(nλ), y areobservations, S corresponds to the matrix T , and Q is an (n, n, q) arraycorresponding to the matrices Σk for k = 1, . . . , q.

We now apply the above DGML method to the second step in the two-step procedure. We compute the DGML criterion on a grid of log10(nλ),find the DGML estimate of the smoothing parameter as the minimizer ofthe DGML criterion, and fit again with the smoothing parameter beingfixed as the DGML estimate:

> S <- cbind(1,x); Q <- cubic(x)

> lgrid <- seq(-6,-2,by=.1)

> gml <- sapply(lgrid, DGML, nlaht=0, y=z1, S=S, Q=Q)


> nlaht <- lgrid[order(gml)[1]]

> mcycle.v.g4 <- ssr(z1~x, cubic(x), limnla=nlaht,


−6 −5 −4 −3 −2

840

850

860

870

log10(nλ)

DG

ML

(a)

time (ms)

log(s

quare

d r

esid

ual)

−5

05

0 20 40 60

o

o

o

oo

ooo

o

o

o

o

o

o

o

oo

ooo

o

o

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

oo

o

o

o

oo

oo

oo

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

oo

o

oo

o

o

o o

(b)

time (ms)accele

ration (

g)

−150

−50

050

0 20 40 60

ooooo oooooooooooooooo

oooooo

o

o

o

oo

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oooo

o

o

o

o

ooo

o

oo

ooo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

o

o

ooo

oo

o

o

o

o

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

(c)

FIGURE 6.3 Motorcycle data, plots of (a) the DGML function wherethe minimum point is marked with a short bar at the bottom; (b) log-arithm of squared residuals (circles) based on model (3.53), PL cubicspline estimate of f2 (line) to squared residuals with the DGML estimateof the smoothing parameter, and 95% approximate Bayesian confidenceintervals (shaded region); and (c) observations (circles) and PWLS par-tial spline estimate of f1 (line) with 95% Bayesian confidence intervals(shaded region).

The DGML function is shown in Figure 6.3(a). It reaches the mini-

mum at log10(nλ) = −4.2. The PL cubic spline estimate of f2 with theDGML choice of the smoothing parameter is shown in Figure 6.3(b).The PWLS partial spline estimate of f1 and 95% Bayesian confidenceintervals are shown in Figure 6.3(c). The effect of unequal variances isreflected in the widths of the confidence intervals.

6.5 Smoothing Spline Spectral Analysis

6.5.1 Spectrum Estimation of a Stationary Process

The spectrum is often used to describe the power distribution of a timeseries. Consider a zero-mean stationary time seriesXt, t = 0,±1,±2, . . . .


The spectrum is defined as

S(ω) =∞∑

u=−∞γ(u) exp(−i2πωu), ω ∈ [0, 1], (6.28)

where γ(u) = E(XtXt+u) is the covariance function, and i =√−1. Let

X0, X1, . . . , XT−1 be a finite sample of the stationary process and definethe periodogram at frequency ωk = k/T as

yk = T−1∣∣

T−1∑

t=0

Xt exp(i2πωkt)∣∣2, k = 0, . . . , T − 1. (6.29)

The periodogram is an asymptotically unbiased but inconsistent esti-mator of the underlying true spectrum. Many different smoothing tech-niques have been proposed to overcome this problem. We now show howto estimate the spectrum using smoothing spline based on observations{(ωk, yk), k = 0, . . . , T − 1}.

Under standard mixing conditions, the periodograms are asymptoti-cally independent and distributed as

yk ∼{S(ωk)χ2

1, ωk = 0, 1/2,S(ωk)χ2

2/2, ωk 6= 0, 1/2.

Regarding Chi-square distribution as a special case of the gamma distri-bution, the method described in this chapter can be used to estimate thespectrum. Consider the logarithmic link function and let f = log(S) bethe log spectrum. We model function f using the periodic spline spaceWm

2 (per). Note that f(ω) is symmetric about ω = 0.5. Therefore, itsuffices to estimate f(ω) for ω ∈ [0, 0.5]. Nevertheless, to estimate f as aperiodic function, we will use periodograms at all frequencies in Section6.5.3.

6.5.2 Time-Varying Spectrum Estimation of a LocallyStationary Process

Many time series are nonstationary. Locally stationary processes havebeen proposed to approximate the nonstationary time series. The time-varying spectrum of a locally stationary time series characterizes changesof stochastic variation over time.

A zero-mean stochastic process {Xt, t = 0, . . . , T − 1} is a locallystationary process if

Xt =

∫ 1

0

A(ω, t/T ) exp(i2πωt)dZ(ω), (6.30)


where Z(ω) is a zero-mean stochastic process on [0, 1], and A(ω, u) de-notes a transfer function with continuous second-order derivatives for(ω, u) ∈ [0, 1]× [0, 1]. Define the time-varying spectrum as

S(ω, u) = ‖A(ω, u)‖2. (6.31)

Consider the logarithmic link function and let f(ω, u) = logS(ω, u).Since it is a periodic function of ω, we model the log spectrum f usingthe tensor product space W 2

2 (per) ⊗ W 22 [0, 1], where the SS ANOVA

decomposition was derived in Section 4.4.5. The SS ANOVA modelusing notations in this section can be represented as

f(ω, u) = µ+β×(u−0.5)+fs2(u)+f1(ω)+fsl

12(ω, u)+fss12(ω, u), (6.32)

where β×(u−0.5) and fs2 (u) are linear and smooth main effects of time,

f1(ω) is the smooth main effect of frequency, and fsl12(ω, u) and fss

12 (ω, u)are linear–smooth and smooth–smooth interactions between frequencyand time.

To estimate the bivariate function f , we compute local periodogramson some time-frequency grids. Specifically, divide the time domain intoJ disjoint blocks [bj , bj+1), where 0 = b1 < b2 < · · · < bJ < bJ+1 = 1.Let uj = (bj +bj+1)/2 be the middle points of these J blocks. Let ωk fork = 1, . . . ,K be K frequencies in [0, 1]. Define the local periodogramsas

y(k−1)J+j =|∑bj+1−1

t=bjXt exp(i2πωkt)|2

|bj+1 − bj |, k = 1, . . . ,K; j = 1, . . . , J.

(6.33)Again, under regularity conditions, the local periodograms are asymp-totically independent and distributed as (Guo, Dai, Ombao and vonSachs 2003)

y(k−1)J+j ∼{

exp{f(ωk, uj)}χ21, ωk = 0, 1/2,

exp{f(ωk, uj)}χ22/2, ωk 6= 0, 1/2.

Let x = (ω, u) and x(k−1)J+j = (ωk, uj) for k = 1, . . . ,K, and j =1, . . . , J . Then the time-varying spectrum can be estimated based onobservations {(xi, yi), i = 1, . . . ,KJ}.

The SS ANOVA decomposition (6.32) may also be used to test if alocally stationary process is stationary. The locally stationary processX(t) is stationary if f(ω, u) is independent of u. Let h(u) = β × (u −0.5) + fs

2 (u) + fsl12(ω, u) + fss

12(ω, u), which collects all terms involvingtime in (6.32). Then the hypothesis is

H0 : h(u) = 0 for all u, H1 : h(u) 6= 0 for some u.


The full SS ANOVA model (6.32) is reduced to f(ω, u) = µ + f1(ω)

under H0. Denote fF and fR as the estimates of f under the full andreduced models, respectively. Let

DF =n∑

i=1

{fFi + yi exp(−fF

i ) − log yi − 1},

DR =

n∑

i=1

{fRi + yi exp(−fR

i ) − log yi − 1}

be deviances under the full and reduced models. We construct two teststatistics

T1 = DR −DF ,

T2 =

∫ 1

0

∫ 1

0

{fF (ω, u) − fR(ω, u)}2dωdu,

where T1 corresponds to the Chi-square statistics commonly used forgeneralized linear models, and T2 computes the L2 distance between fF

and fR. We reject H0 when these statistics are large. It is difficultto derive the null distributions of these statistics. Note that f doesnot depend on u under H0. Therefore, the null distribution can beapproximated by permutation. Specifically, permutation samples aregenerated by shuffling time grid, and statistics T1 and T2 are computedfor each permutation sample. Then the p-values are approximated asthe proportion of statistics based on permutation samples that are largerthan those based on the original data.

6.5.3 Epileptic EEG

We now illustrate how to estimate the spectrum of a stationary pro-cess and the time-varying spectrum of a locally stationary process usingthe seizure data. The data contain two 5-minute intracranial electroen-cephalograms (IEEG) segments from a patient: one at baseline extractedat least 4 hours before the seizure’s onset (labeled as baseline), and oneright before a seizure’s clinical onset (labeled as preseizure). Observa-tions are shown in Figure 6.4.

First assume that both the baseline and preseizure series are station-ary. Then the following statements compute the periodograms and fit aperiodic spline model for the baseline series:

> data(seizure); attach(seizure)

> n <- nrow(seizure)

> x <- seq(1,n-1, by=120)/n


−3

00

−1

00

10

03

00

0 1 2 3 4 5

Baseline

time (minute)

−3

00

−1

00

10

03

00

0 1 2 3 4 5

Preseizure

FIGURE 6.4 Seizure data, plots of 5-minute baseline segments col-lected hours away from the seizure (above), and 5-minute preseizuresegments with the seizure’s onset at the 5th minute (below). The sam-pling rate is 200 Hertz. The total number of time points is 60,000 foreach segment.

> y <- (abs(fft(seizure$base))^2/n)[-1][seq(1,n-1, by=120)]

> seizure.s.base <- ssr(y~x, periodic(x), spar=‘‘u’’,

varht=1, family=‘‘gamma’’,limnla=c(-5,1))

> grid <- data.frame(x=seq(0,.5,len=100))

> seizure.s.base.p <- predict(seizure.s.base,grid)

where the function fft computes an unscaled discrete Fourier transfor-mation. A subset of periodograms is used. The UBR criterion is used toselect the smoothing parameter with the dispersion parameter set to 1.We limit the search range for log10(nλ) to avoid undersmoothing. Logperiodograms, periodic spline estimate of the log spectrum, and 95%Bayesian confidence intervals of the baseline series are shown in the left


panel of Figure 6.5. The log spectrum of the preseizure series is esti-mated similarly. Log periodograms, periodic spline estimate of the logspectrum, and 95% Bayesian confidence intervals of the preseizure seriesare shown in the right panel of Figure 6.5. Because the sampling rateis 200 HZ and the spectrum is symmetric around 100 HZ, we only showthe estimated spectra in frequency bands 0–100 HZ.

0 20 40 60 80 100

02

46

810

frequency (HZ)

log s

pectr

um

Baseline

0 20 40 60 80 100

02

46

810

frequency (HZ)

log s

pectr

um

Preseizure

FIGURE 6.5 Seizure data, plots log periodograms (dots), estimateof the log spectra (line) based on the iterative UBR method, and 95%Bayesian confidence intervals (shaded region) of the baseline series (left)and preseizure series (right).

The IEEG series may become nonstationary before seizure events.Assume that both the baseline and preseizure series are locally sta-tionary. We create time-frequency grids data with ωk = k/(K + 1)and uj = (j − .5)/J + 1/2n for k = 1, . . . ,K and j = 1, . . . , J wheren = 60000 is the length of a time series, and K = J = 32. We computelocal periodograms for the baseline series:

> pgram <- function(x, freqs) {

y <- numeric(length(freqs))

tser <- seq(length(x))-1

for(i in seq(length(freqs))) y[i] <-

Mod(sum(x*complex(mod=1, arg=-2*pi*freqs[i]*tser)))^2

y/length(x)

}

> lpgram <- function(x,times,freqs) {


nsub <- floor(length(x)/length(times))

ymat <- matrix(0, length(freqs), length(times))

for (j in seq(length(times)))

ymat[,j] <- pgram(x[((j-1)*nsub+1):(j*nsub)], freqs)

as.vector(ymat)

}

> nf <- nt <- 32; freqs <- 1:nf/(nf+1)

> nsub <- floor(n/nt)

> times <- (seq(from=(1+nsub)/2, by=nsub, length=nt))/n

> y <- lpgram(seizure$base, times, freqs)

where the functions pgram and lpgram, respectively, compute periodogramsfor a stationary process and local periodograms for a locally stationaryprocess. We now fit the SS ANOVA model (6.32) for the baseline series:

> ftgrid <- expand.grid(freqs,times)

> x1 <- ftgrid[,1]

> x2 <- ftgrid[,2]

> seizure.ls.base <- ssr(y~x2,

rk=list(periodic(x1), cubic(x2),


rk.prod(periodic(x1),cubic(x2))),

spar=‘‘u’’, varht=1, family=‘‘gamma’’)

> grid <- expand.grid(x1=seq(0,.5,len=50),

x2=seq(0,1,len=50))

> seizure.ls.base.p <- predict(seizure.ls.base,

newdata=grid)

The SS ANOVA model (6.32) for the preseizure series is fitted simi-larly. Figure 6.6 shows estimates of time-varying spectra of the baselineseries and preseizure series. The iterative UBR method was used inthe above fits. The DGML method usually leads to a better estimateof the spectrum or the time-varying spectrum (Qin and Wang 2008).The following function spdest estimates a spectrum or a time-varyingspectrum using the DGML method.

spdest <- function(y, freq, time, process, control=list(

optfactr=1e10, limnla=c(-6,1), prec=1e-6, maxit=30))

{

if (process==‘‘S’’) {

thhat <- rep(NA, 4)

S <- as.matrix(rep(1,length(y)))

Q <- periodic(freq)

tmp <- try(optim(-2, DGML, y=y, S=S, Q=Q,


0 20 40 60 80 10001

23

452

4

6

8

frequency (HZ) time

(min

ute)

log

sp

ectr

um

0 20 40 60 80 10001

23

45

0

2

4

6

8

frequency (HZ) time

(min

ute)

log

sp

ectr

um

FIGURE 6.6 Seizure data, plots of estimates of the time-varyingspectra of the baseline series (left), and preseizure series (right) basedon the iterative UBR method.

method=‘‘L-BFGS-B’’,

lower=control$limnla[1], upper=control$limnla[2],

control=list(factr=control$optfactr)))

if (class(tmp)==‘‘try-error’’) {

info <- ‘‘optim failure’’

}

else {

if (tmp$convergence==0) {

nlaht <- tmp$par

fit <- ssr(y~S-1, Q, limnla=tmp$par, family=‘‘gamma’’,

spar=‘‘u’’, varht=1)

info <- ‘‘success’’

}

else info <- paste(‘‘optim failure’’,

as.character(tmp$convergence))

}

}

if (process==‘‘LS’’) {

S <- cbind(1,time-.5)

Q <- array(NA,c(length(freq),length(freq),4))

Q[,,1] <- periodic(freq)

Q[,,2] <- cubic(time)

Q[,,3] <- rk.prod(periodic(freq),kron(time-.5))


Q[,,4] <- rk.prod(periodic(freq),cubic(time))

# compute initial values for optimization of theta

zz <- log(y)+.57721

tmp <- try(ssr(zz~S-1, Q, spar=‘‘m’’),T)

if (class(tmp)==‘‘ssr’’) {

thini <- tmp$rkpk.obj$theta

nlaht <- tmp$rkpk.obj$nlaht

}

else {

thini <- rep(1,4)

nlaht <- 1.e-8

}

tmp <- try(optim(thini, DGML, nlaht=nlaht,

y=y, S=S, Q=Q, method=‘‘L-BFGS-B’’,

control=list(factr=control$optfactr)))

if (class(tmp)==‘‘try-error’’) {info=‘‘optim failure’’}

else {

if (tmp$convergence==0) {

thhat <- tmp$par

thetahat <- 10**thhat

Qt <- 0

for (i in 1:dim(Q)[3]) Qt <- Qt + thetahat[i]*Q[,,i]

fit <- ssr(y~S-1, Qt, limnla=nlaht, family=‘‘gamma’’,

spar=‘‘u’’, varht=1)

info <- ‘‘success’’

}

else { info <- paste(‘‘optim failure’’,

as.character(tmp$convergence)) }

}

}

return(list(fit=fit, nlaht=nlaht, thhat=thhat, info=info))

}

The argument process specifies the type of process with “S” and “LS”corresponding to the stationary and locally stationary processes, respec-tively. For a stationary process, the argument y inputs periodogramsat frequencies specified by freq. The argument time is irrelevant inthis case. For a locally stationary process, the argument y inputs localperiodograms at the time-frequency grid specified by time and freq,respectively. For locally stationary processes, we fit the logarithmictransformed periodograms to get initial values for smoothing parameters(Wahba 1980, Qin and Wang 2008). The DGML function was presented


in Section 6.4.The following statements estimate the spectrum of the baseline series

as a stationary process using the DGML method and compute posteriormeans and standard deviations:

> x <- seq(1,n-1,by=120)/n

> y <- pgram(seizure$base, x)

> tmp <- spdest(y, x, 0, ‘‘S’’)

> seizure.s.base.dgml <- ssr(y~x, periodic(x), spar=‘‘u’’,

varht=1, family=‘‘gamma’’,limnla=tmp$nlaht)

> grid <- data.frame(x=seq(0,.5,len=100))

> seizure.s.base.dgml.p <- predict(seizure.s.base.dgml,

grid)

Note that, to use the predict function, we need to call ssr again us-ing the DGML estimate of the smoothing parameter. The spectrum ofthe preseizure series as a stationary can be estimated similarly. Theestimates of spectra are similar to those in Figure 6.5.

Next we estimate the time-varying spectrum of the baseline seriesas a locally stationary process using the DGML method and computeposterior means:

> y <- lpgram(seizure$base, times, freqs)

> tmp <- spdest(y, x1, x2, ‘‘LS’’)

> th <- 10**tmp$thhat

> S <- cbind(1, x2-.5)

> Q1 <- periodic(x1)

> Q2 <- cubic(x2)

> Q3 <- rk.prod(periodic(x1), kron(x2-.5))

> Q4 <- rk.prod(periodic(x1), cubic(x2))

> Qt <- th[1]*Q1+th[2]*Q2+th[3]*Q3+th[4]*Q4

> seizure.ls.base.dgml <- ssr(y~x2, rk=Qt, spar=‘‘u’’,

varht=1, family=‘‘gamma’’, limnla=tmp$nlaht)

> grid <- expand.grid(x1=seq(0,.5,len=50),

x2=seq(0,1,len=50))

> Sg <- cbind(1,grid$x2-.5)

> Qg1 <- periodic(grid$x1,x1)

> Qg2 <- cubic(grid$x2,x2)

> Qg3 <- rk.prod(periodic(grid$x1,x1),

kron(grid$x2-.5,x2-.5))

> Qg4 <- rk.prod(periodic(grid$x1,x1), cubic(grid$x2,x2))

> Qgt <- th[1]*Qg1+th[2]*Qg2+th[3]*Qg3+th[4]*Qg4

> seizure.ls.base.dglm.p <-

Sg%*%seizure.ls.base.dgml$coef$d+

Qgt%*%seizure.ls.base.dgml$coef$c


where the spdest function was used to find the DGML estimates of thesmoothing parameters, and the ssr function was used to calculate thecoefficients c and d. The time-varying spectrum of the preseizure seriesas a locally stationary process can be estimated similarly. The estimatesof time-varying spectra are shown in Figure 6.7. The estimate of thepreseizure series (right panel in Figure 6.7) is smoother than that basedon the iterative UBR method (right panel in Figure 6.6).

0 20 40 60 80 10001

23

452

4

6

8

frequency (HZ) time

(min

ute)

log

sp

ectr

um

0 20 40 60 80 10001

23

45

0

2

4

6

8

frequency (HZ) time

(min

ute)

log

sp

ectr

um

FIGURE 6.7 Seizure data, plots of estimates of the time-varyingspectra based on the DGML method of the baseline series (left) andpreseizure series (right).

It appears that the baseline spectrum does not vary much over time,while the preseizure spectrum varies over time. Therefore, the baselineseries may be stationary, while the preseizure series may be nonsta-tionary. The following statements perform the permutation test for thebaseline series with 100 permutations.

z <- seizure$base

n <- length(z); nt <- 32; nf <- 32

freqs <- 1:nf/(nf+1); nsub <- floor(n/nt)

times <- (seq(from=(1+nsub)/2, by=nsub, length=nt))/n

y <- lpgram(z,times,freqs)

ftgrid <- expand.grid(freqs,times)

x1 <- ftgrid[,1]

x2 <- ftgrid[,2]


full <- spdest(y, x1, x2, ‘‘LS’’)

reduced <- spdest(y, x1, x2, ‘‘S’’)

fitf <- full$fit$rkpk$eta

fitr <- reduced$fit$rkpk$eta

devf <- sum(-1-log(y)+y*exp(-fitf)+fitf)

devr <- sum(-1-log(y)+y*exp(-fitr)+fitr)

cstat <- devr-devf

l2d <- mean((fitf-fitr)**2)

nperm <- 100; totperm <- 0

cstatp <- l2dp <- NULL

while (totperm < nperm) {

x2p <- rep(sample(times),rep(nf,nt))

full <- spdest(y, x1, x2p, ‘‘LS’’)

if (full$info==‘‘success’’) {

totperm <- totperm+1

fitf <- full$fit$rkpk$eta

devf <- sum(-1-log(y)+y*exp(-fitf)+fitf)

cstatp <- c(cstatp, devr-devf)

l2dp <- c(l2dp, mean((fitf-fitr)**2))

}

}

print(c(mean(cstatp>cstat),mean(l2dp>l2d)))

Note that there is no need to fit the reduced model for the permuteddata since the log spectrum does not depend on time under the nullhypothesis. The permutation test for the preseizure series is performedsimilarly. The p-values for testing stationarity based on two statistics are0.82 and 0.73 for the baseline series, and 0.03 and 0.06 for the preseizureseries. Therefore, the processes far away from the seizure’s clinical onsetcan be regarded as stationary, while the processes close to the seizure’sclinical onset are nonstationary.


Chapter 7

Smoothing Spline NonlinearRegression

7.1 Motivation

The general smoothing spline regression model (2.10) in Chapter 2 andthe SS ANOVA model (4.31) in Chapter 4 assume that the unknownfunction is observed through some linear functionals. This chapter dealswith situations when some unknown functions are observed indirectlythrough nonlinear functionals. We discuss some potential applicationsin this section. More examples can be found in Section 7.6.

In some applications, the theoretical models depend on the unknownfunctions nonlinearly. For example, in remote sensing, the satellite up-welling radiance measurements Rv are related to the underlying atmo-spheric temperature distribution f through the following equation

Rv(f) = Bv(f(xs))τv(xs) −∫ xs

x0

Bv(f(x))τ ′v(x)dx,

where x is some monotone transformation of pressure p, for example,the kappa units x(p) = p5/8; x0 and xs are x values at the surface andtop of the atmosphere; τv(x) is the transmittance of the atmosphereabove x at wavenumber v; and Bv(t) is the Planck’s function, Bv(t) =c1v

3/{exp(c2v/t) − 1}, with known constants c1 and c2. The goal isto estimate f as a function of x based on noisy observations of Rv(f).Obviously, Rv(f) is nonlinear in f . Other examples involving reservoirmodeling and three-dimensional atmospheric temperature distributionfrom satellite-observed radiances can be found in Wahba (1987) andO’Sullivan (1986).

Very often there are certain constraints such as positivity and mono-tonicity on the function of interest, and sometimes nonlinear transfor-mations may be used to relax those constraints. For example, considerthe following nonparametric regression model

yi = g(xi) + ǫi, xi ∈ [0, 1], i = 1, . . . , n. (7.1)

195


Suppose g is known to be positive. The transformation g = exp(f)substitutes the original constrained estimation of g by the unconstrainedestimation of f . The resulting transformed model,

yi = exp{f(xi)} + ǫi, xi ∈ [0, 1], i = 1, . . . , n, (7.2)

depends on f nonlinearly. Monotonicity is another common constraint.Consider model (7.1) again and suppose g is known to be strictly in-creasing with g′(x) > 0. Write g′ as g′(x) = exp{f(x)}. Reexpressing gas g(x) = f(0) +

∫ x

0exp{f(s)}ds leads to the following model

yi = β +

∫ xi

0

exp{f(s)}ds+ ǫi, xi ∈ [0, 1], i = 1, . . . , n. (7.3)

The function f is free of constraints and acts nonlinearly. Strictly speak-ing, model (7.3) is a semiparametric nonlinear regression model in Sec-tion 8.3 since it contains a parameter β.

Sometimes one may want to consider empirical models that dependon unknown functions nonlinearly. One such model, the multiplicativemodel, will be introduced in Section 7.6.4.

7.2 Nonparametric Nonlinear Regression Models

A general nonparametric nonlinear regression (NNR) model assumesthat

yi = Ni(f ) + ǫi, i = 1, . . . , n, (7.4)

where f = (f1, . . . , fr) are r unknown functions, fk belongs to an RKHSHk on an arbitrary domain Xk for k = 1, . . . , r, Ni are known nonlinearfunctionals on H1 × · · ·×Hr, and ǫi are zero-mean independent randomerrors with a common variance σ2. Note that domains Xk for differentfunctions fk may be the same or different. It is obvious that the NNRmodel (7.4) is an extension of both the SSR model (2.10) and the SSANOVA model (4.31). It may be considered as an extension of the para-metric nonlinear regression model with functions in infinite dimensionalspaces as parameters.

An interesting special case of the NNR model (7.4) is when Ni de-pends on f through a nonlinear function and some linear functionals.Specifically,

yi = ψ(L1if1, . . . ,Lrifr) + ǫi, i = 1, . . . , n, (7.5)

Smoothing Spline Nonlinear Regression 197

where ψ is a known nonlinear function, and L1i, . . . ,Lri are boundedlinear functionals. Model (7.2) for positive constraint is a special case of(7.5) with r = 1, ψ(z) = exp(z) and Lif = f(xi). However, model (7.3)for monotonicity constraint cannot be written in the form of (7.5).

7.3 Estimation with a Single Function

In this section we restrict our attention to the case when r = 1, anddrop subscript k for simplicity of notation. Our goal is to estimate thenonparametric function f in H.

Assume that H = H0⊕H1, where H0 = span{φ1, . . . , φp} is an RKHSwith RK R0, and H1 is an RKHS with RK R1. We estimate f as theminimizer of the following PLS:

minf∈H

{

1

n

n∑

i=1

(yi −Nif)2 + λ‖P1f‖2

}

, (7.6)

where P1 is a projection operator from H onto H1, and λ is a smoothingparameter. We assume that the solution to (7.6) exists and is unique(conditions can be found in the supplement document of Ke and Wang

(2004)), and denote the solution as f .

7.3.1 Gauss–Newton and Newton–Raphson Methods

We first consider the special NNR model (7.5) in this section. Let

ηi(x) = Li(z)R(x, z) = Li(z)R0(x, z) + Li(z)R1(x, z) , δi(x) + ξi(x).

Then ηi ∈ S, where S = H0 ⊕ span{ξ1, . . . , ξn}. Any f ∈ H can bedecomposed into f = ς1 + ς2, where ς1 ∈ S and ς2 ∈ Sc. Furthermore,Lif = Liς1. Denote

LS(ψ(L1f), . . . , ψ(Lnf)) =1

n

n∑

i=1

(yi − ψ(Lif))2

as the least squares. Then the penalized least squares

PLS(ψ(L1f), . . . , ψ(Lnf))

, LS(ψ(L1f), . . . , ψ(Lnf)) + λ||P1f ||2= LS(ψ(L1ς1), . . . , ψ(Lnς1)) + λ(||P1ς1||2 + ||P1ς2||2)≥ LS(ψ(L1ς1), . . . , ψ(Lnς1)) + λ||P1ς1||2= PLS(ψ(L1ς1), . . . , ψ(Lnς1)).


Equality holds iff ||P1ς2|| = ||ς2|| = 0. Thus the minimizer of the PLSfalls in the finite dimensional space S, which can be represented as

f(x) =

p∑

ν=1

dνφν(x) +

n∑

i=1

ciξi(x). (7.7)

The representation in (7.7) is an extension of the Kimeldorf–Wahbarepresenter theorem. Let c = (c1, . . . , cn)T and d = (d1, . . . , dp)

T . Basedon (7.7), the PLS (7.6) becomes

1

n

n∑

i=1

(yi − ψ(Lif))2 + λcT Σc, (7.8)

where Lif =∑p

ν=1 dνLiφν +∑n

j=1 cjLiξj , and Σ = {LiLjR1}ni,j=1.

We need to find minimizers c and d. Standard nonlinear optimizationprocedures can be employed to solve (7.8). We now describe the Gauss–Newton and Newton–Raphson methods.

We first describe the Gauss–Newton method. Let c−, d−, and f−be the current approximations of c, d, and f , respectively. Replacingψ(Lif) by its first-order expansion at Lif−,

ψ(Lif) ≈ ψ(Lif−) − ψ′(Lif−)Lif− + ψ′(Lif−)Lif , (7.9)

the PLS (7.8) can be approximated by

1

n||y − V (Td+ Σc)||2 + λcT Σc, (7.10)

where yi = yi − ψ(Lif−) + ψ′(Lif−)Lif−, y = (y1, . . . , yn)T , V =

diag(ψ′(L1f−), . . . , ψ′(Lnf−)), and T = {Liφν}n pi=1 ν=1. Assume that

V is invertible, and let T = V T , Σ = V ΣV , c = V −1c, and d = d.Then the approximated PLS (7.10) reduces to

1

n||y − T d− Σc||2 + λcT Σc, (7.11)

which has the same form as (2.19). Thus the Gauss–Newton methodupdates c and d by solving

(Σ + nλI)c+ T d = y,

T T c = 0.(7.12)

Equations in (7.12) have the same form as those in (2.21). Therefore,the same method in Section 2.4 can be used to compute c and d. New


estimates of c and d can be derived from c = V c and d = d. It is easyto see that the equations in (7.12) are equivalent to

(Σ + nλV −2)c+ Td = V −1y,

T Tc = 0,(7.13)

which have the same form as those in (5.6).We next describe the Newton–Raphson method. Let

I(c,d) =1

2

n∑

i=1

{yi − ψ

(p∑

ν=1

dνLiφν +n∑

j=1

cjLiξj)}2

.

Let ψ = (ψ(L1f−), . . . , ψ(Lnf−))T , u = −V (y−ψ), where V is defined

above,O = diag((y1−ψ(L1f−))ψ′′(L1f−), . . . , (yn−ψ(Lnf−))ψ′′(Lnf−)),and W = V 2 − O. Then ∂I/∂c = −ΣV (y − ψ) = Σu, ∂I/∂d =−T TV (y−ψ) = T Tu, ∂2I/∂c∂cT = ΣV 2Σ−ΣOΣ = ΣWΣ, ∂2I/∂c∂dT

= ΣWT , and ∂2I/∂d∂dT = T TWT . The Newton–Raphson iterationsatisfies the following equations(

ΣWΣ + nλΣ ΣWTT TWΣ T TWT

)(c − c−d− d−

)

=

(−Σu− nλΣc−

−T Tu

)

. (7.14)

Assume that W is positive definite. It is easy to see that solutions to

(Σ + nλW−1)c + Td = f− −W−1u,

T Tc = 0,(7.15)

are also solutions to (7.14), where f− = (L1f−, . . . ,Lnf−)T . Again, theequations in (7.15) have the same form as those in (5.6). Methods inSection 5.2.1 can be used to solve (7.15).

Note that, when O is ignored, W = V 2, and V −1y = f− −W−1u.Then, equations in (7.13) are the same as those in (7.15), and theNewton–Raphson method is the same as the Gauss–Newton method.

7.3.2 Extended Gauss–Newton Method

We now consider the estimation of the general NNR model (7.4). In

general, the solution to the PLS (7.6) f no longer falls in a finite dimen-sional space. Therefore, certain approximation is necessary.

Let f be a fixed element in H. A nonlinear functional N on H is calledFrechet differentiable at f if there exists a bounded linear functionalDN (f) such that (Debnath and Mikusinski 1999)

lim||h||→0

|N (f + h) −N (f) −DN (f)(h)|||h|| = 0.


Denote f− as the current approximation of f . Assume that Ni areFrechet differentiable at f− and denote Di = DNi(f−). Note that Di

are known bounded linear functionals at the current approximation. Thebest linear approximation of Ni near f− is (Debnath and Mikusinski1999)

Nif ≈ Nif− + Di(f − f−). (7.16)

Based on the linear approximation (7.16), the NNR model can be ap-proximated by

yi = Dif + ǫi, i = 1, . . . , n, (7.17)

where yi = yi −Nif− + Dif−. We minimize

1

n

n∑

i=1

(yi −Dif)2 + λ||P1f ||2 (7.18)

to get a new approximation of f . Since Di are bounded linear function-als, the solution to (7.18) has the form

f =

p∑

ν=1

dνφi(x) +

n∑

i=1

ciξi(x), (7.19)

where ξi(x) = Di(z)R1(x, z). Coefficients ci and dν can be calculatedusing the same method in Section 2.4. An iterative algorithm can thenbe formed with the convergent solution as the final approximation off . We refer to this algorithm as the extended Gauss–Newton (EGN)algorithm since it is an extension of the Gauss–Newton method to infinitedimensional spaces.

Note that the linear functionals Di depend on the current approx-imation f−. Thus, representers ξj change along iterations. This ap-proach adaptively chooses representers to approximate the PLS estimatef . As in nonlinear regression, the performance of this algorithm de-pends largely on the curvature of the nonlinear functional. Simulationsindicate that the EGN algorithm works well and converges quickly forcommonly used nonlinear functionals. For the special NNR model (7.5),it can be shown that the EGN algorithm is equivalent to the standardGauss–Newton method presented in Section 7.3.1 (Ke and Wang 2004).

An alternative approach is to approximate f by a finite series and solvecoefficients using the Gauss–Newton or the Newton–Raphson algorithm.When a finite series with good approximation property is available, thisapproach may be preferable since it is relatively easy to implement.


However, the choice of such a finite series may become difficult in cer-tain situations. The EGN approach is fully automatic and adaptive.On the other hand, it is more difficult to implement and may becomecomputationally intensive.

7.3.3 Smoothing Parameter Selection and Inference

The smoothing parameter λ is fixed in Sections 7.3.1 and 7.3.2. Wenow discuss methods for selecting λ. Similar to Section 3.4, we definethe leaving-out-one cross-validation (CV) criterion as

CV(λ) =1

n

n∑

i=1

(Nif[i] − yi)

2, (7.20)

where f [i] is the minimizer of the following PLS

1

n

∑

j 6=i

(yj −Njf)2 + λ||P1f ||2. (7.21)

Again, computation of f [i] for each i = 1, . . . , n is costly. For fixed iand z, let h(i, z) be the minimizer of

1

n(z −Nif)2 +

∑

j 6=i

(yj −Njf)2 + λ||P1f ||2.

That is, h(i, z) is the solution to (7.6) when the ith observation is re-placed by z. It is not difficult to check that arguments in Section 3.4 stillhold for nonlinear functionals. Therefore, we have the following lemma.

Leaving-out-one Lemma For any fixed i, h(i,Nif[i]) = f [i].

Note that h(i, yi) = f . Define

ai ,∂Nih(i, yi)

∂yi= DNi(h(i, yi))

∂h(i, yi)

∂yi, (7.22)

where DNi(h(i, yi)) is the Frechet differential of Ni at h(i, yi). Let

y[i]i = Nif

[i]. Applying the leaving-out-one lemma, we have

∆i(λ) ,Nif −Nif

[i]

yi −Nif [i]=

Nih(i, yi) −Nih(i, y[i]i )

yi − y[i]i

≈ ∂Nih(i, yi)

∂yi= ai.

Then

yi −Nif[i] =

yi −Nif

1 − ∆i(λ)≈ yi −Nif

1 − ai.


Subsequently, the CV criterion (7.20) is approximated by

CV(λ) ≈ 1

n

n∑

i=1

(yi −Nif)2

(1 − ai)2. (7.23)

Replacing ai by the average∑n

i=1 ai/n, we have the GCV criterion

GCV(λ) =1n

∑ni=1(yi −Nif)2

(1 − 1n

∑ni=1 ai)2

. (7.24)

Note that ai in (7.22) depends on f , and f may depend on y nonlin-early. Therefore, unlike the linear case, in general, there is no explicitformula for ai, and it is impossible to compute the CV or GCV estimateof λ by minimizing (7.24) directly. One approach is to replace f in (7.22)

by its current approximation, f−. This suggests estimating λ at eachiteration. Specifically, at each iteration, select λ for the approximatingSSR model (7.17) by the standard GCV criterion (3.4). This iterativeapproach is easy to implement. However, it does not guarantee conver-gence. Simulations indicate that convergence is achieved in most cases(Ke and Wang 2004).

We use the following connections between an NNR model (7.5) anda nonlinear mixed-effects (NLME) model to extend the GML method.Consider the following NLME model

y = ψ(γ) + ǫ, ǫ ∼ N(0, σ2I),

γ = Td+ Σc, c ∼ N(0, σ2Σ+/nλ),(7.25)

where γ = (γ1, . . . , γn)T , ψ(γ) = (ψ(γ1), . . . , ψ(γn))T , d are fixed ef-fects, c are random effects, ǫ = (ǫ1, . . . , ǫn)T are random errors inde-pendent of c, and Σ+ is the Moore–Penrose inverse of Σ. It is commonpractice to estimate c and d as minimizers of the following joint negativelog-likelihood (Lindstrom and Bates 1990)

1

n||y −ψ(Td+ Σc)||2 + λcT Σc. (7.26)

The joint negative log-likelihood (7.26) and the PLS (7.8) lead to thesame estimates of c and d. In their two-step procedure, at the LME step,Lindstrom and Bates (1990) approximated the NLME model (7.25) bythe following linear mixed-effects model

w = Xd+ Zc+ ǫ, (7.27)


where

X =∂ψ(Td+ Σc)

∂d

∣∣∣c=c−,d=d−

,

Z =∂ψ(Td+ Σc)

∂c

∣∣∣c=c−,d=d−

,

w = y −ψ(Td− + Σc−) +Xd− + Zc−.

The subscript minus indicates quantities evaluated at the current iter-ation. It is not difficult to see that w = y, X = V T , and Z = V Σ.Therefore, the REML estimate of λ based on the LME model (7.27) isthe same as the GML estimate for the approximate SSR model corre-sponding to (7.12). This suggests an iterative approach by estimating λfor the approximate SSR model (7.17) at each iteration using the GMLmethod.

The UBR method (3.3) may also be used to estimate λ at each itera-tion. The following algorithm summarizes the above procedures.

Linearization Algorithm

1. Initialize: f = f0.

2. Linearize: Update f by finding the PLS estimate of an approxi-mate SSR model with linear functionals. The smoothing parameterλ is estimated using the GCV, GML, or UBR method.

3. Repeat Step 2 until convergence.

For the special model (7.5), the Gauss–Newton method (i.e. solve(7.12)) or the Newton–Raphson method (i.e., solve (7.15)) may be usedat Step 2. For the general NNR model (7.4), the EGN method (i.e., fitmodel (7.17)) may be used at Step 2.

We estimate σ2 by

σ2 =

∑ni=1(yi −Nif)2

n−∑ni=1 ai

,

where ai are defined in (7.22) and computed at convergence.We now discuss how to construct Bayesian confidence intervals. At

convergence, the extended Gauss–Newton method approximates the orig-inal model by

y∗i = D∗i f + ǫi, i = 1, . . . , n, (7.28)

where D∗i = DNi(f) and y∗i = yi −Nif + D∗

i f . Assume a prior distri-bution for f as

F (x) =

p∑

ν=1

ζνφν(x) + δ1/2U(x),


where ζνiid∼ N(0, κ) and U(x) is a zero-mean Gaussian stochastic pro-

cess with Cov(U(x), U(z)) = R1(x, z). Assume that observations aregenerated from

y∗i = D∗i F + ǫi, i = 1, . . . , n. (7.29)

Since D∗i are bounded linear functionals, the posterior mean of the

Bayesian model (7.29) equals f . Posterior variances and Bayesian con-fidence intervals for model (7.29) can be calculated as in Section 3.8.Based on the first-order approximation (7.28), the performances of theseapproximate Bayesian confidence intervals depend largely on the accu-racy of the linear approximation at convergence. The bootstrap methodmay also be used to construct confidence intervals.

7.4 Estimation with Multiple Functions

We now consider the case when r > 1. The goal is to estimate non-parametric functions f = (f1, . . . , fr) in model (7.4). Note that fk ∈Hk for k = 1, . . . , r. Assume that Hk = Hk0 ⊕ Hk1, where Hk0 =span{φk1, . . . , φkpk

} and Hk1 is an RKHS with RK Rk1. We estimate fas minimizers of the following PLS

1

n

n∑

i=1

(yi −Ni(f))2 +

r∑

k=1

λk‖P1kfk‖2, (7.30)

where Pk1 are projection operators from Hk onto Hk1, and λ1, . . . , λr

are smoothing parameters.The linearization procedures in Section 7.3 may be applied to all func-

tions simultaneously. However, it is usually computationally intensivewhen n and/or r are large. We use a Gauss–Seidel type algorithm toestimate functions iteratively one at a time.

Nonlinear Gauss–Seidel Algorithm

1. Initialize: fk = fk0, k = 1, . . . , r.

2. Cycle: For k = 1, . . . , r, 1, . . . , r, . . . , conditional on the currentapproximations of f1, . . . , fk−1, fk+1, . . . , fr, update fk using thelinearization algorithm in Section 7.3.

3. Repeat Step 2 until convergence.


Step 2 involves an inner iteration of the linearization algorithm. Theconvergence of this inner iteration is usually unnecessary, and a smallnumber of iterations is usually good enough. The nonlinear Gauss–Seidelprocedure is an extension of the backfitting procedure.

Denote f as the estimate of f at convergence. Note here that fdenotes the estimate of the vector of functions f rather than the fittedvalues of a single function. Assume that the Frechet differentials of Ni

with respect to f evaluated at f , D∗i = DNi(f ), exist. Then D∗

i h =∑r

k=1 D∗kihk, where D∗

ki is the partial Frechet differential of Ni with

respect to fk evaluated at f , h = (h1, . . . , hr), and hk ∈ Hk (Flett1980). Using the linear approximation, the NNR model (7.4) may beapproximated by

y∗i =

r∑

k=1

D∗kifk + ǫi, i = 1, . . . , n, (7.31)

where y∗i = yi−Ni(f)+∑r

k=1 D∗kifk. The corresponding Bayes model for

(7.31) is given in (8.12) and (8.13). Section 8.2.2 provides formulae forposterior means and covariances, and discussion about how to constructBayesian confidence intervals for functions fk in model (7.31). TheseBayesian confidence intervals provide approximate confidence intervalsfor fk in the NNR model. Again, the bootstrap method may also beused to construct confidence intervals.

7.5 The nnr Function

The function nnr in the assist package is designed to fit the specialNNR model (7.5) and when Li are evaluational functionals. Sections7.6.2 and 7.6.3 provide two example implementations of the EGN methodfor two NNR models that cannot be fitted by the nnr function.

A typical call is

nnr(formula, func, start)

where formula is a two-sided formula specifying the response variableon the left side of a ~ operator and an expression for the function ψ onthe right side with fk treated as parameters. The argument func inputsa list of formulae, each specifying bases φk1, . . . , φkpk

for Hk0, and RKRk1 for Hk1. Each formula in this list has the form

f ∼ list( ∼ φ1 + · · · + φp, rk).


For example, suppose f = (f1, f2), where f1 and f2 are functions of anindependent variable x. Suppose both f1 and f2 are modeled using cubicsplines. Then the bases and RKs of f1 and f2 can be specified using

func=list(f1(x)~list(~x,cubic(x)),f2(x)~list(~x,cubic(x)))

or simply

func=list(f1(x)+f2(x)~list(~x, cubic(x)))

The argument start inputs a vector or an expression that specifies theinitial values of the unknown functions.

The method of selecting smoothing parameters is specified by the ar-gument spar. The options spar=‘‘v’’, spar=‘‘m’’, and spar=‘‘u’’

correspond to the GCV, GML, and UBR methods, respectively, withGCV as the default. The option method in the argument control speci-fies the computational method with method=‘‘GN’’ and method=‘‘NR’’

corresponding to the Gauss–Newton and Newton–Raphson methods, re-spectively. The default is the Newton–Raphson method.

An object of nnr class is returned. The generic function summary canbe applied to extract further information. Approximate Bayesian confi-dence intervals can be constructed based on the output of the intervalsfunction.

7.6 Examples

7.6.1 Nonparametric Regression Subject to PositiveConstraint

The exponential transformation in (7.2) may be used to relax the positiveconstraint for a univariate or multivariate regression function. In thissection we use a simple simulation to illustrate how to fit model (7.2)and the advantage of the exponential transformation over simple cubicspline fit.

We generate n = 50 samples from model (7.1) with g(x) = exp(−6x),

xi equally spaced in [0, 1], and ǫiiid∼ N(0, 0.12). We fit g with a cubic

spline and the exponential transformation model (7.2) with f modeledby a cubic spline:

> n <- 50

> x <- seq(0,1,len=n)

> y <- exp(-6*x)+.1*rnorm(n)


> ssrfit <- ssr(y~x, cubic(x))

> nnrfit <- nnr(y~exp(f(x)), func=f(u)~list(~u,cubic(u)),

start=list(f=log(abs(y)+0.001)))

where, for simplicity, we used log(|yi| + 0.001) as initial values.We compute the mean squared error (MSE) as

MSE =1

n

n∑

i=1

(g(xi) − g(xi))2,

where g is either the cubic spline fit or the fit based on the exponentialtransformation model (7.2). The simulation was repeated 100 times.Ignoring the positive constraint, the cubic spline fits have larger MSEsthan those based on the exponential transformation (Figure 7.1(a)). Fig-ure 7.1(b) shows observations, the true function, and fits for a typicalreplication. The cubic spline fit has portions that are negative. Theexponential transformation leads to a better fit.

cubic exp. trans.

−12

−10

−8

−6

−4

log(M

SE

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

o

o

o

o

o

oo

o

oo

o

o

o

oo

oo

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

oo

o

oo

oo

oo

ooo

o

ooo

o

true functioncubic splineexponential transformation

FIGURE 7.1 (a) Boxplots of MSEs on the logarithmic scale of thecubic spline fits and the fits based on the exponential transformation;and (b) observations (circles), the true function, and estimates for atypical replication.

7.6.2 Nonparametric Regression Subject to MonotoneConstraint

Consider model (7.1) and suppose that g is known to be strictly in-creasing. We relax the constraint by considering the transformed model


(7.3). It is clear that model (7.3) cannot be written in the form of (7.5).Therefore, it cannot be fitted using the current version of the nnr func-tion. We now illustrate how to apply the EGN method in Section 7.3.2to estimate the function f in (7.3). A similar procedure may be derivedfor other situations. See Section 7.6.3 for another example.

For simplicity, we model f in (7.3) using the cubic spline model spaceW 2

2 [0, 1]. Let Nif =∫ xi

0exp{f(s)}ds. Then it can be shown that

the Frechet differential of Ni at f− exists, and Dif = DNi(f−)(f) =∫ xi

0exp{f−(s)}f(s)ds (Debnath and Mikusinski 1999). Then model

(7.3) can be approximated by

yi = β + Dif + ǫi, i = 1, . . . , n, (7.32)

where yi = yi−Nif−+Dif−. Model (7.32) is a partial spline model. Sup-pose we use the construction of W 2

2 [0, 1] in Section 2.6, where φ1(x) =k0(x) = 1 and φ2(x) = k1(x) = x − 0.5 are basis functions for H0,and R1(x, z) = k2(x)k2(z) − k4(|x − z|) is the RK for H1. To fitmodel (7.32) we need to compute yi, T = {Diφν}n 2

i=1 ν=1 and Σ ={Di(x)Dj(z)R1(x, z)}n

i,j=1. It can be shown that

yi = yi −∫ xi

0

exp{f−(s)}{1 − f−(s)}ds,

Diφ1 =

∫ xi

0

exp{f−(s)}ds,

Diφ2 =

∫ xi

0

exp{f−(s)}(s− 0.5)ds,

Di(x)Dj(z)R1(x, z) =

∫ xi

0

∫ xj

0

exp{f−(s) + f−(t)}R1(s, t)dsdt.

The above integrals do not have closed forms. We approximate themusing the Gaussian quadrature method. For simplicity of notation, sup-pose that x values are distinct and ordered such that x1 < x2 < · · · < xn.Let x0 = 0. Write

∫ xi

0

exp{f−(s)}ds =

i∑

j=1

∫ xj

xj−1

exp{f−(s)}ds.

We then approximate each integral∫ xj

xj−1exp{f−(s)}ds using Gaussian

quadrature with three points. The approximation is quite accurate whenxj−xj−1 is small. More points may be added for wider intervals. We use

the same method to approximate integrals∫ xi

0 exp{f−(s)}{1− f−(s)}ds


and∫ xi

0exp{f−(s)}(s− 0.5)ds. Write the double integral

∫ xi

0

∫ xj

0

exp{f−(s) + f−(t)}R1(s, t)dsdt

=i∑

k=1

j∑

l=1

∫ xk

xk−1

∫ xl

xl−1

exp{f−(s) + f−(t)}R1(s, t)dsdt.

We then approximate each integral

∫ xk

xk−1

∫ xl

xl−1

exp{f−(s) + f−(t)}R1(s, t)dsdt

using the simple product rule (Evans and Swartz 2000).The R function inc in Appendix B implements the EGN procedure

for model (7.3). The function inc is available in the assist package.We now use a small-scale simulation to show the advantage of the

transformed model (7.3) over simple cubic spline fit. We generate n = 50samples from model (7.1) with g(x) = 1 − exp(−6x), xi equally spaced

in [0, 1], and ǫiiid∼ N(0, 0.12). We fit g with a cubic spline and the model

(7.3) with f modeled by a cubic spline, both with the GML choice ofthe smoothing parameter:

> n <- 50; x <- seq(0,1,len=n)

> y <- 1-exp(-6*x)+.1*rnorm(n)

> ssrfit <- ssr(y~x, cubic(x), spar=‘‘m’’)

> incfit <- inc(y, x, spar=‘‘m’’)

MSEs are computed as in Section 7.6.1. The simulation was repeated100 times. Ignoring the monotonicity constraint, cubic spline fits havelarger MSEs than those based on model (7.3) (Figure 7.2(a)). Figure7.2(b) shows observations, the true function, and fits for a typical repli-cation. The cubic spline fit is not monotone. The fit based on model(7.3) is closer to the true function.

If the function g in model (7.1) is known to be both positive andstrictly increasing, then we can consider model (7.3) with an additionalconstraint β > 0. Writing β = exp(α), then model (7.3) becomes asemiparametric nonlinear regression model in Section 8.3. A similarEGN procedure can be derived to fit such a model.

Now consider the child growth data consisting of height measurementsof a child over one school year. Observations are shown in Figure 7.3(a).

We first fit a cubic spline to model (7.1) with the GCV choice of thesmoothing parameter. The cubic spline fit, shown in Figure 7.3(a) as thedashed line, is not monotone. It is reasonable to assume that the mean


cubic monotone

−11

−9

−8

−7

−6

log(M

SE

)

(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

xy

(b)

o

o

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

oo

o

ooo

oo

oo

o

o

o

o

o

oo

ooo

o

oo

oo

oo

oo

true functioncubic splinemonotone

FIGURE 7.2 (a) Boxplots of MSEs on the logarithmic scale of thecubic spline fits and the fits based on model (7.3); and (b) observations(circles), the true function, and estimates for a typical replication.

0 50 150 250

124

126

128

130

day

heig

ht (c

m)

(a)

o

ooooooooo

o

oo

oo

oo

o

oooooo

o

o

ooo

ooo

oooooo

o

o

ooo

o

o

o

oooo

oo

o

oooooo

oooooo

ooo

oo ooooo

ooo

ooooo

0 50 150 250

0.0

10.0

30.0

5

day

gro

wth

velo

city (

cm

/day)

(b)

FIGURE 7.3 Child growth data, plots of (a) observations (circles),the cubic spline fit (dashed line), and the fit based on model (7.3) with95% bootstrap confidence intervals (shaded region), and (b) estimate ofthe velocity of growth with 95% bootstrap confidence intervals (shadedregion).

growth function g in (7.1) is a strictly increasing function. Therefore, wefit model (7.3) using the inc function and compute bootstrap confidenceintervals with 10,000 bootstrap samples:

> library(fda); attach(onechild)


> x <- day/365; y <- height

> onechild.cub.fit <- ssr(y~x, cubic(x))

> onechild.inc.fit <- inc(y, x, spar=‘‘m’’)

> nboot <- 10000; set.seed <- 23057

> yhat <- onechild.inc.fit$y.fit; resi <- y-yhat

> fb <- gb <- NULL; totboot <- 0

> while (totboot<nboot) {

yb <- yhat+sample(resi, replace=T)

tmp <- try(inc(yb, x, spar=‘‘m’’))

if(class(tmp)!=‘‘try-error’’&tmp$iter[2]<1.e-6) {

fb <- cbind(fb,

(tmp$pred$f-onechild.inc.fit$pred$f)/tmp$sigma)

gb <- cbind(gb,

(tmp$pred$y-onechild.inc.fit$pred$y)/tmp$sigma)

totboot <- totboot+1

}

}

> shat <- onechild.inc.fit$sigma

> gl <- onechild.inc.fit$pred$y-

shat*apply(gb,1,quantile,prob=.975)

> gu <- onechild.inc.fit$pred$y-

shat*apply(gb,1,quantile,prob=.025)

> hl <- exp(onechild.inc.fit$pred$f-

shat*apply(fb,1,quantile,prob=.975))/365

> hu <- exp(onechild.inc.fit$pred$f-

shat*apply(fb,1,quantile,prob=.025))/365

where gl and gu are lower and upper bounds for the mean functiong, and hl and hu are lower and upper bounds for the velocity definedas h(x) = g′(x) = exp{f(x)}. The percentile-t bootstrap confidenceintervals were constructed.

The constrained fit based on model (7.3) is shown in Figure 7.3(a) with95% bootstrap confidence intervals. Figure 7.3(b) shows the estimate ofg′(x) with 95% bootstrap confidence intervals. As in Ramsay (1998),the velocity shows three short bursts. Note that Ramsay (1998) used adifferent transformation, g(x) = β1 + β2

∫ x

0 exp{∫ s

0 w(u)du}ds, in model(7.1) to relax the monotone constraint. It is easy to see that w(x) =f ′(x), where f(x) = log g′(x). Therefore, with respect to the mean

function g, the penalty in this section equals∫ 1

0 [{log g′(x)}′′]2dx, and

the penalty in Ramsay (1998) equals∫ 1

0 {g′′(x)/g′(x)}2dx.


7.6.3 Term Structure of Interest Rates

In this section we use the bond data to investigate the term structure ofinterest rate, a concept central to economic and financial theory.

Consider a set of n coupon bonds from which the interest rate termstructure is to be inferred. Denote the current time as zero. Let yi bethe current price of bond i, rij be the payment paid at a future time xij ,0 < xi1 < · · · < ximi

, i = 1, . . . , n, j = 1, . . . ,mi. The pricing modelassumes that (Fisher, Nychka and Zervos 1995, Jarrow, Ruppert and Yu2004)

yi =

mi∑

j=1

rijg(xij) + ǫi, i = 1, . . . , n, (7.33)

where g is a discount function and g(xij) represents the price of adollar delivered at time xij , and ǫi are iid random errors with meanzero and variance σ2. The goal is to estimate the discount functiong from observations yi. Assume that g ∈ W 2

2 [0, b] for a fixed timeb and define Lig =

∑mi

j=1 rijg(xij). Then, it is easy to see that Li

are bounded linear functionals on W 22 [0, b]. Therefore, model (7.33)

is a special case of the general SSR model (2.10). Consider the cu-bic spline construction in Section 2.2. Two basis functions of the nullspace are φ1(x) = 1 and φ2(x) = x, and denote the RK of H1 asR1. To fit model (7.33) using the ssr function, we need to computeT = {Liφν}n 2

i=1 ν=1 and Σ = {LiLjR1}ni,j=1. Define ri = (ri1, . . . , rimi

)and X = diag(r1, . . . , rn). Then

T ={

mi∑

j=1

rijφν(xij)}n 2

i=1 ν=1= XS,

where S = (φ1,φ2) and φν = (φν(x11), . . . , φν(x1m1), . . . , φν(xn1), . . . ,

φν(xnmn))T for ν = 1, 2. Similarly, it can be shown that

Σ ={

mi∑

k=1

mj∑

l=1

rikrjlR1(xik, xjl)}n

i,j=1= XΛXT ,

where Λ = {Λij}ni,j=1 and Λij = {R1(xik, xjl)}mi mj

k=1 l=1. Note that R1 canbe calculated by the cubic2 function.

The bond data set contains 78 Treasury bonds and 144 GE (GeneralElectric Company) bonds. We first fit model (7.33) to the Treasury bondand compute estimates of g at grid points as follows:

> data(bond); attach(bond)


> group <- as.vector(table(name[type==‘‘govt’’]))

> y <- price[type==‘‘govt’’][cumsum(group)]

> x <- time[type==‘‘govt’’]

> r <- payment[type==‘‘govt’’]

> X <- assist:::diagComp(matrix(r,nrow=1),group)

> S <- cbind(1, x)

> T <- X%*%S

> Q <- X%*%cubic2(x)%*%t(X)

> bond.cub.fit.govt <- ssr(y~T-1, Q, spar=‘‘m’’,

limnla=c(6,10))

> grid1 <- seq(min(x),max(x),len=100)

> g.cub.p.govt <- cbind(1,grid1)%*%bond.cub.fit.govt$coef$d

+cubic2(grid1,x)%*%t(X)%*%bond.cub.fit.govt$coef$c

where diagComp is a function in the assist package that constructsthe matrix X = diag(r1, . . . , rn). We compute bootstrap confidenceintervals for g at grid points as follows:

> boot.bond.one <- function(x, y, yhat, X, T, Q,

spar=‘‘m’’, limnla=c(-3,6), grid, nboot, seed=0) {

set.seed <- seed

resi <- y-yhat

gb <- NULL

for(i in 1:nboot) {

yb <- yhat + sample(resi, replace=TRUE)

tmp <- try(ssr(yb~T-1, Q, spar=spar, limnla=c(-3,6)))

if(class(tmp)!=‘‘try-error’’) gb <- cbind(gb,

cbind(1,grid)%*%tmp$coef$d+

cubic2(grid,x)%*%t(X)%*%tmp$coef$c)

}

return(gb)

}

> g.cub.b.govt <- boot.bond.one(x, y,

yhat=bond.cub.fit.govt$fit,

X, T, Q, grid=grid1, nboot=5000, seed=3498)

> gl <- apply(g.cub.b.govt, 1, quantile, prob=.025)

> gu <- apply(g.cub.b.govt, 1, quantile, prob=.975)

where gl and gu are lower and upper bounds for the discount functiong, and the 95% percentile bootstrap confidence intervals were computedbased on 5, 000 bootstrap samples. Model (7.33) for the GE bond canbe fitted similarly. The estimates of discount functions and bootstrapconfidence intervals for the Treasury and GE bonds are shown in Figure7.4(a).


0 5 10 15 20 25

0.2

0.4

0.6

0.8

1.0

time (yr)

dis

count ra

te

(a)

0 5 10 15 20 25

0.2

0.4

0.6

0.8

1.0

time (yr)

dis

count ra

te

(b)

0 5 10 15 20 25

0.0

00.0

20.0

40.0

60.0

80.1

0

time (yr)

forw

ard

rate

(c)

0 5 10 15

−0.0

20.0

00.0

10.0

2

time (yr)

cre

dit s

pre

ad

(d)

FIGURE 7.4 Bond data, plots of (a) unconstrained estimates of thediscount function for Treasury bond (solid line) and GE bond (dashedline) based on model (7.33) with 95% bootstrap confidence intervals(shaded regions), (b) constrained estimates of the discount function forTreasury bond (solid line) and GE bond (dashed line) based on model(7.35) with 95% bootstrap confidence intervals (shaded regions), (c) es-timates of the forward rate for Treasury bond (solid line) and GE bond(dashed line) based on model (7.35) with 95% bootstrap confidence in-tervals (shaded regions), and (d) estimates of the credit spread based onmodel (7.35) with 95% bootstrap confidence intervals (shaded region).

The discount function g is required to be positive, decreasing, andsatisfy g(0) = 1. These constraints are ignored in the above directestimation. One simple approach to dealing with these constraints is torepresent g by

g(x) = exp

{

−∫ x

0

f(s)ds

}

, (7.34)


where f(s) ≥ 0 is the so-called forward rate. Reparametrization (7.34)takes care of all constraints on g. The goal now is to estimate the forwardrate f . Assuming f ∈ W 2

2 [0, b] and replacing g in (7.33) by (7.34) leadsto the following NNR model

yi =

mi∑

j=1

rij exp

{

−∫ xij

0

f(s)ds

}

+ ǫi, i = 1, . . . , n. (7.35)

For simplicity, the nonnegative constraint on f is not enforced since itsestimate is not close to zero. When necessary, the nonnegative constraintcan be enforced with a further reparametrization of f such as f = exp(h).The R function one.bond in Appendix C implements the EGN algorithmto fit model (7.35) for a single bond. For example, we can fit model (7.35)to Treasury bond as follows:

bond.nnr.fit.govt <- one.bond(price= price[type==‘‘govt’’],

payment=payment[type==‘‘govt’’],

time=time[type==‘‘govt’’],

name=name[type==‘‘govt’’])

In the following, we model Treasury and GE bonds jointly. Assumethat

yki =

mki∑

j=1

rkij exp

[

−∫ xkij

0

{f1(s) + f2(s)δk,2}ds]

+ ǫki,

k = 1, 2; i = 1, . . . , nk, (7.36)

where k represents bond type with k = 1 and k = 2 correspondingto Treasury and GE bonds, respectively, yki is the current price forbond i of type k, rkij are future payments for bond i of type k, δν,µ isthe Kronecker delta, f1 is the forward rate for Treasury bond, and f2represents the difference between GE and Treasury bonds (also calledthe credit spread). We assume that f1, f2 ∈ W 2

2 [0, b]. Model (7.36) is ageneral NNR model with two functions f1 and f2. It cannot be fitteddirectly by the nnr function. We now describe how to implement thenonlinear Gauss–Seidel algorithm to fit model (7.36).

Denote the current estimates of f1 and f2 as f1− and f2−, respectively.First consider updating f1 for fixed f2. The Frechet differential of Ni

with respect to f1 at current estimates f1− and f2− is

Dkih = −mki∑

j=1

rkij exp

[

−∫ xkij

0

{f1−(s) + f2−(s)δk,2}ds]∫ xkij

0

h(s)ds

= −mki∑

j=1

rkijwkij

∫ xkij

0

h(s)ds,


where wkij = exp[

−∫ xkij

0{f1−(s) + f2−(s)δk,2}ds

]

. Let

yki,1 =

mki∑

j=1

rkijwkij(1 + fkij,1) − yki,

Lki,1f1 =

mki∑

j=1

rkijwkij

∫ xkij

0

f1(s)ds,

where fkij,1 =∫ xkij

0f1−(s)ds. We need to fit the following SSR model

yki,1 = Lki,1f1 + ǫki, k = 1, 2; i = 1, . . . , nk, (7.37)

to update f1. Let Tk = {Lki,1φν}nk 2i=1 ν=1 for k = 1, 2, T = (T T

1 , TT2 )T ,

Σuv = {Lui,1Lvj,1R1}nu nv

i=1 j=1 for u, v = 1, 2, and Σ = {Σuv}2u,v=1. To fit

model (7.37) using the ssr function, we need to compute matrices T andΣ. Define bki = (rki1wki1, . . . , rkimi

wkimi) andXk = diag(bk1, . . . , bknk

)for k = 1, 2. Define

ψkν =

(∫ xk11

0

φν(s)ds, . . . ,

∫ xk1m1

0

φν(s)ds, . . . ,

∫ xkn1

0

φν(s)ds, . . . ,

∫ xknmn

0

φν(s)ds

)T

, ν = 1, 2; k = 1, 2,

Sk = (ψk1,ψk2), k = 1, 2,

Λuv,ij ={∫ xuik

0

∫ xvjl

0

R1(s, t)dsdt}mui mvj

k=1 l=1, u, v = 1, 2;

i = 1, . . . , nu; j = 1, . . . , nv,

Λuv ={Λuv,ij

}nu nv

i=1 j=1, u, v = 1, 2.

Then it can be shown that

Tk ={

mki∑

j=1

rkijwkij

∫ xkij

0

φν(s)ds}nk 2

i=1 ν=1= XkSk, k = 1, 2,

Σuv ={

mui∑

k=1

mvj∑

l=1

ruikwuikrvjlwvjl

∫ xuik

0

∫ xvjl

0

R1(s, t)dsdt}nu nv

i=1 j=1

= XuΛuvXTv , u, v = 1, 2.

As in Section 7.6.2, integrals∫ xkij

0 φν(s)ds and∫ xuik

0

∫ xvjl

0 R1(s, t)dsdtare approximated using the Gaussian quadrature method. Note thatthese integrals do not change along iterations. Therefore, they onlyneed to be computed once.


Next we consider updating f2 with fixed f1. The Frechet differentialof Ni at current estimates f1− and f2−

Dkih = −mki∑

j=1

rkijδk,2 exp

[

−∫ xkij

0

{f1−(s) + f2−(s)δk,2}ds]

×∫ xkij

0

h(s)ds.

Note that Dkih = 0 when k = 1. Therefore, we use observations withk = 2 only at this step. Let

y2i,2 =

m2i∑

j=1

r2ijw2ij(1 + f2ij,2) − y2i,

L2i,2f =

m2i∑

j=1

rk2jw2ij

∫ x2ij

0

f(s)ds,

where f2ij,2 =∫ x2ij

0 f2−(s)ds. We need to fit the following SSR model

y2i,2 = L2i,2f2 + ǫ2i, i = 1, . . . , n2, (7.38)

to update f2. It can be shown that

{L2i,2φν}nk 2i=1 ν=1 = T2,

{L2i,2L2j,2R1}nu nv

i=1 j=1 = Σ22.

The R function two.bond in Appendix C implements the nonlinearGauss–Seidel algorithm to fit model (7.36) for two bonds. For example,we can fit model (7.36) to the bond data and 5, 000 bootstrap samplesas follows:

> bond.nnr.fit <- two.bond(price=price, payment=payment,

time=time, name=name, type=type, spar=‘‘m’’)

> boot.bond.two <- function(y, yhat, price, payment, time,

name, type, spar=‘‘m’’, limnla=c(-3,6), nboot, seed=0)

{

set.seed <- seed

resi <- y-yhat

group <- c(as.vector(table(name[type==‘‘govt’’])),

as.vector(table(name[type==‘‘ge’’])))

fb <- f2b <- gb <- NULL

for(i in 1:nboot) {

price.b <- rep(yhat+sample(resi,replace=TRUE), group)


tmp <- try(two.bond(price=price.b, payment=payment,

time=time, name=name, type=type))

if(class(tmp)!=‘‘try-error’’&

tmp$iter[2]<1.e-4&tmp$iter[3]==0) {

fb <- cbind(fb,c(tmp$f.val$f1,tmp$f.val$f2))

f2b <- cbind(f2b,tmp$f2.val)

gb <- cbind(gb,c(tmp$dc[[1]],tmp$dc[[2]]))

}

list(fb=fb, f2b=f2b, gb=gb)

}

}

> gf.nnr.b <- boot.bond.two(y=bond.nnr.fit$y$y,

yhat=bond.nnr.fit$y$yhat, price=price, payment=payment,

time=time, name=name, type=type, nboot=5000,

seed=2394)

where the function boot.bond.two returns a list of estimates of thediscount functions (gb), the forward rates (fb), and the credit spread(fb2). The 95% percentile bootstrap confidence intervals for the discountfunctions of the Treasury and GE bonds can be computed as follows:

> n1 <- nrow(bond[type=="govt",])

> gl1 <- apply(gf.nnr.b$gb[1:n1,],1,quantile,prob=.025)

> gu1 <- apply(gf.nnr.b$gb[1:n1,],1,quantile,prob=.975)

> gl2 <- apply(gf.nnr.b$gb[-(1:n1),],1,quantile,prob=.025)

> gu2 <- apply(gf.nnr.b$gb[-(1:n1),],1,quantile,prob=.975)

where gl1 and gu1 are lower and upper bounds for the discount functionof the Treasury bond, and gl2 and gu2 are lower and upper bounds forthe discount function of the GE bond. The 95% percentile bootstrapconfidence intervals for the forward rates and credit spread can be com-puted similarly based on the objects gf.nnr.b$fb and gf.nnr.b$fb2.

Figures 7.4(b) and 7.4(c) show the estimates for the discount andforward rates, respectively. As expected, the GE discount rate is con-sistently smaller than that of Treasury bonds, representing a higher riskassociated with corporate bonds. To assess the difference between thetwo forward rates, we plot the estimated credit spread and its 95% boot-strap confidence intervals in Figure 7.4(d). The credit spread is positivewhen time is smaller than 11 years.

7.6.4 A Multiplicative Model for Chickenpox Epidemic

Consider the chickenpox epidemic data consisting of monthly numberof chickenpox in New York City during 1931–1972. Denote y as the


square root of reported cases in month x1 of year x2. Both x1 and x2

are transformed into the interval [0, 1]. Figure 7.5 shows the time seriesplot of square root of the monthly case numbers.

year

y

10

20

30

40

50

60

31 41 51 61 71

FIGURE 7.5 Chickenpox data, plots of the square root of the numberof cases (dotted line), and the fits from the multiplicative model (7.39)(solid line) and SS ANOVA model (7.40) (dashed line).

There is a clear seasonal variation within a year that has long beenrecognized (Yorke and London 1973, Earn, Rohani, Bolker and Gernfell2000). The seasonal variation was mainly caused by social behaviorof children who made close contacts when school was in session, andtemperature and humidity, that may affect the survival and transmissionof dispersal stages. Thus the seasonal variations were similar over theyears. The magnitude of this variation and the average number of casesmay change over the years. Consequently, we consider the followingmultiplicative model

y(x1, x2) = g1(x2) + exp{g2(x2)} × g3(x1) + ǫ(x1, x2), (7.39)

where y(x1, x2) is the square root of reported cases in month x1 of yearx2; and g1, g3 and exp(g2) represent respectively the yearly mean, theseasonal trend in a year, and the magnitude of the seasonal variation fora particular year. The function exp(g2) is referred to as the amplitude.A bigger amplitude corresponds to a bigger seasonal variation. For sim-plicity, we assume that random errors ǫ(x1, x2) are independent with a


constant variance.To make model (7.39) identifiable, we use the following two side con-

ditions:

(a)∫ 1

0 g2(x2)dx2 = 0. The exponential transformation of g2 and thiscondition make g2 identifiable with g3: the exponential transfor-

mation makes exp(g2) free of a sign change, and∫ 1

0g2(x2)dx2 = 0

makes exp(g2) free of a positive multiplying constant. This condi-tion can be fulfilled by removing the constant functions from themodel space of g2.

(b)∫ 1

0 g3(x1)dx1 = 0. This condition eliminates the additive constantmaking g3 identifiable with g1. This condition can be fulfilled byremoving the constant functions from the model space of g3.

We model g1 and g2 using cubic splines. Specifically, we assume thatg1 ∈ W 2

2 [0, 1] and g2 ∈ W 22 [0, 1] ⊖ {1}, where constant functions are

removed from the model space for g2 to satisfy the side condition (a).Since the seasonal trend g3 is close to a sinusoidal model, we model g3using a trigonometric spline with L = D2 + (2π)2 (m = 2 in (2.70)).That is, g3 ∈W 2

2 (per)⊖{1} where constant functions are removed fromthe model space for g3 to satisfy the side condition (b). Obviously,the multiplicative model (7.39) is a special case of the NNR model withfunctions denoted as g1, g2, and g3 instead of f1, f2, and f3 (the notationsof fi are saved for an SS ANOVA model later). The multiplicative model(7.39) is fitted as follows:

> data(chickenpox)

> y <- sqrt(chickenpox$count)

> x1 <- (chickenpox$month-0.5)/12

> x2 <- ident(chickenpox$year)

> tmp <- ssr(y~1, rk=periodic(x1), spar=‘‘m’’)

> g3.ini <- predict(tmp, term=c(0,1))$fit

> chick.nnr <- nnr(y~g1(x2)+exp(g2(x2))*g3(x1),

func=list(g1(x)~list(~I(x-.5),cubic(x)),

g2(x)~list(~I(x-.5)-1,cubic(x)),

g3(x)~list(~sin(2*pi*x)+cos(2*pi*x)-1,

lspline(x,type=‘‘sine0’’))),

start=list(g1=mean(y),g2=0,g3=g3.ini),

control=list(converg=‘‘coef’’),spar=‘‘m’’)

> grid <- data.frame(x1=seq(0,1,len=50),

x2=seq(0,1,len=50))

> chick.nnr.pred <- intervals(chick.nnr, newdata=grid,

terms=list(g1=matrix(c(1,1,1,1,1,0,0,0,1),


nrow=3,byrow=T),

g2=matrix(c(1,1,1,0,0,1),nrow=3,byrow=T),

g3=matrix(c(1,1,1,1,1,0,0,0,1),nrow=3,byrow=T)))

We first fitted a periodic spline with variable x1 only and used the fittedvalues to the smooth component as initial values for g3. We used theaverage of y as the initial value for g1 and constant zero as the initialvalue for g2. The intervals function was used to compute approximatemeans and standard deviations for functions g1, g2, and g3 and theirprojections. The estimates of g1 and g2 are shown in Figure 7.6. Wealso superimpose yearly averages in Figure 7.6(a) and the logarithm ofscaled ranges in Figure 7.6(b). The scaled range of a specific year wascalculated as the differences between the maximum and the minimummonthly number of cases divided by the range of the estimated seasonaltrend g3. It is clear that g1 captures the long-term trend in the mean, andg2 captures the long-term trend in the range of the seasonal variation.From the estimate of g1 in Figure 7.6(a), we can see that yearly averagespeaked in the 1930s and 1950s, and gradually decreased in the 1960s afterthe introduction of mass vaccination. The amplitude reflects the seasonalvariation in the transmission rate (Yorke and London 1973). From theestimate of g2 in Figure 7.6(b), we can see that the magnitude of theseasonal variation peaked in the 1950s and then declined in the 1960s,possibly as a result of changing public health conditions including massvaccination. Figure 7.7 shows the estimate of the seasonal trend g3, itsprojection onto the null space H30 = span{sin 2πx, cos 2πx} (the simplesinusoidal model), and its projection onto the orthogonal complement ofthe null space H31 = W 2

2 (per) ⊖ span{1, sin 2πx, cos 2πx}. Since theprojection onto the complement space is significantly different from zero,a simple sinusoidal model does not provide an accurate approximation.

To check the multiplicative model (7.39), we use the SS ANOVA de-composition in (4.21) and consider the following SS ANOVA model

y(x1, x2) = µ+ f1(x1) + f2(x2) + f12(x1, x2) + ǫ(x1, x2), (7.40)

where µ is a constant, f1 and f2 are overall main effects of month andyear, and f12 is the overall interaction between month and year. Note

that side conditions∫ 1

0 f1dx1 =∫ 1

0 f2dx2 =∫ 1

0 f12dx1 =∫ 1

0 f12dx2 = 0are satisfied by the SS ANOVA decomposition. It is not difficult to check


year

g1

15

20

25

30

31 41 51 61 71

oo

oo

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

(a)

yearg

2

−0.8

−0.4

0.0

0.4

31 41 51 61 71

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

(b)

FIGURE 7.6 Chickenpox data, plots of (a) yearly averages (circles),estimate of g1 (line), and 95% Bayesian confidence intervals (shaded re-gion), and (b) yearly scaled ranges on logarithm scale (circles), estimateof g2 (line), and 95% Bayesian confidence intervals (shaded region).

month

g3

−15−10−5

05

1015

0 2 4 6 8 10 12

overall

0 2 4 6 8 10 12

parametric

0 2 4 6 8 10 12

smooth

FIGURE 7.7 Chickenpox data, estimated g3 (overall), and its pro-jections to H30 (parametric) and H31 (smooth). Dotted lines are 95%Bayesian confidence intervals.

that model (7.40) reduces to model (7.39) iff

µ =

∫ 1

0

g1dx2,

f1(x1) =

{∫ 1

0

exp(g2)dx2

}

g3(x1),

f2(x2) = g1(x2) −∫ 1

0

g1dx2,

f12(x1, x2) =

[

exp{g2(x2)} −∫ 1

0

exp(g2)dx2

]

g3(x1).


Thus the multiplicative model assumes a multiplicative interaction.We fit the SS ANOVA model (7.40) and compare it with the multi-

plicative model (7.39) using AIC, BIC, and GCV criteria:

> chick.ssanova <- ssr(y~x2,

rk=list(periodic(x1),cubic(x2),


rk.prod(periodic(x1),cubic(x2))))

> n <- length(y)

> rss <- c(sum(chick.ssanova$resi**2),

sum(chick.nnr$resi**2))/n

> df <- c(chick.ssanova$df, chick.nnr$df$f)

> gcv <- rss/(1-df/n)**2

> aic <- n*log(rss)+2*df

> bic <- n*log(rss)+log(n)*df

> print(round(rbind(gcv,aic,bic),2))

gcv 8.85 14.14

aic 826.75 1318.03

bic 1999.38 1422.83

The AIC and GCV criteria select the SS ANOVA model, while the BICselects the multiplicative model. The SS ANOVA model captures localtrend, particularly biennial pattern from 1945 to 1955, better than themultiplicative model (Figure 7.5).

7.6.5 A Multiplicative Model for Texas Weather

The domains for variables x1 and x2 in the multiplicative model (7.39)are not limited to continuous intervals. As the NNR model, these do-mains may be arbitrary sets. We now revisit the Texas weather datato which an SS ANOVA model has been fitted in Section 4.9.4. Wehave defined x1 = (lat, long) as the geological location of a station,and x2 as the month variable scaled into [0, 1]. For illustration purposesonly, assuming that temperature profiles from all stations have the sameshape except a vertical shift and scale transformation that may dependon geographical location, we consider the following multiplicative model

y(x1, x2) = g1(x1) + exp{g2(x1)} × g3(x2) + ǫ(x1, x2), (7.41)

where y(x1, x2) is the average temperature of month x2 at location x1,g1 and exp(g2) represent, respectively, the mean and magnitude of theseasonal variation at location x1, and g3(x2) represents the seasonaltrend in a year. For simplicity, we assume that random errors ǫ(x1, x2)are independent with a constant variance. We model functions g1 and


g2 using thin-plate splines, and g3 using a periodic spline. Specifically,for identifiability, we assume that g1 ∈W 2

2 (R2), g2 ∈ W 22 (R2)⊖{1}, and

g3 ∈W 22 (per) ⊖ {1}. Model (7.41) is fitted as follows:

> g3.ini <- predict(ssr(y~1,rk=periodic(x2),data=tx.dat,

spar=‘‘m’’),terms=c(0,1), pstd=F)$fit

> tx.nnr <- nnr(y~g1(x11,x12)+exp(g2(x11,x12))*g3(x2),

func=list(g1(x,z)~list(~x+z,tp(list(x,z))),

g2(x,z)~list(~x+z-1,tp(list(x,z))),

g3(x)~list(periodic(x))),

data=tx.dat, start=list(g1=mean(y),g2=0,g3=g3.ini))

The estimates of g1 and g2 are shown in Figure 7.8. From northwest tosoutheast, the temperature gets warmer and the variation during a yeargets smaller.

57 59

61

63

63

65

67

69

71 −0.16

−0.12 −0.08

−0.04

0 0.04

0.08

0.12

0.16

FIGURE 7.8 Texas weather data, plots of estimates of g1 (left) andg2 (right).

In Section 4.9.4 we have fitted the following SS ANOVA model

y(x1, x2) = µ+ f1(x1) + f2(x2) + f12(x1, x2) + ǫ(x1, x2), (7.42)

where f1 and f2 are the overall main effects of location and month, andf12 is the overall interaction between location and month. The overallmain effects and interaction are defined in Section 4.4.6. To compare themultiplicative model (7.41) with the SS ANOVA model fitted in Section4.9.4, we compute GCV, AIC, and BIC criteria:


> n <- length(y)

> rss <- c(sum(tx.ssanova$resi**2),sum(tx.nnr$resi**2))

> df <- c(tx.ssanova$df,tx.nnr$df$f)

> gcv <- rss/(1-df/n)**2



> print(rbind(gcv,aic,bic))

gcv 131.5558 434.5253

aic 2661.9293 3481.2111

bic 3730.4315 3894.1836

All criteria choose the SS ANOVA model. One possible reason is thatthe assumption of common temperature profile for all stations is too re-strictive. To look at variation among temperature profiles, we fit eachstation separately using a periodic spline and compute normalized tem-perature profiles such that all profiles integrate to zero and have verticalrange equal to one. These normalized temperature profiles are shownin Figure 7.9(a). Variation among these temperature profiles may benonignorable. To further check if the variation among temperature pro-files may be accounted for by a horizontal shift, we align all profiles suchthat all of them reach maximum at point 0.5 and plot the aligned profilesin Figure 7.9(b). Variation among these aligned temperature profiles isagain nonignorable.

0.0 0.4 0.8

−0.4

0.0

0.4

x2

sh

ap

e

(a)

0.0 0.4 0.8

−0.4

0.0

0.4

x2

sh

ap

e

(b)

FIGURE 7.9 Texas weather data, plots of (a) normalized tempera-ture profiles, and (b) aligned normalized temperature profiles.


It is easy to see that the SS ANOVA model (7.42) reduces to themultiplicative model (7.41) iff

f12(x1, x2) =

[

exp{g2(x1)}∑J

j=1 wj exp{g2(uj)}− 1

]

f2(x2), (7.43)

where uj are fixed points in R2, and wj are fixed positive weights such

that∑J

j=1 wj = 1. Equation (7.43) suggests the following simple ap-proach to check the multiplicative model (7.41): for a fixed station (thusa fixed x1), compute estimates of f2 and f12 from the SS ANOVA fitand plot f12 against f2 to see if the points fall on a straight line. Figure7.10 shows plots of f12 against f2 for two selected stations. The patternsare quite different from straight lines, especially for the Liberty station.Again, it indicates that the multiplicative model may not be appropriatefor this case.

−20 0 10

−1.0

0.0

1.0

f2

f 12

Albany

ooo

ooo

o

o

oo

o o o oo

o

o

o

o

oooooo

o

o

o

o

o

oo

oo

oo

ooo

o

−20 0 10

−1.5

0.0

1.0

f2

f 12

Liberty

ooo

o

o

o o

oo o

oo o

oo o

oo o

o

o

o

o

o

ooo

o

o

o

o

o

o

o

oo

o

ooo

FIGURE 7.10 Texas weather data, plots of f12 against f2 for twostations.

Chapter 8

Semiparametric Regression

8.1 Motivation

Postulating strict relationships between dependent and independent vari-ables, parametric models are, in general, parsimonious. Parameters inthese models often have meaningful interpretations. On the other hand,based on minimal assumptions about the relationship, nonparametricmodels are flexible. However, nonparametric models lose the advantageof having interpretable parameters and may suffer from the curse of di-mensionality. Often, in practice, there is enough knowledge to modelsome components in the regression function parametrically. For othervague and/or nuisance components, one may want to leave them un-specified. Combining both parametric and nonparametric components,a semiparametric regression model can overcome limitations in paramet-ric and nonparametric models while maintaining advantages of havinginterpretable parameters and flexibility.

Many specific semiparametric models have been proposed in the lit-erature. The partial spline model (2.43) in Section 2.10 is perhaps thesimplest semiparametric regression model. Other semiparametric mod-els include the projection pursuit, single index, varying coefficients, func-tional linear, and shape invariant models. A projection pursuit regressionmodel (Friedman and Stuetzle 1981) assumes that

yi = β0 +

r∑

k=1

fk(βTk xi) + ǫi, i = 1, . . . , n, (8.1)

where x are independent variables, β0 and βk are parameters, and fk arenonparametric functions. A partially linear single index model (Carroll,Fan, Gijbels and Wand 1997, Yu and Ruppert 2002) assumes that

yi = βT1 si + f(βT

2 ti) + ǫi, i = 1, . . . , n, (8.2)

where s and t are independent variables, β1 and β2 are parameters, andf is a nonparametric function. A varying-coefficient model (Hastie and

227


Tibshirani 1993) assumes that

yi = β1 +

r∑

k=1

uikfk(xik) + ǫi, i = 1, . . . , n, (8.3)

where xk and uk are independent variables, β1 is a parameter, and fk

are nonparametric functions.The semiparametric linear and nonlinear regression models in this

chapter include all the foregoing models as special cases. The generalform of these models provides a framework for unified estimation, infer-ence, and software development.

8.2 Semiparametric Linear Regression Models

8.2.1 The Model

Define a semiparametric linear regression model as

yi = sTi β +

r∑

k=1

Lkifk + ǫi, i = 1, . . . , n, (8.4)

where s is a q-dimensional vector of independent variables, β is a vec-tor of parameters, Lki are bounded linear functionals, fk are unknownfunctions, and ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1). We assume that Wdepends on an unknown vector of parameters τ . Let Xk be the domainof fk and assume that fk ∈ Hk, where Hk is an RKHS on Xk. Note thatthe domains Xk for different functions may be the same or different.

Model (8.4) is referred to as a semiparametric linear regression modelsince the mean response depends on β and fk linearly. Extension to thenonlinear case will be introduced in Section 8.3. The semiparametriclinear regression model (8.4) is an extension of the partial spline model(2.43) by allowing more than one nonparametric function. It is an ex-tension of the additive models by including a parametric component andallowing general linear functionals. Note that some of the functions fk

may represent main effects and interactions in an SS ANOVA decom-position. Therefore, model (8.4) is also an extension of the SS ANOVAmodel (5.1) by allowing different linear functionals for different compo-nents. It is easy to see that the varying-coefficient model (8.3) is a specialcase of the semiparametric linear regression model with q = 1, si = 1,and Lkifk = uikfk(xik) for k = 1, . . . , r. The functional linear models(FLM) are also a special case of the semiparametric linear regression

Semiparametric Regression 229

model (see Section 8.4.1). In addition, random errors are allowed to becorrelated and/or have unequal variances.

8.2.2 Estimation and Inference

Assume that Hk = Hk0 ⊕ Hk1, where Hk0 = span{φk1, . . . , φkpk} and

Hk1 is an RKHS with RK Rk1. Let y = (y1, . . . , yn)T , f = (f1, . . . , fr),and η(β,f) = (sT

1 β +∑r

k=1 Lk1fk, . . . , sTnβ +

∑rk=1 Lknfk)T .

For a fixed W , we estimate β and f as minimizers to the followingPWLS

1

n(y − η(β,f))TW (y − η(β,f)) + λ

r∑

k=1

θ−1k ‖Pk1fk‖2, (8.5)

where Pk1 is the projection operator onto Hk1 in Hk. As in (4.32),different smoothing parameters λθ−1

k allow different penalties for eachfunction.

For k = 1, . . . , r, let ξki(x) = Lki(z)Rk1(x, z) and Sk = Hk0⊕span{ξk1,. . . , ξkn}. Let

WLS(β,L1f , . . . ,Lnf) =1

n(y − η(β,f))TW (y − η(β,f))

be the weighted LS where Lif = (L1if1, . . . ,Lrifr). Then the PWLS(8.5) can be written as

PWLS(β,L1f , . . . ,Lnf) = WLS(β,L1f , . . . ,Lnf )+λr∑

k=1

θ−1k ‖Pk1fk‖2.

For any fk ∈ Hk, write fk = ςk1 + ςk2, where ςk1 ∈ Sk and ςk2 ∈ Sck. As

shown in Section 6.2.1, we have Lkifk = Lkiςk1. Then for any f ,

PWLS(β,L1f , . . . ,Lnf)

= WLS(β,L1ς1, . . . ,Lnς1) + λ

r∑

k=1

θ−1k (‖Pk1ςk1‖2 + ‖Pk1ςk2‖2)

≥ WLS(β,L1ς1, . . . ,Lnς1) + λ

r∑

k=1

θ−1k ‖Pk1ςk1‖2

= PWLS(β,L1ς1, . . . ,Lnς1),

where Liς1 = (L1iς11, . . . ,Lriςr1) for i = 1, . . . , n. Equality holds iff

||Pk1ςk2|| = ||ςk2|| = 0 for all k = 1, . . . , r. Thus the minimizers fk of thePWLS fall in the finite dimensional spaces Sk, which can be represented


as

fk =

pk∑

ν=1

dkνφkν + θk

n∑

i=1

ckiξki, k = 1, . . . , r, (8.6)

where the multiplying constants θk make the solution and notations sim-ilar to those for the SS ANOVA models. Note that we only used the factthe WLS criterion depends on some bounded linear functionals Lki inthe foregoing arguments. Therefore, the Kimeldorf–Wahba representertheorem holds in general so long as the goodness-of-fit criterion dependson some bounded linear functionals.

Let S = (s1, . . . , sn)T , Tk = {Lkiφkν}n pk

i=1 ν=1 for k = 1, . . . , r, T =(T1, . . . , Tr), and X = (S T ). Let Σk = {LkiLkjRk1}n

i,j=1 for k =

1, . . . , r, and Σθ =∑r

k=1 θkΣk, where θ = (θ1, . . . , θr). Let fk =

(Lk1fk, . . . ,Lknfk)T , ck = (ck1, . . . , ckn)T , and dk = (dk1, . . . , dkpk)T

for k = 1, . . . , r. Based on (8.6), fk = Tkdk + θkΣkck for k = 1, . . . , r.Let d = (dT

1 , . . . ,dTr )T . The overall fit

η = Sβ +r∑

k=1

fk = Sβ + Td+r∑

k=1

θkΣkck = Xα+r∑

k=1

θkΣkck,

where α = (βT ,dT )T . Plugging back into (8.5), we have

1

n

(y−Xα−

r∑

k=1

θkΣkck

)TW(y−Xα−

r∑

k=1

θkΣkck

)+λ

r∑

k=1

θkcTk Σkck.

(8.7)Taking the first derivatives with respect to ck and α, we have

ΣkW(y −Xα−

r∑

k=1

θkΣkck

)− nλΣkck = 0, k = 1, . . . , r,

XTW(y −Xα−

r∑

k=1

θkΣkck

)= 0.

(8.8)

When all Σk are nonsingular, from the first equation in (8.8), we musthave c1 = · · · = cr. Setting c1 = · · · = cr = c, it is easy to see thatsolutions to

(Σθ + nλW−1)c+Xα = y,

XTc = 0,(8.9)

are also solutions to (8.8). Equations in (8.9) have the same form asthose in (5.6). Therefore, a similar procedure as in Section 5.2 can be


used to compute coefficients c and d. Let the QR decomposition of Xbe

X = (Q1 Q2)

(R0

)

and M = Σθ + nλW−1. Then the solutions

c = Q2(QT2 MQ2)

−1QT2 y,

d = R−1QT1 (y −Mc).

(8.10)

Furthermore,η = H(λ,θ, τ )y,

where

H(λ,θ, τ ) = I − nλW−1Q2(QT2MQ2)

−1QT2 (8.11)

is the hat matrix.When W is known, the UBR, GCV, and GML criteria presented in

Section 5.2.3 can be used to estimate the smoothing parameters. Wenow extend the GML method to estimate the smoothing and covarianceparameters simultaneously when W is unknown.

We first introduce a Bayes model for the semiparametric linear regres-sion model (8.4). Since the parametric component sTβ can be absorbedinto any one of the null spaces Hk0 in the setup of a Bayes model, forsimplicity of notation, we drop this term in the following discussion.Assume priors for fk as

Fk(xk) =

pk∑

ν=1

ζkνφkν(xk) +√

δθkUk(xk), k = 1, . . . , r, (8.12)

where ζkνiid∼ N(0, κ), Uk(xk) are independent zero-mean Gaussian stochas-

tic processes with covariance function Rk1(xk, zk), ζkν , and Uk are mu-tually independent, and κ and δ are positive constants. Suppose thatobservations are generated by

yi =

r∑

k=1

LkiFk + ǫi, i = 1, . . . , n. (8.13)

Let Lk0 be a bounded linear functional on Hk. Let λ = σ2/nδ. Thesame arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλW−1 in this chapter. Therefore,

limκ→∞

E(Lk0Fk|y) = fk, k = 1, . . . , r,


and an extension of the GML criterion is

GML(λ,θ, τ ) =y′W (I −H)y

[det+(W (I −H))]1

n−p

, (8.14)

where det+ is the product of the nonzero eigenvalues and p =∑r

k=1 pk.As in Section 5.2.4, we fit the corresponding LME model (5.27) with Tbeing replaced by X in this section to compute the minimizers of theGML criterion (8.14).

For clustered data, the leaving-out-one-cluster approach presented inSection 5.2.2 may also be used to estimate the smoothing and covarianceparameters.

We now discuss how to construct Bayesian confidence intervals. Anyfunction fk ∈ Hk can be represented as

fk =

pk∑

ν=1

f0kν + f1k, (8.15)

where f0kν ∈ span{φkν} for ν = 1, . . . , pk, and f1k ∈ Hk1. Our goal isto construct Bayesian confidence interval for

Lk0fk,γk

=

pk∑

ν=1

γk,νLk0f0kν + γk,pk+1Lk0f1k (8.16)

for any bounded linear functional Lk0 on Hk and any combination ofγk = (γk,1, . . . , γk,pk+1)

T , where γk,j = 1 when the corresponding com-ponent in (8.15) is to be included and 0 otherwise.

Let F0jν = ζjνφjν for ν = 1, . . . , pj , F1j =√δθjUj, and F1k =√

δθkUk for j = 1, . . . , r and k = 1, . . . , r. Let L0j , L0j1 be boundedlinear functionals on Hj , and L0k2 be a bounded linear functional onHk.

Posterior means and covariancesFor j = 1, . . . , r, ν = 1, . . . , pj, the posterior means are

E(L0jF0jν |y) = (L0jφjν )eTj,νd,

E(L0jF1j |y) = θj(L0jξj)Tc.

(8.17)

For j = 1, . . . , r, k = 1, . . . , r, ν = 1, . . . , pj, µ = 1, . . . , pk, the posteriorcovariances are

δ−1Cov(L0j1F0jν ,L0k2F0kµ|y) = (L0j1φjν )(L0k2φkµ)eTj,νAek,µ,

δ−1Cov(L0j1F0jν ,L0k2F1k|y) = −θk(L0j1φjν )eTj,νB(L0k2ξk),

δ−1Cov(L0j1F1j ,L0k2F1k|y) = δj,kθkL0j1L0k2Rk1−θjθk(L0j1ξj)

TC(L0k2ξk),

(8.18)


where ej,µ is a vector of dimension∑r

l=1 pl with the (∑j−1

l=1 pl +µ)th el-ement being one and all other elements being zero, c and d are given inthe equation (8.10), L0j1ξj = (L0j1Lj1Rj1, . . . ,L0j1LjnRj1)

T , L0k2ξk =

(L0k2Lk1Rk1, . . . ,L0k2LknRk1)T , M = Σθ+nλW−1, A = (T TM−1T )−1,

B = AT TM−1, and C = M−1(I − B). For simplicity of notation, we

define∑0

l=1 pl = 0.

Derivation of the above results can be found in Wang and Ke (2009).Posterior mean and variance of Lj0fj,γ

jin (8.16) can be calculated using

the above formulae. Bayesian confidence intervals for Lj0fj,γ can thenbe constructed. Bootstrap confidence intervals can also be constructedas in previous chapters.

The semiparametric linear regression model(8.4) can be fitted by thessr function. The independent variables s and null bases φk1, . . . , φkpk

for k = 1, . . . , r are specified on the right-hand side of the formula ar-gument, and RKs Rk1 for k = 1, . . . , r are specified in the rk argumentas a list. For non-iid random errors, variance and/or correlation struc-tures are specified using the arguments weights and correlation. Theargument spar specifies a method for selecting the smoothing parame-ter(s). UBR, GCV, and GML methods are available for the case whenW is known, and the GML method is available when W needs to be es-timated. The predict function can be used to compute posterior meansand standard deviations. See Section 8.4.1 for examples.

8.2.3 Vector Spline

Suppose we have observations on r dependent variables z1, . . . , zr. As-sume the following partial spline models:

zjk = sTjkβk + Lkjfk + εjk, k = 1, . . . , r; j = 1, . . . , nk, (8.19)

where zjk is the jth observation on zk, sjk is the jth observation ona qk-dimensional vector of independent variables sk, fk ∈ Hk is an un-known function, Hk is an RKHS on an arbitrary set Xk, Lkj is a boundedlinear functional, and εjk is a random error. Model (8.19) is a semipara-metric extension of the linear seemingly unrelated regression model. Forsimplicity, it is assumed that the regression model for each dependentvariable involves one nonparametric function only. The following discus-sions hold when the partial spline models in (8.19) are replaced by thesemiparametric linear regression models (8.4).

There are two possible approaches to estimating the parameters β =(βT

1 , . . . ,βTr )T and the functions f = (f1, . . . , fr). The first approach

is to fit the partial spline models in (8.19) separately, once for each de-pendent variable. The second approach is to fit all partial spline models


in (8.19) simultaneously, which can be more efficient when the randomerrors are correlated (Wang, Guo and Brown 2000, Smith and Kohn2000). We now discuss how to accomplish the second approach usingthe methods in this section.

Let m1 = 0 and mk =∑k−1

l=1 nl for k = 2, . . . , r. Let i = mk+j for j =1, . . . , nk and k = 1, . . . , r. Then there is an one-to-one correspondencebetween i and (j, k). Define yi = zjk for i = 1, . . . , n, where n =

∑rl=1 nl.

Then the partial spline models in (8.19) can be written jointly as

yi = sTjkβk + Lkjfk + εjk

=

r∑

l=1

δk,lsTjlβl +

r∑

l=1

δk,lLljfl + εjk

= sTi β +

r∑

l=1

Llifl + ǫi, (8.20)

where δk,l is the Kronecker delta, sTi = (δk,1s

Tj1, . . . , δk,rs

Tjr), Lli =

δk,lLlj , and ǫi = εjk. Assume that ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1). It

is easy to see that Lli are bounded linear functionals. Thus the model(8.20) for all dependent variables is a special case of the semiparametriclinear regression model (8.4). Therefore, the estimation and inferencemethods described in Section 8.2.2 can be used. In particular, all param-eters β and nonparametric functions f are estimated jointly based on thePWLS (8.5). In comparison, the first approach that fits model (8.19) sep-arately for each dependent variable is equivalent to fitting model (8.20)based on the PLS.

As an interesting special case, consider the following SSR model forr = 2 dependent variables

zjk = fk(xjk) + εjk, k = 1, 2; j = 1, . . . , nk, (8.21)

where the model space of fk is an RKHS Hk on Xk. Assume thatHk = Hk0 ⊕ Hk1, where Hk0 = span{φk1, . . . , φkpk

}, and Hk1 is anRKHS with RK Rk1. Then it is easy to check that

Lkiφkν =

{L1iφ1ν , 1 ≤ i ≤ n1,L2iφ2ν , n1 < i ≤ n1 + n2,

and

LkiLkjRk1 =

{L1iL1jR11, 1 ≤ i, j ≤ n1,L2iL2jR21, n1 < i, j ≤ n1 + n2.

For illustration, we generate a data set from model (8.21) with X1 =X2 = [0, 1], n1 = n2 = 100, xi1 = xi2 = i/n, f1(x) = sin(2πx), f2(x) =


sin(2πx) + 2x, and the paired random errors (εi1, εi2) are iid bivariatenormal random variables with mean zero and Var(ǫi1) = 0.25, Var(ǫi2) =1, and Cor(ǫi1, ǫi2) = 0.8.

> n <- 100; s1 <- .5; s2 <- 1; r <- .8

> A <- diag(c(s1,s2))%*%matrix(c(sqrt(1-r**2),0,r,1),2,2)

> e <- NULL; for (i in 1:n) e <- c(e,A%*%rnorm(2))

> x <- 1:n/n

> y1 <- sin(2*pi*x) + e[seq(1,2*n,by=2)]

> y2 <- sin(2*pi*x) + 2*x + e[seq(2,2*n,by=2)]

> bisp.dat <- data.frame(y=c(y1,y2),x=rep(x,2),

id=as.factor(rep(c(0,1),rep(n,2))), pair=rep(1:n,2))

We model both f1 and f2 using the cubic spline space W 22 [0, 1] under

the construction in Section 2.6. We first fit each SSR model in (8.21)separately and compute posterior means and standard deviations:

> bisp.fit1 <- ssr(y~I(x-.5), rk=cubic(x), spar=‘‘m’’,

data=bisp.dat[bisp.dat$id==0,])

> bisp.p1 <- predict(bisp.fit1)

> bisp.fit2 <- ssr(y~I(x-.5), rk=cubic(x), spar=‘‘m’’,

data=bisp.dat[bisp.dat$id==1,])

> bisp.p2 <- predict(bisp.fit2)

The functions of f1 and f2 and their estimates and confidence intervalsbased on separate fits are shown in the top panel of Figure 8.1.

Next we fit the SSR models in (8.21) jointly, compute posterior meansand standard deviations, and compare the posterior standard deviationswith those based on separate fits:

> bisp.fit3 <- ssr(y~id/I(x-.5)-1,

rk=list(rk.prod(cubic(x),kron(id==0)),

rk.prod(cubic(x),kron(id==1))), spar=‘‘m’’,

weights=varIdent(form=~1|id),

correlation=corSymm(form=~1|pair), data=bisp.dat)

> summary(bisp.fit3)

...

Coefficients (d):

id0 id1 id0:I(x - 0.5) id1:I(x - 0.5)

-0.002441981 1.059626687 -0.440873366 2.086878037

GML estimate(s) of smoothing parameter(s) :

8.358606e-06 5.381727e-06




0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−0.5

0.5

1.5

x

f 1

separate estimate of f1

0.0 0.2 0.4 0.6 0.8 1.0

−1

01

23

x

f 2

separate estimate of f2

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−0.5

0.5

1.5

x

f 1

joint estimate of f1

0.0 0.2 0.4 0.6 0.8 1.0

−1

01

23

x

f 2

joint estimate of f2

FIGURE 8.1 Plots of the true functions (dotted lines), cubic splineestimates (solid lines), and 95% Bayesian confidence intervals (shadedregions) of f1 (left) and f2 (right). Plots in the top panel are based onthe separate fits and plots in the bottom panel are based on the jointfit.

Correlation structure of class corSymm representing

Correlation:

1

2 0.777

Variance function structure of class varIdent representing

0 1

1.000000 1.981937

> bisp.p31 <- predict(bisp.fit3,

newdata=bisp.dat[bisp.dat$id==0,],

terms=c(1,0,1,0,1,0))



newdata=bisp.dat[bisp.dat$id==1,],

terms=c(0,1,0,1,0,1))

> mean((bisp.p1$pstd-bisp.p31$pstd)/bisp.p1$pstd)

0.08417699

> mean((bisp.p2$pstd-bisp.p32$pstd)/bisp.p2$pstd)

0.04500096

An arbitrary pairwise variance–covariance structure was assumed, andthe variance–covariance structure was specified with the combinationof the weights and correlation options. On average, the posteriorstandard deviations based on the joint fit are smaller than those basedon separate fits. Estimates and confidence intervals based on the jointfit are shown in the bottom panel of Figure 8.1.

In many applications, the domains of f1 and f2 are the same. That is,X1 = X2 = X . Then we can rewrite fj(x) as f(j, x) and regard it as abivariate function of j and x defined on the product domain {1, 2}⊗X .The joint approach described above for fitting the SSR models in (8.21)is equivalent to representing the original functions as

f(j, x) = δj,1f1(x) + δj,2f2(x). (8.22)

Sometimes the main interest is the difference between f1 and f2:d(x) = f2(x) − f1(x). We may reparametrize f(j, x) as

f(j, x) = f1(x) + δj,2d(x) (8.23)

or

f(j, x) =1

2{f1(x) + f2(x)} +

1

2(δj,2 − δj,1)d(x). (8.24)

Models (8.23) and (8.24) correspond to the SS ANOVA decompositionsof f(j, t) with the set-to-zero and sum-to-zero side conditions, respec-tively. The following statements fit model (8.23) and compute posteriormeans and standard deviations of d(x):

> bisp.fit4 <- update(bisp.fit3,

y~I(x-.5)+I(id==1)+I((x-.5)*(id==1)),

rk=list(cubic(x),rk.prod(cubic(x),kron(id==1))))


newdata=bisp.dat[bisp.dat$id==1,], terms=c(0,0,1,1,0,1))

where d(x) is modeled using the cubic spline space W 22 [0, 1] under the

construction in Section 2.6. The function of d(x) and its estimate areshown in the left panel of Figure 8.2. Model (8.24) can be fitted similarly.Sometimes it is of interest to check if f1 and f2 are parallel rather than ifd(x) = 0. Let d1(x) be the projection of d(x) onto W 2

2 [0, 1]⊖{1}. Then


f1 and f2 are parallel iff d1(x) = 0. Similarly, the projection of d(x)onto W 2

2 [0, 1]⊖ {1} ⊖ {x− .5}, d2(x), can be used to check if f1 and f2differ by a linear function. We compute posterior means and standarddeviations of d1(x) and d2(x) as follows:





The functions of d1(x) and d2(x), their estimates, and 95% Bayesianconfidence intervals are shown in the middle and right panels of Figure8.2. We can see that f1 and f2 are not parallel but differ by a linearfunction.

0.0 0.2 0.4 0.6 0.8 1.0

−1

01

23

x

d

estimate of d(x)

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

2

x

d1

estimate of d1(x)

0.0 0.2 0.4 0.6 0.8 1.0

−1e−

04

0e+

00

5e−

05

1e−

04

x

d2

estimate of d2(x)

FIGURE 8.2 Plots of the functions d(x) (left), d1(x) (middle), andd2(x) (right) as dotted lines. Estimates of these functions are plottedas solid lines, and 95% Bayesian confidence intervals are marked as theshaded regions.

We now introduce a more sophisticated SS ANOVA decompositionthat can be used to investigate various relationships between f1 and f2.

Define two averaging operators A(1)1 and A(2)

2 such that

A(1)1 f = w1f(1, x) + w2f(2, x),

A(2)1 f =

∫

Xf(j, x)dP (x),

where w1 +w2 = 1, and P is a probability measure on X . Then we have


the following SS ANOVA decomposition

f(j, x) = µ+ g1(j) + g2(x) + g12(j, x), (8.25)

where

µ = A(1)1 A(2)

1 f =

∫

X{w1f1(x) + w2f2(x)}dP (x),

g1(j) = (I −A(1)1 )A(2)

1 f =

∫

Xfj(x)dP (x) − µ,

g2(x) = A(1)1 (I −A(2)

1 )f = w1f1(x) + w2f2(x) − µ,

g12(j, x) = (I −A(1)1 )(I −A(2)

1 )f = fj(x) − µ− g1(j) − g2(x).

The constant µ is the overall mean, the marginal functions g1 and g2 arethe main effects, and the bivariate function g12 is the interaction.

The SS ANOVA decomposition (8.25) makes certain hypotheses moretransparent. For example, it is easy to check that the following hypothe-ses are equivalent:

H0 : f1(x) = f2(x) ⇐⇒ H0 : g1(j) + g12(j, x) = 0,

H0 : f1(x) − f2(x) = constant ⇐⇒ H0 : g12(j, x) = 0,

H0 :

∫

Xf1(x)dP (x) =

∫

Xf2(x)dP (x) ⇐⇒ H0 : g1(j) = 0,

H0 : w1f1(x) + w2f2(x) = constant ⇐⇒ H0 : g2(x) = 0.

Furthermore, if g1(j) 6= 0 and g2(x) 6= 0,

H0 : af1(x) + bf2(x) = c, |a| + |b| > 0 ⇐⇒ H0 : g12(j, t) = βg1(j)g2(x).

Therefore, the hypothesis that f1 and f2 are equal is equivalent tog1(j) + g12(j, x) = 0. The hypothesis that f1 and f2 are parallel isequivalent to the hypothesis that the interaction g12 = 0. The hypothe-sis that the integrals of f1 and f2 with respect to the probability measureP are equal is equivalent to the hypothesis that the main effect g1(j) = 0.The hypothesis that the weighted average of f1 and f2 is a constant isequivalent to the hypothesis that the main effect g2(x) = 0. Note thatthe probability measure P and weights wj are arbitrary, which can beselected for specific hypotheses. Under the specified conditions, the hy-pothesis that there exists a linear relationship between the functions f1and f2 is equivalent to the hypothesis that the interaction is multiplica-tive. Thus, for these hypotheses, we can fit the SS ANOVA model (8.25)and perform tests on the corresponding components.


8.3 Semiparametric Nonlinear Regression Models

8.3.1 The Model

A semiparametric nonlinear regression (SNR) model assumes that

yi = Ni(β,f) + ǫi, i = 1, . . . , n, (8.26)

where β = (β1, . . . , βq)T ∈ R

q is a vector of parameters, f = (f1, . . . , fr)are unknown functions, fk belongs to an RKHS Hk on an arbitrarydomain Xk for k = 1, . . . , r, Ni are known nonlinear functionals onR

q ×H1 × · · · × Hr, and ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1).It is obvious that the SNR model (8.26) is an extension of the NNR

model (7.4) to allow parameters in the model. The mean function maydepend on both parameters β and nonparametric functions f nonlin-early. The nonparametric functions f are regarded as parameters justlike β. Certain constraints may be required to make an SNR modelidentifiable. Specific conditions depend on the form of a model and thepurpose of an analysis. Often, identifiability can be achieved by addingconstraints on parameters, absorbing some parameters into f and/oradding constraints on f by removing certain components from the modelspaces. Illustrations of how to make an SNR model identifiable can befound in Section 8.4.

The semiparametric linear regression model (8.4) is a special case ofSNR model when Ni are linear in β and f . When Ni are linear in f forfixed β, model (8.26) can be expressed as

yi = α(β;xi) +

r∑

k=1

Lki(β)fk + ǫi, i = 1, . . . , n, (8.27)

where α is a known linear or nonlinear function of independent variablesx = (x1, . . . , xd), xi = (xi1, . . . , xid), and Lki(β) are bounded linearfunctionals that may depend on β. Model (8.27) will be referred to as asemiparametric conditional linear model.

One special case of model (8.27) is

yi = α(β;xi) +

r∑

k=1

δk(β;xi)fk(γk(β;xi)) + ǫi, i = 1, . . . , n, (8.28)

where α, δk, and γk are known functions. Containing many existingmodels as special cases, model (8.28) is interesting in its own right. Itis obvious that both nonlinear regression and nonparametric regression


models are special cases of model (8.28). The project pursuit regres-sion model (8.1) is a special case with α(β;x) = β0, δk(β;x) ≡ 1 andγk(β;x) = β

Tk x, where β = (β0,β

T1 , . . . ,β

Tr )T . Partially linear sin-

gle index model (8.2) is a special case with r = 1, x = (sT , tT )T ,α(β;x) = βT

1 s, δ1(β;x) ≡ 1, γ1(β;x) = βT2 t, and β = (βT

1 ,βT2 )T .

Other special cases can be found in Section 8.4.Sometimes one may want to investigate how parameters β depend on

other covariates. One approach is to build a second-stage linear model,β = Aϑ, where A is a known matrix. See Section 7.5 in Pinheiro andBates (2000) for details. The general form of models (8.26), (8.27), and(8.28) remains the same when the second-stage model is plugged in.Therefore, the estimation procedures in Section 8.3.3 also apply to theSNR model combined with a second-stage model with ϑ as parameters.

Let y = (y1, . . . , yn)T , η(β,f ) = (N1(β,f), . . . ,Nn(β,f))T , and ǫ =(ǫ1, . . . , ǫn)T . Then model (8.26) can be written in a vector form as

y = η(β,f) + ǫ. (8.29)

8.3.2 SNR Models for Clustered Data

Clustered data such as repeated measures, longitudinal, and multileveldata are common in practice. The SNR model for clustered data assumesthat

yij = Nij(βi,f) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (8.30)

where yij is the jth observation in cluster i, βi = (βi1, . . . , βiq)T ∈ R

q

is a vector of parameters for cluster i, f = (f1, . . . , fr) are unknownfunctions, fk belongs to an RKHS Hk on an arbitrary domain Xk for k =1, . . . , r, Nij are known nonlinear functionals on R

q ×H1×· · ·×Hr, andǫij are random errors. Let ǫi = (ǫi1, . . . , ǫini

)T and ǫ = (ǫT1 , . . . , ǫ

Tm)T .

We assume that ǫ ∼ N(0, σ2W−1). Usually, observations are correlatedwithin a cluster and independent between clusters. In this case, W−1 isblock diagonal.

Again, when Nij are linear in f for fixed βi, model (8.30) can beexpressed as

yij = α(βi;xij) +r∑

k=1

Lkij(βi)fk + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,

(8.31)where α is a known function of independent variables x = (x1, . . . , xd),xij = (xij1, . . . , xijd), and Lkij(β) are bounded linear functionals thatmay depend on β. Model (8.31) will be referred to as a semiparametricconditional linear model for clustered data.


Similar to model (8.28), one special case of model (8.31) is

yij = α(βi;xij) +

r∑

k=1

δk(βi;xij)fk(γk(βi;xij)) + ǫij ,

i = 1, . . . ,m; j = 1, . . . , ni, (8.32)

where α, δk, and γk are known functions. Model (8.32) can be re-garded as an extension of the self-modeling nonlinear regression (SE-MOR) model proposed by Lawton, Sylvestre and Maggio (1972). Inparticular, the shape invariant model (SIM) (Lawton et al. 1972, Wangand Brown 1996),

yij = βi1+βi2f

(xij − βi3

βi4

)

+ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (8.33)

is a special case of model (8.32) with d = 1, r = 1, q = 4, α(βi;xij) = βi1,δ1(βi;xij) = βi2, and γk(βi;xij) = (xij − βi3)/βi4. Again, a second-stage linear model may also be constructed for parameters βi, and theestimation procedures in Section 8.3.3 apply to the combined model.

Let n =∑m

i=1 ni, yi = (yi1, . . . , yini)T , y = (yT

1 , . . . ,yTm)T , β =

(βT1 , . . . ,β

Tm)T , ηi(βi,f) = (Ni1(βi,f), . . . ,Nini

(βi,f))T , and η(β,f)=(ηT

1 (β1,f ), . . . ,ηTm(βm,f))T . Then model (8.30) can be written in the

vector form (8.29).


For simplicity we present the estimation and inference procedures forSNR models in Section 8.3.1 only. The same methods apply to SNRmodels in Section 8.3.2 for clustered data with a slight modification ofnotation.

Consider the vector form (8.29) and assume that W depends on anunknown vector of parameters τ . Assume that fk ∈ Hk, and Hk =Hk0⊕Hk1, where Hk0 = span{φk1, . . . , φkpk

} and Hk1 is an RKHS withRK Rk1. Our goal is to estimate parameters β, τ , σ2, and nonparametricfunctions f .

Let

l(y;β,f , τ , σ2) = log |σ2W−1| + 1

σ2(y − η)TW (y − η) (8.34)

be twice the negative log-likelihood where an additive constant is ignoredfor simplicity. We estimate β, τ , and f as minimizers of the penalizedlikelihood (PL)

l(y;β,f , τ , σ2) +nλ

σ2

r∑

k=1

θ−1k ||Pk1fk||2, (8.35)


where Pk1 is the projection operator onto Hk1 in Hk, and λθ−1k are

smoothing parameters. The multiplying constant n/σ2 is introduced inthe penalty term such that, ignoring an additive constant, the PL (8.35)has the same form as the PWLS (8.5) when N is linear in both β andf .

We will first develop a backfitting procedure for the semiparametricconditional linear model (8.27) and then develop an algorithm for thegeneral SNR model (8.26).

Consider the semiparametric conditional linear model (8.27). We firstconsider the estimation of f with fixed β and τ . When β and τ are fixed,the PL (8.35) is equivalent to the PWLS (8.5). Therefore, the solutionsof f to (8.35) can be represented as those in (8.6). We will use the samenotations as in Section 8.2.2. Note that both T and Σθ may depend onβ even though the dependence is not expressed explicitly for simplicity.Let α = (α(β;x1), . . . , α(β;xn))T . We need to solve equations

(Σθ + nλW−1)c + Td = y −α,T Tc = 0,

(8.36)

for coefficients c and d. Note that α and W are fixed since β and τ arefixed, and the equations in (8.36) have the same form as those in (5.6).Therefore, methods in Section 5.2.1 can be used to solve (8.36), and theUBR, GCV, and GML methods in Section 5.2.3 can be used to estimatesmoothing parameters λ and θ.

Next we consider the estimation of β and τ with fixed f . When f isfixed, the PL (8.35) is equivalent to

log |σ2W−1| + 1

σ2{y − η(β,f)}TW{y − η(β,f)}. (8.37)

We use the backfitting and Gauss–Newton algorithms in Pinheiro andBates (2000) to find minimizers of (8.37) by updating β and τ itera-tively. Details about the backfitting and Gauss–Newton algorithms canbe found in Section 7.5 of Pinheiro and Bates (2000).

Putting pieces together, we have the following algorithm.

Algorithm for semiparametric conditional linear models

1. Initialize: Set initial values for β and τ .

2. Cycle: Alternate between (a) and (b) until convergence:

(a) Conditional on current estimates of β and τ , update f usingmethods in Section 5.2.1 with smoothing parameters selectedby the UBR, GCV, or GML method in Section 5.2.3.


(b) Conditional on current estimates of f , update β and τ bysolving (8.37) alternatively using the backfitting and Gauss–Newton algorithms.

Note that smoothing parameters are estimated iteratively with fixed τat step 2(a). The parameters τ are estimated at step 2(b), which makesthe algorithm relatively easy to implement. An alternative computa-tionally more expensive approach is to estimate smoothing parametersand τ jointly at step (a).

Finally, we consider the estimation for the general SNR model (8.26).When η is nonlinear in f , the solutions of f to (8.35) usually do notfall in finite dimensional spaces. Therefore, certain approximations arenecessary. Again, first consider estimating f with fixed β and τ . Wenow extend the EGN procedure in Section 7.3 to multiple functions. Letf− be the current estimate of f . For any fixed β, Ni is a functionalon H1 × · · · × Hr. We assume that Ni is Frechet differentiable at f−and write Di = DNi(f−). Then Dih =

∑rk=1 Dkihk, where Dki is the

partial Frechet differential of Ni with respect to fk evaluated at f−,h = (h1, . . . , hr) and hk ∈ Hk (Flett 1980). For k = 1, . . . , r, Lki is abounded linear functional on Hk. Approximating Ni(β,f ) by its linearapproximation

Ni(β,f) ≈ Ni(β,f−) +r∑

k=1

Dki(fk − fk−), (8.38)

we have an approximate semiparametric conditional linear model

yi =

r∑

k=1

Dkifk + ǫi, i = 1, . . . , n, (8.39)

where yi = yi −Ni(β,f−)+∑r

k=1 Dkifk−. Functions f in model (8.39)can be estimated using the method in Section 8.2.2. Consequently, wehave the following algorithm for the general SNR model.

Algorithm for general SNR model

1. Initialize: Set initial values for β, τ , and f .


(a) Conditional on current estimates of β, τ , and f , compute Dki

and yi, and update f by applying step 2(a) in the Algorithmfor semiparametric conditional linear models to the approxi-mate model (8.39). Repeat this step until convergence.



Denote the final estimates of β, τ , and f as β, τ , and f . We estimateσ2 by

σ2 =(y − η)T W (y − η)

n− d− tr(H∗), (8.40)

where η = η(β, f), W is the estimate of W with τ being replaced by τ ,d is the degrees of freedom for parameters, which is usually taken as thetotal number of parameters, and H∗ is the hat matrix for model (8.39)computed at convergence.

Conditional on f , inference for β and τ can be made based on theapproximate distributions of the maximum likelihood estimates. Condi-tional on β and τ , model (8.27) is a special case of the semiparametriclinear regression model (8.4). Therefore, Bayesian confidence intervalscan be constructed as in Section 8.2 for semiparametric conditional linearmodels. For the general SNR model, approximate Bayesian confidenceintervals can be constructed based on the linear approximation (8.39)at convergence. The bootstrap approach may also be used to constructconfidence intervals.

8.3.4 The snr Function

The function snr in the assist package is designed to fit the followingspecial SNR models

yi = ψ(β,L1i(β)f1, . . . ,Lri(β)fr) + ǫi, i = 1, . . . , n, (8.41)

for cross-sectional data and

yij = ψ(βi,L1ij(βi)f1, . . . ,Lrij(βi)fr) + ǫij ,

i = 1, . . . ,m; j = 1, . . . , ni, (8.42)

for clustered data, where ψ is a known nonlinear function, and Lki(β)and Lkij(βi) are evaluational functionals on Hk. Obviously, (8.41) in-cludes the model (8.28) as a special case, and (8.42) includes the model(8.32) as a special case.

A modified procedure is implemented in the snr function. Note thatmodels (8.41) and (8.42) reduce to the NNR model (7.5) when β isfixed and random errors are iid. For non-iid random errors, when bothβ and τ are fixed, similar transformations as in Section 5.2.1 may be


used. The nonlinear Gauss–Seidel algorithm in Section 7.4 can then beused to update nonparametric functions f . Therefore, we implement thefollowing procedure in the snr function.

Algorithm in the snr function

1. Initialize: Set initial values for β, τ , and f .


(a) Conditional on current estimates of β, τ and f , apply trans-formations as in Section 5.2.1 when random errors are non-iid,and use the nonlinear Gauss–Seidel algorithm to update f .


No initial values for fk are necessary if ψ depends on fk linearly in (8.41)or (8.42). The initial τ is set such that W equals the identity matrix.Step (b) is implemented using the gnls function in the nlme package.

A typical call is

snr(formula, func, params, start)

where formula is a two-sided formula specifying the response variableon the left side of a ~ operator and an expression for the function ψ inthe model (8.41) or (8.42) on the right side with β and fk treated asparameters. The argument func inputs a list of formulae, each specifyingbases φk1, . . . , φkpk

for Hk0 and RK Rk1 for Hk1 in the same way as thefunc argument in the nnr function. The argument params inputs alist of two-sided linear formulae specifying second-stage models for β.When there is no second-stage model for a parameter, it is specified as~1. The argument start inputs initial values for all parameters. Whenψ depends on functions fk nonlinearly, initial values for those functionsshould also be provided in the start argument.

An object of snr class is returned. The generic function summary

can be applied to extract further information. Predictions at covariatevalues can be computed using the predict function. Posterior meansand standard deviations for f can be computed using the intervals

function. See Sections 8.4.2–8.4.6 for examples.


8.4 Examples

8.4.1 Canadian Weather — Revisit

Consider the Canadian weather data again with annual temperatureand precipitation profiles from all 35 stations as functional data. Wenow investigate how the monthly logarithm of rainfall depends on cli-mate regions and temperature. Let y be the logarithm of prec, x1 bethe region indicator, and x2 be the month variable scaled into [0, 1].Consider the following FLM

yk,x1(x2) = f1(x1, x2) + wk,x1

(x2)f2(x2) + ǫk,x1(x2), (8.43)

where yk,x1(x2), wk,x1

(x2), and ǫk,x1(x2) are profiles of log precipitation,

residual temperature after removing the region effect, and random errorof station k in climate region x1, respectively. Both annual log pre-cipitation and temperature profiles can be regarded as functional data.Therefore, model (8.43) is an example of situation (iii) in Section 2.10where both the independent and dependent variables involve functionaldata. We model the bivariate function f1(x1, x2) using the tensor prod-uct space R

4 ⊗W 22 (per). Then, as in (4.22), f1 admits the SS ANOVA

decomposition

f1(x1, x2) = µ+ f1,1(x1) + f1,2(x2) + f1,12(x1, x2).

Model (8.43) is the same as model (14.1) in Ramsay and Silverman(2005) with µ(x2) = µ + f1,2(x2), αx1

(x2) = f1,1(x1) + f1,12(x1, x2),and β(x2) = f2(x2). The function f2 is the varying coefficient functionfor the temperature effect. We model f2 using the periodic spline spaceW 2

2 (per). There are 12 monthly observations for each station. Collectall observations on y, x1, x2, and w for all 35 stations and denote themas {(yi, xi1, xi2, wi), i = 1, . . . , n} where n = 420. Denote the collectionof random errors as ǫ1, . . . , ǫn. Then model (8.43) can be rewritten as

yi = µ+ f1,1(xi1) + f1,2(xi2) + f1,12(xi1, xi2) + L2if2 + ǫi, (8.44)

where L2if2 = wif2(xi2). Model (8.44) is a special case of the semipara-metric linear regression model (8.4). Model spaces for f1,1, f1,2, andf1,12 are H1 = H1, H2 = H2, and H3 = H3, where H1, H2, and H3

are defined in Section 4.4.4. The RKs R1, R2, and R3 of H1, H2, andH3 can be calculated as products of the RKs of the involved marginalspaces. Define Σ1 = {R1(xi1, xj1)}n

i,j=1, Σ2 = {R2(xi2, xj2)}ni,j=1, and

Σ3 = {R3((xi1, xi2), (xj1, xj2))}ni,j=1. For f2, the model space H4 =


W 22 (per) where the construction of W 2

2 (per) is given in Section 2.7.Specifically, write W 2

2 (per) = H40 ⊕ H41, where H40 = {1} and H41 =W 2

2 (per) ⊖ {1}. Denote φ4(x2) = 1 as the basis of H40 and R4 as theRK for H41. Then, T4 = (L21φ4, . . . ,L2nφ4)

T = (w1, . . . , wn)T , w,and Σ4 = {L2iLj2R4}n

i,j=1 = {wiwjR4(xi2, xj2)}ni,j=1 = wwT ◦Λ where

Λ = {R4(xi2, xj2)}ni,j=1 and ◦ represents elementwise multiplication of

two matrices. Therefore, model (8.44) can be fitted as follows:

> x1 <- rep(as.factor(region),rep(12,35))

> x2 <- (rep(1:12,35)-.5)/12

> y <- log(as.vector(monthlyPrecip))

> w <- canada.fit2$resi

> canada.fit3 <- ssr(y~w,


periodic(x2),

rk.prod(shrink1(x1),periodic(x2)),

rk.prod(kron(w),periodic(x2))))

Estimates of region effects αx1(x2) and coefficient function f2 evalu-

ated at grid points are computed as follows:

> xgrid <- seq(0,1,len=50)

> zone <- c(‘‘Atlantic’’,‘‘Pacific’’,

‘‘Continental’’,‘‘Arctic’’)

> grid <- data.frame(x1=rep(zone,rep(50,4)),

x2=rep(xgrid,4), w=rep(0,200))

> alpha <- predict(canada.fit3, newdata=grid,

terms=c(0,0,1,0,1,0))

> grid <- data.frame(x1=rep(zone[1],50), x2=xgrid,

w=rep(1,50))

> f2 <- predict(canada.fit3, newdata=grid,

terms=c(0,0,0,0,0,1))

Those estimates of αx1(x2) and f2 are shown in Figures 8.3 and 8.4,

respectively.Ramsay and Silverman (2005) also considered the following FLM

(model (16.1) in Ramsay and Silverman (2005)):

yk(x2) = f1(x2) +

∫ 1

0

wk(u)f2(u, x2)du + ǫk(x2),

k = 1, . . . , 35,

(8.45)

where yk(x2) is the logarithm of annual precipitation profile at stationk, f1(x2) plays the part of an intercept as in the standard regression,wk(u) is the temperature profile at station k, f2(u, x2) is an unknown


month

reg

ion

eff

ect

−1.0

−0.5

0.0

0.5

1.0

0 2 4 6 8 10 12

Continental Arctic

Atlantic

0 2 4 6 8 10 12

−1.0

−0.5

0.0

0.5

1.0

Pacific

FIGURE 8.3 Canadian weather data, plots of estimated region effectsto precipitation, and 95% Bayesian confidence intervals.

month

f 2−

0.0

3−

0.0

10.0

10.0

3

J F M A M J J A S O N D

FIGURE 8.4 Canadian weather data, estimate of the coefficient func-tion for temperature effect f2, and 95% Bayesian confidence intervals.


weight function at month x2, and ǫk(x2) are random error processes.Comparing to model (8.44), the whole temperature profile is used topredict the current precipitation in (8.45). Note that wk(x2) in model(8.45) is the actual temperature, and wk(x2) in (8.44) is the residualtemperature after the removing the region effect.

The goal is to model and estimate functions f1 and f2. It is reasonableto assume that f1 and f2 are smooth periodic functions. Specifically, weassume that f1 ∈W 2

2 (per), and f2 ∈W 22 (per)⊗W 2

2 (per). Let H0 = {1}and H1 = W 2

2 (per) ⊖ {1}. Then, we have an one-way SS ANOVAdecomposition for W 2

2 (per),

W 22 (per) = H0 ⊕H1,

and a two-way SS ANOVA decomposition for W 22 (per) ⊗W 2

2 (per),

W 22 (per) ⊗W 2

2 (per)

={

H(1)0 ⊕H(1)

1

}

⊗{

H(2)0 ⊕H(2)

1

}

={

H(1)0 ⊗H(2)

0

}

⊕{

H(1)1 ⊗H(2)

0

}

⊕{

H(1)0 ⊗H(2)

1

}

⊕{

H(1)1 ⊗H(2)

1

}

, H0 ⊕H2 ⊕H3 ⊕H4, (8.46)

where H(1)0 = H(2)

0 = {1}, and H(1)1 = H(2)

1 = W 22 (per) ⊖ {1}. Equiva-

lently, we have the following SS ANOVA decomposition for f1 and f2:

f1(x2) = µ1 + f1,1(x2),

f2(u, x2) = µ2 + f2,1(u) + f2,2(x2) + f2,12(u, x2),

where f1,1 ∈ H1, f2,1 ∈ H2, f2,2 ∈ H3, and f2,12 ∈ H4. Then model(8.45) can be rewritten as

yk(x2) = µ1 + µ2zk + f1,1(x2) +

∫ 1

0

wk(u)f2,1(u)du + zkf2,2(x2)

+

∫ 1

0

wk(u)f2,12(u, x2)du + ǫk(x2), (8.47)

where zk =∫ 1

0wk(u)du. Let z = (z1, . . . , z35)

T and s = (s1, . . . , sn) ,

z⊗112, where 1k is a k-vector of all ones and ⊗ represents the Kroneckerproduct. Denote {(yi, x2i), i = 1, . . . , n} as the collection of all obser-vations on y and x2, and ǫ1, . . . , ǫn as the collection of random errors.Then model (8.47) can be rewritten as

yi = µ1 + µ2si + f1,1(xi2) +

∫ 1

0

w[i](u)f2,1(u)du + sif2,2(xi2)

+

∫ 1

0

w[i](u)f2,12(u, xi2)du+ ǫi, i = 1, . . . , n, (8.48)


where [i] represents the integer part of (11 + i)/12. Define a linearoperator L1i as the evaluational functional on H1 = W 2

2 (per)⊖{1} suchthat L1if1,1 = f1,1(xi2). Define linear operators L2i, L3i, and L4i onsubspaces H2, H3, and H4 in (8.46) such that

L2if2,1 =

∫ 1

0

w[i](u)f2,1(u)du,

L3if2,2 = sif2,2(xi2),

L4if2,12 =

∫ 1

0

w[i](u)f2,12(u, xi2)du.

Assume that functions wk(u) for k = 1, . . . , 35 are square integrable.Then L2i, L3i, and L4i are bounded linear functionals, and model (8.48)is a special case of the semiparametric linear regression model (8.4).Let ti = (i − 0.5)/12 for i = 1, . . . , 12 be the middle point of month i,x2 = (t1, . . . , t12)

T , wk = (wk(x12), . . . , wk(xm2))T , where m = 12, and

W = (w1, . . . ,w35). Let R1 be the RK of H1 = W 22 (per) ⊖ {1}. In-

troduce the notation R1(u,v) = {R1(uk, vl)}K Lk=1 l=1 for any vectors u =

(u1, . . . , uK)T and v = (v1, . . . , vL)T . It is easy to check that S = (1n, s),Σ1 = 135 ⊗ 1T

35 ⊗R1(x2,x2), and Σ3 = z ⊗ zT ⊗R1(x2,x2). A similarapproximation as in Section 2.10 leads to Σ2 ≈ {WTR1(x2,x2)W} ⊗112 ⊗1T

12/144. Note that the RK of H4 in (8.46) equals R1(s, t)R1(x, z).Then the (i, j)th element of the matrix Σ4

Σ4(i, j) = L4iL4jR1(u, v)R1(x, z)

= R1(xi2, xj2)

∫ 1

0

∫ 1

0

w[i](u)w[j](v)R1(u, v)dudv

≈ 1

144R1(xi2, xj2)w

Ti R1(x2,x2)wj .

Thus, Σ4 ≈ Σ1 ◦ Σ2. We fit model (8.48) as follows:

> W <- monthlyTemp; z <- apply(W,2,mean)

> s <- rep(z,rep(12,35)); x <- seq(0.5,11.5,1)/12

> y <- log(as.vector(monthlyPrecip))

> Q1 <- kronecker(matrix(1,35,35),periodic(x))

> Q2 <- kronecker(t(W)%*%periodic(x)%*%W,

matrix(1,12,12))/144

> Q3 <- kronecker(z%*%t(z),periodic(x))

> Q4 <- rk.prod(Q1,Q2)

> canada.fit4 <- ssr(y~s, rk=list(Q1,Q2,Q3,Q4))


We now show how to compute the estimated functions evaluated at aset of points. From (8.6), the estimated functions are represented by

f1(x2) = d1 + θ1

n∑

i=1

ciR1(x2, xi2),

f2,1(u) = θ2

n∑

i=1

ci

∫ 1

0

w[i](v)R1(u, v)dv,

f2,2(x2) = θ3

n∑

i=1

ci

{∫ 1

0

w[i](u)du

}

R1(x2, xi2),

f2,12(u, x2) = θ4

n∑

i=1

ci

{∫ 1

0

w[i](v)R1(u, v)dv

}

R1(x2, xi2).

Let u0 and x0 be a set of points in [0, 1] for the variables u and x2,respectively. For simplicity, assume that both u0 and x0 have lengthn0. The following calculations can be extended to the case when u0 andx0 have different lengths. It is not difficult to check that

f1(x0) = d11n0+ θ1

n∑

i=1

ciR1(x0, xi2) = d11n0+ θ1S1c,

f2,1(u0) = θ2

n∑

i=1

ci

∫ 1

0

w[i](u)R1(u0, u)du

≈ 1

12θ2

n∑

i=1

ci

12∑

j=1

R1(u0, xj2)w[i](xj2)

=1

12θ2

n∑

i=1

ciR1(u0,x2)w[i] = θ2S2c,

f2,2(x0) = θ3

n∑

i=1

ci

{∫ 1

0

w[i](u)du

}

R1(x0, xi2) = θ3S3c,

where S1 = 1T35 ⊗R1(x0,x2), S2 = {R1(u0,x2)W} ⊗ 1T

12/12, and S3 =zT ⊗R1(x0,x2). The interaction f2,12 is a bivariate function. Thus, weevaluate it at a bivariate grid {(u0k, x0l) : k, l = 1, . . . , n0}:

f2,12(u0k, x0l) = θ4

n∑

i=1

ci

{∫ 1

0

w[i](v)R1(u0k, v)dv

}

R1(x0l, xi2)

≈ θ4

n∑

i=1

ciS2[k, i]S1[l, i].


Then (f2,12(u01, x01), . . . , f2,12(u01, x0n0), . . . , f2,12(u0n0

, x01), . . . ,

f2,12(u0n0, x0n0

)) = θ4S4c, where S4 is an n20 × n matrix with elements

S4[(k − 1)n0 + l, i] = S2[k, i]S1[l, i] for k, l = 1, . . . , n0 and i = 1, . . . , n.

> ngrid <- 40; xgrid <- seq(0,1,len=ngrid)

> S1 <- kronecker(t(rep(1,35)), periodic(xgrid,x))

> S2 <- kronecker(periodic(xgrid,x)%*%W, t(rep(1,12)))/12

> S3 <- kronecker(t(z), periodic(xgrid,x))

> S4 <- NULL

> for (k in 1:ngrid) {

for (l in 1:ngrid) S4 <- rbind(S4, S1[l,]*S2[k,])}

> the <- 10^canada.fit4$rkpk.obj$theta

> f1 <- canada.fit4$coef$d[1]

+the[1]*S1%*%canada.fit4$coef$c

> mu2 <- canada.fit4$coef$d[2]

> f21 <- the[2]*S2%*%canada.fit4$coef$c



> f2 <- mu2+rep(f21,rep(ngrid,ngrid))+rep(f22,ngrid)+f212

Figures 8.5 displays the estimates of f1 and f2.

x2

f 1

1.8

2.0

2.2

2.4

2.6

2.8

0.0 0.2 0.4 0.6 0.8 1.0

0.00.2

0.40.6

0.8

0.00.2

0.40.6

0.8

−4

−2

0

2

4

u

x2

f2

FIGURE 8.5 Canadian weather data, estimates of f1 (left) and f2(right).


8.4.2 Superconductivity Magnetization Modeling

In this section we use the superconductivity data to illustrate how tocheck a nonlinear regression model. Figure 8.6 displays magnetizationversus time on logarithmic scale.

8 9 10 11 12

−3

4.5

−3

3.5

−3

2.5

logarithm of time (min)

ma

gn

etiza

tio

n (

am

2/k

g)

o

o

o

o

oo

oo

oo

oo

oooooooooooooo

oooooo

ooooooo

oooooooo

oooooooooo

oooooooooooo

ooooooooooooooo

ooooooooooooooooo

ooooooooooooooooooooo

ooooooooooooooooooooooooo

ooooooo

NLR

cubic spline

NPS

L−spline

FIGURE 8.6 Superconductivity data, observations (circles), and thefits by nonlinear regression (NLR), cubic spline, nonlinear partial spline(NPS), and L-spline.

It seems that a straight line (Anderson and Kim model) can fit datawell (Yeshurun, Malozemoff and Shaulov 1996). Let y be magnetismvalues, and x be logarithm of time scaled into the interval [0, 1]. Tocheck the Anderson and Kim model, we fit a cubic spline with the GMLchoice of the smoothing parameter:

> library(NISTnls)

> a <- Bennett5; x <- ident(a$x); y <- a$y

> super.cub <- ssr(y~x, cubic(x), spar=‘‘m’’)

> anova(super.cub, simu.size=1000)


test.value simu.size simu.p-value approximate.p-value

LMP 0.1512787 1000 0

GML 0.01622123 1000 0 0

Let f1 be the projection onto the space W 22 [0, 1]⊖{1, x−0.5} which rep-


resents the systematic departure from the straight line model. Both theLMP and GML tests for the hypothesisH0 : f1(x) = 0 conclude that thedeparture from the straight line model is statistically significant. Figure8.7(a) shows the estimate of f1 with 95% Bayesian confidence intervals.Those confidence intervals also indicate that, though small, the depar-ture from a straight line is statistically significant. The deviation fromthe Anderson and Kim model has been noticed for high-temperature su-perconductors, and the following “interpolation formula” was proposed(Bennett, Swartzendruber, Blendell, Habib and Seyoum 1994, Yeshurunet al. 1996):

y = β1(β2 + x)− 1

β3 + ǫ. (8.49)

The nonlinear regression model (8.49) is fitted as follows:

> b10 <- -1500*(max(a$x)-min(a$x))**(-1/.85)

> b20 <- (45+min(a$x))/(max(a$x)-min(a$x))

> super.nls <- nls(y~b1*(b2+x)**(-1/b3),

start=list(b1=b10,b2=b20,b3=.85))

The initial values were computed based on one set of initial values pro-vided in the help file of Bennett5. The fit of the above NLR model isshown in Figure 8.6. To check the “interpolation formula”, we can fit anonlinear partial spline model

y = β1(β2 + x)− 1

β3 + f2(x) + ǫ, (8.50)

with f2 ∈ W 22 [0, 1]. Model (8.50) is a special case of the SNR model,

which can be fitted as follows:

> bh <- coef(super.nls)

> super.snr <- snr(y~b1*(b2+x)**(-1/b3)+f(x),

params=list(b1+b2+b3~1),

func=f(u)~list(~I(u-.5), cubic(u)),

start=list(params=bh), spar=‘‘m’’)

> summary(super.snr)

...

Coefficients:

Value Std.Error t-value p-value

b1 -466.4197 32.05006 -14.55285 0

b2 11.2296 0.21877 51.33033 0

b3 0.9322 0.01719 54.22932 0

...

GML estimate(s) of smoothing spline parameter(s):

0.0001381760

Equivalent Degrees of Freedom (DF) for spline function:


10.96524

Residual standard error: 0.001703358

We used the estimates of parameters in (8.49), bh, as initial values forβ1, β2, and β3. The fit of model (8.50) is shown in Figure 8.6. To check ifthe departure from model (8.49), f2, is significant, we compute posteriormeans and standard deviations for f2:

> super.snr.pred <- intervals(super.snr)

The estimate of f2 and its 95% Bayesian confidence intervals are shownin Figure 8.7(b). The magnitude of f2 is very small. The zero line isoutside the confidence intervals in some regions.

8 9 10 11 12

−0.0

4−

0.0

20.0

00.0

2


f 1

(a)

8 9 10 11 12

−0.0

4−

0.0

20.0

00.0

2


f 2

(b)

8 9 10 11 12

−0.0

4−

0.0

20.0

00.0

2


f 3

(c)

FIGURE 8.7 Superconductivity data, (a) estimate of the departurefrom the straight line model, (b) estimate of the departure from the“interpolation formula” based on the nonlinear partial spline, and (c)estimate of the departure from the “interpolation formula” based on theL-spline. Shaded regions are 95% Bayesian confidence intervals.

An alternative approach to checking the NLR model (8.49) is to usethe L-spline introduced in Section 2.11. Consider the regression model(2.1) where f ∈ W 2

2 [0, 1]. For fixed β2 and β3, it is clear that the spacecorresponding to the “interpolation formula”

H0 = span{(β2 + x)− 1

β3 }

is the kernel of the differential operator

L = D +1

β3(β2 + x).


Let H1 = W 22 [0, 1]⊖H0. The Green function

G(x, s) =

(β2+sβ2+x

) 1β3, s ≤ x,

0, s > x.

Therefore, the RK of H1

R1(x, z) =

∫ 1

0

(β2 + s

β2 + x

) 1

β3

(β2 + s

β2 + z

) 1

β3

ds

= C(β2 + x)−1

β3 (β2 + z)−1

β3 , (8.51)

where C =∫ 1

0(β2 + s)

2β3 ds. We fit the L-spline model as follows:

> ip.rk <- function(x,b2,b3) ((b2+x)%o%(b2+x))**(-1/b3)

> super.l <- ssr(y~I((bh[2]+x)**(-1/bh[3]))-1,

rk=ip.rk(x,bh[2],bh[3]), spar=‘‘m’’)

The function ip.rk computes the RK R1 in (8.51) with the constantC being ignored since it can be absorbed by the smoothing parameter.The values of β2 and β3 are fixed as the estimates from the nonlinear re-gression model (8.49). Let f3 be the projection onto the space H1 whichrepresents the systematic departure from the “interpolation formula”.Figure 8.7(c) shows the estimate of f3 with 95% Bayesian confidenceintervals. The systematic departure from the “interpolation formula” isessentially zero.

8.4.3 Oil-Bearing Rocks

The rock data set contains measurements on four cross sections of eachof 12 oil-bearing rocks. The aim is to predict permeability (perm) fromthree other measurements: the total area (area), total perimeter (peri)and a measure of “roundness” of the pores in the rock cross section(shape). Let y = log(perm), x1 = area/10000, x2 = peri/10000 andx3 = shape. A full multivariate nonparametric model such as an SSANOVA model is not desirable in this case since there are only 48 ob-servations. Consider the projection pursuit regression model (8.1) withr = 2. Let βk = (βk1, βk2, βk3)

T for k = 1, 2. For identifiability, we usespherical coordinates βk1 = sin(αk1) cos(αk2), βk2 = sin(αk1) sin(αk2),and βk3 = cos(αk1), which satisfy the side condition β2

k1 +β2k2 +β2

k3 = 1.Then we have the following SNR model:

y = β0 + f1(sin(α11) cos(α12)x1 + sin(α11) sin(α12)x2 + cos(α11)x3

)

+f2(sin(α21) cos(α22)x1 + sin(α21) sin(α22)x2 + cos(α21)x3) + ǫ.

(8.52)


Note that the domains of f1 and f2 are not fixed intervals since theydepend on unknown parameters. We model f1 and f2 using the one-dimensional thin-plate spline space W 2

2 (R). To make f1 and f2 identi-fiable with the constant β0, we remove the constant functions from themodel space. That is, we assume that f1, f2 ∈ W 2

2 (R) ⊖ {1}. Randomerrors are assumed to be iid. Bounds on parameters αkj are ignored forsimplicity.

> attach(rock)

> y <- log(perm)

> x1 <- area/10000; x2 <- peri/10000; x3 <- shape

> rock.ppr <- ppr(y~x1+x2+x3, nterms=2, max.terms=5)

> b.ppr <- rock.ppr$alpha

> a11ini <- acos(b.ppr[3,1]/sqrt(sum(b.ppr[,1]**2)))

> a12ini <- atan(b.ppr[2,1]/b.ppr[1,1])

> a21ini <- acos(b.ppr[3,2]/sqrt(sum(b.ppr[,2]**2)))

> a22ini <- atan(b.ppr[2,2]/b.ppr[1,2])

> rock.snr <- snr(y~b0

+f1(sin(a11)*cos(a12)*x1+sin(a11)*sin(a12)*x2

+cos(a11)*x3)

+f2(sin(a21)*cos(a22)*x1+sin(a21)*sin(a22)*x2

+cos(a21)*x3),

func=list(f1(u)~list(~u-1,tp(u)),

f2(v)~list(~v-1,tp(v))),

params=list(b0+a11+a12+a21+a22~1), spar=‘‘m’’,

start=list(params=c(mean(y),a11ini,a12ini,

a21ini,a22ini)),

control=list(prec.out=1.e-3,maxit.out=50))

> summary(rock.snr)

...

Coefficients:


b0 5.341747 0.31181572 17.13110 0.0000

a11 1.574432 0.04597406 34.24610 0.0000

a12 -1.221826 0.01709925 -71.45495 0.0000

a21 0.836215 0.26074376 3.20704 0.0025

a22 -1.010785 0.02919986 -34.61607 0.0000

...


1.173571e-05 1.297070e-05


8.559332



> a <- rock.snr$coef[-1]

> u <- sin(a[1])*cos(a[2])*x1+sin(a[1])*sin(a[2])*x2

+cos(a[1])*x3

> v <- sin(a[3])*cos(a[4])*x1+sin(a[3])*sin(a[4])*x2

+cos(a[3])*x3

> ugrid <- seq(min(u),max(u),len=50)

> vgrid <- seq(min(v),max(v),len=50)

> rock.snr.ci <- intervals(rock.snr,

newdata=data.frame(u=ugrid,v=vgrid))

We fitted the projection pursuit regression model using the ppr func-tion in R first (Venables and Ripley 2002) and used the estimates tocompute initial values for spherical coordinates. The estimates and pos-terior standard deviations of f1 and f2 were computed at grid pointsusing the intervals function. Figure 8.8 shows the estimates of f1 andf2 with 95% Bayesian confidence intervals. It is interesting to note thatthe overall shapes of the estimated functions from ppr (Figure 8.9 inVenables and Ripley (2002)) and snr are comparable even though theestimation methods are quite different.

−0.10 0.00 0.10 0.20

−4

−2

01

23

term 1

f 1

0.1 0.2 0.3 0.4 0.5

−4

−2

01

23

term 2

f 2

FIGURE 8.8 Rock data, estimates of f1 (left) and f2 (right). Shadedregions are 95% Bayesian confidence intervals.

8.4.4 Air Quality

The air quality data set contains daily measurements of the ozone con-centration (Ozone) in parts per million and three meteorological vari-ables: wind speed (Wind) in miles per hour, temperature (Temp) in de-


grees Fahrenheit, and solar radiation (Solar.R) in Langleys. The goalis to investigate how the air pollutant ozone concentration depends onthree meteorological variables. Let y = Ozone1/3, x1 = Wind, x2 = Temp,and x3 = Solar.R. Yu and Ruppert (2002) considered the following par-tially linear single index model

y = f1(β1x1 + β2x2 +

√

1 − β21 − β2

2x3

)+ ǫ. (8.53)

Note that, for identifiability,√

1 − β21 − β2

2 is used to represent the co-efficient for x3 such that it is positive and the summation of all squaredcoefficients equals 1. Random errors are assumed to be iid. We modelf1 using the one-dimensional thin-plate spline space W 2

2 (R). The singleindex model (8.53) is an SNR model that can be fitted as follows:

> air <- na.omit(airquality)

> attach(air)

> y <- Ozone^(1/3)

> x1 <- Wind; x2 <- Temp; x3 <- ident(Solar.R)

> air.snr.1 <- snr(y~f1(b1*x1+b2*x2+sqrt(1-b1^2-b2^2)*x3),

func=f1(u)~list(~u, rk=tp(u)),

params=list(b1+b2~1), spar=‘‘m’’,

start=list(params=c(-0.8,.5)),

control=list(maxit.out=50))

The condition β21 + β2

2 < 1 is ignored for simplicity. The estimate of f1and its 95% Bayesian confidence intervals are shown in Figure 8.9(a).The estimate of f1 is similar to that in Yu and Ruppert (2002).

Yu and Ruppert (2002) also considered the following partially linearsingle index model

y = f2(β1x1 +

√

1 − β21x2

)+ β2x3 + ǫ. (8.54)

The following statement fits model (8.54) with f2 ∈W 22 (R):

> air.snr.2 <- snr(y~f2(b1*x1+sqrt(1-b1^2)*x2)+b2*x3,

func=f2(u)~list(~u, rk=tp(u)),

params=list(b1+b2~1), spar=‘‘m’’,

start=list(params=c(-0.8,1.3))

The estimates of f2 and β2x3 are shown in Figure 8.9(b)(c). The effect ofradiation may be nonlinear. So, we further fit the following SNR model

y = f3(β1x1 +

√

1 − β21x2

)+ f4(x3) + ǫ. (8.55)

Again, we model f3 using the space W 22 (R). Since the domain of x3

is [0, 1], we model f4 using the cubic spline space W 22 [0, 1] ⊖ {1} where

constant functions are removed for identifiability.


15 25 35 45

−1

01

23

45

index

f 1

(a)

20 30 40 50

−1

01

23

45

index

f 2 &

f3

(b)

0 100 200 300

−1

01

23

45

radiation

f 4

(c)

FIGURE 8.9 Air quality data, (a) observations (dots), and the esti-mate of f1 (solid line) with 95% Bayesian confidence intervals; (b) partialresiduals after removing the radiation effect (dots), the estimates of f2(dashed line) and f3 (solid line), and 95% Bayesian confidence intervals(shaded region) for f3; and (c) partial residuals after removing the indexeffects based on wind speed and temperature (dots), the estimates ofβ2x3 in model (8.54) (dashed line) and f4 (solid line), and 95% Bayesianconfidence intervals (shaded region) for f4.

> air.snr.3 <- snr(y~f3(a1*x1+sqrt(1-a1^2)*x2)+f4(x3),

func=list(f3(u)~list(~u, rk=tp(u)),

f4(v)~list(~v-1, rk=cubic(v))),

params=list(a1~1), spar=‘‘m’’,

start=list(params=c(-0.8)))

Estimates of f3 and f4 are shown in Figure 8.9(b)(c). The estimate of f4increases with increasing radiation until a value of about 250 Langleys,after which it is flat. The difference between f4 and the linear estimateis not significant based on confidence intervals, perhaps due to the smallsample size. To compare three models (8.53), (8.54), and (8.55), wecompute AIC, BIC, and GCV criteria:

> n <- 111

> rss <- c(sum((y-air.snr.1$fitted)**2),

sum((y-air.snr.2$fitted)**2),

sum((y-air.snr.3$fitted)**2))/n

> df <- c(air.snr.1$df$f+air.snr.1$df$para,

air.snr.2$df$f+air.snr.2$df$para,

air.snr.3$df$f+air.snr.3$df$para)

> gcv <- rss/(1-df/n)**2




> print(round(rbind(aic, bic, gcv),4))

aic -156.9928 -176.5754 -177.3056

bic -139.6195 -157.8198 -155.7608

gcv 0.2439 0.2046 0.2035

AIC and GCV select model (8.55), while BIC selects model (8.54).

8.4.5 The Evolution of the Mira Variable R Hydrae

The star data set contains magnitude (brightness) of the Mira variable RHydrae during 1900–1950. Figure 8.10 displays observations over time.The Mira variable R Hydrae is well known for its declining period andamplitude. We will consider three SNR models in this section to inves-tigate the pattern of the decline.

0 5000 10000 15000

45

67

89

time (day)

magnitude

FIGURE 8.10 Star data, plot of observations (points), and the fitbased on model (8.58) (solid line).

Let y = magnitude and x = time. We first consider the followingSNR model (Genton and Hall 2007)

y = a(x)f1(t(x)) + ǫ, (8.56)

where y is the magnitude on day x, a(x) = 1 + β1x is the amplitude


function, f1 is the common periodic shape function with unit period,t(x) = log(1 + β3x/β2)/β3 is a time transformation function, and ǫ isa random error. Random errors are assumed to be iid. The function1/t′(x) can be regarded as the period function. Therefore, model (8.56)assumes that the amplitude and period evolve linearly. Since f1 is closeto a sinusoidal function, we model f1 using the trigonometric spline withL = D{D2 + (2π)2} (m = 2 in (2.67)) and f1 ∈ W 3

2 (per). Model (8.56)can be fitted as follows:

> data(star); attach(star)

> star.fit.1 <- snr(y~(1+b1*x)*f1(log(1+b3*x/b2)/b3),

func=list(f1(u)~list(~sin(2*pi*u)+cos(2*pi*u),

lspline(u,type=‘‘sine1’’))),

params=list(b1+b2+b3~1), spar=‘‘m’’,

start=list(params=c(0.0000003694342,419.2645,

-0.00144125)))

> summary(star.fit.1)

...

Coefficients:


b1 0.0000 0.00000035 1.0163 0.3097

b2 419.2485 0.22553623 1858.8966 0.0000

b3 -0.0014 0.00003203 -44.8946 0.0000

...


0.9999997


3.000037


> grid <- seq(0,1,len=100)

> star.p.1 <- intervals(star.fit.1,

newdata=data.frame(u=grid),

terms=list(f1=matrix(c(1,1,1,1,0,0,0,1),

ncol=4,byrow=T)))

> co <- star.fit.1$coef

> tx <- log(1+co[3]*x/co[2])/co[3]

> xfold <- tx-floor(tx)

> yfold <- y/(1+co[1]*x)

We computed posterior means and standard deviations of f1 and its pro-jection onto the subspaceW 3

2 (per)⊖{1, sin 2πx, cos 2πx}, P1f1, using theintervals function. Estimate of f1 and its 95% Bayesian confidence in-tervals are shown in Figure 8.11(a). We computed folded observations at


day x as y = y/(1+ β1x) and x = t(x)−⌊t(x)⌋, where z−⌊z⌋ representsthe fractional part of z. The folded observations are shown in Figure8.11(a). The projection P1f1 represents the departure from the sinu-soidal model space span{1, sin 2πx, cos 2πx}. Figure 8.11(b) indicatesthat the function f1 is not significantly different from a sinusoidal modelsince P1f1 is not significantly different from zero.

0.0 0.4 0.8

45

67

89

x

f 1

(a)

0.0 0.4 0.8

−0

.01

00

.00

00

.01

0

x

P1f 1

(b)

FIGURE 8.11 Star data, (a) folded observations (points), estimateof f1 (solid line), and 95% Bayesian confidence intervals (shaded region);and (b) estimate of P1f1 (solid line) and 95% Bayesian confidence inter-vals (shaded region).

Assuming that the periodic shape function can be well modeled bya sinusoidal function, we can investigate the evolving amplitude andperiod nonparametrically. We first consider the following SNR model

y = β1 + exp{f2(x)} sin[2π{β2 + log(1 + β4x/β3)/β4}] + ǫ, (8.57)

where β1 is a parameter of the mean, exp{f2(x)} is the amplitude func-tion, and β2 is a phase parameter. Note that the exponential transfor-mation is used to enforce the positive constraint on the amplitude. Wemodel f2 using the cubic spline model space W 2

2 [0, b], where b equals themaximum of x. Model (8.57) can be fitted as follows:

> star.fit.2 <- snr(y~b1+

exp(f2(x))*sin(2*pi*(b2+log(1+b4*x/b3)/b4)),

func=list(f2(u)~list(~u,cubic2(u))),

params=list(b1+b2+b3+b4~1), spar=‘‘m’’,


start=list(params=c(7.1726,.4215,419.2645,-0.00144125),

f=list(f2=log(diff(range(star.p.1$f1$fit[,1]))))),

control=list(prec.out=.001,

rkpk.control=list(limnla=c(12,14))))

> summary(star.fit.2)

...

Coefficients:


b1 7.2004 0.0273582 263.1886 0

b2 0.3769 0.0143172 26.3278 0

b3 416.8002 0.5796450 719.0612 0

b4 -0.0011 0.0000617 -18.4201 0


1.000000e+12


9.829385


We used the logarithm of the range of f1 in model (8.56) as initial valuesfor f2. We also limited the search range for log10(nλ), where n is thenumber of observations. To look at the general pattern of the ampli-tudes, we compute the amplitude for each period by fitting a sinusoidalmodel for each period of the folded data containing more than five ob-servations. The logarithm of these estimated amplitudes and estimatesof the amplitude functions based on models (8.56) and (8.57) are shownin Figure 8.12(a). Apparently, the amplitudes are underestimated basedon these models.

Finally, we consider the following SNR model

y = β + exp{f3(x)} sin{2πf4(x)} + ǫ, (8.58)

where both the amplitude and period functions are modeled nonpara-metrically. We model f3 and f4 using the cubic spline model spaceW 2

2 [0, b]. Model (8.58) can be fitted as follows:

> co <- star.fit.2$coef

> f4ini <- co[2]+log(1+co[4]*x/co[3])/co[4]

> star.fit.3 <- snr(y~b+exp(f3(x))*sin(2*pi*f4(x)),

func=list(f3(u)+f4(u)~list(~u,cubic2(u))),

params=list(b~1), spar=‘‘m’’,

start=list(params=c(7.1726),

f=list(f3=star.fit.2$funcFitted,f4=f4ini)),

control=list(prec.out=.001,


0 5000 10000 15000

0.0

0.2

0.4

0.6

0.8

1.0

time (day)

f 2

(a)

o

o

oo

oooo

o

o

ooo

o

o

o

oo

oooo

o

oo

o

o

oooo

o

oo

ooo

oo

0 5000 10000 15000

0.0

0.2

0.4

0.6

0.8

1.0

time (day)

f 3

(b)

o

o

oo

oooo

o

o

ooo

o

o

o

oo

oooo

o

oo

o

o

oooo

o

oo

ooo

oo

0 5000 10000 15000

010

20

30

40

time (day)

f 4

(c)

0 5000 10000 15000

200

300

400

500

600

time (day)

period

(d)

oo

o

ooooo

o

oo

o

o

ooo

o

o

o

oooo

o

oooo

oooo

o

oooo

o

o

FIGURE 8.12 Star data, (a) estimated amplitudes based on foldeddata (circles), estimate of the amplitude function a(x) in model (8.56)rescaled to match the amplitude function in model (8.57) (dashed line),estimate of the amplitude function exp(f2(x)) in model (8.57) (solidline), and 95% Bayesian confidence intervals (shaded region), all on log-arithmic scale; (b) estimated amplitudes based on folded data (circles),estimate of the amplitude function exp(f3(x)) in model (8.58) (solidline), and 95% Bayesian confidence intervals (shaded region), all on log-arithmic scale; (c) estimate of f4 based on model (8.58) (solid line) and95% Bayesian confidence intervals (shaded region); and (d) estimatedperiods based on the CLUSTER method (circles) and estimate of theperiod function 1/f ′

4(x).

rkpk.control=list(limnla=c(12,14))))

> n <- length(y)

> rss <- c(sum((y-star.fit.1$fitted)**2),

sum((y-star.fit.2$fitted)**2),

sum((y-star.fit.3$fitted)**2))/n


> df <- c(star.fit.1$df$f+star.fit.1$df$para,

star.fit.2$df$f+star.fit.2$df$para,

star.fit.3$df$f+star.fit.3$df$para)



> gcv <- rss/(1-df/n)**2

> print(round(rbind(aic, bic, gcv),4))

aic -164.0240 -202.6215 -1520.0960

bic -134.0823 -133.6093 -1197.6866

gcv 0.8598 0.8299 0.2476

We used the fitted values of the corresponding components from models(8.57) and (8.56) as initial values for f3 and f4. We also computed AIC,BIC, and GCV criteria for models (8.56), (8.57), and (8.58). Model(8.58) fits data much better, and the overall fit is shown in Figure 8.10.AIC, BIC, and GCV all select model (8.58). Estimates of functions f3and f4 are shown in Figure 8.12(b)(c). The confidence intervals for f4are so narrow that they are indistinguishable from the estimate of f4.The amplitude function f3 fits data much better. To look at the gen-eral pattern of the periods, we first identify peaks using the CLUSTERmethod (Yang, Liu and Wang 2005). Observed periods are estimatedas the lengths between peaks, and they are shown in Figure 8.12(d).The estimate of period function 1/f ′

4 in Figure 8.12(d) indicates thatthe evolution of the period may be nonlinear. By allowing a nonlinearperiod function, the model (8.58) leads to a much improved overall fitwith less biased estimate of the amplitude function.

8.4.6 Circadian Rhythm

Many biochemical, physiological, and behavioral processes of living mat-ters follow a roughly 24-hour cycle known as the circadian rhythm. Weuse the hormone data to illustrate how to fit an SIM to investigate cir-cadian rhythms. The hormone data set contains cortisol concentrationmeasured every 2 hours for a period of 24 hours from multiple subjects.In this section we use observations from normal subjects only. Cortisolconcentrations on the log10 scale from nine normal subjects are shownin Figure 8.13.

It is usually assumed that there is a common shape function for allindividuals. The time axis may be shifted and the magnitude of variationmay differ between subjects; that is, there may be phase and amplitudedifferences between subjects. Therefore, we consider the following SIM

concij = βi1 + exp(βi2)f(timeij − alogit(βi3)) + ǫij ,

i = 1, . . . ,m, j = 1, . . . , ni, (8.59)


time

cort

isol concentr

ation o

n log s

cale

0

1

2

3

0.0 0.4 0.8

8007 8008

0.0 0.4 0.8

8009

8004 8005

0

1

2

3

8006

0

1

2

3

8001

0.0 0.4 0.8

8002 8003

FIGURE 8.13 Hormone data, plots of cortisol concentrations (cir-cles), and the fitted curves (solid lines) based on model (8.59). Subjects’ID are shown in the strip.


where m is the total number of subjects, ni is the number of obser-vations from subject i, concij is the cortisol concentration (on log10

scale) of the ith subject at the jth time point timeij , βi1 is the 24-hourmean of the ith subject, exp(βi2) is the amplitude of the ith subject,alogit(βi3) = exp(β3i)/{1 + exp(βi3)} is the phase of the ith subject,and ǫij are random errors. Note that the variable time is transformedinto the interval [0, 1], the exponential transformation is used to enforcepositive constraint on the amplitude, and the inverse logistic transfor-mation alogit is used such that the phase is inside the interval [0, 1].Comparing with the SIM model (8.33), there is no scale parameter β4

in model (8.59) since the period is fixed to be 1. The function f is thecommon shape function. Since it is a periodic function with period 1,we model f using the trigonometric spline with L = D2 + (2π)2 (m = 2in (2.70)) and f ∈ W 2

2 (per)⊖{1} where constant functions are removedfrom the model space to make f identifiable with βi1. In order to makeβi2 and βi3 identifiable with f , we add constraints: β21 = β31 = 0.Model (8.59) is an SNR model for clustered data. Assuming randomerrors are iid, model (8.59) can be fitted as follows:

> data(horm.cort)

> nor <- horm.cort[horm.cort$type==‘‘normal’’,]

> M <- model.matrix(~as.factor(ID), data=nor)

> nor.snr.fit1 <- snr(conc~b1+exp(b2)*f(time-alogit(b3)),

func=f(u)~list(~sin(2*pi*u)+cos(2*pi*u)-1,

lspline(u,type=‘‘sine0’’)),

params=list(b1~M-1, b2+b3~M[,-1]-1),

start=list(params=c(mean(nor$conc),rep(0,24))),

data=nor, spar=‘‘m’’,

control=list(prec.out=0.001,converg=‘‘PRSS’’))

Note that the second-stage models for parameters were specified by theparams argument. We removed the first column in the design matrixM to satisfy the side condition β21 = β31 = 0. We used the optionconverg=‘‘PRSS’’ instead of the default converg=‘‘COEF’’ becausethis option usually requires fewer number of iterations. We computefitted curves for all subjects evaluated at grid points:

> nor.grid <- data.frame(ID=rep(unique(nor$ID),rep(50,9)),

time=rep(seq(0,1,len=50),9))

> M <- model.matrix(~as.factor(ID), data=nor.grid)

> nor.snr.p <- predict(nor.snr.fit1,newdata=nor.grid)

Note that the matrix M needs to be generated again for the grid points.The fitted curves for all subjects are shown in Figure 8.13. We fur-


ther compute the posterior means and standard deviations of f and itsprojection onto W 2

2 (per) ⊖ {1, sin 2πx, cos 2πx}, P1f , as follows:

> grid <- seq(0,1,len=100)

> nor.snr.p.f <- intervals(nor.snr.fit1,

newdata=data.frame(u=grid),

terms=list(f=matrix(c(1,1,1,0,0,1),nrow=2,byrow=T)))

The estimates of f and its projection P1f are shown in Figure 8.14. Itis obvious that P1f is significantly different from zero. Thus a simplesinusoidal model may not be appropriate for this data.

0.0 0.4 0.8

−1.0

0.0

1.0

x

f

(a)

0.0 0.4 0.8

−1.0

0.0

1.0

x

P1f

(b)

FIGURE 8.14 Hormone data, (a) estimate of f (solid line), and 95%Bayesian confidence intervals (shaded region); and (b) estimate of P1f(solid line) and 95% Bayesian confidence intervals (shaded region).

Random errors in model (8.59) may be correlated. In the followingwe fit with an AR(1) within-subject correlation structure:

> M <- model.matrix(~as.factor(subject))

> nor.snr.fit2 <- update(nor.snr.fit1,

cor=corAR1(form=~1|subject))

> summary(nor.snr.fit2)

...

Correlation Structure: AR(1)

Formula: ~1 | subject

Parameter estimate(s):

Phi

-0.1557776

...


The lag 1 autocorrelation coefficient is small. We will further discuss howto deal with possible correlation within each subject in Section 9.4.5.


Chapter 9

Semiparametric Mixed-EffectsModels

9.1 Linear Mixed-Effects Models

Mixed-effects models include both fixed effects and random effects, whererandom effects are usually introduced to model correlation within a clus-ter and/or spatial correlations. They provide flexible tools to model boththe mean and the covariance structures simultaneously.

The simplest mixed-effects model is perhaps the classical two-waymixed model. Suppose A is a fixed factor with a levels, B is a randomfactor with b levels, and the design is balanced. The two-way mixedmodel assumes that

yijk = µ+ αi + βj + (αβ)ij + ǫijk,

i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . ,m,(9.1)

where yijk is the kth observation at level i of factor A and level j offactor B, µ is the overall mean, αi and βj are main effects, (αβ)ij is theinteraction, and ǫijk are random errors. Since factor B is random, βj

and (αβ)ij are random effects. It is usually assumed that βjiid∼ N(0, σ2

b ),

(αβ)ijiid∼ N(0, σ2

ab), ǫijkiid∼ N(0, σ2), and they are mutually independent.

Another simple mixed-effects model is the linear growth curve model.Suppose a response variable is measured repeatedly over a period of timeor a sequence of doses from multiple individuals. Assume that responsesacross time or doses for each individual can be described by a simplestraight line, while the intercepts and slopes for different individualsmay differ. Then, one may consider the following linear growth curvemodel

yij = β1 + β2xij + b1i + b2ixij + ǫij ,

i = 1, . . . ,m; j = 1, . . . , ni,(9.2)

where yij is the observation from individual i at time (or does) xij , β1

and β2 are population intercept and slope, b1i and b2i are random effects

273


representing individual i’s departures in intercept and slope from thepopulation parameters, and ǫij are random errors. Let bi = (b1i, b2i)

T .

It is usually assumed that biiid∼ N(0, σ2D) for certain covariance matrix

D, and random effects and random errors are mutually independent.A general linear mixed-effects (LME) model assumes that

y = Sβ + Zb+ ǫ, (9.3)

where y is an n-vector of observations on the response variable, S andZ are design matrices for fixed and random effects, respectively, β isa q1-vector of unknown fixed effects, b is a q2-vector of unobservablerandom effects, and ǫ is an n-vector of random errors. We assume thatb ∼ N(0, σ2D), ǫ ∼ N(0, σ2Λ), and b and ǫ are independent. Note thaty ∼ N(Sβ, σ2(ZDZT +Λ)). Therefore, the mean structure is modeled bythe fixed effects, and the covariance structure is modeled by the randomeffects and random errors.

As discussed in Sections 3.5, 4.7, 5.2.2, and 7.3.3, smoothing splineestimates can be regarded as the BLUP estimates of the correspondingLME and NLME (nonlinear mixed-effects) models. Furthermore, theGML (generalized maximum likelihood) estimates of smoothing param-eters are the REML (restricted maximum likelihood) estimates of vari-ance components in the corresponding LME and NLME models. Theseconnections between smoothing spline models and mixed-effects modelswill be utilized again in this chapter. In particular, details for the con-nection between semiparametric linear mixed-effects models and LMEmodels will be given in Section 9.2.2.

9.2 Semiparametric Linear Mixed-Effects Models

9.2.1 The Model

A semiparametric linear mixed-effects (SLM) model assumes that

yi = sTi β +

r∑

k=1

Lkifk + zTi b+ ǫi, i = 1, . . . , n, (9.4)

where s and z are independent variables for fixed and random effectsrespectively, β is a q1-vector of parameters, b is a q2-vector of randomeffects, Lki are bounded linear functionals, fk are unknown functions,and ǫi are random errors. For k = 1, . . . , r, denote Xk as the domain offk and assume that fk ∈ Hk, where Hk is an RKHS on Xk.

Semiparametric Mixed-Effects Models 275

Let y = (y1, . . . , yn)T , S = (s1, . . . , sn)T , Z = (z1, . . . ,zn)T , f =(f1, . . . , fr), γ(f ) = (

∑rk=1 Lk1fk, . . . ,

∑rk=1 Lknfk)T , and ǫ = (ǫ1, . . . ,

ǫn)T . Then model (9.4) can be written in the vector form

y = Sβ + γ(f ) + Zb+ ǫ. (9.5)

We assume that b ∼ N(0, σ2D), ǫ ∼ N(0, σ2Λ), and they are mutuallyindependent. It is clear that model (9.5) is an extension of the LMEmodel with an additional term for nonparametric fixed effects. We notethat the random effects are general. Stochastic processes including thosebased on smoothing splines can be used to construct models for randomeffects. Therefore, in a sense, the random effects may also be modelednonparametrically. See Section 9.2.4 for an example.

For clustered data, an SLM model assumes that

yij = sTijβ+

r∑

k=1

Lkijfk+zTijbi+ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (9.6)

where yij is the jth observation in cluster i, and bi are random effects forcluster i. Let n =

∑mi=1 ni, yi = (yi1, . . . , yini

)T , y = (yT1 , . . . ,y

Tm)T ,

Si = (si1, . . . , sini)T , S = (ST

1 , . . . , STm)T , Zi = (zi1, . . . ,zini

)T , Z =diag(Z1, . . . , Zm), b = (bT

1 , . . . , bTm)T , γi(f ) = (

∑rk=1 Lki1fk, . . . ,∑r

k=1 Lkinifk)T , γ(f ) = (γT

1 (f ), . . . ,γTm(f))T , ǫi = (ǫi1, . . . , ǫini

)T ,and ǫ = (ǫT

1 , . . . , ǫTm)T . Then model (9.6) can be written in the same

vector form as (9.5).Similar SLM models may be constructed for multiple levels of grouping

and other situations.


For simplicity we present the estimation and inference procedures forthe SLM model (9.4). The same methods apply to the SLM model (9.6)with a slight modification of notation.

We assume that the covariance matrices D and Λ depend on anunknown vector of covariance parameters τ . Our goal is to estimateβ, f , b, τ , and σ2. Assume that Hk = Hk0 ⊕ Hk1, where Hk0 =span{φk1, . . . , φkpk

} and Hk1 is an RKHS with RK Rk1. Let η(β,f) =Sβ + γ(f) and W−1 = ZDZT + Λ. The marginal distribution of y isy ∼ N(η(β,f), σ2W−1). For fixed τ , we estimate β and f as minimizersof the PWLS (penalized weighted least squares)

1

n(y − η(β,f))TW (y − η(β,f)) + λ

r∑

k=1

θ−1k ‖Pk1fk‖2, (9.7)


where Pk1 is the projection operator onto Hk1 in Hk, and λθ−1k are

smoothing parameters. Note that the PWLS (9.7) has the same form as(8.5). Therefore, the results in Section 8.2.2 hold for the SLM models.In particular,

fk =

pk∑

ν=1

dkνφkν + θk

n∑

i=1

ciξki, k = 1, . . . , r, (9.8)

where ξki(x) = Lki(z)Rk1(x, z).For k = 1, . . . , r, let dk = (dk1, . . . , dkpk

)T , Tk = {Lkiφkν}n pk

i=1 ν=1,

and Σk = {LkiLkjRk1}ni,j=1. Let c = (c1, . . . , cn)T , d = (dT

1 , . . . ,dTr )T ,

α = (βT ,dT )T , T = (T1, . . . , Tr), X = (S T ), θ = (θ1, . . . , θr), and

Σθ =∑r

k=1 θkΣk. We haveLkifk =∑pk

ν=1 dkνLkiφkν+θk

∑nj=1 cjLkiξkj

and fk = (Lk1fk, . . . ,Lknfk)T = Tkdk + θkΣkc. Consequently,

γ(f ) =

r∑

k=1

(Tkdk + θkΣkc) = Td+ Σθc

and

η(β,f) = Sβ + Td+ Σθc = Xα+ Σθc.

Furthermore,

r∑

k=1

θ−1k ‖Pk1fk‖2 =

r∑

k=1

θkcT Σkc = cT Σθc.

Then the PWLS (9.7) reduces to

(y −Xα− Σθc)TW (y −Xα− Σθc) + nλcT Σθc. (9.9)

Differentiating (9.9) with respect to α and c, we have

XTWXα+XTWΣθc = XTWy,

ΣθWXα+ (ΣθWΣθ + nλΣθ)c = ΣθWy.(9.10)

We estimate random effects by (Wang 1998a)

b = DZTW (y −Xα− Σθc). (9.11)

Henderson (hierarchical) likelihood is often used to justify the estima-tion of fixed and random effects in an LME model (Robinson 1991, Lee,Nelder and Pawitan 2006). We now show that the above PWLS esti-mates of β, f , and b can also be derived from the following penalized


Henderson (hierarchical) likelihood of y and b

(y−η(β,f )−Zb)T Λ−1(y−η(β,f )−Zb)+bTD−1b+nλ

r∑

k=1

θ−1k ‖Pk1fk‖2,

(9.12)where, ignoring some constant terms, the first two components in (9.12)equal twice the negative logarithm of the joint density function of y andb. Again, the solution of f in (9.12) can be represented by (9.8). Thenthe penalized Henderson (hierarchical) likelihood (9.12) reduces to

(y−Xα−Σθc−Zb)T Λ−1(y−Xα−Σθc−Zb)+bTD−1b+nλcT Σθc.(9.13)

Differentiating (9.13) with respect to α, c, and b, we have

XT Λ−1Xα+XT Λ−1Σθc+XT Λ−1Zb = XT Λ−1y,

ΣθΛ−1Xα+ (ΣθΛ−1Σθ + nλΣθ)c+ ΣθΛ−1Zb = ΣθΛ−1y,

ZT Λ−1Xα+ ZT Λ−1Σθc+ (D−1 + ZT Λ−1Z)b = ZT Λ−1y.

(9.14)

It is easy to check that

W = Λ−1 − Λ−1Z(ZT Λ−1Z +D−1)−1ZT Λ−1. (9.15)

Since I = (ZDZT + Λ)W = ZDZTW + ΛW , then

ZDZTW = I − Λ{Λ−1 − Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1}= Z(D−1 + ZT Λ−1Z)−1ZT Λ−1. (9.16)

From (9.11) and (9.16), we have

Zb = Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc). (9.17)

From (9.15) and (9.17), the first equation in (9.10) is equivalent to

0 = XTWXα+XTWΣθc−XTWy

= XT Λ−1Xα+XT Λ−1Σθc −XT Λ−1y

+XT Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc)

= XT Λ−1Xα+XT Λ−1Σθc −XT Λ−1y +XT Λ−1Zb,

which is the same as the first equation in (9.14). Similarly, the sec-ond equation in (9.10) is equivalent to the second equation in (9.14).Equation (9.11) is equivalent to

D−1b = ZT Λ−1(y −Xα− Σθc)

−ZT Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc)

= ZT Λ−1(y −Xα− Σθc) − ZT Λ−1Zb,


which is the same as the third equation in (9.14). Therefore, the PWLSestimates of β, f , and b based on (9.7) and (9.11) can be regarded asthe penalized Henderson (hierarchical) likelihood estimates.

Let Zs = (In, . . . , In), where In denotes an n × n identity matrix.Consider the following LME model

y = Sβ + Td+

r∑

k=1

uk + Zb+ ǫ = Xα+ Zsu+ Zb+ ǫ, (9.18)

where β and d are fixed effects, α = (βT ,dT )T , uk are random effectswith uk ∼ N(0, σ2θkΣk/nλ), u = (uT

1 , . . . ,uTr )T , b are random effects

with b ∼ N(0, σ2D), ǫ are random errors with ǫ ∼ N(0, σ2Λ), andrandom effects and random errors are mutually independent. Let Ds =diag(θ1Σ1, . . . , θrΣr) and b = (uT , bT )T . Write

σ−2Cov(b) = diag((nλ)−1Ds, D

)

={Inr+q2

}{diag(Ds, D)

}{diag

((nλ)−1Inr, Iq2

) }.

Then equation (3.3) in Harville (1976) can be written as

XT Λ−1Xα+XT Λ−1ZsDsφ1 +XT Λ−1ZDφ

= XT Λ−1y,

DsZTs Λ−1Xα+ (nλDs +DsZ

Ts Λ−1ZsDs)φ1 +DsZ

Ts Λ−1ZDφ

= DsZTs Λ−1y,

DZT Λ−1Xα+DZT Λ−1ZsDsφ1 + (D +DZT Λ−1ZD)φ

= DZT Λ−1y.

(9.19)

Suppose α, c, and b are solutions to (9.14). Note that ZsDsZTs = Σθ.

When Σθ is invertible, multiplying DsZTs Σ−1

θto both side of the second

equation in (9.14), it is not difficult to see that α, φ1 = ZTs c, and

φ = D−1b are solutions to (9.19). From Theorem 2 in Harville (1976),the linear system (9.19) is consistent, and the BLUP estimate of u isu = Dsφ1 = DsZ

Ts c = (θ1c

T Σ1, . . . , θrcT Σr)

T . Therefore, the PWLSestimate of each component, uk = θkΣkc, is a BLUP. When Σθ isnot invertible, consider Zsu =

∑rk=1 uk instead of each individual uk.

Letting b = (uTZTs , b

T )T and following the same arguments, it can beshown that the overall fit Σθc is a BLUP. Similarly, the estimate of b isalso a BLUP.

We now discuss how to estimate smoothing parameters λ and θ andvariance–covariance parameters σ2 and τ . Let the QR decomposition of


X be

X = (Q1 Q2)

(R0

)

.

Consider the LME model (9.18) and orthogonal contrast w1 = QT2 y.

Then w1 ∼ N(0, δQT2MQ2), where δ = σ2/nλ and M = Σθ + nλW−1.

The restricted likelihood based on w1 is given in (3.37) with a differentM defined in this Section. Following the same arguments as in Section3.6, we have the GML criterion for λ, θ and τ as

GML(λ,θ, τ ) =wT

1 (QT2MQ2)

−1w1

{det(QT2MQ2)−1} 1

n−q1−p

, (9.20)

where p =∑r

k=1 pk. Similar to (3.41), the REML (GML) estimate of σ2

is

σ2 =nλwT

1 (QT2 MQ2)

−1w1

n− q1 − p, (9.21)

where M = Σˆθ+nλW−1, and W−1 is the estimate of W−1 with τ being

replaced by τ .For clustered data, the leaving-out-one-cluster approach presented in

Section 5.2.2 may also be used to estimate the smoothing and variance-covariance parameters.

The SLM model (9.4) reduces to the semiparametric linear regres-sion model (8.4) when the random effects are combined with randomerrors. Therefore, the methods described in Section 8.2.2 can be used todraw inference about β and f . Specifically, posterior means and stan-dard deviations can be calculated using formulae (8.17) and (8.18) withW−1 = ZDZT + Λ fixed at its estimate. Bayesian confidence intervalsfor the overall functions and their components can then be constructedas in Section 8.2.2. Covariances for random effects can be computedusing Theorem 1 in Wang (1998a). The bootstrap method may also beused to construct confidence intervals.

9.2.3 The slm Function

The connections between SLM models and LME models suggest a rel-atively simple approach to fitting SLM models using existing softwarefor LME models. Specifically, for k = 1, . . . , r, let Σk = ZskZ

Tsk be

the Cholesky decomposition, where Zsk is a n ×mk matrix with mk =rank(Σk). Consider the following LME model

y = Sβ + Td+

r∑

k=1

Zskbsk + Zb+ ǫ, (9.22)


where β and d are fixed effects, bsk are random effects with bsk ∼N(0, σ2θkImk

/nλ), b are random effects with b ∼ N(0, σ2D), ǫ are ran-dom errors with ǫ ∼ N(0, σ2Λ), and random effects and random errorsare mutually independent. Following the same arguments in Section9.2.2, it can be shown that the PWLS estimates of parameters and non-parametric functions in the SLM model (9.5) correspond to the BLUPestimates in the LME model (9.22). Furthermore, the GML estimates ofsmoothing and covariance parameters in (9.5) correspond to the REMLestimates of covariance parameters in (9.22). Therefore, the PWLS es-timates of β and f , and GML estimates of λ, θ and τ in (9.5), can becalculated by fitting the LME model (9.22) with covariance parameterscalculated by the REML method. This approach is implemented by theslm function in the assist package where the LME model was fittedusing the lme function in the nlme library.

A typical call to the slm function is

slm(formula, rk, random)

where formula and rk serve the same purposes as those in the ssr func-tion. Combined, they specify the fixed effects. The random argumentspecifies the random effects the same way as in lme. An object of slmclass is returned. The generic function summary can be applied to extractfurther information. Predictions on different levels of random effects canbe computed using the predict function where the nonparametric func-tions are treated as part of fixed effects. Posterior means and standarddeviations for β and f can be computed using the intervals function.Examples can be found in Section 9.4.

9.2.4 SS ANOVA Decomposition

We have shown how to build multiple regression models using SS ANOVAdecompositions in Chapter 4. The resulting SS ANOVA models havecertain modular structures that parallel the classical ANOVA decom-positions. In this section we show how to construct similar SS ANOVAdecompositions with modular structures that parallel the classical mixedmodels. We illustrate how to construct SS ANOVA decompositionsthrough two examples. More examples of SS ANOVA decompositions in-volving random effects can be found in Wang (1998a), Wang and Wahba(1998), and Section 9.4.3. As a general approach, the SS ANOVA de-composition may be employed to build mixed-effects models for othersituations.

It is instructive to see how the classical two-way mixed model (9.1) canbe derived via an SS ANOVA decomposition. In general, the factor Bis considered random since the levels of the factor are chosen at random


from a well-defined population of all factor levels. It is of interest todraw an inference about the general population using information fromthese observed (chosen) levels. Let X1 = {1, . . . , a} be the domain offactor A, and X2 be the population from which the levels of the randomfactor B are drawn. Assume the following model

yiwk = f(i, w) + ǫiwk, i ∈ X1; w ∈ X2; k = 1, . . . ,m, (9.23)

where f(i, w) is a random variable since w is a random sample from X2.f(i, j) for j = 1, . . . , b are realizations of the true mean function definedon X1×X2. Let P be the sampling distribution on X2. Define averaging

operators A(1)1 on X1 and A(2)

1 on X2 as

A(1)1 f =

1

a

a∑

i=1

f(i, ·),

A(2)1 f =

∫

X2

f(·, w)dP.

A(2)1 computes population average with respect to the sampling distri-

bution. Let A(1)2 = I − A(1)

1 and A(2)2 = I − A(2)

1 . An SS ANOVAdecomposition can be defined as

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2

}f

= A(1)1 A(2)

1 f + A(1)2 A(2)

1 f + A(1)1 A(2)

2 f + A(1)2 A(2)

2 f

, µ+ αi + βw + (αβ)iw .

Therefore, the SS ANOVA decomposition leads to the same structure asthe classical two-way mixed model (9.1).

Next we discuss how to derive the SS ANOVA decomposition for re-peated measures data. As in Section 9.1, suppose we have repeatedmeasurements on a response variable over a period of time or a sequenceof doses from multiple individuals. Suppose that individuals are selectedat random from a well-defined population X1. Without loss of generality,denote the time period or dose range as X2 = [0, 1]. The linear growthcurve model (9.2) assumes that responses across time or doses for eachindividual can be well described by a simple straight line, which may betoo restrictive for some applications. Assume the following model

ywj = f(w, xwj) + ǫwj, w ∈ X1; xwj ∈ X2, j = 1, . . . , nw, (9.24)

where f(w, xwj) are random variables since w are random samples fromX1. f(i, xij), i = 1, . . . ,m, j = 1, . . . , ni are realizations of the truemean functions defined on X1 × X2. Suppose we want to model the


mean function nonparametrically using the cubic spline space W 22 [0, 1]

under the construction in Section 2.2. Let P be the sampling distributionon X1. Define averaging operators

A(1)1 f =

∫

X1

f(w, ·)dP,

A(2)1 f = f(·, 0),

A(2)2 f = f ′(·, 0)x.

Let A(1)2 = I −A(1)

1 and A(2)3 = I −A(2)

1 −A(2)2 . Then

f ={A(1)

1 + A(1)2

}{A(2)

1 + A(2)2 + A(2)

3

}f

= A(1)1 A(2)

1 f + A(1)1 A(2)

2 f + A(1)1 A(2)

3 f+

+ A(1)2 A(2)

1 f + A(1)2 A(2)

2 f + A(1)2 A(2)

3 f

= β1 + β2x+ f2(x) + b1w + b2wx+ f1,2(w, x), (9.25)

where the first three terms are fixed effects representing the populationmean function, and the last three terms are random effects representingthe departure of individual w from the population mean function. Sinceboth the first and the last three terms are orthogonal components inW 2

2 [0, 1], then both the population mean function and the individualdeparture are modeled by cubic splines.

Based on the SS ANOVA decomposition (9.25), we may consider thefollowing model for observed data

yij = β1 + β2xij + f2(xij) + b1i + b2ixij + f1,2(i, xij) + ǫij ,

i = 1, . . . ,m; j = 1, . . . , ni.

Let bi = (b1i, b2i)T and assume that bi

iid∼ N(0, σ2D). One possible modelfor the nonparametric random effects f1,2 is to assume that f1,2(i, x) isa stochastic process on [0, 1] with mean zero and covariance functionσ2

1R1(x, y), where R1(x, y) is the RK of W 22 [0, 1] ⊖ {1, x} defined in

(2.4). It is obvious that the linear growth curve model (9.2) is a specialcase of model (9.25) with f2 = 0 and σ2

1 = 0.


9.3 Semiparametric Nonlinear Mixed-EffectsModels

9.3.1 The Model

Nonlinear mixed-effects (NLME) models extend LME models by allowingthe regression function to depend on fixed and random effects througha nonlinear function. Consider the following NLME model proposed byLindstrom and Bates (1990) for clustered data:

yij = ψ(φij ;xij) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,

φij = Sijβ + Zijbi, biiid∼ N(0, σ2D),

(9.26)

where m is the number of clusters, ni is the number of observationsfrom the ith cluster, yij is the jth observation in cluster i, ψ is a knownfunction of a covariate x, φij is a q-vector of parameters, ǫij are randomerrors, Sij and Zij are design matrices for fixed and random effectsrespectively, β is a q1-vector of population parameters (fixed effects), andbi is a q2-vector of random effects for cluster i. Let ǫi = (ǫi1, . . . , ǫini

)T .We assume that ǫi ∼ N(0, σ2Λi), bi and ǫi are mutually independent,and observations from different clusters are independent.

The first-stage model in (9.26) relates the conditional mean of the re-sponse variable to the covariate x and parameters φij . The second-stagemodel relates parameters φij to fixed and random effects. Covariate ef-fects can be incorporated into the second-stage model.

As an extension of the NLME model, Ke and Wang (2001) proposedthe following class of semiparametric nonlinear mixed-effects (SNM) mod-els:

yij = Nij(φij ,f) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,

φij = Sijβ + Zijbi, biiid∼ N(0, σ2D),

(9.27)

where φij is a q-vector of parameters, β is a q1-vector of fixed effects, bi isa q2-vector of random effects for cluster i, f = (f1, . . . , fr) are unknownfunctions, fk belongs to an RKHS Hk on an arbitrary domain Xk for k =1, . . . , r, and Nij are known nonlinear functionals on R

q ×H1×· · ·×Hr.As the NLME model, we assume that ǫi = (ǫi1, . . . , ǫini

)T ∼ N(0, σ2Λi),bi and ǫi are mutually independent, and observations from differentclusters are independent.

It is clear that the SNM model (9.27) is an extension of the SNRmodel (8.30) with an additional mixed-effects second-stage model. Sim-ilar to the SNR model, certain constraints may be required to make an


SNM model identifiable, and often these constraints can be achieved byremoving certain components from the model spaces for parameters andf . The SLM model (9.26) is a special case when N is linear in both φand f .

Let n =∑m

i=1 ni, yi = (yi1, . . . , yini)T , y = (yT

1 , . . . ,yTm)T , φi =

(φTi1, . . . ,φ

Tini

)T , φ = (φT1 , . . . ,φ

Tm)T , ηi(φi,f) = (Ni1(φi1,f), . . . ,

Nini(φini

,f ))T , η(φ,f ) = (ηT1 (φ1,f), . . . ,ηT

m(φm,f))T , ǫ = (ǫT1 , . . . ,

ǫTm)T , b = (bT

1 , . . . , bTm)T , Λ = diag(Λ1, . . . ,Λm), S = (ST

11, . . . , ST1n1

,. . . , ST

m1, . . . , STmnm

)T , Zi = (ZTi1, . . . , Z

Tini

)T , Z = diag(Z1, . . . , Zm), and

D = diag(D, . . . , D). Then model (9.27) can be written in the vectorform

y = η(φ,f) + ǫ, ǫ ∼ N(0, σ2Λ),

φ = Sβ + Zb, b ∼ N(0, σ2D).(9.28)

Note that model (9.28) is more general than (9.27) in the sense thatother SNM models may also be written in this form. We will discussestimation and inference procedures for the general model (9.28).


Suppose Hk = Hk0 ⊕Hk1, where Hk0 = span{φk1, . . . , φkpk} and Hk1 is

an RKHS with RK Rk1. Assume that D and Λ depend on an unknownparameter vector τ . We need to estimate β, f , τ , σ2, and b. Themarginal likelihood based on model (9.28)

L(β,f , τ , σ2) = (2πσ2)−mq2+n

2 |D|− 12 |Λ|− 1

2

∫

exp

{

− 1

σ2g(b)

}

db,

where

g(b) =1

2

{

(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f)) + bTD−1b}

.

For fixed τ and σ2, we estimate β and f as the minimizers of the fol-lowing penalized likelihood (PL)

l(β,f , τ , σ2) +nλ

σ2

r∑

k=1

θ−1k ‖Pk1fk‖2, (9.29)

where l(β,f , τ , σ2) = −2 logL(β,f , τ , σ2), Pk1 is the projection opera-tor onto Hk1 in Hk, and λθ−1

k are smoothing parameters.The integral in the marginal likelihood is usually intractable because

η may depend on b nonlinearly. We now derive an approximation to the


log-likelihood using the Laplace method. Let G = ∂η(Sβ+Zb,f)/∂bT

and b be the solution to

∂g(b)

∂b= −GT Λ−1(y − η(Sβ + Zb,f)) +D−1b = 0. (9.30)

Approximating the Hessian by ∂2g(b)/∂b∂bT ≈ GT Λ−1G + D−1 andapplying the Laplace method for integral approximation, the function lcan be approximated by

l(β,f , τ , σ2) = n log 2πσ2 + log |Λ| + log |I + GT Λ−1GD|

+1

σ2(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f))

+1

σ2b

TD−1b, (9.31)

where G = G|b=˜b

. Replacing l by l, ignoring the dependence of G on β

and f , and dropping constant terms, the PL (9.29) reduces to

(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f ))

+ bTD−1b+ nλ

r∑

k=1


(9.32)

It is not difficult to see that estimating b by equation (9.30), and β andf by equation (9.32), is equivalent to estimating b, β, and f jointly asminimizers of the following penalized Henderson (hierarchical) likelihood

(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f ))

+ bTD−1b+ nλ

r∑

k=1


(9.33)

Denote the estimates of β and f as β and f . We now discuss theestimation of τ and σ2 with β, f , and b fixed at β, f , and b. LetW−1 = Λ + GDGT , η = η(Sβ +Zb,f), and U = (D−1 + GT Λ−1G)−1.Since W = Λ−1 − Λ−1GUGT Λ−1 and GT Λ−1(y − η) = D−1b, then

(y − η + Gb)T W (y − η + Gb)

= (y − η + Gb)T Λ−1(y − η + Gb)

−(y − η + Gb)T Λ−1GUGT Λ−1(y − η + Gb)

= (y − η + Gb)T Λ−1(y − η + Gb)

−(y − η + Gb)T Λ−1GU(D−1 + GT Λ−1G)b

= (y − η + Gb)T Λ−1(y − η + Gb) − (y − η + Gb)T Λ−1Gb

= (y − η)T Λ−1(y − η) + bTGT Λ−1(y − η)

= (y − η)T Λ−1(y − η) + bTD−1b.


Note that, based on Theorem 18.1.1 in Harville (1997), |W−1| = |Λ +GDGT | = |Λ||I + GT Λ−1GD|. Then we can reexpress l in (9.31) as

l(β, f, τ , σ2) = n log 2π + log |σ2W−1| + 1

σ2eT We, (9.34)

where e = y − η(Sβ + Zb,f) + Gb. Plugging-in estimates of β andf in (9.34), profiling with respect to σ2, and dropping constant terms,we estimate τ as minimizers of the following approximate negative log-likelihood

log |W−1| + log(eT W e), (9.35)

where e = y − η(Sβ + Zb, f) + Gb.The resulting estimates of τ are denoted as τ . To account for the loss

of degrees of freedom for estimating β and f , we estimate σ2 by

σ2 =eT W (τ )e

n− q1 − df(f ), (9.36)

where q1 is the dimension of β, and df(f) is a properly defined degreeof freedom for estimating f . As in Section 8.3.3, we set df(f ) = tr(H∗),where H∗ is the hat matrix computed at convergence.

Assume priors (8.12) for fk, k = 1, . . . , r. Usually the posterior distri-bution does not have a closed form. Expanding η at b, we consider thefollowing approximate Bayes model

y ≈ η(Sβ + Zb,f) − Gb+ Gb+ ǫ, (9.37)

where priors for f are given in (8.12), and priors for β are N(0, κIq1).

Ignoring the dependence of G on β and f and combining random effectswith random errors, model (9.37) reduces to an SNR model. Then meth-ods discussed in Section 8.3.3 can be used to draw an inference aboutβ, τ , and f . In particular, approximate Bayesian confidence intervalscan be constructed for f . The bootstrap approach may also be used todraw inference about β, τ , and f .

9.3.3 Implementation and the snm Function

Since f may interact with β and b in a complicated way, it is usuallyimpossible to solve (9.33) and (9.35) directly. The following iterativeprocedure will be used.

Algorithm for SNM models

1. Initialize: Set initial values for β, f , b, and τ .


2. Cycle: Alternate between (a), (b), and (c) until convergence:

(a) Conditional on current estimates of β, b, and τ , update f bysolving (9.33).

(b) Conditional on current estimates of f and τ , update β andb by solving (9.33).

(c) Conditional on current estimates of f , β, and b, update τ bysolving (9.35).

Note that step (b) corresponds to the pseudo-data step, and step (c)corresponds to part of the LME step in the Lindstrom–Bates algorithm(Lindstrom and Bates 1990). Consequently, steps (b) and (c) can beimplemented by the nlme function in the nlme library. We now discussthe implementation of step (a). Note that β, b, and τ are fixed as thecurrent estimates, say, β−, b− and τ−. To update f at step (a), weneed to fit the first-stage model

y = η(φ−,f) + ǫ, ǫ ∼ N(0, σ2Λ−), (9.38)

where φ− = Sβ−+Zb−, and Λ− is the covariance matrix with τ fixed atτ−. First consider the special case when Nij in (9.27) are linear in f andinvolve evaluational functionals. Then model (9.27) can be rewritten as

yij = α(φ−;xij) +

r∑

k=1

δk(φ−;xij)fk(γk(φ−;xij)) + ǫij , (9.39)

which has the same form as the general SEMOR (self-modeling nonlinearregression) model (8.32). Since φ− and Λ− are fixed, the proceduredescribed in Section 8.2.2 can be used to update f . In particular, thesmoothing parameters may be estimated by the UBR, GCV or GMLmethod at this step. When η is nonlinear in f , the EGN (extendedGauss–Newton) procedure in Section 8.3.3 can be used to update f .

The snm function in the assist library implements the algorithmwhen Nij in (9.27) are linear in f and involve evaluational functionals.A typical call to the snm function is

snm(formula, func, fixed, random, start)

where the arguments formula and func serve the same purposes as thosein the nnr function and are specified in the same manner. Followingsyntax in nlme, the fixed and random arguments specify the fixed, andrandom effects models in the second-stage model. The option start

specifies initial values for all parameters in the fixed effects. An objectof snm class is returned. The generic function summary can be applied


to extract further information. Predictions at the population level canbe computed using the predict function. At convergence, approximateBayesian confidence intervals can be constructed as in Section 8.3.3.Posterior means and standard deviations for f can be computed usingthe intervals function. Examples can be found in Section 9.4.

9.4 Examples


We have fitted SS ANOVA models in Section 4.9.2 and trigonometricspline models with heterogeneous and correlated errors in Section 5.4.2to the Arosa data. An alternative approach is to consider observationsas a long time series. Define a time variable as t = (month− 0.5)/12 +year − 1 with domain [0, b], where b = 45.46 is the maximum valueof t. Denote {(yi, ti), i = 1, . . . , 518} as observations on the responsevariable thick and time t. Consider the following semiparametric linearregression model

yi = β1 + β2 sin(2πti) + β3 cos(2πti) + β4ti + f(ti) + ǫi, (9.40)

where sin(2πt) and cos(2πt) model the seasonal trend, β4t+f(t) modelsthe long-term trend, and ǫi are random errors. Note that the space forthe parametric component, H0 = {1, t, cos 2πt, sin 2πt}, is the same asthe null space of the linear-periodic spline defined in Section 2.11.4 withτ = 2π. Therefore, we model the nonparametric function f using themodel space W 4

2 [0, b]⊖H0 (see Section 2.11.4 for more details).Observations close in time are likely to be correlated. We consider

the exponential correlation structure with a nugget effect. Specifically,write ǫi = ǫ(ti) and assume that ǫ(t) is a zero-mean stochastic processwith correlation structure

Cov(ǫ(s), ǫ(t)) =

{σ2(1 − c0) exp(−|s− t|/ρ), s 6= t,σ2, s = t,

(9.41)

where ρ is the range parameter and c0 is the nugget effect.In the following we fit model (9.40) with correlation structure (9.41)

and compute the overall fit as well as estimates of the seasonal andlong-term trends:

> Arosa$t <- (Arosa$month-0.5)/12+Arosa$year-1

> arosa.ls.fit3 <- ssr(thick~t+sin(2*pi*t)+cos(2*pi*t),


rk=lspline(2*pi*t,type=‘‘linSinCos’’), spar=‘‘m’’,

corr=corExp(form=~t,nugget=T), data=Arosa)


...




Correlation structure of class corExp representing

range nugget

0.3529361 0.6366844

> tm <- matrix(c(1,1,1,1,1,1,0,1,1,0,0,1,0,0,1), ncol=5,

byrow=T)

> grid3 <- data.frame(t=seq(0,max(Arosa$t)+0.001,len=500))

> arosa.ls.fit3.p <- predict(arosa.ls.fit3, newdata=grid3,

terms=tm)

Figure 9.1 shows the overall fit as well as estimates of the seasonal andlong-term trends. We can see that the long-term trend is not significantlydifferent from zero.

Sometimes it is desirable to use a stochastic process to model theautocorrelation and regard this process as part of the signal. Then weneed to separate this process from other errors and predict it at desiredpoints. Specifically, consider the following SLM model

yi = β1 + β2 sin(2πti) + β3 cos(2πti) + β4ti + f(ti) + u(ti) + ǫi, (9.42)

where u(t) is a stochastic process independent of ǫi with mean zero andCov(u(s), u(t)) = σ2

1 exp(−|s− t|/ρ) with range parameter ρ.Let t = (t1, . . . , t518)

T be the vector of design points for the variable t.Let u = (u(t1), . . . , u(t518))

T be the vector of the u process evaluated atdesign points. Then u are random effects, and u ∼ N(0, σ2

1D), where Dis a covariance matrix with the (i, j)th element equals exp(−|ti − tj |/ρ).The SLM model (9.42) cannot be fitted directly using the slm functionsince D depends on the unknown range parameter ρ nonlinearly. We fitmodel (9.42) in two steps. We first regard u as part of random errors andestimate the range parameter. This is accomplished in arosa.ls.fit3.Then we calculate the estimate ofD without the nugget effect and regardit as the true covariance matrix. We calculate the Cholesky decompo-sition of D as D = ZZT and transform the random effects u = Zb,where b ∼ N(0, σ2

1I). Then we fit the transformed SLM model and com-pute the overall fit, estimate of the seasonal trend, and estimate of thelong-term trend as follows:

> tau <- coef(arosa.ls.fit3$cor.est, F)


o

o

o

o

o

o

ooooo

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

oooo

o

oo

o

oo

o

o

o

ooo

oo

o

o

oo

o

oo

oooo

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

oo

o

o

o

o

oo

o

oo

o

o

o

o

oo

o

oo

o

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

oo

o

o

o

o

o

o

oo

o

o

ooo

o

o

oo

oo

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

oo

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

ooo

oo

o

o

o

o

o

o

o

oo

o

o

oo

o

oo

o

o

oo

oo

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

ooo

o

oo

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o

o

ooo

o

ooo

oo

o

o

o

o

oo

o

o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

ooo

o

o

o

o

oo

oo

o

o

o

o

ooo

o

o

o

o

o

oooo

oo

o

o

oo

o

o

o

o

oo

o

o

o

oo

ooo

o

o

o

oo

o

oo

o

o

ooo

oo

oo

o

oo

oo

o

o

o

o

o

oo

o

o

o

o

o

ooo

o

o

oo

o

oo

o

oo

o

ooo

o

oo

o

o

o

0 10 20 30 40

300

350

400

time

thic

kness

1926 1935 1944 1953 1962 1971

(a) observations and overall fit

0 10 20 30 40

300

320

340

360

380

time

thic

kness

1926 1935 1944 1953 1962 1971

(b) seasonal trend

0 10 20 30 40

−5

05

10

15

time

thic

kness

1926 1935 1944 1953 1962 1971

(c) long−term trend

FIGURE 9.1 Arosa data, plots of (a) observations and the overallfit, (b) estimate of the seasonal trend, and (c) estimate of the long-termtrend with 95% Bayesian confidence intervals (shaded region). Estimatesare based on the fit to model (9.40) with correlation structure (9.41).


> D <- corMatrix(Initialize(corExp(tau[1],form=~t),

data=Arosa))

> Z <- chol.new(D)

> arosa.ls.fit4 <- slm(thick~t+sin(2*pi*t)+cos(2*pi*t),

rk=lspline(2*pi*t,type=‘‘linSinCos’’),

random=list(pdIdent(~Z-1)), data=Arosa)

> arosa.ls.fit4.p <- intervals(arosa.ls.fit4,

newdata=grid3, terms=tm)

where chol.new is a function in the assist library for Cholesky de-composition. Suppose we want to predict u on grid points and denotev as the vector of the u process evaluated at these grid points. LetR = Cov(v,u). Note that Z is a square invertible matrix since D is

invertible. Then v = RD−1u = RZ−T b. We compute the prediction asfollows:

> newdata <- data.frame(t=c(Arosa$t,grid3$t))

> RD <- corMatrix(Initialize(corExp(tau[1],form=~t),

data=newdata))

> R <- RD[(length(Arosa$t)+1):length(newdata$t),

1:length(Arosa$t)]

> b <- as.vector(arosa.ls.fit4$lme.obj$coef$random[[2]])

> u.new <- R%*%t(solve(Z))%*%b

Figure 9.2 shows the overall fit, estimate of the seasonal trend, es-timate of the long-term trend, and prediction of u(t) (local stochastictrend). The bump during 1940 in Figure 4.10 shows up in the predictionof local stochastic trend.

9.4.2 Lake Acidity — Revisit

We have shown how to investigate geological location effect using anSS ANOVA model in Section 5.4.4. We now describe an alternativeapproach using random effects. We use the same notations defined inSection 5.4.4. Consider the following SLM model:

pH(xi1,xi2) = f(xi1) + β1xi21 + β2xi22 + u(xi2) + ǫ(xi1,xi2),(9.43)

where f ∈ W 22 (R), u(x2) is a spatial process, and ǫ(x1,x2) are random

errors independent of the spatial process. Model (9.43) separates thecontribution of the spatial correlation from random errors and regardsthe spatial process as part of the signal. Assume that u(x2) is a zero-mean process independent of ǫ(x1,x2) with an exponential correlationstructure Cov(u(x2), u(z2)) = σ2

1 exp{−d(x2, z2)/ρ} with range param-eter ρ, where d(x2, z2) represents the Euclidean distance.


oo

oo

o

o

ooooo

oo

oo

oo

ooo

ooo

o

ooo

oo

o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

ooo

ooooo

oo

o

oo

o

o

oooo

ooo

o

ooo

oo

oooo

o

o

o

oo

oo

o

oo

oo

o

oooo

o

oooooo

oo

oo

o

o

oo

o

oo

o

o

o

o

oo

ooo

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

oo

oo

o

o

o

oo

oo

o

o

o

ooo

ooo

oooo

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

oo

o

oooo

o

o

o

o

ooo

oo

ooo

o

o

oooo

o

oo

ooo

o

o

o

o

ooo

oo

o

oo

o

ooo

ooooo

oo

o

o

o

o

o

o

ooo

oo

oo

oo

oo

oo

o

oo

oo

oo

ooo

oo

o

oo

ooo

oo

o

o

oo

o

oo

oo

o

o

oo

o

oo

o

o

oo

oo

oo

oooo

o

oo

ooo

o

o

o

oo

o

o

ooo

o

o

o

ooooo

ooo

o

o

o

oo

oo

o

o

ooo

o

ooo

oo

o

oo

o

o

o

o

oo

oo

oo

o

o

ooo

o

ooo

oo

o

oo

o

oo

o

o

ooo

o

oo

ooo

o

o

o

oo

oo

oo

o

oooo

oo

o

o

ooo

oo

o

oo

o

o

ooo

o

o

o

ooo

oo

oo

o

o

ooo

ooo

o

o

oooo

oo

oo

oo

o

o

o

o

oo

o

o

o

oo

ooo

o

oo

oo

o

oo

o

o

ooo

oooo

o

oo

oo

o

o

o

oo

oo

oo

o

oo

ooo

o

o

oo

ooo

o

oo

o

ooo

o

oo

ooo

0 10 20 30 40

300

400

time

thic

kness

1926 1935 1944 1953 1962 1971

(a) observations and overall fit

0 10 20 30 40

280

320

360

time

thic

kness

1926 1935 1944 1953 1962 1971

(b) seasonal trend

0 10 20 30 40

−20

020

40

time

thic

kness

1926 1935 1944 1953 1962 1971

(c) long−term trend

0 10 20 30 40

−20

020

time

thic

kness

1926 1935 1944 1953 1962 1971

(d) local stochastic trend

FIGURE 9.2 Arosa data, plots of (a) observations and the overallfit, (b) estimate of the seasonal trend, (c) estimate of the long-termtrend with 95% Bayesian confidence intervals (shaded region), and (d)prediction of local stochastic trend. Estimates are based on the fit tomodel (9.42).


Let u be the vector of the u process evaluated at design points. Thenu are random effects and u ∼ N(0, σ2D), where the covariance matrixD = {exp(−d(xi2,xj2)/ρ)}n

i,j=1 depends on the unknown parameter ρnonlinearly. Therefore, again, we fit model (9.43) in two steps. We firstregard u as part of random errors, estimate the range parameter, andcalculate the estimated covariance matrix without the nugget effect:

> temp <- ssr(ph~x1+x21+x22, rk=tp(x1), data=acid,

corr=corExp(form=~x21+x22, nugget=T), spar=‘‘m’’)

> tau <- coef(temp$cor.est, F)

> D <- corMatrix(Initialize(corExp(tau[1],form=~x21+x22),

data=acid))

Consider the estimated D as the true covariance matrix. Then wecalculate the Cholesky decomposition of D as D = ZZT and transformthe random effects u = Zb, where b ∼ N(0, σ2

1I). Now we are ready tofit the transformed SLM model:

> Z <- chol.new(D)

> acid.slm.fit <- slm(ph~x1+x21+x22, rk=tp(x1), data=acid,

random=list(pdIdent(~Z-1)))

We then calculate the estimated effect of calcium:

> grid1 <- data.frame(

x1=seq(min(acid$x1),max(acid$x1),len=100),

x21=min(acid$x21), x22=min(acid$x22))

> acid.slm.x1.p <- intervals(acid.slm.fit, newdata=grid1,

terms=c(0,1,0,0,1))

Let v be the vector of the u process evaluated at grid points at whichwe wish to predict the location effect. Let R = Cov(v,u). Then v =

RD−1u = RZ−T b.

> grid2 <- expand.grid(

x21=seq(min(acid$x21)-.001,max(acid$x21)+.001,len=20),

x22=seq(min(acid$x22)-.001,max(acid$x22)+.001,len=20))

> newdata <- data.frame(z1=c(acid$x21,grid2$x21),

z2=c(acid$x22,grid2$x22))

> RD <- corMatrix(Initialize(corExp(tau[1], form=~z1+z2),

data=newdata))

> R <- RD[(length(acid$x21)+1):length(newdata$z1),

1:length(acid$x21)]

> b <- as.vector(acid.slm.fit$lme.obj$coef$random[[2]])

> u.new <- R%*%t(solve(Z))%*%b

> acid.slm.x2.p <- u.new+


acid.slm.fit$lme.obj$coef$fixed[3]*grid2$x21+

acid.slm.fit$lme.obj$coef$fixed[4]*grid2$x22

Figure 9.3 plots the estimated calcium (x1) effect and prediction ofthe location (x2) effect.

calcium(log10 mg/L)

pH

6.0

6.5

7.0

7.5

8.0

−0.5 0.0 0.5 1.0 1.5

−0.020.00

0.02−0.02

0.00

0.02

−0.2

0.0

0.2

latitude

long

itude

pH

FIGURE 9.3 Lake acidity data, the left panel includes observationsand the estimate of the calcium effect f (the constant plus main effectof x1), and the right panel includes prediction of the location effect

β1x21 + β22x22 + u(x2).

9.4.3 Coronary Sinus Potassium in Dogs

The dog data set contains measurements of coronary sinus potassiumconcentration from dogs in four groups: control, extrinsic cardiac den-ervation three weeks prior to coronary occlusion, extrinsic cardiac den-ervation immediately prior to coronary occlusion, and bilateral thoracicsympathectomy and stellectomy three weeks prior to coronary occlusion.We are interested in (i) estimating the group (treatment) effects, (ii) es-timating the group mean concentration as a function of time, and (iii)predicting response over time for each dog. There are two categorical co-variates, group and dog, and a continuous covariate, time. We code thegroup factor as 1 to 4, and the observed dog factor as 1 to 36. Coronarysinus potassium concentrations for all dogs are shown in Figure 9.4.

Let t be the time variable transformed into [0, 1]. We treat group andtime as fixed factors. From the design, the dog factor is nested withinthe group factor. We treat dog as a random factor. For group k, denote


3.0

3.5

4.0

4.5

5.0

5.5

6.0

time (min)

pota

ssiu

m

1 3 5 7 9 11 13

0 00

0 0

0

0

11

1 1

1

1

1

22

2 2

2

2

2

3

3

3

3

3

3

3

4

4

4

4

4 4

4

5

5

5

55

55

6

6 6

6

6

6

67

7

7

7

7

7

78 8

8

8

8

8

8

Group 1

3.0

3.5

4.0

4.5

5.0

5.5

6.0

time (min)

pota

ssiu

m

1 3 5 7 9 11 13

0 00

0 0

0

0

1

1

1 11

11

22

2

2

2

2 23

3 3 33

3 3

44

4

4

4 44

5

5

55

5 5 5

6 6 66

6

6 6

7

7

77

7

7 78

88 8

8

8

8

99

9

9

9

9 9

Group 23.0

3.5

4.0

4.5

5.0

5.5

6.0

time (min)

pota

ssiu

m

1 3 5 7 9 11 13

00

0 0

0

0

0

11 1

1 11

1

2

22

22

2 2

3

33

3

3

3

34 4

4

4

4

4 4

5

5

5

5

5

5

5

6

6

6

6

6

6

6

77

7

7

7

7

7

Group 33.0

3.5

4.0

4.5

5.0

5.5

6.0

time (min)

pota

ssiu

m

1 3 5 7 9 11 13

0

0 0

0

0 0

01

1

11 1

1

1

2

2

2

2

2

22

3 33

33

3

3

44

44

4

44

5

5

55

5

5

5

6

6

6

6

66

6

7 7

7

77

77

88 8

8

88

8

Group 4

FIGURE 9.4 Dog data, coronary sinus potassium concentrations overtime for each dog. Solid thick lines link within group average concentra-tions at each time point.

Bk as the population from which the dogs in group k were drawn, andPk as the sampling distribution. Assume the following model

ykwj = f(k, w, tj) + ǫkwj , k = 1, . . . , 4; w ∈ Bk; tj ∈ [0, 1], (9.44)

where ykwj is the observed potassium concentration at time tj from dog

w in the population Bk, f(k, w, tj) is the true concentration at time tj ofdog w in the population Bk, and ǫkwj are random errors. The functionf(k, w, tj) is defined on {{1}⊗B1, {2}⊗B2, {3}⊗B3, {4}⊗B4}⊗ [0, 1].Note that f(k, w, j) is a random variable since w is a random samplefrom Bk. What we observe are realizations of this true mean functionplus random errors. We use label i to denote dogs we actually observe.

Suppose we want to model the time effect using the cubic spline modelspaceW 2

2 [0, 1] under the construction in Section 2.6, and the group effect


using the classical one-way ANOVA model space R4 under the construc-

tion in Section 4.3.1. Define the following four averaging operators:

A2f =

∫

Bk

f(k, w, t)dPk(w),

A1f =1

4

4∑

k=1

A2f(k, t),

A3f =

∫ 1

0

f(k, w, t)dt,

A4f =

(∫ 1

0

∂f(k, w, t)

∂tdt

)

(t− 0.5).

Then we have the following SS ANOVA decomposition:

f = {A1 + (A2 −A1) + (I −A2)}{A3 + A4 + (I −A3 −A4)}f= A1A3f + A1A4f + A1(I −A3 −A4)f

+ (A2 −A1)A3f + (A2 −A1)A4f + (A2 −A1)(I −A3 −A4)f

+ (I −A2)A3f + (I −A2)A4f + (I −A2)(I −A3 −A4)f

= µ+ β(t− 0.5) + s1(t) + ξk + δk(t− 0.5) + s2(k, t)

+ αw(k) + γw(k)(t− 0.5) + s3(k, w, t), (9.45)

where µ is a constant, β(t−0.5) is the linear main effect of time, s1(t) isthe smooth main effect of time, ξk is the main effect of group, δk(t−0.5)is the smooth-linear interaction between time and group, s2(k, t) is thesmooth-smooth interaction between time and group, αw(k) is the maineffect of dog, γw(k)(t−0.5) is the smooth-linear interaction between time

and dog, and s3(k, w, t) is the smooth-smooth interaction between time

and dog. The overall main effect of time equals β(t − 0.5) + s1(t), theoverall interaction between time and group equals δk(t− 0.5)+ s2(k, t),and the overall interaction between time and dog equals γw(k)(t−0.5)+s3(k, w, t). The first six terms are fixed effects. The last three terms arerandom effects since they depend on the random variable w. Dependingon time only, the first three terms represent the mean curve for all dogs.The middle three terms measure the departure of the mean curve for aparticular group from the mean curve for all dogs. The last three termsmeasure the departure of a particular dog from the mean curve of apopulation from which the dog was chosen.

Based on the SS ANOVA decomposition (9.45), we will fit the follow-ing three models:

• Model 1 includes the first seven terms in (9.45). It has a differentpopulation mean curve for each group plus a random intercept for


each dog. We assume that αiiid∼ N(0, σ2

1), ǫkijiid∼ N(0, σ2), and the

random effects and random errors are mutually independent.

• Model 2 includes the first eight terms in (9.45). It has a differentpopulation mean curve for each group plus a random intercept

and a random slope for each dog. We assume that (αi, γi)iid∼

N(0, σ2D1), whereD1 is an unstructured covariance matrix, ǫkijiid∼

N(0, σ2), and the random effects and random errors are mutuallyindependent.

• Model 3 includes all nine terms in (9.45). It has a different popula-tion mean curve for each group plus a random intercept, a randomslope, and a smooth random effect for each dog. We assume that

(αi, γi)iid∼ N(0, σ2D1), s3(k, i, t) are stochastic processes that are

independent between dogs with mean zero and covariance func-tion σ2

2R1(s, t), where R1 is the cubic spline RK given in Table

2.2, ǫkijiid∼ N(0, σ2), and the random effects and random errors

are mutually independent.

Model 1 and Model 2 can be fitted as follows:

> data(dog)

> dog.fit1 <- slm(y~time,

rk=list(cubic(time), shrink1(group),

rk.prod(kron(time-.5),shrink1(group)),

rk.prod(cubic(time),shrink1(group))),

random=list(dog=~1), data=dog)

> dog.fit1

Semi-parametric linear mixed-effects model fit by REML

Model: y ~ time

Data: dog

Log-restricted-likelihood: -180.4784

Fixed: y ~ time

(Intercept) time

3.8716210 0.4339335

Random effects:

Formula: ~1 | dog

(Intercept) Residual

StdDev: 0.4980355 0.3924478


0.0034079441 0.0038490075 0.0002048518





> dog.fit2 <- update(dog.fit1, random=list(dog=~time))

> dog.fit2

Semi-parametric linear mixed-effects model fit by REML

Model: y ~ time

Data: dog

Log-restricted-likelihood: -166.4478

Fixed: y ~ time

(Intercept) time

3.8767107 0.4196788

Random effects:

Formula: ~time | dog

Structure: General positive-definite,

Log-Cholesky parametrization

StdDev Corr

(Intercept) 0.4188186 (Intr)

time 0.5592910 0.025

Residual 0.3403256


3.286885e-03 5.781563e-03 8.944897e-05




To fit Model 3, we need to find a way to specify the smooth (non-parametric) random effect s3. Let t = (t1, . . . , t7)

T be the time pointsat which measurements were taken for each dog (note that time pointsare the same for all dogs), ui = (s3(k, i, t1), . . . , s3(k, i, t7))

T , and u =

(uT1 , . . . ,u

T36)

T . Then uiiid∼ N(0, σ2

2D2), where D2 is the RK of a cubicspline evaluated at the design points t. Let D2 = HHT be the Choleskydecomposition of D2, D = diag(D2, . . . , D2), and Z = diag(H, . . . ,H).Then ZZT = D. We can write u = Zb2, where b2 ∼ N(0, σ2

2In) andn = 252. Then we can specify the random effects u using the matrix Z.

> D2 <- cubic(dog$time[1:7])

> H <- chol.new(D2)


> Z <- kronecker(diag(36), H)

> dog$all <- rep(1,36*7)

> dog.fit3 <- update(dog.fit2,

random=list(all=pdIdent(~Z-1),dog=~time))

> summary(dog.fit3)

Semi-parametric Linear Mixed Effects Model fit

Model: y ~ time

Data: dog

Linear mixed-effects model fit by REML

Data: dog

AIC BIC logLik

322.1771 360.9131 -150.0885

Random effects:

Formula: ~Z - 1 | all

Structure: Multiple of an Identity

Z1 Z2 Z3 Z4 Z5 Z6

StdDev: 3.90843 3.90843 3.90843 3.90843 3.90843 3.90843

...

Formula: ~time | dog %in% all

Structure: General positive-definite, Log-Cholesky

parametrization

StdDev Corr


time 0.5716972 -0.083

Residual 0.2383448

Fixed effects: y ~ time

Value Std.Error DF t-value p-value

(Intercept) 3.885270 0.08408051 215 46.20892 0e+00

time 0.404652 0.10952837 215 3.69449 3e-04

Correlation:

(Intr)

time -0.224

...

Smoothing spline:

GML estimate(s) of smoothing parameter(s): 8.775858e-05

1.561206e-03 3.351111e-03 2.870198e-05


The dimension of b2 associated with the smooth random effects s3


equals the total number of observations in the above construction. There-fore, computation and/or memory required for larger sample size can beprohibitive. One approach to stabilize and speed up the computation isto use the low rank approximation (Wood 2003). For generality, supposetime points for different dogs may be different. Let tij for j = 1, . . . , ni bethe time points for dog i, ui = (s3(k, i, ti1), . . . , s3(k, i, tini

))T and u =(uT

1 , . . . ,uTm)T where m is the number of dogs. Then ui ∼ N(0, σ2

2Πi),where Πi is the RK of a cubic spline evaluated at the design pointsti = (ti1, . . . , tini

). Let Πi = UiΓiUTi be the eigendecomposition where

Ui is an ni × ni orthogonal matrix, Γi = diag(γi1, . . . , γini) and γi1 ≤

γi2 ≤ . . . ≤ γiniare eigenvalues. Usually some of the eigenvalues are

much smaller than others. Discard ni − ki smallest eigenvalues and letHi1 = UiΓi1 where Γi1 is an ni ×ki matrix with diagonal elements equal√γi1, . . . ,

√γiki

and all other elements equal zero. Then Πi ≈ Hi1HTi1

and D = diag(Π1, . . . ,Πm) ≈ Z1ZT1 where Z1 = diag(H11, . . . , Hm1).

We can approximate Model 3 using u1 = Z1b2 where b2 ∼ N(0, σ22IK)

and K =∑m

i=1 ki. The dimension K can be much smaller than n. Lowrank approximations for other situations can be constructed similarly.The above approximation procedure is implemented to the dog data asfollows:

> chol.new1 <- function(Q, cutoff) {

tmp <- eigen(Q)

num <- sum(tmp$values<cutoff)

k <- ncol(Q)-num

t(t(as.matrix(tmp$vector[,1:k]))*sqrt(tmp$values[1:k]))

}

> H1 <- chol.new1(D2, 1e-3)

> Z1 <- kronecker(diag(36), H1)

> dog$all1 <- rep(1,nrow(Z1))

> dog.fit3.1 <- update(dog.fit2,

random=list(all1=pdIdent(~Z1-1), dog=~time))

> summary(dog.fit3.1)

Semi-parametric Linear Mixed Effects Model fit

Model: y ~ time

Data: dog

Linear mixed-effects model fit by REML

Data: dog

AIC BIC logLik

323.4819 362.2180 -150.7409

Random effects:


Formula: ~Z1 - 1 | all1

Structure: Multiple of an Identity

Z11 Z12 Z13 Z14 Z15 Z16

StdDev: 3.65012 3.65012 3.65012 3.65012 3.65012 3.65012

...

Formula: ~time | dog %in% all1

Structure: General positive-definite, Log-Cholesky

parametrization

StdDev Corr


time 0.5690967 -0.07

Residual 0.2527408

Fixed effects: y ~ time


(Intercept) 3.883828 0.08418382 215 46.13509 0e+00

time 0.407229 0.11063259 215 3.68092 3e-04

Correlation:

(Intr)

time -0.228

...

Smoothing spline:


1.764225e-03 3.716363e-03 3.200021e-05


where eigenvalues smaller than 10−3 are discarded and ki = 2. TheR function chol.new1 computes the truncated Cholesky decomposition.The fitting criteria and parameter estimates are similar to those from thefull model. Another fit based on low rank approximation with the cutoffvalue 10−3 being replaced by 10−4 produces almost identical results asthose from the full model.

As discussed in Section 9.2.2, Models 1, 2, and 3 are connected withthree LME models and these connections are used to fit the SLM models.We can compare these three corresponding LME models as follows:

> anova(dog.fit1$lme.obj, dog.fit2$lme.obj,

dog.fit3$lme.obj)

Model df AIC BIC logLik

dog.fit1$lme.obj 1 8 376.9568 405.1285 -180.4784

dog.fit2$lme.obj 2 10 352.8955 388.1101 -166.4478

dog.fit3$lme.obj 3 11 322.1771 360.9131 -150.0885


Test L.Ratio p-value

dog.fit1$lme.obj

dog.fit2$lme.obj 1 vs 2 28.06128 <.0001

dog.fit3$lme.obj 2 vs 3 32.71845 <.0001

Even though they do not compare three SLM models directly, these com-parison results seem to indicate that Model 3 is more favorable. Moreresearch on model selection and inference for SLM and SNM models isnecessary.

We can calculate estimates of the population mean curves for fourgroups based on Model 3 as follows:

> dog.grid <- data.frame(time=rep(seq(0,1,len=50),4),

group=as.factor(rep(1:4,rep(50,4))))

> e.dog.fit3 <- intervals(dog.fit3, newdata=dog.grid,

terms=rep(1,6))

Figure 9.5 shows the estimated mean curves and 95% Bayesian confi-dence intervals based on the fit dog.fit3.

time (min)

3.5

4.0

4.5

5.0

2 4 6 8 10 12

1

2 4 6 8 10 12

2

2 4 6 8 10 12

3

2 4 6 8 10 12

4

FIGURE 9.5 Dog data, estimates of the group mean response curves(solid lines) with 95% Bayesian confidence intervals (dotted lines) basedon the fit dog.fit3. Squares are within group average concentrations.

We have shrunk the group mean curves toward the overall mean curve.That is, we have penalized the group main effect ξk and the smooth-linear group-time interaction δk(t− 0.5) in the SS ANOVA decomposi-tion (9.45). From Figure 9.5 we can see that the estimated populationmean curve for group 2 is biased upward, while the estimated population


mean curves for group 1 is biased downward. This is because responsesfrom group 2 are smaller, while responses from group 1 are larger thanthose from groups 3 and 4. Thus, their estimates are pulled toward theoverall mean. Shrinkage estimates in this case may not be advantageoussince group only has four levels. One may want to leave ξk and δk(t−0.5)terms unpenalized to reduce biases. We can rewrite the fixed effects in(9.45) as

fk(t) , µ+ β(t− 0.5) + s1(t) + ξk + δk(t− 0.5) + s2(k, t)

= {µ+ ξk} + {β(t− 0.5) + δk(t− 0.5)} + {s1(t) + s2(k, t)}= ξk + δk(t− 0.5) + s2(k, t), (9.46)

where fk(t) is the mean curve for group k. Assume that fk ∈ W 22 [0, 1].

Define penalty as∫ 1

0 (f ′′k (t))2dt = ||s2(k, t)||2. Then the constant term

ξk and the linear term δk(t− 0.5) are not penalized. We can refit Model1, Model 2, and Model 3 under this new form of penalty as follows:

> dog.fit4 <- slm(y~group*time,

rk=list(rk.prod(cubic(time),kron(group==1)),

rk.prod(cubic(time),kron(group==2)),

rk.prod(cubic(time),kron(group==3)),

rk.prod(cubic(time),kron(group==4))),

random=list(dog=~1), data=dog)

> dog.fit5 <- update(dog.fit4, random=list(dog=~time))

> dog.fit6 <- update(dog.fit5,

random=list(all=pdIdent(~Z-1),dog=~time))

> e.dog.fit6 <- intervals(dog.fit6, newdata=dog.grid,

terms=rep(1,12))

Figure 9.6 shows the estimated mean curves and 95% Bayesian confi-dence intervals based on the fit dog.fit6. The estimated mean functionsare less biased.

We now show how to calculate predictions for all dogs based on thefit dog.fit6. Predictions based on other models may be derived simi-larly. For a particular dog i in group k, its prediction at time z can be

computed asˆξk +

ˆδk(z − 0.5) + ˆs2(k, z) + αi + γi(z − 0.5) + s3(k, i, z).

Prediction of the fixed effects can be computed using the prediction

function. Predictions of random effects αi and γi can be extractedfrom the fit. Therefore, we only need to compute s3(k, i, z). Sup-pose we want to predict s3 for dog i in group k on a vector of pointszi = (zi1, . . . , zigi

)T . Let vi = (s3(k, i, zi1), . . . , s3(k, i, zigi))T and Ci =

Cov(vi,ui) = {R1(zik, tj)}gi 7k=1 j=1 for i = 1, . . . , 36, where R1(z, t) is

the cubic spline RK given in Table 2.2. Let v = (vT1 , . . . ,v

T36)

T , R =


time (min)

3.0

3.5

4.0

4.5

5.0

5.5

2 4 6 8 10 12

1

2 4 6 8 10 12

2

2 4 6 8 10 12

3

2 4 6 8 10 12

4

FIGURE 9.6 Dog data, estimates of the group mean response curves(solid lines) with 95% Bayesian confidence intervals (dotted lines) basedon the fit dog.fit6. Squares are within group average responses.

diag(C1, . . . , C36), and u be the prediction of u. We then can computethe prediction for all dogs as v = RD−1u. The smallest eigenvalue of Dis close to zero, and thus D−1 cannot be calculated precisely. We willuse an alternative approach that does not require inverting D. Note thatu = Zb2 and denote the estimate of b2 as b2. If we can find a vector r(need not to be unique) such that

ZTr = b2, (9.47)

then

v = RD−1u = RD−1Zb2 = RD−1ZZTr = Rr.

So the task now is to solve (9.47). Let

Z = (Q1 Q2)

(V0

)

be the QR decomposition of Z. We consider r in the space spanned byQ1: r = Q1α. Then, from (9.47), α = V −T b2. Thus, r = Q1V

−T b2 isa solution to (9.47). This approach also applies to the situation when Dis singular. In the following we calculate predictions for all 36 dogs ona set of grid points. Note that groups 1, 2, 3, and 4 have 9, 10, 8, and 9dogs, respectively.

> dog.grid2 <- data.frame(time=rep(seq(0,1,len=50),36),

dog=rep(1:36,rep(50,36)))


> R <- kronecker(diag(36),

cubic(dog.grid2$time[1:50],dog$time[1:7]))

> b1 <- dog.fit6$lme.obj$coef$random$dog

> b2 <- as.vector(dog.fit6$lme.obj$coef$random$all)

> Z.qr <- qr(Z)

> r <- qr.Q(Z.qr)%*%solve(t(qr.R(Z.qr)))%*%b2

> tmp1 <- c(rep(e.dog.fit6$fit[dog.grid$group==1],9),

rep(e.dog.fit6$fit[dog.grid$group==2],10),

rep(e.dog.fit6$fit[dog.grid$group==3],8),

rep(e.dog.fit6$fit[dog.grid$group==4],9))

> tmp2 <- as.vector(rep(b1[,1],rep(50,36)))

> tmp3 <- as.vector(kronecker(b1[,2],dog.grid2$time[1:50]))

> u.new <- as.vector(R%*%r)

> p.dog.fit6 <- tmp1+tmp2+tmp3+u.new

Predictions for dogs 1, 2, 26, and 27 are shown in Figure 9.7.

time (min)

3.0

3.5

4.0

4.5

5.0

5.5

2 4 6 8 10 12

1

2 4 6 8 10 12

2

2 4 6 8 10 12

26

2 4 6 8 10 12

27

FIGURE 9.7 Dog data, predictions for dogs 1, 2, 26, and 27. Plusesare observations and solid lines are predictions.

9.4.4 Carbon Dioxide Uptake

The carbon dioxide data set contains five variables: plant identifieseach plant; Type has two levels, Quebec and Mississippi, indicating theorigin of the plant; Treatment indicates two treatments, nonchilling andchilling; conc gives ambient carbon dioxide concentrations (mL/L); anduptake gives carbon dioxide uptake rates (umol/m2 sec). Figure 9.8


shows the CO2 uptake rates for all plants.

The objective of the experiment was to evaluate the effect of planttype and chilling treatment on CO2 uptake. Pinheiro and Bates (2000)gave detailed analyses of this data set based on NLME models. Theyreached the following NLME model:

uptakeij = eφ1i{1 − e−eφ2i (concj−φ3i)} + ǫij ,

φ1i = β11 + β12Type+ β13Treatment+ β14Treatment:Type+ bi,

φ2i = β21,

φ3i = β31 + β32Type+ β33Treatment+ β34Treatment:Type,

i = 1, . . . , 12; j = 1, . . . , 7, (9.48)

where uptakeij denotes the CO2 uptake rate of plant i at CO2 ambientconcentration concj ; Type equals 0 for plants from Quebec and 1 forplants from Mississippi, Treatment equals 0 for chilled plants and 1 forcontrol plants; eφ1i , eφ2i , and φ3i denote, respectively, the asymptoticuptake rate, the uptake growth rate, and the maximum ambient CO2

concentration at which no uptake is verified for plant i; random effects

biiid∼ N(0, σ2

b ); and random errors ǫijiid∼ N(0, σ2). Random effects and

random errors are mutually independent. Note that we used exponentialtransformations to enforce the positivity constraints.

> data(CO2)

> co2.nlme <-

nlme(uptake~exp(a1)*(1-exp(-exp(a2)*(conc-a3))),

fixed=list(a1+a2~Type*Treatment,a3~1),

random=a1~1, groups=~Plant, data=CO2,

start=c(log(30),0,0,0,log(0.01),0,0,0,50))

> summary(co2.nlme)

Nonlinear mixed-effects model fit by maximum likelihood

Model: uptake ~ exp(a1)*(1-exp(-exp(a2)*(conc - a3)))

Data: CO2

AIC BIC logLik

393.2869 420.0259 -185.6434

...

Fits of model (9.48) are shown in Figure 9.8 as dotted lines. Based onmodel (9.48), one may conclude that the CO2 uptake is higher for plantsfrom Quebec, and that chilling, in general, results in lower uptake, andits effect on Mississippi plants is much larger than on Quebec plants.

We use this data set to demonstrate how to fit an SNM model and howto check if an NLME model is appropriate. As an extension of (9.48),


ambient carbon dioxide concentration(uL/L)

CO

2 u

pta

ke r

ate

(um

ol/m

2 s

)

10

20

30

40

200 400 600 800 1000

Mc1 Mc2

200 400 600 800 1000

Mc3

Mn1 Mn2

10

20

30

40

Mn3

10

20

30

40

Qc1 Qc2 Qc3

Qn1

200 400 600 800 1000

Qn2

10

20

30

40

Qn3

FIGURE 9.8 Carbon dioxide data, plot of observations, and fittedcurve for each plant. Circles are CO2 uptake rates. Solid lines representSNM model fits from co2.smm. Dotted lines represent NLME model fitsfrom co2.nlme. Strip names represent IDs of plants with “Q” indicatingQuebec, “M” indicating Mississippi, “c” indicating chilled, “n” indicat-ing nonchilled, and “1”, “2”, “3” indicating the replicate numbers.


we consider the following SNM model

uptakeij = eφ1if(eφ2i(concj − φ3i)) + ǫij ,

φ1i = β12Type+ β13Treatment+ β14Treatment:Type+ bi,

φ2i = β21,

φ3i = β31 + β32Type+ β33Treatment+ β34Treatment:Type,

i = 1, . . . , 12; j = 1, . . . , 7, (9.49)

where f ∈ W 22 [0, b] for some fixed b > 0, and the second-stage model is

similar to that in (9.48). In order to test if the parametric model (9.48) isappropriate, we use the exponential spline introduced in Section 2.11.2with γ = 1. Then H0 = span{1, exp(−x)}, and H1 = W 2

2 [0, b] ⊖ H0

with RK given in (2.58). Note that β11 in (9.48) is excluded from (9.49)to make f free of constraint on the vertical scale. We need the sideconditions that f(0) = 0 and f(x) 6= 0 for x 6= 0 to make β31 identifiablewith f . The first condition reduces H0 to H0 = span{1 − exp(−x)}and is satisfied by all functions in H2. We do not enforce the secondcondition because it is satisfied by all reasonable estimates. Thus themodel space for f is H0 ⊕ H1. It is clear that the NLME model is aspecial case of the SNM model with f ∈ H0. In the following we fit theSNM model (9.49) with initial values chosen from the NLME fit. Theprocedure converged after five iterations.

> M <- model.matrix(~Type*Treatment, data=CO2)[,-1]

> co2.snm <- snm(uptake~exp(a1)*f(exp(a2)*(conc-a3)),

func=f(u)~list(~I(1-exp(-u))-1,lspline(u,type=‘‘exp’’)),

fixed=list(a1~M-1,a3~1,a2~Type*Treatment),

random=list(a1~1), group=~Plant, verbose=T,

start=co2.nlme$coe$fixed[c(2:4,9,5:8)], data=CO2)

> summary(co2.snm)

Semi-parametric Nonlinear Mixed Effects Model fit

Model: uptake ~ exp(a1) * f(exp(a2) * (conc - a3))

Data: CO2

AIC BIC logLik

406.4865 441.625 -188.3760

...

GCV estimate(s) of smoothing parameter(s): 1.864814


Converged after 5 iterations

Fits of model (9.49) are shown in Figure 9.8 as solid lines. Since thedata set is small, different initial values may lead to different estimates.


However, the overall fits are similar. We also fitted models with AR(1)within-subject correlations and covariate effects on φ3. None of thesemodels improve fits significantly. The estimates are comparable to theNLME fit, and the conclusions remain the same as those based on (9.48).

To check if the parametric NLME model (9.48) is appropriate, wecalculate the posterior means and standard deviations using the functionintervals. Note that the intervals function returns an object of classcalled “bCI” to which the generic function plot can be applied directly.

> co2.grid2 <- data.frame(u=seq(0.3, 11, len=50))

> co2.ci <- intervals(co2.snm, newdata=co2.grid2,

terms=matrix(c(1,1,1,0,0,1), ncol=2, byrow=T))

> plot(co2.ci,

type.name=c(‘‘overall’’,‘‘parametric’’,‘‘smooth’’))

x

est

imate

0

10

20

30

40

0 10 20 30 40 50

overall

0 10 20 30 40 50

parametric

0 10 20 30 40 50

smooth

FIGURE 9.9 Carbon dioxide data, estimate of the overall functionf (left), its projection onto H0 (center), and H1 (right). Solid linesare fitted values. Dash lines are approximate 95% Bayesian confidenceintervals.

Figure 9.9 shows the estimate of f and its projection onto H0 and H1.The zero line is inside the Bayesian confidence intervals for the projec-tion onto H1 (smooth component), which suggests that the parametricNLME model (9.48) is adequate.


9.4.5 Circadian Rhythm — Revisit

We have fitted an SIM (8.59) for normal subjects in Section 8.4.6 whereparameters βi1, βi2, and βi3 are deterministic. The fixed effects SIM(8.59) has several drawbacks: (1) potential correlations among observa-tions from the same subject are ignored; (2) the number of deterministicparameters is large; and (3) it is difficult to investigate covariate effectson parameters and/or the common shape function. In this section weshow how to fit the hormone data using SNM models. More details canbe found in Wang, Ke and Brown (2003).

We first consider the following mixed-effects SIM for a single group

concij = µ+ b1i + exp(b2i)f(timeij − alogit(b3i)) + ǫij ,

i = 1, . . . ,m, j = 1, . . . , ni, (9.50)

where the fixed effect µ represents 24-hour mean of the population, therandom effects b1i, b2i, and b3i represent deviations in 24-hour mean, am-plitude, and phase of subject i. We assume that f ∈W 2

2 (per)⊖{1} and

bi = (b1i, b2i, b3i)T iid∼ N(0, σ2D), where D is an unstructured positive-

definite matrix. The assumption of zero population mean for amplitudeand phase, and the removal of constant functions from the periodic splinespace, take care of potential confounding between amplitude, phase, andthe nonparametric common shape function f . We fit model (9.50) tocortisol measurements from normal subjects as follows:

> nor <- horm.cort[horm.cort$type==‘‘normal’’,]

> nor.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),

func=f(u)~list(periodic(u)),

data=nor, fixed=list(b1~1), random=list(b1+b2+b3~1),

start=c(mean(nor$conc)), groups=~ID, spar=‘‘m’’)

> summary(nor.snm.fit)


Model: conc ~ b1 + exp(b2) * f(time - alogit(b3))

Data: nor

AIC BIC logLik

176.1212 224.1264 -70.07205

Random effects:

Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)

Level: ID



StdDev Corr


b1 0.2462385 b1 b2

b2 0.1803665 -0.628

b3 0.2486114 0.049 -0.521

Residual 0.3952836

Fixed effects: list(b1 ~ 1)


b1 1.661412 0.07692439 98 21.59799 0

GML estimate(s) of smoothing parameter(s): 0.0001200191



We compute predictions for all subjects evaluated at grid points:

> nor.grid <- data.frame(ID=rep(unique(nor$ID),rep(50,9)),

time=rep(seq(0,1,len=50),9))

> nor.snm.p <- predict(nor.fit, newdata=nor.grid)

The predictions are shown in Figure 9.10. In the following we alsofit model (9.50) to depression and Cushing groups. Observations andpredictions based on model (9.50) for these two groups are shown inFigures 9.11 and 9.12, respectively.

> dep <- horm.cort[horm.cort$type==‘‘depression’’,]

> dep.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),


data=dep, fixed=list(b1~1), random=list(b1+b2+b3~1),

start=c(mean(dep$conc)), groups=~ID, spar=‘‘m’’)

> cush <- horm.cort[horm.cort$type==‘‘cushing’’,]

> cush.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),


data=cush, fixed=list(b1~1), random=list(b1+b2+b3~1),

start=c(mean(cush$conc)), groups=~ID, spar=‘‘m’’)

We calculate the posterior means and standard deviations of the com-mon shape functions for all three groups:

> ci.grid <- data.frame(time=seq(0,1,len=50))

> nor.ci <- intervals(nor.snm.fit, newdata=ci.grid)

> dep.ci <- intervals(dep.snm.fit, newdata=ci.grid)

> cush.ci <- intervals(cush.snm.fit, newdata=ci.grid)

The estimated common shape functions and 95% Bayesian confidenceintervals for three groups are shown in Figure 9.13.


time

cort

isol concentr

ation o

n log s

cale

0

1

2

3

0.0 0.4 0.8

8007 8008

0.0 0.4 0.8

8009

8004 8005

0

1

2

3

8006

0

1

2

3

8001

0.0 0.4 0.8

8002 8003

FIGURE 9.10 Hormone data, normal subjects, plots of cortisol con-centrations (circles), and fitted curves based on model (9.50) (solid lines)and model (9.53) (dashed lines). Subjects’ ID are shown in the strip.


time

cort

isol concentr

ation o

n log s

cale

0

1

2

3

0.0 0.4 0.8

122 123

0.0 0.4 0.8

124

117 118

0

1

2

3

119

0

1

2

3

113 115 116

111

0.0 0.4 0.8

0

1

2

3

112

FIGURE 9.11 Hormone data, depressed subjects, plots of cortisolconcentrations (circles), and fitted curves based on model (9.50) (solidlines) and model (9.53) (dashed lines). Subjects’ ID are shown in thestrip.


time

cort

isol concentr

ation o

n log s

cale

2.5

3.0

3.5

4.0

0.0 0.4 0.8

3066 3067

0.0 0.4 0.8

3069 3075

3053 3056 3061

2.5

3.0

3.5

4.0

3064

2.5

3.0

3.5

4.0

3044 3045 3048 3049

3039

0.0 0.4 0.8

3040 3042

0.0 0.4 0.8

2.5

3.0

3.5

4.0

3043

FIGURE 9.12 Hormone data, subjects with Cushing’s disease, plotsof cortisol concentrations (circles), and fitted curves based on model(9.50) (solid lines). Subjects’ ID are shown in the strip.


0.0 0.4 0.8

−2

−1

01

2

time

f

0.0 0.4 0.8

−2

−1

01

2

time

f

0.0 0.4 0.8

−2

−1

01

2

time

f

0.0 0.4 0.8

−2

−1

01

2

time

f

FIGURE 9.13 Hormone data, estimates of the common shape func-tion f (lines), and 95% Bayesian confidence intervals (shaded regions).The lefts three panels are estimates based on model (9.50) for normal,depression, and Cushing groups, respectively. The right panel is theestimate based on model (9.53) for combined data from normal and de-pression groups.

It is obvious that the common function for the Cushing group is almostzero, which suggests that, in general, circadian rhythms are lost forCushing patients. It seems that the shape functions for normal anddepression groups are similar. We now test the hypothesis that the shapefunctions for normal and depression groups are the same by fitting datafrom these two groups jointly. Consider the following model

concijk = µk + b1ik + exp(b2ik)f(k, timeijk − alogit(b3ik)) + ǫijk,

i = 1, . . . ,m, j = 1, . . . , ni, k = 1, 2, (9.51)

where k represents group factor with k = 1 and k = 2 correspondingto the depression and normal groups, respectively; fixed effect µk is thepopulation 24-hour mean of group k; random effects b1ik, b2ik, and b3ik

represent the ith subject’s deviation of 24-hour mean, amplitude, andphase. Note that subjects are nested within groups. We allow differ-ent variances for the random effects in each group. That is, we assume

that bik = (b1ik, b2ik, b3ik)T iid∼ N(0, σ2kD), where D is an unstructured

positive-definite matrix. We assume different common shape functionsfor each group. Thus f is a function of both group (denoted as k) andtime. Since f is periodic in time, we model f using the tensor productspace R

2 ⊗ W 22 (per). Specifically, consider the SS ANOVA decompo-

sition (4.22). The constant and main effect of group are removed foridentifiability with µk. Therefore, we assume the following model for f :

f(k, time) = f1(time) + f12(k, time), (9.52)


where f1(time) is the main effect of time, and f12(k, time) is the in-teraction between group and time. The hypothesis H0 : f(1, time) =f(2, time) is equivalent to H0 : f12(k, time) = 0 for all values of thetime variable. Model (9.51) is fitted as follows:

> nordep <- horm.cort[horm.cort$type!=‘‘cushing’’,]

> nordep$type <- as.factor(as.vector(nordep$type))

> nordep.fit1 <- snm(conc~b1+exp(b2)*f(type,time-alogit(b3)),

func=f(g,u)~list(list(periodic(u),

rk.prod(shrink1(g),periodic(u)))),

data=nordep, fixed=list(b1~type), random=list(b1+b2+b3~1),

groups=~ID, weights=varIdent(form=~1|type),

spar=‘‘m’’, start=c(1.8,-.2))

> summary(nordep.fit1)


Model: conc ~ b1 + exp(b2) * f(type, time - alogit(b3))

Data: nordep

AIC BIC logLik

441.4287 542.7464 -191.5463

Random effects:

Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)

Level: ID



StdDev Corr

b1.(Intercept) 0.3403483 b1.(I) b2

b2 0.2936284 -0.781

b3 0.2962941 0.016 -0.159

Residual 0.4741110

Variance function:

Structure: Different standard deviations per stratum

Formula: ~1 | type

Parameter estimates:

depression normal

1.000000 0.891695

Fixed effects: list(b1 ~ type)


b1.(Intercept) 1.8389689 0.08554649 218 21.496719 0.0000

b1.typenormal -0.1179035 0.12360018 218 -0.953911 0.3412

Correlation:


b1.(I)

b1.typenormal -0.692


2.256957e+02



The smoothing parameter for the interaction term f12(k, time) is large,indicating that the interaction is negligible. We compute posterior meanand standard deviation of the interaction term:

> u <- seq(0,1,len=50)

> nordep.inter <- intervals(nordep.fit1, terms=c(0,1),

newdata=data.frame(g=rep(c(‘‘normal’’,‘‘depression’’),

c(50,50)),u=rep(u,2)))

> range(nordep.inter$fit)

-9.847084e-06 1.349702e-05

> range(nordep.inter$pstd)

0.001422883 0.001423004

The posterior means are on the magnitude of 10−5, while the posteriorstandard deviations are on the magnitude of 10−3. The estimate of f12is essentially zero. Therefore, it is appropriate to assume the same shapefunction for normal and depression groups.

Under the assumption of one shape function for both normal anddepression groups, we now can investigate differences of 24-hour mean,amplitude, and phase between two groups. For this purpose, considerthe following model

concijk = µk + b1ik + exp(b2ik + δk,2d1) ×f(timeijk − alogit(b3ik + δk,2d2)) + ǫijk,

i = 1, . . . ,m, j = 1, . . . , ni, k = 1, 2, (9.53)

where δk,2 is the Kronecker delta, and parameters d1 and d2 accountfor the differences in amplitude and phase, respectively, between normaland depression groups.

> nordep.fit2 <- snm(conc~b1+exp(b2+d1*I(type==‘‘normal’’))

*f(time-alogit(b3+d2*I(type==‘‘normal’’))),

func=f(u)~list(periodic(u)), data=nordep,

fixed=list(b1~type,d1+d2~1), random=list(b1+b2+b3~1),



spar=‘‘m’’, start=c(1.9,-0.3,0,0))



Model: conc ~ b1 + exp(b2 + d1 * I(type == ‘‘normal’’)) *

f(time - alogit(b3 + d2 * I(type == ‘‘normal’’)))

Data: nordep

AIC BIC logLik

429.9391 503.4998 -193.7516

Random effects:

Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)

Level: ID



StdDev Corr

b1.(Intercept) 0.3309993 b1.(I) b2

b2 0.2841053 -0.781

b3 0.2901979 0.030 -0.189

Residual 0.4655115

Variance function:


Formula: ~1 | type


depression normal

1.0000000 0.8908655

Fixed effects: list(b1 ~ type, d1 + d2 ~ 1)


b1.(Intercept) 1.8919482 0.08590594 216 22.023485 0.0000

b1.typenormal -0.2558220 0.14361590 216 -1.781293 0.0763

d1 0.2102017 0.10783207 216 1.949343 0.0525

d2 0.0281460 0.09878700 216 0.284916 0.7760

Correlation:

b1.(I) b1.typ d1


d1 0.000 -0.509

d2 0.000 0.023 -0.159





The differences of 24-hour mean and amplitude are borderline significant,while the difference of phase is not. We refit without the d2 term:

> nordep.fit3 <- snm(conc~b1+

exp(b2+d1*I(type==‘‘normal’’))*f(time-alogit(b3)),

func=f(u)~list(periodic(u)), data=nordep,

fixed=list(b1~type,d1~1), random=list(b1+b2+b3~1),


spar=‘‘m’’, start=c(1.9,-0.3,0))



Model: conc ~ b1 + exp(b2 + d1 * I(type == ‘‘normal’’)) *

f(time - alogit(b3))

Data: nordep

AIC BIC logLik

425.2350 495.3548 -192.4077

Random effects:

Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)

Level: ID



StdDev Corr

b1.(Intercept) 0.3302233 b1.(I) b2

b2 0.2835421 -0.780

b3 0.2898165 0.033 -0.192

Residual 0.4647148

Variance function:


Formula: ~1 | type


depression normal

1.0000000 0.8902948

Fixed effects: list(b1 ~ type, d1 ~ 1)


b1.(Intercept) 1.8919931 0.08574693 217 22.064849 0.0000

b1.typenormal -0.2567236 0.14327117 217 -1.791872 0.0745

d1 0.2148962 0.10620988 217 2.023316 0.0443

Correlation:

b1.(I) b1.typ



d1 0.000 -0.512




The predictions based on the final fit are shown in Figures 9.10 and9.11. The estimate of the common shape function f is shown in the rightpanel of Figure 9.13. Data from two groups are pooled to estimate thecommon shape function, which leads to narrower confidence intervals.The final model suggests that the depressed subjects have their meancortisol level elevated and have less profound circadian rhythm thannormal subjects.

To take a closer look, we extract estimates of 24-hour mean levels andamplitudes for all subjects in normal and depression groups and performbinary recursive partitioning using these two variables:

> nor.mean <- nordep.fit3$coef$fixed[1]+

nordep.fit3$coef$fixed[2]+

nordep.fit3$coef$random$ID[12:20,1]

> dep.mean <- nordep.fit3$coef$fixed[1]+

nordep.fit3$coef$random$ID[1:11,1]

> nor.amp <- exp(nordep.fit3$coef$fixed[3]+

nordep.fit3$coef$random$ID[12:20,2])

> dep.amp <- exp(nordep.fit3$coef$random$ID[1:11,2])

> u <- c(nor.mean, dep.mean)

> v <- c(nor.amp, dep.amp)

> s <- c(rep(‘‘n’’,9),rep(‘‘d’’,11))

> library(rpart)

> prune(rpart(s~u+v), cp=.1)

n= 20

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 20 9 d (0.5500000 0.4500000)

2) u>=1.639263 12 2 d (0.8333333 0.1666667) *

3) u< 1.639263 8 1 n (0.1250000 0.8750000) *

Figure 9.14 shows the estimated 24-hour mean levels plotted againstthe estimated amplitudes. There is a negative relationship between the24-hour mean and amplitude. The estimate of correlation between b1ik

and b2ik equals −0.781. The normal subjects and depressed patients canbe well separated by the 24-hour mean level.


1.2 1.4 1.6 1.8 2.0 2.2 2.4

0.6

0.8

1.0

1.2

1.4

24−hour mean

am

plit

ude

n

n

n

n

n

n

n

n

n

d

d

d

d

d

dd

d

dd

d

FIGURE 9.14 Hormone data, plot of the estimated 24-hour meanlevels against amplitudes. Normal subjects and depressed patients aremarked as “n” and “d”, respectively. The dotted line represents partitionbased on the tree method.


Appendix A

Data Sets

In this appendix we describe data sets used for illustrations in this book.Table A.1 lists all data sets.

TABLE A.1 List of all data sets.

Air quality New York air quality measurementsArosa Monthly ozone measurements from ArosaBeveridge Beveridge wheat price indexBond Treasury and GE bondsCanadian weather Monthly temperature and precipitation

from 35 Canadian stationsCarbon dioxide Carbon dioxide uptake in grass plantsChickenpox Monthly chickenpox cases in New York CityChild growth Height of a child over one school yearDog Coronary sinus potassium in dogsGeyser Old Faithful geyser dataHormone Cortisol concentrationsLake acidity Acidity measurements from lakesMelanoma Melanoma incidence rates in ConnecticutMotorcycle Simulated motorcycle accident dataParamecium caudatum Growth of paramecium caudatum populationRock Measurements on petroleum rock samplesSeizure IEEG segments from a seizure patientStar Magnitude of the Mira variable R HydraeStratford weather Daily maximum temperatures in StratfordSuperconductivity Superconductivity magnetizationTexas weather Texas historical climate dataUltrasound Ultrasound imaging of the tongue shapeUSA climate Average winter temperatures in USAWeight loss Weight loss of an obese patientWESDR Wisconsin Epidemiological Study of Diabetic

RetinopathyWorld climate Global average winter temperature

323


A.1 Air Quality Data

This data set contains daily air quality measurements in New York City,May to September of 1973. Four variables were measured: mean ozonein parts per billion from 1300 to 1500 hours at Roosevelt Island (denotedas Ozone), average wind speed in miles per hour at 0700 and 1000 hoursat LaGuardia Airport (denoted as Wind), maximum daily temperaturein degrees Fahrenheit at LaGuardia Airport (denoted as Temp), and solarradiation in Langleys in the frequency band 4000-7700 Angstroms from0800 to 1200 hours at Central Park (denoted as Solar.R). The data setis available in R with the name airquality.

A.2 Arosa Ozone Data

This is a data set in Andrews and Herzberg (1985) contains monthlymean ozone thickness (Dobson units) in Arosa, Switzerland, from 1926to 1971. It consists of 518 observations on three variables: thick forozone thickness, month, and year. The data set is available in libraryassist with the name Arosa.

A.3 Beveridge Wheat Price Index Data

The data set contains Beveridge wheat price index averaged over manylocations in western and central Europe from 1500 to 1869. The dataset is available in library tseries with the name bev.

A.4 Bond Data

144 GE (General Electric Company) bonds and 78 Treasury bonds werecollected from Bloomberg. The data set contains four variables: name ofa bond, current price, payment at future times, and type of the bond.The number of payments till due ranges from 1 to 27 with a median 3

Data Sets 325

for the GE bonds, and from 1 to 58 with a median 2.5 for Treasury. Themaximum time to maturity is 14.8 years for GE bonds and 28.77 yearsfor Treasury bonds. The data set is available in library assist with thename bond.

A.5 Canadian Weather Data

The data set contains mean monthly temperature and precipitation at35 Canadian weather stations (Ramsay and Silverman 2005). It consistsof 420 observations on five variables: temp for temperature in Celsius,prec for precipitation in millimeters, station code number, geologicalzone, and month. The data set is available in library fda with the nameCanadianWeather.

A.6 Carbon Dioxide Data

This data set comes from a study of cold tolerance of a C4 grass species,Echinochloa crus-galli. A total of 12 four-week-old plants were usedin the study. There were two types of plants: six from Quebec andsix from Mississippi. Two treatments, nonchilling and chilling, wereassigned to three plants of each type. Nonchilled plants were kept at26oC, and chilled plants were subject to 14 hours of chilling at 7oC.After 10 hours of recovery at 20oC, CO2 uptake rates (in umol/m

2s)

were measured for each plant at seven concentrations of ambient CO2

in increasing, consecutive order. More details can be found in Pinheiroand Bates (2000). The data set is available in library nlme with thename CO2.

A.7 Chickenpox Data

The data set, downloaded at http://robjhyndman.com/tsdldata/epi/chicknyc.dat, contains monthly number of reported cases of chick-enpox in New York City from 1931 to the first 6 months of 1972. It


consists of 498 observations on three variables: count, month, and year.The data set is available in library assist with the name chickenpox.

A.8 Child Growth Data

Height measurements of a child were recorded over one school year (Ram-say 1998). The data set contains 83 observations on two variables:height in centimeters and day in a year. The data set is availablein library fda with the name onechild.

A.9 Dog Data

A total of 36 dogs were assigned to four groups: control, extrinsic car-diac denervation 3 weeks prior to coronary occlusion, extrinsic cardiacdenervation immediately prior to coronary occlusion, and bilateral tho-racic sympathectomy and stellectomy 3 weeks prior to coronary occlu-sion. Coronary sinus potassium concentrations (milliliter equivalents perliter) were measured on each dog every 2 minutes from 1 to 13 minutesafter occlusion. This data was originally presented by Grizzle and Allen(1969). It is available in library assist with the name dog.

A.10 Geyser Data

This data set contains 272 measurements from the Old Faithful geyserin Yellowstone National Park. Two variables were recorded: duration

as the eruption time in minutes, and waiting as the waiting time inminutes to the next eruption. The data set is available in R with thename faithful.

Data Sets 327

A.11 Hormone Data

In an experiment to study immunological responses in humans, bloodsamples were collected every two hours for 24 hours from 9 healthy nor-mal volunteers, 11 patients with major depression and 16 patients withCushing’s syndrome. These blood samples were analyzed for parame-ters that measure immune functions and hormones of the Hypothalamic-Pituitary-Adrenal axis (Kronfol, Nair, Zhang, Hill and Brown 1997). Wewill concentrate on hormone cortisol. The data set contains four vari-ables: ID for subject index, time for time points when blood sampleswere taken, type as a group indicator for subjects and conc for cortisolconcentration on log10 scale. The variable time is scaled into the in-terval [0, 1]. The data set is available in library assist with the namehorm.cort.

A.12 Lake Acidity Data

This data set was derived by Douglas and Delampady (1990) from theEastern Lakes Survey of 1984. The study involved measurements of1789 lakes in three Eastern US regions: Northeast, Upper Midwest, andSoutheast. We use a subset of 112 lakes in the southern Blue Ridgemountains area. The data set contains 112 observations on four vari-ables: ph for water pH level, t1 for calcium concentration in log10 mil-ligrams per liter, x1 for latitude, and x2 for longitude. The data set isavailable in library assist with the name acid.

A.13 Melanoma Data

This is a data set in Andrews and Herzberg (1985) that contains numbersof melanoma cases per 100,000 in the state of Connecticut during 1936–1972. It consists of 37 observations on two variables: cases for numbersof melanoma cases per 100,000, and year. The data set is available inlibrary fda with the name melanoma.


A.14 Motorcycle Data

These data come from a simulated motorcycle crash experiment on theefficacy of crash helmets. The data set contains 133 measurements ontwo variables: accel as head acceleration in g of a subject and time astime after impact in milliseconds. The data set is available in libraryMASS with the name mcycle.

A.15 Paramecium caudatum Data

This is a data set in Gause (1934) that contains growth of parameciumcaudatum population in the medium of Osterhout. It consists of 25 ob-servations on two variables: days since the start of the experiment, anddensity representing the mean number of individuals in 0.5 milliliter ofmedium of four different cultures started simultaneously. The data setis available in library assist with the name paramecium.

A.16 Rock Data

This data set contains measurements on 48 rock samples collected froma petroleum reservoir. Four variables were measured: area of poresspace in pixels out of 256 by 256, perimeter in pixels (denoted as peri),shape in perimeter/area1/2, and permeability in milli-Darcies (denotedas perm). The data set is available in R with the name rock.

A.17 Seizure Data

This data set, provided by Li Qin, contains two 5-minute intracranialelectroencephalograms (IEEG) segments from a seizure patient: base in-cludes the baseline segment extracted at least 4 hours before the seizure’sonset, and preseizure includes the segment right before a seizure’s clin-ical onset. The sampling rate is 200 Hertz. Therefore, there are 60,000

Data Sets 329

time points in each segment. The data set is available in library assist

with the name seizure.

A.18 Star Data

This data set, provided by Marc G. Genton, contains magnitude (bright-ness) of the Mira variable R Hydrae during 1900–1950. It consists of twovariables: magnitude and time in days. The data set is available in li-brary assist with the name star.

A.19 Stratford Weather Data

This is part of a climate data set downloaded from the Carbon DioxideInformation Analysis Center at http://cdiac.ornl.gov/ftp/ndp070.Daily maximum temperatures from the station in Stratford, Texas, inthe year 1990 were extracted. The year was divided into 73 five-dayperiods, and measurements on the third day in each period were selectedas observations. Therefore, the data set consists of 73 observations ontwo variables: y as the observed maximum temperature in Fahrenheit,and x as the time scaled into [0, 1]. The data set is available in libraryassist with the name Stratford.

A.20 Superconductivity Data

The data come from a study involving superconductivity magnetiza-tion modeling conducted by the National Institute of Standard andTechnology. The data set contains 154 observations on two variables:magnetization in ampere×meter2/kilogram, and log time in minutes.Temperature was fixed at 10 degrees Kelvin. The data set is availablein library NISTnls with the name Bennett5.


A.21 Texas Weather Data

The data set contains average monthly temperatures during 1961–1990from 48 weather stations in Texas. It also contains geological locationsof these stations in terms of longitude (long) and latitude (lat). Thedata set is available in library assist with the name TXtemp.

A.22 Ultrasound Data

Ultrasound imaging of the tongue provides real-time information aboutthe shape of the tongue body at different stages in articulation. Thisdata set comes from an experiment conducted in the Phonetics and Ex-perimental Phonology Lab of New York University led by Professor LisaDavidson. Three Russian speakers produced the consonant sequence,/gd/, in three different linguistic environments:

2words: the g was at the end of one word followed by d at the beginningof the next word. For example, the Russian phrase pabjeg damoj;

cluster: the g and d were both at the beginning of the same word. Forexample, the phrase xot gdamam;

schwa: the g and d were at the beginning of the same word but are sep-arated by the short vowel schwa (indicated by [∂]). For example,the phrase pr∂tajitatjg∂da’voj”.

Details about the ultrasound experiment can be found in Davidson(2006). We use a subset from a single subject, with three replicationsfor each environment, 15 points recorded from each of 9 slices of tonguecurves separated by 30 ms (milliseconds). The data set contains fourvariables: height as tongue height in mm (millimeters), length astongue length in mm, time as the time in ms and env as the envi-ronment with three levels: 2words, cluster, and schwa. The data set isavailable in library assist with the name ultrasound.

Data Sets 331

A.23 USA Climate Data

The data set contains average winter (December, January, and February)temperatures (temp) in 1981 from 1214 stations in the United States. Italso contains geological locations of these stations in terms of longitude(long) and latitude (lat). The data set is available in library assist

with the name USAtemp.

A.24 Weight Loss Data

The data set contains 52 observations on two variables, Weight in kilo-grams of a male obese patient, and Days since the start of a weightrehabilitation program. The data set is available in library MASS withthe name wtloss.

A.25 WESDR Data

Wisconsin Epidemiological Study of Diabetic Retinopathy (WESDR) isan epidemiological study of a cohort of diabetic patients receiving theirmedical care in an 11-county area in Southern Wisconsin. A number ofmedical, demographic, ocular, and other covariates were recorded at thebaseline and later examinations along with a retinopathy score for eacheye. Detailed descriptions of the study were given in Klein, Klein, Moss,Davis and DeMets (1988) and Klein, Klein, Moss, Davis and DeMets(1989). This subset contains 669 observations on five variables: num forsubject ID, dur for duration of diabetes at baseline, gly for glycosylatedhemoglobin, bmi for body mass index (weight in kilograms/(height inmeters)2), and prg for progression status of diabetic retinopathy at thefirst follow-up (1 for progression and 0 for nonprogression). The dataset is available in library assist with the name wesdr.


A.26 World Climate Data

The data were obtained from the Carbon Dioxide Information and Anal-ysis Center at Oak Ridge National Laboratory. The data set containsaverage winter (December, January, and February) temperatures (temp)in 1981 from 725 stations around the globe, and geological locations ofthese stations in terms of longitude (long) and latitude (lat). The dataset is available in library assist with the name climate.

Appendix B

Codes for Fitting StrictlyIncreasing Functions

B.1 C and R Codes for Computing Integrals

The following functions k2 and k4 compute the scaled Bernoulli polyno-mials k2(x) and k4(x) in (2.27), and the function rc computes the RKof the cubic spline in Table 2.2:

static double

k2(double x) {

double value;

x = fabs(x);

value = x - 0.5;

value *= value;

value = (value-1./12.)/2.;

return(value);

}

static double

k4(double x) {

double val;

x = fabs(x);

val = x - 0.5;

val *= val;

val = (val * val - val/2. + 7./240.)/24.;

return(val);

}

static double

rc(double x, double y) {

double value;

value = k2(x) * k2(y)- k4 (x - y);

return(value);

}

333


The following functions integral s, integral f, and integral 1

compute three-point Gaussian quadrature approximations to integrals∫ x

0 f(s)ds,∫ x

0 f(s)R1(s, y)ds, and∫ x

0

∫ y

0 f(s)f(t)R1(s, t)dsdt, respectively,where R1 is the RK of the cubic spline in Table 2.2:

void integral_s(double *f, double *x, long *n, double *res)

{

long i;

double sum=0.0;

for(i=0; i< *n; i++){

sum += (x[i+1]-x[i])*(0.2777778*(f[3*i]+f[3*i+2])+

0.4444444*f[3*i+1]);

res[i] = sum;

}

}

void integral_f(double *x, double *y, double *f,

long *nx, long *ny, double *res)

{

long i, j;

double x1, y1, sum=0.0;

for(i=0;i< *ny; i++){

sum = 0.0;

for(j=0; j< *nx; j++){

x1 = x[j+1]-x[j];

sum += x1*(0.2777778*(f[3*j]*

rc(x[j]+x1*0.1127017, y[i])+

f[3*j+2]*rc(x[j]+x1*0.8872983, y[i]))+

0.4444444*f[3*j+1]*rc(x[j]+x1*0.5,y[i]));

res[i*(*nx)+j] = sum;

}

}

}

void integral_1(double *x, double *y, double *f,

long *n1, long *n2, double *res)

{

long i, j, t, s;

double x1, y1, sum=0.0, sum_tmp;

for(i=0; i< *n1; i++){

Codes for Fitting Strictly Increasing Functions 335

x1 = x[i+1]-x[i];

sum = 0.0;

for(j=0; j< *n2; j++){

y1 = y[j+1]-y[j];

sum_tmp = 0.2777778*0.2777778*(f[3*i]*f[3*j])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.1127017)+

0.2777778*0.4444444*((f[3*i]*f[3*j+1])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.5)+

(f[3*i+1]*f[3*j])*rc(x[i]+x1*0.5,y[j]+y1*0.1127017));

sum_tmp += 0.4444444*0.4444444*((f[3*i+1]*f[3*j+1])*

rc(x[i]+x1*0.5,y[j]+y1*0.5))+

0.2777778*0.2777778*((f[3*i+2]*f[3*j+2])*

rc(x[i]+x1*0.8872983, y[j]+ y1*0.8872983));

sum_tmp += 0.2777778*0.2777778*((f[3*i]*f[3*j+2])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.8872983)+

(f[3*i+2]*f[3*j])*

rc(x[i]+x1*0.8872983,y[j]+y1*0.1127017))+

0.4444444*0.2777778*((f[3*i+1]*f[3*j+2])*

rc(x[i]+x1*0.5, y[j]+y1*0.8872983)+

(f[3*i+2]*f[3*j+1])*rc(x[i]+x1*0.8872983, y[j]+y1*0.5));

sum += sum_tmp*x1*y1;

res[i*(*n2)+j] = sum;

}

}

}

The following R functions provide interface with the C functionsintegral s, integral f, and integral 1:

int.s <- function(f, x, low=0) {

n <- length(x)

x <- c(low, x)

.C(‘‘integral_s’’, as.double(f), as.double(x),

as.integer(n), val = double(n))$val

}

int.f <- function(x, y, f, low=0) {

nx <- length(x)

ny <- length(y)

x <- c(low, x)

res <- .C(‘‘integral_f’’, as.double(x),

as.double(y), as.double(f), as.integer(nx),

as.integer(ny), val=double(nx*ny))$val

matrix(res, ncol=ny, byrow=F)


}

int1 <- function(x, f.val, low=0) {

n <- length(x)

x <- c(low, x)

if(length(f.val) != 3 * n) stop(‘‘input not match’’)

res <- matrix(.C(‘‘integral_1’’, as.double(x),

as.double(x), as.double(f.val), as.integer(n),

as.integer(n), val = double(n * n))$val, ncol = n)

apply(res, 1, cumsum)

}

B.2 R Function inc

The following R function implements the EGN procedure for model (7.3).

inc <- function(y, x, spar=‘‘v’’, grid=x, limnla=c(-6,0),

prec=1.e-6, maxit=50, verbose=F)

{

n <- length(x)

org.ord <- match(1:n, (1:n)[order(x)])

s.x <- sort(x)

s.y <- y[order(x)]

x1 <- c(0, s.x[-n])

x2 <- s.x-x1

q.x <- as.vector(rep(1,3)%o%x1+

c(0.1127017,0.5,0.8872983)%o%x2)

# function for computing derivatives

k1 <- function(x) x-.5

k2 <- function(x) ((x-.5)^2-1/12)/2

dk2 <- function(x) x-.5

dk4 <- function(x)

sign(x)*((abs(x)-.5)^3/6-(abs(x)-.5)/24)

drkcub <- function(x,z) dk2(x)%o%k2(z)-

dk4(x%o%rep(1,length(z))-rep(1,length(x))%o%z)

# compute starting value

ini.fit <- ssr(s.y~I(s.x-.5), cubic(s.x))

g.der <- ini.fit$coef$d[2]+drkcub(q.x,x)%*%ini.fit$coef$c

Codes for Fitting Strictly Increasing Functions 337

h.new <- abs(g.der)+0.005

# begin iteration

iter <- cover <- 1

h.old <- h.new

repeat {

if(verbose) cat(‘‘\n Iteration: ’’, iter)

yhat <- s.y-int.s(h.new*(1-log(h.new)),s.x)

smat <- cbind(int.s(h.new, s.x), int.s(h.new*q.x,s.x))

qmat <- int1(s.x,h.new)

fit <- ssr(yhat~smat, qmat, spar=spar, limnla=limnla)

if(verbose)

cat(‘‘\nSmoothing parameter: ’’, fit$rkpk$nlaht)

dd <- fit$coef$d

cc <- as.vector(fit$coef$c)

h.new <- as.vector(exp(cc%*%int.f(s.x,q.x,h.new)+

dd[2]+dd[3]*q.x))

cover <- mean((h.new-h.old)^2)

h.old <- h.new

if(verbose)

cat(‘‘\nConvergent Criterion: ‘‘, cover, ‘‘\n’’)

if(cover<prec || iter>(maxit-1)) break

iter <- iter + 1

}

if(iter>=maxit) print(‘‘convergence not achieved!’’)

y.fit <- (smat[,1]+fit$rkpk$d[1])[org.ord]

f.fit <- as.vector(cc%*%int.f(s.x,grid,h.new)+

dd[2]+dd[3]*grid)

x1 <- c(0, grid[-length(grid)])

x2 <- grid-x1

q.x <- as.vector(rep(1,3)%o%x1+

c(0.1127017,0.5,0.8872983)%o%x2)

h.new <- as.vector(exp(cc%*%int.f(s.x,q.x,h.new)+

dd[2]+dd[3]*q.x))

y.pre <- int.s(h.new,grid)+fit$rkpk$d[1]

sigma <- sqrt(sum((y-y.fit)^2)/(length(y)-fit$df))

list(fit=fit, iter=c(iter, cover),

pred=list(x=grid,y=y.pre,f=f.fit),

y.fit=y.fit, sigma=sigma)

}

where x and y are vectors of the independent and dependent variables;grid is a vector of grid points of the x variable used for assessing conver-


gence and prediction; and options spar and limnla are similar to thosein the ssr function. Let h(x) = g′(x) = exp{f(x)}. To get the initialvalue for the function f , we first fit a cubic spline to model (7.1). De-note the fitted function as g0. We then use log(|g′0(x)|+ δ) as the initialvalue for f , where δ = 0.005 is a small positive number for numericalstability. Since g0(x) = d1φ1(x) + d2φ2(x) +

∑ni=1 ciR1(xi, x), we have

g′0(x) = d2 +∑n

i=1 ci∂R1(xi, x)/∂x, where ∂R1(xi, x)/∂x is computedby the function drkcub. Functions k1 and k2 compute scaled Bernoullipolynomials defined in (2.27), and functions dk2 and dk4 compute k′2(x)and k′4(x), respectively. As a by-product, the line starting with g.der

shows how to compute the first derivative for a cubic spline fit.

Appendix C

Codes for Term Structure ofInterest Rates

C.1 C and R Codes for Computing Integrals

The following rc function computes the RK of the cubic spline in Table2.1:

static double

rc(double x, double y) {

double val, tmp;

tmp = (x+y-fabs(x-y))/2.0;

val = (tmp)*(tmp)*(3.0*(x+y-tmp)-tmp)/6.0;

return(val);

}

In addition to the functions integral s, integral f, and integral 1

presented in Appendix B, we need the following function integral 2

for computing three-point Gaussian quadrature approximations to theintegral

∫ x

0

∫ y

0 f1(s)f2(t)R1(s, t)dsdt, where R1 is the RK of the cubicspline in Table 2.1:

void integral_2(double *x, double *y, double *fx,

double *fy, long *n1, long *n2, double *res)

{

long i,j, t, s;

double x1, y1, sum=0.0, sum_tmp;

for(i=0; i< *n1; i++){

x1 = x[i+1]-x[i];

sum = 0.0;

for(j=0; j< *n2; j++){

y1 = y[j+1]-y[j];

sum_tmp = 0.2777778*0.2777778*(fx[3*i]*fy[3*j])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.1127017)+

339


0.2777778*0.4444444*((fx[3*i]*fy[3*j+1])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.5)+

(fx[3*i+1]*fy[3*j])*rc(x[i]+

x1*0.5, y[j]+y1*0.1127017));

sum_tmp += 0.4444444*0.4444444*((fx[3*i+1]*fy[3*j+1])*

rc(x[i]+x1*0.5, y[j]+y1*0.5)) + 0.2777778*0.2777778*

((fx[3*i+2]*fy[3*j+2])*

rc(x[i]+x1*0.8872983, y[j]+y1*0.8872983));

sum_tmp += 0.2777778*0.2777778*((fx[3*i]*fy[3*j+2])*

rc(x[i]+x1*0.1127017, y[j]+y1*0.8872983)+

(fx[3*i+2]*fy[3*j])*

rc(x[i]+x1*0.8872983, y[j]+y1*0.1127017))+

0.4444444*0.2777778*((fx[3*i+1]*fy[3*j+2])*

rc(x[i]+x1*0.5, y[j]+y1*0.8872983)+

(fx[3*i+2]*fy[3*j+1])*

rc(x[i]+x1*0.8872983, y[j]+y1*0.5));

sum += sum_tmp*x1*y1;

res[i*(*n2)+j] = sum;

}

}

}

Note that the rc function called inside integral f, integral 1, andintegral 2 in this Appendix computes the RK R1 of the cubic splinein Table 2.1.

The following R functions provide interface with the C functionintegral 2:

int2 <- function(x, y, fx, fy, low.x=0, low.y=0) {

nx <- length(x)

ny <- length(y)

if((length(fx) != 3 * nx) || (length(fy) != 3 * ny))

stop(‘‘input not match’’)

x <- c(low.x, x)

y <- c(low.y, y)

res <- matrix(.C(‘‘integral_2’’, as.double(x),

as.double(y), as.double(fx), as.double(fy),

as.integer(nx), as.integer(ny), val=double(nx*ny))$val,

ncol=ny, byrow=T)

apply(res, 2, cumsum)

}

Codes for Term Structure of Interest Rates 341

C.2 R Function for One Bond

The following one.bond function implements the EGN algorithm to fitmodel (7.35):

one.bond <- function(price, payment, time, name,

spar=‘‘m’’, limnla=c(-3,6)) {

# pre-processing the data

# the data has to be sorted by the name

group <- as.vector(table(name))

y <- price[cumsum(group)]

n.time <- length(time)

# create variables for 3-point Gaussian quadrature

s.time <- sort(time)

x1.y <- c(0, s.time[-n.time])

x2.y <- s.time-x1.y

org.ord <- match(1:n.time, (1:n.time)[order(time)])

q.time <- as.vector(rep(1,3)%o%x1.y+

c(0.1127017,0.5,0.8872983)%o%x2.y)

# initial values for f

f0 <- function(x) rep(0.04, length(x))

f.old <- f.val <- f0(q.time)

# create s and q matrices

S <- cbind(time,time*time/2.0)

Lambda <- int1(s.time,

rep(1,3*length(s.time)))[org.ord,org.ord]

Lint <- int.f(s.time,q.time,

rep(1,3*length(s.time)))[org.ord,]

# begin iteration

iter <- cover <- 1

repeat {

fint <- int.s(f.val,s.time)[org.ord]

X <- assist:::diagComp(matrix(payment*exp(-fint),nrow=1),

group)

ytilde <- X%*%(1+fint)-y

T <- X%*%S; Q <- X%*%Lambda%*%t(X)

fit <- ssr(ytilde~T-1, Q, spar=spar, limnla=limnla)

dd <- fit$coef$d; cc <- fit$coef$c


f.val <- as.vector((cc%*%X)%*%Lint+dd[1]+dd[2]*q.time)

cover <- mean((f.val-f.old)^2)

if(cover<1.e-6 || iter>20) break

iter<- iter+1; f.old <- f.val

}

tmp <- -int.s(f.val,s.time)[org.ord]

yhat <- apply(assist:::diagComp(matrix(payment*exp(tmp),

nrow=1),group),1,sum)

sigma <- sqrt(sum((y-yhat)^2)/(length(y)-fit$df))

list(fit=fit, iter=c(iter, cover), call=match.call(),

f.val=f.val, q.time=q.time, dc=exp(tmp),

y=list(y=y,yhat=yhat), sigma=sigma)

}

where variable names are self-explanatory.

C.3 R Function for Two Bonds

The following two.bond function implements the nonlinear Gauss–Seidelalgorithm to fit model (7.36):

two.bond <- function(price, payment, time, name, type,

spar=‘‘m’’, limnla=c(-3,6), prec=1.e-6, maxit=20) {

# pre-processing the data

# the data in each group has to be sorted by the name

group1 <- as.vector(table(name[type==‘‘govt’’]))

y1 <- price[type==‘‘govt’’][cumsum(group1)]

time1 <- time[type==‘‘govt’’]

n1.time <- length(time1)

payment1 <- payment[type==‘‘govt’’]

group2 <- as.vector(table(name[type==‘‘ge’’]))

y2 <- price[type==‘‘ge’’][cumsum(group2)]

time2 <- time[type==‘‘ge’’]

n2.time <- length(time2)

payment2 <- payment[type==‘‘ge’’]

y <- c(y1, y2)

group <- c(group1, group2)

payment <- c(payment1, payment2)

error <- 0

# create variables for 3-point Gaussian quadrature


s.time1 <- sort(time1)

x1.y1 <- c(0, s.time1[-n1.time])

x2.y1 <- s.time1-x1.y1

org.ord1 <- match(1:n1.time, (1:n1.time)[order(time1)])

q.time1 <- as.vector(rep(1,3)%o%x1.y1+

c(0.1127017,0.5,0.8872983)%o%x2.y1)

s.time2 <- sort(time2)

x1.y2 <- c(0, s.time2[-n2.time])

x2.y2 <- s.time2-x1.y2

org.ord2 <- match(1:n2.time, (1:n2.time)[order(time2)])

q.time2 <- as.vector(rep(1,3)%o%x1.y2+

c(0.1127017,0.5,0.8872983)%o%x2.y2)

# initial values for f



f1.val1 <- f10(q.time1)


f1.old <- c(f1.val1,f1.val2)


f2.old <- f2.val2

# create s and q matrices

S1 <- cbind(time1,time1*time1/2.0)

S2 <- cbind(time2,time2*time2/2.0)

L1 <- int1(s.time1,

rep(1,3*length(s.time1)))[org.ord1,org.ord1]

L2 <- int1(s.time2,

rep(1,3*length(s.time2)))[org.ord2,org.ord2]

L12 <- int2(s.time1,s.time2,rep(1,3*length(s.time1)),

rep(1,3*length(s.time2)))

L12 <- L12[org.ord1,org.ord2]

Lambda <- rbind(cbind(L1,L12),cbind(t(L12),L2))

L1int <- int.f(s.time1,c(q.time1,q.time2),

rep(1,3*length(s.time1)))[org.ord1,]

L2int <- int.f(s.time2,c(q.time1,q.time2),


Lint <- rbind(L1int,L2int)

L2int2 <- int.f(s.time2,q.time2,


# begin iteration

iter <- cover <- 1


repeat {

# update f1

f1int1 <- int.s(f1.val1,s.time1)[org.ord1]



X <- assist:::diagComp(matrix(payment*

exp(-c(f1int1,f1int2+f2int2)),nrow=1),group)

ytilde1 <- X%*%(1+c(f1int1,f1int2))-y

T <- X%*%rbind(S1,S2)

Q <- X%*%Lambda%*%t(X)

fit1 <- try(ssr(ytilde1~T-1,Q,spar=spar,limnla=limnla))

if (class(fit1)==‘‘try-error’’) {error=1; break}

if (class(fit1)!=‘‘try-error’’) {

dd <- fit1$coef$d; cc <- fit1$coef$c

f1.val <- as.vector((cc%*%X)%*%Lint+dd[1]+

dd[2]*c(q.time1,q.time2))

f1.val1 <- f1.val[1:(3*n1.time)]

f1.val2 <- f1.val[-(1:(3*n1.time))]

}

# update f2


X2 <- assist:::diagComp(matrix(payment2*

exp(-f1int2-f2int2),nrow=1),group2)

ytilde2 <- X2%*%(1+f2int2)-y2

T2 <- X2%*%S2

Q22 <- X2%*%L2%*%t(X2)

fit2 <- try(ssr(ytilde2~T2-1,Q22,spar=spar,

limnla=limnla))

if (class(fit2)==‘‘try-error’’) {error=1; break}

if (class(fit2)!=‘‘try-error’’) {

dd <- fit2$coef$d; cc <- fit2$coef$c

f2.val2 <- as.vector((cc%*%X2)%*%L2int2+

dd[1]+dd[2]*q.time2)

}

cover <- mean((c(f1.val1,f1.val2,f2.val2)-

c(f1.old, f2.old))^2)

if(cover<prec || iter>maxit) break

iter<- iter + 1

f1.old <- c(f1.val1,f1.val2)

f2.old <- f2.val2

}

tmp1 <- -int.s(f1.val1,s.time1)[org.ord1]


tmp2 <- -int.s(f1.val2+f2.val2,s.time2)[org.ord2]

yhat <- apply(assist:::diagComp(matrix(payment*

exp(c(tmp1,tmp2)),nrow=1),group),1,sum)

sigma <- NA

if (error==0) sigma <- sqrt(sum((y-yhat)^2)/

(length(y)-fit1$df-fit2$df))

list(fit1=fit1, fit2=fit2, iter=c(iter, cover, error),

call=match.call(),

f.val=list(f1=f1.val1,f2=f1.val2+f2.val2),

f2.val=f2.val2,

q.time=list(q.time1=q.time1,q.time2=q.time2),

dc=list(dc1=exp(tmp1),dc2=exp(tmp2)),

y=list(y=y,yhat=yhat), sigma=sigma)

}

The matrices S1, S2, T, T2, L1, L2, L12, and Q represent S1, S2, T , T2,Λ1, Λ2, Λ12, and Σ respectively, in the description about the Gauss–Seidel algorithm for model (7.36) in Section 7.6.3. In the output, f1and f2 in the list f.val contain estimated forward rates for two bondsevaluated at time points q.time1 and q.time2 in the list q.time; f2.valcontains estimated credit spread evaluated at time points q.time2; anddc1 and dc2 in the list dc contain estimated discount rates for two bondsevaluated at the observed time points.


References

Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathemati-cal Functions with Formulas, Graphs, and Mathematical Tables,Washington, DC: National Bureau of Standards.

Andrews, D. F. and Herzberg, A. M. (1985). Data: A Collection ofProblems From Many Fields for the Student and Research Worker,Springer, Berlin.

Aronszajn, N. (1950). Theory of reproducing kernels, Transactions ofthe American Mathematics Society 68: 337–404.

Bennett, L. H., Swartzendruber, L. J. Turchinskaya, M. J., Blendell,J. E., Habib, J. M. and Seyoum, H. M. (1994). Long-time mag-netic relaxation measurements on a quench melt growth YBCOsuperconductor, Journal of Applied Physics 76: 6950–6952.

Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel HilbertSpaces in Probability and Statistics, Kluwer Academic, Norwell,MA.

Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in gen-eralized linear mixed models, Journal of the American StatisticalAssociation 88: 9–25.

Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997). General-ized partial linear single-index models, Journal of the AmericanStatistical Association 92: 477–489.

Coddington, E. A. (1961). An Introduction to Ordinary DifferentialEquations, Prentice-Hall, NJ.

Cox, D. D., Koh, E., Wahba, G. and Yandell, B. (1988). Testing the(parametric) null model hypothesis in (semiparametric) partial andgeneralized spline model, Annals of Statistics 16: 113–119.

Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, Chapmanand Hall, London.

Craven, P. and Wahba, G. (1979). Smoothing noisy data with splinefunctions, Numerische Mathematik 31: 377–403.

347

348 References

Dalzell, C. J. and Ramsay, J. O. (1993). Computing reproducing kernelswith arbitrary boundary constraints, SIAM Journal on ScientificComputing 14: 511–518.

Davidson, L. (2006). Comparing tongue shapes from ultrasound imag-ing using smoothing spline analysis of variance., Journal of theAcoustical Society of America 120: 407–415.

Davies, R. B. (1980). The distribution of a linear combination of χ2

random variables, Applied Statistics 29: 323–333.

Debnath, L. and Mikusinski, P. (1999). Introduction to Hilbert Spaceswith Applications, Academic Press, London.

Douglas, A. and Delampady, M. (1990). Eastern Lake Survey — PhaseI: documentation for the data base and the derived data sets,SIMS Technical Report 160. Department of Statistics, Universityof British Columbia, Vancouver.

Duchon, J. (1977). Spline minimizing rotation-invariant semi-norms inSobolev spaces, pp. 85–100. In Constructive Theory of Functions ofSeveral Variables, W. Schemp and K. Zeller eds., Springer, Berlin.

Earn, D. J. D., Rohani, P., Bolker, B. M. and Gernfell, B. T. (2000).A simple model for complex dynamical transitions in epidemics,Science 287: 667–670.

Efron, B. (2001). Selection criteria for scatterplot smoothers, Annals ofStatistics 29: 470–504.

Efron, B. (2004). The estimation of prediction error: covariance penaltiesand cross-validation (with discussion), Journal of the AmericanStatistical Association 99: 619–632.

Eubank, R. (1988). Spline Smoothing and Nonparametric Regression,Dekker, New York.

Eubank, R. (1999). Nonparametric Regression and Spline Smoothing,2nd ed., Dekker, New York.

Evans, M. and Swartz, T. (2000). Approximating Integrals via MonteCarlo and Deterministic Methods, Oxford University Press, Ox-ford, UK.

Fisher, M. D., Nychka, D. and Zervos, D. (1995). Fitting the termstructure of interest rates with smoothing spline, Working Paper 95-1, Finance and Eonomics Discussion Series, Federal Reserve Board.

Flett, T. M. (1980). Differential Analysis, Cambridge University Press,London.

References 349

Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression,Journal of the American Statistical Association 76: 817–823.

Gause, G. F. (1934). The Struggle for Existence, Williams & Wilkins,Baltimore, MD.

Genton, M. G. and Hall, P. (2007). Statistical inference for evolving pe-riodic functions, Journal of the Royal Statistical Society B 69: 643–657.

Green, P. J. and Silverman, B. W. (1994). Nonparametric Regressionand Generalized Linear Models: A Roughness Penalty Approach,Chapman and Hall, London.

Grizzle, J. E. and Allen, D. M. (1969). Analysis of growth and doseresponse curves, Biometrics 25: 357–381.

Gu, C. (1992). Penalized likelihood regression: A Bayesian analysis,Statistica Sinica 2: 255–264.

Gu, C. (2002). Smoothing Spline ANOVA Models, Springer, New York.

Guo, W., Dai, M., Ombao, H. C. and von Sachs, R. (2003). Smoothingspline ANOVA for time-dependent spectral analysis, Journal of theAmerican Statistical Association 98: 643–652.

Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptoticallyoptimal difference-based estimation of variance in nonparametricregression, Biometrika 77: 521–528.

Harville, D. (1976). Extension of the Gauss-Markov theorem to includethe estimation of random effects, Annals of Statistics 4: 384–395.

Harville, D. A. (1997). Matrix Algebra From A Statistician’s Perspective,Springer, New York.

Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models, Chap-man and Hall, London.

Hastie, T. and Tibshirani, R. (1993). Varying coefficient model, Journalof the Royal Statistical Society B 55: 757–796.

Heckman, N. (1997). The theory and application of penalized leastsquares methods or reproducing kernel Hilbert spaces made easy,University of British Columbia Statistics Department TechnicalReport number 216.

Heckman, N. and Ramsay, J. O. (2000). Penalized regression with model-based penalties, Canadian Journal of Statistics 28: 241–258.

350 References

Jarrow, R., Ruppert, D. and Yu, Y. (2004). Estimating the termstructure of corporate debt with a semiparametric penalized splinemodel, Journal of the American Statistical Association 99: 57–66.

Ke, C. and Wang, Y. (2001). Semi-parametric nonlinear mixed-effectsmodels and their applications (with discussion), Journal of theAmerican Statistical Association 96: 1272–1298.

Ke, C. and Wang, Y. (2004). Nonparametric nonlinear regression mod-els, Journal of the American Statistical Association 99: 1166–1175.

Kimeldorf, G. S. and Wahba, G. (1971). Some results on Tchebycheffianspline functions, Journal of Mathematical Analysis and Applica-tions 33: 82–94.

Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L.(1988). Glycosylated hemoglobin predicts the incidence and pro-gression of diabetic retinopathy, Journal of the American MedicalAssociation 260: 2864–2871.

Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L.(1989). Is blood pressure a predictor of the incidence or progressionof diabetic retinopathy, Archives of Internal Medicine 149: 2427–2432.

Kronfol, Z., Nair, M., Zhang, Q., Hill, E. and Brown, M. (1997).Circadian immune measures in healthy volunteers: Relationshipto hypothalamic-pituitary-adrenal axis hormones and sympatheticneurotransmitters, Psychosomatic Medicine 59: 42–50.

Lawton, W. H., Sylvestre, E. A. and Maggio, M. S. (1972). Self-modelingnonlinear regression, Technometrics 13: 513–532.

Lee, Y., Nelder, J. A. and Pawitan, Y. (2006). Generalized Linear Modelswith Random Effects: Unified Analysis via H-likelihood, Chapmanand Hall, London.

Li, K. C. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing,Annals of Statistics 14: 1101–1112.

Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed effectsmodels for repeated measures data, Biometrics 46: 673–687.

Liu, A. and Wang, Y. (2004). Hypothesis testing in smoothingspline models, Journal of Statistical Computation and Simulation74: 581–597.

References 351

Liu, A., Meiring, W. and Wang, Y. (2005). Testing generalized linearmodels using smoothing spline methods, Statistica Sinica 15: 235–256.

Liu, A., Tong, T. and Wang, Y. (2007). Smoothing spline estimation ofvariance functions, Journal of Computational and Graphical Statis-tics 16: 312–329.

Ma, X., Dai, B., Klein, R., Klein, B. E. K., Lee, K. and Wahba, G.(2010). Penalized likelihood regression in reproducing kernel Hilbertspaces with randomized covariate data, University of WisconsinStatistics Department Technical Report number 1158.

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Chap-man and Hall, London.

Meinguet, J. (1979). Multivariate interpolation at arbitrary points madesimple, Journal of Applied Mathematics and Physics (ZAMP)30: 292–304.

Neal, D. (2004). Introduction to Population Biology, Cambridge Univer-sity Press, Cambridge, UK.

Nychka, D. (1988). Bayesian confidence intervals for smoothing splines,Journal of the American Statistical Association 83: 1134–1143.

Opsomer, J. D., Wang, Y. and Yang, Y. (2001). Nonparametric regres-sion with correlated errors, Statistical Science 16: 134–153.

O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse prob-lems (with discussion), Statistical Science 4: 502–527.

Parzen, E. (1961). An approach to time series analysis, Annals of Math-ematical Statistics 32: 951–989.

Pinheiro, J. and Bates, D. M. (2000). Mixed-effects Models in S andS-plus, Springer, New York.

Qin, L. and Wang, Y. (2008). Nonparametric spectral analysis with ap-plications to seizure characterization using EEG time series, Annalsof Applied Statistics 2: 1432–1451.

Ramsay, J. O. (1998). Estimating smooth monotone functions, Journalof the Royal Statistical Society B 60: 365–375.

Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis,2nd ed., Springer, New York.

Rice, J. A. (1984). Bandwidth choice for nonparametric regression, An-nals of Statistics 12: 1215–1230.

352 References

Robinson, G. K. (1991). That BLUP is a good thing: The estimation ofrandom effects (with discussion), Statistical Science 6: 15–51.

Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). SemiparametricRegression, Cambridge, New York.

Schumaker, L. L. (2007). Spline Functions: Basic Theory, 3rd ed., Cam-bridge University Press, Cambridge, UK.

Smith, M. and Kohn, R. (2000). Nonparametric seemingly unrelatedregression, Journal of Econometrics 98: 257–281.

Speckman, P. (1995). Fitting curves with features: semiparametricchange-point methods, Computing Science and Statistics 26: 257–264.

Stein, M. (1990). A comparison of generalized cross-validation andmodified maximum likelihood for estimating the parameters of astochastic process, Annals of Statistics 18: 1139–1157.

Tibshirani, R. and Knight, K. (1999). The covariance inflation crite-rion for adaptive model selection, Journal of the Royal StatisticalSociety B 61: 529–546.

Tong, T. and Wang, Y. (2005). Estimating residual variance in nonpara-metric regression using least squares, Biometrika 92: 821–830.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statisticswith S, 4th ed., Springer, New York.

Wahba, G. (1980). Automatic smoothing of the log periodogram, Jour-nal of the American Statistical Association 75: 122–132.

Wahba, G. (1981). Spline interpolation and smoothing on the sphere,SIAM Journal on Scientific Computing 2: 5–16.

Wahba, G. (1983). Bayesian confidence intervals for the cross-validatedsmoothing spline, Journal of the Royal Statistical Society B45: 133–150.

Wahba, G. (1985). A comparison of GCV and GML for choosing thesmoothing parameters in the generalized spline smoothing prob-lem, Annals of Statistics 4: 1378–1402.

Wahba, G. (1987). Three topics in ill posed inverse problems, pp. 37–51.In Inverse and Ill-Posed Problems, M. Engl and G. Groetsch, eds.Academic Press, New York.

Wahba, G. (1990). Spline Models for Observational Data, SIAM,Philadelphia, PA. CBMS-NSF Regional Conference Series in Ap-plied Mathematics, Vol. 59.

References 353

Wahba, G. and Wang, Y. (1995). Behavior near zero of the distributionof GCV smoothing parameter estimates for splines, Statistics andProbability Letters 25: 105–111.

Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. E. K. (1995).Smoothing spline ANOVA for exponential families, with applica-tion to the Wisconsin Epidemiological Study of Diabetic Retinopa-thy, Annals of Statistics 23: 1865–1895.

Wang, Y. (1994). Smoothing Spline Analysis of Variance of DataFrom Exponential Families, Ph.D. Thesis, University of Wisconsin-Madison, Department of Statistics.

Wang, Y. (1997). GRKPACK: fitting smoothing spline analysis of vari-ance models to data from exponential families, Communications inStatistics: Simulation and Computation 26: 765–782.

Wang, Y. (1998a). Mixed-effects smoothing spline ANOVA, Journal ofthe Royal Statistical Society B 60: 159–174.

Wang, Y. (1998b). Smoothing spline models with correlated randomerrors, Journal of the American Statistical Association 93: 341–348.

Wang, Y. and Brown, M. B. (1996). A flexible model for human circadianrhythms, Biometrics 52: 588–596.

Wang, Y. and Ke, C. (2009). Smoothing spline semi-parametric non-linear regression models, Journal of Computational and GraphicalStatistics 18: 165–183.

Wang, Y. and Wahba, G. (1995). Bootstrap confidence intervalsfor smoothing splines and their comparison to Bayesian confi-dence intervals, Journal of Statistical Computation and Simulation51: 263–279.

Wang, Y. and Wahba, G. (1998). Discussion of “Smoothing Spline Mod-els for the Analysis of Nested and Crossed Samples of Curves” byBrumback and Rice, Journal of the American Statistical Associa-tion 93: 976–980.

Wang, Y., Guo, W. and Brown, M. B. (2000). Spline smoothing forbivariate data with applications to association between hormones,Statistica Sinica 10: 377–397.

Wang, Y., Ke, C. and Brown, M. B. (2003). Shape invariant modellingof circadian rhythms with random effects and smoothing splineANOVA decomposition, Biometrics 59: 804–812.

354 References

Wang, Y., Wahba, G., Chappell, R. and Gu, C. (1995). Simulation stud-ies of smoothing parameter estimates and Bayesian confidence in-tervals in Bernoulli SS ANOVA models, Communications in Statis-tics: Simulation and Computation 24: 1037–1059.

Wong, W. (2006). Estimation of the loss of an estimate. In Frontiersin Statistics, J. Fan and H. L. Koul eds. Imperial College Press,London.

Wood, S. N. (2003). Thin plate regression splines, Journal of the RoyalStatistical Society B 65: 95–114.

Xiang, D. and Wahba, G. (1996). A generalized approximate cross val-idation for smoothing splines with non-Gaussian data, StatisticaSinica 6: 675–692.

Yang, Y., Liu, A. and Wang, Y. (2005). Detecting pulsatile hormonesecretions using nonlinear mixed effects partial spline models, Bio-metrics pp. 230–238.

Ye, J. M. (1998). On measuring and correcting the effects of data miningand model selection, Journal of the American Statistical Associa-tion 93: 120–131.

Yeshurun, Y., Malozemoff, A. P. and Shaulov, A. (1996). Magnetic re-laxation in high-temperature superconductors, Reviews of ModernPhysics 68: 911–949.

Yorke, J. A. and London, W. P. (1973). Recurrent outbreaks ofmeasles, chickenpox and mumps, American Journal of Epidemi-ology 98: 453–482.

Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partiallylinear single index models, Journal of the American Statistical As-sociation 97: 1042–1054.

Yuan, M. and Wahba, G. (2004). Doubly penalized likelihood estima-tor in heteroscedastic regression, Statistics and Probability Letters69: 11–20.

A general class of powerful and flexible modeling techniques, spline smoothing has attracted a great deal of research attention in recent years and has been widely used in many application areas, from medicine to economics. Smoothing Splines: Methods and Applications covers basic smoothing spline models, including polynomial, periodic, spherical, thin-plate, L-, and partial splines, as well as more advanced models, such as smoothing spline ANOVA, extended and generalized smoothing spline ANOVA, vector spline, nonparametric nonlinear regression, semiparametric regression, and semiparametric mixed-effects models. It also presents methods for model selection and inference.

The book provides unified frameworks for estimation, inference, and software implementation by using the general forms of nonparametric/semiparametric, linear/nonlinear, and fixed/mixed smoothing spline models. The theory of reproducing kernel Hilbert space (RKHS) is used to present various smoothing spline models in a unified fashion. Although this approach can be technical and difficult, the author makes the advanced smoothing spline methodology based on RKHS accessible to practitioners and students. He offers a gentle introduction to RKHS, keeps theory at a minimum level, and explains how RKHS can be used to construct spline models.

Smoothing Splines offers a balanced mix of methodology, compu-tation, implementation, software, and applications. It uses R to per-form all data analyses and includes a host of real data examples from astronomy, economics, medicine, and meteorology. The codes for all examples, along with related developments, can be found on the book’s web page.

C7755

Sm

oothing Splines

Wang

Statistics

Smoothing Splines Methods and Applications

Yuedong Wang


C7755_Cover.indd 1 4/25/11 9:28 AM

Documents

Smoothing Splines - 221.114.158.246221.114.158.246/~bunken/statistics/others_smoothingspline.pdf · Applications covers basic smoothing spline models, including polynomial, periodic,