Problem Deﬁnition & Interpolation Univariate Smoothing ...web.cecs.pdx.edu/~edam/Slides/UnivariateSmoothingx4.pdf · – Cubic Spline Interpolation: Fit a cubic polynomial with

Interpolation versus Extrapolation

• Interpolation is technically defined only for inputs that are withinthe range of the data set mini xi ≤ x ≤ maxi xi

• If an input is outside of this range, the model is said to beextrapolating

• A good model should do reasonable things for both cases

• Extrapolation is much a harder problem

J. McNames Portland State University ECE 4/557 Univariate Smoothing Ver. 1.25 3

Univariate Smoothing Overview

• Problem definition

• Interpolation

• Polynomial smoothing

• Cubic splines

• Basis splines

• Smoothing splines

• Bayes’ rule

• Density estimation

• Kernel smoothing

• Local averaging

• Weighted least squares

• Local linear models

• Prediction error estimates


Example 1: Linear Interpolation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input x

Out

put y

Chirp Linear Interpolation


Problem Definition & Interpolation

• Smoothing Problem: Given a data set with a single inputvariable x, find the best function g(x) that minimizes theprediction error on new inputs (probably not in the data set)

• Interpolation Problem: Same as the smoothing problem exceptthe model is subject to the constraint g(xi) = yi for everyinput-output pair (xi, yi) in the data set

– Linear Interpolation: Use a line between each pair of points

– Nearest Neighbor Interpolation: Find the nearest input inthe data set and use the corresponding output as anapproximate fit

– Polynomial Interpolation: Fit a polynomial of order n − 1 toinput output data: g(x) =

∑ni=1 wix

i−1

– Cubic Spline Interpolation: Fit a cubic polynomial withcontinuous second derivatives in between each pair of points(more on this later)


yh = At*w;

figure;

FigureSet(1,’LTX’);

h = plot(xt ,yh,’b’,x,y,’r.’);

set(h,’MarkerSize ’ ,8);

set(h,’LineWidth ’,1.2);

xlabel(’Input x’);

ylabel(’Output y’);

title(’Chirp Polynomial Interpolation ’);

set(gca ,’Box’,’Off’);

grid on;

axis ([0 1 -2 2]);

AxisSet (8);

print -depsc InterpolationPolynomial;

%================================================

% Cubic Spline Interpolation

%================================================

figure;


yh = spline(x,y,xt);






title(’Chirp Cubic Spline Interpolation ’);



Example 1: MATLAB Code

%function [] = Interpolation ();

close all;

N = 15;

rand(’state’ ,2);

x = rand(N,1);

y = sin (2*pi*2*x.^2) + 0.2*randn(N,1);

xt = (0:0 .0001 :1)’; % Test inputs

%================================================

% Linear Interpolation

%================================================

figure;


yh = interp1(x,y,xt,’linear’);






title(’Chirp Linear Interpolation ’);


grid on;

axis ([0 1 -2 2]);

AxisSet (8);

print -depsc InterpolationLinear;


grid on;

axis ([0 1 -2 2]);

AxisSet (8);

print -depsc InterpolationCubicSpline;

%================================================

% Cubic Spline Interpolation

%================================================

figure;


yt = sin (2*pi*2* xt. ^2);

h = plot(xt ,yt,’b’,x,y,’r.’);





title(’Chirp Optimal Model’);


grid on;

axis ([0 1 -2 2]);

AxisSet (8);

print -depsc InterpolationOptimalModel;


%================================================

% Nearest Neighbor Interpolation

%================================================

figure;


yh = interp1(x,y,xt,’nearest ’);






title(’Chirp Nearest Neighbor Interpolation ’);


grid on;

axis ([0 1 -2 2]);

AxisSet (8);

print -depsc InterpolationNearestNeighbor;

%================================================

% Polynomial Interpolation

%================================================

A = zeros(N,N);

for cnt = 1:size(A,2),

A(:,cnt) = x.^(cnt -1);

end;

w = pinv(A)*y;

At = zeros(length(xt),N);

for cnt = 1:size(A,2),

At(:,cnt) = xt.^(cnt -1);

end;


Example 3: Polynomial Interpolation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input x

Out

put y

Chirp Polynomial Interpolation


Example 2: Nearest Neighbor Interpolation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input x

Out

put y

Chirp Nearest Neighbor Interpolation



Same data set and test inputs as linear interpolation example.





Interpolation Comments

• There are an infinite number of functions that satisfy theinterpolation constraint:

g(xi) = yi ∀i

• Of course, we would like to choose the model that minimizes theprediction error

• Given only data, there is no way to do this exactly

• Our data set only specifies what g(x) should be at specific points

• What should it be in between these points?

• In practice, the method of interpolation is usually chosen by theuser


Example 4: Cubic Spline Interpolation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input x

Out

put y

Chirp Cubic Spline Interpolation


Smoothing

• For the smoothing problem, even the constraint is relaxed

g(xi) ≈ yi ∀i

• The data set can be merely suggesting what the model outputshould be approximately at some specified points

• We need another constraint or assumption about the relationshipbetween x and y to have enough constraints to uniquely specifythe model





Smoothing

yi = g(xi) + εi

• When we add noise, we can drop the interpolation constraint:g(xi) = yi ∀i

• But we still want g(·) to be consistent with (i.e. close to) thedata: g(xi) ≈ yi

• The methods we will discuss are biased in favor of models that aresmooth


Smoothing Assumptions and Statistical Model

y = g(x) + ε

• Generally we assume that the data was generated from thestatistical model above

• εi is a random variable with the following assumed properties

– Zero mean: E[ε] = 0– εi and εj are independently distributed for i �= j

– εi is identically distributed

• The two additional assumptions are usually made for thesmoothing problem

– g(x) is continuous

– g(x) is smooth


Bias-Variance Tradeoff

Recall that

MSE(x) = E[(g(x) − g(x))2

]= (g(x) − E[g(x)])2 + E

[(g(x) − E[g(x)])2

]= Bias2 + Variance

• Fundamental smoother tradeoff:

– Smoothness of the estimate g(x)– Fit to the data

• This can also be framed as a bias-variance tradeoff


Example 5: Interpolation Optimal Model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input x

Out

put y

Chirp Optimal Model


Example 6: Smoothing Problem MATLAB Code

function [] = SmoothingProblem ();

A = load(’MotorCycle.txt ’);

x = A(: ,1);

y = A(: ,2);

figure;


h = plot(x,y,’r.’);




title(’Motorcycle Data Set’);


grid on;

ymin = min(y);

ymax = max(y);

yrng = ymax -ymin;

ymin = ymin - 0.05*yrng;

ymax = ymax + 0.05*yrng;

axis([min(x) max(x) ymin ymax ]);

AxisSet (8);

%print -depsc Test;

print -depsc SmoothingProblem;


Bias-Variance Tradeoff Continued

MSE(x) = (g(x) − E[g(x)])2 + E[(g(x) − E[g(x)])2

]

• Smooth models

– Less sensitive to the data

– Less variance

– Potentially high bias since they don’t fit the data well

• Flexible models

– Sensitive to the data

– In the most extreme case, they interpolate the data

– High variance since they are sensitive to the data

– Low bias


return;

%================================================

% Linear

%================================================

figure;

FigureSet (1,4.5 ,2.8);

A = [ones(N,1) x];

w = pinv(A)*y;

yh = [ones(size(xt)) xt]*w;






title(’Chirp Linear Least Squares ’);


grid on;

ymin = min(y);

ymax = max(y);

yrng = ymax -ymin;

ymin = ymin - 0.05*yrng;

ymax = ymax + 0.05*yrng;


AxisSet (8);

print -depsc LinearLeastSquares;


Example 6: Univariate Smoothing Data

5 10 15 20 25 30 35 40 45 50 55

−100

−50

0

50

Input x

Out

put y

Motorcycle Data Set



Matlab/PolynomialSmoothing.m


Polynomial Smoothing

• We can fit a polynomial g(x) =∑p−1

i=0 wixi to the data using the

linear modeling methods

• Note that linear models are linear in the parameters wi

• They need not be linear in the inputs

• Alternatively, you can think of this as a linear model with pdifferent inputs where the ith input is given by xi = xi

• This model is smooth in the sense that all derivatives of g(x) arecontinuous

• This is one measure of model smoothness

• In general, this is a terrible smoother

– Terrible at extrapolation

– The matrix inverse is often poorly conditioned andregularization is necessary

– The user has to pick the order of the polynomial p − 1


Cubic Splines

• Cubic splines are modeled after the properties of flexible rods shipdesigners used to use to draw smooth curves

• The rod would be rigidly constrained to go through specific points(interpolation)

• The rod smoothly bent from one point to the next

• The rod naturally minimized its bending energy (i.e. curvature)

• This can be approximated by a piecewise cubic polynomial


Example 7: Polynomial Smoothing

−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y

Motorcycle Linear Regression

LinearQuadraticCubic4th Order5th Order


Cubic Spline Constraints

Cubic Spline

x

y

• Cubic splines are piecewise cubic

• This means g(x) =∑3

i=0 wi(x)xi has different weights betweeneach pair of points

• For the entire region between each pair of points, the weights arefixed

• Since each polynomial is defined by 4 parameters wi(x)

• We have n + 1 regions where n is the number of points in thedata set

• Thus, we need at least 4 × (n + 1) constraints for each region touniquely specifies the weights


Cubic Splines Functional Form

g(x) =3∑

i=0

wi(x)xi

• Unlike polynomial regression, here the parameters wi(x) are also afunction of x

• Consider a class of functions g(x) that have the followingproperties

– Continuous

– Continuous 1st derivative

– Continuous 2nd derivative

– Interpolates the data: g(xi) = yi


Cubic Spline Constraints Continued

Cubic Spline

x

y

Let pk(x) be the polynomial between the point xk and xk+1. We need4 × (n + 1) constraints to have the problem well defined.Property Expression Constraints

Interpolation g(xi) = yi nContinuous pk(xk+1) = pk+1(xk+1) nContinuous Derivative p′

k(xk+1) = p′k+1(xk+1) n

Continuous 2nd Derivative p′′k(xk+1) = p′′

k+1(xk+1) n

Natural splines have 4 additional constraints

p0(x1)′′′ = 0 p0(x1)′′ = 0

pn(xn)′′′ = 0 pn(xn)′′ = 0


Cubic Splines Smoothness Definition

• Out of all the functions that meet the above criteria, considerthose that also minimize the approximate “curvature” of g(x)

C ≡∫ xmax

xmin

(d2g(x)

dx

)2

dx

• These are piecewise cubics and are called cubic splines

• In the sense of satisfying the criteria listed above and minimizingthe curvature C, cubic splines are optimal

• Even with all of these constraints, g(x) is not uniquely specified

• There are several cubic splines that meet the strict criteria andhave the same curvature

• The most popular additional constraints are

g(xmin)′′ = 0 g(xmax)′′ = 0

• These are called natural cubic splines


Example 8: Basis Function 0

0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

Input x

Out

put y

Basis Function B0(x)


Basis Splines

• You could solve for the 4(n + 1) model coefficients by solving aset of 4(n + 1) linear equations

• This is cumbersome and very inefficient mathematically

• An easier way is to use basis functions

• Mathematically, each basis function is defined recursively

bi,j(x) =x − kj

ki+j − kjbi−1,j(x) +

ki+j+1 − x

ki+j+1 − kj+1bi−1,j+1(x)



0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

Input x

Out

put y



Basis Splines Continued

• The output of our model can then be written as

g(x) =n−1∑i=−2

wib3,i(x)

• Numerically, this can be solved much more quickly (the order isproportional to n)

• Since the basis functions have finite support (i.e. finite span) theequivalent A matrix is banded

• Basis splines also have the nice property that they sum to unity

n−1∑j=1−i

bi,j(x) = 1 ∀x ∈ [k1, kn]



Matlab/BasisFunctions.m



0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

Input x

Out

put y



Smoothing Splines

• For smoothing, we do not require g(xi) = yi

• But we would like it to be close: g(xi) ≈ yi

• How do we tradeoff smoothness (low variance) for a good fit tothe data (low bias)?

• One way is to find the g(xi) that minimizes the followingperformance criterion:

Eλ =n∑

i=1

(yi − g(xi))2 + λ

∫ +∞

−∞(g(x)′′)2 dx

• Contrast to cubic splines in which we required the first term to bezero

• The second term is a roughness penalty



0 1 2 3 4 5 6 7 8 9 10

0

0.2

0.4

0.6

0.8

1

Input x

Out

put y



Example 9: Smoothing Spline

−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y

Motorcycle Data Smoothing Spline Regression

α = 1.0

α = 0.5α = 0.0


Smoothing Splines Continued

Eλ =n∑

i=1

(yi − g(xi))2 + λ

∫ +∞

−∞(g(x)′′)2 dx

• λ is a user-specified parameter that controls the tradeoff

• It turns out the optimal solution (in the sense of minimizing Eλ)is a smoothing spline

• A smoothing spline is identical to a cubic spline in form

– There is an ith order polynomial between each pair of points

– Same number of knots

– Same number of different sets of polynomials

• Unlike the cubic spline, we now drop the constraint that g(xi) = yi

• Instead, g(xi) = yi for some set of yi



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y

Motorcycle Data Smoothing Spline Regression α=0.0001


Smoothing Splines Comments

Eλ =n∑

i=1

(yi − g(xi))2 + λ

∫ +∞

−∞(g(x)′′)2 dx

• Smoothing splines are smooth in the same sense as cubic splines

– If cubic, the second derivative is continuous

– If quadratic, the first derivative is continuous

– If linear, the function is continuous

• If cubic smoothing spline, then

– As λ → ∞, g(x) approaches a linear least squares fit to thedata (i.e. g(x)′′ → 0)

– As λ → 0, g(x) becomes an interpolating cubic spline

• This is implemented in MATLAB as csaps

• Instead of λ, it takes an equivalent parameter scaled between 0(linear least squares fit) and 1 (cubic spline interpolation)



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




function [] = SmoothingSplineEx ();

close all;


xr = A(: ,1); % Raw values

yr = A(: ,2); % Raw values

x = unique(xr);

y = zeros(size(x));

for cnt = 1: length(x),

y(cnt) = mean(yr(xr==x(cnt )));

end;

N = size(A,1); % No. data set points

xt = ( -10:0.2:70) ’;

NT = length(xt); % No. test points

NS = 3; % No. of different splines

yh = zeros(NT ,NS);

yh(:,3) = csaps(x’,y’,0,xt ’)’;

yh(:,2) = csaps(x’,y’,0.5,xt ’)’;

yh(:,1) = csaps(x’,y’,1.0,xt ’)’;


h = plot(xt ,yh,x,y,’k.’);





−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y





title(’Motorcycle Data Smoothing Spline Regression ’);


grid on;

axis([-10 70 -150 90]);

AxisSet (8);

legend(’\alpha = 1.0’,’\alpha = 0.5’,’\alpha = 0.0’ ,4);

print -depsc SmoothingSplineEx;

L = [0.0001 0.001 0.01 0.2 0.5 0.9 0.99];

for cnt = 1: length(L),

alpha = L(cnt);

figure;


yh = csaps(x’,y’,alpha ,xt ’)’;

h = plot(xt,yh ,x,y,’k.’);





st = sprintf(’Motorcycle Data Smoothing Spline Regression \\ alpha =%6 .4f’,alpha);

title(st);


grid on;

axis ([-10 70 -150 90]);

AxisSet (8);

st = sprintf(’print -depsc SmoothingSplineEx %04d;’,round(alpha *10000));

eval(st);

end;



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y



Density Estimation

• Then a kernel density estimator is simply expressed as

f(x) =1n

n∑i=1

bσ(x − xi)

• The width of the kernel is specified by σ. Typically

bσ(u) =1σ

b(u

σ

)where it is easy to show that

∫bσ(u) du = 1 for any value of σ

• Bumps shaped like a Gaussian are popular

b(u) = 1√2π

e−u22

• Typically the bumps have even symmetry:

b(u) = b(−u) = b(|u|)


Review of Bayes’ Rule

• Bayes’ rule says that two discrete-valued random variables A andB have the following relationship

Pr {B|A} =Pr {A, B}Pr {A} =

Pr {A|B}Pr {B}Pr {A}

• Recall that earlier we found the the g(x) that minimizes the MSEis given by

Y = g∗(x) = E[Y |X = x]

• For smoothing, we can use the continuous analog of Bayes’ rule toestimate E[Y |X = x]

f(y|X = x) =f(x, y)f(x)

=f(x|Y = y)f(y)

f(x)


Example 11: Density Estimation

−10 0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Input x

Den

sity

p(x

)

Motorcycle Data Density Estimation w= 0.1


Continuous Bayes’ Rule

f(y|X = x) =f(x, y)f(x)

=f(x|Y = y)f(y)

f(x)

• E[Y |X = x] is given by

E[Y |X = x] =∫ +∞

−∞yf(y|X = x) dy

• In order to estimate these equations we need a means ofestimating the densities f(x) and f(x, y)

• A popular method of estimating density is to add a series of“bumps” together

• The bumps are called kernels and should have the followingproperty ∫ +∞

−∞bσ(u) du = 1



−10 0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Input x

Den

sity

p(x

)




−10 0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Input x

Den

sity

p(x

)




−10 0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Input x

Den

sity

p(x

)




−10 0 10 20 30 40 50 60 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Input x

Den

sity

p(x

)



Density Estimation in Higher Dimensions

• Density estimation can be extended to higher dimensions in a theobvious way

f(x) =1n

n∑i=1

p∏j=1

bσ(xj − xi,j)

where xj is the jth element of the input vector x and xi,j is thejth element of the ith input vector in the data set

• Although you can use this for large values of p, it is notrecommended

• The estimate becomes inaccurate very quickly as the number ofdimensions grows

• For one or two dimensions this is a pretty good technique



function [] = DensityEx ();

close all;


x = A(: ,1); % Raw values

y = A(: ,2); % Raw values

W = [0.1 0.2 0.5 1.0 5.0];

xt = ( -10:0 .05 :70) ’;

for c1 = 1: length(W),

w = W(c1);

bs = zeros(size(xt)); % Bump sum

for c2 = 1: length(x),

bs = bs + exp(-(xt -x(c2)).^2/(2* w. ^2))/ sqrt (2*pi*w^2);

end;

bs = bs/length(x);

figure;


h = plot(x,zeros(size(x)),’k.’,xt ,bs);



ylabel(’Density p(x)’);

st = sprintf(’Motorcycle Data Density Estimation w=%5 .1f’,w);


Example 12: 2D Density Estimation

Input x

Out

put y

Motorcycle Data Input−Output Density Estimation w= 0.05

0 10 20 30 40 50

−100

−50

0

50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


title(st);


grid on;

axis ([-10 70 0 0.1]);

AxisSet (8);

st = sprintf(’print -depsc DensityEx %02d;’,round(w*10));

eval(st);

end;



Input x

Out

put y


0 10 20 30 40 50

−100

−50

0

50

0.02

0.04

0.06

0.08

0.1

0.12

0.14



Input x

Out

put y


0 10 20 30 40 50

−100

−50

0

50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8



Input x

Out

put y


0 10 20 30 40 50

−100

−50

0

50

0.01

0.02

0.03

0.04

0.05

0.06

0.07



Input x

Out

put y


0 10 20 30 40 50

−100

−50

0

50

0.05

0.1

0.15

0.2

0.25

0.3

0.35


Density Estimation and Scaling

• In higher dimensions it is important to scale each input to havethe same variance

• The following example shows the same data set without scaling

• Notice the oval-shaped bumps



function [] = DensityEx2D ();

close all;


xr = A(: ,1); % Raw values

yr = A(: ,2); % Raw values

xm = mean(xr);

ym = mean(yr);

xs = std(xr);

ys = std(yr);

x = (xr -xm)/xs;

y = (yr -ym)/ys;

W = [0.05 0.1 0.2 0.5 1.0];

xst = -2.0:0.02:2.5; % X-test points

yst = -2.5:0.02:2.5; % Y-test points

[xmt ,ymt] = meshgrid(xst ,yst); % Grids of scaled test points

xt = xst*xs + xm; % Unscaled x-test values

yt = yst*ys + ym; % Unscaled y-test values


w = W(c1);

bs = zeros(size(xmt )); % Bump sum



Input x

Out

put y

Motorcycle Data No Scaling Density Estimation w= 0.50

0 10 20 30 40 50 60−150

−100

−50

0

50

1

2

3

4

5

6

7

8

9x 10

−3



bx = exp(-(xmt -x(c2)).^2/(2* w. ^2))/ sqrt (2*pi*w^2);

by = exp(-(ymt -y(c2)).^2/(2* w. ^2))/ sqrt (2*pi*w^2);

bs = bs + bx.*by;

end;

bs = bs/length(x);

figure;


h = imagesc(xt ,yt ,bs);

hold on;

h = plot(xr,yr ,’k.’,xr,yr ,’w.’);

set(h(1),’MarkerSize ’ ,4);


hold off;

set(gca ,’YDir’,’Normal’);



st = sprintf(’Motorcycle Data Input -Output Density Estimation w=%5 .2f’,w);

title(st);


colorbar;

AxisSet (8);

st = sprintf(’print -depsc DensityEx2D %03d;’,round(w*100));

eval(st);

end;



Input x

Out

put y


0 10 20 30 40 50 60−150

−100

−50

0

50

0.5

1

1.5

2

2.5

3

x 10−4



Input x

Out

put y


0 10 20 30 40 50 60−150

−100

−50

0

50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x 10−3



Input x

Out

put y


0 10 20 30 40 50 60−150

−100

−50

0

50

2

4

6

8

10

12

14

x 10−5



Input x

Out

put y


0 10 20 30 40 50 60−150

−100

−50

0

50

1

2

3

4

5

6

7

8

x 10−4


h = plot(x,y,’k.’,x,y,’w.’);



hold off;

set(gca ,’YDir’,’Normal’);



st = sprintf(’Motorcycle Data No Scaling Density Estimation w=%6 .2f’,w);

title(st);


colorbar;

AxisSet (8);

st = sprintf(’print -depsc DensityEx2Db %03d;’,round(w*10));

eval(st);

end;



Input x

Out

put y


0 10 20 30 40 50 60−150

−100

−50

0

50

1

1.5

2

2.5

3

3.5

4

x 10−5


Kernel Smoothing Derivation

The following equations compose the Nadaraya-Watson estimator ofE[y|x]

E[y|x] =∫ ∞

−∞y f(y|x) dy =

∫ ∞

−∞y

f(x, y)f(x)

dy =

∫∞−∞ y f(x, y) dy

f(x)

The two densities can be estimated as follows

f(x, y) =1n

n∑i=1

bσ(|x − xi|) · bσ(|y − yi|)

f(x) =1n

n∑i=1

bσ(|x − xi|)



function [] = DensityEx2D ();

% This is the same as DensityEx2D , except no scaling is used.

close all;




W = [0.5 1.0 2.0 5.0 10.0 20.0 50.0];

xt = 0:0.5:60; % X-test points

yt = -150:90; % Y-test points

[xmt ,ymt] = meshgrid(xt ,yt); % Grids of scaled test points


w = W(c1);

bs = zeros(size(xmt )); % Bump sum


bx = exp(-(xmt -x(c2)).^2/(2* w. ^2))/ sqrt (2*pi*w^2);

by = exp(-(ymt -y(c2)).^2/(2* w. ^2))/ sqrt (2*pi*w^2);

bs = bs + bx.*by;

end;

bs = bs/length(x);

figure;


h = imagesc(xt ,yt ,bs);

hold on;


Kernel Smoothing Derivation Continued

Thus, by combining the equations on the previous slides we obtain

E[y|x] ≈ g(x) =∑n

i=1 yi bσ(|x − xi|)∑ni=1 bσ(|x − xi|)

• Popular kernels includeEpanechnikov: b(u) = c (1 − u2) p(u)Biweight: b(u) = c (1 − u2)2 p(u)Triweight: b(u) = c (1 − u2)3 p(u)Triangular: b(u) = c (1 − |u|) p(u)Gaussian: b(u) = c e−u2

Sinc: b(u) = c sinc(u)• Here c is a constant chosen to meet the constraint∫∞

−∞ b(u) du = 1

• p(u) is the unit pulse:

p(u) =

{1 |u| ≤ 10 Otherwise


Kernel Smoothing Derivation Continued (1)

E[y|x] ≈∫∞−∞ yf(x, y) dy

f(x)

f(x) E[y|x] ≈∫ ∞

−∞y

(1n

n∑i=1

bσ(|x − xi|) · bσ(|y − yi|))

dy

=1n

n∑i=1

bσ(|x − xi|)∫ ∞

−∞y bσ(|y − yi|) dy

=1n

n∑i=1

bσ(|x − xi|)∫ ∞

−∞(y − yi + yi) bσ(|y − yi|) dy

=1n

n∑i=1

bσ(|x − xi|) ×[yi

∫ ∞

−∞bσ(|y − yi|) dy +

∫ ∞

−∞(y − yi) bσ(|y − yi|) dy

]


Example 14: Kernels

−2 0 2

0

0.5

1

Epanechnikov

−2 0 2

0

0.5

1

Biweight

−2 0 2

0

0.5

1

Triweight

−2 0 2

0

0.5

1

Triangular

−2 0 2

0

0.5

1

Gaussian

−2 0 2

0

0.5

1

Sinc


Kernel Smoothing Derivation Continued (2)

f(x) E[y|x] =1n

n∑i=1

bσ(|x − xi|) ×[yi

∫ ∞

−∞bσ(|y − yi|) dy +

∫ ∞

−∞(y − yi) bσ(|y − yi|) dy

]

=1n

n∑i=1

bσ(|x − xi|) ×[yi +

∫ ∞

−∞u bσ(|u|) dy

]

=1n

n∑i=1

yi bσ(|x − xi|)

E[y|x] =1n

∑ni=1 yi bσ(|x − xi|)

1n

∑ni=1 bσ(|x − xi|)

=∑n



Kernel Smoothing Comments

E[y|x] ≈ g(x) =∑n

i=1 yi bσ(|x − xi|)∑ni=1 bσ(|x − xi|) =

n∑i=1

yi

(bσ(|x − xi|)∑n

j=1 bσ(|x − xj |)

)

• Kernel smoothing can be written as a weighted average

g(x) =n∑

i=1

yi wi(x) wi(x) =bσ(|x − xi|)∑n

j=1 bσ(|x − xj |)

• Note that by definition∑n

i=1 wi(x) = 1

• If all the weights were equal, wi(x) = 1n , then g(x) = y

• This occurs as σ → ∞



function [] = Kernels ();

ST = 0.01;

x = (-2.2:ST:2.2)’;

u = abs(x);

I = (u≤1);

kep = (1-u.^2).*I; % Epanechnikov

kbw = (1-u.^2).^2.*I; % Biweight

ktw = (1-u.^2).^3.*I; % Triweight

ktr = (1-u).*I; % Triangular

kga = exp(-u.^2); % Gaussian

ksn = sinc(u); % Sinc

kep = kep/(sum(kep)*ST); % Normalize

kbw = kbw/(sum(kbw)*ST); % Normalize

ktw = ktw/(sum(ktw)*ST); % Normalize

ktr = ktr/(sum(ktr)*ST); % Normalize

kga = kga/(sum(kga)*ST); % Normalize

ksn = ksn/(sum(ksn)*ST); % Normalize

K = [kep kbw ktw ktr kga ksn];

L = {’Epanechnikov ’,’Biweight ’,’Triweight ’,’Triangular ’,’Gaussian ’,’Sinc’};

FigureSet (1,4.5 ,2.8);

for cnt = 1:6,

subplot(2,3,cnt);


Kernel Smoothing Effect of Support

E[y|x] ≈ g(x) =∑n


As the width decreases (σ ↓) one of two things happens

• If b(u) has infinite support,

– All of the equivalent weights become nearly equal to zero

– The weight from the nearest neighbor dominates

– Thus g(x) does nearest neighbor interpolation as σ → 0

• If b(u) has finite support,

– At some values of x all of the weights may be 0

– In this happens g(x) at these points is not defined (depends onimplementation)


h = plot([-5 5],[0 0],’k:’,x,K(:,cnt ));

set(h(2),’LineWidth ’,1.5);

title(char(L(cnt )));

box off;

axis([min(x) max(x) -0.3 1.2]);

end;

AxisSet (8);

print -depsc Kernels;


Example 15: Kernel Smoothing

−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y

Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=1.0000


Kernel Smoothing Bias-Variance Tradeoff

E[y|x] ≈ g(x) =1n

∑ni=1 yi bσ(|x − xi|)

1n

∑ni=1 bσ(|x − xi|)

• Thus, as with smoothing splines there is a single parameter thatcontrols the tradeoff of smoothness (high bias) for the ability ofthe model to fit the data (high variance)

• Kernel smoothers have bounded outputs

mini

yi ≤ minx

g(x) ≤ g(x) ≤ maxx

g(x) ≤ maxi

yi

• In this sense, they are more stable than smoothing splines

• Recall smoothing splines diverge outside of the data range

• However, kernel smoothers are more likely to round off sharpedges and peaks and troughs



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y

Motorcycle Data Kernel Smoothing Gaussian Kernel w=0.1000



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y




function [] = KernelSmoothingEx ();

close all;




xt = ( -10:0 .05 :70) ’;

W = [0.1 1.0 2.0 3.0 5.0 10.0];

% Epanechnikov Kernel

for cnt = 1: length(W),

w = W(cnt);

figure;


yh = Kernel(x,y,xt ,w,2);

h = plot(xt,yh ,’b’,x,y,’k.’);





st = sprintf(’Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=%6 .4f’,w);

title(st);


grid on;

axis ([-10 70 -150 90]);



−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50

Input x

Out

put y



Local Averaging Concept

Time (ms)

Hea

d A

ccel

erat

ion

(g)

Motorcycle Data Set

5 10 15 20 25 30 35 40 45 50 55−150

−100

−50

0

50


AxisSet (8);

st = sprintf(’print -depsc EKernelSmoothingEx %03d;’,round(w*10));

eval(st);

end;

% Gaussian Kernel

for cnt = 1: length(W),

w = W(cnt);

figure;


yh = Kernel(x,y,xt ,w,1);

h = plot(xt,yh ,’b’,x,y,’k.’);





st = sprintf(’Motorcycle Data Kernel Smoothing Gaussian Kernel w=%6 .4f’,w);

title(st);


grid on;

axis ([-10 70 -150 90]);

AxisSet (8);

st = sprintf(’print -depsc GKernelSmoothingEx %03d;’,round(w*10));

eval(st);

end;


MATLAB Code

function [] = LocalAverageConcept ();

D = load(’Motorcycle.txt ’);

x = D(: ,1);

y = D(: ,2);

[x,is] = sort(x);

y = y(is);

q = 30; % Query point (test input)

k = 10; % No. of neighbors

d = (x-q).^2;

[ds,is] = sort(d);

xs = x(is);

ys = y(is);

xn = xs(1:k);

yn = ys(1:k);

[xsmin , imin] = min(xs(1:k));

[xsmax , imax] = max(xs(1:k));

imin = is(imin);

imax = is(imax);

xll = (x(imin) + x(imin -1))/2; % lower limit

xul = (x(imax) + x(imax +1))/2; % upper limit


Local Averaging

g(x) =n∑

i=1

wi(x) yi

• We saw that kernel smoothers can be viewed as a weightedaverage

• Instead, we could take a local average of the k-nearest neighborsof x

g(x) =1k

k∑i=1

yc(i)

where c(i) is the data set index of the ith nearest point

• For this type of model, k controls the smoothness


Local Averaging Discussion

g(x) =1k

k∑i=1

yc(i)

• Local averaging has a number of disadvantages

– The data set must be stored in memory. (This is essentiallytrue for kernel smoothers and smoothing splines also)

– The output g(x) is discontinuous

– Finding the k nearest neighbors can be computationallyexpensive


rg = max(y)-min(y);

ymin = min(y)-0.1*rg;

ymax = max(y)+0.1*rg;

xbox = [xll xul xul xll ];

ybox = [ymin ymin ymax ymax];

yav = mean(yn)*[1 1];

xav = [xll xul];

A = [xn ones(k ,1)];

b = yn;

v = pinv(A)*b;

xl1 = 0;

yl1 = [xl1 1]*v;

xl2 = 1.5;

yl2 = [xl2 1]*v;

xll = [xl1 xl2];

yll = [yl1 yl2];

figure;


h = patch(xbox ,ybox ,’g’);

set(h,’FaceColor ’,.8*[1 1 1]);

set(h,’EdgeColor ’,.8*[1 1 1]);

hold on;

h = plot(x,y,’k.’);



Example 16: Local Averaging

Time (ms)

Hea

d A

ccel

erat

ion

(g)

Motorcycle Data Set k=2

−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


h = plot(xav ,yav ,’r-’);


%h = plot(xll ,yll ,’b:’);

h = plot(q*[1 1], [ymin ymax],’b--’);


hold off;

rg = max(y)-min(y);


xlabel(’Time (ms)’);

ylabel(’Head Acceleration (g)’);

title(’Motorcycle Data Set’);

set(gca ,’Layer’,’top’);

set(gca ,’Box’,’off’);

AxisSet (8);

print -depsc LocalAverageConcept;



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


Weighted Local Averaging

g(x) =∑k

i=1 bk(|x − xc(i)|) yc(i)∑ki=1 bk(|x − xc(i))

=∑k

i=1 bi yc(i)∑ki=1 bi

• Local averaging can be tweaked to produce a continuous g(x)

• We simply take a weighted average where b(u) is a smoothlydecreasing function of the distance

• We can use our familiar (non-negative) kernels to achieve this

• My favorite is the biweight function

bi =(

1 − d2i

d2k+1

)2

where di = |x − xc(i)| is the distance between the input and theith nearest neighbor



function [] = LocalAverageFit ();

close all;


x = D(: ,1);

y = D(: ,2);

[x,is] = sort(x);

y = y(is);

xt = ( -10:0.1:70) ’;

yh = zeros(size(xt));

K = [2 5 10 20 50];

for c = 1: length(K),

k = K(c);

for cnt = 1: length(xt),

d = (x-xt(cnt)).^2;

[ds ,is] = sort(d);

xs = x(is);

ys = y(is);

xn = xs(1:k);

yn = ys(1:k);


Example 17: Local Averaging Weighting Functions

0 0.2 0.4 0.6 0.8 1 1.2

0

0.2

0.4

0.6

0.8

1

Distance (u)

Wei

ghtin

g Fu

nctio

n

Weighted Averaging Weighting Functions

EpanechnikovBiweightTriweightTriangular


yh(cnt) = mean(yn);

end;

figure;


h = plot(x,y,’k.’);


hold on;

h = stairs(xt ,yh,’b’);


hold off;

axis ([-10 70 -150 90]);



st = sprintf(’Motorcycle Data Set k=%d’,k);

title(st);



AxisSet (8);

st = sprintf(’print -depsc LocalAverageEx %02d;’,k);

eval(st);

end;


Example 17: Weighted Averaging

Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


w = (1-(dn/dmax)).^2;

yt(cnt) = sum(w.*yn)/sum(w);

end;



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


Weighted Local Averaging Comments

g(x) =∑k

i=1 bk(|x − xc(i)|) yc(i)∑ki=1 bk(|x − xc(i)|)

• Like kernel smoothers, weighted local averaging models are stable(bounded)

minc(i)∈1,k

yc(i) ≤ g(x) ≤ maxc(i)∈1,k

yc(i)

• The key difference here is that the kernel width is determined bythe distance to the (k + 1)th nearest neighbor

• This is advantageous

– In regions of dense data, the equivalent kernel width shrinks

– In regions of sparse data, the equivalent kernel width expands



function [yt] = WeightedAverage(x,y,xt ,k);

xarg = x;

yarg = y;

x = unique(xarg);

y = zeros(size(x));

for cnt = 1: length(x),

y(cnt) = mean(yarg(xarg==x(cnt )));

end;

yt = zeros(length(xt),1);

[Np,Ni] = size(x);

for cnt = 1: length(xt),

d = zeros(Np ,1);

for cnt2 = 1:Ni,

d = d + (x(:,cnt2)-xt(cnt ,cnt2)).^2;

end;

[ds ,is] = sort(d);

xs = x(is);

ys = y(is);

dn = ds(1:k);

dmax = ds(k+1);

xn = xs(1:k);

yn = ys(1:k);


Local Model Consistency

• Under general assumptions, it can be shown that smoothingsplines, kernel smoothers, and local models are consistent

• Consistency means that as n → ∞, if the following conditions aresatisfied

–∫ |bσ(u)|du < ∞

– limu→∞ u bσ(u) = 0– E[ε2

i ] < ∞– As n → ∞, σ → 0 and nσ → ∞then at every point that is continuous for g(x) and f(x) withf(x) > 0,

g(x) → g(x)

with probability 1.


Local Model Optimality

It can be shown that for fixed weighting functions, w(x), both kernelsmoothers and weighted local averaging models minimize the weightedaverage squared error

ASE ≡ 1n

n∑i=1

(yi − g(xi))2b(|x − xi|)

dASEdg(xi)

∝n∑

i=1

(yi − g(x)) b(|x − xi|) = 0

0 =n∑

i=1

yib(|x − xi|) −n∑

i=1

g(x)b(|x − xi|)

=n∑

i=1

yib(|x − xi|) − g(x)n∑

i=1

b(|x − xi|)

g(x) =∑n

i=1 yi b(|x − xi|)∑ni=1 b(|x − xi|)


Bias-Variance Tradeoff

PE ≡ E[(y − g(x))2

]= (g(x) − E [g(x)])2︸︷︷︸

Bias2

+ E[(g(x) − E [g(x)])2

]︸︷︷︸

Variance

• For each, we discussed a bias-variance tradeoff

– Less smooth ⇒ More variance and less bias

– More smooth ⇒ Less variance and more bias

• Recall that the prediction error can be written as shown above

• The expectation is taken over the distribution of data sets used toconstruct g(x)

• Conceptually, this can be plotted


Local Model Optimality Continued

ASE ≡ 1n

n∑i=1

(yi − g(xi))2b(|x − xi|)

g∗(x) =∑n

i=1 yib(|x − xi|)∑ni=1 b(|x − xi|)

• Thus, we have an alternative derivation of kernel smoothers andweighted local averaging models

• They are the models that minimize the weighted ASE

• The only difference between kernel smoothers, local averagingmodels, and weighted local averaging models are the weightingfunctions, b(·)


Model Selection Continued

• The prediction error can then be estimated by

PE ≈ r( p

n

)× ASE

where the r(·) is a function that adjusts the ASE to be a moreaccurate estimate of the PE

• A number of different functions r(·) have been proposed

– Final Prediction Error: r(u) = 1+u1−u

– Schwartz’ Criterion: r(u) = 1 + loge n2

u1−u

– Generalized CVE: r(u) = 1(1−u)2

– Shabata’s Model Selector: r(u) = 1 + 2u

• In each case u = pn


Bias-Variance Tradeoff Continued

Smoothness

Prediction Error

Bias

Variance

• Our goal is to minimize the prediction error

• How do we choose the best smoothing parameter?

• All of the methods we discussed had a single parameter thatcontrolled smoothness

– Smoothing splines had a smoothness penalty parameter λ

– Kernel methods had the bump width σ

– Local averaging models had the number of neighbors k

• How do we pick the best smoothness?

• Would like an accurate estimate of the prediction error


Resampling Techniques

• It is also possible to use resampling techniques

– N-Fold Cross-Validation: Divide the data set into Ndifferent sets. Pick the first set as the test set and build themodel using the remaining N − 1 sets of points. Calculate theASE on the test set. Repeat for all of the sets and average allN estimates of the ASE.

– Leave-one-out Cross-Validation: Same as above for N = n.

– Bootstrap: Select n points from the data set withreplacement and calculate the ASE.

• I personally prefer to use Leave-one-out CVE

• A study I conducted last spring with weighted averaging modelsindicated that CVE and Generalized CVE were the best (for thattype of model)


Model Selection

• How do we estimate the prediction error with only one data set?

• The ASE won’t work: monotonically decreases as the smoothnessdecreases

• All of our smoothers can be written as

g(x) = y = H(x)y

for a given input vector x.

• This is very similar to the hat matrix of linear models, except nowthe H matrix is a function of x

• The equivalent degrees of freedom can be estimated by

p ≈ trace(HH

T)


Weighted Least Squares

ASEb =1n

n∑i=1

b2i (yi − yi)2

= (y − Aw)TB

TB(y − Aw)

where

B =

⎡⎢⎢⎢⎣

b1 0 . . . 00 b2 . . . 0...

.... . .

...0 0 . . . bn

⎤⎥⎥⎥⎦

• When we discussed linear models we found that the minimized theaverage squared error

• We can generalize this easily to find the best linear model thatminimizes the weighted ASE


Example: Weighted Averaging CVE

Number of Neighbors (k)

CV

E

Local Averaging Cross−Validation Error

0 5 10 15 20 25 30

400

500

600

700

800

900


Weighted Least Squares Continued

ASEb = (y − Aw)TB

TB(y − Aw)

• This can be framed as a typical (unweighted) least squaresproblem if we add the following definitions

Ab ≡ BA yb ≡ By

• Then the ASEb can be written as

ASEb = (yb − Abw)T(yb − Abw)

which has the known optimal least squares solution

w = (AT

b Ab)−1AT

b yb

• Now we can easily generalize kernel methods and local averagingmodels to create localized linear models

• We merely specify the weights so that points near the input havethe most influence on the model output


Example: Weighted Averaging CVE

Time (ms)

Hea

d A

ccel

erat

ion

(g)

Motorcycle Data Local Averaging kopt

=11

−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


Example 18: Local Linear Model

Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50



function [] = LocalLinearEx ();

close all;


x = D(: ,1);

y = D(: ,2);

[x,is] = sort(x);

y = y(is);

xt = ( -10:0 .05 :70) ’;

k = [2 5 10 20 30 50 93];

for cnt = 1: length(k),

yh = LocalLinear(x,y,xt ,k(cnt ));

figure;


h = plot(x,y,’k.’,xt,yh ,’b’);


set(h(2),’LineWidth ’,1.2);

axis ([-10 70 -150 90]);



st = sprintf(’Motorcycle Data Set k=%d’,k(cnt ));

title(st);




Time (ms)

Hea

d A

ccel

erat

ion

(g)


−10 0 10 20 30 40 50 60 70−150

−100

−50

0

50


Univariate Smoothing Summary Continued

• All of the smoothing methods had a single parameter that controlsthe smoothness of the model

• For each, we discussed a bias-variance tradeoff

– Less smooth ⇒ More variance and less bias

– More smooth ⇒ Less variance and more bias

• We discussed a several methods of estimating the “true”prediction error of the model

– Some of the methods were simple modifications of the ASE– Other methods were based on resampling (cross-validation &

the bootstrap)

• Most can be generalized to the multivariate case



AxisSet (8);

st = sprintf(’print -depsc LocalLinearEx %02d;’,k(cnt ));

eval(st);

end;


Univariate Smoothing Summary

• We discussed four methods of interpolation

– Linear Interpolation

– Nearest Neighbor Interpolation

– Polynomial Interpolation

– Cubic Spline Interpolation

• We discussed six methods of univariate smoothing

– Polynomial regression (generalization of linear models)

– Smoothing splines

– Kernel smoothing

– Local averaging

– Weighted local averaging

– Local linear models (weighted)

• Discussed one method of density estimation based on kernels


Documents

Problem Deﬁnition & Interpolation Univariate Smoothing ...web.cecs.pdx.edu/~edam/Slides/UnivariateSmoothingx4.pdf · – Cubic Spline Interpolation: Fit a cubic polynomial with