Locally Weighted Regression for Control

L

Locally Weighted Regression forControl

Jo-Anne Ting1, Franziska Meier2, SethuVijayakumar1;2, and Stefan Schaal3;41University of Edinburgh, Edinburgh, UK2University of Southern California, Los Angeles,CA, USA3Max Planck Institute for Intelligent Systems,Stuttgart, Germany4Computer Science, University of SouthernCalifornia, Los Angeles, CA, USA

Synonyms

Kernel shaping; Lazy learning; Locally weightedlearning; Local distance metric adaptation; LWR;LWPR; Nonstationary kernels; Supersmoothing

Definition

This entry addresses two topics: learning controland locally weighted regression.

Learning control refers to the process of ac-quiring a control strategy for a particular con-trol system and a particular task by trial anderror. It is usually distinguished from adaptivecontrol (Astrom and Wittenmark 1989) in thatthe learning system is permitted to fail duringthe process of learning, resembling how humansand animals acquire new movement strategies.

In contrast, adaptive control emphasizes single-trial convergence without failure, fulfilling strin-gent performance constraints, e.g., as needed inlife-critical systems like airplanes and industrialrobots.

Locally weighted regression refers to super-vised learning of continuous functions (otherwiseknown as function approximation or regression)by means of spatially localized algorithms, whichare often discussed in the context of kernel re-gression, nearest neighbor methods, or lazy learn-ing (Atkeson et al. 1997). Most regression algo-rithms are global learning systems. For instance,many algorithms can be understood in terms ofminimizing a global loss function such as theexpected sum squared error:

J D E

"1

2

NXiD1

.ti � yi /2

#

D E

"1

2

NXiD1

�ti � � .xi /

T ˇ�2#

(1)

where E Œ � � denotes the expectation operator,ti the noise-corrupted target value for an inputxi—which is expanded by basis functions into abasis function vector � .xi /-and ˇ is the vectorof (usually linear) regression coefficients. Clas-sical feedforward neural networks, radial basisfunction networks, mixture models, or Gaussianprocess regression are all global function approx-imators in the spirit of Eq. (1).

© Springer Science+Business Media New York 2016C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining,DOI 10.1007/978-1-4899-7502-7 493-1

http://link.springer.com/Kernel shaping

http://link.springer.com/Lazy learning

http://link.springer.com/Locally weighted learning

http://link.springer.com/Local distance metric adaptation

http://link.springer.com/LWR

http://link.springer.com/LWPR

http://link.springer.com/Nonstationary kernels

http://link.springer.com/Supersmoothing

http://link.springer.com/learning control

http://link.springer.com/Learning control

http://link.springer.com/supervised learning

http://link.springer.com/regression

http://link.springer.com/kernel regression

http://link.springer.com/nearest neighbor

http://link.springer.com/lazy learning

http://link.springer.com/loss

http://link.springer.com/neural networks

http://link.springer.com/radial basis function

http://link.springer.com/mixture models

http://link.springer.com/Gaussian process

2 Locally Weighted Regression for Control

In contrast, local learning systems conceptu-ally split up the global learning problem intomultiple simpler learning problems. Traditionallocally weighted regression approaches achievethis by dividing up the cost function into multipleindependent local cost functions,

J D E

"1

2

KXkD1

NXiD1

wk;i�

ti � xTi ˇk�2#

D1

2

KXkD1

E

"NXiD1

wk;i�

ti � xTi ˇk�2#

D1

2

KXkD1

Jk : (2)

resulting inK (independent) local model learningproblems. A different strategy for local learningstarts out with the global objective (Eq. 1) andreformulates it to capture the idea of local modelsthat cooperate to generate a (global) function fit.This is achieved by assuming there are K featurefunctions �k , such that the kth feature function�k .xi / D wk;ixi , resulting in

J D E

"1

2

NXiD1

�ti � � .xi /

T ˇ�2#

D E

241

2

NXiD1

ti �

KXkD1

wk;i .xTi ˇk/

!235 :

(3)

In this setting, local models are initially coupledand approximations are found to decouple thelearning of the local models parameters.

Motivation and Background

Figure 1 illustrates why locally weighted re-gression methods are often favored over globalmethods when it comes to learning from incre-mentally arriving data, especially when dealingwith nonstationary input distributions. The figureshows the division of the training data into twosets: the “original training data” and the “newtraining data” (in dots and crosses, respectively).

•••

•

••

•

•

•••••

•

••••

••

•

•

•

••••

•

•

••

•••••

••

••

•

•••

••••••••

•

•••

•

•

••

••••

••

•

••••

•

••

•

•••

•

••••••••

•

••••

••

••••

•

•••••••

•••••••

•••

•••

•

•

•

••

•

••

•

•

•

+++

+

+

++

+++

+

++

++++

+++

+

+

+

+

++++

++++

++++

++++

++++++

++++

+++

+++

+++++++++

+

++

++

–2

–1

0

1

2

3

4

–2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5

y

x

• original training data

+ new training data

true y predicted y predicted y after new training data

00.5

1

–2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5

w

x

Learned Organization of Receptive Fields

•••

•

••

•

•

•••••

•

••••

••

•

•

•

••••

•

•

••

•••••

••

••

•

•••

••••••••

•

•••

•

•

••

••••

••

•

••••

•

••

•

•••

•

••••••••

•

••••

••

••••

•

•••••••

•••••••

•••

•••

•

•

•

••

•

••

•

•

•

+++

+

+

++

+++

+

++

++++

+++

+

+

+

+

++++

++++

++++

++++

++++++

++++

+++

+++

+++++++++

+

++

++

–6

–5

–4

–3

–2

–1

0

1

2

3

4a b

c

–2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5

y

x

Local Function Fitting With Receptive FieldsGlobal Function Fitting With Sigmoidal NeuralNetwork

Locally Weighted Regression for Control, Fig. 1Function approximation results for the function y Dsin.2x/ C 2 exp.�16x2/ C N.0; 0:16/ with (a) a sig-moidal neural network, (b) a locally weighted regressionalgorithm (note that the data traces “true y,” “predicted y,”

and “predicted y after new training data” largely coincide),and (c) the organization of the (Gaussian) kernels of (b)after training. See Schaal and Atkeson 1998 for moredetails

Locally Weighted Regression for Control 3

L

Initially, a sigmoidal neural network and alocally weighted regression algorithm are trainedon the “original training data,” using 20 % ofthe data as a cross validation set to assess con-vergence of the learning. In a second phase,both learning systems are trained solely on the“new training data” (again with a similar cross-validation procedure), but without using any datafrom the “original training data.” While bothalgorithms generalize well on the “new trainingdata,” the global learner incurred catastrophicinterference, unlearning what was learned ini-tially, as seen in Fig. 1a. Figure 1b shows thatthe locally weighted regression algorithm doesnot have this problem since learning (along withgeneralization) is restricted to a local area.

Appealing properties of locally weighted re-gression include the following:

• Function approximation can be performedincrementally with nonstationary input andoutput distributions and without significantdanger of interference. Locally weightedregression can provide posterior probabilitydistributions, offer confidence assessments,and deal with heteroscedastic data.

• Locally weighted learning algorithms arecomputationally inexpensive to compute. Itis well suited for online computations (e.g.,for online and incremental learning) in the fastcontrol loop of a robot—typically on the orderof 100–1000 Hz.

• Locally weighted regression methods can im-plement continual learning and learning fromlarge amounts of data without running into se-vere computational problems on modern com-puting hardware.

• Locally weighted regression is a nonparamet-ric method (i.e., it does not require that theuser determine a priori the number of localmodels in the learning system), and the learn-ing systems grow with the complexity of thedata it tries to model.

• Locally weighted regression can include fea-ture selection, dimensionality reduction, andBayesian inference—all which are requiredfor robust statistical inference.

• Locally weighted regression works favorablywith locally linear models (Hastie and Loader1993), and local linearizations are of ubiqui-tous use in control applications.

BackgroundReturning to Eqs. (1) to (3), the main differencesbetween global methods that directly solveEq. (1) and local methods that solve eitherEqs. (2) or (3) are listed below:

(i) A weight wi;k is introduced that focuses:• either the function approximation fit in

Eq. (2)• or a local models contribution toward the

global function fit in Eq. (3)on only a small neighborhood around apoint of interest ck in input space (see Eq. 4below).

(ii) The learning problem is split into K inde-pendent optimization problems.

(iii) Due to the restricted scope of the functionapproximation problem, we do not need anonlinear basis function expansion and can,instead, work with simple local functions orlocal polynomials (Hastie and Loader 1993).

The weights wk;i in Eq. (2) are typically com-puted from some kernel function (Atkeson et al.1997) such as a squared exponential kernel:

wk;i D exp

��

1

2.xi � ck/

T Dk .xi � ck/�

(4)

with Dk denoting a positive semidefinite distancemetric and ck the center of the kernel. The num-ber of kernelsK is not finite. In many local learn-ing algorithms, the kernels are never maintainedin memory. Instead, for every query point xq ,a new kernel is centered at ck D xq , and thelocalized function approximation is solved withweighted regression techniques (Atkeson et al.1997).

Locally weighted regression should not beconfused with mixture of experts models (Jordanand Jacobs 1994). Mixture models are globallearning systems since the experts competeglobally to cover training data. Mixture models

http://link.springer.com/neural network

http://link.springer.com/generalization

http://link.springer.com/posterior probability

http://link.springer.com/online

http://link.springer.com/incremental learning

http://link.springer.com/complexity

http://link.springer.com/feature selection

http://link.springer.com/dimensionality reduction

http://link.springer.com/Bayesian inference

http://link.springer.com/statistical inference

http://link.springer.com/kernel function


http://link.springer.com/Mixture models


address the bias-variance dilemma (Intuitively,the bias-variance dilemma addresses how manyparameters to use for a function approximationproblem to find an optimal balance betweenoverfitting and oversmoothing of the trainingdata.) by finding the right number of localexperts. Locally weighted regression addressesthe bias-variance dilemma in a local wayby finding the optimal distance metric forcomputing the weights in the locally weightedregression (Schaal and Atkeson 1998).

Structure of Learning System

All local learning approaches have three criticalcomponents in common:

(i) Optimizing the regression parameters ˇk(ii) Learning the distance metric Dk that defines

a local model neighborhood(iii) Choosing the location ck of receptive

field(s)

Local learning methods can be separated into“lazy” approaches that require all training datato be stored and “memoryless” approaches thatcompress data into a several local models andthus do not require storage of data points.

In the “lazy” approach, the computational bur-den of a prediction is deferred until the lastmoment, i.e., when a prediction is needed. Such a“compute-the-prediction-on-the-fly” approach isoften called lazy learning and is a memory-basedlearning system where all training data is keptin memory for making predictions. A predictionis formed by optimizing the parameters ˇq anddistance metric Dq of one local model centered atthe query point cq D xq .

Alternatively, in the “memoryless” approach,multiple kernels are created as needed to coverthe input space, and the sufficient statistics of theweighted regression are updated incrementallywith recursive least squares (Schaal and Atkeson1998). This approach does not require storageof data points in memory. Predictions of neigh-boring local models can be blended, improving

function fitting results in the spirit of committeemachines.

We describe some algorithms of both flavorsnext.

Memory-Based Locally WeightedRegression (LWR)The original locally weighted regression algo-rithm was introduced by Cleveland (1979) andpopularized in the machine learning and learningcontrol community by Atkeson (1989). The algo-rithm – categorized as a “lazy” approach – canbe summarized as follows below (for algorithmicpseudo-code, see Schaal et al. 2002):

• All training data is collected in the rows of thematrix X and the vector (For simplicity, onlyfunctions with a scalar output are addressed.Vector-valued outputs can be learned eitherby fitting a separate learning system for eachoutput or by modifying the algorithms to fitmultiple outputs (similar to multi-output lin-ear regression).) t.

• For every query point xq , the weighting kernelis centered at the query point.

• The weights are computed with Eq. (4), andall data points’ weights are collected in thediagonal weight matrix Wq

• The local regression coefficients are computedas

ˇq D�

XTWqX��1

XTWqt (5)

• A prediction is formed with yq D�xTq 1

�ˇq .

As in all kernel methods, it is important tooptimize the kernel parameters in order to getoptimal function fitting quality. For LWR, thecritical parameter determining the bias-variancetrade-off is the distance metric Dq . If the kernel istoo narrow, it starts fitting noise. If it is too broad,oversmoothing will occur. Dq can be optimizedwith leave-one-out cross validation to obtain aglobally optimal value, i.e., the same Dq D Dis used throughout the entire input space of thedata. Alternatively, Dq can be locally optimizedas a function of the query point, i.e., obtain a Dq

http://link.springer.com/bias-variance


http://link.springer.com/overfitting


http://link.springer.com/least squares



L

(as indicated by the subscript “q”). In the recentmachine learning literature (in particular, workrelated to kernel methods), such input-dependentkernels are referred to as nonstationary kernels.

Locally Weighted Projection Regression(LWPR)Schaal and Atkeson (1998) suggested amemoryless version of LWR, called RFWR,in order to avoid the expensive nearest neighborcomputations—particularly for large trainingdata sets—of LWR and to have fast real-time (Inmost robotic systems, “real time” means on theorder of maximally 1–10 ms computation time,corresponding to a 1000 to 100 Hz control loop.)prediction performance. The main ideas of theRFWR algorithm (Schaal and Atkeson 1998) arelisted below:

• Create new kernels only if no existing kernelin memory covers a training point with someminimal activation weight.

• Keep all created kernels in memory and up-date the weighted regression with weightedrecursive least squares for new training pointsfx; tg:

ˇnC1kD ˇnk C wPnC1 Qx

�t � QxTˇnk

�

where PnC1kD

1

�

Pnk �

PnkQxQxTPn

k

�w C Qx

TPnkQx

!and Qx

DhxT 1

iT: (6)

• Adjust the distance metric Dq for each kernelwith a gradient descent technique using leave-one-out cross validation.

• Make a prediction for a query point taking aweighted average of predictions from all localmodels:

yq D

PKkD1 wq;k Oyq;kPKkD1 wq;k

(7)

Adjusting the distance metric Dq with leave-one-out cross validation without keeping all trainingdata in memory is possible due to the PRESS

residual. The PRESS residual allows the leave-one-out cross validation error to be computed inclosed form without needing to actually excludea data point from the training data.

Another deficiency of LWR is its inabilityto scale well to high-dimensional input spacessince the covariance matrix inversion in Eq. (5)becomes severely ill-conditioned. Additionally,LWR becomes expensive to evaluate as the num-ber of local models to be maintained increases.Vijayakumar et al. (2005) suggested local di-mensionality reduction techniques to handle thisproblem. Partial least squares (PLS) regressionis a useful dimensionality reduction method thatis used in the LWPR algorithm (Vijayakumaret al. 2005). In contrast to PCA methods, PLSperforms dimensionality reduction for regression,i.e., it eliminates subspaces of the input space thatminimally correlates with the outputs, not justparts of the input space that have low variance.

While LWPR is typically used in conjunctionwith linear local models, the use of local non-parametric models, such as Gaussian processes,has also been explored (Nguyen-Tuong et al.2008). Finally, LWPR is currently one of the bestdeveloped locally weighted regression algorithmsfor control (Klanke et al. 2008) and has beenapplied to learning control problems with over100 input dimensions.

A Full Bayesian Treatment of LocallyWeighted RegressionTing et al. (2008) proposed a fully probabilistictreatment of LWR in an attempt to avoid cross-validation procedures and minimize any manualparameter tuning (e.g., gradient descent rates,kernel initialization, forgetting rates, etc.). Theresulting Bayesian algorithm learns the distancemetric of local linear model (For simplicity, a lo-cal linear model is assumed, although local poly-nomials can be used as well.) probabilistically,can cope with high input dimensions, and rejectsdata outliers automatically. The main ideas ofBayesian LWR are listed below (please see Ting2009 for details):

http://link.springer.com/nearest neighbor

http://link.springer.com/least squares

http://link.springer.com/covariance matrix






• Introduce hidden variables z to the local linearmodel to decompose the statistical estimationproblem into d individual estimation prob-lems (where d is the number of input dimen-sions). The result is an iterative expectation-maximization (EM) algorithm that is of linearcomputational complexity in d and the num-ber of training data samples N , i.e., O.Nd/.

• Associate a scalar weight wi with each train-ing data sample fxi ; tig, placing a Bernoulliprior probability distribution over a weightwim for each input dimension m so that theweights are positive and between 0 and 1:

wi DdYmD1

wim where

wim� Bernoulli .qim/ for i D 1; ::; N I

m D 1; ::; d (8)

The weight wi indicates a training sample’scontribution to the local model. The formula-tion of the parameter qim determines the shapeof the weighting function applied to the localmodel. The weighting function qim used inBayesian LWR is listed below:

qim D1

1C�xim � xqm

2hm

for i D 1; ::; N I

m D 1; ::; d (9)

where xq 2 <d�1 is the query input pointand hm is the bandwidth parameter/distancemetric of the local model in the m-th inputdimension.

• Place a gamma prior probability distributionover the distance metric hm:

hm � Gamma .ahm0; bhm0/ (10)

where fahm0; bhm0g are the prior parametervalues of the gamma distribution.

• Treat the model as an EM-like regressionproblem, using variational approximations to

achieve analytically tractable inference of theposterior probability distributions.

This Bayesian method can also be appliedas general kernel shaping algorithm for globalkernel learning methods that are linear in theparameters (e.g., to realize nonstationary Gaus-sian processes (Ting et al. 2008), resulting in anaugmented nonstationary Gaussian process).

Figure 2 illustrates Bayesian kernel shaping’sbandwidth adaptation abilities on several syn-thetic data sets, comparing it to a stationary Gaus-sian process and the augmented nonstationaryGaussian process. For the ease of visualization,the following one-dimensional functions are con-sidered: (i) a function with a discontinuity, (ii)a spatially inhomogeneous function, and (iii) astraight line function. Figure 2 shows the pre-dicted outputs of all three models trained on noisydata drawn from data sets (i)–(iii). The localkernel shaping algorithm smoothens over regionswhere a stationary Gaussian process overfits, andyet, it still manages to capture regions of highlyvarying curvature, as seen in Fig. 2a, b. It cor-rectly adjusts the bandwidths hwith the curvatureof the function. When the data looks linear, thealgorithm opens up the weighting kernel so thatall data samples are considered, as Fig. 2c shows.

From the viewpoint of learning control,overfitting—as seen in the Gaussian process inFig. 2—can be detrimental since learning controloften relies on extracting local linearizations toderive controllers. Obtaining the wrong sign ona slope in a local linearization may destabilize acontroller.

In contrast to LWPR, the Bayesian LWRmethod is a “lazy” learner, although memorylessversions could be derived. Future work willalso have to address how to incorporatedimensionality reduction methods for robustnessin high dimensions. Nevertheless, it is a first steptoward a probabilistic locally weighted regressionmethod with minimal parameter tuning requiredby the user.

http://link.springer.com/computational complexity

http://link.springer.com/prior probability

http://link.springer.com/prior probability


http://link.springer.com/variational approximations

http://link.springer.com/posterior probability

http://link.springer.com/kernel learning methods

http://link.springer.com/Gaussian processes





http://link.springer.com/learning control

http://link.springer.com/overfitting


http://link.springer.com/learning

http://link.springer.com/controllers

http://link.springer.com/controller



L

−2 −1 0

Function i) Function ii) Function iii)

1 2−4

−2

0

2

x

y

−2 −1 0 1 2

−1

0

1

2

x

y

Training dataStationary GPAug GPKernel Shaping

−2 −1 0 1 2−2

−1

0

1

2

x

y

0

1

w

−2 −1 0 1 2100

103

107

x

h

wxq

0

1

w

−2 −1 0 1 2100

106

x

h

0

1

w

−2 −1 0 1 2

10−6

100

106

x

h

a b c

Locally Weighted Regression for Control, Fig. 2Predicted outputs using a stationary Gaussian process(GP), the augmented nonstationary GP, and local kernelshaping on three different data sets. Figures on the bottom

row show the bandwidths learned by local kernel shapingand the corresponding weighting kernels (in dotted blacklines) for various input query points (shown in red circles)

From Global to Local: Local Regressionwith Coupling Between Local ModelsMeier et al. (2014) offer an alternative approachto local learning. They start out with the globalobjective (Eq. 3) and reformulate it to capture theidea of local models that cooperate to generate afunction fit, resulting in

J D E

241

2

NXiD1

ti �

KXkD1

wk;i .xTi ˇk/

!235 :

(11)

With this change, a local models’ contributionOyk D xTi ˇk toward the fit of target ti is local-ized through weight wk;i . However, this form oflocalization couples all local models. For efficientlearning, local Gaussian regression (LGR) thusemploys approximations to decouple learning ofparameters. The main ideas of LGR are:

• Introduce Gaussian hidden variables fk thatform virtual targets for the weighted contribu-tion of the kth local model:

fk;i D N�

wk;i .xTi ˇk/; ˇ

�1m

�(12)

Assume that the target t is observed withGaussian noise and that the hidden variablesfk need to sum up to noisy target ti

ti D N X

k

fk;i ; ˇ�1y

!(13)

In its exact form, this model learning proce-dure will couple all local models parameters.

• Employ a variational approximation to de-couple local models. This results in an itera-tive (EM style) learning procedure, betweenupdating posteriors over hidden variables fkfollowed by posterior updates for regressionparameters ˇk , for all local models k D

1; : : : ; K.• The updates over the hidden variables fk turn

out to be a form of message passing betweenlocal model predictions. This step allows theredistribution of virtual target values for each


–10

1

–10

Cross function in 2D Local models trained withLWPR

Local models trained withLGR

1

0

1

–2 –1 0 1 2

–1

0

1

–2 –1 0 1 2

–1

0

1

a b c

Locally Weighted Regression for Control, Fig. 3Local models trained on data from the 2D cross functionfor LWPR and LGR. Local models trained via LWPR

(visualized in (b)) do not know of each other, while localmodels trained by LGR (visualized in (c)) collaborate togenerate a function fit

local model. This communication between lo-cal models is what distinguishes LGR fromtypical LWR approaches. This update is linearin the number of local models and in thenumber of data points.

• The parameter updates (ˇk and Dk) per lo-cal model become completely independentthrough the variational approximation, result-ing in a localized learning algorithm, similarin spirit to LWR.

• Place Gaussian priors over regression param-eters ˇk � N .ˇk I 0; diag .˛k// that allow forautomatic relevance determination of the inputdimensions.

• For incrementally incoming data, apply recur-sive Bayesian updates that utilize the poste-rior over parameters at time step t � 1 tobe the prior over parameters at time step t .Furthermore, new local models are added ifno existing local model is activated with someminimal activation weight, similar to LWPR.

• Prediction for a query input xq becomes aweighted average of local models predictions

yq D

KXkD1

wk;q.xTq ˇk/

More details and a pseudo-algorithm for incre-mental LGR can be found in Meier et al. (2014).Figure 3 illustrates the different shapes of local

models being learned by LWPR and LGR. Localmodels learned by LGR collaborate to generatea good fit, as visualized in Fig. 3c. Compared toLWPR, this often allows LGR to achieve similarpredictive performance while using fewer localmodels.

Finally, an interesting structural feature of lo-cal Gaussian regression is that it easily extends toa model with finitely many local nonparametricGaussian process models.

Applications

Learning Internal Models with LWPRLearning an internal model is one of most typi-cal applications of local regression methods forcontrol. The model could be a forward model(e.g., the nonlinear differential equations of robotdynamics), an inverse model (e.g., the equationsthat predict the amount of torque to achieve achange of state in a robot), or any other func-tion that models associations between input andoutput data about the environment. The mod-els are used, subsequently, to compute a con-troller, e.g., an inverse dynamics controller sim-ilar to Eq. (16). Models for complex robots suchas like humanoids exceed easily a hundred in-put dimensions. In such high-dimensional spaces,it is hopeless to assume that a representativedata set can be collected for offline training that

http://link.springer.com/Gaussian regression


http://link.springer.com/controller


L

02468

101214161820a b

0

50

100

150

200

250

300

3501 10 100

1000

1000

0

1000

0012

5000

MS

E o

n Te

st S

et

#Rec

eptiv

e Fi

elds

#Training Data Points

ParameterEstimationLWPR

Locally Weighted Regression for Control, Fig. 4Learning an inverse dynamics model in real time with ahigh-performance anthropomorphic robot arm. (a) Learn-

ing curve LWPR online learning. (b) Seven degree-of-freedom Sarcos robot arm

can generalize sufficiently to related tasks. Thus,the local regression philosophy involves hav-ing a learning algorithm that can learn rapidlywhen entering a new part of the state spacesuch that it can achieve acceptable generaliza-tion performance almost instantaneously. BothLWPR (Vijayakumar et al. 2005) and incrementalLGR (Meier et al. 2014) have been applied toinverse dynamics learning tasks.

Figure 4 demonstrates online learning of aninverse dynamics model for the elbow joint (cf.Eq. 16) for a Sarcos Dexterous Robot Arm. Therobot starts with no knowledge about this model,and it tracks some randomly varying desiredtrajectories with a proportional-derivative (PD)controller. During its movements, training dataconsisting of tuples .q; Pq; Rq; �/—which model amapping from joint position, joint velocities, andjoint accelerations .q; Pq; Rq/ to motor torques �—are collected (at about every 2 ms). Here, everydata point is used to train a LWPR function ap-proximator, which generates a feedforward com-mand for the controller. The learning curve isshown in Fig. 4a.

Using a test set created beforehand, the modelpredictions of LWPR are compared every 1000training points with that of a parameter esti-mation method. The parameter estimation ap-proach fits the minimal number of parametersto an analytical model of the robot dynamics

under an idealized rigid body dynamics (RBD)assumptions, using all training data (i.e., notincrementally). Given that the Sarcos robot isa hydraulic robot, the RBD assumption is notvery suitable, and, as Fig. 4a shows, LWPR (inthick red line) outperforms the analytical model(in dotted blue line) after a rather short amountof training. After about 5 min of training (about125,000 data points), very good performance isachieved, using about 350 local models. Thisexample demonstrates (i) the quality of functionapproximation that can be achieved with LWPRand (ii) the online allocation of more local modelsas needed.

Learning Paired Inverse-Forward ModelsLearning inverse models (such as inverse kine-matics and inverse dynamics models) can bechallenging since the inverse model problem isoften a relation, not a function, with a one-to-many mapping. Applying any arbitrary nonlin-ear function approximation method to the in-verse model problem can lead to unpredictablybad performance, as the training data can formnon-convex solution spaces, in which averagingis inappropriate. Architectures such as mixturemodels (in particular, mixture density networks)have been proposed to address problems withnon-convex solution spaces. A particularly inter-esting approach in control involves learning lin-

http://link.springer.com/generalization

http://link.springer.com/online learning

http://link.springer.com/learning curve

http://link.springer.com/mixture models


Locally Weighted Regression for Control, Fig. 5SensAble Phantom haptic robotic arm

earizations of a forward model (which is properfunction) and learning an inverse mapping withinthe local region of the forward model.

Ting et al. (2008) demonstrated such aforward-inverse model learning approach withBayesian LWR to learn an inverse kinematicsmodel for a haptic robot arm (shown in Fig. 5)in order to control the end effector along adesired trajectory in task space. Training datawas collected while the arm performed randomsinusoidal movements within a constrained boxvolume of Cartesian space. Each sample consistsof the arm’s joint angles q, joint velocities Pq, end-effector position in Cartesian space x, and end-effector velocities Px. From this data, a forwardkinematics model is learned:

Px D J.q/ Pq (14)

where J.q/ is the Jacobian matrix. The transfor-mation from Pq to Px can be assumed to be locallylinear at a particular configuration q of the robotarm. Bayesian LWR is used to learn the forwardmodel, and, as in LWPR, local models are onlyadded if a training point is not already sufficientlycovered by an existing local model. Importantly,the kernel functions in LWR are localized onlywith respect to q, while the regression of eachmodel is trained only on a mapping from Pq to Px—

these geometric insights are easily incorporatedas priors in Bayesian LWR, as they are natural tolocally linear models. Incorporating these priorsin other function approximators, e.g., Gaussianprocess regression, is not straightforward.

The goal of the robot task is to track a desiredtrajectory .x; Px/ specified only in terms of x and´ positions and velocities, i.e., the movement issupposed to be in a vertical plane in front ofthe robot, but the exact position of the verticalplane is not given. Thus, the task has one degreeof redundancy, and the learning system needs togenerate a mapping from fx; Pxg to Pq. Analytically,the inverse kinematics equation is

Pq D J#.q/Px � ˛.I � J#J/@g

@q(15)

where J #.q/ is the pseudo-inverse of the Jaco-bian. The second term is a gradient descent op-timization term for redundancy resolution, spec-ified here by a cost function g in terms of jointangles q.

To learn an inverse kinematics model, the localregions of q from the forward model can bereused since any inverse of J is locally linearwithin these regions. Moreover, for locally linearmodels, all solution spaces for the inverse modelare locally convex, such that an inverse can belearned without problems. The redundancy issuecan be solved by applying an additional weightto each data point according to a reward func-tion. Since the experimental task is specified interms of f Px; P g, a reward is defined, based ona desired y coordinate, ydes , and enforced asa soft constraint. The resulting reward functionis g D e�

12h.k.ydes�y/� Py/

2, where k is a gain

and h specifies the steepness of the reward. Thisensures that the learned inverse model chooses asolution that pushes Py toward ydes . Each forwardlocal model is inverted using a weighted linearregression, where each data point is weightedby the kernel weight from the forward modeland additionally weighted by the reward. Thus,a piecewise locally linear solution to the inverseproblem can be learned efficiently.

Figure 6 shows the performance of the learnedinverse model (Learnt IK) in a figure-eight track-

http://link.springer.com/priors

http://link.springer.com/priors


http://link.springer.com/linear regression


L

0.2

a bDesiredAnalytical IK

Desired

Learnt IK

0.1

z (m

)

0

–0.1–0.1 –0.05 0

x (m) x (m)

0.05 0.1

0.2

0.1

z (m

)

0

–0.1–0.1 –0.05 0 0.05 0.1

Locally Weighted Regression for Control, Fig. 6 Desired versus actual trajectories for SensAble Phantom robotarm. (a) Analytical solution. (b) Learned solution

Shoulder

Human arm model iLQG

Elbow

x

y

q1

q2

1

2

3

4

5

a b

6

1 2 3 4 5 61

25

49

muscles

k (ti

me)

−10 0 10 2030

40

50

60

Locally Weighted Regression for Control, Fig. 7 (a)Human arm model with six muscles; (b) Optimized con-trol sequence (left) and resulting trajectories (right) usingthe known analytic dynamics model. The control se-

quences (left target only) for each muscle (1–6) are drawnfrom bottom to top, with darker gray levels indicatingstronger muscle activation

ing task. The learned model performs as well asthe analytical inverse kinematics solution (Ana-lytical IK), with root-mean-squared tracking er-rors in positions and velocities very close to thatof the analytical solution.

Learning Trajectory OptimizationsMitrovic et al. (2008) have explored a theoryfor sensorimotor adaptation in humans, i.e., howhumans replan their movement trajectories inthe presence of perturbations. They rely on theiterative Linear Quadratic Gaussian (iLQG) al-gorithm (Todorov and Li 2004) to deal with thenonlinear and changing plant dynamics that mayresult from altered morphology, wear and tear,or external perturbations. They take advantage of

the “on-the-fly” adaptation of locally weightedregression methods like LWPR to learn the for-ward dynamics of a simulated arm for the purposeof optimizing a movement trajectory between astart point and an end point.

Figure 7a shows the diagram of a two degrees-of-freedom planar human arm model, which isactuated by four single-joint and two double-joint antagonistic muscles. Although kinemati-cally simple, the system is over-actuated and,therefore, an interesting test bed because largeredundancies in the dynamics have to be re-solved. The dimensionality of the control sig-nals makes adaptation processes (e.g., to externalforce fields) quite demanding.


ILQG u plantlearneddynamics model +

feedbackcontroller

x, dx

L, x

u

perturbationsxcost function(incl. target)

δu

–

– u +– uδ

Locally Weighted Regression for Control, Fig. 8 Illustration of learning and control scheme of the iterative LinearQuadratic Gaussian (iLQG) algorithm with learned dynamics

The dynamics of the arm is, in part, based onstandard RBD equations of motion:

� DM .q/ RqC C .q; Pq/ Pq (16)

where � are the joint torques; q and Pq are thejoint angles and velocities, respectively; M.q/is the two-dimensional symmetric joint spaceinertia matrix; and C .q; Pq/ accounts for Coriolisand centripetal forces. Given the antagonisticmuscle-based actuation, it is not possible to com-mand joint torques directly. Instead, the effectivetorques from the muscle activations u—whichhappens to be quadratic in u—should be used. Asa result, in contrast to standard torque-controlledrobots, the dynamics equation in Eq. (16) is non-linear in the control signals u.

The iLQG algorithm (Todorov and Li 2004) isused to calculate solutions to “localized” linearand quadratic approximations, which are iteratedto improve the global control solution. However,it relies on an analytical forward dynamics modelPx D f .x;u/ and finite difference methods tocompute gradients. To alleviate this requirementand to make iLQG adaptive, LWPR can be usedto learn an approximation of the plant’s forwarddynamics model. Figure 8 shows the controldiagram, where the “learned dynamics model”(the forward model learned by LWPR) is thenupdated in an online fashion with every iterationto cope with changes in dynamics. The result-ing framework is called iLQG-LD (iLQG withlearned dynamics).

Movements of the arm model in Fig. 7a arestudied for fixed time horizon reaching move-

ment. The manipulator starts at an initial positionq0 and reaches toward a target qtar . The costfunction to be optimized during the movementis a combination of target accuracy and amountof muscle activation (i.e., energy consumption).Figure 7b shows trajectories of generated move-ments for three reference targets (shown in redcircles) using the feedback controller from iLQGwith the analytical plant dynamics. The trajecto-ries generated with iLQG-LD (where the forwardplant dynamics are learned with LWPR) are omit-ted as they are hardly distinguishable from theanalytical solution.

A major advantage of iLQG-LD is that it doesnot rely on an accurate analytic dynamics model;this enables the framework to predict adaptationbehavior under an ideal observer planning model.Reaching movements were studied where a con-stant unidirectional force field acting perpendic-ular to the reaching movement was generatedas a perturbation (see Fig. 9 (left)). Using theiLQG-LD model, the manipulator gets stronglydeflected when reaching for the target becausethe learned dynamics model cannot yet accountfor the “spurious” forces. However, when the de-flected trajectory is used as training data and thedynamics model is updated online, the trackingimproves with each new successive trial (Fig. 9(left)). Please refer to Mitrovic et al. (2008)for more details. Aftereffects upon removing theforce field, very similar to those observed inhuman experiments, are also observed.




L

0 1030

40

50

60

1 2 3 4 5 61

25

49

muscles

k (ti

me)

1 2 3 4 5 61

25

49

muscles

k (ti

me)

initial adapted

Locally Weighted Regression for Control, Fig. 9Adaptation to a unidirectional constant force field (indi-cated by the arrows). Darker lines indicate better trainedmodels. In particular, the left-most trajectory corresponds

to the “initial” control sequence, which was calculatedusing the LWPR model before the adaptation process.The fully “adapted” control sequence results in a nearlystraight line reaching movement

Cross References

�Bayesian Inference�Bias and Variance�Dimensionality Reduction�Direct and Indirect Control� Incremental Learning�Kernel Function�Kernel Methods�Lazy Learning�Linear Regression�Mixture Models�On-line Learning�Overfitting�Radial Basis Functions�Regression� Supervised Learning�Variational Approximations

Programs and Data

http://www-clmc.usc.edu/softwarehttp://www.ipab.inf.ed.ac.uk/slmc/software/

Recommended Reading

Astrom KJ, Wittenmark B (1989) Adaptive control.Addison-Wesley, Reading

Atkeson C (1989) Using local models to control move-ment. In: Proceedings of the advances in neural in-formation processing systems, vol 1. Morgan Kauf-mann, San Mateo, pp 157–183

Atkeson C, Moore A, Schaal S (1997) Locallyweighted learning. AI Rev 11:11–73

Cleveland WS (1979) Robust locally weighted regres-sion and smoothing scatterplots. J Am Stat Assoc74:829–836

Hastie T, Loader C (1993) Local regression: automatickernel carpentry. Stat Sci 8:120–143

Jordan MI, Jacobs R (1994) Hierarchical mixturesof experts and the EM algorithm. Neural Comput6:181–214

Klanke S, Vijayakumar S, Schaal S (2008) A libraryfor locally weighted projection regression. J MachLearn Res 9:623–626

Meier F, Hennig P, Schaal S (2014) Incremental localGaussian regression. In: Proceedings of advancesin neural information processing systems, Montreal,vol 27

Mitrovic D, Klanke S, Vijayakumar S (2008) Adaptiveoptimal control for redundantly actuated arms. In:Proceedings of the 10th international conferenceon the simulation of adaptive behavior, Osaka.Springer, pp 93–102

Nguyen-Tuong D, Peters J, Seeger M (2008) LocalGaussian process regression for real-time onlinemodel learning. In: Proceedings of advances inneural information processing systems, Vancouver,vol 21

Schaal S, Atkeson CG (1998) Constructive incrementallearning from only local information Neural Com-put 10(8):2047–2084

http://link.springer.com/Bayesian Inference

http://link.springer.com/Bias and Variance

http://link.springer.com/Dimensionality Reduction

http://link.springer.com/Direct and Indirect Control

http://link.springer.com/Incremental Learning

http://link.springer.com/Kernel Function

http://link.springer.com/Kernel Methods

http://link.springer.com/Lazy Learning

http://link.springer.com/Linear Regression

http://link.springer.com/Mixture Models

http://link.springer.com/On-line Learning

http://link.springer.com/Overfitting

http://link.springer.com/Radial Basis Functions

http://link.springer.com/Regression

http://link.springer.com/Supervised Learning

http://link.springer.com/Variational Approximations

http://www-clmc.usc.edu/software

http://www.ipab.inf.ed.ac.uk/slmc/software/


Schaal S, Atkeson CG, Vijayakumar S (2002) Scalabletechniques from nonparametric statistics. Appl In-tell 17:49–60

Ting J (2009) Bayesian methods for autonomous learn-ing systems. Phd Thesis, Department of ComputerScience, University of Southern California

Ting J, Kalakrishnan M, Vijayakumar S, Schaal S(2008) Bayesian kernel shaping for learning control.In: Proceedings of advances in neural informationprocessing systems, Vancouver, vol 21. MIT Press,pp 1673–1680

Todorov E, Li W (2004) A generalized iterative LQGmethod for locally-optimal feedback control of con-strained nonlinear stochastic systems. In: Proceed-ings of 1st international conference of informaticsin control, automation and robotics, Setubal

Vijayakumar S, D’Souza A, Schaal S (2005) Incre-mental online learning in high dimensions. NeuralComput 17:2602–2634

Documents

Locally Weighted Regression for Control