Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES Ridge Regression in Linear

Neural NetworksLecture 5

Regularized Learning Methods

Academic year 2013-2014

Simone Scardapane

Table of Contents

1 REGULARIZATIONRidge Regression in Linear ModelsRegularized ERMRepresenter’s Theorem

2 REGULARIZED LEARNINGSupport Vector MachinesConsistencyKernel Ridge Regression

3 SRM AND BAYESStructural Risk MinimizationBayesian Learning

4 MODEL SELECTION

5 REFERENCES

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Ridge Regression in Linear Models

Ill-Posed Problems

In mathematics, a problem is well-posed if the solution:

• Exists,• Is unique,• Is stable with respect to the data of the problem.

Under this definition, we see that the learning problem, as formulatedunder the ERM principle, is highly ill-posed. This is a rather generalproperty for inverse problems.It is known since the work of Tikhonov and other mathematicians thatan ill-posed problem can be solved by imposing some regularizing con-straints on the solution, i.e., by penalizing “unwanted” behavior, suchas too high complexity or discontinuities.Note: most of this lecture follows the exposition of [EPP00].



Ordinary Least Square

Consider the simple case of f (x) = wTx. Remember we are given adataset in the form {(xi, yi)}N

i=1. We define the matrices X = [x1, . . . , xN]

and y = [y1, . . . , yN]T.

The ordinary least-square (OLS) solution is:

I[w] = ‖Xw− y‖2 (1)

Under some additional assumptions, the solution to (1) is given by:

w∗ = (XTX)−1XTy (2)

Even if (2) can be computed, the matrix inversion can amount in anhighly ill-posed problem.



Ridge Regression

A possible solution is to penalize large weights, the so-called RidgeRegression estimation:

Ireg[w] = ‖Xw− y‖2 + λ‖w‖2 (3)

where λ is called a regularization factor. Solution to (3) is now given by:

w∗ = (XTX + λIN)−1XTy (4)

where IN is the N × N identity matrix. For small λ, the quantity λINvanishes and we are left with standard OLS. However, for a sufficientλ, this ensures that the matrix to be inverted is well conditioned.


Regularized ERM

Regularized Learning Methods

Generalizing the previous considerations, consider the following reg-ularized version of ERM:

minf∈H

N∑i=1

L(xi, f (xi)) + λΦ(‖f‖) (5)

The minimization is made on a Reproducing Kernel Hilbert Space H.The additional term Φ(‖f‖) should be a monotone increasing func-tion of the norm of the function (in the following we will always haveφ(‖f‖) = ‖f‖2).As before, the term λ is called a regularization factor. For λ → +∞we will choose a function with zero norm, while for λ → 0 we obtainERM.


Regularized ERM

Norms and Smoothness

An important point is that the actual definition of “smoothness” weare enforcing depends on the norm of the function, which in turn de-pends on the kernel we are choosing.For example, in the case of the linear kernel k(x,y) = xTy we obtain‖f‖2 = wTw, i.e., standard ridge regression.Consider instead the Gaussian kernel that we defined in the previouslecture. It can be shown that its norm is given by:

‖f‖2 =1

2πN

∫X|̃f (ω)|2 exp

{σ2ω2

2

}dω

where f̃ is the Fourier transform of f . Hence, high-frequency compo-nents are more penalized with respect to low-frequency components.


Representer’s Theorem

Statement

Theorem 1 (Representer’s Theorem)

In equation (5), suppose Φ(‖f‖2) is a non-decreasing function. Then, asolution to (5) can always be expressed as:

f (x) =

N∑i=1

αik(x, xi)

Moreover, if Φ(‖f‖2) is monotonically increasing, all solutions have thisform.

The Representer’s theorem is fundamental: a possibly infinite dimen-sional search (over the RKHS) amounts in a finite-dimensional search(over the coefficients).


Representer’s Theorem

Considering a Bias Term

It is now time to answer a question: where has the bias term b gone?Note that, practically, the inclusion of a bias amounts in a shift of thehyperplane in the feature space, hence in a different decision bound-ary.Theoretically, there is the need of an extension of the Representer’sTheorem to conditionally PSD kernels.It can be shown that using a bias term is equivalent to using a differ-ent kernel (and hence a different feature space) where constant fea-tures are not penalized. See [PMR+01] for a lengthy discussion on thesubject.


Support Vector Machines

C-SVM

The non-linear Support Vector Machine that we derived from a geo-metrical viewpoint fits into this framework, with the use of the hingeloss function:

L(y, f (x)) = (1− yf (x))+, with (a)+ = max {0, a}

This can be shown by first demonstrating that ‖f‖2 = αKα, whereα = [α1, . . . , αN]

T and K is the Gram matrix. Then, the slack variablesare used to make the Hinge loss function differentiable.


Consistency

Consistency of Regularized Learning Methods

The regularization framework can also be used to derive general the-orems on the consistency of learning methods. As an example, forcontinuous kernels, it can be shown that the C-SVM is consistent ifand only if:

• The kernel is universal (see next slide),• The regularization factor λ is chosen “large enough”.

The condition on λ is highly technical but can be simplified in somecontexts. For example, for the Gaussian kernel, the C-SVM is consis-tent if λ is chosen such that:

λ = Nβ−1

for some 0 < β < 1d (where d is the dimensionality of the input).


Consistency

Universal Kernels

Consider the space of functions induced by the kernel:

F = span {k(·, x) ∈ X}

The kernel is said to be universal if F is dense in C[X ], i.e., for everyf ∈ C[X ] there exists a g ∈ F such that, for every ε > 0 we have:

‖f − g‖ ≤ ε

As an example, the Gaussian kernel is universal, while the polynomialkernel is not.


Consistency

R-SVM

From the regularization framework it is also possible to directly derivea version of SVM for regression, that we will call R-SVM. Consider theε-insensitive loss function:

L(y, f (x)) = (|y− f (x)| − ε)+that penalizes error of at least ε linearly. By introducing two sets ofslack variables we obtain the following differentiable cost function:

minimizeζ+i ,ζ

−i

− 12

N∑i=1

(ζ+i + ζ−i ) + λ‖f‖2

subject to yi − f (xi) ≤ ε+ ζ+i

f (xi)− yi ≤ ε+ ζ−i

ζ+i , ζ−i ≥ 0

(6)


Consistency

R-SVMThe dual optimization problem of (7) is given by:

maxαi,βi

− εN∑

i=1

(βi − αi) +

N∑i=1

(βi − αi)yi −12

N∑i,j=1

(βi − αi)(βj − αj)k(xi, xj)

s.t. 0 ≤ αi, βi ≤ λf (xi)− yi ≤ ε+ ζ−i

N∑i=1

(βi − αi) = 0

(7)And the final regression function is given by:

f (x) =

N∑i=1

(βi − αi)k(x, xi)


Kernel Ridge Regression

Kernel Ridge Regression

Another important class of learning methods is obtained by consider-ing the squared loss function (kernel ridge regression).As in linear ridge regression, it can be shown that the solutions satis-fies the following set of linear equations:

(K + λIN)α = y

Although this is simpler to solve that the SVM optimization problem,sparsness is lost in the solution.


Structural Risk Minimization


Let us explore the link between regularization and SRM in the case ofthe Hinge loss function (for other loss functions, some technical prob-lems arises [EPP00]).Consider the sequence of spacesH1 ⊂ H2, . . . such that:

‖f‖ ≤ Ai, ∀f ∈ Hi

where we have chosen the scalars Ai such that A1 ≤ A2, . . .. Minimiz-ing the empirical risk on spaceHk amounts in solving:

maxλ≥0

minf∈Hk

N∑i=1

L(yi, f (xi)) + λ(‖f‖2 − A2k) (8)



Structural Risk Minimization (2)

Solving (8) for every Ak gives us a sequence of optimal λ∗1 , λ∗2 , . . ..

After minimizing the empirical risk, we choose the function that min-imizes a given VC bound, with associated λ∗i = λ∗.The overall operation is equivalent to directly solving:

minf∈Hk

N∑i=1

L(yi, f (xi)) + λ∗(‖f‖2) (9)

Hence, regularization can be seen as an approximate solution to SRM,where the regularization factor is chosen depending on the knowledgeof the VC dimension.


Bayesian Learning

Bayesian View

Another perspective on regularization comes from considering the Ba-yesian approach to learning. Suppose we are given the following ele-ments:

• A prior probability distribution P(f ), f ∈ H, that represents oura-priori knowledge on the goodness of each function.

• A likelihood probability distribution P(S|f ) that gives us theprobability of observing a dataset, supposing that the truefunction is f .

According to the Bayes law, once we observe a dataset S, the posteriordistribution is computed as:

P(f |S) =P(S|f )p(f )∑f P(S|f )p(f )

(10)


Bayesian Learning

Using the Posterior

The Bayes decision function is obtained by averaging over all possiblefunctions:

f (x) =

∫H

P(f |S)f (x)dP

In practice, simpler estimations can be considered:

• Maximum a-Posteriori, i.e., maximizing the posterior.• Maximum Likelihood, i.e., maximizing the likelihood. This is

equivalent to not making prior assumptions on the shape of thefunction (uninformative prior).


Bayesian Learning

Regularization and Bayes

Suppose we penalize our models as:

P(f ) ∝ exp{−‖f‖2}

Additionally, suppose the noise in the system is normally distributedwith variance σ:

P(S|f ) ∝ exp

{− 1

2σ

N∑i=1

(yi − f (xi))2

}The posterior is proportional to:

P(f |S) ∝ exp

{− 1

2σ

N∑i=1

(yi − f (xi))2 − ‖f‖2

}


Bayesian Learning

Regularization and Bayes (2)

Taking the MAP estimation is equivalent to minimizing the negativeof the exponent of the posterior:

minf∈Hk

N∑i=1

(yi − f (xi))2 +

2σ2

N(‖f‖2) (11)

Hence, this amount in making a specific, data-independent choice of theregularization factor.Similar considerations can be made for other choice of the loss func-tion and of the regularization function.


Holdout Method

Before concluding, we look at a practical issue: how can we test theaccuracy of a trained model?The simplest idea is the so-called Holdout method:

• Subdivide the original dataset into a training set and a testing set.• Train the model on the former and test it on the latter.• Repeat steps 1-2 a number of times and average the result

(eventually computing a confidence value).

In general, however, the k-fold cross validation is preferable.


Cross-Validation

Here is how the method works:

• Subdivide the original dataset into k equally-sized subsets (folds).• Repeat for i = 1, . . . , k:

• Train the model on the union of all folds except fold i.• Test the obtained model on fold i.

• Average over the results as before.

Typycal values of k are between 3 and 10. A special case is given by k =N, known as leave-one-out cross validation, which possess interestingtheoretical properties but is computationally expensive.


Model Selection

Cross validation can be used for model selection, i.e., choosing theoptimal parameters of the model.Suppose we have a set of M possible configurations. We can perform ak-fold cross validation on every configuration and choose the optimalone. Note that in this case we can have two nested cross-validations,one for testing and one for validating.As an example, for C-SVM with polynomial kernel, we may test allconfiguration for C = 2.−15, . . . , 25, and p = 1, . . . , 15.More powerful methods exist for specific cases (such as C-SVM witha Gaussian kernel [HRTZ05]).


Bibliography I

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosingmultiple parameters for support vector machines, Machine learning(2002), 131–159.

T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks andsupport vector machines, Advances in Computational Mathematics13 (2000), 1–50.

T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entireregularization path for the support vector machine, Journal ofMachine Learning Research 5 (2005), 1391–1415.

C.A. Micchelli, Y. Xu, and H. Zhang, Universal kernels, TheJournal of Machine Learning Research 7 (2006), 2651–2667.

T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri, b,Tech. report, 2001.


Bibliography II

I. Steinwart, Support Vector Machines are Universally Consistent,Journal of Complexity 18 (2002), no. 3, 768–791.

Andreas Steinwart, Ingo and Christmann, Support vectormachines, 1st ed., 2008.

Documents

Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES Ridge Regression in Linear