MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and...

MIT 9.520/6.860, Fall 2018Statistical Learning Theory and Applications

Class 02: Statistical Learning Setting

Lorenzo Rosasco

Learning from examples

I Machine Learning deals with systems that are trained from datarather than being explicitly programmed.

I Here we describe the framework considered in statistical learningtheory.

L.Rosasco, 9.520/6.860 Fall 2018

All starts with DATA

I Supervised: {(x1, y1), . . . , (xn, yn)}.

I Unsupervised: {x1, . . . , xm}.

I Semi-supervised: {(x1, y1), . . . , (xn, yn)} ∪ {x1, . . . , xm}.

L.Rosasco, 9.520/6.860 Fall 2018

Supervised learning

Problem: given Sn = {(x1, y1), . . . , (xn, yn)} find f (xnew) ∼ ynew

L.Rosasco, 9.520/6.860 Fall 2018

The supervised learning problem

I X × Y probability space, with measure P.

I ` : Y × Y → [0,∞), measurable loss function.

Define expected risk:

L(f ) =

∫X×Y

`(y , f (x))dP(x , y).

Problem: Solvemin

f :X→YL(f ),

given onlySn = (x1, y1), . . . , (xn, yn) ∼ Pn,

sampled i.i.d. with P fixed, but unknown.

L.Rosasco, 9.520/6.860 Fall 2018

Data space

X︸︷︷︸input space

Y︸︷︷︸output space

L.Rosasco, 9.520/6.860 Fall 2018

Input space

X input space:

I Linear spaces, e. g.

– vectors,– functions,– matrices/operators.

I “Structured” spaces, e. g.

– strings,– probability distributions,– graphs.

L.Rosasco, 9.520/6.860 Fall 2018

Output space

Y output space:

I linear spaces, e. g.

– Y = R, regression,– Y = RT , multitask regression,– Y Hilbert space, functional regression.

I “Structured” spaces, e. g.

– Y = {−1, 1}, classification,– Y = {1, . . . ,T}, multicategory classification,– strings,– probability distributions,– graphs.

L.Rosasco, 9.520/6.860 Fall 2018

Probability distribution

Reflects uncertainty and stochasticity of the learning problem,

P(x , y) = PX (x)P(y |x),

I PX marginal distribution on X ,

I P(y |x) conditional distribution on Y given x ∈ X .

L.Rosasco, 9.520/6.860 Fall 2018

Conditional distribution and noise

(x2, y2)

(x3, y3)

(x4, y4)(x5, y5)(x1, y1)

Regression

yi = f∗(xi) + εi .

I Let f∗ : X → Y , fixed function,

I ε1, . . . , εn zero mean random variables, εi ∼ N(0, σ),

I x1, . . . , xn random,

P(y |x) = N(f ∗(x), σ).L.Rosasco, 9.520/6.860 Fall 2018

Conditional distribution and misclassification

ClassificationP(y |x) = {P(1|x),P(−1|x)}.

Noise in classification: overlap between the classes,

∆δ ={x ∈ X

∣∣∣ ∣∣P(1|x)− 1/2∣∣ ≤ δ}.

L.Rosasco, 9.520/6.860 Fall 2018

Marginal distribution and sampling

PX takes into account uneven sampling of the input space.

L.Rosasco, 9.520/6.860 Fall 2018

Marginal distribution, densities and manifolds

p(x) =dPX (x)

dx⇒ p(x) =

dPX (x)

dvol(x)

- 1.0 - 0.5 0.0 0.5 1.0

L.Rosasco, 9.520/6.860 Fall 2018

Loss functions

` : Y × Y → [0,∞)

I Cost of predicting f (x) in place of y .

I Measures the pointwise error `(y , f (x)).

I Part of the problem definition since L(f ) =∫X×Y `(y , f (x))dP(x , y).

Note: sometimes it is useful to consider loss of the form

` : Y × G → [0,∞)

for some space G, e.g. G = R.

L.Rosasco, 9.520/6.860 Fall 2018

Loss for regression

``(y , y ′) = V (y − y ′), V : R→ [0,∞).

I Square loss `(y , y ′) = (y − y ′)2.

I Absolute loss `(y , y ′) = |y − y ′|.I ε-insensitive `(y , y ′) = max(|y − y ′| − ε, 0).

1.0 0.5 0.5 1.0

Square Loss

Absolute

insensitive-

L.Rosasco, 9.520/6.860 Fall 2018

Loss for classification

`(y , y ′) = V (−yy ′), V : R→ [0,∞).

I 0-1 loss `(y , y ′) = Θ(−yy ′), Θ(a) = 1, if a ≥ 0 and 0 otherwise.

I Square loss `(y , y ′) = (1− yy ′)2.

I Hinge-loss `(y , y ′) = max(1− yy ′, 0).

I Logistic loss `(y , y ′) = log(1 + exp(−yy ′)).

0 1 loss

square loss

Hinge loss

Logistic loss

L.Rosasco, 9.520/6.860 Fall 2018

Loss function for structured prediction

Loss specific for each learning task, e.g.

I Multiclass: square loss, weighted square loss, logistic loss, . . .

I Multitask: weighted square loss, absolute, . . .

I . . .

L.Rosasco, 9.520/6.860 Fall 2018

Expected risk

L(f ) =

∫X×Y

`(y , f (x))dP(x , y),

withf ∈ F , F = {f : X → Y | f measurable}.

Example

Y = {−1,+1}, `(y , f (x)) = Θ(−yf (x)) 1

L(f ) = P({(x , y) ∈ X × Y | f (x) 6= y}).

1Θ(a) = 1, if a ≥ 0 and 0 otherwise. L.Rosasco, 9.520/6.860 Fall 2018

Target function

fP = arg minf∈F

L(f ),

can be derived for many loss functions.

L(f ) =

∫dP(x , y)`(y , f (x)) =

∫dPX (x)

∫`(y , f (x))dP(y |x)︸︷︷︸

Lx (f (x))

It is possible to show that:

I inf f∈F L(f ) =∫dPX (x) infa∈R Lx(a).

I Minimizers of L(f ) can be derived “pointwise” from the inner riskLx(f (x)).

L.Rosasco, 9.520/6.860 Fall 2018

Target functions in regression

square loss

fP(x) =

ydP(y |x).

absolute lossfP(x) = median(P(y |x)),

median(p(·)) = y s.t.

−∞tdp(t) =

∫ +∞

tdp(t).

L.Rosasco, 9.520/6.860 Fall 2018

Target functions in classification

misclassification loss

fP(x) = sign(P(1|x)− P(−1|x)).

square lossfP(x) = P(1|x)− P(−1|x).

logistic loss

fP(x) = logP(1|x)

P(−1|x).

hinge-lossfP(x) = sign(P(1|x)− P(−1|x)).

L.Rosasco, 9.520/6.860 Fall 2018

Learning algorithms

Solveminf∈F

L(f ),

given onlySn = (x1, y1), . . . , (xn, yn) ∼ Pn.

Learning algorithmSn → f̂n = f̂Sn .

fn estimates fP given the observed examples Sn.

How to measure the error of an estimate?

L.Rosasco, 9.520/6.860 Fall 2018

Excess risk

Excess risk:L(f̂ )−min

f∈FL(f ).

Consistency: For any ε > 0,

limn→∞

P(L(f̂ )−min

f∈FL(f ) > ε

L.Rosasco, 9.520/6.860 Fall 2018

Other forms of consistency

Consistency in Expectation: For any ε > 0,

limn→∞

E[L(f̂ )−minf∈F

L(f )] = 0.

Consistency almost surely: For any ε > 0,

limn→∞

L(f̂ )−minf∈F

L(f ) = 0

Note: different notions of consistency correspond to different notions ofconvergence for random variables: weak, in expectation and almost sure.

L.Rosasco, 9.520/6.860 Fall 2018

Sample complexity, tail bounds and error bounds

I Sample complexity: For any ε > 0, δ ∈ (0, 1], when n ≥ nP,F (ε, δ),

P(L(f̂ )−min

f∈FL(f ) ≥ ε

)≤ δ.

I Tail bounds: For any ε > 0, n ∈ N,

P(L(f̂ )−min

f∈FL(f ) ≥ ε

)≤ δP,F (n, ε).

I Error bounds: For any δ ∈ (0, 1], n ∈ N,

P(L(f̂ )−min

f∈FL(f ) ≤ εP,F (n, δ)

)≥ 1− δ.

L.Rosasco, 9.520/6.860 Fall 2018

No free-lunch theorem

A good algorithm should have small sample complexity for manydistributions P.

No free-lunchIs it possible to have an algorithm with small (finite) sample complexityfor all problems?

The no free lunch theorem provides a negative answer.

In other words given an algorithm there exists a problem for which thelearning performance are arbitrarily bad.

L.Rosasco, 9.520/6.860 Fall 2018

Algorithm design: complexity and regularization

The design of most algorithms proceed as follows:

I Pick a (possibly large) class of function H, ideally

minf∈H

L(f ) = minf∈F

I Define a procedure Aγ(Sn) = f̂γ ∈ H to explore the space H

L.Rosasco, 9.520/6.860 Fall 2018

Bias and variance

Let fγ be the solution obtained with an infinite number of examples.

Key error decomposition

L(f̂γ)− minf∈H

L(f ) = L(f̂γ)− L(fγ)︸︷︷︸Variance/Estimation

+ L(fγ)− minf∈H

L(f )︸︷︷︸Bias/Approximation

Small Bias lead to good data fit, high variance to possible instability.

L.Rosasco, 9.520/6.860 Fall 2018

ERM and structural risk minimization

A classical example.

Consider (Hγ)γ such that

H1 ⊂ H2, . . .Hγ ⊂ . . .H

Then, let

f̂γ = minf∈Hγ

L̂(f ), L̂(f ) =1

n∑i=1

`(yi , f (xi ))

ExampleHγ are functions f (x) = w>x (or f (x) = w>Φ(x)), s.t.‖w‖ ≤ γ

L.Rosasco, 9.520/6.860 Fall 2018

Beyond constrained ERM

In this course we will see other algorithm design principles:

I Penalization

I Stochastic gradient descent

I Implicit regularization

I Regularization by projection

L.Rosasco, 9.520/6.860 Fall 2018

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and...

Documents

MIT9.520/6.860,Fall2019 ... › ~9.520 › fall19 › slides › class02_SLT.pdfOutputspace Youtputspace: I linearspaces,e.g. –Y= R,regression, –Y= RT,multitaskregression, –YHilbertspace,functionalregression

iccdb.webfactional.comiccdb.webfactional.com/documents/implementations/word/C... · Web viewRevogado pela Lei nº 9.520, de 27.11.1997:?tTexto original: Art.35 A mulher casada não

Estrutura do Ministério da Saúde Decreto 6.860 de 27 de .... a...Comissionados para o Ministério da Saúde •2009: Projeto de estrutura com reorganização do Ministério da Saúde

Online Learning - web.mit.eduweb.mit.edu/9.520/www/spring11/slides/class15_online.pdf · Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and

Statistical Computing Statistical Graphics

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...9.520/fall18/slides/class02_SLT.pdf · Algorithm design: complexity and regularization The design of most algorithms

RESULTADOS 1T18 - ri.usiminas.comri.usiminas.com/ptb/5730/618812.pdf · 4.464 4.998 4.722 4.342 4.117 6.879 6.950 6.860 6.656 5.680 ... seu Sistema da qualidade certificado pela norma

Regularized Least Squares and Support Vector Machines9.520/fall13/slides/class06/class06_RLSSVM.pdf · Regularized Least Squares and Support Vector Machines Lorenzo Rosasco 9.520

8x8 Tração Total - denigris.com.br · Novo Actros 2546 6x2 Estradeiro Dimensões (mm)1 Modelo/Entre eixos 2546/33 [a] Distância entre eixos 3.300+1.350 [b] Comprimento total 6.860

Spectral Regularization - web.mit.eduweb.mit.edu/9.520/www/spring08/Classes/class07_spectral.pdfPlan From ERM to Tikhonov regularization. Linear ill-posed problems and stability. Spectral

MIT9.520/6.860,Fall2019

The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Online Learning - Massachusetts Institute of Technologyweb.mit.edu/9.520/www/fall14/slides/online/online.pdf · GoalTo introduce theory and algorithms for online learning. L. Rosasco

LONGITUD (m) 630 ANCHO TOTAL (m) 20,00 IMD veh/día 9.520

Statistical Script Learning with Recurrent Neural Nets€¦ · Background: Statistical Scripts • Statistical Scripts: Statistical Models of Event Sequences. • Non-statistical

Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with

Hierarchical Learning Machines: Derived Kernels and the ...9.520/spring10/Classes/class16_derivedkernel_2010.pdf · A. Wibisono, L. Rosasco Derived Kernel and the Neural Response

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...web.mit.edu/9.520/www/fall18/slides/Class06_SGD.pdf · MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications

Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio,

CONSULTING, STATISTICAL 141 MULTIVARIATE ANALYSIS STATISTICAL INFERENCE) · 2016-03-23 · MULTIVARIATE ANALYSIS STATISTICAL INFERENCE) CONSULTING, STATISTICAL DEFINITION Statistical