MIT9.520/6.860,Fall2019

MIT 9.520/6.860, Fall 2019Statistical Learning Theory and Applications

Class 02: Statistical Learning Setting

Lorenzo Rosasco

Learning from examples

I Machine Learning deals with systems that are trained from datarather than being explicitly programmed.

I Here we describe the framework considered in statistical learningtheory.

L.Rosasco, 9.520/6.860 Fall 2018

All starts with DATA

I Supervised: {(x1, y1), . . . , (xn, yn)}.

I Unsupervised: {x1, . . . , xm}.

I Semi-supervised: {(x1, y1), . . . , (xn, yn)} ∪ {x1, . . . , xm}.

L.Rosasco, 9.520/6.860 Fall 2018

Supervised learning

Problem: given Sn = {(x1, y1), . . . , (xn, yn)} find f (xnew) ∼ ynew

L.Rosasco, 9.520/6.860 Fall 2018

The supervised learning problem

I X × R probability space, with measure P .I ` : Y × Y → [0,∞), measurable loss function.

Define expected risk:

L(f ) = L(f ) = E(x,y)∼P [`(y , f (x))]

Problem: Solvemin

f :X→YL(f ),

given onlySn = (x1, y1), . . . , (xn, yn) ∼ Pn,

i.e. n i.i.d. samples w.r.t. P fixed, but unknown.

L.Rosasco, 9.520/6.860 Fall 2018

Data space

X︸︷︷︸input space

Y︸︷︷︸output space

L.Rosasco, 9.520/6.860 Fall 2018

Input space

X input space:

I Linear spaces, e. g.– vectors,– functions,– matrices/operators.

I “Structured” spaces, e. g.– strings,– probability distributions,– graphs.

L.Rosasco, 9.520/6.860 Fall 2018

Output space

Y output space:

I linear spaces, e. g.– Y = R, regression,– Y = RT , multitask regression,– Y Hilbert space, functional regression.

I “Structured” spaces, e. g.– Y = {−1, 1}, classification,– Y = {1, . . . ,T}, multicategory classification,– strings,– probability distributions,– graphs.

L.Rosasco, 9.520/6.860 Fall 2018

Probability distribution

Reflects uncertainty and stochasticity of the learning problem,

P(x , y) = PX (x)P(y |x),

I PX marginal distribution on X ,

I P(y |x) conditional distribution on Y given x ∈ X .

L.Rosasco, 9.520/6.860 Fall 2018

Conditional distribution and noise

(x2, y2)

(x3, y3)

(x4, y4)(x5, y5)(x1, y1)

Regressionyi = f∗(xi) + εi .

I Let f∗ : X → Y , fixed function,I ε1, . . . , εn zero mean random variables, εi ∼ N(0, σ),I x1, . . . , xn random,

P(y |x) = N(f ∗(x), σ).L.Rosasco, 9.520/6.860 Fall 2018

Conditional distribution and misclassification

ClassificationP(y |x) = {P(1|x),P(−1|x)}.

Noise in classification: overlap between the classes,

∆δ ={

x ∈ X∣∣∣ ∣∣P(1|x)− 1/2

∣∣ ≤ δ}.

L.Rosasco, 9.520/6.860 Fall 2018

Marginal distribution and sampling

PX takes into account uneven sampling of the input space.

L.Rosasco, 9.520/6.860 Fall 2018

Marginal distribution, densities and manifolds

p(x) = dPX (x)dx ⇒ p(x) = dPX (x)

dvol(x)

- 1.0 - 0.5 0.0 0.5 1.0

L.Rosasco, 9.520/6.860 Fall 2018

Loss functions

` : Y × Y → [0,∞)

I Cost of predicting f (x) in place of y .

I Measures the pointwise error `(y , f (x)).

I Part of the problem definition since L(f ) =∫

X×Y `(y , f (x))dP(x , y).

Note: sometimes it is useful to consider loss of the form

` : Y × G → [0,∞)

for some space G, e.g. G = R.

L.Rosasco, 9.520/6.860 Fall 2018

Loss for regression

``(y , y ′) = V (y − y ′), V : R → [0,∞).

I Square loss `(y , y ′) = (y − y ′)2.I Absolute loss `(y , y ′) = |y − y ′|.I ε-insensitive `(y , y ′) = max(|y − y ′| − ε, 0).

1.0 0.5 0.5 1.0

Square Loss

Absolute

insensitive-

L.Rosasco, 9.520/6.860 Fall 2018

Loss for classification

`(y , y ′) = V (−yy ′), V : R → [0,∞).

I 0-1 loss `(y , y ′) = Θ(−yy ′), Θ(a) = 1, if a ≥ 0 and 0 otherwise.I Square loss `(y , y ′) = (1 − yy ′)2.I Hinge-loss `(y , y ′) = max(1 − yy ′, 0).I Logistic loss `(y , y ′) = log(1 + exp(−yy ′)).

0 1 loss

square loss

Hinge loss

Logistic loss

L.Rosasco, 9.520/6.860 Fall 2018

Loss function for structured prediction

Loss specific for each learning task, e.g.I Multiclass: square loss, weighted square loss, logistic loss, …I Multitask: weighted square loss, absolute, …I …

L.Rosasco, 9.520/6.860 Fall 2018

Expected risk

L(f ) = E(x,y)∼P [`(y , f (x))] =∫

X×Y`(y , f (x))dP(x , y),

withf ∈ F , F = {f : X → Y | f measurable}.

Example

Y = {−1,+1}, `(y , f (x)) = Θ(−yf (x)) 1

L(f ) = P({(x , y) ∈ X × Y | f (x) 6= y}).

1Θ(a) = 1, if a ≥ 0 and 0 otherwise. L.Rosasco, 9.520/6.860 Fall 2018

Target function

fP = argminf ∈F

L(f ),

can be derived for many loss functions.

L(f ) =∫

dP(x , y)`(y , f (x)) =∫

dPX(x)∫

`(y , f (x))dP(y |x)︸︷︷︸Lx (f (x))

It is possible to show that:I inf f ∈F L(f ) =

∫dPX (x) infa∈R Lx(a).

I Minimizers of L(f ) can be derived “pointwise” from the inner riskLx(f (x)).

I Measurability of this pointwise definition of fP can be ensured.

L.Rosasco, 9.520/6.860 Fall 2018

Target functions in regression

fP(x) = argmina∈R

Lx(a).

square lossfP(x) =

ydP(y |x).

absolute lossfP(x) = median(P(y |x)),

median(p(·)) = y s.t.∫ y

−∞tdp(t) =

∫ +∞

ytdp(t).

L.Rosasco, 9.520/6.860 Fall 2018

Target functions in classification

misclassification loss

fP(x) = sign(P(1|x)− P(−1|x)).

square lossfP(x) = P(1|x)− P(−1|x).

logistic lossfP(x) = log

P(1|x)P(−1|x) .

hinge-lossfP(x) = sign(P(1|x)− P(−1|x)).

L.Rosasco, 9.520/6.860 Fall 2018

Different loss, different target

I Each loss functions defines a different optimal target function.Learning enters the picture when the latter is impossible or hard tocompute (as in simulations).

I As we see in the following, loss functions also differ in terms ofinduced computations.

L.Rosasco, 9.520/6.860 Fall 2018

Learning algorithms

Solveminf ∈F

L(f ),

given onlySn = (x1, y1), . . . , (xn, yn) ∼ Pn.

Learning algorithmSn → f̂n = f̂Sn .

fn estimates fP given the observed examples Sn.

How to measure the error of an estimate?

L.Rosasco, 9.520/6.860 Fall 2018

Excess risk

Excess risk:L(f̂ )−min

f ∈FL(f ).

Consistency: For any ε > 0,

limn→∞

L(f̂ )−minf ∈F

L(f ) > ε

L.Rosasco, 9.520/6.860 Fall 2018

Other forms of consistency

Consistency in Expectation: For any ε > 0,

limn→∞

E[L(f̂ )−minf ∈F

L(f )] = 0.

Consistency almost surely: For any ε > 0,

limn→∞

L(f̂ )−minf ∈F

L(f ) = 0)

Note: different notions of consistency correspond to different notions ofconvergence for random variables: weak, in expectation and almost sure.

L.Rosasco, 9.520/6.860 Fall 2018

Sample complexity, tail bounds and error bounds

I Sample complexity: For any ε > 0, δ ∈ (0, 1], when n ≥ nP,F (ε, δ),

L(f̂ )−minf ∈F

L(f ) ≥ ε

)≤ δ.

I Tail bounds: For any ε > 0, n ∈ N,

L(f̂ )−minf ∈F

L(f ) ≥ ε

)≤ δP,F (n, ε).

I Error bounds: For any δ ∈ (0, 1], n ∈ N,

L(f̂ )−minf ∈F

L(f ) ≤ εP,F (n, δ))

≥ 1 − δ.

L.Rosasco, 9.520/6.860 Fall 2018

No free-lunch theorem

A good algorithm should have small sample complexity for manydistributions P .

No free-lunchIs it possible to have an algorithm with small (finite) sample complexityfor all problems?

The no free lunch theorem provides a negative answer.

In other words given an algorithm there exists a problem for which thelearning performance are arbitrarily bad.

L.Rosasco, 9.520/6.860 Fall 2018

Algorithm design: complexity and regularization

The design of most algorithms proceed as follows:

I Pick a (possibly large) class of function H, ideally

minf ∈H

L(f ) = minf ∈F

I Define a procedure Aγ(Sn) = f̂γ ∈ H to explore the space H

L.Rosasco, 9.520/6.860 Fall 2018

Bias and variance

Let fγ be the solution obtained with an infinite number of examples.

Key error decomposition

L(f̂γ)− minf ∈H

L(f ) = L(f̂γ)− L(fγ)︸︷︷︸Variance/Estimation

+ L(fγ)− minf ∈H

L(f )︸︷︷︸Bias/Approximation

Small Bias lead to good data fit, high variance to possible instability.

L.Rosasco, 9.520/6.860 Fall 2018

ERM and structural risk minimization

A classical example.

Consider (Hγ)γ such that

H1 ⊂ H2, . . .Hγ ⊂ . . .H

Then, let

f̂γ = minf ∈Hγ

L̂(f ), L̂(f ) = 1n

n∑i=1

`(yi , f (xi))

ExampleHγ are functions f (x) = w>x (or f (x) = w>Φ(x)), s.t.‖w‖ ≤ γ

L.Rosasco, 9.520/6.860 Fall 2018

Beyond constrained ERM

In this course we will see other algorithm design principles:I PenalizationI Stochastic gradient descentI Implicit regularizationI Regularization by projection

L.Rosasco, 9.520/6.860 Fall 2018

Beyond supervised learning

I Z probability space, with measure P .I H a set.I ` : Z ×H → [0,∞), measurable loss function.

Problem: Solveminh∈H

Ez∼P [`(z, h)],

given onlySn = z1, . . . , zn ∼ Pn,

i.e. n i.i.d. samples w.r.t. P fixed, but unknown.I H is part of the definition of the problemI The above setting covers for example many unsupervised learning

problems as well as decision theory problem (aka, general learningsetting).

L.Rosasco, 9.520/6.860 Fall 2018

MIT9.520/6.860,Fall2019

Documents

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...9.520/fall18/slides/class02_SLT.pdf · Algorithm design: complexity and regularization The design of most algorithms

fall2019-centos7-nfssweden.minnesota.edu/examples/ex/fall2019-centos7-nfs.pdf · nfs-server - VMware Workstation 15 Player (Non-commercial use only) Player. Applications Places Home

MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software

PRODUCT CATALOG FALL2019 - FFF EnterprisesPRODUCT CATALOG FALL2019 the products you need when you need them. Going Beyond Distribution, Because the Patients are Always First. Count

Estrutura do Ministério da Saúde Decreto 6.860 de 27 de .... a...Comissionados para o Ministério da Saúde •2009: Projeto de estrutura com reorganização do Ministério da Saúde

Statistical Learning Theory and Applications - MIT9.520/fall15/slides/class01/class01.pdf• Support Vector Machines, manifold learning, sparsity, batch and online supervised learning,

fall2019-centos7-apachesweden.minnesota.edu/examples/ex/fall2019-centos7-apache.pdf · apacheOI - VMware Workstation 15 Player (Non-commercial use only) Player. Applications Places

Support Vector Machines - MIT9.520/spring11/slides/class06-svm.pdfSupport Vector Machines Charlie Frogner 1 MIT 2011 1Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Support

Math Camp 1: Functional analysis - MIT9.520/spring10/Classes/mathcamp01.pdfAnalysis” by Kolmogorov and Fomin (highly recommended). 6. Norms and semi-norms of linear spaces 7. Euclidean

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and ...9.520/fall17/slides/class21_ldr3_nn.pdf · Road map Last class: I Part II: Data representations by unsupervised learning

WELD Fall2019 FinalPrinterCopy - CWB Group

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...web.mit.edu/9.520/www/fall18/slides/Class06_SGD.pdf · MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and

New Student Orientation Fall2019 - sci.tohoku.ac.jp

BGCFyi Fall2019 v5.0.qxp Layout 1

MIT - 6.036: IntrotoMachineLearninglindrew/6.036.pdf · 2019. 12. 21. · 6.036: IntrotoMachineLearning Lecturer: ProfessorLeslieKaelbling Notesby: AndrewLin Fall2019 1Introduction

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and ...web.mit.edu/9.520/www/fall18/slides/class02_SLT.pdf · MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications

Multiclass Classiﬁcation - MIT9.520/spring09/Classes/multiclass.pdf · Multiclass Classiﬁcation 9.520 Class 06, 25 Feb 2008 Ryan Rifkin “It is a tale Told by an idiot, full

LASTONE(WooHoo!) Fall2019 DueMonday,December2ndby5PM · LASTONE(WooHoo!) Fall2019 DueMonday,December2ndby5PM Either email me a PDF of your solutions to LarsenML@cofc.edu by 5 PM on

gpg enigmail config fall2019 - brazil.minnesota.edubrazil.minnesota.edu/examples/ex/gpg_enigmail_config_fall2019.pdf · Search Results for "enigmail" Sort by: Relevance Most Users