Stastical Learning Theory

Embed Size (px)

Citation preview

  • 8/4/2019 Stastical Learning Theory

    1/194

    Statistical Learning Theory

    Olivier BousquetDepartment of Empirical Inference

    Max Planck Institute of Biological Cybernetics

    [email protected]

    Machine Learning Summer School, August 2003

  • 8/4/2019 Stastical Learning Theory

    2/194

    Roadmap (1)

    Lecture 1: Introduction

    Part I: Binary Classification

    Lecture 2: Basic bounds

    Lecture 3: VC theory

    Lecture 4: Capacity measures

    Lecture 5: Advanced topics

    O. Bousquet Statistical Learning Theory 1

  • 8/4/2019 Stastical Learning Theory

    3/194

    Roadmap (2)

    Part II: Real-Valued Classification

    Lecture 6: Margin and loss functions

    Lecture 7: Regularization

    Lecture 8: SVM

    O. Bousquet Statistical Learning Theory 2

  • 8/4/2019 Stastical Learning Theory

    4/194

    Lecture 1

    The Learning Problem Context

    Formalization

    Approximation/Estimation trade-off

    Algorithms and Bounds

    O. Bousquet Statistical Learning Theory Lecture 1 3

  • 8/4/2019 Stastical Learning Theory

    5/194

    Learning and Inference

    The inductive inference process:1. Observe a phenomenon

    2. Construct a model of the phenomenon

    3. Make predictions

    This is more or less the definition of natural sciences !

    The goal of Machine Learning is to automate this process

    The goal of Learning Theory is to formalize it.

    O. Bousquet Statistical Learning Theory Lecture 1 4

  • 8/4/2019 Stastical Learning Theory

    6/194

    Pattern recognition

    We consider here the supervised learning framework for patternrecognition:

    Data consists of pairs (instance, label)

    Label is +1 or 1

    Algorithm constructs a function (instance label)

    Goal: make few mistakes on future unseen instances

    O. Bousquet Statistical Learning Theory Lecture 1 5

  • 8/4/2019 Stastical Learning Theory

    7/194

    Approximation/Interpolation

    It is always possible to build a function that fits exactly the data.

    0 0.5 1 1.50

    0.5

    1

    1.5

    But is it reasonable ?

    O. Bousquet Statistical Learning Theory Lecture 1 6

  • 8/4/2019 Stastical Learning Theory

    8/194

    Occams Razor

    Idea: look for regularities in the observed phenomenonThese can ge generalized from the observed past to the future

    choose the simplest consistent model

    How to measure simplicity ?

    Physics: number of constants Description length

    Number of parameters

    ...

    O. Bousquet Statistical Learning Theory Lecture 1 7

  • 8/4/2019 Stastical Learning Theory

    9/194

    No Free Lunch

    No Free Lunch if there is no assumption on how the past is related to the future,

    prediction is impossible

    if there is no restriction on the possible phenomena, generalization

    is impossible

    We need to make assumptions Simplicity is not absolute Data will never replace knowledge Generalization = data + knowledge

    O. Bousquet Statistical Learning Theory Lecture 1 8

  • 8/4/2019 Stastical Learning Theory

    10/194

    Assumptions

    Two types of assumptions

    Future observations related to past ones

    Stationarity of the phenomenon

    Constraints on the phenomenon Notion of simplicity

    O. Bousquet Statistical Learning Theory Lecture 1 9

  • 8/4/2019 Stastical Learning Theory

    11/194

  • 8/4/2019 Stastical Learning Theory

    12/194

    Probabilistic Model

    Relationship between past and future observations

    Sampled independently from the same distribution

    Independence: each new observation yields maximum information

    Identical distribution: the observations give information about theunderlying phenomenon (here a probability distribution)

    O. Bousquet Statistical Learning Theory Lecture 1 11

  • 8/4/2019 Stastical Learning Theory

    13/194

    Probabilistic Model

    We consider an input space X and output space Y.Here: classification case Y= {1, 1}.

    Assumption: The pairs (X, Y) X Yare distributedaccording to P (unknown).

    Data: We observe a sequence of n i.i.d. pairs (Xi, Yi) sampled

    according to P.

    Goal: construct a function g : X Ywhich predicts Y from X.

    O. Bousquet Statistical Learning Theory Lecture 1 12

  • 8/4/2019 Stastical Learning Theory

    14/194

    Probabilistic Model

    Criterion to choose our function:

    Low probability of error P(g(X) = Y).Risk

    R(g) = P(g(X) = Y) = E

    1[g(X)=Y]

    P is unknown so that we cannot directly measure the risk Can only measure the agreement on the data Empirical Risk

    Rn(g) =1

    n

    n

    i=1 1[g(Xi)=Yi]

    O. Bousquet Statistical Learning Theory Lecture 1 13

  • 8/4/2019 Stastical Learning Theory

    15/194

    Target function

    P can be decomposed as PX P(Y|X) (x) = E [Y|X = x] = 2P [Y = 1|X = x] 1 is the regression

    function

    t(x) = sgn (x) is the target function in the deterministic case Y = t(X) (P [Y = 1|X] {0, 1})

    in general, n(x) = min(P [Y = 1|X = x] , 1P [Y = 1|X = x])(1 (x))/2 is the noise level

    O. Bousquet Statistical Learning Theory Lecture 1 14

  • 8/4/2019 Stastical Learning Theory

    16/194

    Assumptions about P

    Need assumptions about P.

    Indeed, if t(x) is totally chaotic, there is no possible generalization

    from finite data.

    Assumptions can be

    Preference (e.g. a priori probability distribution on possible functions)

    Restriction (set of possible functions)Treating lack of knowledge

    Bayesian approach: uniform distribution Learning Theory approach: worst case analysis

    O. Bousquet Statistical Learning Theory Lecture 1 15

  • 8/4/2019 Stastical Learning Theory

    17/194

  • 8/4/2019 Stastical Learning Theory

    18/194

    Overfitting/Underfitting

    The data can mislead you. Underfittingmodel too small to fit the data

    Overfittingartificially good agreement with the data

    No way to detect them from the data ! Need extra validation data.

    O. Bousquet Statistical Learning Theory Lecture 1 17

  • 8/4/2019 Stastical Learning Theory

    19/194

    Empirical Risk Minimization

    Choose a model G (set of possible functions)

    Minimize the empirical risk in the model

    mingG Rn(g)

    What if the Bayes classifier is not in the model ?

    O. Bousquet Statistical Learning Theory Lecture 1 18

  • 8/4/2019 Stastical Learning Theory

    20/194

    Approximation/Estimation

    Bayes risk

    R = infg

    R(g) .

    Best risk a deterministic function can have (risk of the target function,

    or Bayes classifier).

    Decomposition: R(g) = inf

    gGR(g)

    R(gn) R = R(g) R + R(gn) R(g) Approximation Estimation

    Only the estimation error is random (i.e. depends on the data).

    O. Bousquet Statistical Learning Theory Lecture 1 19

  • 8/4/2019 Stastical Learning Theory

    21/194

    Structural Risk Minimization

    Choose a collection of models {Gd : d = 1, 2, . . .} Minimize the empirical risk in each model Minimize the penalized empirical risk

    mind

    mingGd

    Rn(g) + pen(d, n)

    pen(d, n) gives preference to models where estimation error is small

    pen(d, n) measures the size or capacity of the model

    O. Bousquet Statistical Learning Theory Lecture 1 20

  • 8/4/2019 Stastical Learning Theory

    22/194

    Regularization

    Choose a large model G (possibly dense) Choose a regularizer g Minimize the regularized empirical risk

    mingG Rn(g) + g2

    Choose an optimal trade-off (regularization parameter).

    Most methods can be thought of as regularization methods.

    O. Bousquet Statistical Learning Theory Lecture 1 21

  • 8/4/2019 Stastical Learning Theory

    23/194

    Bounds (1)

    A learning algorithm

    Takes as input the data (X1, Y1), . . . , (Xn, Yn) Produces a function gn

    Can we estimate the risk of gn ?

    random quantity (depends on the data).

    need probabilistic bounds

    O. Bousquet Statistical Learning Theory Lecture 1 22

  • 8/4/2019 Stastical Learning Theory

    24/194

    Bounds (2)

    Error boundsR(gn) Rn(gn) + B

    Estimation from an empirical quantity

    Relative error bounds

    Best in a class

    R(gn) R(g) + B Bayes risk

    R(gn)

    R + B

    Theoretical guarantees

    O. Bousquet Statistical Learning Theory Lecture 1 23

  • 8/4/2019 Stastical Learning Theory

    25/194

    Lecture 2

    Basic Bounds

    Probability tools

    Relationship with empirical processes

    Law of large numbers

    Union bound

    Relative error bounds

    O. Bousquet Statistical Learning Theory Lecture 2 24

  • 8/4/2019 Stastical Learning Theory

    26/194

    Probability Tools (1)

    Basic facts

    Union: P [A or B] P [A] + P [B]

    Inclusion: If A B, then P [A] P [B].

    Inversion: IfP [X t] F(t) then with probability at least 1 ,X F1().

    Expectation: If X

    0, E [X] = 0 P [X t] dt.

    O. Bousquet Statistical Learning Theory Lecture 2 25

  • 8/4/2019 Stastical Learning Theory

    27/194

    Probability Tools (2)

    Basic inequalities

    Jensen: for f convex, f(E [X]) E [f(X)]

    Markov: If X 0 then for all t > 0, P [X t] E[X]t

    Chebyshev: for t > 0, P [|X E [X] | t] Var[X]t2

    Chernoff: for all t R, P [X t] inf0 E e(Xt)

    O. Bousquet Statistical Learning Theory Lecture 2 26

  • 8/4/2019 Stastical Learning Theory

    28/194

    Error bounds

    Recall that we want to bound R(gn) = E 1[gn(X)=Y] where gn hasbeen constructed from (X1, Y1), . . . , (Xn, Yn). Cannot be observed (P is unknown)

    Random (depends on the data) we want to bound

    P [R(gn) Rn(gn) > ]

    O. Bousquet Statistical Learning Theory Lecture 2 27

  • 8/4/2019 Stastical Learning Theory

    29/194

    Loss class

    For convenience, let Zi = (Xi, Yi) and Z = (X, Y). GivenG

    define

    the loss class

    F= {f : (x, y) 1[g(x)=y] : g G}

    Denote P f =E

    [f(X, Y)] and Pnf =

    1

    n ni=1 f(Xi, Yi)Quantity of interest:

    P f Pnf

    We will go back and forth between Fand G (bijection)

    O. Bousquet Statistical Learning Theory Lecture 2 28

  • 8/4/2019 Stastical Learning Theory

    30/194

    Empirical process

    Empirical process:

    {P f Pnf}fF

    Process = collection of random variables (here indexed by functionsin

    F)

    Empirical = distribution of each random variable Many techniques exist to control the supremum

    supfF

    P f Pnf

    O. Bousquet Statistical Learning Theory Lecture 2 29

  • 8/4/2019 Stastical Learning Theory

    31/194

    The Law of Large Numbers

    R(g) Rn(g) = E [f(Z)] 1nn

    i=1

    f(Zi)

    difference between the expectation and the empirical average of ther.v. f(Z)

    Law of large numbers

    P

    lim

    n1

    n

    ni=1

    f(Zi) E [f(Z)] = 0

    = 1 .

    can we quantify it ?

    O. Bousquet Statistical Learning Theory Lecture 2 30

  • 8/4/2019 Stastical Learning Theory

    32/194

    Hoeffdings Inequality

    Quantitative version of law of large numbers.

    Assumes bounded random variables

    Theorem 1. LetZ1, . . . , Zn be n i.i.d. random variables. Iff(Z) [a, b]. Then for all > 0, we have

    P

    1

    n

    ni=1

    f(Zi) E [f(Z)]

    >

    2 exp

    2n

    2

    (b a)2

    .

    Lets rewrite it to better understand

    O. Bousquet Statistical Learning Theory Lecture 2 31

  • 8/4/2019 Stastical Learning Theory

    33/194

    Hoeffdings InequalityWrite

    = 2 exp 2n2(b a)2

    Then

    P

    |Pnf P f| > (b a)

    log 2

    2n

    or [Inversion] with probability at least 1 ,

    |Pnf P f| (b a)

    log 22n

    O. Bousquet Statistical Learning Theory Lecture 2 32

  • 8/4/2019 Stastical Learning Theory

    34/194

    Hoeffdings inequality

    Lets apply to f(Z) = 1[g(X)=Y].

    For any g, and any > 0, with probability at least 1

    R(g)

    Rn(g) +

    log 2

    2n

    . (1)

    Notice that one has to consider a fixed function f and the probability is

    with respect to the sampling of the data.

    If the function depends on the data this does not apply !

    O. Bousquet Statistical Learning Theory Lecture 2 33

  • 8/4/2019 Stastical Learning Theory

    35/194

    Limitations

    For each fixed function f F, there is a set S of samples for whichP f Pnf

    log 2

    2n (P [S] 1 )

    They may be different for different functions

    The function chosen by the algorithm depends on the sample

    For the observed sample, only some of the functions in Fwill satisfythis inequality !

    O. Bousquet Statistical Learning Theory Lecture 2 34

  • 8/4/2019 Stastical Learning Theory

    36/194

    Limitations

    What we need to bound is

    P fn Pnfn

    where fn is the function chosen by the algorithm based on the data.

    For any fixed sample, there exists a function f such that

    P f Pnf = 1

    Take the function which is f(Xi) = Yi on the data and f(X) = Yeverywhere else.

    This does not contradict Hoeffding but shows it is not enough

    O. Bousquet Statistical Learning Theory Lecture 2 35

  • 8/4/2019 Stastical Learning Theory

    37/194

    Limitations

    Risk

    Function class

    R

    Remp

    f f f opt m

    R[f]

    R [f]emp

    Hoeffdings inequality quantifies differences for a fixed function

    O. Bousquet Statistical Learning Theory Lecture 2 36

    U if D i i

  • 8/4/2019 Stastical Learning Theory

    38/194

    Uniform Deviations

    Before seeing the data, we do not know which function the algorithm

    will choose.

    The trick is to consider uniform deviations

    R(fn) Rn(fn) supfF(R(f) Rn(f))

    We need a bound which holds simultaneously for all functions in a class

    O. Bousquet Statistical Learning Theory Lecture 2 37

    U i B d

  • 8/4/2019 Stastical Learning Theory

    39/194

    Union Bound

    Consider two functions f1, f2 and define

    Ci = {(x1, y1), . . . , (xn, yn) : P fi Pnfi > }

    From Hoeffdings inequality, for each i

    P [Ci]

    We want to bound the probability of being bad for i = 1 or i = 2

    P[C1 C2]

    P[C1] +

    P[C2]

    O. Bousquet Statistical Learning Theory Lecture 2 38

    Fi i C

  • 8/4/2019 Stastical Learning Theory

    40/194

    Finite CaseMore generally

    P [C1 . . . CN] N

    i=1

    P [Ci]

    We have

    P [f {f1, . . . , f N} : P f Pnf > ]

    N

    i=1

    P [P fi Pnfi > ]

    N exp 2n2O. Bousquet Statistical Learning Theory Lecture 2 39

    Fi it C

  • 8/4/2019 Stastical Learning Theory

    41/194

    Finite Case

    We obtain, for

    G=

    {g1, . . . , gN

    }, for all > 0

    with probability at least 1 ,

    g

    G, R(g)

    Rn(g) + log N + log

    1

    2m

    This is a generalization bound !

    Coding interpretation

    log N is the number of bits to specify a function in F

    O. Bousquet Statistical Learning Theory Lecture 2 40

    A i ti /E ti ti

  • 8/4/2019 Stastical Learning Theory

    42/194

    Approximation/EstimationLet

    g

    = arg mingG R(g)If gn minimizes the empirical risk in G,

    Rn(g) Rn(gn) 0

    Thus

    R(gn) = R(gn) R(g) + R(g) Rn(g) Rn(gn) + R(gn) R(g) + R(g)

    2 supgG |R(g) Rn(g)| + R(g)

    O. Bousquet Statistical Learning Theory Lecture 2 41

    A i ti /E ti ti

  • 8/4/2019 Stastical Learning Theory

    43/194

    Approximation/Estimation

    We obtain with probability at least 1

    R(gn) R(g) + 2

    log N + log 22m

    The first term decreases if N increasesThe second term increases

    The size ofG controls the trade-off

    O. Bousquet Statistical Learning Theory Lecture 2 42

    S (1)

  • 8/4/2019 Stastical Learning Theory

    44/194

    Summary (1)

    Inference requires assumptions

    - Data sampled i.i.d. from P

    - Restrict the possible functions to G

    - Choose a sequence of models Gm to have more flexibility/control

    O. Bousquet Statistical Learning Theory Lecture 2 43

    Summary (2)

  • 8/4/2019 Stastical Learning Theory

    45/194

    Summary (2)

    Bounds are valid w.r.t. repeated sampling

    - For a fixed function g, for most of the samples

    R(g) Rn(g) 1/

    n

    - For most of the samples if|G| = N

    supgG

    R(g) Rn(g)

    log N/n

    Extra variability because the chosen gn changes with the data

    O. Bousquet Statistical Learning Theory Lecture 2 44

    Improvements

  • 8/4/2019 Stastical Learning Theory

    46/194

    Improvements

    We obtained

    supgG

    R(g) Rn(g)

    log N + log 22n

    To be improved

    Hoeffding only uses boundedness, not the variance Union bound as bad as if independent Supremum is not what the algorithm chooses.

    Next we improve the union bound and extend it to the infinite case

    O. Bousquet Statistical Learning Theory Lecture 2 45

    Refined union bound (1)

  • 8/4/2019 Stastical Learning Theory

    47/194

    Refined union bound (1)

    For each f

    F,

    P

    P f Pnf >

    log 1(f)

    2n

    (f)

    P

    f F: P f Pnf >

    log 1(f)

    2n

    fF

    (f)

    Choose (f) = p(f) with fFp(f) = 1

    O. Bousquet Statistical Learning Theory Lecture 2 46

    Refined union bound (2)

  • 8/4/2019 Stastical Learning Theory

    48/194

    Refined union bound (2)

    With probability at least 1

    ,

    f F, P f Pnf +

    log 1p(f) + log1

    2n

    Applies to countably infinite F Can put knowledge about the algorithm into p(f)

    But p chosen before seeing the data

    O. Bousquet Statistical Learning Theory Lecture 2 47

    Refined union bound (3)

  • 8/4/2019 Stastical Learning Theory

    49/194

    Refined union bound (3)

    Good p means good bound. The bound can be improved if you knowahead of time the chosen function (knowledge improves the bound)

    In the infinite case, how to choose the p (since it implies an ordering)

    The trick is to look at Fthrough the data

    O. Bousquet Statistical Learning Theory Lecture 2 48

    Lecture 3

  • 8/4/2019 Stastical Learning Theory

    50/194

    Lecture 3

    Infinite Case: Vapnik-Chervonenkis Theory

    Growth function

    Vapnik-Chervonenkis dimension

    Proof of the VC bound

    VC entropy

    SRM

    O. Bousquet Statistical Learning Theory Lecture 3 49

    Infinite Case

  • 8/4/2019 Stastical Learning Theory

    51/194

    Infinite CaseMeasure of the size of an infinite class ?

    Consider

    Fz1,...,zn = {(f(z1), . . . , f (zn)) : f F}

    The size of this set is the number of possible ways in which the data

    (z1, . . . , zn) can be classified.

    Growth function

    SF(n) = sup(z1,...,zn)

    |Fz1,...,zn|

    Note that SF(n) = SG(n)

    O. Bousquet Statistical Learning Theory Lecture 3 50

    Infinite Case

  • 8/4/2019 Stastical Learning Theory

    52/194

    Infinite Case

    Result (Vapnik-Chervonenkis)With probability at least 1

    g G, R(g) Rn(g) +

    log SG(2n) + log 48n

    Always better than N in the finite case How to compute SG(n) in general ?

    use VC dimension

    O. Bousquet Statistical Learning Theory Lecture 3 51

    VC Dimension

  • 8/4/2019 Stastical Learning Theory

    53/194

    VC Dimension

    Notice that since g

    {1, 1

    }, SG(n)

    2n

    If SG(n) = 2n, the class of functions can generate any classification onn points (shattering)

    Definition 2. The VC-dimension ofG is the largest n such that

    SG(n) = 2n

    O. Bousquet Statistical Learning Theory Lecture 3 52

    VC Dimension

  • 8/4/2019 Stastical Learning Theory

    54/194

    VC Dimension

    Hyperplanes

    In Rd, V C(hyperplanes) = d + 1

    O. Bousquet Statistical Learning Theory Lecture 3 53

    VC Dimension

  • 8/4/2019 Stastical Learning Theory

    55/194

    Number of Parameters

    Is VC dimension equal to number of parameters ?

    One parameter{sgn(sin(tx)) : t R}

    Infinite VC dimension !

    O. Bousquet Statistical Learning Theory Lecture 3 54

    VC Dimension

  • 8/4/2019 Stastical Learning Theory

    56/194

    We want to know S

    G(n) but we only know

    SG(n) = 2n for n hWhat happens for n h ?

    m

    log(S(m))

    h

    O. Bousquet Statistical Learning Theory Lecture 3 55

    Vapnik-Chervonenkis-Sauer-Shelah Lemma

  • 8/4/2019 Stastical Learning Theory

    57/194

    p

    Lemma 3. Let

    Gbe a class of functions with finite VC-dimension h.

    Then for all n N,SG(n)

    hi=0

    ni

    and for all n h,

    SG(n) enh

    h

    phase transition

    O. Bousquet Statistical Learning Theory Lecture 3 56

    VC Bound

  • 8/4/2019 Stastical Learning Theory

    58/194

    Let

    Gbe a class with VC dimension h.

    With probability at least 1

    g G, R(g) Rn(g) +

    h log 2enh + log4

    8n

    So the error is of order h log n

    n

    O. Bousquet Statistical Learning Theory Lecture 3 57

    Interpretation

  • 8/4/2019 Stastical Learning Theory

    59/194

    VC dimension: measure of effective dimension

    Depends on geometry of the class

    Gives a natural definition of simplicity (by quantifying the potentialoverfitting)

    Not related to the number of parameters

    Finiteness guarantees learnability under any distribution

    O. Bousquet Statistical Learning Theory Lecture 3 58

    Symmetrization (lemma)

  • 8/4/2019 Stastical Learning Theory

    60/194

    y ( )

    Key ingredient in VC bounds: Symmetrization

    Let Z1, . . . , Zn an independent (ghost) sample and P

    n the

    corresponding empirical measure.

    Lemma 4. For any t > 0, such that nt

    2

    2,

    P

    supfF

    (P Pn)f t

    2P

    supfF

    (Pn Pn)f t/2

    O. Bousquet Statistical Learning Theory Lecture 3 59

    Symmetrization (proof 1)

  • 8/4/2019 Stastical Learning Theory

    61/194

    ( )

    fn the function achieving the supremum (depends on Z1, . . . , Zn)

    1[(PPn)fn>t]1[(PPn)fnt (PPn)fnt/2]

    Taking expectations with respect to the second sample gives

    1[(PPn)fn>t]P (P Pn)fn < t/2 P (Pn Pn)fn > t/2

    O. Bousquet Statistical Learning Theory Lecture 3 60

    Symmetrization (proof 2)

  • 8/4/2019 Stastical Learning Theory

    62/194

    By Chebyshev inequality,

    P (P Pn)fn t/2 4Var [fn]nt2 1nt2

    Hence

    1[(PPn)fn>t](1 1

    nt2) P (Pn Pn)fn > t/2

    Take expectation with respect to first sample.

    O. Bousquet Statistical Learning Theory Lecture 3 61

    Proof of VC bound (1)

  • 8/4/2019 Stastical Learning Theory

    63/194

    Symmetrization allows to replace expectation by average on ghost

    sample Function class projected on the double sample

    FZ1,...,Zn,Z1,...,Zn

    Union bound on FZ1,...,Zn,Z1,...,Zn Variant of Hoeffdings inequality

    P Pnf P

    nf > t 2ent2/2

    O. Bousquet Statistical Learning Theory Lecture 3 62

    Proof of VC bound (2)

  • 8/4/2019 Stastical Learning Theory

    64/194

    P

    supfF(P Pn)f t

    2P supfF(Pn Pn)f t/2= 2P

    supfFZ1,...,Zn,Z1,...,Zn (Pn Pn)f t/2 2SF(2n)P

    (Pn Pn)f t/2

    4SF(2n)ent

    2/8

    O. Bousquet Statistical Learning Theory Lecture 3 63

    VC Entropy (1)

  • 8/4/2019 Stastical Learning Theory

    65/194

    VC dimension is distribution independent

    The same bound holds for any distribution

    It is loose for most distributions

    A similar proof can give a distribution-dependent result

    O. Bousquet Statistical Learning Theory Lecture 3 64

    VC Entropy (2)

  • 8/4/2019 Stastical Learning Theory

    66/194

    Denote the size of the projection N (

    F, z1 . . . , zn) := #

    Fz

    1,...,z

    n The VC entropy is defined as

    HF(n) = log E [N (F, Z1, . . . , Zn)] ,

    VC entropy bound: with probability at least 1

    g G, R(g) Rn(g) +

    HG(2n) + log 28n

    O. Bousquet Statistical Learning Theory Lecture 3 65

    VC Entropy (proof)

  • 8/4/2019 Stastical Learning Theory

    67/194

    Introduce i {1, 1} (probability 1/2), Rademacher variables

    2P

    supfFZ,Z(P

    n Pn)f t/2

    2EP

    supfFZ,Z

    1

    n

    n

    i=1i(f(Z

    i) f(Zi)) t/2

    2E NF, Z , Z P

    1

    n

    ni=1

    i t/2

    2E NF, Z , Z

    ent2/8

    O. Bousquet Statistical Learning Theory Lecture 3 66

    From Bounds to Algorithms

  • 8/4/2019 Stastical Learning Theory

    68/194

    For any distribution, H

    G(n)/n

    0 ensures consistency of empirical

    risk minimizer (i.e. convergence to best in the class)

    Does it means we can learn anything ?

    No because of the approximation of the class

    Need to trade-off approximation and estimation error (assessed bythe bound)

    Use the bound to control the trade-off

    O. Bousquet Statistical Learning Theory Lecture 3 67

    Structural risk minimization

  • 8/4/2019 Stastical Learning Theory

    69/194

    Structural risk minimization (SRM) (Vapnik, 1979): minimize the

    right hand side of

    R(g) Rn(g) + B(h, n).

    To this end, introduce a structure on G.

    Learning machine a set of functions and an induction principle

    O. Bousquet Statistical Learning Theory Lecture 3 68

    SRM: The Picture

  • 8/4/2019 Stastical Learning Theory

    70/194

    R(f* )

    h

    training error

    capacity term

    error

    Sn1 Sn Sn+1

    structure

    bound on test error

    O. Bousquet Statistical Learning Theory Lecture 3 69

    Lecture 4

  • 8/4/2019 Stastical Learning Theory

    71/194

    Capacity Measures

    Covering numbers

    Rademacher averages

    Relationships

    O. Bousquet Statistical Learning Theory Lecture 4 70

    Covering numbers

  • 8/4/2019 Stastical Learning Theory

    72/194

    Define a (random) distance d between functions, e.g.

    d(f, f) = 1n

    #{f(Zi) = f(Zi) : i = 1, . . . , n}

    Normalized Hamming distance of the projections on the sample

    A set f1, . . . , f N covers Fat radius if

    F Ni=1B(fi, )

    Covering number N(F, , n) is the minimum size of a cover ofradius

    Note that N(F, , n) = N(G, , n).

    O. Bousquet Statistical Learning Theory Lecture 4 71

    Bound with covering numbers

  • 8/4/2019 Stastical Learning Theory

    73/194

    When the covering numbers are finite, one can approximate the class

    G by a finite set of functions

    Result

    P [g G : R(g) Rn(g) t] 8E [N(G, t , n)] ent2/128

    O. Bousquet Statistical Learning Theory Lecture 4 72

    Covering numbers and VC dimension

  • 8/4/2019 Stastical Learning Theory

    74/194

    Notice that for all t, N(

    G, t , n)

    #

    GZ = N(

    G, Z)

    Hence N(G, t , n) h log enh

    Haussler

    N(G, t , n) Ch(4e)h 1th

    Independent of n

    O. Bousquet Statistical Learning Theory Lecture 4 73

    Refinement

  • 8/4/2019 Stastical Learning Theory

    75/194

    VC entropy corresponds to log covering numbers at minimal scale

    Covering number bound is a generalization where the scale is adaptedto the error

    Is this the right scale ?

    It turns out that results can be improved by considering all scales (chaining)

    O. Bousquet Statistical Learning Theory Lecture 4 74

    Rademacher averages

  • 8/4/2019 Stastical Learning Theory

    76/194

    Rademacher variables: 1, . . . , n independent random variables

    with

    P [i = 1] = P [i = 1] =1

    2

    Notation (randomized empirical measure) Rnf =

    1n ni=1 if(Zi)

    Rademacher average: R(F) = E supfFRnf Conditional Rademacher average Rn(F) = E

    supfFRnfO. Bousquet Statistical Learning Theory Lecture 4 75

    Result

  • 8/4/2019 Stastical Learning Theory

    77/194

    Distribution dependent

    f F, P f Pnf + 2R(F) +

    log 12n

    ,

    Data dependent

    f F, P f Pnf + 2Rn(F) +

    2log 2n

    ,

    O. Bousquet Statistical Learning Theory Lecture 4 76

    Concentration

  • 8/4/2019 Stastical Learning Theory

    78/194

    Hoeffdings inequality is a concentration inequality

    When n increases, the average is concentrated around the expectation

    Generalization to functions that depend on i.i.d. random variablesexist

    O. Bousquet Statistical Learning Theory Lecture 4 77

    McDiarmids Inequality

  • 8/4/2019 Stastical Learning Theory

    79/194

    Assume for all i = 1, . . . , n,

    supz1,...,zn,z

    i

    |F(z1, . . . , zi, . . . , zn) F(z1, . . . , zi, . . . , zn)| c

    then for all > 0,

    P [|F E [F] | > ] 2 exp

    22

    nc2

    O. Bousquet Statistical Learning Theory Lecture 4 78

    Proof of Rademacher average bounds

  • 8/4/2019 Stastical Learning Theory

    80/194

    Use concentration to relate supf

    FP f

    Pnf to its expectation

    Use symmetrization to relate expectation to Rademacher average

    Use concentration again to relate Rademacher average to conditionalone

    O. Bousquet Statistical Learning Theory Lecture 4 79

    Application (1)

  • 8/4/2019 Stastical Learning Theory

    81/194

    supfFA(f) + B(f) supfFA(f) + supfFB(f)Hence

    | supf

    F

    C(f) supf

    F

    A(f)| supf

    F

    (C(f) A(f))

    this gives

    | supfF

    (P f Pnf) supfF

    (P f Pnf)| supfF

    (P

    nf Pnf)

    O. Bousquet Statistical Learning Theory Lecture 4 80

    Application (2)

  • 8/4/2019 Stastical Learning Theory

    82/194

    f {0, 1} hence,

    Pnf Pnf =1

    n(f(Zi) f(Zi))

    1

    n

    thus

    | supfF(P f Pnf) supfF(P f Pnf)| 1

    n

    McDiarmids inequality can be applied with c = 1/n

    O. Bousquet Statistical Learning Theory Lecture 4 81

    Symmetrization (1)

  • 8/4/2019 Stastical Learning Theory

    83/194

    Upper bound

    E

    supfF

    P f Pnf

    2E

    supfF

    Rnf

    Lower bound

    E

    supfF

    |P f Pnf|

    12E

    supfF

    Rnf

    12

    n

    O. Bousquet Statistical Learning Theory Lecture 4 82

    Symmetrization (2)

  • 8/4/2019 Stastical Learning Theory

    84/194

    E[supfF

    P f Pnf]

    = E[supfF

    E

    P

    nf Pnf]

    EZ,Z[supfFP

    nf Pnf]

    = E,Z,Z

    supfF

    1

    n

    n

    i=1i(f(Z

    i) f(Zi))

    2E[sup

    fFRnf]

    O. Bousquet Statistical Learning Theory Lecture 4 83

    Loss class and initial class

  • 8/4/2019 Stastical Learning Theory

    85/194

    R(F) = E

    supgG

    1

    n

    ni=1

    i1[g(Xi)=Yi]

    = E supgG1

    n

    n

    i=1

    i1

    2(1

    Yig(Xi))

    =1

    2E

    supgG

    1

    n

    ni=1

    iYig(Xi)

    =

    1

    2R(G)

    O. Bousquet Statistical Learning Theory Lecture 4 84

    Computing Rademacher averages (1)

  • 8/4/2019 Stastical Learning Theory

    86/194

    12Esup

    gG1n

    ni=1

    ig(Xi)=

    1

    2+ E

    supgG

    1

    n

    n

    i=11 ig(Xi)

    2

    =1

    2 E

    infgG

    1

    n

    ni=1

    1 ig(Xi)2

    =1

    2 E infgG Rn(g, )

    O. Bousquet Statistical Learning Theory Lecture 4 85

    Computing Rademacher averages (2)

  • 8/4/2019 Stastical Learning Theory

    87/194

    Not harder than computing empirical risk minimizer

    Pick i randomly and minimize error with respect to labels i

    Intuition: measure how much the class can fit random noise

    Large class R(G) = 12

    O. Bousquet Statistical Learning Theory Lecture 4 86

    Concentration again

  • 8/4/2019 Stastical Learning Theory

    88/194

    Let

    F = E

    supfF

    Rnf

    Expectation with respect to i only, with (Xi, Yi) fixed.

    F satisfies McDiarmids assumptions with c = 1n

    E [F] = R(F) can be estimated by F = Rn(F)

    O. Bousquet Statistical Learning Theory Lecture 4 87

    Relationship with VC dimension

  • 8/4/2019 Stastical Learning Theory

    89/194

    For a finite set

    F=

    {f1, . . . , f N

    }R(F) 2

    log N/n

    Consequence for VC class Fwith dimension h

    R(F) 2

    h log enhn

    .

    Recovers VC bound with a concentration proof

    O. Bousquet Statistical Learning Theory Lecture 4 88

    Chaining

  • 8/4/2019 Stastical Learning Theory

    90/194

    Using covering numbers at all scales, the geometry of the class is

    better captured

    DudleyRn(F)

    Cn

    0

    log N(F, t , n) dt

    ConsequenceR(F) C

    h

    n

    Removes the unnecessary log n factor !

    O. Bousquet Statistical Learning Theory Lecture 4 89

    Lecture 5

    Advanced Topics

  • 8/4/2019 Stastical Learning Theory

    91/194

    Advanced Topics

    Relative error bounds

    Noise conditions

    Localized Rademacher averages

    PAC-Bayesian bounds

    O. Bousquet Statistical Learning Theory Lecture 5 90

    Binomial tails

  • 8/4/2019 Stastical Learning Theory

    92/194

    Pnf

    B(p, n) binomial distribution p = P f

    P [P f Pnf t] = n(pt)

    k=0

    nk

    pk(1 p)nk

    Can be upper bounded Exponential 1p1pt

    n(1pt)

    pp+tn(p+t)

    Bennett e np1p((1t/p) log(1t/p)+t/p)

    Bernstein e nt2

    2p(1p)+2t/3

    Hoeffding e2nt2

    O. Bousquet Statistical Learning Theory Lecture 5 91

    Tail behavior

    ( 2/ ( ))

  • 8/4/2019 Stastical Learning Theory

    93/194

    For small deviations, Gaussian behavior

    exp(

    nt2/2p(1

    p))

    Gaussian with variance p(1 p)

    For large deviations, Poisson behavior exp(3nt/2)

    Tails heavier than Gaussian

    Can upper bound with a Gaussian with large (maximum) varianceexp(2nt2)

    O. Bousquet Statistical Learning Theory Lecture 5 92

    Illustration (1)

    Maximum variance (p = 0.5)

  • 8/4/2019 Stastical Learning Theory

    94/194

    0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    De

    viationbound

    BinoCDF

    Bernstein

    BennettBinomial Tail

    Best Gausian

    O. Bousquet Statistical Learning Theory Lecture 5 93

    Illustration (2)

    Small variance (p = 0 1)

  • 8/4/2019 Stastical Learning Theory

    95/194

    Small variance (p = 0.1)

    0.1 0.15 0.2 0.25 0.3 0.350

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    Deviationbound

    BinoCDF

    SubGaussian

    Bernstein

    Bennett

    Binomial Tail

    Best Gausian

    0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.440

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    x

    Deviationbound

    BinoCDF

    Bernstein

    Bennett

    Binomial Tail

    Best Gausian

    O. Bousquet Statistical Learning Theory Lecture 5 94

    Taking the variance into account (1)

    E h f i f h diff i P f (1 P f ) P f

  • 8/4/2019 Stastical Learning Theory

    96/194

    Each function f

    Fhas a different variance P f(1

    P f)

    P f.

    For each f F, by Bernsteins inequality

    P f

    Pnf +

    2P f log 1

    n

    +2log 1

    3n

    The Gaussian part dominates (for P f not too small, or n largeenough), it depends on P f

    O. Bousquet Statistical Learning Theory Lecture 5 95

    Taking the variance into account (2)

  • 8/4/2019 Stastical Learning Theory

    97/194

    Central Limit Theorem

    nP f PnfP f(1 P f)

    N(0, 1)

    Idea is to consider the ratioP f Pnf

    P f

    O. Bousquet Statistical Learning Theory Lecture 5 96

    Normalization

    Here (f {0, 1}), Var [f] P f2 = P f

  • 8/4/2019 Stastical Learning Theory

    98/194

    (f { , }), [f ] f f

    Large variance large risk. After normalization, fluctuations are more uniform

    supfF

    P f PnfP f

    not necessarily attained at functions with large variance.

    Focus of learning: functions with small error P f (hence smallvariance).

    The normalized supremum takes this into account.

    O. Bousquet Statistical Learning Theory Lecture 5 97

    Relative deviations

    Vapnik-Chervonenkis 1974

  • 8/4/2019 Stastical Learning Theory

    99/194

    For > 0 with probability at least 1 ,

    f F, P f PnfP f

    2

    log SF(2n) + log 4n

    and

    f F, Pnf P fPnf

    2

    log SF(2n) + log 4n

    O. Bousquet Statistical Learning Theory Lecture 5 98

    Proof sketch

    1. Symmetrization

  • 8/4/2019 Stastical Learning Theory

    100/194

    P

    supfF

    P f PnfP f

    t

    2P

    supfF

    Pnf Pnf(Pnf + Pnf)/2

    t

    2. Randomization

    = 2EP

    supfF

    1n

    ni=1 i(f(Z

    i) f(Zi))

    (Pnf + Pnf)/2 t

    3. Tail bound

    O. Bousquet Statistical Learning Theory Lecture 5 99

    Consequences

    From the fact

  • 8/4/2019 Stastical Learning Theory

    101/194

    A B + CA A B + C2 + BC

    we get

    f F, P f Pnf + 2Pnflog SF(2n) + log 4n

    +4log SF(2n) + log 4

    n

    O. Bousquet Statistical Learning Theory Lecture 5 100

    Zero noise

    Ideal situation

  • 8/4/2019 Stastical Learning Theory

    102/194

    gn empirical risk minimizer t G

    R = 0 (no noise, n(X) = 0 a.s.)In that case

    Rn(gn) = 0

    R(gn) = O(d log nn ).

    O. Bousquet Statistical Learning Theory Lecture 5 101

    Interpolating between rates ?

    Rates are not correctly estimated by this inequality

  • 8/4/2019 Stastical Learning Theory

    103/194

    y y q y

    Consequence of relative error bounds

    P fn P f + 2

    P flog SF(2n) + log 4

    n

    +4log SF(2n) + log 4n

    The quantity which is small is not P f but P fn P f

    But relative error bounds do not apply to differences

    O. Bousquet Statistical Learning Theory Lecture 5 102

    Definitions

    P = PX P (Y |X)

  • 8/4/2019 Stastical Learning Theory

    104/194

    P PX P(Y|X) regression function (x) = E [Y|X = x] target function t(x) = sgn (x) noise level n(X) = (1 |(x)|)/2 Bayes risk R = E [n(X)] R(g) = E [(1 (X))/2] + E (X)1[g0] R(g) R = E |(X)|1[g0]

    O. Bousquet Statistical Learning Theory Lecture 5 103

    Intermediate noise

    Instead of assuming that |(x)| = 1 (i.e. n(x) = 0), the

  • 8/4/2019 Stastical Learning Theory

    105/194

    deterministic case, one can assume that n is well-behaved.

    Two kinds of assumptions

    n not too close to 1/2

    n not often too close to 1/2

    O. Bousquet Statistical Learning Theory Lecture 5 104

    Massart Condition

    For some c > 0, assume

  • 8/4/2019 Stastical Learning Theory

    106/194

    > ,

    |(X)| > 1c

    almost surely

    There is no region where the decision is completely random

    Noise bounded away from 1/2

    O. Bousquet Statistical Learning Theory Lecture 5 105

    Tsybakov Condition

    Let [0, 1], equivalent conditions

  • 8/4/2019 Stastical Learning Theory

    107/194

    (1) c > 0, g {1, 1}X,P [g(X)(X) 0] c(R(g) R)

    (2) c > 0, A X, A dP(x) c(A |(x)|dP(x))

    (3) B > 0, t 0, P [|(X)| t] Bt 1

    O. Bousquet Statistical Learning Theory Lecture 5 106

    Equivalence

    (1) (2) Recall R(g) R = E |(X)|1[g0] . For each

  • 8/4/2019 Stastical Learning Theory

    108/194

    ( ) ( ) (g) |( )| [g0]function g, there exists a set A such that 1[A] = 1[g0] (2) (3) Let A = {x : |(x)| t}

    P [|| t] = A dP(x) c(A |(x)|dP(x))

    ct(

    A

    dP(x))

    P [

    |

    | t]

    c

    11 t

    1

    O. Bousquet Statistical Learning Theory Lecture 5 107

    (3) (1)

    R(g) R = E |(X)| 1[g0]tE 1 1

  • 8/4/2019 Stastical Learning Theory

    109/194

    tE 1[g0]1[||>t]

    = tP [|| > t] tE 1[g>0]1[||>t] t(1 Bt 1 ) tP [g > 0] = t(P [g 0] Bt 1 )

    Take t = (1)P[g0]B (1)/

    P [g 0] B1

    (1 )(1 ) (R(g) R

    )

    O. Bousquet Statistical Learning Theory Lecture 5 108

    Remarks

    is in [0, 1] because

  • 8/4/2019 Stastical Learning Theory

    110/194

    R(g) R = E |(X)|1[g0] E 1[g0]

    = 0 no condition

    = 1 gives Massarts condition

    O. Bousquet Statistical Learning Theory Lecture 5 109

    Consequences

    Under Massarts condition

  • 8/4/2019 Stastical Learning Theory

    111/194

    E

    (1[g(X)=Y] 1[t(X)=Y])2

    c(R(g) R)

    Under Tsybakovs conditionE

    (1[g(X)=Y] 1[t(X)=Y])2

    c(R(g) R)

    O. Bousquet Statistical Learning Theory Lecture 5 110

    Relative loss class

    F is the loss class associated to G

  • 8/4/2019 Stastical Learning Theory

    112/194

    The relative loss class is defined as

    F= {f f : f F}

    It satisfiesP f

    2 c(P f)

    O. Bousquet Statistical Learning Theory Lecture 5 111

    Finite case

    Union bound on Fwith Bernsteins inequality would give

  • 8/4/2019 Stastical Learning Theory

    113/194

    P fnP f PnfnPnf+8c(P fn P f) log N

    n+

    4 log N3n

    Consequence when f F(but R > 0)

    P fn P f C

    log Nn

    12

    always better than n1/2 for > 0

    O. Bousquet Statistical Learning Theory Lecture 5 112

    Local Rademacher average

    Definition

  • 8/4/2019 Stastical Learning Theory

    114/194

    R(F, r) = E supfF:P f2r

    Rnf Allows to generalize the previous result

    Computes the capacity of a small ball in F (functions with smallvariance)

    Under noise conditions, small variance implies small error

    O. Bousquet Statistical Learning Theory Lecture 5 113

    Sub-root functions

    Definition

    R R

  • 8/4/2019 Stastical Learning Theory

    115/194

    A function :R

    R

    is sub-root if

    is non-decreasing

    is non negative

    (r)/r is non-increasing

    O. Bousquet Statistical Learning Theory Lecture 5 114

    Sub-root functions

    Properties

    A sub-root function

  • 8/4/2019 Stastical Learning Theory

    116/194

    is continuous has a unique fixed point (r) = r

    0 0.5 1 1.5 2 2.5 30

    0.5

    1

    1.5

    2

    2.5

    3xphi(x)

    O. Bousquet Statistical Learning Theory Lecture 5 115

    Star hull

    Definition

  • 8/4/2019 Stastical Learning Theory

    117/194

    F= {f : f F, [0, 1]}

    Properties

    Rn(

    F, r) is sub-root

    Entropy of F is not much bigger than entropy ofF

    O. Bousquet Statistical Learning Theory Lecture 5 116

    Result

    r fixed point ofR(F, r)

  • 8/4/2019 Stastical Learning Theory

    118/194

    Bounded functionsP f Pnf C

    rVar [f] +

    log 1 + log log n

    n

    Consequence for variance related to expectation (Var [f] c(P f))

    P f C

    Pnf + (r)

    12 +

    log 1 + log log n

    n

    O. Bousquet Statistical Learning Theory Lecture 5 117

    Consequences

    For VC classes R(F, r) C rhn hence r Chn

  • 8/4/2019 Stastical Learning Theory

    119/194

    Rate of convergence of Pnf to P f in O(1/

    n)

    But rate of convergence of P fn to P f is O(1/n1/(2))Only condition is t G but can be removed by SRM/Model selection

    O. Bousquet Statistical Learning Theory Lecture 5 118

    Proof sketch (1)

    Talagrands inequality

  • 8/4/2019 Stastical Learning Theory

    120/194

    supfF

    P fPnf E

    supfF

    P f Pnf

    +c

    supfF

    Var [f] /n+c/n

    Peeling of the class

    Fk = {f : Var [f] [xk, xk+1)}

    O. Bousquet Statistical Learning Theory Lecture 5 119

    Proof sketch (2)

    Application

  • 8/4/2019 Stastical Learning Theory

    121/194

    sup

    fFkP fPnf E

    sup

    fFkP f Pnf

    +c

    xVar [f] /n+c/n

    Symmetrization

    f F, P fPnf 2R(F, xVar [f])+c

    xVar [f] /n+c/n

    O. Bousquet Statistical Learning Theory Lecture 5 120

    Proof sketch (3)

    We need to solve this inequality. Things are simple ifR behave like

  • 8/4/2019 Stastical Learning Theory

    122/194

    a square root, hence the sub-root property

    P f Pnf 2

    rVar [f] + c

    xVar [f] /n + c/n

    Variance-expectationVar [f] c(P f)

    Solve in P f

    O. Bousquet Statistical Learning Theory Lecture 5 121

    Data-dependent version

    As in the global case, one can use data-dependent local Rademcher

  • 8/4/2019 Stastical Learning Theory

    123/194

    averagesRn(F, r) = E

    sup

    fF:P f2rRnf

    Using concentration one can also get

    P f C

    Pnf + (rn)

    12 +

    log 1 + log log n

    n

    where rn is the fixed point of a sub-root upper bound of

    Rn(

    F, r)

    O. Bousquet Statistical Learning Theory Lecture 5 122

    Discussion

    Improved rates under low noise conditions

  • 8/4/2019 Stastical Learning Theory

    124/194

    p

    Interpolation in the rates

    Capacity measure seems local,

    but depends on all the functions,

    after appropriate rescaling: each f Fis considered at scale r/Pf2

    O. Bousquet Statistical Learning Theory Lecture 5 123

    Randomized Classifiers

    Given G a class of functions

  • 8/4/2019 Stastical Learning Theory

    125/194

    Deterministic: picks a function gn and always use it to predict Randomized construct a distribution n over G for each instance to classify, pick g n

    Error is averaged over nR(n) = nP f

    Rn(n) = nPnf

    O. Bousquet Statistical Learning Theory Lecture 5 124

    Union Bound (1)

    Let be a (fixed) distribution over F.

  • 8/4/2019 Stastical Learning Theory

    126/194

    Recall the refined union bound

    f F, P f Pnf

    log 1(f) + log1

    2n

    Take expectation with respect to n

    nP f nPnf n

    log 1(f) + log

    1

    2n

    O. Bousquet Statistical Learning Theory Lecture 5 125

    Union Bound (2)

    (1

    /(

  • 8/4/2019 Stastical Learning Theory

    127/194

    nP f nPnf n log (f) + log /(2n)

    n log (f) + log 1 /(2n) K(n, ) + H(n) + log

    1 /(2n)

    K(n, ) =

    n(f) logn(f)(f) df Kullback-Leibler divergence

    H(n) = n(f) log n(f) df Entropy

    O. Bousquet Statistical Learning Theory Lecture 5 126

    PAC-Bayesian Refinement

    It is possible to improve the previous bound.

  • 8/4/2019 Stastical Learning Theory

    128/194

    With probability at least 1 ,

    nP f nPnf

    K(n, ) + log 4n + log1

    2n

    1

    Good if n is spread (i.e. large entropy) Not interesting if n = fn

    O. Bousquet Statistical Learning Theory Lecture 5 127

    Proof (1)

    Variational formulation of entropy: for any T

    T (f ) lT(f)

    K( )

  • 8/4/2019 Stastical Learning Theory

    129/194

    T(f) log e + K(, ) Apply it to (P f Pnf)2

    n(P f Pnf)2 log e(P fPnf)2

    + K(n, )

    Markovs inequality: with probability 1 ,

    n(P f Pnf)2 log E

    e(P fPnf)2

    + K(n, ) + log1

    O. Bousquet Statistical Learning Theory Lecture 5 128

    Proof (2)

    FubiniE e

    (P fPnf)2 = E e(P fPnf)2

  • 8/4/2019 Stastical Learning Theory

    130/194

    Modified Chernoff boundE

    e(2n1)(P fPnf)

    2

    4n

    Putting together ( = 2n 1)(2n 1)n(P f Pnf)2 K(n, ) + log 4n + log 1

    Jensen (2n

    1)(n(P f

    Pnf))

    2

    (2n

    1)n(P f

    Pnf)

    2

    O. Bousquet Statistical Learning Theory Lecture 5 129

    Lecture 6

    Loss Functions

    P ti

  • 8/4/2019 Stastical Learning Theory

    131/194

    Properties

    Consistency

    Examples

    Losses and noise

    O. Bousquet Statistical Learning Theory Lecture 6 130

    Motivation (1)

    ERM: minimize ni=1 1[g(Xi)=Yi] in a set GC t ti ll h d

  • 8/4/2019 Stastical Learning Theory

    132/194

    Computationally hard Smoothing

    Replace binary by real-valued functions

    Introduce smooth loss functionn

    i=1

    (g(Xi), Yi)

    O. Bousquet Statistical Learning Theory Lecture 6 131

    Motivation (2)

    Hyperplanes in infinite dimension have i fi ite VC di e sio

  • 8/4/2019 Stastical Learning Theory

    133/194

    infinite VC-dimension but finite scale-sensitive dimension (to be defined later)

    It is good to have a scale

    This scale can be used to give a confidence (i.e. estimate the density)

    However, losses do not need to be related to densities Can get bounds in terms of margin error instead of empirical error

    (smoother easier to optimize for model selection)

    O. Bousquet Statistical Learning Theory Lecture 6 132

    Margin

    It is convenient to work with (symmetry of +1 and 1)

    (g(x) y) (yg(x))

  • 8/4/2019 Stastical Learning Theory

    134/194

    (g(x), y) = (yg(x))

    yg(x) is the margin of g at (x, y) Loss

    L(g) = E [(Y g(X))] , Ln(g) =1

    n

    ni=1

    (Yig(Xi))

    Loss class F= {f : (x, y) (yg(x)) : g G}

    O. Bousquet Statistical Learning Theory Lecture 6 133

    Minimizing the loss

    Decomposition of L(g)

    1

  • 8/4/2019 Stastical Learning Theory

    135/194

    1

    2E [E [(1 + (X))(g(X)) + (1 (X))(g(X))|X]]

    Minimization for each x

    H() = infR

    ((1 + )()/2 + (1 )()/2)

    L := infg L(g) = E [H((X))]

    O. Bousquet Statistical Learning Theory Lecture 6 134

    Classification-calibrated

    A minimal requirement is that the minimizer in H() has the correct

    sign (that of the target t or that of )

  • 8/4/2019 Stastical Learning Theory

    136/194

    sign (that of the target t or that of ).

    Definition is classification-calibrated if, for any = 0

    inf:0

    (1+)()+(1

    )(

    ) > inf

    R(1+)()+(1

    )(

    )

    This means the infimum is achieved for an of the correct sign (andnot for an of the wrong sign, except possibly for = 0).

    O. Bousquet Statistical Learning Theory Lecture 6 135

    Consequences (1)

    Results due to (Jordan, Bartlett and McAuliffe 2003)

    is classification calibrated iff for all sequences g and every proba

  • 8/4/2019 Stastical Learning Theory

    137/194

    is classification-calibrated iff for all sequences gi and every proba-bility distribution P,

    L(gi) L R(gi) R

    When is convex (convenient for optimization) is classification-calibrated iff it is differentiable at 0 and (0) < 0

    O. Bousquet Statistical Learning Theory Lecture 6 136

    Consequences (2)

    Let H() = inf :0 ((1 + )()/2 + (1 )()/2)

  • 8/4/2019 Stastical Learning Theory

    138/194

    Let H () = inf:0 ((1 + )()/2 + (1 )()/2)

    Let () be the largest convex function below H() H()

    One has(R(g) R) L(g) L

    O. Bousquet Statistical Learning Theory Lecture 6 137

    Examples (1)

    3.5

    401hingesquared hingesquareexponential

  • 8/4/2019 Stastical Learning Theory

    139/194

    1 0.5 0 0.5 1 1.5 20

    0.5

    1

    1.5

    2

    2.5

    3

    O. Bousquet Statistical Learning Theory Lecture 6 138

    Examples (2)

    Hinge loss(x) = max(0, 1 x), (x) = x

    Squared hinge loss

  • 8/4/2019 Stastical Learning Theory

    140/194

    Squared hinge loss(x) = max(0, 1 x)2, (x) = x2

    Square loss

    (x) = (1 x)2, (x) = x2 Exponential

    (x) = exp(x), (x) = 1

    1 x2

    O. Bousquet Statistical Learning Theory Lecture 6 139

    Low noise conditions

    Relationship can be improved under low noise conditions Under Tsybakovs condition with exponent and constant c,

    c(R(g) R)((R(g) R)1/2c) L(g) L

  • 8/4/2019 Stastical Learning Theory

    141/194

    c(R(g) R ) ((R(g) R ) /2c) L(g) L

    Hinge loss (no improvement)

    R(g)

    R

    L(g)

    L

    Square loss or squared hinge loss

    R(g) R (4c(L(g) L)) 12

    O. Bousquet Statistical Learning Theory Lecture 6 140

    Estimation error

    Recall that Tsybakov condition implies P f2 c(P f) for therelative loss class (with 0

    1 loss)

  • 8/4/2019 Stastical Learning Theory

    142/194

    What happens for the relative loss class associated to ?

    Two possibilities

    Strictly convex loss (can modify the metric on R)

    Piecewise linear

    O. Bousquet Statistical Learning Theory Lecture 6 141

    Strictly convex losses

    Noise behavior controlled by modulus of convexity

    R l

  • 8/4/2019 Stastical Learning Theory

    143/194

    Result(

    P f2

    K) P f /2

    with K Lipschitz constant of and modulus of convexity of L(g)

    with respect to f gL2(P)

    Not related to noise exponent

    O. Bousquet Statistical Learning Theory Lecture 6 142

    Piecewise linear losses

    Noise behavior related to noise exponent

  • 8/4/2019 Stastical Learning Theory

    144/194

    Result for hinge lossP f

    2

    CP f

    if initial class G is uniformly bounded

    O. Bousquet Statistical Learning Theory Lecture 6 143

  • 8/4/2019 Stastical Learning Theory

    145/194

    Examples

    Under Tsybakovs condition

    Hinge loss 1 log 1 + log log n

  • 8/4/2019 Stastical Learning Theory

    146/194

    R(g) R L(g) L + C

    (r

    )1

    2 +log + log log n

    n

    Squared hinge loss or square loss (x) = cx2, P f2

    CP f

    R(g)R C

    L(g) L + C(r + log1 + log log n

    n)

    12

    O. Bousquet Statistical Learning Theory Lecture 6 145

    Classification vs Regression losses

    Consider a classification-calibrated function

  • 8/4/2019 Stastical Learning Theory

    147/194

    It is a classification loss if L(t) = L

    otherwise it is a regression loss

    O. Bousquet Statistical Learning Theory Lecture 6 146

    Classification vs Regression losses

    Square, squared hinge, exponential losses Noise enters relationship between risk and loss

  • 8/4/2019 Stastical Learning Theory

    148/194

    Modulus of convexity enters in estimation error

    Hinge loss Direct relationship between risk and loss

    Noise enters in estimation error

    Approximation term not affected by noise in second case Real value does not bring probability information in second case

    O. Bousquet Statistical Learning Theory Lecture 6 147

    Lecture 7

    Regularization

    Formulation

  • 8/4/2019 Stastical Learning Theory

    149/194

    Capacity measures

    Computing Rademacher averages

    Applications

    O. Bousquet Statistical Learning Theory Lecture 7 148

    Equivalent problems

    Up to the choice of the regularization parameters, the following

    problems are equivalent

    min L (f ) + (f )

  • 8/4/2019 Stastical Learning Theory

    150/194

    minfF

    Ln(f) + (f)

    minfF:Ln(f)e

    (f)

    minfF:(f)R Ln(f)

    The solution sets are the same

    O. Bousquet Statistical Learning Theory Lecture 7 149

    Comments

    Computationally, variant of SRM

    variant of model selection by penalization

  • 8/4/2019 Stastical Learning Theory

    151/194

    y p

    one has to choose a regularizer which makes sense

    Need a class that is large enough (for universal consistency)

    but has small balls

    O. Bousquet Statistical Learning Theory Lecture 7 150

    Rates

    To obtain bounds, consider ERM on balls

    Relevant capacity is that of balls

  • 8/4/2019 Stastical Learning Theory

    152/194

    p y

    Real-valued functions, need a generalization of VC dimension, entropyor covering numbers

    Involve scale sensitive capacity (takes into account the value and notonly the sign)

    O. Bousquet Statistical Learning Theory Lecture 7 151

    Scale-sensitive capacity

    Generalization of VC entropy and VC dimension to real-valued func-tions

    D fi i i i h d b F ( l ) if h

  • 8/4/2019 Stastical Learning Theory

    153/194

    Definition: a set x1, . . . , xn is shattered by F (at scale ) if thereexists a function s such that for all choices of i {1, 1}, thereexists f F

    i(f(xi) s(xi)) The fat-shattering dimension of Fat scale (denoted vc(F, )) is

    the maximum cardinality of a shattered set

    O. Bousquet Statistical Learning Theory Lecture 7 152

    Link with covering numbers

    Like VC dimension, fat-shattering dimension can be used to upperbound covering numbers

  • 8/4/2019 Stastical Learning Theory

    154/194

    ResultN(F, t , n)

    C1

    t

    C2vc(F,C3t)

    Note that one can also define data-dependent versions (restriction onthe sample)

    O. Bousquet Statistical Learning Theory Lecture 7 153

    Link with Rademacher averages (1)

    Consequence of covering number estimates

    Rn(F) C1

    vc(F, t) log C2t

    dt

  • 8/4/2019 Stastical Learning Theory

    155/194

    ( ) n

    0

    ( ) g

    t

    Another link via Gaussian averages (replace Rademacher by GaussianN(0,1) variables)

    Gn(F) = Eg

    supfF

    1

    n

    ni=1

    gif(Zi)

    O. Bousquet Statistical Learning Theory Lecture 7 154

    Link with Rademacher averages (2)

    Worst case average

    n(F) = sup E

    sup Gnf

  • 8/4/2019 Stastical Learning Theory

    156/194

    n(F) supx1,...,xn

    E

    supfF

    Gnf

    Associated dimension t(F, ) = sup{n N : n(F) } Result (Mendelson & Vershynin 2003)

    vc(F, c) t(F, ) K2

    vc(F, c)

    O. Bousquet Statistical Learning Theory Lecture 7 155

    Rademacher averages and Lipschitz losses

    What matters is the capacity of

    F(loss class)

  • 8/4/2019 Stastical Learning Theory

    157/194

    If is Lipschitz with constant M

    thenRn(F) MRn(G)

    Relates to Rademacher average of the initial class (easier to compute)

    O. Bousquet Statistical Learning Theory Lecture 7 156

    Dualization

    Consider the problem mingR Ln(g) Rademacher of ball

    E sup

    gRRng

  • 8/4/2019 Stastical Learning Theory

    158/194

    g

    Duality

    E supfR Rnf = RnE

    ni=1

    iXi

    dual norm, Xi evaluation at Xi (element of the dual underappropriate conditions)

    O. Bousquet Statistical Learning Theory Lecture 7 157

    RHKS

    Given a positive definite kernel k

    Space of functions: reproducing kernel Hilbert space associated to k

    Regularizer: rkhs norm

  • 8/4/2019 Stastical Learning Theory

    159/194

    Regularizer: rkhs norm k

    Properties: Representer theorem

    gn =n

    i=1

    ik(Xi, )

    O. Bousquet Statistical Learning Theory Lecture 7 158

    Shattering dimension of hyperplanes

    Set of functions

    G = {g(x) = w x : w = 1}

  • 8/4/2019 Stastical Learning Theory

    160/194

    G {g(x) w x : w 1}

    Assume

    x

    R

    Resultvc(G, ) R2/2

    O. Bousquet Statistical Learning Theory Lecture 7 159

    Proof Strategy (Gurvits, 1997)

    Assume that x1, . . . , xr are -shattered by hyperplanes with w = 1,i.e., for all y1, . . . , yr {1}, there exists a w such that

    yi w, xi for all i = 1, . . . , r . (2)

  • 8/4/2019 Stastical Learning Theory

    161/194

    Two steps:

    prove that the more points we want to shatter (2), the larger

    ri=1 yixi must be upper bound the size of ri=1 yixi in terms of R

    Combining the two tells us how many points we can at most shatter

    O. Bousquet Statistical Learning Theory Lecture 7 160

    Part I

    Summing (2) yields w, (

    ri=1 yixi) r

    By Cauchy-Schwarz inequality

  • 8/4/2019 Stastical Learning Theory

    162/194

    w,

    r

    i=1

    yixi

    w

    ri=1

    yixi

    =

    ri=1

    yixi

    Combine both:r

    ri=1

    yixi

    . (3)

    O. Bousquet Statistical Learning Theory Lecture 7 161

    Part II

    Consider labels yi {1}, as (Rademacher variables).

    E

    ri=1

    yixi2 = E r

    i,j=1

    yiyj xi, xj

  • 8/4/2019 Stastical Learning Theory

    163/194

    i 1

    i,j 1

    =r

    i=1 E [xi, xi] + E

    r

    i=j xi, xj

    =r

    i=1

    xi2

    O. Bousquet Statistical Learning Theory Lecture 7 162

    Part II, ctd.

    Since xi R, we get E ri=1 yixi

    2

    rR2.

    This holds for the expectation over the random choices of the labels,

  • 8/4/2019 Stastical Learning Theory

    164/194

    hence there must be at least one set of labels for which it also holds

    true. Use this set.

    Hence

    ri=1

    yixi

    2

    rR2.

    O. Bousquet Statistical Learning Theory Lecture 7 163

    Part I and II Combined

    Part I: (r)2 ri=1 yixi

    2

    Part II: ri=1 yixi2 rR2

  • 8/4/2019 Stastical Learning Theory

    165/194

    i 1 Hence

    r2

    2

    rR

    2,

    i.e.,

    r R2

    2

    O. Bousquet Statistical Learning Theory Lecture 7 164

    Boosting

    Given a class H of functions

    Space of functions: linear span of

    H Regularizer: 1-norm of the weights g = inf{ |i| : g =h }

  • 8/4/2019 Stastical Learning Theory

    166/194

    ihi}

    Properties: weight concentrated on the (weighted) margin maximizers

    gn = whh

    diYih(Xi) = minhH

    diYih

    (Xi)

    O. Bousquet Statistical Learning Theory Lecture 7 165

    Rademacher averages for boosting

    Function class of interest

    GR = {g span H : g1 R}

  • 8/4/2019 Stastical Learning Theory

    167/194

    Result

    Rn(GR) = RRn(H)

    Capacity (as measured by global Rademacher averages) not affectedby taking linear combinations !

    O. Bousquet Statistical Learning Theory Lecture 7 166

    Lecture 8

    SVM

    Computational aspects

  • 8/4/2019 Stastical Learning Theory

    168/194

    Capacity Control

    Universality

    Special case of RBF kernel

    O. Bousquet Statistical Learning Theory Lecture 8 167

    Formulation (1)

    Soft margin

    minw,b,

    1

    2 w2

    + C

    mi=1

    i

    (w x + b) 1

  • 8/4/2019 Stastical Learning Theory

    169/194

    yi(w, xi + b) 1 ii 0

    Convex objective function and convex constraints

    Unique solution Efficient procedures to find it Is it the right criterion ?

    O. Bousquet Statistical Learning Theory Lecture 8 168

    Formulation (2) Soft margin

    minw,b,

    1

    2w2 + C

    m

    i=1i

    yi(w, xi + b) 1 i, i 0O ti l l f

  • 8/4/2019 Stastical Learning Theory

    170/194

    Optimal value of ii = max(0, 1 yi(w, xi + b))

    Substitute above to get

    minw,b

    1

    2w2 + C

    m

    i=1max(0, 1 yi(w, xi + b))

    O. Bousquet Statistical Learning Theory Lecture 8 169

    Regularization

    General form of regularization problem

    minfF 1m

    ni=1

    c(yif(xi)) + f2

  • 8/4/2019 Stastical Learning Theory

    171/194

    Capacity control by regularization with convex cost

    0 1

    1

    O. Bousquet Statistical Learning Theory Lecture 8 170

    Loss Function

    (Y f(X)) = max(0, 1 Y f(X))

    Convex, non-increasing, upper bounds 1[Y f(X)0]

  • 8/4/2019 Stastical Learning Theory

    172/194

    Classification-calibrated

    Classification type (L = L(t))R(g) R L(g) L

    O. Bousquet Statistical Learning Theory Lecture 8 171

    Regularization

    Choosing a kernel corresponds to

    Choose a sequence (ak) Set

  • 8/4/2019 Stastical Learning Theory

    173/194

    f2 :=k0

    ak

    |f(k)|2dx

    penalization of high order derivatives (high frequencies)

    enforce smoothness of the solution

    O. Bousquet Statistical Learning Theory Lecture 8 172

    Capacity: VC dimension The VC dimension of the set of hyperplanes is d + 1 in Rd.

    Dimension of feature space ?

    for RBF kernel

    w choosen in the span of the data (w = iyixi)The span of the data has dimension m for RBF kernel (k(., xi)

  • 8/4/2019 Stastical Learning Theory

    174/194

    p ( ( , i)linearly independent)

    The VC bound does not give any informationh

    m= 1

    Need to take the margin into account

    O. Bousquet Statistical Learning Theory Lecture 8 173

    Capacity: Shattering dimension

    Hyperplanes with Margin

    If x R,vc(hyperplanes with margin , 1) R

    2

    /2

  • 8/4/2019 Stastical Learning Theory

    175/194

    O. Bousquet Statistical Learning Theory Lecture 8 174

    Margin

    The shattering dimension is related to the margin

    Maximizing the margin means minimizing the shattering dimension

    S

  • 8/4/2019 Stastical Learning Theory

    176/194

    Small shattering dimension good control of the risk

    this control is automatic (no need to choose the margin beforehand)

    but requires tuning of regularization parameter

    O. Bousquet Statistical Learning Theory Lecture 8 175

  • 8/4/2019 Stastical Learning Theory

    177/194

    Rademacher Averages (2)

    E supwM 1n n

    i=1 i w, xi= E

    supwM

    w, 1n

    ni 1

    ixi

  • 8/4/2019 Stastical Learning Theory

    178/194

    pwM

    , n

    i=1

    i xi

    E supwM w

    1n

    n

    i=1ixi

    =M

    nE

    ni=1

    ixi, n

    i=1ixi

    O. Bousquet Statistical Learning Theory Lecture 8 177

    Rademacher Averages (3)

    M

    n

    E n

    i=1

    ixi,n

    i=1

    ixi M

    E

    n

    n

  • 8/4/2019 Stastical Learning Theory

    179/194

    n

    E

    i=1

    ixi,

    i=1ixi

    =

    M

    n E i,j ij xi, xj=

    M

    n

    ni=1

    k(xi, xi)

    O. Bousquet Statistical Learning Theory Lecture 8 178

    Improved rates Noise condition

    Under Massarts condition (

    |

    |> 0), with

    g

    M

    E

    ((Y g(X)) (Y t(X)))2

    (M1+2/0)(L(g)L) .

  • 8/4/2019 Stastical Learning Theory

    180/194

    If noise is nice, variance linearly related to expectation Estimation error of order r (of the class G)

    O. Bousquet Statistical Learning Theory Lecture 8 179

    Improved rates Capacity (1)

    rn related to decay of eigenvalues of the Gram matrix

    rn c

    mindN

    d + j

  • 8/4/2019 Stastical Learning Theory

    181/194

    n n dN

    j>d

    j

    Note that d = 0 gives the trace bound

    rn always better than the trace bound (equality when i constant)

    O. Bousquet Statistical Learning Theory Lecture 8 180

    Improved rates Capacity (2)

    Example: exponential decay

    i = ei

    1

  • 8/4/2019 Stastical Learning Theory

    182/194

    Global Rademacher of order 1n

    rn of orderlog n

    n

    O. Bousquet Statistical Learning Theory Lecture 8 181

    Exponent of the margin

    Estimation error analysis shows that in GM = {g : g M}

    R(gn) R(g) M...

  • 8/4/2019 Stastical Learning Theory

    183/194

    Wrong power (M2 penalty) is used in the algorithm

    Computationally easier But does not give a dimension-free status Using M could improve the cutoff detection

    O. Bousquet Statistical Learning Theory Lecture 8 182

    Kernel

    Why is it good to use kernels ?

    Gaussian kernel (RBF)

    k( )xy

    2

    2

  • 8/4/2019 Stastical Learning Theory

    184/194

    k(x, y) = e 22

    is the width of the kernel

    What is the geometry of the feature space ?

    O. Bousquet Statistical Learning Theory Lecture 8 183

  • 8/4/2019 Stastical Learning Theory

    185/194

    RBF

  • 8/4/2019 Stastical Learning Theory

    186/194

    O. Bousquet Statistical Learning Theory Lecture 8 185

    RBF

    Differential Geometry

    Flat Riemannian metric

    distance along the sphere is equal to distance in input space

  • 8/4/2019 Stastical Learning Theory

    187/194

    distance along the sphere is equal to distance in input space

    Distances are contracted

    shortcuts by getting outside the sphere

    O. Bousquet Statistical Learning Theory Lecture 8 186

    RBF

    Geometry of the span

    Ellipsoid

    a

    b

    x^2/a^2 + y^2/b^2 = 1

  • 8/4/2019 Stastical Learning Theory

    188/194

    K = (k(xi, xj)) Gram matrix Eigenvalues 1, . . . , m Data points mapped to ellispoid with lengths 1, . . . ,

    m

    O. Bousquet Statistical Learning Theory Lecture 8 187

    RBF

    Universality

    Consider the set of functions

    H = span{k(x, ) : x X}

  • 8/4/2019 Stastical Learning Theory

    189/194

    H is dense in C(X) Any continuous function can be approximated (in the norm) by

    functions in H

    with enough data one can construct any function

    O. Bousquet Statistical Learning Theory Lecture 8 188

    RBF

    Eigenvalues

    Exponentially decreasing

    Fourier domain: exponential penalization of derivatives

  • 8/4/2019 Stastical Learning Theory

    190/194

    p p

    Enforces smoothness with respect to the Lebesgue measure in inputspace

    O. Bousquet Statistical Learning Theory Lecture 8 189

    RBF

    Induced Distance and Flexibility

    0

    1-nearest neighbor in input spaceEach point in a separate dimension, everything orthogonal

  • 8/4/2019 Stastical Learning Theory

    191/194

    linear classifier in input space

    All points very close on the sphere, initial geometry

    Tuning allows to try all possible intermediate combinations

    O. Bousquet Statistical Learning Theory Lecture 8 190

    RBF

    Ideas

    Works well if the Euclidean distance is good

    Works well if decision boundary is smooth

  • 8/4/2019 Stastical Learning Theory

    192/194

    Adapt smoothness via

    Universal

    O. Bousquet Statistical Learning Theory Lecture 8 191

  • 8/4/2019 Stastical Learning Theory

    193/194

    Learning Theory: some informal thoughts

    Need assumptions/restrictions to learn

    Data cannot replace knowledge No universal learning (simplicity measure)

  • 8/4/2019 Stastical Learning Theory

    194/194

    SVM work because of capacity control

    Choice of kernel = choice of prior/ regularizer RBF works well if Euclidean distance meaningful Knowledge improves performance (e.g. invariances)

    O. Bousquet Statistical Learning Theory Lecture 8 193