STAT 208: Statistical learning theoryarash.amini/teaching/stat208/notes/208_slides.… · STAT 208: Statistical learning theory Arash A. Amini March 8, 2018 1/183. ... Learning functions

STAT 208: Statistical learning theory

Arash A. Amini

March 8, 2018

1 / 183

Probabilistic framework

• Observe feature X ∈ X and predict outcome Y ∈ Y based on it.

• Of course, there should be some relation between X and Y .

• Examples1

X Ydocument topicimage of handwritten digit digitimage of an object type of objectimage of signature identity of writeremail spam or notsentence parse treemedical tests diseasegene expression in tissue cancerous or notphylogenetic profile of a gene gene functionweb search query ranked list of pagesfriendship networks communities

1Most from P. Bartlett’s slides.2 / 183

• Model the relation as a joint probability distribution P on X × Y:

(X ,Y ) ∼ P

• Often want to learn the conditional distribution P∗x := P(Y ∈ · | X = x) orsome aspect of it, e.g. f ∗(x) := E[Y | X = x ].

• More generally, learn a map Q : X →M(Y) where M(Y) is the set ofprobability distributions on Y. Useful if we want to associate uncertainty toour predictions.

• Often we settle for learning a function f : X → Y.

• Learning functions is easier than learning distributions. (Function spacesbehave like infinite-dimensional vector spaces; easy to apply linear algebraictechniques.)

• The less we can assume about P the better.

• Parametric models assume P ∈ Pθ : θ ∈ Θ for some Θ ⊂ Rd .

• Often interested in nonparametric setups.

3 / 183

Risk minimization

• What function are we seeking?

• Pick a loss function ` : Y × Y → R.

• `(f (X ),Y ) measure how good we are doing on observation (X ,Y ).

• Our goal is to recover (or approximate) the function minimizing the risk:

f ∗F ∈ argminf∈F

E[`(f (X ),Y )]

• Here F could be a pre-specified class of functions encoding our desiderataabout the prediction function f (e.g., being Holder continuous of certaindegree, etc.)

• Computing f ∗F requires the unknown distribution P.

• Instead, we assume that we observe a tranining sample (key difference withunsupervised approaches)

(Xi ,Yi )iid∼P, i = 1, . . . , n

Let Dn := (X1,Y1), . . . , (Xn,Yn) collect our training data.

4 / 183

• Instead, we assume that we observe a tranining sample (key difference withunsupervised approaches)

(Xi ,Yi )iid∼P, i = 1, . . . , n

Let Dn := (X1,Y1), . . . , (Xn,Yn) collect our training data.

• We also assume (X ,Y ) ∼ P is independent of Dn.

• (X ,Y ) represent a future sample point (testing sample) where we observeonly the X coordinate and want to predict its Y coordinate.

5 / 183

• Our goal is to recover (or approximate) the function minimizing the risk:


E[`(f (X ),Y )]

• Given the training data (Xi ,Yi )iid∼P, i = 1, . . . , n, the empirical distribution

Pn :=1

n

n∑i=1

δ(Xi ,Yi )

approximates P. In other words, Pn is a (consistent) estimate of P.

• Replace expectation w.r.t. unknown P with that w.r.t. known Pn:

fn ∈ argminf∈F

Pn[`(f (X ),Y )] = argminf∈F

1

n

n∑i=1

`(f (Xi ),Yi )

(Here we are assuming (X ,Y ) ∼ Pn when computing the expectation.)

• This is the empirical risk minimization (ERM) idea.

6 / 183

• Target versus the estimate:


E[`(f (X ),Y )]

fn ∈ argminf∈F

1

n

n∑i=1

`(f (Xi ),Yi )

• fn can in principal be computed.

• Computation might be intractable due to the complexity of the loss function` or F . =⇒ try to relax the loss

• F might be too large for a sample of size n. (Overfitting.) =⇒ Try toregularize. E.g., optimize over simpler classes Fn often with the propertythat Fn ↑ F as n→∞.

7 / 183

Optimal (Bayes) rule

• Assuming that F = All meas. functions from X to Y,• the optimal rule is obtained as the minimizer of the posterior risk:

f ∗(x) = argmina∈Y

E[`(a,Y ) | X = x ]

• Show this as an exercise.

8 / 183

Example 1 (Regression)

Assume Y = R, and consider quadratic loss `(a, y) = 12 (a− y)2.

• Optimal (unconstrained) Bayes rule is f ∗(x) = E[Y | X = x ].

• ERM is obtained as

fn ∈ argminf∈Fn

1

2n

n∑i=1

(Yi − f (Xi ))2

• Empirical loss is convex in f .

• Without Fn restriction, many f achieve zero empirical risk (overfitting).

• Assume X ⊂ Rd , and consider

Fn = Flin := fθ,θ0 : θ ∈ Rd , θ0 ∈ R

where fθ,θ0 (x) = θT x + θ0.

• ERM is usual least-squares regression (unique if design is full-rank.)

9 / 183

Example 2 (Classification)

Assume Y = [K ] := 1, 2, . . . ,K and consider zero-one loss `(a, y) = 1a 6= y.

• Optimal (unconstrained) Bayes rule is: f ∗ : X → [K ],

f ∗(x) = argmaxk∈[K ]

P(Y = k | X = x)

that is the maximum a posteriori (MAP) rule.

• The ERM is obtained as

f ∈ argminf∈Fn

1

n

n∑i=1

1Yi 6= f (Xi )

• Here the problem is intractable due to combinatorial nature of loss, even ifwe drop the constraint Fn, or have a simple Fn.

10 / 183

Choice of the function class

• Linear:

• Notation x 7→ θT x is the function f (x) = θT x .

Flin(Θ) := x 7→ θT x : θ ∈ Θ, Θ ⊂ Rd

• Often assume that the intercept is included for simplicity, i.e.,

x = (x , 1), θ = (θ, θ0), θT x = θT x + θ0

• Class of affine functions can be considered a linear class in augmented featurevector.

11 / 183

• (Feed-froward) neural networks:

• A single neuron η(x ; θ, φ) = φ(θT x).

• Often multiple layers cascaded together:

a[`]j (x) = η(a[`−1](x); θ

[`]j , φ

[`]j ), j ∈ [N`], ` ∈ [L]

• a[`]j (x) output (or activation) of neuron j in layer `.

• a[`](x) = (a[`]1 (x), . . . , a

[`]N`

(x)) the vector of activations at layer `.

• Layer 0: the input a[0](x) = x and N0 = d .

• Nonlinear parametric function families: say NL = 1 and f (x ; θ) = a[L]1 (x).

• Often φj are fixed nonlinearities: ReLU, sigmoid, tanh, etc.

• All parameters: θ = (θ[`]j , j ∈ [N`], ` ∈ [L]).

• In a fully connected network: θ[`]j ∈ RN`−1+1.

• Total # of parameters:∑L`=1 N`(N`−1 + 1).

• Often too many parameters; need to introduce regularization (e.g. sparsity).

• E.g. if every layer has the same # of neurons as the input layer: ∼ Ld2.

12 / 183

• Functions spaces as infinite-dimensional vector spaces:

• We can do basis expansions to represent functions.• Let ψ1, ψ2, . . . be a basis for F .• Any f ∈ F can be written as f =

∑∞j=1 θjψj where θj ∈ R.

• We can consider “truncated space”

FN =f : f =

N∑j=1

θjψj , θj ∈ R

• Can let N = Nn grow with the sample size.• If F is a Hilbert space we can take the basis to be orthonormal (many

choices).• E.g., F = L2([0, 1]; Unif) with trigonometric basis (Fourier expansion).• E.g., F = L2(R; Gauss) with Hermite polynomials.

13 / 183

• Reproducing kernel Hilbert spaces (RKHS) of functions:

• We will talk about these at length.• Very natural for machine learning applications where we observe discrete

samples of a function.• Related to the “kernel trick”.

• Ensemble approaches:

• Writing a function as a weighted average of weak learners;• Some stochastic element (subsampling data) can be introduced in fitting the

weak learners (e.g. random forests ).

• Sieves: Finite families of functions FN = f1, . . . , fN.

14 / 183

General theoretical goal

• Let R(f ) = E `(f (X ),Y ).

• Excess risk a measure of prediction performance:

E(f ) = R(f )− R(f ∗) = R(f )−minR(·)

• Ideally, we would like to show E(fn)→ 0 as n→ 0. (Prediction consistency).

• Ideally, characterize how fast it goes to zero.

• Generally rates in terms of sample size n and a notion of size(F ).

15 / 183

Binary classification

• Take Y = 0, 1 and consider zero-one loss.

• Bayes rule can be written as

f ∗(x) =

1 η(x) ≥ 1/2

0 η(x) < 1/2,

where η(x) = E[Y | X = x ] = P(Y = 1 | X = x) is the regression function.

• The choice for m(x) = 1/2 could be any of 0 or 1.

• Optimal Bayes risk can be written as

R∗ = R(f ∗) = E|f ∗(X )− η(X )| (1)

i.e., the L1(PX ) distance between η and its thresholded version.

• Another expression:

R∗ = R(f ∗) = Eminη(X ), 1− η(X )

i.e. R∗ =∫

minη, 1− ηdPX . (Exercise)

16 / 183

• To see (1) note that

P(Y 6= f ∗(X ) | X ) =

P(Y = 0 | X ) f ∗(X ) = 1

P(Y = 1 | X ) f ∗(X ) = 0

=

f ∗(X )− η(X ) f ∗(X ) = 1

η(X )− f ∗(X ) f ∗(X ) = 0= |f ∗(X )− η(X )|.

• Let us compute the excess risk.

Theorem 1

The execess risk for binary classification is:

E(f ) = E[|2η(X )− 1||f ∗(X )− f (X )|

]= E[|2η(X )− 1|1f ∗(X ) 6= f (X )]

that is, E(f ) =∫|f ∗ − f ||2η − 1|dPX =

∫|f ∗ − f |dQ = ‖f ∗ − f ‖L1(Q).

• Here, Q is measure that has density |2η − 1| w.r.t. PX .

17 / 183

• For two binary variables Y ,Z , we have the following identities

1Y 6= Z = |Y − Z | = (Y − Z )2 = Y (1− Z ) + Z (1− Y )

= Y + Z − 2YZ

• Thus, for any f : X → Y = 0, 1, we have

1Y 6= f (X ) = Y + f (X )− 2Yf (X )

• Subtracting similar identity for 1Y 6= f ∗(X ), we obtain

1Y 6= f (X ) − 1Y 6= f ∗(X ) = (2Y − 1)(f ∗(X )− f (X )) (2)

• Taking expectation of both sides:

R(f )− R(f ∗) = E[(2Y − 1)(f ∗(X )− f (X ))]

(a)= E[(2η(X )− 1)(f ∗(X )− f (X )]

(b)= E[|2η(X )− 1||f ∗(X )− f (X )|]

Exercise: Justify steps (a) and (b).

18 / 183

Yet another expression

• From (2), we have∣∣1Y 6= f (X ) − 1Y 6= f ∗(X )∣∣ = |f ∗(X )− f (X )| (3)

• Hence, with `f (Z ) := 1Y 6= f (X ) where Z = (X ,Y ), we have

E(f ) = E[|2η(X )− 1||`f (Z )− `f ∗(Z )|] (4)

19 / 183

Plugin classifier

• Form of f ∗(x) = 1η(x) ≥ 1/2 suggest the plugin estimator

fη(x) = 1η(x) ≥ 1/2

where η is an estimator of η.

Theorem 2

For any η : X → R, we have E(fη) ≤ 2E|η(X )− η(X )|.

i.e., E(fη) ≤ 2‖η − η‖L1(PX ).Proof:

• Use Theorem 1.

• f ∗ 6= fη iff η and η are on different sides of 1/2.

• In that case, |η − 1/2| ≤ |η − η|.• Thus, 1f ∗ 6= fη|η − 1/2| ≤ |η − η|. Integrate w.r.t. PX .

20 / 183

• Having a good regression estimator η give us a good plugin classifier.

• It is enough for η to be consistent η in L1(PX ) norm (or L2(PX )):

E(fη) ≤ 2‖η − η‖L1(PX ) ≤ 2‖η − η‖L2(PX )

(In a probability space, L2 norm dominates L1: E|Z | ≤√EZ 2)

• However, having consistent estimate of η is not necessary.

• Excess risk can be much smaller than 2‖η − η‖L1(PX ).

Example 3

• Assume η ∈ 0, 1.• Then, there is always η satisfying

• η and η are on the same side of 1/2.

• |η(X )− η(X )| = (1− ε)/2 a.s.

• We have E(fη) = 0 while 2‖η − η‖L1(PX ) = 1− ε.

21 / 183

Plugin (parametric approach)

• We can directly model η(x) = P(Y = 1 | X = x) using a parametric model.

• Modeling η(x) directly is sometimes called discriminative (as opposed togenerative) approach.

• For example, in logistic regression, we assume

P(Y = 1 | X = x) = σ(θT x), σ(t) =1

1 + e−t

• Can estimate θ based on a training sample, using e.g. MLE θ.

• Then, η(x) = σ(θT x) and the plugin rule will be

fη(x) = 1η(x) ≥ 1/2 = 1θT x ≥ 0. (5)

or x 7→ sign(θT x) if we work with Y = −1, 1.• Decision boundary in (5) is linear.

22 / 183

• With logistic loss `lg(y , t) = −yt + log(1 + et), the MLE is

θ ∈ argminθ∈Rd

1

n

n∑i=1

`lg(Yi , θT xi )

• Equivalently, with Flin := x 7→ θT x : θ ∈ Rd

η ∈ argminf∈Flin

1

n

n∑i=1

`lg(Yi , f (Xi ))

• I.e., logistic regression is ERM with a surrogate loss function (replacing 0-1with the smooth logistic loss).

• Logistic loss t 7→ `lg(y , t) is convex.

• Can obtain the solution by Newton’s method (leads to iteratively reweighedleast-squares).

• Can perform SGD which looks like a variant of the perceptron algorithm.

23 / 183

Plugin (generative approach)

• Start with class-conditional densities pk(x) = p(x | Y = k),

• and class priors πk = P(Y = k), so that

P(Y = k | X = x) =πkpk(x)∑K`=1 π`p`(x)

.

• Can estimate pk and πk from training data.

• E.g., fit a kernel density estimate to all Xi : Yi = k to get pk .

• πk = 1n

∑ni=1 1Yi = k.

• Optimal Bayes classification rule can be written as

f ∗(x) = argmaxk ∈ [K ]

πkpk(x).

• Any collection of function δk(x), k = 1, . . . ,K for whichf ∗(x) = argmaxk∈[K ] δk(x) are called discriminant functions.

24 / 183

• We could also postulate a parametric form for pk , e.g.

X | Y = k ∼ N(µk ,Σk) (6)

and estimate µk and Σk from training data.

• Bayes rule in model (6) gives quadratic discriminant analysis (QDA):

f ∗(x) = argmaxk

−1

2(x − µk)TΣk(x − µk) + ck

, (7)

where ck = − 12 log |Σk |+ log πk . (Chapter 4 of ELS or HW1)

• When Σk = Σ for all k, (7) simplifies to linear discriminant analysis (LDA):

f ∗(x) = argmaxk

(wTk x + bk),

where wk = Σ−1µk and bk = − 12µ

Tk Σ−1µk + log πk .

• Decision boundaries are linear in LDA.

• In binary classification, f ∗ of LDA is a thresholded linear functionf ∗(x) = 1wT x + b ≥ 0.or f ∗(x) = sign(wT x + b).

25 / 183

Fisher LDA

• A method of dimension reduction geared towards classification.

• Project X onto one dimension: Xw := wTX .

• Classify the resulting scalar variable Xw .

• Choose w to maximize the ratio of between over within cluster variation:

w∗ = argmaxw

var(E[Xw | Y ])

E var(Xw | Y )

• This leads to a generalized eigenvalue problem (see HW1).

• In binary classification, there is a closed form: w∗ ∝ Σ−1(µ1 − µ2).

• Equivalent formulation:

w∗ = argmaxw

var(E[Xw | Y ])

var(Xw )

• Contrast with PCA which maximizes w 7→ var(Xw ). (No label info Y .)

26 / 183

Stochastic gradient descent (SGD)

• A general stochastic optimization technique for minimizing θ 7→ EL(θ;Z ).

• Assume that we have iid samples Z1,Z2, · · · ∼ Z .

• SGD has the following update

θt+1 = θt − α∇L(θt ;Zt)

• Usual (batch) gradient descent on the empirical loss θ 7→ 1n

∑ni=1 L(θ;Zi ),

has the following update

θt+1 = θt − α1

n

n∑i=1

∇L(θt ;Zt)

• In both cases the step we take is unbiased estimate of the gradient ofEL(θ;Z ), the for SGD the variance is a lot higher, while the computation isa lot cheaper.

27 / 183

SGD for logistic regression and perceptron

• The SGD update for logistic regression (Yt ∈ 0, 1)

θt+1 = θt + α(Yt − σ(θTt Xt))Xt

• Replacing σ(·) with hard threshold 1· ≥ 0, gives the perceptron algorithm:

θt+1 = θt + α(Yt − 1θTt Xt ≥ 0)Xt

• Alternative form with Yt ∈ −1, 1 (and α = 1):

• θ0 = 0, t = 0, and repeat while there are misclassified pairs:

• Pick i (if any) such that Yi 6= sign(θTt Xi ).

• Update: θt+1 ← θt + YiXi .

• Increment t.

28 / 183

Perceptron

• Perceptron Algorithm:

θ0 = 0, t = 0, and repeat while there are misclassified pairs:

• Pick i (if any) such that Yi 6= sign(θTt Xi ).

• Update: θt+1 ← θt + YiXi .

• Increment t.

• Data (xi , yi ) ∈ Rd × −1, 1, i = 1, . . . , n is linearly separable if there is aseparating hyperplane:

∃θ ∈ Rd , yi (θT xi ) > 0

in which case, we can define the margin:

γ = mini

yiθT xi‖θ‖ (8)

• θ need not be unique (even its direction).

29 / 183

30 / 183

31 / 183

Perceptron

• Assuming ‖θ‖ = 1, the margin is γ = mini yiθT xi .

Theorem 3

For any linearly separable data (xi , yi ) ⊂ Rd × −1, 1 with margin γ as in (8)and radius R = maxi ‖xi‖, the perceptron algorithm terminates in

at most R2/γ2 steps

and returns a separating hyperplane (for any choices made in the updates).

Assuming linear separation:

• Perceptron achieves zero empirical risk for the 0-1 loss.

• Exact algorithm for solving ERM with 0-1 loss over thresholded linearfunction class

minf∈Ft-lin

1

n

n∑i=1

1Yi 6= f (Xi ), Ft-lin := x 7→ sign(θT x) : θ ∈ Rd

32 / 183

Proof• WLOG, assume that “true” θ is normalized: ‖θ‖ = 1.

• Idea of the proof: 〈θ, θt〉 & t while ‖θt‖ .√t.

• Recalling θt+1 = θt + yixi , and the definition of margin:

〈θt+1, θ〉 − 〈θt , θ〉 = yi 〈xi , θ〉 ≥ γ

• Hence, 〈θt , θ〉 ≥ γt since θ0 = 0.

• On the other hand

‖θt+1‖2 = ‖θt‖2 + ‖xi‖2 + 2yi 〈θt , xi 〉≤ ‖θt‖2 + R2.

• It follows that ‖θt‖2 ≤ tR2.

• Cauchy-Schwarz gives:

γt ≤ 〈θ, θt〉 ≤ ‖θt‖ ≤√tR.

• Exercise: Show that if we start with θ0 6= 0, we have the bound√t ≤ (R +

√R2 + 4bγ)/(2γ) ≤ R/γ +

√b/γ where b = ‖θ0‖ − 〈θ0, θ〉.

33 / 183

Perception, general inner product

• Can write θt as (weighted) combination of data points:

θt =n∑

j=1

α(t)j xj , ‖α(t)‖1 =

∑i

|α(t)j | = t

• Originally all the weights are zero; perception gradually updates them.

• Perceptron Algorithm:

α(0) = 0, t = 0, and repeat while there are misclassified pairs:

• Pick i (if any) such that yi 6= y(t)i where

y(t)i := sign

( n∑j=1

α(t)j 〈xj , xi 〉

)• Update: α

(t+1)i := α

(t)i + yi . Increment t.

• Perceptron (and Theorem 3) works for general inner product 〈xi , xj〉.• Can even replace inner product with kernels K (xi , xj). [Linear separation in

feature space.]

34 / 183

A bit of notational overhead

• (Excess) risk also depends on P the unknown distribution of (X ,Y ):

E(f ;P) = R(f ;P)− inffR(f ;P)

= R(f ;P)− R∗(P)

which itself can be thought of as a loss between f and P.

35 / 183


• Let Z n = (Zi , i = 1, . . . , n) ∈ Zn be the training data where

Zi = (Xi ,Yi ) ∈ Z := X × −1, 1 ⊂ Rd × −1, 1.

• A classification rule or algorithm is specified by a collection of functionfn : X × Zn → −1, 1 for n = 1, 2, . . . .

• A trained classifier is a random function x 7→ fn(x ;Z n), with randomnessdue to training data Z n.

• We also write fn = fn( · ;Z n) or fn(x) = fn(x ;Z n) for simplicity.

• Notational overload: fn : X → −1, 1 is a random function orfn : X × Zn → −1, 1 is a deterministic function depending on the context.

36 / 183


• The excess risk of fn is the random quantity E(fn;P).

• We can look at E[E(fn;P)] which is a form of risk itself (in the decisiontheoretic sense). Here expectation is w.r.t. randomness in Z n.

• Let us call E[E(fn;P)] the meta excess risk. (Non-standard terminology.)

• Meta risk measures the quality of the classification rulefn : X × Zn → −1, 1 relative to the probability distribution P on Z.

37 / 183


• We can also write EZ n∼Pn [E(fn;P)] or just EPn [E(fn;P)] to emphasize.

• Consider binary classification with 0-1 loss:

R(f ;P) = EP1f (X ) 6= Y

where EP or EZ∼P is to emphasize that the expectation is w.r.t Z ∼ P.

• The meta risk is

E[R(fn;P)] = EPnEP

[1fn(X ;Z n) 6= Y

]= EPn+1

[1fn(X ;Z n) 6= Y

]= PPn+1 (fn(X ) 6= Y )

where EPn+1 or E(Z ,Z n)∼Pn+1 is to emphasize underlying distribution.

• We can simply just write E[R(fn;P)] = P(fn(X ) 6= Y ).

• Meta risk is the probability of misclassification taking into account all thesources of randnomness. (Can think of it as the limit of “ideal” leave-oneout cross validation on N independent batches of size n + 1 as N →∞.)

38 / 183

Perceptron excess risk

• Consider a randomized version (somewhat like a regularization):

• Let fm be the classifier returned by perceptron when trained onZm = (Z1, . . . ,Zm) for m ≤ n.

• Let fn be chosen at random from f1, f2, . . . , fn.

Theorem 4

Assume that P is such that ∃θ ∈ Sd−1 and γ > 0, such that

‖X‖ ≤ R, and, (θTX )Y ≥ γ a.s.

Then, the excess risk of fn (for 0-1 loss) is

E[E(fn,P)] = E[R(fn,P)] ≤ R2

nγ2

(The equality is since R∗(P) = 0 in this case.)

39 / 183

Proof

• Write R(f ) = R(f ,P) for simplicity.

ER(fn) =1

n

n∑m=1

ER(fm)

=1

n

n∑m=1

P(fm(X ) 6= Y ) =1

n

n∑m=1

P(fm(Xm+1) 6= Ym+1)

• On the other hand,

n∑m=1

1fm(Xm+1) 6= Ym+1 ≤R2

γ2

since the LHS is the number of misclassifications in a single run of theperceptron algorithm on data Z n+1, with the convention that we choose thesmallest index among misclassified examples at each update.

40 / 183

A minimax lower bound• Assume the 0-1 loss and consider the following class of probability

distribution on Z:

P := P(Z) :=P : inf

f ∈Ft-lin

R(f ,P) = 0

• and the class of all classification rules based on a sample of size n:

Cn := Cn(X ) :=fn | fn : X × Zn 7→ −1, 1

(Recall Z = X × −1, 1)

Theorem 5

Assume X ⊂ Rd has non-empty interior. Then,

inffn ∈Cn

supP ∈P

E[E(fn,P)] ≥ min(n, d)− 1

2n

(1− 1

n

)n• For any classification rule fn : X × Zn 7→ −1, 1, there exists a probability

distribution P ∈ P(Z) such that E[E(fn,P)] ≥ · · · .41 / 183

Proof

• We construct a random probability distribution P on Z and show that

inffn ∈Cn

E[E(fn,P)] ≥ min(n, d)− 1

2n

(1− 1

n

)nwhere E is also over the randomness of P.

• The result follows since maximum risk is lower bounded by the average(Bayes) risk. (Also called the probabilistic approach.)

• In fact, we pick P uniformly at random from P0 ⊂ P, so we have thestronger result:

inffn ∈Cn

supP∈P0

E[E(fn,P)] ≥ inffn ∈Cn

E[E(fn,P)]

42 / 183

Proof (constructing P)

• Pick linearly independent v1, . . . , vd ⊂ X and define:

Pb =d∑

j=1

wjδ(vj ,bj ), wj =

ε

d−1 j = 1, . . . , d − 1

1− ε j = d

where b = (b1, . . . , bd) ∈ −1, 1d .

• For any Pb the Bayes risk over t-lin. class if zero: there exists a linearfunction fb such that fb(vj) = bj for all j = 1, . . . , d for any b.

• The key is that in a draw from Pb we mostly see (vd , bd).

• Let B = (B1, . . . ,Bd) ∼ Unif(−1, 1d) and define P = PB .

• That is, we draw the data as

B ∼ Unif(−1, 1d)

(X ,Y ), (X1,Y1), . . . , (Xn,Yn) | B = biid∼ Pb,

• In other words, Z ,Z1, . . . ,Zn | B = b ∼ Pb.

43 / 183

• Let V (X n) = v1, . . . , vd−1 \ X1, . . . ,Xn, i.e., unseen “light” examples.

• Let J = J(X n) be indices of elements of V (X n). The key is

BJ | |J| = k ∼ Unif(−1, 1k)

A bit more precisely:

BJ | Z n = zn, |J| = k ∼ Unif(−1, 1k)

BJc | Z n = zn ∼ deterministic

(Exercise: Argue this and fill in the details.)

44 / 183

• Recall that EE(fn,P) = ER(fn,P) = P(fn(X ) 6= Y ),

P(fn(X ) 6= Y | B,Z n = zn

)=∑j

wj 1fn(vj , zn) 6= Bj

≥∑

j∈J(xn)

wj 1fn(vj , zn) 6= Bj

=ε

d − 1

∑j∈J(xn)

1fn(vj , zn) 6= Bj

• Taking the expectation w.r.t. conditional distribution of B given Z n = zn:

P(fn(X ) 6= Y | Z n = zn

)=

ε

d − 1

∑j∈J(xn)

E[1fn(vj , zn) 6= Bj | Z n = zn]

=ε

d − 1

∑j∈J(xn)

P(fn(vj , zn) 6= Bj | Z n = zn)

=ε

d − 1

∑j∈J(xn)

1

2

=1

2

ε

d − 1|J(xn)|.

45 / 183

• In other words, the performance of the decision rule on BJ (the labels ofunseen examples from the light pairs) is equivalent to random guessing.

• Thus, we have shown

P(fn(X ) 6= Y | Z n

)≥ ε

2(d − 1)|J(X n)|. (9)

• Using smoothing (since |J| is a function of Z n) it follows that

P(fn(X ) 6= Y | |J|

)≥ ε

2(d − 1)|J|.

• Either from this or from (9) directly, by smoothing

P(fn(X ) 6= Y

)≥ ε

2(d − 1)E|J|

46 / 183

• But we have

E|J| =d−1∑j=1

P(vj /∈ X1, . . . ,Xn) = (d − 1)(

1− ε

d − 1

)n• Putting the pieces together

EE(fn,P) ≥ ε

2

(1− ε

d − 1

)n• Choose ε = min(n,d)−1

n and note that

min(n, d)− 1

n(d − 1)≤ 1

n.

47 / 183

Fast rates under (strong) margin condition

• Recall that η(X ) = P(Y = 1 | X = x) and f ∗ is the optimal Bayes classifier.

Theorem 6

Assume that |2η(X )− 1| ≥ γ almost surely, for some γ > 0. Let fn be the ERMwith zero-one loss over a finite function class F , and f ∗ ∈ F .

Then, with probability at least 1− δ,

E(fn) ≤ Cγlog(|F |/δ)

n,

where Cγ > 0 is a constant only dependent on γ.

• Proof uses concentration inequalities.

48 / 183

• An example of a concentration inequality

Theorem 7 (Bernstein inequality)

Let Xi , i = 1, . . . , n be independent, centered, bounded variables |Xi | ≤ b (a.s.),and let Sn =

∑ni=1 Xi and v =

∑ni=1 EX 2

i . Then,

P(Sn ≥ t) ≤ exp(− t2

2v + 2bt/3

), ∀t ≥ 0.

• Note that v = var(Sn).

• Same bound holds for P(Sn ≤ −t).

• Thus P(|Sn| ≥ t) ≤ twice the bound.

• If E[Xi ] = µi while |Xi − µi | ≤ b a.s., we apply the result to Xi − µi .

• Letting µ = E[Sn] =∑

i µi , we get

P(Sn − µ ≥ t) ≤ exp(− t2

2v + 2bt/3

), ∀t ≥ 0.

where v = var(Sn) =∑

i E(Xi − µi )2.

49 / 183


Theorem 8 (Bernstein inequality)

Let Xi , i = 1, . . . , n be independent, centered, bounded variables |Xi | ≤ b (a.s.),and let Sn =

∑ni=1 Xi and v =

∑ni=1 EX 2

i . Then,

P(Sn ≥ t) ≤ exp(− t2

2v + 2bt/3

), ∀t ≥ 0.

• Note that v = var(Sn).

• Same bound holds for P(Sn ≤ −t).

• Thus P(|Sn| ≥ t) ≤ twice the bound.


• Letting µ = E[Xn] = 1n

∑i µi , we get

P(Xn − µ ≥ t) ≤ exp(− nt2

2v + 2bt/3

), ∀t ≥ 0.

where v = var(Xn) = 1n

∑i E(Xi − µi )

2.

50 / 183

Fast rates• Taking t = βv + ε, for β, ε > 0. Then, v ≤ t/β, and

P(Xn − µ ≥ t) ≤ exp(− nt

2/β + 2b/3

)≤ exp

(−cnt

)≤ exp

(−cnε

)• That is,with c = (2/β + 2b/3)−1,

P(Xn ≥ µ+ βv + ε

)≤ exp(−cnε), ∀ε > 0

and similarly,

P(Xn ≤ µ− βv − ε

)≤ exp(−cnε), ∀ε > 0

• Thus,

Xn ∈ (µ− βv , µ+ βv) + [−ε, ε]

for ε = O( log(1/δ)n ) w.p. ≥ 1− 2δ.

51 / 183

Fast rates (variance bounded by mean)

• Assume that v ≤ C µ for C > 0. (Note that this requires µ ≥ 0.)

• Let β = α/C , for some α > 0, so that βv ≤ αµ.

• It follows that, with c = (2C/α + 2b/3)−1,

P(Xn ≥ (1 + α)µ+ ε

)≤ exp(−cnε), ∀ε > 0, (10)

P(Xn ≤ (1− α)µ− ε

)≤ exp(−cnε), ∀ε > 0. (11)

• Thus,

Xn ∈ ((1− α)µ, (1 + α)µ) + [−ε, ε]

for ε = O( log(1/δ)n ) w.p. ≥ 1− 2δ.

• Contrast with the general bound

Xn ∈ µ+ [−ε, ε]

for ε = O(√

v log(1/δ)n ) w.p. ≥ 1− 2δ, for large n.

52 / 183

Proof of Theorem 6

• Let `f (z) = 1f (x) 6= y for z = (x , y).

• As f varies in some class F , it carves out the loss class `f : f ∈ F.• We have E(f ) = E[`f (Z )− `f ∗(Z )], and

E(f ) := Rn(f )− Rn(f ∗) =1

n

n∑i=1

[`f (Zi )− `f ∗(Zi )]

• Note that EE(f ) = E(f ).

• To simplify further, let ∆f (z) := `f (z)− `f ∗(z). Then,

E(f ) = E[∆f (Z )], E(f ) =1

n

n∑i=1

∆f (Zi )

• ∆f (Zi ) ∈ −1, 0, 1 but E∆f (Zi ) = E(f ) ≥ 0.

53 / 183

• Recall from (4) that E(f ) = E[|∆f (Z )||2η(X )− 1|].• Under margin condition |2η(X )− 1| ≥ γ a.s., we have

E(f ) ≥ γE|∆f (Z )| ≥ γE[∆f (Z )]2 ≥ γ var(∆f (Z ))

and note that E(f ) = E[∆f (Z )].

• In our previous notation, v ≤ γ−1µ. Apply (11)

P(E(f ) ≤ (1− α)E(f )− ε

)≤ exp(−cnε), ε > 0

where c = ( 2αγ + 2

3 )−1.

• Union bound

P(∃f ∈ F : E(f ) ≤ (1− α)E(f )− ε

)≤ |F | exp(−cnε), ε > 0

• Taking ε = log(|F |/δ)cn , w.p. ≥ 1− δ,

E(f ) ≥ (1− α)E(f )− log(|F |/δ)

cn, ∀f ∈ F

54 / 183

• Rearranging, with c ′ = (1− α)c ,

E(f ) ≤ 1

1− α E(f ) +log(|F |/δ)

c ′n, ∀f ∈ F

w.p. ≥ 1− δ.

• Applying to fn and noting that E(fn) ≤ 0 (why?),

E(fn) ≤ log(|F |/δ)

c ′n

w.p. ≥ 1− δ. This is the desired result.

55 / 183

General (slow) rates

• Let EF (·) be the excess risk over function class F , i.e.,

EF (f ) = R(f )− inff∈F

R(f ).

Theorem 9

Let fn be the ERM with zero-one loss over a finite function class F .

Then, with probability at least 1− 2δ,

EF (fn) ≤ C

√log(|F |/δ)

n,

where C is a universal constant.

• Proof uses concentration inequalities.

• Theorem 9 holds for any “bounded” loss function (with the same proof).

56 / 183


Theorem 10 (Hoeffding inequality)

Let Xi , i = 1, . . . , n be independent random variables with Xi ∈ [ai , bi ] a.s., andlet Sn =

∑ni=1 Xi . Then,

P(Sn − ESn ≥ t) ≤ exp(− 2t2∑

i (bi − ai )2

), ∀t ≥ 0.

• Same bound holds for the lower tail.


• Letting µ = E[Xn] = 1n

∑i µi , we get

P(Xn − µ ≥ t) ≤ exp(− nt2

2b2

), ∀t ≥ 0.

57 / 183

Proof of Theorem 9

• Let f = fn be the ERM over F .

• Let f ∗F = argminf∈F R(f ). We have

EF (f ) := R(f )− R(f ∗F )

≤ R(f )− Rn(f ) + Rn(f ∗F )− R(f ∗F )

≤ 2 supf∈F|Rn(f )− R(f )|

:= 2‖Rn − R‖F

• Recall that

Rn(f )− R(f ) =1

n

n∑i=1

[`f (Zi )− E`f (Zi )]

• For 0-1 loss: `f (Zi ) = 1f (Xi ) 6= Yi

58 / 183

• We have |`f (Zi )− E`f (Zi )| ≤ 1, hence

P(|Rn(f )− R(f )| ≥ t

)≤ 2 exp

(−nt2

2

).

• Apply union bound

P(

supf∈F

|Rn(f )− R(f )| ≥ t)≤ 2|F | exp

(−nt2

2

).

• Take t =√

2 log(|F |/δ)/n, w.p. ≥ 1− 2δ

‖Rn − R‖F ≤√

2 log(|F |/δ)

n.

• Hence, with the same probability

EF (f ) ≤ 2

√2 log(|F |/δ)

n.

59 / 183

How to get around finiteness?

• Use discretization; leads to ε-net and metric entropy ideas.

• Use empirical process theory: reduce by symmetrization and concentration tobounding Rademacher averages. In this case, the size of the trace of `F onZ1, . . . ,Zn matters, not the cardinality of F itself, i.e. the size of

`F (Z n) := (`f (Z1), . . . , `f (Zn)) : f ∈ F.

Here `F = `f : f ∈ F is the loss function class.

60 / 183

ε-net (or ε-cover) and metric entropy

(a) is a δ-covering and (b) a δ-packing2.

2Picture from HDS book61 / 183

General (slow) rate: via metric entropy• Let N = N(ε) = N(ε; F , ‖ · ‖∞).

• Assume that N = f1, . . . , fN is an ε-cover of F in sup-norm:

∀f ∈ F , ∃fj ∈ N , s.t. ‖f − fj‖∞ ≤ ε• logN(ε) is called the metric entropy.

Theorem 11

Let fn be the ERM with zero-one loss over a function class F , and letN(ε) = N(ε; F , ‖ · ‖∞) be the ε-covering number of F in sup-norm.

Then, with probability at least 1− 2δ,

EF (fn) .

√log(N(ε)/δ)

n+ ε,

• a . b means: there is universal constant C such that a ≤ Cb.

• Choose ε to balance the two terms, i.e. solve

logN(ε) nε2

62 / 183

Parametric example

• We have N(ε;Bd2 , ‖ · ‖2) ≤ (1 + 2/ε)d .

• Or N(ε;Bd2 , ‖ · ‖2) ≤ (3/ε)d if ε ≤ 1.

• Consider the parametric function class

F = x 7→ σ(θT x) : θ ∈ Bd2

where σ : Rd → Rd is some L-Lipschitz function.

• Assuming that X ⊂ r Bd2 , i.e. x ∈ X =⇒ ‖x‖2 ≤ r ,

N(Lrε; F , ‖ · ‖∞) ≤ N(ε;Bd2 , ‖ · ‖2).

• Assuming that ε ≤ Lr , logN(ε; F , ‖ · ‖∞) ≤ d log(

3Lrε

)• Follows that w.p. ≥ 1− 2δ,

EF (fn) .

√d log(3Lr/ε) + log(1/δ)

n+ ε,

63 / 183

• Follows that w.p. ≥ 1− 2δ,

EF (fn) .

√d log(3Lr/ε) + log(1/δ)

n+ ε,

• Take ε = Lr(n/d)−1/2 and δ = n−d , to get w.p. ≥ 1− 2n−d ,

EF (fn) ≤ CL,r

√d log(n/d)

n,

assuming d ≤ n.

• In general, log factor can be removed with a more careful argument (e.g.,chaining).

• Note that we could have taken ε (n/d)−α for any α ≥ 1/2.

64 / 183

Proof of Theorem 11

• `f (z) = 1y 6= f (x).• Recall from (3) that for any f , g ∈ F , |`f (z)− `g (z)| = |f (x)− g(x)|.• Let Rn = Rn(f )− R(f ) be the centered empirical risk.

|Rn(f )− Rn(g)| ≤ 1

n

n∑i=1

|`f (Zi )− `g (Zi )|+ E|`f (Zi )− `g (Zi )|

=

1

n

n∑i=1

|f (Xi )− g(Xi )|+ E|f (Xi )− g(Xi )|

≤ 2ε

assuming ‖f − g‖∞ ≤ ε. Hence,

supf∈F|Rn(f )| ≤ sup

g∈N|Rn(g)|+ 2ε

or in short ‖Rn‖F ≤ ‖Rn‖N + 2ε.

65 / 183

• Using our previous results: ‖Rn‖N ≤√

2 log(|N |/δ)n w.p. ≥ 1− 2δ.

• Recalling that |N | = N(ε), we have w.p. ≥ 1− 2δ,

EF (fn) ≤ 2‖Rn‖F ≤ 2

√2 log(N(ε)/δ)

n+ 4ε.

66 / 183

Empirical process approach• Define Rademacher complexity of the function class F as

Rn(F ) = EX ,ε

[supf∈F

∣∣∣1n

n∑i=1

εi f (Xi )∣∣∣]. (12)

• εi iid∼Unif(−1, 1) independent of X = (X1, . . . ,Xn).

Theorem 12For any b-uniformly bounded class of functions F , we have

‖Pn − P‖F ≤ 2Rn(F ) + δ

with probability 1− exp(−nδ2/2b2).

Proof sketch:

• Use bounded difference inequality to show that ‖Pn − P‖F is concentratedaround E‖Pn − P‖F .

• Use a symmetrization argument to upper bound E‖Pn − P‖F by theRademacher complexity.

67 / 183

Bounded difference inequality

Theorem 13Suppose that f : X n → R satisfies the bounded difference property withconstants ci, and let X = (X1, . . . ,Xn) have independent coordinates. Then,

P(|f (X )− Ef (X )| ≥ t

)≤ 2 exp

(− 2t2∑n

i=1 c2i

)• Bounded difference property:

|f (x1, . . . , xi , . . . , xn)− f (x1, . . . , x′i , . . . , xn)| ≤ ci ,

for all choices of x1, . . . , xn and x ′i .

68 / 183

• We apply these type of result to the loss function class

`F = `f : f ∈ F

where `f (z) = `(f (x), y).

• Recall that Rn(f ) := Rn(f )− R(f ) is the centered empirical risk.

Corollary 1

For any b-uniformly bounded loss function call `F , we have

‖Rn‖F ≤ 2Rn(`F ) + δ


• It is thus enough to bound the Rademacher complexity of the loss class:

Rn(`F ) = EZ ,ε

[supf∈F

∣∣∣1n

n∑i=1

εi`f (Zi )∣∣∣]

69 / 183

• Let us write Rn(F | X n1 ) for the empirical Rademacher complexity:

Rn(F | X n1 ) := Eε

[supf∈F

∣∣∣1n

n∑i=1

εi f (Xi )∣∣∣]

• Alternatively, let F (X n1 ) = (f (X1), . . . , f (Xn) : f ∈ F, then

Rn(F | X n1 ) = Eε

[sup

a∈F (X n1 )

∣∣∣1n

n∑i=1

εiai

∣∣∣]

• From now on, let F be a class of binary functions.

• Then, |F (X n1 )| is at most 2n. (Reduction to finite case.)

• We say that F shatters X n1 if |F (X n

1 )| = 2n.

• In interesting cases, |F (X n1 )| is much smaller than 2n.

70 / 183

Bounding Rn(F | X )

• Problem reduces to bounding

R(A) := E supa∈A

∣∣∣1n

n∑i=1

εiai

∣∣∣Lemma 1

For any A ⊂ Rn, with D2(A) = supa∈A

1

n‖a‖2

2,

E supa∈A

1

n

n∑i=1

εiai ≤√

2D2(A) log |A|n

R(A) := E supa∈A

∣∣∣1n

n∑i=1

εiai

∣∣∣ ≤ √2D2 log(2|A|)

n

• Second one follows from the first by applying it to A ∪ −A.

• The first: maximum of sub-Gaussian variables.

71 / 183

Theorem 14

Assume Xini=1 are zero-mean RVs, sub-Gaussian with parameter σ. Then,

E[ maxi=1,...,n

Xi ] ≤√

2σ2 log n, ∀n ≥ 1

• Proof of 14: Jensen’s inequality on eλZ where Z = maxi Xi .

• Proof of Lemma 1:

• Rademacher variables εi are sub-Gaussian with σi = 1.

• Hence, 1n

∑i εiai is sub-Gaussian with squared parameter

σ2a =

1

n2

∑i

a2i =‖a‖2

2

n2≤ D2(A)

n.

72 / 183

• Let us define3

mF (n) := max|F (xn1 )| : x1, . . . , xn ∈ X

that is, maximum size of the trace of F on any n points from X .

• From the result just proved

Rn(F | X n1 ) ≤

√2D2

F logmF (n)

n.

where

D2F = D2(F (X n

1 )) = supf∈F

1

n

n∑i=1

f 2(Xi ) (13)

3Another notation: |F (xn1 )| is also written as ∆F (A) where A = x1, . . . , xn.73 / 183

VC dimension

• Let us define

mF (n) := max|F (xn1 )| : x1, . . . , xn ∈ X

• xn1 = x1, . . . , xn is shattered by F if F (xn1 ) = 0, 1n.

• I.e., shattered if restriction of F to xn1 gives all possible binary functions.

• The VC dimension of F is defined as

ν(F ) = supn : mF (n) = 2n= supn : some x1, . . . , xn ⊂ X is shattered by F

• The largest cardinality of a subset of X shattered by F .

• Equivalence between binary function and sets: F = 1C : C ∈ C .• Can talk about VC dimension of collections of subsets of X .

• F (or C ) is called VC class if its VC dimension is finite.

• VC dimension is form of combinatorial complexity.

74 / 183

Examples

• F = 1(−∞,a] : a ∈ R has ν(F ) = 1.

• F = 1[a,b] : a, b ∈ R, a ≤ b has ν(F ) = 2.

• F = indicators of half planes in R2 has ν(F ) = 3.

• Two possible arrangement in general position, none can be shattered. Seepicture 4 What if not in general position.

4From HDP book.75 / 183

Class VC dimension

Pairs on intervals [a, b] ∪ [c , d ] in R 4

Circles in R2 3

Rectangles [a, b]× [c , d ] in R2 4

Squares [a, b]2 in R2 3

Rectangles not necessarily axis-aligned in R2 7

Polygons with k vertices in R2 2k + 1

Half-spaces in Rd d + 1

76 / 183

• Equivalence between binary function and sets: F = 1C : C ∈ C .• F (or C ) is called VC class if its VC dimension is finite.

Theorem 15

Let F be nonempty, VC class with, ν(F ) = d <∞, then

mF (n) ≤d∑

i=0

(n

i

)

• There are only two possibilities

mF (n)

= 2n n ≤ d ,

≤ (en/d)d n > d

77 / 183

• The inequality∑d

i=0

(nd

)≤ (en/d)d for d ≤ n:

(dn

)d d∑i=0

(n

d

)≤

d∑i=0

(n

d

)(dn

)i≤(

1 +d

n

)n≤ ed

using 1 + x ≤ ex , hence (1 + x)n ≤ enx .

• Other bounds:∑d

i=0

(nd

)≤ (n + 1)d .

78 / 183

• Equivalence between binary function and sets: F = 1C : C ∈ C .• F (or C ) is called VC class if its VC dimension is finite.

Theorem 16

Let C be a nonempty, VC class collection of subsets of X , with ν(C ) = d <∞.Then for any finite subset A ⊂ X ,

mC (A) ≤ |B ⊂ A : |B| ≤ d| =d∑

i=0

(|A|i

)where mC (A) is the cardinality of the trace of C on A.

• Follows from Pajor’s lemma: mC (A) ≤ |B ⊂ A : B is shatterd by C |.• Note that A is finite here.

• mC (A) = |F (A)| where F = 1C : C ∈ C .• mC (A) = |C ∩ A| = |C ∩ A : C ∈ C |.• Consider the example C = (−∞, a] : a ∈ R with ν(C ) = 1 andA = x1, x2, x3.

79 / 183

• Recall

Rn(F | X n1 ) ≤

√2D2

F logmF (n)

n.

• If ν(F ) = d <∞, we have

Rn(F | X n1 ) ≤

√2D2

F d log(en/d)

n

• Similar bound holds for unconditional Rn(F ):

Rn(F ) ≤√

2ED2F d log(en/d)

n.

√d log(n/d)

n.

assuming ED2F = O(1).

• (Note the application of Jensen inequality: E√Z ≤

√EZ .)

80 / 183

Proposition 1

For any b-uniformly bounded loss function call `F , with VC dimensionν(`F ) = d <∞, we have

‖Rn‖F ≤ 2

√2ED2

F d log(en/d)

n+ δ


• Recall that

ED2F = E sup

f∈F

1

n

n∑i=1

f 2(Xi ).

• Again log factor can be removed (by chaining/Dudley’s integral).

81 / 183

VC dimension (continued)

• Relation between VC dimension and usual linear algebraic dimension.

Proposition 2

• Let G be a vector space of functions with dimention dim(G) <∞.

• Let S(G) be the subgraph class of G: S(G) = lev≤0(g) : g ∈ G. Then,

ν(S(G)) ≤ dim(G)

• lev≤0(g) = x : g(x) ≤ 0.• If S(G) shatters x1, . . . , xn, then training a ERM classifier from the classx 7→ 1g(x) ≤ 0, g ∈ G, on data (x1, y1), . . . , (xn, yn gives a trainingerror or zero

82 / 183

Proof of Proposition 2

• Need to show no subset of n := dim(G) + 1 can be shattered.

• Pick x1, . . . , xn ⊂ X , and define the (linear) map

Φ : G → Rn, Φ(g) = (g(x1), . . . , g(xn))

• Let Φ(G) denote the range of G.

• We have dim(Φ(G)) ≤ dim(G) = n − 1 < n.

• Hence, ∃v ∈ Rn \ 0 s.t. 〈v ,Φ(g)〉 = 0 for all g ∈ G.

• Assume that S(G) shatters x1, . . . , xn.• Then, ∃g ∈ G such that (using convention sign(0) = −1)

sign(g(xi )) = sign(vi ), ∀i ∈ [n].

• WLOG, at least some vj > 0 so that sign(vj) = 1 and g(xj) > 0.

• For that g we have 0 = 〈v ,Φ(g)〉 =∑

i |vi ||g(xi )| ≥ |vj ||g(xj)|,contradicting vj 6= 0.

83 / 183

Example (half spaces)

• Consider the class of all half-spaces in Rd :

C =x ∈ Rd : 〈a, x〉+ b ≤ 0 : (a, b) ∈ Rd × R

• C is sub-graph class for x 7→ 〈a, x〉+ b : a ∈ Rd , b ∈ R.• ν(C ) ≤ d + 1.

• (VC dimension is = d + 1 in fact)

84 / 183

Example (spheres)

• Consider spheres in Rd :

C =x ∈ Rd : ‖x − a‖ ≤ b : (a, b) ∈ Rd × R+

• C is sub-graph class for functions:

F1 :=x 7→ ‖x‖2 − 2

d∑j=1

ajxj + ‖a‖2 − b2 : (a, b) ∈ Rd × R+

• Define φ(x) := (1, x , . . . , xd , ‖x‖2), maps Rd → Rd+2.

• Class F1 is included in

F2 = x 7→ 〈θ, φ(x)〉 : θ ∈ Rd+2, dim(F2) = d + 2.

• Hence, ν(C ) ≤ d + 2.

• (Can show that VC dimension = d + 1.)

85 / 183

Neural Network (NN) examples

• Linear threshold function (t-lin.):

F LTd =

x 7→ sign(〈θ, x〉+ θ0) : θ ∈ Rd , θ0 ∈ R

.

has ν(F LTd ) = d + 1.

• Also let us introduce

F LT(Θ) =x 7→ sign(〈θ, x〉+ θ0) : θ ∈ Θ, θ0 ∈ R

• Also recall

Rn(F ) ≤√

2ED2F logmF (n)

n.

86 / 183

Two-layer network

• Two-layer network with linear threshold functions in 1st layer,`1 bounded weights in 2nd layer:

F(1)d :=

k∑i=1

θi fi : k ∈ N, fi ∈ F LTd , ‖θ‖1 ≤ 1

= co(Fd ∪ −Fd)

since 0 ∈ Fd .

• Also called absolute convex hull or symmetric convex hull.

absconv(A) = k∑

i=1

λixi : k ∈ N, xi ∈ A,k∑

i=1

|λi | ≤ 1

• F(1)d := absconv(F LT

d ), hence

Rn(F(1)d ) = Rn

(F LT

d ) .

√d log(en/d)

n

87 / 183

Why?

• Recalling Rademacher complexity of the function class F :

Rn(F ) = EX ,ε

[supf∈F

∣∣∣1n

n∑i=1

εi f (Xi )∣∣∣]. (14)

• Rademacher complexity ia invariant under taking convex hull:

Rn(co(F )) = Rn(F )

(Hint: maximizing a convex function over a domain.)

• absconv(F ) = co(0 ∪F ∪ −F ).

• Hence, Rn(absconv F ) = Rn(F ).

88 / 183

Two-layer network

• Two-layer network with bounded fan-in linear threshold functions in firstlayer and `1 bounded weights in second layer:

Fd,s =x 7→ sign(〈θ, x〉+ θ0) : θ ∈ Rd , θ0 ∈ R, ‖θ‖0 ≤ s

F

(1)d,s = absconv(Fd,s)

• We have Fd,s = F LT(B0(s)) =⋃|S|=k FS where

FS = F LT(θ ∈ Rd : θSc = 0), mFS(n) ≤

( en

s + 1

)s+1

• Since mF∪G(n) ≤ mF (n) + mG(n),

mFd,s(n) ≤

( en

s + 1

)s+1(d

s

)=⇒ R(F ) .

√s log(nd/s)

n

89 / 183

Network of LT functions

Theorem 17Let Fp,k be the class of functions computed by a feed-forward network of LTfunctions with k computation units and p (total) parameters. Then,

mFp,k(n) ≤

(enkp

)pand hence ν(Fp,k) ≤ 2p log2(2k/ log 2).

• That is, ν(Fp,k) = O(p log k).

• Dimension of the input space X ⊂ Rd does not come into play.

90 / 183

Proof

• Fix a set of n input vector x1, . . . , xn.

• Consider a topological sort of computation units. (NN architecture is a DAG.)

• p`: number of parameters of unit `.

• D`: number of distinct states of unit `.(# of distinct functions x1, . . . , xn → ±1` from input to the output ofunits before `, as we vary the parameters. See next slide.)

• D1 ≤ (en/p1)p1 , and D` ≤ D`−1(en/p`)p` , hence

Dk ≤k∏`=1

(en/p`)p`

• Since the bound is independent of xn1 , we have

logmFp,k(n) ≤

k∑`=1

p` log(en/p`)

• Bound is maximized by spreading p uniformly over the k units: p` = p/k.

91 / 183

Details:• To see what is going on with states, consider

f(g1(x), g2(x), . . . , g`(x)

)all binary functions.

• Consider how many function on x1, . . . , xn we have as gi and f vary

x 7→ f(g1(x), g2(x), . . . , g`(x)

), ∀f , g1, . . . , g`

• First consider functions x1, . . . , xn → ±1` generated by(g1(x), g2(x), . . . , g`(x)

), ∀g1, . . . , g`

• Each such function can be represented by a binary matrix: (n = 4, ` = 2)

1 2x1 + −x2 − +x3 + +x4 − −

1 2x1 − −x2 + −x3 + −x4 − +

. . .

1 2x1 + +x2 + −x3 − −x4 − +

92 / 183

• First consider functions x1, . . . , xn → ±1` generated by(g1(x), g2(x), . . . , g`(x)

), ∀g1, . . . , g`

• Each such function can be represented by a binary matrix:

1 2x1 + −x2 − +x3 + +x4 − −

1 2x ′1 − −x ′2 + −x ′3 + −x ′4 − +

. . .

1 2x ′′1 + +x ′′2 + −x ′′3 − −x ′′4 − +

• For each matrix, we have possibly different set of n input vectors to thefunction f . (Each row is a data point to f .)

• Each of these matrices produces at most (en/p`)p` functions

x1, . . . , xn → ± at the output of f , and

• the number of matrices is D`.

93 / 183

• I.e, the same set x1, x2, x3, x4 at the input layer, appear as differentconfigurations at the input to f , due to the variations in g1, . . . , g`.

x1

x2 x3

x4

x ′′1

x ′′2x ′′3

x ′′4

x1, x2 cannot be separated from x3, x4 in blue configuration, but can bein red.

94 / 183

Network of LT functions (lower bound)

Theorem 18

Let Fd,k be the class of functions f : Rd → ±1 computed by a two-layerfeed-forward network of LT functions with k hidden computation units (i.e.,p = k(d + 2) + 1 parameters). Then ν(Fd,k) ≥ kd = Ω(p).

• More involved argument shows ν(Fd,k) = Ω(p log k).

• p = k(d + 1) + (k + 1).

• Conclusion: For LT functions with p parameters and k computation unit, VCdimension is Θ(p log k) independent of depth.

95 / 183

Proof sketch

• Arrange kd points in k well-separated clusters Ci , i = 1, . . . , k on the surfaceof a sphere in Rd .

• Ensure d points in each cluster are in general position.

• Choose the decision boundary of ith hidden neuron to pass through point inCi , and orient to have output at the center of sphere.

• Choose parameters of the output unit to compute conjunction of its k inputs.

• By perturbing hidden unit parameters all 2kd classifications can be computed.

96 / 183

Other nonlinearities

• Linear threshold networks are piecewise constant.

• What about piecewise linear, networks of ReLU x 7→ (x)+? Or piecewisepolynomials?

• How about smooth nonlinearities: sigmoid, soft-max, smoothed ReLU

σ(t) =1

1 + e−t, [σ(t)] =

eti∑j e

tj, σ(t) = log(1 + et)

not clear if finite.

• A smooth function with few parameters can have infinite VC dimension.

97 / 183

Sine nonlinearity

• Consider Fsin = x 7→ sign(sin(θx)) : θ ∈ R.• We have ν(Fsin) =∞.• Any sequence xi = 2i , i = 1, 2, . . . , n can be shattered.

• Consider the subclass obtained with θ = cπ where c has binaryrepresentation 0.b1b2 . . . bn1 for bi ∈ 0, 1. Then,

sin(xiθ) = sin(2iπ 0.b1b2 . . . bn1)

= sin(π b1b2 . . . bi .bi+1 . . . bn1)

= sin(π bi .bi+1 . . . bn1)

so that sign(sin(θxi )) = 1− 2bi .

98 / 183

Geometric methods

• Want to bound growth function mF (n) of a parametric family:

x 7→ f (x , θ) : θ ∈ Θ, Θ ⊂ Rp

• Example: linear threshold functions f (x , θ) = sign(〈θ, x〉):

• Intercept included in θ ∈ Rd and x has been augmented with 1:

x = (1, x1, . . . , xn), θ = (θ0, θ1, . . . , θd)

• We know ν(F LTd ) = d + 1, hence mF LT

d(n) ≤∑d+1

i=0

(ni

).

• Directly compute mF dLT using a geometric argument.

• Useful in other situations.

99 / 183

Theorem 19

For the class of linear threshold functions on Rd ,

mF (n) = 2d∑

i=0

(n − 1

i

)Proof idea:

• Fix x1, . . . , xn ∈ Rd+1.

• Divide the parameter space into cells that give the same classification.

• Count the cells using a geometric argument (goes back to Schaffli, 1851)

100 / 183

• Assume the points are in Rd (for simplicity) and

• in general position, that is every subset of size ≤ d is linearly independent.

• For each xi define the hyperplane

Hxi = θ ∈ Rd : 〈θ, xi 〉 = 0

• Number of classifications on xn1 is given by

|F (xn1 )| =∣∣∣CC

(Rd \

n⋃i=1

Hxi

)∣∣∣ =: C (n, d)

• Compute C (n, d) by induction on n (also shows independence of xn1 ).

• C (1, d) = 2.

• Consider arrangement by n planes, and add new point xn+1.

• Hxn+1 splits some of the cells, leaves others intact.

• Number of cells split = |CC(Hxn+1 \⋃n

i=1 Hxi )| = C (n, d − 1), thus

C (n + 1, d) = C (n, d) + C (n, d − 1).

• Solution to this is C (n, d) = 2∑d−1

k=0

(n−1k

).

101 / 183

102 / 183

103 / 183

Convex losses for classification

• Finding ERM for zero-loss is difficult.

• Let Y ∈ −1, 1, Y ′ = R, i.e. f : X → R.

• Looking at general loss functions of the form `(y , t) = φ(yt).

• Risk for zero-loss is

R(f ) = P[Y 6= sign(f (X ))

]= E1Yf (X ) ≤ 0=: Eφ0(Yf (X ) ≤ 0)

where φ0(t) := 1t ≤ 0.• Can replace φ0 with a convex loss function:

φsvm(t) = (1− t)+, φada(t) = e−t , φlogi(t) = log(1 + e−t)

• Efficient convex optimization algorithms.

104 / 183

• What do we lose? Focus on the population level question.

• Does minimizing Rφ(f ) = Eφ(Yf (X )) give good classifier,

• i.e., small zero-one risk R(f ) = Eφ0(Yf (X ))?

• If φ(t) ≤ cφ0(t) =⇒ Rφ(f ) ≤ c Rφ0 (f ).

• Can we do better?

• Can approximately minimizing Eφ(f ), also approximately minimize Eφ0 (f )?

105 / 183

• Posterior risk

Cη(t) = Eη[L(Y , t)], C∗η = inft∈R

C (t; η) (15)

where η is a distribution on Y , here identified with η = P(Y = 1).

• Later we with η(x) = P(Y = 1 | X = x).

• Here L is the zero one loss.

• C and C∗ corresponding quantities for a surrogate loss L.

• Recall Cη(x)(t) = E[L(Y , t) | X = x ]. Hence,

ECη(X )(f (X )) = EL(Y , f (X )) = R(f ).

For margin-based classifiers L(y , t) = φ(yt)

Cη(t) = ηφ(t) + (1− η)φ(−t)

106 / 183

Calibration• Often drop dependence on η for simplicity.

• ε-approximate minimizers of C (t):

M(ε) =Mη(ε) = t ∈ R : C (t)− C∗ < ε

• calibration function

δ(ε) := inft /∈M(ε)

C (t)− C∗

• convention δ(ε) =∞ if C∗ =∞. Note that

C (t)− C∗ ≥ ε =⇒ C (t)− C∗ ≥ δ(ε), ∀t ∈ R, ε > 0

• For any fixed t ∈ R, taking the supremum over ε ≤ C (t)−C∗ and using thefact that ε 7→ δ(ε) is increasing we obtain

C (t)− C∗ ≥ supε≤C(t)−C∗

δ(ε) = δ(C (t)− C∗

)assuming that δ(·) is continuous.

107 / 183

Calibration

• Putting η back in, we have shown

δη(Cη(t)− C∗η

)≤ Cη(t)− C∗η .

Definition 1 (Calibration)

• L is calibrated for L (w.r.t. Q) if

δη(ε) > 0, ∀ε > 0, η ∈ Q

• L is uniformly calibrated for L (w.r.t. Q) if

δη(ε) ≥ δ(ε) > 0, ∀ε > 0, η ∈ Q

• Usually define δ(ε) := δQ(ε) := infη∈Q δη(ε)

• but any uniform lower bound suffices.

108 / 183

• Let Bf := supx |Cη(x)(f (x))|.• P on X × Y is of Q-type if P(Y ∈ ·|X = x) belongs to Q.

Theorem 20

• L is uniformly calibrated for L with calibration function δ(·).

• δ∗∗ : [0,Bf ]→ [0,∞] be the Fenchel-Legendre biconjugate of δ |[0,Bf ].

• Assume R∗(f ) and R∗(f ) are finite.

Then,

δ∗∗Bf

(R(f )− R∗

)≤ R(f )− R∗

• For any proper function f , its biconjugate f ∗∗ ≤ f and

• f ∗∗ is convex and lsc (lower semicontinuous).

• In fact, f ∗∗ is the largest convex lsc minorant of f .

• For a proper function f = f ∗∗ iff f is convex lsc.

• Alternatively, epi(f ∗∗) = closed convex hull of epi(f ).

109 / 183

Example 4

• For classification L(y , t) = 1yt ≤ 0 = 1y sign(t) ≤ 0. Hence,

Cη(t) = η1t ≤ 0+ (1− η)1t ≥ 0

• We have C∗η = minη, 1− η. Hence,

Cη(t)− C∗η = max0, 2η − 11t ≤ 0+ max1− 2η, 01t ≥ 0= |2η − 1| 1(2η − 1)t ≤ 0.

• Then,

Mη(ε) =

R ε > |2η − 1|t ∈ R : (2η − 1)t > 0 ε ≤ |2η − 1|

• It follows that (inf ∅ =∞)

δη(ε) =

∞ ε > |2η − 1|inft

Cη(t)− C∗η : (2η − 1)t ≤ 0

ε ≤ |2η − 1|

110 / 183

Proof

• Let Z := Cη(X )(f (X ))− C∗η(X ),

• and Z := Cη(X )(f (X ))− C∗η(X ), both random variables.

• By assumption |Z | ≤ Bf . We have

δ∗∗(Z ) ≤ δ(Z ) ≤ δη(X )

(Z)≤ Z .

• Since δ∗∗ is convex, we can apply Jensen inequality.

δ∗∗(EZ ) ≤ Eδ∗∗(Z ) ≤ EZ .

• EZ and EZ are the excess risk of the target and surrogate losses.

111 / 183

Lemma 2

For margin-based loss L we define

H(η) = inft ∈R: (2η−1)t≤ 0

Cη(t)− C∗η (16)

1. L classif. calibrated iff H(η) > 0 for all η 6= 1/2.

2. If L is continuous, then δη(ε) = H(η) for all ε ≤ |2η − 1|.3. H is continuous, H(η) = H(1− η) and H(1/2) = 0.

112 / 183

Theorem 21

• For margin-based loss L, the following are equivalent:

1. L is classification calibrated.

2. L is uniformly classification calibrated.

• Moreover for H as in (16), let

ψ(ε) = H(1 + ε

2

), ε ∈ [0, 1].

• Then, ψ∗∗(ε) ≤ δ∗∗Q (ε) for all ε ∈ [0, 1], with equality if L is continuous.

• If L is classification calibrated, we have ψ∗∗(ε) > 0 for all ε > 0.

• Conclusion

ψ∗∗(R(f )− R∗

)≤ R(f )− R∗

113 / 183

Proof

• We have

δQ(ε) = infη ∈ [0,1]

δη(ε)≥= inf|2η−1|≥ε

H(η) = infη≥(1+ε)/2

H(η)

by symmetry of H around 1/2. (The second one could be an inequality due

to the issue with t = 0. Assuming L is continuous hence C (t) is continuousat t = 0 that won’t happen.)

• Assuming L is calibrated, H is strictly positive on [1 + ε/2, 1],

• and continuous (by previous lemma),

• hence δ∗∗Q (ε) > 0, i.e., uniformly calibrated.

• Now, we have δ∗∗Q (·) = ψ∗∗,

• since δQ(ε) = infε′≥ε ψ(ε′).

114 / 183

Theorem 22

L margin-based convex loss represented by φ.

(i) L is classification calibrated.

(ii) φ is differentiable at 0 and φ′(0) < 0.

If L is classification calibrated, then δ∗∗Q (ε) = φ(0)− C∗(ε+1)/2.

115 / 183

Proof

• bad notation for C∗, this is not a conjugate

• (ii) =⇒ (i): φ differentiable at 0 implies Cη(·) is differentiable at 0,

• with C ′η(0) = (2η − 1)φ′(0) < 0 for η ∈ (1/2, 1].

• Convexity Cη(·) implies is decreasing on (−∞, 0], hence

H(η) = inft ∈R: (2η−1)t≤0

Cη(t)− C∗η

= inft≤ 0

Cη(t)− C∗η

= Cη(0)− C∗η = φ(0)− C∗η

for η ∈ (1/2, 1].

• C∗η is the minimizer of Cη(·) over R.

• Since C ′η(0) < 0 implies that Cη cannot have a minimizer at 0,

• Cη(0)− C∗η > 0 =⇒ H(η) > 0 showing classif. calibration.

116 / 183

• (i) =⇒ (ii): Assume φ is diff’able at 0 (not needed). Assume φ′(0) ≥ 0and derive contradiction:

C1(t) = φ(t) ≥ φ′(0)t + φ(0) ≥ C1(0), ∀t > 0. (17)

• L is classif. calib. =⇒ H(η) > 0 =⇒ C∗η = inft>0 Cη(t) for η > 1/2.

• Apply with η = 1, take inf over t > 0 in (17) =⇒ C∗1 ≥ C1(0),

• showing H(1) ≤ 0 contradicting classif. calibration.

• Last part: C ′1/2(0) = 12φ′(0)− 1

2φ′(0) = 0 hence C∗1/2 = C1/2(0).

• Shows H(η) = φ(0)− C∗η holds even at η = 1/2, i.e. for all η ∈ [1/2, 1].

• C∗η infimum of affine functions =⇒ concave,

• =⇒ H is convex on [1/2, 1].

• H is continuous (follows from continuity of L?) and convex,

• =⇒ H = H∗∗.

• ψ = ψ∗∗ as well. (Recall ψ(ε) = H( 1+ε2 ).)

117 / 183

Example

−2 −1 1 2

0.5

1

1.5

η = 0.6

−2 −1 1 2

0.5

1

1.5

η = 0.4

• φ(t) = (1− t)+ = max(0, 1− t), hinge loss.

• Cη(t) = η(1− t)+ + (1− η)(1 + t)+.

• C∗η = Cη(1) = 2(1− η) if η > 1/2 and = Cη(−1) = 2η if η < 1/2

118 / 183

• H(η) = 1− C∗η = 1− 2 min(η, 1− η).

• For ε ∈ [0, 1]:

ψ(ε) = H(1 + ε

2

)= 1− 2 min

(1 + ε

2,

1− ε2

)= 1− 2

1− ε2

= ε.

• Clear that ψ∗∗ = ψ. Conclusion:

R(f )− R∗(f ) ≤ Rhinge(f )− R∗hinge.

• Excess risk in 0-1 loss is directly bounded by excess risk in hinge loss.

• Also, called Zhang’s inequality.

119 / 183

Hard-margin SVM

• Linearly separable data.

• Distance of xi to the hyperplane x : wT x + b = 0 is (wT xi + b)/‖w‖.• Maximum margin separating hyperplane:

maxw ,b

mini

wT xi + b

‖w‖ .

• Equivalent to

maxγ,w

γ/‖w‖s.t. yi (w

T xi + b) ≥ γ, ∀i = 1, . . . , n.

• Invariant to scaling of w , b and γ. So set γ = 1:

minw

‖w‖s.t. yi (w

T xi + b) ≥ 1, ∀i = 1, . . . , n,

which is a convex problem.

120 / 183

Soft-margin SVM

• What if data is not linearly separable?

• Introduce slack variables yiwT xi + b ≥ 1− ξi , and ξi ≥ 0.

• But also need ξi not to be too large; can penalized them in the objective:

minw ,ξ

1

2‖w‖2 + C

n∑i=1

ξi

s.t. yi (wT xi + b) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n

(18)

• SVM penalizes data lying in the margin in addition to those misclassified.

121 / 183

• Set b = 0 for simplicity.

• Conditions yiwT xi ≥ 1− ξi and ξi ≥ 0 are equivalent to

ξi ≥ max0, 1− yiwT xi

• Optimizing over ξi first, we obtain equivalent form of (18):

minw

1

2‖w‖2 + C

n∑i=1

max0, 1− yiwT xi

• Defining the hinge loss Lhinge(y , t) = (1− yt)+ = max0, 1− yt:

minw

1

n

n∑i=1

Lhinge(yi ,wT xi ) + λ‖w‖2

where λ = 1/(2nC ).

122 / 183

SVM dual

minw , ξ, b

1

2‖w‖2 + C

n∑i=1

ξi

s.t. yi (wT xi + b) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n

• Can write in minimax form

infw , ξ, b

supα, µ≥0

L(w , ξ, b;α, µ)

• where the Lagrangian is

L(w , ξ, b;α, µ) =1

2‖w‖2 + C

∑i

ξi +∑i

αi (1− ξi − yi (wT xi + b))−

∑i

µiξi

=1

2‖w‖2 − wT (

∑i

xiyiαi ) + b∑i

yiαi +∑i

ξi (C − µi − αi ) +∑i

αi

123 / 183

• Strong duality

infw , ξ, b

supα, µ≥0

L(w , ξ, b;α, µ) = supα, µ≥0

infw , ξ, b

L(w , ξ, b;α, µ)

• Minimizing over w gives w∗ =∑

i xiyiαi .

• Minimizing over b gives −∞ unless∑

i yiαi = 0.

• Minimizing over ξ gives −∞ unless C − µi − αi = 0,∀i .• The dual problem is

supα, µ≥0

∑i

αi −1

2‖∑i

xiyiαi‖2

s.t. C − αi − µi = 0,∀i , ∑i αiyi = 0

• Role of µ is to only enforce C − αi = µi ≥ 0.

• Also,∑

i αi − 12‖∑

i xiyiαi‖2 =∑

i,j αiαjyiyj〈xi , xj〉.

124 / 183

• The dual problem simplifies to

supα

∑i

αi −1

2

∑i,j

αiαjyiyj〈xi , xj〉

s.t. 0 ≤ αi ≤ C , ∀i , ∑i αiyi = 0

• To solve the dual, we only need the Gram matrix (〈xi , xj〉) ∈ Rn×n.

125 / 183

KKT conditions• Write f ∗(x) = 〈w∗, x〉+ b∗. KKT conditions are

w∗ =∑i

α∗i xiyi

α∗i [1− ξi − yi f∗(xi )] = 0

µ∗i ξ∗i = 0 ⇐⇒ (C − α∗i )ξ∗i = 0

α∗ ∈ [0,C ]n,∑i

yiα∗i = 0, ξ∗i ≥ 0

two middle lines complementary slackness, last line primal and dualfeasibility.

• Note that f ∗(x) =∑

i α∗i yi 〈xi , x〉+ b∗.

• α∗i > 0 =⇒ 1− yi f∗(xi ) = ξ∗i .

• α∗i < C =⇒ ξ∗i = 0.

• α∗i ∈ (0,C ) =⇒ yi f∗(xi ) = 1.

• yi f∗(xi ) > 1 =⇒ α∗i = 0, and α∗i = 0 =⇒ yi f

∗(xi ) ≥ 1

• α∗i = C =⇒ yi f∗(xi ) ≤ 1. (Misclassified points are among these:

yi f∗(xi ) < 0.)

126 / 183

SVM: General feature map Φ

• Primal problem

minw ,ξ,b

1

2‖w‖2 + C

n∑i=1

ξi

s.t. yi (wTΦ(xi ) + b) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n

• Dual problem

supα

∑i

αi −1

2

∑i,j

αiαjyiyj〈Φ(xi ),Φ(xj)〉

s.t. 0 ≤ αi ≤ C , ∀i , ∑i αiyi = 0

• f ∗(x) = 〈w∗,Φ(x)〉+ b∗ =∑n

i=1 α∗i yi 〈Φ(x),Φ(xi )〉.

• K (x , y) = 〈Φ(x),Φ(y)〉 is a kernel.

• The kernel trick: only the kernel is needed not the feature map.

127 / 183

RKHS

• A Hilbert space is a complete inner product space.

• complete = Cauchy sequences converge.

Theorem 23Let H be Hilbert space and L : H → R a linear functional. The following areequivalent:

(a) L is continuous.

(b) L is continuous at 0.

(c) L is continuous at some point.

(d) L is bounded: ∃c > 0 s.t. |L(h)| ≤ c‖h‖ for all h ∈ H.

• Norm of a (linear) functional is defined as

‖L‖ := sup|L(h)| : ‖h‖ ≤ 1

L is bounded iff ‖L‖ <∞.

128 / 183

dist(h,K ) := inf‖h − k‖ : k ∈ K

Theorem 24Let H be Hilbert space, and h ∈ H.

• K ⊂ H nonempty, closed, and convex. ∃! k0 ∈ K s.t. ‖h− k0‖ = dist(h,K ).

• M ⊂ H closed, linear subspace, then f0 ∈ M is the unique point that‖h − f0‖ = dist(h,M) iff h − f0 ⊥ M.

• The unique point above is called the projection of h onto K or M.

• In the case of M (linear closed subspace) an orthogonal projection.

Theorem 25M ⊂ H closed, linear subspace, h ∈ H and Ph be the projection of h onto M.

(a) P is a linear operator on H.

(b) ‖Ph‖ ≤ ‖h‖ for all h ∈ H.

(c) P2 = P (P is idempotent)

(d) kerP = M⊥ and ranP = M.

129 / 183

• Conway’s notation: M ≤ H means M is a closed linear subspace of H.

• Corollary: orthogonal decomposition of the space:

• Assume M ≤ H. For any h ∈ H we have h = h1 + h⊥ where h1 ∈ M andh⊥ ∈ M⊥, and the decomposition is unique.

Corollary 2

If M ≤ H, then (M⊥)⊥ = M.

Corollary 3

A linear subspace A ⊂ H is dense in H iff A⊥ = 0.

130 / 183

Theorem 26 (Riesz Representation)

L : H → R bounded linear functional; there exists a unique h0 ∈ H such that

L(h) = 〈h, h0〉, ∀h ∈ H.

Moreover, ‖L‖ = ‖h0‖.

• ker L closed linear subspace (since L is continuous).

• Assume ker L 6= H, hence (ker L)⊥ 6= 0;• There exists f0 ∈ (ker L)⊥ with L(f0) = 1.

• Pick h ∈ H and let α = L(h).

• L(h − αf0) = 0 =⇒ h − αf0 ∈ ker L, hence

0 = 〈h − αf0, f0〉 = 〈h, f0〉 − L(h)‖f0‖2

• h0 = ‖f0‖−2f0 is the desired representer.

• Argue uniqueness as an exercise.

131 / 183

Definition 2

Consider a Hilbert space H of (real-valued) functions f : X → R:

• H is a reproducing kernel Hilbert space (RKHS) if the evaluation functionalδx : H → R defined by

δx(f ) = f (x), f ∈ H

is continuous for all x ∈ X .

• K : X × X → R a reproducing kernel of H if K (·, x) ∈ H for all x ∈ X andthe reproducing property holds

〈f ,K (·, x)〉 = f (x), x ∈ X , f ∈ H

• In a RKHS, convergence in ‖ · ‖H implies pointwise convergence, due to thecontinuity of evaluation functionals:

fnH→ f =⇒ fn(x)→ f (x), ∀x ∈ X

(LHS implies δx(fn)→ δx(f ).)

• L2pre([0, 1]) can not be made into RKHS; fn(x) = xn, fn

L2

→ 0 but fn(1) = 1.

(L2pre([0, 1]) = C ([0, 1]) equipped with L2 norm (

∫ 1

0f 2(t)dt)1/2)

132 / 183

Lemma 3

H is Hilbert function space over X that has reproducing kernel K . Then,

• K is a kernel (i.e., PSD).

• H is a RKHS.

• K (·, x) acts as the representer of evaluation functional δx ,

• as well as a feature map for K .

Definition 3A symmetric bivariate function K : X × X → R is called a kernel if it is PSD.

• K is PSD if (K (xi , xj)) ∈ Rn×n is PSD for all x1, . . . , xn ∈ X and all n ≥ 1.

• Equivalent to: ∃Φ : X → H′ such that

K (x , y) = 〈Φ(x),Φ(y)〉H′

• Φ is called a feature map of K .

133 / 183

Proof of Lemma 3

• We have δx(f ) = f (x) = 〈f ,K (·, x)〉H, hence (Cauchy-Schwarz)

|δx(f )| ≤ ‖f ‖H‖K (·, x)‖H

showing that δx is bounded and ‖δx‖ ≤ ‖K (·, x)‖H; thus H is a RKHS.

• We have, again by reproducing property:

〈K (·, x),K (·, y)〉H = K (x , y)

thus taking Φ(x) = K (·, x) shows that K is a kernel with feature map Φ.

134 / 183

Converse

• Every RKHS has a unique reproducing kernel.

Proposition 3

Let H be an RKHS over X , and let kx be the Riesz representer of evaluationfunctional δx . Then H has a unique reproducing kernel which is given byK (x , y) = 〈kx , ky 〉H as well as

K (x , y) =∑i∈I

ei (x)ei (y)

for any orthonormal basis (ei )i∈I of H.

135 / 183

Proof of Proposition 3

• Let kx ∈ H be the representer of evaluation at x : f (x) = 〈f , kx〉H.

• Define K (x , y) := 〈kx , ky 〉H = ky (x) = kx(y).

• That is K (·, y) = ky ., hence

〈K (·, y), f 〉H = 〈ky , f 〉H = f (y)

i.e., K is a reproducing kernel for H.

• Let K be a reproducing kernel of H and (ei )i∈I an ONB. Then,

K (·, y) =∑i∈I

〈K (·, y), ei 〉ei =∑i∈I

ei (y) ei , ∀y ∈ X

where the series converges in ‖ · ‖H. Since H is a RKHS, the series alsoconverges pointwise,

• i.e, K (x , y) =∑

i∈I ei (y) ei (x) for all x , y ∈ X .

• Since K was arbitrary, it should be unique.

136 / 183

Some examples

• X = Rd .

• Linear kernel K (x , y) = 〈x , y〉.• Homogenuous polynomial kernel: K (x , y) = 〈x , y〉m.

• E.g. for m = 2, Φ(x) = (x2i ,√

2xixj : 1 ≤ i < j ≤ d) is a feature map.

• Inhomogenuous polynomial kernel: K (x , y) = (1 + 〈x , y〉)m.

• Gaussian kernel: K (x , y) = exp(− 12σ2 ‖x − y‖2). (Exercise: show it is PSD.)

137 / 183

Lemma 4

Let H be a RKHS on X . Then, the linear span of kx(·) = K (·, x) is dense in H.

• Proof: kx : x ∈ X⊥ = 0 .

• Another way to state the lemma: spankx : x ∈ X = H.

• Note: the closure is w.r.t. the norm of H.

138 / 183

From RKHS to kernels• The mapping from RKHSs to kernels is injective:

Proposition 4

Let Hi be a RKHS on X with kernel Ki (x , y) for 1, 2. Then,

K1(x , y) = K2(x , y), ∀x , y ∈ X =⇒ H1 = H2, ‖f ‖H2 = ‖f ‖H1

• kx := K1(·, x) = K2(·, x).

• W = spankx : x ∈ X =⇒ W is dense in Hi , i = 1, 2.

• Easy to verify that ‖f ‖H1 = ‖f ‖H2 for f ∈W .

• If f ∈ H1, then ∃fn ⊂W , fnH1→ f .

• fn is Cauchy in (W , ‖ · ‖H1 ), hence Cauchy in (W , ‖ · ‖H2 ),

• hence ∃g ∈ H2 s.t. fnH2→ g .

• fn(x)→ g(x) pointwise ∀x , and also fn(x)→ f (x) pointwise for all x ,

• hence f (x) = g(x), i.e., f ∈ H2. (Reverse argument holds as well.)

• This proves H1 = H2. Since the norms are equal over a dense subset, theyare equal everywhere.

139 / 183

From kernels to RKHS• Every kernel has a (unique) RKHS.

Theorem 27 (Moore or Moore-Aronszajn)

Let K : X × X → R be a (PSD) kernel. There exists a RKHS (of functions) onX with reproducing kernel K .

• W = spankx : x ∈ X. Define an inner product on W by⟨∑i

αikxi ,∑j

βjkxj

⟩?

:=∑i,j

αiβjK (xi , xj)

• Can verify that it is well-defined: same inner product for differentrepresentations of the same function as linear combinations of kx , x ∈ X .(Show that f (x) =

∑i αikyi (x) is zero identically iff 〈f ,w〉? = 0, ∀w ∈W .)

• By definition for any f ∈W , 〈f , kx〉? = f (x).

• W is an inner product space, hence has a completion (as equivalence classesof Cauchy senescences) which is a Hilbert space.

• Any element of the completion is in fact a function (as opposed to thecompletion of C ([0, 1]) for example to get L2([0, 1]).) See next slide.

140 / 183

• Let h ∈ H, and fn ⊂W a Cauchy seq. with fnH→ h.

• By Cauchy–Schwarz inequality

|fn(x)− fm(x)| = 〈fn − fm, kx〉? ≤√

K (x , x)‖fn − fm‖ (19)

(valid by our definition of 〈·, ·〉? over W )

• Hence, fn is pointwise convergent, and we can define h to beh(x) = limn fn(x). Usual argument shows that this is independent of theparticular Cauchy sequence5.

• Let 〈·, ·〉 be the inner product on H, and fn and h as above. Then,

〈h, kx〉 = limn〈fn, kx〉 = lim

n〈fn, kx〉? = lim

nfn(x) = h(x)

i.e., kx is the representer of evaluation in H.

• Hence H is a RKHS, and K (x , y) = kx(y) is its unique reproducing kernel(see Lemma 3 and Proposition 3).

5Recall that two Cauchy sequences fn and gn are equivalent if limn ‖fn − gn‖? = 0.141 / 183

• Conclusion: there is a one-to-one mapping between kernels and RKHSs.

Example 5 (Sobolev order 1)

• Consider the following:

H1([0, 1]) = f : [0, 1]→ R | f (0) = 0, and f is abs. cont., f ′ ∈ L2([0, 1])

• With inner product

〈f , g〉H1 =

∫ 1

0

f ′(t)g ′(t)dt

• Verify that it is a Hilbert space.

• Take kt(s) := K (s, t) = mins, t and note that k ′t(s) = 1s < t a.e. s.

• Note that kt ∈ H1 and satisfies the reproducing property:∫ 1

0

f ′(s)k ′t(s)ds = f (t), ∀t ∈ [0, 1], f ∈ H1

• It follows (Lemma 3) that H1([0, 1]) is a RKHS with kernel K .

• Conclude that K is PSD, i.e., (minti , tj) ∈ Rn×n is PSD for anyt1, . . . , tn ∈ [0, 1] which might not be easy to show directly.

142 / 183

Example 6 (Higher-order Sobolev, smoothing splines)

• Hα([0, 1]) = f such that f (α−1) is abs. cont and f (α) ∈ L2([0, 1]).

• Additional boundary condition f (0) = f (1)(0) = · · · = f (α−1)(0) = 0.

• Inner product 〈f , g〉Hα =∫ 1

0f (α)(t)g (α)(t)dt.

• Hα is a RKHS with the following kernel:

kx(y) = K (x , y) =

∫ 1

0

ψα(x − t)ψα(y − t)dt, ψα(t) =tα−1

+

(α− 1)!.

• kx ∈ Hα and k(α)x (y) = ψα(x − y). [Note: ψ

(α−1)α (t) = 1t > 0]

• Reproducing property follows from Taylor expansion

f (x) =α−1∑`=1

f (`)(0)x`

`!+

∫ 1

0

f (α)(t)ψα(x − t)dt

143 / 183

Interpolation

• Consider the minimum norm interpolation

minf∈H

‖f ‖Hf (xi ) = yi , i = 1, . . . , n.

(20)

• Let K = 1n (K (xi , xj)) ∈ Rn×n be the kernel matrix, and y = (yi ) ∈ Rn.

Proposition 5

Problem (20) is feasible iff y ∈ ran(K ). In that the case the (unique) optimalsolution is

f (x) =1√n

n∑i=1

αiK (·, xi ), K αi =y√n

144 / 183

145 / 183

• Let Lx = spanK (·, xi ) : i = 1, . . . , n = fα : α ∈ Rn where

fα :=1√n

n∑i=1

αiK (·, xi )

• Lx is a linear subspace of H with dim(Lx) ≤ n, hence closed.

• Minimum norm solution should be in Lx :

• Take h ∈ H and let h = h1 + h⊥ be its decomposition w.r.t. L⊕ L⊥.

• h⊥(xi ) = 0 for all i , so h is feasible iff h1 is feasible.

• Assume h is feasible, hence h1 is feasible.

• ‖h‖2 = ‖h1‖2 + ‖h⊥‖2. Hence, h⊥ 6= 0 =⇒ ‖h‖ > ‖h1‖.• =⇒ the minimum norm solution should have h⊥ = 0, i.e., h ∈ Lx .

• The problem reduces to

minf∈Lx

‖f ‖Hf (xi ) = yi ,∀i ,

≡minα∈Rn

‖fα‖Hfα(xi ) = yi ,∀i ,

≡minα∈Rn

αTKα

Kα = y/√n

(21)

146 / 183

ERM with RKHS regularization

• Consider the following optimization

minf∈H

L(f (x1), . . . , f (xn)) + ω(‖f ‖2H) (22)

Theorem 28Let H be an RKHS in X , ω : R→ R be increasing, and L : Rn → R. Then, anyminimizer f ∗ of (22) is in the span of kx1 , . . . , kxn.

• Let us write `x(f ) = L(f (x1), . . . , f (xn)).

• Let Lx = spanK (·, xi ) : i = 1, . . . , n,• Let f be optimal and write f = f1 + f⊥ where f1 ∈ Lx and f⊥ ⊥ Lx .

• f⊥(xi ) = 0 for all i , hence `x(f ) = `x(f1).

• ω(‖f ‖2H) = ω(‖f1‖2

H + ‖f⊥‖2) ≥ ω(‖f1‖2H).

• Strict inequality unless ‖f⊥‖H = 0 =⇒ f⊥ = 0.

147 / 183

ERM with RKHS regularization• Example: Regularized ERM

minf∈H

1

n

n∑i=1

L(yi , f (xi )) + λ‖f ‖2H (23)

• We can restrict to solutions of the form

fα :=1√n

n∑i=1

αiK (·, xi )

• We have ‖fα‖2H = αTKα, and

fα(xj) = 〈K (·, xj), fα〉H =1√n

n∑i=1

αiK (xj , xi ) =√nKα

• Problem reduces to

minα∈Rn

1

n

n∑i=1

L(yi ,√n(Kα)i ) + λαTKα (24)

148 / 183

Special cases

• Kernel ridge regression (KRR):

minf∈H

1

n

n∑i=1

(yi − f (xi ))2 + λ‖f ‖2H (25)

• Can be thought of as solving the noisy interpolation problem:yi = f ∗(xi ) + wi .

• Solution will be fα with

α = (K + λIn)−1 y√n

• Kernel SVM:

minf∈H

1

n

n∑i=1

(1− yi f (xi ))+ + λ‖f ‖2H (26)

• Derive the dual as an exercise.

149 / 183

Mercer Theorem

Definition 4

Let X ⊂ Rd be compact. K : X × X → R is called a Mercer kernel if it iscontinuous and PSD, i.e., a continuous kernel function.

Lemma 5

Let K be a Mercer kernel, then H(K ) is separable.

• Since H(K ) is separable (has a countable dense subset) it has a countableorthonormal basis (ONB).

Lemma 6

Let K be a Mercer kernel and let en : n ∈ N be an ONB for H(K ), withK (x , y) =

∑∞n=1 en(x)en(y). Then the series converges absolutely and uniformly.

150 / 183

151 / 183

• The integral operator associated with K assuming X is equipped with ameasure µ,

(TK f )(x) =

∫XK (x , y)f (y)dµ(y)

• If K ∈ L2(X × X , µ⊗ µ), then TK is well-defined for all f ∈ L2(X , µ).

• Think of it as a continuous version of matrix multiplication.

Theorem 29Let K : X × X → R be a Mercer kernel with µ a finite Borel measure on X .Then, the integral operator TK is a bounded map from L2(X , µ) into itself with

‖TK‖ ≤(∫X

∫XK (x , y)2dµ(x)dµ(y)

)1/2

and every function in the range of TK is continuous.

• TK will be in fact a Hilbert-Schmidt operator.

152 / 183

Theorem 30 (Mercer)

• Let K be a Mercer kernel on X ,

• µ finite Borel measure with support X , and

• TK : L2(X , µ)→ L2(X , µ) its associated integral operator.

• Then, there is a countable collection of continuous function en on Xorthonormal in L2

• that are eigenvectors of TK with corresponding eigenvalues λn• such that for every f ∈ L2(X , µ),

TK f =∑i

λi 〈f , ei 〉ei .

• Furthermore, K (x , y) =∑

i λiei (x)ei (y).(with the convergence uniform and absolute).

• It follows that ei , i ∈ N is an orthogonal basis for H(K ).

• ei are not normalized in H(K ).

• Can guess that √λiei is an orthonormal basis for H(K ).

153 / 183

Mercer representation of the RKHS

Proposition 6

Under the Mercer theorem assumptions:

H(K ) =∑

i

ai√λiei : (ai ) ∈ `2

.

If f =∑

i ai√λiei and g =

∑i bi√λiei , then 〈f , g〉H =

∑i aibi .

154 / 183

Operations on kernels

Theorem 31 (Aronszajn’s inclusion)

Let X be a set, and Ki : X × X → R, i = 1, 2 kernels on X . Then,

H(K1) ⊆ H(K2) ⇐⇒ ∃c > 0, K1 ≤ c2K2

Moreover ‖f ‖2 ≤ c‖f ‖1 for all f ∈ H(K1).

• K1 ≤ c2K2 should be interpreted as c2K2 − K1 is PSD.

155 / 183

Theorem 32 (Aronszajn’s sums of kernels)

Let Hi be a RKHS with kernel Ki and norm ‖ · ‖i , for i = 1, 2.Let K = K1 + K2. Then,

H(K ) = f1 + f2 : fi ∈ H, i = 1, 2

with norm ‖ · ‖ given by

‖f ‖2 = min‖f1‖21 + ‖f2‖2

2 : f = f1 + f2, fi ∈ Hi , i = 1, 2

• If H1 ∩H2 = 0, there is a unique representation f = f1 + f2, hence

‖f ‖2H(K) = ‖f1‖2

1 + ‖f2‖22

.

156 / 183

Example 7

• Recall the 1st order Sobolev kernel K2(s, t) = mins, t with RKHS

H2 = f : [0, 1]→ R | f (0) = 0, and f is abs. cont., f ′ ∈ L2([0, 1])

• Let H1 = span(1), the space of constant functions.

• H1 is a RKHS: 〈f , g〉1 = f (0)g(0) and K1(s, t) = 1.

• Note H1 ∩H2 = 0.• Hence, K (s, t) = 1 + mins, t generates the RKHS

H = f : [0, 1]→ R : f = c1 + f2, f2 ∈ H2, c ∈ R= f : [0, 1]→ R : f is abs. cont., f ′ ∈ L2([0, 1])

with norm

‖f ‖2H = f 2(0) +

∫ 1

0

[f ′(t)]2dt

157 / 183

• Pull-back of a kernel K by map φ is

K φ(s, t) = K (φ(s), φ(t))

which itself is a kernel.

Theorem 33 (Pull-back)

• Let X and S be sets.

• Let φ : S → X be a function and K : X × X → R be a kernel.

• Then, H(K φ) = f φ : f ∈ H(K ) and

‖u‖H(Kφ) = min‖f ‖H(K) : u = f φ

158 / 183

Other results:

• Tensor product of kernel

K ((x , s), (y , t)) = K1(x , y)K2(s, t)

denoted as K1 ⊗ K2.

• Product of kernels

K (x , y) = K1(x , y)K2(x , y)

denoted as K1 K2.

• Letting ∆(x) = (x , x) be the diagonal map, we have

K1 K2 = (K1 ⊗ K2) ∆

.

159 / 183

Graphical models

• Graphical models: encoding conditional independence assumptions usinggraph separation.

• Different kind of graphs (and graph separations) encode different kinds ofconditional independence assumptions.

• Two common categories are directed and undirected graphical models.

• Directed case often called a Bayesian network, used to model causalrelations.

160 / 183

Properties of conditional independence

Symmetry: X ⊥ Y | Z =⇒ Y ⊥ X | Z .

Decompoistion: X ⊥ (Y ,W ) | Z =⇒ X ⊥ Y | Z .

Weak union: X ⊥ (Y ,W ) | Z =⇒ X ⊥ Y | Z ,W .

The following only holds for strictly positive distributions:

Intersection: X ⊥ Y | (W ,Z ), and X ⊥W | (Y ,Z ) =⇒ X ⊥ (Y ,W ) | Z

161 / 183

Directed graphical models

D

G

I

S

L

• D: Difficulty (Stat 100C)

• I: Intelligence

• G: Grade (Stat 100C)

• L: (Recom.) Letter

• S: SAT Score

162 / 183

Definition 5 (BN structure)

• A Bayesian network structure/graph is a directed acyclic graph (DAG) G .

• Nodes represent random variables X1, . . . ,Xn.

• PaGXi: parents of Xi in G .

• NdGXi

: non-descendants of Xi in G .

• G encodes the following set of conditional independence assumptions:

Xi ⊥ NdGXi| PaGXi

, ∀i ,

called local independences denoted by I`(G ).

• Also called directed local Markov property.

163 / 183

Definition 6

For a distribution P , let I(P) be the set of (conditional) independence assertionsof the form (X ⊥ Y | Z ) that hold in P.

Definition 7

We say that G is an I-map for P whenever I`(G ) ⊆ I(P).

• I.e, G is an I -map for P whenever P is locally Markov w.r.t. G .

164 / 183

Definition 8 (Factorization)

• Let G be a BN graph over variables X = (X1, . . . ,Xn).

• We say that distribution P factorizes according to G if P can be expressed as

P(X1, . . . ,Xn) =n∏

i=1

P(Xi | PaGXi)

• Chain rule of BNs.

• P(Xi | PaGXi) local probabilistic models.

Definition 9

A Bayesian network is a pair (G ,P) where P factorizes over G .

165 / 183

I -map to factorization

Theorem 34G a BN structure over X , and P a joint distribution for X :

G is an I-map for P =⇒ P factorizes according to G

• Assume X1, . . . ,Xn are in a topological ordering.

• X1, . . . ,Xi−1 = PaXi ∪Z where Z ⊆ NdXi .

• By assumption Xi ⊥ NdXi | PaXi , hence Xi ⊥ Z | PaXi

• That is, P(Xi | X1, . . . ,Xi−1) = P(Xi | PaXi ), hence

p(X1, . . . ,Xn) =n∏

i=1

P(Xi | X1, . . . ,Xi−1) =n∏

i=1

P(Xi | PaXi ).

166 / 183

Factorization to I -map

Theorem 35G a BN structure over X , and P a joint distribution for X :

P factorizes according to G =⇒ G is an I-map for P

Conclusion:

• P factorizes over G iff G is an I -map for P. In other words,

P factorizes according to G ⇐⇒ I`(G ) ⊆ I(P)

167 / 183

d-separation

• Are there other independencies that can be “read from the G”, besides thelocal ones?

• Suppose that P factorizes according to G ; then I`(G ) ⊆ I(P).

• Is there a larger set I(G ) such that

I`(G ) ⊆ I(P) =⇒ I`(G ) ⊆ I(G ) ⊆ I(P)

• Yes.

• Need to characterize information flow over the network.

168 / 183

• When can we have X ⊥ Y | Z?

• Consider the following four cases:

Causal trail X → Z → YEvidential trail X ← Z ← Y

Common cause X ← Z → YCommon effect X → Z ← Y a.k.a v -structure or collider

• The trail is active if information flows through it, otherwise blocked.

• The first three are blocked iff Z is observed.

• v -structure is active iff either Z or one of its descendants is observed.

Definition 10• G a BN structure

• X1 X2 · · · Xn a trail in G

• Z a subset of nodes.

Trail X1 X2 · · · Xn is active given Z if

• For every Xi−1 → Xi ← Xi+1, either Xi or one of its descendants are in Z ,

• no other node along the trail is in Z

169 / 183

D

G

I

S

L

• Trail D → G ← I → S .

• For Z = ∅, not active, since v -structure D → G ← I is not active.

• For Z = L, active, since v -structure is active, and no other block.

• For Z = L, I, not active, observing I block G ← I → S .

170 / 183

Definition 11• X ,Y ,Z there set of variables.

• X ,Y are d-separated given Z , denoted d-sepG (X ;Y | Z ) if

• given Z , there is no active trail with endpoints in X and Y . Define

I(G ) := (X ⊥ Y | Z ) : d-sepG (X ;Y | Z )

• IG can be thought of as the set of directed global Markov properties.

Theorem 36

If P factorizes according to G , then I(G ) ⊆ I(P).

• Equivalent statement I`(G ) ⊂ I(P) =⇒ I(G ) ⊂ I(P).

• (Reverse direction is true and clear since I`(G ) ⊆ I(G ). Why?)

171 / 183

Definition 12A distribution P is faithful to G whenever

(X ⊥ Y | Z ) ∈ I(P) =⇒ d-sepG (X ;Y | Z ),

that is, I(P) ⊆ I(G ).

• Conclusion:If P factorizes over G , and P is faithful to G , then I(P) = I(G ).

• Not all distributions that factorize over G are faithful to G .

• Any product measure P on X (i.e., all variables are independentunconditionally) factorizes over any graph G , but is faithful only the emptygraph.

Definition 13 (P-map)

G is a perfect map (P-map) for P whenever I(P) = I(G ).

172 / 183

• Is there anything even bigger than I(G ) contained in every I(P) for Pfactorizing according to G?

• No.

Theorem 37Let G be a BN structure. If X and Y are not d-separated in G given Z , thenthey are dependent given Z in some distribution that factorizes over G .

• Equivalently ⋂P factorizes over G

I(P) = I(G )

173 / 183

I -Equivalence

• Can we have G1 6= G2 but I(G1) = I(G2)? Yes.

• G1 and G2 are called I -equivalent whenever I(G1) = I(G2).

• Also called Markov equivalent.

• A proper equivalence relation;

• partitions the set of DAGs (into Markov equivalence classes).

• Among the four 3-DAGs, the first three are equivalent:

X

Z

Y

Y

Z

X Y

Z

X

Y

Z

X

• X → Y and Y ← X are equivalent. (Encode no independence.)

174 / 183

Definition 14Skeleton of a DAG is the undirected graph obtained by removing edge directions.

Theorem 38If G1 and G2 have the same skeleton and the same v -structures,then G1 and G2 are equivalent.

• Same skeleton and v -structures is sufficient but not necessary.

• Consider complete DAGs (i.e., acyclic/transitive tournaments), encode noindependence, v -structures are different.

Definition 15A v -structure X → Z ← Y is an immorality if there is no edge between X and Y .

Theorem 39G1 and G2 are I -equivalent iff they have the same skeleton and immoralities.

• Complete DAGs have no immoralities (but possibly many v -structures.)

175 / 183

Definition 16

An edge X → Y in G is covered if PaGY = PaTX ∪X.

Theorem 40

G and G ′ are I -equivalent iff ∃ a sequence of networks G = G1, . . . ,Gk = G ′,all I -equivalent to G such that Gi+1 differs from Gi only in reversal of a singlecovered edge.

176 / 183

Definition 17

G is a minimal I -map for P if G is an I-map for P (i.e., I(G ) ⊆ I(P)) andremoving any edge from G renders it not an I -map.

• A complete DAG K is an I -map for a any P (since I(K ) = ∅) but often nota minimal I -map.

How to get a minimal I -map

• Fix an ordering.

• For every i , choose U ⊆ X i−11 := X1, . . . ,Xi−1 to be the minimal subset

satisfying

Xi ⊥ (X i−11 \ U) | U,

minimal means removing any element to U violates the property.

• U will be PaXi .

• There could be multiple choices. Unique if P is positive.

• Every ordering could potentially lead to a different minimal I-MAP (even ifP is positive).

177 / 183

D

G

I

S

L

D

G

I

S

L

(a) (b)

• (a) corresponds to the ordering D, I ,G ,S , L. (Multiple other topologicalorderings)

• Suppose that P factors over G and is faithful to G .

• I.e., G is a P-map for P, or equivalently I(Ga) = I(P).

• Then (b) is the minimal I-map associated with the ordering L,S ,G , I ,D.

• I(Gb) ⊂ I(P), but I(Gb) 6= I(P).

• Easiest to see by noting that Ga and Gb are not Markov equivalent,

• I(Ga) 6= I(Gb) since they have different skeletons.

178 / 183

Definition 18

A Gibbs distributions over X1, . . . ,Xn with factors Φ = (φi (XSi )) where Si ⊂ [n] is

PΦ(X1, . . . ,Xn) =1

Z

∏i

φi (XSi ).

• Example: PΦ(X1,X2,X3) = 1Z φ(X1,X2)φ2(X2,X3).

Definition 19 (Markov network)

PΦ with factors Φ = (φ1(XS1 ), . . . , φk(XSk)), factorizes over an undirected graph

G , if each of S1, . . . ,Sk is a clique of G .

• Clique = complete (induced) subgraph.

• PΦ is Markov w.r.t. G , or G is a Markov network for PΦ.

• φi (Si ) is called clique potential.

• Can reduce to maximal clique. (Possible loss of information.)

179 / 183

Markov properties

Definition 20

Let G be a Markov network (undirected graph).A path a X1 − X2 − · · · − Xn in G is active given Z , if none of Xi is in Z .

Definition 21

Z separates X and Y in G , denoted sepH(X ;Y |Z ) if there is no active pathgiven Z with endpoints in X and Y .Global independences associated with G is defined as

I(G ) = (X ⊥ Y | Z ) : sepG (X ;Y | Z )

• Usual notion of separation, every path between X and Y is block by somenode in Z .

• I(G ) is the set of global Markov properties/independencies.

180 / 183

Theorem 41

Let G be an undirected graph (Markov network). Then,

P factorizes over G =⇒ G is an I-map for P

• The converse is not always true.

Theorem 42 (Hammersley-Clifford)

Let G be an undirected graph (Markov network). If P is a positive distribution,

G is an I-map for P =⇒ P factorizes over G

181 / 183

• Assume that Z separates Y1 and Y2.

• Also assume that Y1,Y2,Z partition the whole set of variables X .

• I.e., Y1 ∪ Y2 ∪ Z = X .

• Every clique of G is either fully in Y1 ∪ Z or Y2 ∪ Z , hence

P(X1, . . . ,Xn) ∝∏i∈I1

Φi (Si )∏i∈I2

Φi (Si ) = f (Y1,Z )f (Y2,Z )

where Si ⊆ Yr ∪ Z for i ∈ Ir and r = 1, 2.

• This proves Y1 ⊥ Y2 | Z under P.

• In general let U = X \ (Y1 ∪ Y2 ∪ Z ).

• Then, U can be partitioned into U = U1 ∪ U2 such that

• Z separates Y1 ∪ U1 from Y2 ∪ U2.

• It follows that (Y1 ∪ U1) ⊥ (Y2 ∪ U2) | Z ,

• hence Y1 ⊥ Y2 | Z .

182 / 183

Local Markov properties

• Let X = X1, . . . ,Xn.• Pairwise Markov properties associated with G :

Ip(G ) =

(X ⊥ Y | X \ X ,Y ) : X − Y /∈ G

• Local Markov properties associated with G :

I`(G ) =

(X ⊥ X \ clG (X ) | bdG (X )) : X ∈ X

• bd(X ) = boundary of X = Markov blanket of X = neighbors of X .

• cl(X ) = X ∪ bd(X ).

• For any distribution: pairwise =⇒ local =⇒ global.

• The three are equivalent for positive distributions.

183 / 183

Documents

STAT 208: Statistical learning theoryarash.amini/teaching/stat208/notes/208_slides.… · STAT 208: Statistical learning theory Arash A. Amini March 8, 2018 1/183. ... Learning functions