144
Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology 1 / 109

Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

  • Upload
    others

  • View
    51

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Machine Learning: A Statistics andOptimization Perspective

Nan Ye

Mathematical Sciences SchoolQueensland University of Technology

1 / 109

Page 2: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What is Machine Learning?

2 / 109

Page 3: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Machine Learning

• Machine learning turns data into insight, predictions and/ordecisions.

• Numerous applications in diverse areas, including natural languageprocessing, computer vision, recommender systems, medicaldiagnosis.

3 / 109

Page 4: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

A Much Sought-after Technology

4 / 109

Page 5: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Enabled Applications

Make reminders by talking to your phone.

5 / 109

Page 6: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Tell the car where you want to go to and the car takes you there.

6 / 109

Page 7: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Check your emails, see some spams in your inbox, mark them as spams,and similar spams will not show up.

7 / 109

Page 8: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Video recommendations.

8 / 109

Page 9: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Play Go against the computer.

9 / 109

Page 10: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Tutorial Objective

Essentials for crafting basic machine learning systems.

Formulate applications as machine learning problems

Classification, regression, density estimation, clustering...

Understand and apply basic learning algorithms

Least squares regression, logistic regression, support vectormachines, K-means,...

Theoretical understanding

Position and compare the problems and algorithms in a unifyingstatistical framework

Have fun...

10 / 109

Page 11: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Outline

• A statistics and optimization perspective

• Statistical learning theory

• Regression

• Model selection

• Classification

• Clustering

11 / 109

Page 12: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Hands-on

• An exercise on using WEKA.

WEKA Java http://www.cs.waikato.ac.nz/ml/weka/H2O Java http://www.h2o.ai/

scikit-learn python http://scikit-learn.org/CRAN R https://cran.r-project.org/web/views/MachineLearning.html

• Some technical details are left as exercises. These are tagged with(verify).

12 / 109

Page 13: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

A Statistics and OptimizationPerspective

Illustrations

• Learning a binomial distribution

• Learning a Gaussian distribution

13 / 109

Page 14: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning a Binomial Distribution

I pick a coin with the probability of heads being θ. I flip it 100 times foryou and you see a dataset D of 70 heads and 30 tails, can you learn θ?

Maximum likelihood estimationThe likelihood of θ is

P(D | θ) = θ70(1− θ)30.

Learning θ is an optimization problem.

θml = arg maxθ

P(D | θ)

= arg maxθ

lnP(D | θ)

= arg maxθ

(70 ln θ + 30 ln(1− θ)).

14 / 109

Page 15: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning a Binomial Distribution

I pick a coin with the probability of heads being θ. I flip it 100 times foryou and you see a dataset D of 70 heads and 30 tails, can you learn θ?

Maximum likelihood estimationThe likelihood of θ is

P(D | θ) = θ70(1− θ)30.

Learning θ is an optimization problem.

θml = arg maxθ

P(D | θ)

= arg maxθ

lnP(D | θ)

= arg maxθ

(70 ln θ + 30 ln(1− θ)).

14 / 109

Page 16: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

θml = arg maxθ

(70 ln θ + 30 ln(1− θ)).

Set derivative of log-likelihood to 0,

70

θ− 30

1− θ= 0,

we have

θml = 70/(70 + 30).

15 / 109

Page 17: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning a Gaussian distribution

I pick a Gaussian N(µ, σ2) andgive you a bunch of data D ={x1, . . . , xn} independently drawnfrom it. Can you learn µ and σ.

X

f (X )

P(x | µ, σ) =1

σ√

2πe−

(x−µ)2

2σ2 .

16 / 109

Page 18: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning a Gaussian distribution

I pick a Gaussian N(µ, σ2) andgive you a bunch of data D ={x1, . . . , xn} independently drawnfrom it. Can you learn µ and σ.

X

f (X )

P(x | µ, σ) =1

σ√

2πe−

(x−µ)2

2σ2 .

16 / 109

Page 19: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Maximum likelihood estimation

lnP(D | µ, σ)

= ln

(1

σ√

)n

exp

(−∑i

(xi − µ)2

2σ2

)

= −n ln(σ√

2π)−∑i

(xi − µ)2

2σ2.

Set derivative w.r.t. µ to 0,

−∑i

xi − µσ2

= 0 ⇒ µml =1

n

∑i

xi

Set derivative w.r.t. σ to 0,

−n

σ+

(xi − µ)2

σ3= 0 ⇒ σ2ml =

1

n

n∑i=1

(xi − µml)2.

17 / 109

Page 20: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

Learning is...

• Collect some data, e.g. coin flips.

• Choose a hypothesis class, e.g. binomial distribution.

• Choose a loss function, e.g. negative log-likelihood.

• Choose an optimization procedure, e.g. set derivative to 0.

• Have fun...

Statistics and optimization provide powerful tools for formulating andsolving machine learning problems.

18 / 109

Page 21: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Statistical Learning Theory• The framework

• Applications in classification, regression, and density estimation

• Does empirical risk minimization work?

19 / 109

Page 22: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

There is nothing more practical than a good theory.Kurt Lewin

...at least in the problems of statistical inference.Vladimir Vapnik

20 / 109

Page 23: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning...

• H. Simon: Any process by which a system improves itsperformance.

• M. Minsky: Learning is making useful changes in our minds.

• R. Michalsky: Learning is constructing or modifying representationsof what is being experienced.

• L. Valiant: Learning is the process of knowledge acquisition in theabsence of explicit programming.

21 / 109

Page 24: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

A Probabilistic Framework

Data

Training examples z1, . . . , zn are i.i.d. drawn from a fixed butunknown distribution P(Z ) on Z.

e.g. outcomes of coin flips.

Hypothesis space H

e.g. head probability θ ∈ [0, 1].

Loss function

L(z , h) measures the penalty for hypothesis h on example z .

e.g. log-loss L(z , θ) = − lnP(z | θ) =

{− ln(θ), z = H,

− ln(1− θ), z = T .

22 / 109

Page 25: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Expected risk

• The expected risk of h is E(L(Z , h)).

• We want to find the hypothesis with minimum expected risk,

arg minh∈H

E(L(Z , h)).

Empirical risk minimization (ERM)

Minimize empirical risk Rn(h)def= 1

n

∑i L(zi , h) over h ∈ H.

e.g. choose θ to minimize −70 ln θ − 30 ln(1− θ).

23 / 109

Page 26: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

This provides a unified formulation for many machine learning problems,which differ in

• the data domain Z,

• the choice of the hypothesis space H, and

• the choice of loss function L.

Most algorithms that we see later can be seen as special cases of ERM.

24 / 109

Page 27: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Classificationpredict a discrete class

Digit recognition: image to {0, 1, . . . , 9}.

Spam filter: email to {spam, not spam}.

25 / 109

Page 28: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Given D = {(x1, y1), . . . , (xn, yn)} ⊆ X × Y, find a classifier f thatmaps an input x ∈ X to a class y ∈ Y.

We usually use the 0/1 loss

L((x , y), h) = I(h(x) = y) =

{1, h(x) 6= y ,

0, h(x) = y ..

ERM chooses the classifier with minimum classification error

minh∈H

1

n

∑i

I(h(xi ) = yi ).

26 / 109

Page 29: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Regressionpredict a numerical value

Stock market prediction: predict stock price using recent trading data

27 / 109

Page 30: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Given D = {(x1, y1), . . . , (xn, yn)} ⊆ X × R, find a function f thatmaps an input x ∈ X to a value y ∈ R.

We usually use the quadratic loss

L((x , y), h) = (y − h(x))2.

ERM is often called the method of least squares in this case

minh∈H

1

n

∑i

(yi − h(xi ))2.

28 / 109

Page 31: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Density Estimation

E.g. learning a binomial distribution, or a Gaussian distribution.

X

f (X )

We often use the log-loss

L(x , h) = − ln p(x | h).

ERM is MLE in this case.

29 / 109

Page 32: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Does ERM Work?

Estimation error

• How does the empirically best hypothesis hn = arg minh∈H Rn(h)compare with the best in the hypothesis space? Specifically, howlarge is the estimation error R(hn)− infh∈H R(h)?

• Consistency: Does R(hn) converge to infh∈H R(h) as n→∞?

If |H| is finite, ERM is likely to pick the function with minimal expectedrisk when n is large, because then Rn(h) is close to R(h) for all h ∈ H.

If |H| is infinite, we can still show that ERM is likely to choose anear-optimal hypothesis if H has finite complexity (such asVC-dimension).

30 / 109

Page 33: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Does ERM Work?

Estimation error

• How does the empirically best hypothesis hn = arg minh∈H Rn(h)compare with the best in the hypothesis space? Specifically, howlarge is the estimation error R(hn)− infh∈H R(h)?

• Consistency: Does R(hn) converge to infh∈H R(h) as n→∞?

If |H| is finite, ERM is likely to pick the function with minimal expectedrisk when n is large, because then Rn(h) is close to R(h) for all h ∈ H.

If |H| is infinite, we can still show that ERM is likely to choose anear-optimal hypothesis if H has finite complexity (such asVC-dimension).

30 / 109

Page 34: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Approximation error

How good is the best hypothesis in H? That is, how large is theapproximation error infh∈H R(h)− infh R(h)?

Trade-off between estimation error and approximation error:

• Larger hypothesis space implies smaller approximation error, butlarger estimation error.

• Smaller hypothesis space implies larger approximation error, butsmaller estimation error.

31 / 109

Page 35: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Approximation error

How good is the best hypothesis in H? That is, how large is theapproximation error infh∈H R(h)− infh R(h)?

Trade-off between estimation error and approximation error:

• Larger hypothesis space implies smaller approximation error, butlarger estimation error.

• Smaller hypothesis space implies larger approximation error, butsmaller estimation error.

31 / 109

Page 36: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Optimization error

Is the optimization algorithm computing the empirically besthypothesis exact?

While ERM can be efficiently implemented in many cases, there are alsocomputationally intractable cases, and efficient approximations aresought. The performance gap between the sub-optimal hypothesis andthe empirically best hypothesis is the optimization error.

32 / 109

Page 37: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Optimization error

Is the optimization algorithm computing the empirically besthypothesis exact?

While ERM can be efficiently implemented in many cases, there are alsocomputationally intractable cases, and efficient approximations aresought. The performance gap between the sub-optimal hypothesis andthe empirically best hypothesis is the optimization error.

32 / 109

Page 38: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

Recognise machine learning problems as special cases of the generalstatistical learning problem.

Understand that the performance of ERM depends on theapproximation error, estimation error and optimization error.

33 / 109

Page 39: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Regression• Ordinary least squares

• Ridge regression

• Basis function method

• Regression function

• Nearest neighbor regression

• Kernel regression

• Classification as regression

34 / 109

Page 40: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Ordinary Least Squares

Find a best fitting hyperplane for (x1, y1), . . . , (xn, yn) ∈ Rd × R.

35 / 109

Page 41: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

OLS finds a hyperplane minimizing the sum of squared errors

βn = arg minβ∈Rd

n∑i=1

(xTi β − yi )2.

A special case of function learning using ERM

• The input set is X = Rd , and the output set is Y = R.

• The hypothesis space are hyperplanes H = {xTβ : β ∈ Rd}.• Quadratic loss is used, as typically in regression.

36 / 109

Page 42: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Empirically best hyperplane. The solution to OLS is

βn = (XTX)−1XTy,

where X is the n × d matrix with xi as the i-th row, andy = (y1, . . . , yn)T .

The formula holds when XTX is non-singular. When XTX is singular, there are

infinitely many possible values for βn. They can be obtained by solving the linear

systems (XTX)β = XTy.

37 / 109

Page 43: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Proof. The empirical risk is (ignoring a factor of 1n )

Rn(β) =n∑

i=1

(xTi β − yi )2 = ||Xβ − y||22.

Set the gradient of Rn to 0

∇Rn = 2XT (Xβ − y) = 0, (verify)

we have

βn = (XTX)−1XTy.

38 / 109

Page 44: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Optimal hyperplane. The hyperplane β∗ = E(XXT )−1 E(XY )minimizes the expected quadratic loss among all hyperplanes.

Proof. The expected quadratic loss of a hyperplane β is

R(β) = E((βTX − Y )2)

= E(βTXXTβ − 2βTXY + Y 2)

= βT E(XXT )β − 2βT E(XY ) + E (Y 2).

Set the gradient of R to 0, we have

∇R(β) = 2E(XXT )β − 2E(XY ) = 0 ⇒ β∗ = E(XXT )−1 E(XY ).

39 / 109

Page 45: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Optimal hyperplane. The hyperplane β∗ = E(XXT )−1 E(XY )minimizes the expected quadratic loss among all hyperplanes.

Proof. The expected quadratic loss of a hyperplane β is

R(β) = E((βTX − Y )2)

= E(βTXXTβ − 2βTXY + Y 2)

= βT E(XXT )β − 2βT E(XY ) + E (Y 2).

Set the gradient of R to 0, we have

∇R(β) = 2E(XXT )β − 2E(XY ) = 0 ⇒ β∗ = E(XXT )−1 E(XY ).

39 / 109

Page 46: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Consistency. We can show that least squares linear regression is

consistent, that is, R(βn)P→ R(β∗), by using the law of large numbers.

40 / 109

Page 47: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Least Squares as MLE

• Consider the class of conditional distributions {pβ(Y |X ) : β ∈ Rd},where

pβ(Y | X = x) = N(Y ; xTβ, σ)def=

1√2πσ

e−(Y−xTβ)2/2σ2,

with σ being a constant.

• The (conditional) likelihood of β is

Ln(β) = pβ(y1 | x1) . . . pβ(yn | xn).

• Maximizing the likelihood Ln(β) gives the same βn as given by themethod of least squares. (verify)

41 / 109

Page 48: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Ridge Regression

When collinearity is present, the matrix XTX may be singular or closeto singular, making the solution unreliable.

Ridge regression

We add a regularizer λ||β||22 to OLS objective, where λ > 0 is afixed constant.

βn = arg minβ∈Rd

( n∑i=1

(xTi β − yi )2 +

quadratic/`2 regularizer︷ ︸︸ ︷λ||β||22

).

Empirically optimal hyperplane

βn = (λI + XTX)−1XTy. (verify)

The matrix λI + XTX is non-singular (verify), and thus there isalways a unique solution.

42 / 109

Page 49: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Regression with a Bias

So far we have only considered hyperplanes of the form y = xTβ, whichpasses through the origin (green line).

Considering hyperplanes with a bias term, that is, hyperplanes of theform y = xTβ + b is more useful (red line).

43 / 109

Page 50: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

OLS with a bias solves

(bn, βn) = arg minb∈R,β∈Rd

n∑i=1

(xTi β +

bias︷︸︸︷b − yi )

2.

Solution. Reduce it to regression without a bias term by replacing

xTβ + b with (1 xT )

(bβ

).

44 / 109

Page 51: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

OLS with a bias solves

(bn, βn) = arg minb∈R,β∈Rd

n∑i=1

(xTi β +

bias︷︸︸︷b − yi )

2.

Solution. Reduce it to regression without a bias term by replacing

xTβ + b with (1 xT )

(bβ

).

44 / 109

Page 52: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Ridge regression with a bias solves

(bn, βn) = arg minb∈R,β∈Rd

( n∑i=1

(xTi β + b − yi )2 + λ||β||22

).

Solution. Reduce it to ridge regression without a bias term as follows.Let xi = xi − x, and yi = yi − y , where x =

∑ni=1 xi/n, and

y =∑n

i=1 yi/n, then

βn = arg minβ∈Rd

( n∑i=1

(xTi β − yi )2 + λ||β||22

),

bn = y − xTβn. (verify)

45 / 109

Page 53: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Ridge regression with a bias solves

(bn, βn) = arg minb∈R,β∈Rd

( n∑i=1

(xTi β + b − yi )2 + λ||β||22

).

Solution. Reduce it to ridge regression without a bias term as follows.Let xi = xi − x, and yi = yi − y , where x =

∑ni=1 xi/n, and

y =∑n

i=1 yi/n, then

βn = arg minβ∈Rd

( n∑i=1

(xTi β − yi )2 + λ||β||22

),

bn = y − xTβn. (verify)

45 / 109

Page 54: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Basis Function Method

We can use linear regression to learn complex regression functions

• Choose some basis functions g1, . . . , gk : Rd → R.

• Transform each input x to (g1(x), . . . , gk(x)).

• Perform linear regression on the transformed data.

Examples

• Linear regression: use basis functions g1, . . . , gd with gi (x) = xi ,and g0(x) = 1.

• Quadratic functions: use basis functions of the above form,together with basis functions of the form gij(x) = xixj for all1 ≤ i ≤ j ≤ d .

46 / 109

Page 55: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Basis Function Method

We can use linear regression to learn complex regression functions

• Choose some basis functions g1, . . . , gk : Rd → R.

• Transform each input x to (g1(x), . . . , gk(x)).

• Perform linear regression on the transformed data.

Examples

• Linear regression: use basis functions g1, . . . , gd with gi (x) = xi ,and g0(x) = 1.

• Quadratic functions: use basis functions of the above form,together with basis functions of the form gij(x) = xixj for all1 ≤ i ≤ j ≤ d .

46 / 109

Page 56: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Regression Function

The minimizer of the expected quadratic loss is the regression function

h∗(x) = E(Y | x).

Proof. The expected quadratic loss of a function h is

E((h(X )− Y )2) = EX

(h(X )2 − 2h(X )E(Y | X ) + E(Y 2 | X )

).

Hence we can set the value of h(x) independently for each x bychoosing it to minimize the expression under expectation. This leads toh∗(x) = E(Y | x).

47 / 109

Page 57: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Regression Function

The minimizer of the expected quadratic loss is the regression function

h∗(x) = E(Y | x).

Proof. The expected quadratic loss of a function h is

E((h(X )− Y )2) = EX

(h(X )2 − 2h(X )E(Y | X ) + E(Y 2 | X )

).

Hence we can set the value of h(x) independently for each x bychoosing it to minimize the expression under expectation. This leads toh∗(x) = E(Y | x).

47 / 109

Page 58: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

k nearest neighbor (kNN)

kNN approximates the regression function using

hn(x) = avg(yi | xi ∈ Nk(x)),

which is the average of the values of the set Nk(x) of the k nearestneighbors of x in the training data.

• Under mild conditions, as k →∞ and n/k →∞, hn(x)→ h∗(x),for any distribution P(X ,Y ).

• (Curse of dimensionality) The number of samples required foraccurate approximation is exponential in the dimension.

• kNN is a non-parametric method, while linear regression is aparametric method.

48 / 109

Page 59: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

k nearest neighbor (kNN)

kNN approximates the regression function using

hn(x) = avg(yi | xi ∈ Nk(x)),

which is the average of the values of the set Nk(x) of the k nearestneighbors of x in the training data.

• Under mild conditions, as k →∞ and n/k →∞, hn(x)→ h∗(x),for any distribution P(X ,Y ).

• (Curse of dimensionality) The number of samples required foraccurate approximation is exponential in the dimension.

• kNN is a non-parametric method, while linear regression is aparametric method.

48 / 109

Page 60: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Kernel Regression

hn(x) =n∑

i=1

K (x, xi )yi/ n∑

i=1

K (x, xi ),

where K (x, x′) is a function measuring the similarity between x and x′,and is often called a kernel function.

Example kernel functions

• Gaussian kernel Kλ(x, x′) = 1λ exp(− ||x

′−x||222λ ).

• kNN kernel Kk(x, x′) = I(||x′ − x|| ≤ maxx′′∈Nk (x) ||x′′ − x||). Note

that this kernel is data-dependent and non-symmetric.

49 / 109

Page 61: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Kernel Regression

hn(x) =n∑

i=1

K (x, xi )yi/ n∑

i=1

K (x, xi ),

where K (x, x′) is a function measuring the similarity between x and x′,and is often called a kernel function.

Example kernel functions

• Gaussian kernel Kλ(x, x′) = 1λ exp(− ||x

′−x||222λ ).

• kNN kernel Kk(x, x′) = I(||x′ − x|| ≤ maxx′′∈Nk (x) ||x′′ − x||). Note

that this kernel is data-dependent and non-symmetric.

49 / 109

Page 62: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Binary Classification as Regression

• Label one class as −1 and and the other as +1.

• Fit a function f (x) using least squares regression.

• Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.

50 / 109

Page 63: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

• Regression function.

• Parametric methods: Ordinary least squares, ridge regression, basisfunction method.

• Non-parametric methods: kNN, kernel regression.

51 / 109

Page 64: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Model Selection

52 / 109

Page 65: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bias-Variance Tradeoff

The predicted value Y ′ at a fixed point x can be considered a randomfunction of the training set. Let Y be the true value at x. The expectedprediction error E((Y ′ − Y )2) is a property of the model class.

Bias-variance decomposition

expected prediction error︷ ︸︸ ︷E((Y ′ − Y )2

)=

variance︷ ︸︸ ︷E((Y ′ − E(Y ′))2

)+

bias (squared)︷ ︸︸ ︷(E(Y ′)− E(Y )

)2+

irreducible noise︷ ︸︸ ︷E((Y − E(Y ))2

),

Proof. Expand the RHS and simplify.

Bias-variance tradeoff

In general, as model complexity increases (i.e., the hypothesisbecomes more complex), variance tends to increase, and bias tendsto decrease.

53 / 109

Page 66: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bias-Variance Tradeoff

The predicted value Y ′ at a fixed point x can be considered a randomfunction of the training set. Let Y be the true value at x. The expectedprediction error E((Y ′ − Y )2) is a property of the model class.

Bias-variance decomposition

expected prediction error︷ ︸︸ ︷E((Y ′ − Y )2

)=

variance︷ ︸︸ ︷E((Y ′ − E(Y ′))2

)+

bias (squared)︷ ︸︸ ︷(E(Y ′)− E(Y )

)2+

irreducible noise︷ ︸︸ ︷E((Y − E(Y ))2

),

Proof. Expand the RHS and simplify.

Bias-variance tradeoff

In general, as model complexity increases (i.e., the hypothesisbecomes more complex), variance tends to increase, and bias tendsto decrease.

53 / 109

Page 67: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bias-Variance Tradeoff

The predicted value Y ′ at a fixed point x can be considered a randomfunction of the training set. Let Y be the true value at x. The expectedprediction error E((Y ′ − Y )2) is a property of the model class.

Bias-variance decomposition

expected prediction error︷ ︸︸ ︷E((Y ′ − Y )2

)=

variance︷ ︸︸ ︷E((Y ′ − E(Y ′))2

)+

bias (squared)︷ ︸︸ ︷(E(Y ′)− E(Y )

)2+

irreducible noise︷ ︸︸ ︷E((Y − E(Y ))2

),

Proof. Expand the RHS and simplify.

Bias-variance tradeoff

In general, as model complexity increases (i.e., the hypothesisbecomes more complex), variance tends to increase, and bias tendsto decrease.

53 / 109

Page 68: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bias-variance Tradeoff in kNN

Assumption

Suppose Y | X ∼ N(f (X ), σ) for some function f and some fixedσ. In addition, suppose x1, . . . , xn are fixed.

Bias and variance

At x, Y ′ = 1k

∑xi∈Nk (x)

yi is predicted and the true value is Y .

bias = E(Y ′)− E(Y ) =1

k

∑xi∈Nk (x)

f (xi )− f (x),

variance = E((Y ′ − E(Y ′))2) = σ2/k .

54 / 109

Page 69: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bias-variance Tradeoff in kNN

Assumption

Suppose Y | X ∼ N(f (X ), σ) for some function f and some fixedσ. In addition, suppose x1, . . . , xn are fixed.

Bias and variance

At x, Y ′ = 1k

∑xi∈Nk (x)

yi is predicted and the true value is Y .

bias = E(Y ′)− E(Y ) =1

k

∑xi∈Nk (x)

f (xi )− f (x),

variance = E((Y ′ − E(Y ′))2) = σ2/k .

54 / 109

Page 70: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

bias =1

k

∑xi∈Nk (x)

f (xi )− f (x),

variance = σ2/k.

1k as a complexity measure

This is because with smaller 1k , hn(x) = avg(yi | xi ∈ Nk(x)) is

closer to a constant, and thus model complexity is smaller.

Bias-variance trade-off

As 1k increases (or as model complexity increases), bias is likely to

decrease, and variance increases.

55 / 109

Page 71: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

56 / 109

Page 72: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Model Selection

Assume model complexity is controlled by some parameter θ. How topick the best value among candidates θ0, . . . , θm?

Using a development set

• Split available data into a training set T and a development set D.

• For each θi , train a model on T , and test it on D.

• Choose the parameter with best performance.

A lot of data is needed, while the amount may be limited.

57 / 109

Page 73: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Model Selection

Assume model complexity is controlled by some parameter θ. How topick the best value among candidates θ0, . . . , θm?

Using a development set

• Split available data into a training set T and a development set D.

• For each θi , train a model on T , and test it on D.

• Choose the parameter with best performance.

A lot of data is needed, while the amount may be limited.

57 / 109

Page 74: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

K -fold cross validation

• Split the training data into K folds.

• For each θi , train K models, with each trained on K − 1 folds andtested on the remaining fold.

• Choose the parameter with best average performance.

Computationally more expensive than using a development set.

58 / 109

Page 76: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

The expected prediction error is a function of the model complexity.

The bias-variance decomposition implies that minimizing the expectedprediction error requires careful tuning of the model complexity.

Using a development set and cross-validation are two basic methods formodel selection.

60 / 109

Page 77: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Recap

• A statistics and optimization perspective

Learning a binomial and a Gaussian

• Statistical learning theory

Data, hypotheses, loss function, expected risk, empirical risk...

• Regression

Regression function, parametric regression, non-parametricregression...

• Model selection

Bias-variance tradeoff, development set, cross-validation...

• Classification

• Clustering

61 / 109

Page 78: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Classification• Bayes optimal classifier

• Nearest neighbor classifier

• Naive Bayes classifier

• Logistic regression

• The perceptron

• Support vector machines

62 / 109

Page 79: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Recall

Classification

Output a function f : X → Y where Y is a finite set.

0/1 loss

L((x , y), h) = I(y 6= h(x)) =

{1, y 6= h(x),

0, y = h(x).

Expected/True risk

R(h) = E(L((X ,Y ), h).

63 / 109

Page 80: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bayes Optimal Classifier

The expected 0/1 loss is minimized by the Bayes optimal classifier

h∗(x) = arg maxy∈Y

P(y | x).

Proof. The expected 0/1 loss of a classifier h(x) is

E(L((X ,Y ), h)) = EX EY |X (I(Y 6= h(X )))

= EX P(Y 6= h(X ) | X ).

Hence we can set the value of h(x) independently for each x bychoosing it to minimize the expression under expectation. This leads toh∗(x) = arg maxy∈Y P(y | x).

64 / 109

Page 81: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Bayes Optimal Classifier

The expected 0/1 loss is minimized by the Bayes optimal classifier

h∗(x) = arg maxy∈Y

P(y | x).

Proof. The expected 0/1 loss of a classifier h(x) is

E(L((X ,Y ), h)) = EX EY |X (I(Y 6= h(X )))

= EX P(Y 6= h(X ) | X ).

Hence we can set the value of h(x) independently for each x bychoosing it to minimize the expression under expectation. This leads toh∗(x) = arg maxy∈Y P(y | x).

64 / 109

Page 82: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

The Bayes optimal classifier is

h∗(x) = arg maxy∈Y

P(y | x).

However, P(Y | x) is unknown...

Idea. Estimate P(y | x) from data.

65 / 109

Page 83: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Nearest Neighbor Classifier

Approximate P(y | x) using the label distribution in {yi | xi ∈ Nk(x)},where Nk(x) consists of the k nearest examples of x (with respect tosome distance measure), and predict the majority label

hn(x) = majority{yi | xi ∈ Nk(x)}.

• Under mild conditions, as k →∞ and n/k →∞, hn(x)→ h∗(x),for any distribution P(X ,Y ).

• (Curse of dimensionality) The number of samples required foraccurate approximation is exponential in the dimension.

66 / 109

Page 84: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Nearest Neighbor Classifier

Approximate P(y | x) using the label distribution in {yi | xi ∈ Nk(x)},where Nk(x) consists of the k nearest examples of x (with respect tosome distance measure), and predict the majority label

hn(x) = majority{yi | xi ∈ Nk(x)}.

• Under mild conditions, as k →∞ and n/k →∞, hn(x)→ h∗(x),for any distribution P(X ,Y ).

• (Curse of dimensionality) The number of samples required foraccurate approximation is exponential in the dimension.

66 / 109

Page 85: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

1-NN classifier 15-NN classifier Bayes optimal classifier

67 / 109

Page 86: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Naive Bayes Classifier (NB)

Model

• X = X1 × . . .×Xd , where each Xi is a finite set.

• A model p(X ,Y ) satisfies the independence assumption

p(x1, . . . , xd | y) = p(x1 | y) . . . p(xd | y).

Classification

An example x = (x1, . . . , xd) is classified as

y = arg maxy ′∈Y

p(y ′ | x).

This is equivalent to

y = arg maxy ′∈Y

p(y ′, x) = arg maxy ′∈Y

p(y ′)p(x1 | y)...p(xd | y),

by the independence assumption.

68 / 109

Page 87: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Naive Bayes Classifier (NB)

Model

• X = X1 × . . .×Xd , where each Xi is a finite set.

• A model p(X ,Y ) satisfies the independence assumption

p(x1, . . . , xd | y) = p(x1 | y) . . . p(xd | y).

Classification

An example x = (x1, . . . , xd) is classified as

y = arg maxy ′∈Y

p(y ′ | x).

This is equivalent to

y = arg maxy ′∈Y

p(y ′, x) = arg maxy ′∈Y

p(y ′)p(x1 | y)...p(xd | y),

by the independence assumption.

68 / 109

Page 88: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning (MLE)

The maximum likelihood Naive Bayes model is p(X ,Y ) given by

p(y) = ny/n,

p(xi | y) = ny ,xi/ny ,

where ny is the number of times class y appears in the training set,and ny ,xi is the number of times attribute i takes value xi when theclass label is y . (verify)

Issues

• Independence assumption unlikely to be satisified.

• The counts ny may be 0, making the estimates undefined.

• The counts may be very small, leading to unstable estimates.

69 / 109

Page 89: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning (MLE)

The maximum likelihood Naive Bayes model is p(X ,Y ) given by

p(y) = ny/n,

p(xi | y) = ny ,xi/ny ,

where ny is the number of times class y appears in the training set,and ny ,xi is the number of times attribute i takes value xi when theclass label is y . (verify)

Issues

• Independence assumption unlikely to be satisified.

• The counts ny may be 0, making the estimates undefined.

• The counts may be very small, leading to unstable estimates.

69 / 109

Page 90: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Laplace correction

p(y) = (ny + c0)/∑y∈Y

(ny ′ + c0),

p(xi | y) = (ny ,xi + c1)/∑x ′i ∈Xi

(ny ,x ′i + c1),

where c0 > 0 and c1 > 0 are user-chosen constants. Laplace correctionmakes NB more stable, but still relies on strong independenceassumption.

70 / 109

Page 91: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Logistic Regression (LR)

Model

• X = Rd .

• Logistic regression estimates conditional distributions of the form

p(y | x, θ) = exp(xT θy )/∑

y ′∈Yexp(xT θy ′),

where θy = (θy1, . . . , θyd) ∈ Rd , and θ is the concatenation of θy ’s.

Classification

An example x is classified as

y = arg maxy ′∈Y

p(y ′ | x, θ).

71 / 109

Page 92: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Logistic Regression (LR)

Model

• X = Rd .

• Logistic regression estimates conditional distributions of the form

p(y | x, θ) = exp(xT θy )/∑

y ′∈Yexp(xT θy ′),

where θy = (θy1, . . . , θyd) ∈ Rd , and θ is the concatenation of θy ’s.

Classification

An example x is classified as

y = arg maxy ′∈Y

p(y ′ | x, θ).

71 / 109

Page 93: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Learning

• Training is often done by maximizing regularized log-likelihood

L(θ) = logn∏

i=1

p(yi | xi , θ)− λ||θ||22.

That is, the parameter estimate is

θn = arg maxθ

L(θ).

• L(θ) is a concave function, and can be optimized using standardnumerical methods (such as L-BFGS).

72 / 109

Page 94: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Comparing kNN, NB and LR

All approximates the Bayes-optimal classifier

• Learn an approximation P(X ,Y ) to P(X ,Y ) (or learn anapproximation P(Y | X ) to P(Y | X )).

• Choose the function hn(x) = arg maxy P(y | x).

kNN and logistic regression estimate P(Y | X ), while naive Bayesestimates P(X ,Y ).

kNN is a non-parametric method, while naive Bayes and logisticregression are both parametric methods.

73 / 109

Page 95: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Application in Digit Recognition

Assume each digit image is a binary image, represented as a vector withthe pixel values as its elements.

Applying the algorithms

• kNN: use the Euclidean distance between the examples.

• NB can be applied because each example is a discrete vector.

• LR can be applied because each example is a continuous vector.

74 / 109

Page 96: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

The Perceptron

x0 = 1x1

x2

...

xd

Σ

w1

w2

wd

w0

xTw y = sgn(xTw)

A perceptron maps an input x ∈ Rd+1 to

h(x) = sgn(xTw) =

1, xTw > 0,

0, xTw = 0,

−1, xTw < 0.

.

Here x = (1, x1, . . . , xd) includes a dummy variable 1.75 / 109

Page 97: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

x0

x1

+

+

+

+

+

−−

A perceptron corresponds to a linear decision boundary (i.e., theboundary between the regions for examples of the same class).

76 / 109

Page 98: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

It is NP-hard to minimize the empirical 0/1 loss of a perceptron, that is,given a training set {(x1, y1), . . . , (xn, yn)}, it is NP-hard to solve

minw

1

nI(sgn(xTi w) 6= yi )

Idea: Can we use (xTi w − yi )2 as a surrogate loss for the 0/1 loss?

77 / 109

Page 99: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Least Squares May Fail

Recall: Binary classification as regression

• Label one class as −1 and and the other as +1.

• Fit a function f (x) using least squares regression.

• Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.

Issue

Least squares fitting may not find a separating hyperplane (i.e., ahyperplane which puts the positive and negative examples ondifferent sides of it) even there is one.

78 / 109

Page 100: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Least Squares May Fail

Recall: Binary classification as regression

• Label one class as −1 and and the other as +1.

• Fit a function f (x) using least squares regression.

• Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.

Issue

Least squares fitting may not find a separating hyperplane (i.e., ahyperplane which puts the positive and negative examples ondifferent sides of it) even there is one.

78 / 109

Page 101: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

−+

+

+

+

+

+

+

+

+

+

+

+

The decision boundary learned using least squares fitting (red line)wrongly classifies the negative example, while there exists separatinghyperplanes (like the blue line).

79 / 109

Page 102: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Perceptron Algorithm

Require: (x1, y1), . . . , (xn, yn) ∈ Rd+1 × {−1,+1}, η ∈ (0, 1].Ensure: Weight vector w.

Randomly or smartly initialize w.while there is any misclassified example do

Pick a misclassified example (xi , yi ).w← w + ηyixi .

80 / 109

Page 103: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Why the update rule w← w + ηyixi?

• w classifies (xi , yi ) correctly if and only if yixTi w > 0.

• If w classifies (xi , yi ) wrongly, then the update rule moves yixTi w

towards positive, because

yixTi (w + ηyixi ) = yix

Ti w + η||xi ||2 > yix

Ti w.

81 / 109

Page 104: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Perceptron convergence theorem

If the training data is linearly separable (i.e., there exists some w∗

such that yixTi w∗ > 0 for all i), then the perceptron algorithm

terminates with all training examples correctly classified.

Proof. Suppose w∗ separates the data. We can scale w∗ such that |xTi w∗| ≥ ||xi ||22.

If w classifies (xi , yi ) wrongly, then it will be updated to w′ = w + ηyixi .

We show that ||w∗ − w′||22 ≤ ||w∗ − w||22 − ηR2, where R = mini ||xi ||2. This impliesthat only finitely many updates is possible.

The inequality can be shown as follows.

||w∗ − w′||22 = ||w∗ − w − ηyixi ||22= ||w∗ − w||22 − 2ηyix

Ti (w∗ − w) + η2y 2

i ||xi ||22≤ ||w∗ − w||22 − 2η(|xT

i w∗|+ |xTi w|) + η||xi ||22

≤ ||w∗ − w||22 − 2η|xTi w∗|+ η||xi ||22

≤ ||w∗ − w||22 − η|xTi w∗|

≤ ||w∗ − w||22 − ηR2.

82 / 109

Page 105: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Perceptron convergence theorem

If the training data is linearly separable (i.e., there exists some w∗

such that yixTi w∗ > 0 for all i), then the perceptron algorithm

terminates with all training examples correctly classified.

Proof. Suppose w∗ separates the data. We can scale w∗ such that |xTi w∗| ≥ ||xi ||22.

If w classifies (xi , yi ) wrongly, then it will be updated to w′ = w + ηyixi .

We show that ||w∗ − w′||22 ≤ ||w∗ − w||22 − ηR2, where R = mini ||xi ||2. This impliesthat only finitely many updates is possible.

The inequality can be shown as follows.

||w∗ − w′||22 = ||w∗ − w − ηyixi ||22= ||w∗ − w||22 − 2ηyix

Ti (w∗ − w) + η2y 2

i ||xi ||22≤ ||w∗ − w||22 − 2η(|xT

i w∗|+ |xTi w|) + η||xi ||22

≤ ||w∗ − w||22 − 2η|xTi w∗|+ η||xi ||22

≤ ||w∗ − w||22 − η|xTi w∗|

≤ ||w∗ − w||22 − ηR2.

82 / 109

Page 106: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Issues

• When the data is separable, the hyperplane found by theperceptron algorithm depends on the initial weight and is thusarbitrary.

• Convergence can be very slow, especially when the gap betweenthe positive and negative examples is small.

• When the data is not separable, the algorithm does not stop, butthis can be difficult to detect.

83 / 109

Page 107: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Support Vector Machines (SVMs)Separable data

+

+

+

+

+

wTx + w0 = 0

yi (wT xi+w0)||w||2−

(xi , yi )

Geometric intuitionFind a separating hyperplane with maximalmargin (i.e., the minimum distance from thepoints to it).

Algebraic formulationmaxM,w,w0

M

subject to yi (wT xi+w0)||w||2

≥ M, i = 1, . . . , n.

Equivalent formulation (add M||w||2 = 1). minw,w0

12||w||22

subject to yi (wT xi + w0) ≥ 1, i = 1, . . . , n.

84 / 109

Page 108: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Support Vector Machines (SVMs)Separable data

+

+

+

+

+

wTx + w0 = 0

yi (wT xi+w0)||w||2−

(xi , yi )

Geometric intuitionFind a separating hyperplane with maximalmargin (i.e., the minimum distance from thepoints to it).

Algebraic formulationmaxM,w,w0

M

subject to yi (wT xi+w0)||w||2

≥ M, i = 1, . . . , n.

Equivalent formulation (add M||w||2 = 1). minw,w0

12||w||22

subject to yi (wT xi + w0) ≥ 1, i = 1, . . . , n.

84 / 109

Page 109: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Support Vector Machines (SVMs)Separable data

+

+

+

+

+

wTx + w0 = 0

yi (wT xi+w0)||w||2−

(xi , yi )

Geometric intuitionFind a separating hyperplane with maximalmargin (i.e., the minimum distance from thepoints to it).

Algebraic formulationmaxM,w,w0

M

subject to yi (wT xi+w0)||w||2

≥ M, i = 1, . . . , n.

Equivalent formulation (add M||w||2 = 1). minw,w0

12||w||22

subject to yi (wT xi + w0) ≥ 1, i = 1, . . . , n.

84 / 109

Page 110: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Soft-margin SVMsNon-separable data

Algebraic formulation

minw,w0,ξ1,...,ξn

1

2||w||22 + C

∑i

ξi

subject to yi (wTxi + w0) ≥ 1− ξi , i = 1, . . . , n,

ξi ≥ 0, i = 1, . . . , n.

• C > 0 is a user chosen constant.

• Introducing ξi allows (xi , yi ) to be misclassified with a penalty ofCξi in the original objective function 1

2 ||w||22.

An SVM always have a unique solution that can be found efficiently.

85 / 109

Page 111: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

SVM as minimizing regularized hinge loss

Soft-margin SVMs can be equivalently written as

minw,w0

1

2C||w||22 +

∑i

max(0, 1− yi (wTxi + w0)),

where max(0, 1− y(wTx + w0)) is the hinge loss

Lhinge((x, y), h) = max(0, 1− yh(x))

of the classifier h(x) = wTx + w0, and upper bounds the 0/1 loss

L0/1((x, y), h) = I(y 6= sgn(h(x)))

86 / 109

Page 112: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

• Bayes optimal classifier.

• Probabilistic classifiers: NN, NB and LR.

• Perceptrons and SVMs.

87 / 109

Page 113: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Clustering

88 / 109

Page 114: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Clustering

Clustering is unsupervised function learning which learns a functionfrom X to [K ] = {1, . . . ,K}. The objective generally depends on somesimilarity/distance measure between the items, so that similar items aregrouped together.

89 / 109

Page 115: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

K -means Clustering

Given observations x1, . . . , xn, find a K -clustering f (a surjective from[n] to [K ]) to minimize the cost

n∑i=1

||xi − cf (i)||2,

where ck is the centroid of cluster k , that is, the average of xi ’s withf (i) = k .

90 / 109

Page 116: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

K -means algorithm

Randomly initialize c1:K , and set each f (i) = 0.repeat

(Assignment) Set each f (i) to be the index of the cj closest to xi .(Update) Set each ci as the centroid of cluster i given by f .

until f does not change

Initialization

• Forgy method: randomly choose K observations as centroids, andinitialize f as in the assignment step.

• Random Partition: assign a random cluster to each example.

Random partition is preferable.

91 / 109

Page 117: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

K -means algorithm

Randomly initialize c1:K , and set each f (i) = 0.repeat

(Assignment) Set each f (i) to be the index of the cj closest to xi .(Update) Set each ci as the centroid of cluster i given by f .

until f does not change

Initialization

• Forgy method: randomly choose K observations as centroids, andinitialize f as in the assignment step.

• Random Partition: assign a random cluster to each example.

Random partition is preferable.

91 / 109

Page 118: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Illustration

Initialize centroids

Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

92 / 109

Page 119: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Illustration

Iter 1: Assignment

Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

92 / 109

Page 120: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Illustration

Iter 1: Update

Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

92 / 109

Page 121: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Illustration

Iter 2: Assignment

Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

92 / 109

Page 122: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Illustration

Iter 2: Update (converged)

Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

92 / 109

Page 123: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

K -means as Coordinate Descent

K -means is a coordinate descent algorithm for the cost function

C (f , c1:K ) =n∑

i=1

||xi − cf (i)||2.

Specifically, we have (verify)

(Assignment) f ← arg minf ′ C (f ′, c1:K ), where f ′ is a K -clustering.

(Update) c1:K ← arg minc′1:KC (f , c′1:K ),

93 / 109

Page 124: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Convergence

• The cost decreases at each iteration before termination.

• The cost converges to a local minimum by the monotoneconvergence theorem.

• Convergence may be very slow, taking exponential time in somecases, but such cases do not seem to arise in practice.

Dealing with poor local minimum

• Restart multiple times and pick the minimum cost clustering found.

K -means gives hard clusterings. Can we give probabilistic assignmentsto clusters?

94 / 109

Page 125: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Convergence

• The cost decreases at each iteration before termination.

• The cost converges to a local minimum by the monotoneconvergence theorem.

• Convergence may be very slow, taking exponential time in somecases, but such cases do not seem to arise in practice.

Dealing with poor local minimum

• Restart multiple times and pick the minimum cost clustering found.

K -means gives hard clusterings. Can we give probabilistic assignmentsto clusters?

94 / 109

Page 126: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Convergence

• The cost decreases at each iteration before termination.

• The cost converges to a local minimum by the monotoneconvergence theorem.

• Convergence may be very slow, taking exponential time in somecases, but such cases do not seem to arise in practice.

Dealing with poor local minimum

• Restart multiple times and pick the minimum cost clustering found.

K -means gives hard clusterings. Can we give probabilistic assignmentsto clusters?

94 / 109

Page 127: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Soft Clustering with Gaussian

Mixtures

Assumption

• Each cluster is represented as a Gaussian

N(x;µk ,Σk) =1√|2πΣk |

exp(−1

2(x− µk)TΣ−1k (x− µk)

)• Cluster k has weight wk ≥ 0, with

∑Kk=1 wk = 1.

95 / 109

Page 128: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Equivalently, we assume the probability of observing x and it is incluster z = k is

p(x, z = k | θ) = wkN(x;µk ,Σk),

where θ = {w1:K , µ1:K ,Σ1:K}.

The distribution p(x | θ) =∑

k wkN(x;µk ,Σk) is called a Gaussianmixture.

96 / 109

Page 129: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Equivalently, we assume the probability of observing x and it is incluster z = k is

p(x, z = k | θ) = wkN(x;µk ,Σk),

where θ = {w1:K , µ1:K ,Σ1:K}.

The distribution p(x | θ) =∑

k wkN(x;µk ,Σk) is called a Gaussianmixture.

96 / 109

Page 130: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Computing a soft clustering

p(Z = k | x, θ) =p(x,Z = k | θ)∑K

k ′=1 p(x,Z = k ′ | θ).

Learning a Gaussian mixture

Given the observations D = {x1, . . . , xn}, we choose θ bymaximizing the log-likelihood L(D | θ) =

∑ni=1 ln p(xi | θ)

maxθ

L(D | θ).

This is solved using the EM algorithm.

97 / 109

Page 131: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

The EM Algorithm

Let zi be the random variable representing the cluster from which xi isdrawn from. EM starts with some initial parameter θ(0), and repeats thefollowing steps:

Expectation step: Q(θ | θ(t)) = E(∑

i

ln p(xi , zi | θ) | D, θ(t)),

Maximization step: θ(t+1) = arg maxθ

Q(θ | θ(t)).

In words...

• E-step: Expectation of the log-likelihood for the complete data,w.r.t. the conditional distribution p(z1:n | D, θ(t)).

• M-step: Maximization of the expectation

98 / 109

Page 132: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

The EM Algorithm

Let zi be the random variable representing the cluster from which xi isdrawn from. EM starts with some initial parameter θ(0), and repeats thefollowing steps:

Expectation step: Q(θ | θ(t)) = E(∑

i

ln p(xi , zi | θ) | D, θ(t)),

Maximization step: θ(t+1) = arg maxθ

Q(θ | θ(t)).

Data completion interpretation

• E-step: Create complete data (xi , zi ) with weight p(zi | xi , θ(t)),for each xi and each zi ∈ [K ].

• M-step: Perform maximum likelihood estimation on the completedata set.

98 / 109

Page 133: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

EM algorithm is iterative likelihood maximization

EM algorithm iteratively improves the likelihood function,

L(D | θ(t+1)) ≥ L(D | θ(t)).

Proof. With some algebraic manipulation, we have

Q(θ(t+1) | θ(t))− Q(θ(t) | θ(t))

= L(D | θ(t+1))− L(D | θ(t))−∑i

KL(p(Zi | xi , θ(t)) | p(Zi | xi , θ(t+1))),

where KL(q||q′) is the KL-divergence∑

x q(x) ln q(x)q′(x) .

The result follows by noting that the LHS is non-negative by the choiceof θ(t+1) in the M-step, and the nonnegativeness of the KL-divergence.

99 / 109

Page 134: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

EM algorithm is iterative likelihood maximization

EM algorithm iteratively improves the likelihood function,

L(D | θ(t+1)) ≥ L(D | θ(t)).

Proof. With some algebraic manipulation, we have

Q(θ(t+1) | θ(t))− Q(θ(t) | θ(t))

= L(D | θ(t+1))− L(D | θ(t))−∑i

KL(p(Zi | xi , θ(t)) | p(Zi | xi , θ(t+1))),

where KL(q||q′) is the KL-divergence∑

x q(x) ln q(x)q′(x) .

The result follows by noting that the LHS is non-negative by the choiceof θ(t+1) in the M-step, and the nonnegativeness of the KL-divergence.

99 / 109

Page 135: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Q(θ(t+1) | θ(t))− Q(θ(t) | θ(t))

= E(∑

i

ln p(xi , zi | θ(t+1)) | D, θ(t))− E

(∑i

ln p(xi , zi | θ(t)) | D, θ(t))

= E(∑

i

(ln p(xi | θ(t+1)) + ln p(zi | xi , θ(t+1))

− ln p(xi | θ(t))− ln p(zi | xi , θ(t) | D, θ(t)))

=∑i

ln p(xi | θ(t+1))−∑i

ln p(xi | θ(t))− E(

lnp(zi | xi , θ(t))p(zi | xi , θ(t)

) | D, θ(t))

= L(D | θ(t+1))− L(D | θ(t))−∑i

KL(p(Zi | xi , θ(t)) | p(Zi | xi , θ(t+1))).

100 / 109

Page 136: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Update Equations for Gaussian

Mixtures

Scalar covariance matrices

Assume each covariance matrix Σk is a scalar matrix σ2k Id . Let

w(t,i)k = p(zi = k | xi , θ(t)). Then given θ(t), θ(t+1) can be

computed using

w(t+1)k =

∑i

w(t,i)k /n,

µ(t+1)k =

∑i

w(t,i)k xi/

∑i

w(t,i)k ,

(σ(t+1)k )2 =

∑i

w(t,i)k

∑j

(xij − µ(t+1)kj )2/(d

∑i

w(t,i)k ).

101 / 109

Page 137: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Data completion interpretation

• Split example xi into K complete examples (xi , 1), . . . , (xi ,K ),

where example (xi , k) has weight w(t,i)k .

• Apply maximum likelihood estimation to the complete data

• w(t+1)k is total weight of examples in cluster k .

• µ(t+1)k is the mean x value of the (weighted) examples in cluster k.

• σ(t+1)k is the standard deviation of all the attributes of the

(weighted) examples in cluster k .

102 / 109

Page 138: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Diagonal covariance matrices

Assume each covariance matrix Σk is a diagonal matrix with

diagonal entries σ2k1, . . . , σ2kd . Let w

(t,i)k = p(zi = k | xi , θ(t)).

Then given θ(t), θ(t+1) can be computed using

w(t+1)k =

∑i

w(t,i)k /n,

µ(t+1)k =

∑i

w(t,i)k xi/

∑i

w(t,i)k ,

(σ(t+1)kj )2 =

∑i

w(t,i)k (xij − µ

(t+1)kj )2/

∑i

w(t,i)k .

103 / 109

Page 139: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Arbitrary covariance matrices

Let w(t,i)k = p(zi = k | xi , θ(t)). Then given θ(t), θ(t+1) can be

computed using

w(t+1)k =

∑i

w(t,i)k /n,

µ(t+1)k =

∑i

w(t,i)k xi/

∑i

w(t,i)k ,

Σ(t+1)k =

∑i

w(t,i)k (xi − µ

(t+1)k )(xi − µ

(t+1)k )T/

∑i

w(t,i)k .

104 / 109

Page 140: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

What You Need to Know...

• The clustering problem.

• Hard clustering with K -means algorithm.

• Soft clustering with Gaussian mixture.

105 / 109

Page 141: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Density Estimation

• Maximum likelihood estimation.

• Naive Bayes.

• Logistic regression.

106 / 109

Page 142: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

This Tutorial...

Essentials for crafting basic machine learning systems.

Formulate applications as machine learning problems

Classification, regression, density estimation, clustering

Understand and apply basic learning algorithms

Least squares regression, logistic regression, support vectormachines, K-means,...

Theoretical understanding

Position and compare the problems and algorithms in the unifyingframework of statistical learning theory.

107 / 109

Page 143: Machine Learning: A Statistics and Optimization Perspective · Machine Learning Machine learning turns data into insight, predictions and/or decisions. Numerous applications in diverse

Beyond This Course

• Representation: dimensionality reduction, feature selection,...

• Algorithms: decision tree, artificial neural network, Gaussianprocesses,...

• Meta-learning algorithms: boosting, stacking, bagging,...

• Learning theory: generalization performance of learning algorithms

• Many other exciting stuff...

108 / 109