67
Linear Models for Supervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 CS420 Machine Learning, Lecture 2 http://wnzhang.net/teaching/cs420/index.html

Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Models for Supervised Learning

Weinan ZhangShanghai Jiao Tong University

http://wnzhang.net

2019 CS420 Machine Learning, Lecture 2

http://wnzhang.net/teaching/cs420/index.html

Page 2: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Discriminative Model and Generative Model

• Discriminative model• modeling the dependence of unobserved variables on

observed ones• also called conditional models.• Deterministic: • Probabilistic:

• Generative model• modeling the joint probabilistic distribution of data• given some hidden parameters or variables

• then do the conditional inference

Page 3: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Discriminative Model and Generative Model

• Discriminative model• modeling the dependence of unobserved variables on

observed ones• also called conditional models.• Deterministic: • Probabilistic:

• Directly model the dependence for label prediction• Easy to define dependence specific features and models• Practically yielding higher prediction performance

• Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons, decision trees, random forest etc.

Page 4: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Discriminative Model and Generative Model

• Generative model• modeling the joint probabilistic distribution of data• given some hidden parameters or variables

• then do the conditional inference

• Recover the data distribution [essence of data science]• Benefit from hidden variables modeling

• Naive Bayes, Hidden Markov Model, Mixture Gaussian, Markov Random Fields, Latent Dirichlet Allocation etc.

Page 5: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Regression

Page 6: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Discriminative Models• Discriminative model

• modeling the dependence of unobserved variables on observed ones

• also called conditional models.• Deterministic: • Probabilistic:

• Focus of this course• Linear regression model• Linear classification model

Page 7: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Discriminative Models• Discriminative model

• modeling the dependence of unobserved variables on observed ones

• also called conditional models.• Deterministic: • Probabilistic:

• Linear regression model

y = fμ(x) = μ0 +dX

j=1

μjxj = μ>xy = fμ(x) = μ0 +dX

j=1

μjxj = μ>x

x = (1; x1; x2; : : : ; xd)x = (1; x1; x2; : : : ; xd)

Page 8: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Regression• One-dimensional linear & quadratic regression

Linear Regression Quadratic Regression(A kind of generalizedlinear model)

Page 9: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Regression• Two-dimensional linear regression

Page 10: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Learning Objective• Make the prediction close to the corresponding

label

• Loss function measures the error between the label and prediction

• The definition of loss function depends on the data and task

• Most popular loss function: squared loss

Page 11: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Squared Loss

• Penalty much more on larger distances

• Accept small distance (error) • Observation

noise etc.• Generalization

Page 12: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Least Square Linear Regression

• Objective function to minimize

Page 13: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Minimize the Objective Function• Let N=1 for a simple case, for (x,y)=(2,1)

J(μ) =1

2(y ¡ μ0 ¡ μ1x)2 =

1

2(1¡ μ0 ¡ 2μ1)

2J(μ) =1

2(y ¡ μ0 ¡ μ1x)2 =

1

2(1¡ μ0 ¡ 2μ1)

2

Page 14: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Gradient Learning Methods

μnew à μold ¡ ´@L(μ)

@μμnew à μold ¡ ´

@L(μ)

Page 15: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Batch Gradient Descent

• Update for the whole batch

Page 16: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Learning Linear Model - Curve

Page 17: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Learning Linear Model - Weights

Page 18: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Stochastic Gradient Descent

• Update for every single instance

J (i)(μ) =1

2(yi ¡ fμ(xi))

2 minμ

1

N

Xi

J (i)(μ)J (i)(μ) =1

2(yi ¡ fμ(xi))

2 minμ

1

N

Xi

J (i)(μ)

μnew = μold ¡ ´ @J(i)(μ)@μμnew = μold ¡ ´ @J(i)(μ)@μ

@J (i)(μ)

@μ= ¡(yi ¡ fμ(xi))

@fμ(xi)

@μ= ¡(yi ¡ fμ(xi))xi

μnew = μold + ´(yi ¡ fμ(xi))xi

@J (i)(μ)

@μ= ¡(yi ¡ fμ(xi))

@fμ(xi)

@μ= ¡(yi ¡ fμ(xi))xi

μnew = μold + ´(yi ¡ fμ(xi))xi

• Compare with BGD• Faster learning• Uncertainty or fluctuation in learning

Page 19: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Classification Model

Page 20: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Mini-Batch Gradient Descent• A combination of batch GD and stochastic GD

• Split the whole dataset into K mini-batches

• Update for each mini-batch

J (k)(μ) =1

2Nk

NkXi=1

(yi ¡ fμ(xi))2J (k)(μ) =

1

2Nk

NkXi=1

(yi ¡ fμ(xi))2

μnew = μold ¡ ´@J (k)(μ)

@μμnew = μold ¡ ´

@J (k)(μ)

• For each mini-batch k, perform one-step BGD towards minimizing

Page 21: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Mini-Batch Gradient Descent• Good learning stability (BGD)• Good convergence rate (SGD)

• Easy to be parallelized• Parallelization within a mini-batch

Mini-batch

Worker 1

Worker 2

Worker 3

ParallelizedMap Gradient Reduce Gradient Sum

Page 22: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Basic Search Procedure• Choose an initial value for• Update iteratively with the data• Until we research a minimum

μμ

μμ

Page 23: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Basic Search Procedure• Choose a new initial value for• Update iteratively with the data• Until we research a minimum

μμ

μμ

Page 24: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Unique Minimum for Convex Objective

• Different initial parameters and different learning algorithm lead to the same optimum

Page 25: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Convex Set• A convex set S is a set of points such that, given any

two points A, B in that set, the line AB joining them lies entirely within S.

A

BA

B

tx1 + (1¡ t)x2 2 Stx1 + (1¡ t)x2 2 S

for all x1; x2 2 S; 0 · t · 1x1; x2 2 S; 0 · t · 1

Convex set Non-convex set

[Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.]

Page 26: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Convex Function

is convex if is a convex set andf : Rn ! Rf : Rn ! Rf(tx1 + (1¡ t)x2) · tf(x1) + (1¡ t)f(x2)f(tx1 + (1¡ t)x2) · tf(x1) + (1¡ t)f(x2)

for all

dom fdom f

x1; x2 2 dom f; 0 · t · 1x1; x2 2 dom f; 0 · t · 1

Page 27: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Choosing Learning Rate

• To see if gradient descent is working, print out for each or every several iterations. If does not drop properly, adjust

μnew = μold ¡ ´@J(μ)

@μμnew = μold ¡ ´

@J(μ)

too smallslow convergence

too largeIncreasing value of

• May overshoot the minimum• May fail to converge• May even diverge

Slide credit Eric Eaton

• The initial point may be too far away from the optimal solution, which takes much time to converge

Page 28: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Algebra Perspective

• Prediction

X =

26664x(1)

x(2)

...

x(n)

37775 =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775X =

26664x(1)

x(2)

...

x(n)

37775 =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775 μ =

26664μ1μ2...μd

37775μ =

26664μ1μ2...μd

37775 y =

26664y1

y2...

yn

37775y =

26664y1

y2...

yn

37775

y = Xμ =

26664x(1)μ

x(2)μ...

x(n)μ

37775y = Xμ =

26664x(1)μ

x(2)μ...

x(n)μ

37775• Objective J(μ) =

1

2(y ¡ y)>(y ¡ y) =

1

2(y ¡Xμ)>(y ¡Xμ)J(μ) =

1

2(y ¡ y)>(y ¡ y) =

1

2(y ¡Xμ)>(y ¡Xμ)

Page 29: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Matrix Form

• Gradient

J(μ) =1

2(y ¡Xμ)>(y ¡Xμ) min J(μ)J(μ) =

1

2(y ¡Xμ)>(y ¡Xμ) min J(μ)

@J(μ)

@μ= ¡X>(y ¡Xμ)

@J(μ)

@μ= ¡X>(y ¡Xμ)

@J(μ)

@μ= 0 ) X>(y ¡Xμ) = 0

) X>y = X>Xμ

) μ = (X>X)¡1X>y

@J(μ)

@μ= 0 ) X>(y ¡Xμ) = 0

) X>y = X>Xμ

) μ = (X>X)¡1X>y

• Objective

• Solution

http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/ECE275A_Viewgraphs_5.pdf

Page 30: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

First column

Second column

Matrix Form• Then the predicted values are

y = X(X>X)¡1X>y

= Hy

y = X(X>X)¡1X>y

= Hy

• Geometrical Explanation• The column vectors form a subspace of • H is a least square projection

H: hat matrix

X =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775 = [x1;x2; : : : ;xd]X =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775 = [x1;x2; : : : ;xd] y =

26664y1

y2...

yn

37775y =

26664y1

y2...

yn

37775

[x1;x2; : : : ;xd][x1;x2; : : : ;xd] RnRn

ky ¡Xμk2ky ¡Xμk2

More details refer to Sec 3.2. Hastie et al. The elements of statistical learning.

Page 31: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Might be Singular• When some column vectors are not independent

• For example,then is singular, thuscannot be directly calculated.

• Solution: regularization

X>XX>X

x2 = 3x1x2 = 3x1

X>XX>X μ = (X>X)¡1X>yμ = (X>X)¡1X>y

J(μ) =1

2(y ¡Xμ)>(y ¡Xμ) +

¸

2kμk2

2J(μ) =1

2(y ¡Xμ)>(y ¡Xμ) +

¸

2kμk2

2

Page 32: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Matrix Form with Regularization

• Gradient

J(μ) =1

2(y ¡Xμ)>(y ¡Xμ) +

¸

2kμk22 minJ(μ)J(μ) =

1

2(y ¡Xμ)>(y ¡Xμ) +

¸

2kμk22 minJ(μ)

@J(μ)

@μ= ¡X>(y ¡Xμ) + ¸μ

@J(μ)

@μ= ¡X>(y ¡Xμ) + ¸μ

@J(μ)

@μ= 0 ! ¡X>(y ¡Xμ) + ¸μ = 0

! X>y = (X>X + ¸I)μ

! μ = (X>X + ¸I)¡1X>y

@J(μ)

@μ= 0 ! ¡X>(y ¡Xμ) + ¸μ = 0

! X>y = (X>X + ¸I)μ

! μ = (X>X + ¸I)¡1X>y

• Objective

• Solution

Page 33: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Discriminative Models• Discriminative model

• modeling the dependence of unobserved variables on observed ones

• also called conditional models.• Deterministic: • Probabilistic:

• Linear regression with Gaussian noise model

Page 34: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Objective: Likelihood

• Data likelihood

p(²) =1p

2¼¾2e¡

²2

2¾2p(²) =1p

2¼¾2e¡

²2

2¾2

² » N (0; ¾2)² » N (0; ¾2)

p(yjx) =1p

2¼¾2e¡

(y¡μ>x)2

2¾2p(yjx) =1p

2¼¾2e¡

(y¡μ>x)2

2¾2

Page 35: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Learning• Maximize the data likelihood

maxμ

NYi=1

1p2¼¾2

e¡(yi¡μ>xi)

2

2¾2maxμ

NYi=1

1p2¼¾2

e¡(yi¡μ>xi)

2

2¾2

• Maximize the data log-likelihood

logNY

i=1

1p2¼¾2

e¡(yi¡μ>xi)

2

2¾2 =NX

i=1

log1p

2¼¾2e¡

(yi¡μ>xi)2

2¾2

= ¡NX

i=1

(yi ¡ μ>xi)2

2¾2+ const

logNY

i=1

1p2¼¾2

e¡(yi¡μ>xi)

2

2¾2 =NX

i=1

log1p

2¼¾2e¡

(yi¡μ>xi)2

2¾2

= ¡NX

i=1

(yi ¡ μ>xi)2

2¾2+ const

minμ

NXi=1

(yi ¡ μ>xi)2min

μ

NXi=1

(yi ¡ μ>xi)2 Equivalent to least square error learning

Page 36: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Classification

Page 37: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Classification Problem• Given:

• A description of an instance, , where is the instance space.

• A fixed set of categories:

• Determine:• The category of : , where is a

categorization function whose domain is and whose range is

• If the category set binary, i.e. ({false, true}, {negative, positive}) then it is called binary classification.

x 2 Xx 2 X XX

C = fc1; c2; : : : ; cmgC = fc1; c2; : : : ; cmg

xx f(x) 2 Cf(x) 2 C f(x)f(x)XX

CC

C = f0; 1gC = f0; 1g

Page 38: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Binary Classification

Linearly inseparable Non-linearly inseparable

Page 39: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Linear Discriminative Models• Discriminative model

• modeling the dependence of unobserved variables on observed ones

• also called conditional models.• Deterministic:

• Non-differentiable• Probabilistic:

• Differentiable

• For binary classification

y = fμ(x)y = fμ(x)

pμ(yjx)pμ(yjx)

pμ(y = 1jx)pμ(y = 1jx)

pμ(y = 0jx) = 1¡ pμ(y = 1jx)pμ(y = 0jx) = 1¡ pμ(y = 1jx)

Page 40: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Loss Function• Cross entropy loss

0 1 0 0 0

0.1 0.6 0.05 0.05 0.2

Ground Truth

Prediction

L(y; x; pμ) = ¡X

k

±(y = ck) log pμ(y = ckjx)L(y; x; pμ) = ¡X

k

±(y = ck) log pμ(y = ckjx)

H(p; q) = ¡X

x

p(x) log q(x)H(p; q) = ¡X

x

p(x) log q(x)

H(p; q) = ¡Z

xp(x) log q(x)dxH(p; q) = ¡

Zxp(x) log q(x)dx

• For classification problem

±(z) =

(1; z is true

0; otherwise±(z) =

(1; z is true

0; otherwise

Discrete case:

Continuous case:

Page 41: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Cross Entropy for Binary Classification

• Loss function

0 1

0.3 0.7

Ground Truth

Prediction

Class 1 Class 2

L(y; x; pμ) = ¡±(y = 1) log pμ(y = 1jx)¡ ±(y = 0) log pμ(y = 0jx)

= ¡y log pμ(y = 1jx)¡ (1¡ y) log(1¡ pμ(y = 1jx))

L(y; x; pμ) = ¡±(y = 1) log pμ(y = 1jx)¡ ±(y = 0) log pμ(y = 0jx)

= ¡y log pμ(y = 1jx)¡ (1¡ y) log(1¡ pμ(y = 1jx))

Page 42: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Logistic Regression• Logistic regression is a binary classification model

pμ(y = 1jx) = ¾(μ>x) =1

1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =

1

1 + e¡μ>x

pμ(y = 0jx) =e¡μ>x

1 + e¡μ>xpμ(y = 0jx) =

e¡μ>x

1 + e¡μ>x

L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))

@¾(z)

@z= ¾(z)(1¡ ¾(z))

@¾(z)

@z= ¾(z)(1¡ ¾(z))

@L(y; x; pμ)

@μ= ¡y

1

¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)

¡1

1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x

= (¾(μ>x)¡ y)x

μ Ã μ + ´(y ¡ ¾(μ>x))x

@L(y; x; pμ)

@μ= ¡y

1

¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)

¡1

1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x

= (¾(μ>x)¡ y)x

μ Ã μ + ´(y ¡ ¾(μ>x))x

• Cross entropy loss function

• Gradient

¾(x)¾(x)

xx

Page 43: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Label Decision• Logistic regression provides the probability

• The final label of an instance is decided by setting a threshold

pμ(y = 1jx) = ¾(μ>x) =1

1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =

1

1 + e¡μ>x

pμ(y = 0jx) =e¡μ>x

1 + e¡μ>xpμ(y = 0jx) =

e¡μ>x

1 + e¡μ>x

hh

y =

(1; pμ(y = 1jx) > h

0; otherwisey =

(1; pμ(y = 1jx) > h

0; otherwise

Page 44: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures

• True / False• True: prediction = label• False: prediction ≠ label

• Positive / Negative• Positive: predict y = 1• Negative: predict y = 0

1 0

1 TruePositive

False Negative

0 False Positive

True Negative

Label

Prediction

Class 1

Class 0

TP: if predicting 1FN: if predicting 0

FP: if predicting 1TN: if predicting 0

Page 45: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures

• Accuracy: the ratio of cases when prediction = label

1 0

1 TruePositive

False Negative

0 False Positive

True Negative

Label

Prediction

Acc =TP + TN

TP + TN + FP + FNAcc =

TP + TN

TP + TN + FP + FN

Page 46: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures

• Precision: the ratio of true class 1 cases in those with prediction 1

1 0

1 TruePositive

False Negative

0 False Positive

True Negative

Label

Prediction

Prec =TP

TP + FPPrec =

TP

TP + FP

• Recall: the ratio of cases with prediction 1 in all true class 1 cases

1 0

1 TruePositive

False Negative

0 False Positive

True Negative

Label

Prediction

Rec =TP

TP + FNRec =

TP

TP + FN

Page 47: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures• Precision-recall tradeoff

• Higher threshold, higher precision, lower recall• Extreme case: threshold = 0.99

• Lower threshold, lower precision, higher recall• Extreme case: threshold = 0

• F1 Measure

y =

(1; pμ(y = 1jx) > h

0; otherwisey =

(1; pμ(y = 1jx) > h

0; otherwise

F1 =2£ Prec£Recall

Prec + RecF1 =

2£ Prec£Recall

Prec + Rec

Page 48: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures• Ranking-based measure: Area Under ROC Curve (AUC)

Page 49: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures• Ranking-based measure: Area Under ROC Curve (AUC)

PerfectPrediction

AUC = 1

RandomPredictionAUC = 0.5

Page 50: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Evaluation Measures• A simple example of Area Under ROC Curve (AUC)

False Positive Ratio

TruePositive

Ratio

Prediction Label0.91 10.85 00.77 10.72 10.61 00.48 10.42 00.33 00.25 0.5 0.75 1.0

0.5

1.0

0.75

0.25

AUC = 0.75

Page 51: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Multi-Class Classification

• Still cross entropy loss 0 1 0

0.1 0.7 0.2

Ground Truth

Prediction

L(y; x; pμ) = ¡X

k

±(y = ck) log pμ(y = ckjx)L(y; x; pμ) = ¡X

k

±(y = ck) log pμ(y = ckjx) ±(z) =

(1; z is true

0; otherwise±(z) =

(1; z is true

0; otherwise

Page 52: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

• Softmax• Parameters• Can be normalized with m-1 groups of parameters

Multi-Class Logistic Regression• Class set C = fc1; c2; : : : ; cmgC = fc1; c2; : : : ; cmg

pμ(y = cjjx) =eμ>j xPm

k=1 eμ>k x

for j = 1; : : : ;mpμ(y = cjjx) =eμ>j xPm

k=1 eμ>k x

for j = 1; : : : ;m

pμ(y = cjjx)pμ(y = cjjx)• Predicting the probability of

μ = fμ1; μ2; : : : ; μmgμ = fμ1; μ2; : : : ; μmg

Page 53: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Multi-Class Logistic Regression• Learning on one instance

• Maximize log-likelihood

@ log pμ(y = cjjx)

@μj=

@

@μjlog

eμ>j xPm

k=1 eμ>k x

= x¡ @

@μjlog

mXk=1

eμ>k x

= x¡ eμ>j xxPm

k=1 eμ>k x

@ log pμ(y = cjjx)

@μj=

@

@μjlog

eμ>j xPm

k=1 eμ>k x

= x¡ @

@μjlog

mXk=1

eμ>k x

= x¡ eμ>j xxPm

k=1 eμ>k x

maxμ

log pμ(y = cjjx)maxμ

log pμ(y = cjjx)

(x; y = cj)(x; y = cj)

• Gradient

Page 54: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Application Case StudyClick-Through Rate (CTR) Estimation in Online Advertising

Page 55: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

[http://news.ifeng.com]

Click or not?

Ad Click-Through Rate Estimation

Page 56: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

User Response Estimation Problem

• Problem definition

• Date: 20160320• Hour: 14• Weekday: 7• IP: 119.163.222.*• Region: England• City: London• Country: UK• Ad Exchange: Google• Domain: yahoo.co.uk• URL: http://www.yahoo.co.uk/abc/xyz.html• OS: Windows• Browser: Chrome• Ad size: 300*250• Ad ID: a1890• User occupation: Student• User tags: Sports, Electronics

Click (1) or not (0)?

Predicted CTR (0.15)

One instance data Corresponding label

Page 57: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

One-Hot Binary Encoding

• High dimensional sparse binary feature vector• Usually higher than 1M dimensions, even 1B dimensions• Extremely sparse

x=[Weekday=Friday, Gender=Male, City=Shanghai]

x=[0,0,0,0,1,0,0 0,1 0,0,1,0…0]

• A standard feature engineering paradigm

Sparse representation: x=[5:1 9:1 12:1]

Page 58: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Training/Validation/Test Data• Examples (in LibSVM format)

1 5:1 9:1 12:1 45:1 154:1 509:1 4089:1 45314:1 988576:10 2:1 7:1 18:1 34:1 176:1 510:1 3879:1 71310:1 818034:1

• Training/Validation/Test data split• Sort data by time• Train:validation:test = 8:1:1• Shuffle training data

Page 59: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Training Logistic Regression• Logistic regression is a binary classification model

pμ(y = 1jx) = ¾(μ>x) =1

1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =

1

1 + e¡μ>x

L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1¡ y) log(1¡ ¾(μ>x)) +¸

2jjμjj22L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1¡ y) log(1¡ ¾(μ>x)) +

¸

2jjμjj22

μ Ã (1 ¡ ¸´)μ + ´(y ¡ ¾(μ>x))xμ Ã (1 ¡ ¸´)μ + ´(y ¡ ¾(μ>x))x

• Cross entropy loss function with L2 regularization

• Parameter learning

• Only update non-zero entries

Page 60: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Experimental Results• Datasets

• Criteo Terabyte Dataset• 13 numerical fields, 26 categorical fields• 7 consecutive days out of 24 days in total (about 300 GB)

during 2014• 79.4M impressions, 1.6M clicks after negative down sampling

• iPinYou Dataset• 65 categorical fields• 10 consecutive days during 2013• 19.5M impressions, 937.7K clicks without negative down

sampling

Page 61: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

PerformanceModel Linearity

AUC Log LossCriteo iPinYou Criteo iPinYou

Logistic Regression Linear 71.48% 73.43% 0.1334 5.581e-3

Factorization Machine Bi-linear 72.20% 75.52% 0.1324 5.504e-3

Deep NeuralNetworks

Non-linear 75.66% 76.19% 0.1283 5.443e-3

[Yanru Qu et al. Product-based Neural Networks for User Response Prediction. ICDM 2016.]

• Compared with non-linear models, linear models• Pros: standardized, easily understood and implemented, efficient

and scalable• Cons: modeling limit (feature independent assumption), cannot

explore feature interactions

Page 62: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Generalized Linear Models

Page 63: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Review: Linear Regression

• Prediction

X =

26664x(1)

x(2)

...

x(n)

37775 =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775X =

26664x(1)

x(2)

...

x(n)

37775 =

266664x

(1)1 x

(1)2 x

(1)3 : : : x

(1)d

x(2)1 x

(2)2 x

(2)3 : : : x

(2)d

......

.... . .

...

x(n)1 x

(n)2 x

(n)3 : : : x

(n)d

377775 μ =

26664μ1μ2...μd

37775μ =

26664μ1μ2...μd

37775 y =

26664y1

y2...

yn

37775y =

26664y1

y2...

yn

37775

y = Xμ =

26664x(1)μ

x(2)μ...

x(n)μ

37775y = Xμ =

26664x(1)μ

x(2)μ...

x(n)μ

37775• Objective J(μ) =

1

2(y ¡ y)>(y ¡ y) =

1

2(y ¡Xμ)>(y ¡Xμ)J(μ) =

1

2(y ¡ y)>(y ¡ y) =

1

2(y ¡Xμ)>(y ¡Xμ)

Page 64: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Review: Matrix Form of Linear Reg.

• Gradient

J(μ) =1

2(y ¡Xμ)>(y ¡Xμ) min J(μ)J(μ) =

1

2(y ¡Xμ)>(y ¡Xμ) min J(μ)

@J(μ)

@μ= ¡X>(y ¡Xμ)

@J(μ)

@μ= ¡X>(y ¡Xμ)

@J(μ)

@μ= 0 ! X>(y ¡Xμ) = 0

! X>y = X>Xμ

! μ = (X>X)¡1X>y

@J(μ)

@μ= 0 ! X>(y ¡Xμ) = 0

! X>y = X>Xμ

! μ = (X>X)¡1X>y

• Objective

• Solution

Page 65: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Generalized Linear Models• Dependence

y = f(μ>Á(x))y = f(μ>Á(x))

• Feature mapping function• Mapped feature matrix

© =

2666666664

Á(x(1))

Á(x(2))...

Á(x(i))...

Á(x(n))

3777777775=

2666666664

Á1(x(1)) Á2(x

(1)) ¢ ¢ ¢ Áh(x(1))

Á1(x(2)) Á2(x

(2)) ¢ ¢ ¢ Áh(x(2))...

.... . .

...

Á1(x(i)) Á2(x

(i)) ¢ ¢ ¢ Áh(x(i))...

.... . .

...

Á1(x(n)) Á2(x

(n)) ¢ ¢ ¢ Áh(x(n))

3777777775© =

2666666664

Á(x(1))

Á(x(2))...

Á(x(i))...

Á(x(n))

3777777775=

2666666664

Á1(x(1)) Á2(x

(1)) ¢ ¢ ¢ Áh(x(1))

Á1(x(2)) Á2(x

(2)) ¢ ¢ ¢ Áh(x(2))...

.... . .

...

Á1(x(i)) Á2(x

(i)) ¢ ¢ ¢ Áh(x(i))...

.... . .

...

Á1(x(n)) Á2(x

(n)) ¢ ¢ ¢ Áh(x(n))

3777777775

Page 66: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Matrix Form of Kernel Linear Regression

• Gradient

J(μ) =1

2(y ¡©μ)>(y ¡©μ) min J(μ)J(μ) =

1

2(y ¡©μ)>(y ¡©μ) min J(μ)

@J(μ)

@μ= ¡©>(y ¡©μ)

@J(μ)

@μ= ¡©>(y ¡©μ)

@J(μ)

@μ= 0 ! ©>(y ¡©μ) = 0

! ©>y = ©>©μ

! μ = (©>©)¡1©>y

@J(μ)

@μ= 0 ! ©>(y ¡©μ) = 0

! ©>y = ©>©μ

! μ = (©>©)¡1©>y

• Objective

• Solution

Page 67: Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

Matrix Form of Kernel Linear Regression

• The optimal parameters with L2 regularization

μ = (©>© + ¸Ih)¡1©>y

= ©>(©©> + ¸In)¡1y

μ = (©>© + ¸Ih)¡1©>y

= ©>(©©> + ¸In)¡1y

for prediction, we never actually need to access ©©

y = ©μ = ©©>(©©> + ¸In)¡1y

= K(K + ¸In)¡1y

y = ©μ = ©©>(©©> + ¸In)¡1y

= K(K + ¸In)¡1y

• With the Algebra trick(P¡1 + B>R¡1B)¡1B>R¡1 = PB>(BPB> + R)¡1(P¡1 + B>R¡1B)¡1B>R¡1 = PB>(BPB> + R)¡1

[http://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf]

where the kernel matrix K = fK(x(i); x(j))gK = fK(x(i); x(j))g