38
Stochastic Gradient Descent Spring 2020 Security and Fairness of Deep Learning

Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Stochastic Gradient Descent

Spring 2020

Security and Fairness of Deep Learning

Page 2: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Image Classification

Page 3: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Linear model

• Score function• Maps raw data to class scores

• Loss function• Measures how well predicted classes agree with ground truth labels

• Multiclass Support Vector Machine loss (SVM loss)• Softmax classifier (cross-entropy loss)

• Learning• Find parameters of score function that minimize loss function

• Multiclass Support Vector Machine loss (SVM loss)

Page 4: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Recall: Linear model with SVM loss

• Score function• Maps raw data to class scores

• Loss function• Measures how well predicted classes agree with ground truth labels

f(xi,W ) = Wxi

L =1

N

X

i

X

j 6=yi

[max(0, f(xi;W )j � f(xi;W )yi +�)] + �X

k

X

l

W 2k,l

Page 5: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Today

• Learning model parameters with Stochastic Gradient Descent that minimize loss

• Later• Different score functions: deep networks• Same loss functions and learning algorithm

Page 6: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Outline

• Visualizing the loss function• Optimization

• Random search• Random local search• Gradient descent

• Mini-batch gradient descent

Page 7: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Visualizing SVM loss function

• Difficult to visualize fully • CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of

30,730 parameters

• Can gain intuition by visualizing along rays (1 dimension) or planes (2 dimensions)

Page 8: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Visualizing in 1-D

• Generate random weight matrix• Generate random direction• Compute loss along this direction

W

W1

L(W + aW1)

Loss for single example

Where is the minima?

Page 9: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Visualizing in 2-D

• Compute loss along plane L(W + aW1 + bW2)

Loss for single example Average loss for 100 examples (convex function)

Page 10: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

How do we find weights that minimize loss?

• Random search• Try many random weight matrices and pick the best one• Performance: poor

• Random local search• Start with random weight matrix• Try many local perturbations, pick the best one, and iterate• Performance: better but still quite poor

• Useful idea: iterative refinement of weight matrix

Page 11: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Optimization basics

Page 12: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

The problem of optimization

Find the value of x where f(x) is minimum

Our setting: x represents weights, f(x) represents loss function

x

f(x)

Page 13: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

In two stages

• Function of single variable• Function of multiple variables

Page 14: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Derivative of a function of single variable

df(x)

dx= lim

h !0

f(x+ h)� f(x)

h

Page 15: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Derivatives

d

dx(x2) = 2x

d

dx(ex) = ex

d

dx(ln x) =

1

xif x > 0

Page 16: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Finding minima

Increase x if derivative negative, decrease if positivei.e., take step in direction opposite to sign of derivative(key idea of gradient descent)

x

f(x)𝑑𝑦𝑑𝑥

= 0

𝑑𝑦𝑑𝑥 < 0

𝑑𝑦𝑑𝑥 > 0

Page 17: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Doesn’t always work

• Theoretical and empirical evidence that gradient descent works quite well for deep networks

f(x)

x

global minimum

inflection point

local minimum

global maximum

Page 18: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

In two stages

• Function of single variable• Function of multiple variables

Page 19: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Partial derivatives

The partial derivative of an n-ary function f(x1,...,xn) in the direction xi at the point (a1,...,an) is defined to be:

Page 20: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Partial derivative example

By IkamusumeFan - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=42262627

At (1, 1), the slope is 3

Page 21: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

The gradient of a scalar function

• The gradient 𝛻𝑓(𝑋) of a scalar function 𝑓(𝑋) of a multi-variate input 𝑋 is a multiplicative factor that gives us the change in 𝑓(𝑋) for tiny variations in 𝑋

𝑑𝑓 𝑋 = 𝛻𝑓 𝑋 𝑑𝑋𝛻𝑓 𝑋 = 𝑑𝑓(𝑋)/𝑑𝑋

𝑑𝑓(𝑋)

𝑑𝑋

Page 22: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Gradients of scalar functions with multi-variate inputs•Consider 𝑓 𝑋 = 𝑓 𝑥., 𝑥0, … , 𝑥2

•𝛻𝑓(𝑋) = 45 6478

45 6479

⋯ 45 647;

Page 23: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Computing gradients analytically

f(x, y) = x+ y ! @f

@x= 1

@f

@y= 1

Page 24: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Computing gradients analytically

f(x, y) = xy ! @f

@x= y

@f

@y= x

rf = [@f

@x,@f

@y] = [y, x]

Page 25: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Derivatives measure sensitivity

If we were to increase by a tiny amount, the effect on the whole expression would be to decrease it (due to the negative sign), and by three times that amount.

x = 4, y = �3 f(x, y) = �12@f

@x= �3

x

f(x, y) = xy ! @f

@x= y

@f

@y= x

Page 26: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Finding minima

Take step in direction opposite to sign of gradient

Gradientvector 𝛻𝑓(𝑋)

Moving in this direction increases 𝑓(𝑋) fastest

−𝛻𝑓(𝑋)Moving in this

direction decreases 𝑓 𝑋 fastest

Page 27: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Gradient descent algorithm

• Initialize: • 𝑥=• 𝑘 = 0

• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?

• 𝑘 = 𝑘 + 1

• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀

Average gradient across all training

examples

f(xk) =1

N⌃N

i=1fi(xk)

5f(xk) =1

N⌃N

i=1 5 fi(xk)

Page 28: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Step size affects convergence of gradient descent

Murphy, Machine Learning, Fig 8.2

Page 29: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Gradient descent algorithm

• Initialize: • 𝑥=• 𝑘 = 0

• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?

• 𝑘 = 𝑘 + 1

• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀

Challenge to discuss later: How to choose step size?

Average gradient across all training

examples

Challenge: Not scalable for very large data sets

Page 30: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Mini-batch gradient descent

• Initialize: • 𝑥=• 𝑘 = 0

• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?

• 𝑘 = 𝑘 + 1

• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀

Faster convergence

Average gradient over small batches

of training examples (e.g., sample of 256

examples)

Special case: Stochastic or online gradient descent àuse single training example in each update step

Page 31: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Stochastic gradient descent convergence

Murphy, Machine Learning, Fig 8.8

Page 32: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

SVM loss visualization

Challenge: Gradient does not exist

Page 33: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Computing subgradients analytically

The set of subderivatives at x0 for a convex function is a nonempty closed interval [a, b], where a and b are the one-sided limits:

Page 34: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Computing subgradients analytically

f(x, y) = max(x, y) ! @f

@x= I(x >= y)

@f

@y= I(y >= x)

The (sub)gradient is 1 on the input that is larger and 0 on the other input

Page 35: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Subgradient of SVM loss

Li =X

j 6=yi

⇥max(0, wT

j xi � wTyixi +�)

rwyiLi = �

0

@X

j 6=yi

I(wTj xi � wT

yixi +� > 0)

1

Axi

Number of classes that didn’t meet the desired margin

rwjLi = I(wTj xi � wT

yixi +� > 0)xi

j-th class didn’t meet the desired margin

Page 36: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Review derivatives

• Please review rules for computing derivatives and partial derivatives of functions, including the chain rule

• https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives

• You will need to use them in HW1!

Page 37: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Summary

Page 38: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures

Acknowledgment

• Based on material from• Stanford CS231n http://cs231n.github.io/• CMU 11-785 Course• Spring 2019 Course