34
Lesson 09 Convolutional Neural Network - Advanced Techniques Ing. Marek Hr´ uz, Ph.D. Katedra Kybernetiky Fakulta aplikovan´ ych vˇ ed apadoˇ cesk´ a univerzita v Plzni Lesson 09

Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Lesson 09Convolutional Neural Network - Advanced Techniques

Ing. Marek Hruz, Ph.D.

Katedra KybernetikyFakulta aplikovanych ved

Zapadoceska univerzita v Plzni

Lesson 09

Page 2: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Nesterov Momentum

I This method is already well established and recommended touse

I The idea is – first go to the point where we should end up(the momentum gradient direction v t)

I Then correct the estimate by computing the gradient in the“look-ahead” point

ωt+1 = ωt + v t − ε · ∇L(ωt + v t

)(1)

I where v t = α · v t−1 − β · ε · ωt−1 − ε · ∇L(ωt−1

)is the

momentum

I For comparison with classical momentum:

ωt+1 = ωt + v t+1 (2)

Lesson 09 1 / 33

Page 3: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Nesterov Momentum vs. Momentum

Lesson 09 2 / 33

Page 4: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Nesterov Momentum vs. Momentum

Lesson 09 3 / 33

Page 5: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Per-parameter adaptive learning rate methods

I There are drawbacks of static learning rates

I The learning rate is also global - bad

I There is a way to estimate the learning rate from the gradientestimate

I The gradient is the direction and rate of the largest growth ofa function

I But we apply the gradient in the domain of the function

Lesson 09 4 / 33

Page 6: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Gradient visualization

Lesson 09 5 / 33

Page 7: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Gradient application

Lesson 09 6 / 33

Page 8: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Gradient visualization

Lesson 09 7 / 33

Page 9: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Gradient application

Lesson 09 8 / 33

Page 10: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Per-parameter adaptive learning rate methods

I A steep function has a large gradient - but steepness(intuitively) means that we are close to local extreme - thestep should be small

I A shallow function has a small gradient - but shallowness(intuitively) means that we are far from extreme - the stepshould be large

I We can make use of this - we apply an adaptive learning rate

Lesson 09 9 / 33

Page 11: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Adagrad

I We introduce an parameter of cumulative squared gradientmagnitudes:

σi = σi + ‖∇J(ωti

)‖2

2 (3)

ωt+1i = ωt

i −ε√

σi + ξ∇J(ωti

)(4)

I where ξ is a small constant, ωti is a vector of weights of i th

neuron

I The method is aggressive and updates of gradients go to zero(since σi always grows)

Lesson 09 10 / 33

Page 12: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Adadelta & RMSProp

I Adadelta is the reaction to weak point of Adagrad (dyingupdates)

I The always growing magnitude of gradient history is replacedby a running average

σi = γσi + (1− γ) ‖∇J(ωti

)‖2

2 (5)

I The learning rate is also approximated (we want it to have thesame hypothetical units as gradient) as a running average ofparameter updates

εt+1 = γεt + (1− γ) ∆ωti (6)

I Then we can write the update as

ωt+1i = ωt

i −√εt + ξ√σi + ξ

∇J(ωti

)(7)

I Independently discovered RMSProp neglects the learning rateapproximation

Lesson 09 11 / 33

Page 13: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Adam

I Adaptive Moment Estimation (Adam) additionallyapproximates the running average of non-squared gradients(first moment)

mi = β1mi + (1− β1)∇J(ωti

)(8)

I The running average of squared gradient magnitudes is kept(second moment)

vi = β2vi + (1− β2) ‖∇J(ωti

)‖2

2 (9)

I The update becomes:

ωt+1i = ωt

i −ε√

vi + ξmi (10)

Lesson 09 12 / 33

Page 14: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Object Detecting Networks - time-lapse

Lesson 09 13 / 33

Page 15: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region based CNN

I R-CNN

I Propose regions - selective search, that is merging of superpixels (2 sec per image)

I Extracts features from warped proposals

I Train per class SVM on CNN features

Lesson 09 14 / 33

Page 16: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Proposal Network

I Faster R-CNNI Two modules that share parameters:

I Region proposal modelI Classification module

I Region proposal - a small CNN is applied to the last layer of apre-trained CNN (VGG-16; 14× 14× 512)

I The small CNN is a convolutional layer (with n × n kernels)followed by a fully connected layer (512 dim)

I After that there are two FC layers - 2k neurons forobjectiveness and 4k neurons for location

Lesson 09 15 / 33

Page 17: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Proposal Network

Lesson 09 16 / 33

Page 18: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Proposal Network

I k is the number of anchor boxes

I The anchors have different sizes and aspect ratios

I In original paper they use 3 sizes and 3 ratios

I The outputs of the 2k and 4k FC layers are relative to theposition and size of the appropriate anchor box

Lesson 09 17 / 33

Page 19: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Proposal Network - learning

I The targets are deduced from the ground truth data(annotated regions of objects)

I Each anchor that has Intersection over Union (IoU) greaterthan 0.7 is considered a positive sample

I The objectiveness of such anchor (in 2k FC layer) is set topositive

I The regression of the anchor (position and size) is computedfrom the ground truth region

I Negative anchors are such that have IoU lower than 0.3I The loss is defined as:

L ({pi}, {ti}) =1

Ncls

∑i

Lcls (pi , p∗i ) +λ

1

Nreg

∑i

p∗i Lreg (ti , t∗i )

(11)I Lcls is a 2 dim softmax, Lreg is a smoothed L1-norm

Lesson 09 18 / 33

Page 20: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Recognition - Detection Network

I Uses Fast R-CNN

I The weights are shared - the same features are used forproposal and detection

I Each proposed region is recognized - but watch out! theregions have different sizes

I That is why you need to use the ROI max-pooling

I This is quite hard to implement and is deprecated by SingleShot Multi-Box Detector

I ... thus, need not to learn (YEAH!)

Lesson 09 19 / 33

Page 21: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Region Recognition - Detection Network

Lesson 09 20 / 33

Page 22: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Single Shot Multi-Box Detector (SSD)

Lesson 09 21 / 33

Page 23: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Single Shot Multi-Box Detector (SSD)

Lesson 09 22 / 33

Page 24: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Single Shot Multi-Box Detector (SSD)

Lesson 09 23 / 33

Page 25: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Single Shot Multi-Box Detector (SSD)

Lesson 09 24 / 33

Page 26: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Detection and Classification

Lesson 09 25 / 33

Page 27: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Default Boxes

Lesson 09 26 / 33

Page 28: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Default Boxes

Lesson 09 27 / 33

Page 29: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Default Boxes

Lesson 09 28 / 33

Page 30: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

SSD - all together

Lesson 09 29 / 33

Page 31: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

SSD - all together

Lesson 09 30 / 33

Page 32: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Training of SSD

I Each ground truth box is matched to the default box (anchor)with the best Jaccard overlap (IoU)

I Each GT box thus has only one matching default box

I This is extended by adding default boxes with at least 0.5Jaccard overlap

I From these matches the deltas of x, y, width, and height arecomputed and also the classifier gets its label

Lesson 09 31 / 33

Page 33: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Training Objective

L (x , c, l , g) =1

N(Lconf (x , c) + αLloc (x , l , g)) (12)

I x is the predicted class, c is the GT class, l is the regresseddeltas of default boxes, g is the GT deltas

I N is the number of matched default boxes

I Lloc is a smooth L1 loss, Lconf is softmax

Lloc =∑i

smoothL1 (li − gi ) (13)

smoothL1 (x) =

{0.5x2 if |x | < 1

|x | − 0.5 otherwise(14)

Lesson 09 32 / 33

Page 34: Lesson 09 - Convolutional Neural Network - Advanced …Nesterov Momentum vs. Momentum Lesson 09 2/33. Nesterov Momentum vs. Momentum Lesson 09 3/33. Per-parameter adaptive learning

Hard negative mining

I After matching a lot of default boxes will be negatives(background class)

I This gives an imbalanced training set with lots of defaultboxes

I The negatives are ranked according the confidence

I Only a portion of the negatives if chosen (3:1) for the gradientupdate

I Data augmentation:I Use the entire original imageI Sample the original image so that the patches have IoU with

the object at least 0.1, 0.3, 0.5, 0.7, or 0.9I Randomly sample the image into patches

Lesson 09 33 / 33