CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec...Question: What i“ I ’o this with a stri’e o“ 2 ? Downsample above by . Think o“ what happens when sizes are even

/

CSE 559A: Computer Vision

Fall 2020: T-R: 11:30-12:50pm @ Zoom

Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams

De‘ 3, 2020

http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/

1

http://www.cse.wustl.edu/~ayan/courses/cse559a/

/

LAST TIMELAST TIMETalke’ about ‘ore semanti‘ vision tasks

2

/

THE EFFECT OF DATATHE EFFECT OF DATA

3

/


4

/


5

/


6

/


7

/


8

/


9

/


10

/


11

/


12

/

THE EFFECT OF DATATHE EFFECT OF DATAOl’er ML Metho’s ’esi”ne’ “or small trainin” sets

Use’ more ‘omplex optimization metho’s (than ”ra’ient ’es‘ent): se‘on’ or’er metho’s, et‘.Metho’s ha’ better ”uarantees i“ you ‘hose simpler ‘lassi“iersAn’ in pra‘ti‘e, ”ave you better results than neural networksBut were qua’rati‘ in trainin” set size

With trainin” set o“ millions, qua’rati‘-time optimization was not “easible.So people “irst move’ to ”ra’ient ’es‘ent, but with the same simple ‘lassi“iers.Foun’ that with a’’itional ‘omputation power, i“ you train with small step size “or many iterations (still betterthan qua’rati‘), ”ra’ient ’es‘ent ”ives you a reasonable answer.But then, sin‘e ”ra’ient ’es‘ent was workin”, the question was why not try more ‘omplex ‘lassi“iers ?An’ Krizhevsky an’ others ’emonstrate’: in this lar”e trainin” set / hi”h trainin” ‘omputation bu’”et, ’eepneural networks are mu‘h better!

13

/

ARCHITECTURESARCHITECTURESBroad Design Principles

Think o“ a network that ‘an express the operations that you think are nee’e’ to solve the problemWhat kin’ o“ a re‘eptive “iel’ shoul’ it have.How non-linear ’oes it nee’ to be.What shoul’ be the nature o“ the “low o“ in“ormation a‘ross the ima”e.

Make sure its a “un‘tion you ‘an a‘tually learn.Think o“ the “low o“ ”ra’ients.Try to make other ar‘hite‘tures that you know ‘an be su‘‘ess“ully traine’ as a startin” point.

Dealin” with Over“ittin”: One approa‘h:First “in’ the bi””est ’eepest network that will over“it the ’ata(Given enou”h ‘apa‘ity, CNNs will o en be able to just memorize the ’ataset)Then s‘ale it ’own so that it ”eneralizes.

14

/

ARCHITECTURESARCHITECTURESLet s ‘onsi’er ima”e ‘lassi“i‘ation.

We will “ix our input ima”e to be a spe‘i“i‘ size.

Typi‘ally ‘hoose square ima”es o“ size

Given ima”e, resize proportionally so that smaller si’e (hei”ht or wi’th) is

Then take an ‘rop alon” the other ’ire‘tion

(Sometime take multiple ‘rops an’ avera”e)

The “inal output will be a ’imensional ve‘tor “or ‘lasses.

Train usin” so -max ‘ross entropy.Classi“y usin” ar”-max

O en, you ll hear about Top-K error.

How o en is the true ‘lass in the top K hi”hest values o“ pre’i‘te’ ve‘tor.

S × S

S

S

C C

C

15

/

ARCHITECTURESARCHITECTURESLet s talk about VGG-16. Winner o“ Ima”enet 2014.

ReferenceKen Chat“iel’, Karen Simonyan, An’rea Ve’al’i, an’ An’rew Zisserman, Return o“ the Devil in the Details:Delvin” Deep into Convolutional Nets

Karen Simonyan & An’rew Zisserman, Very Deep Convolutional Networks “or Lar”e-S‘ale Ima”e Re‘o”nition.

Four kin’s o“ layers:

ConvolutionalMax-poolin”Fully Conne‘te’So -max

16

/

ARCHITECTURESARCHITECTURESConvolutional Layers

Take a spatial input, pro’u‘e a spatial output.

Can also ‘ombine with ’own-samplin”.

Here, is stri’e.

PSET 5 asks you to implement vali’ ‘onvolutionBut o en ‘ombine’ with pa’’in” (just like in re”ular ‘onvolution)

B ×H ×W × C ⇒ B × × ×H

′

W

′

C

′

g[b, y, x, ] = f [b, y s + , x s + , ]k[ , , , ]c

2

∑

k

y

∑

k

x

∑

c

1

k

y

k

x

c

1

k

y

k

x

c

1

c

2

s

17

/

ARCHITECTURESARCHITECTURESQuestion: Input a‘tivation is , an’ I ‘onvolve it with a kernel o“ size , whatis the size o“ my output ? Assume vali’ ‘onvolution.

Question: What i“ I ’o this with a stri’e o“ 2 ?

Downsample above by . Think o“ what happens when sizes are even or o’’.

In ”eneral, you want to pa’ su‘h that an’ are even, so that you keep the ri”ht an’ bottom e’”eo“ your ima”es.

B ×H ×W × C

1

K × K × ×C

1

C

2

B × (H − K + 1) × (W − K + 1) × C

2

2

B × ⌊(H − K)/2⌋ + 1 × ⌊(W − K)/2⌋ + 1 × C

2

H − K W − K

18

/

ARCHITECTURESARCHITECTURESMax-Pooling Layer

For ea‘h ‘hannel, ‘hoose the maximum value in a spatial nei”hborhoo’.

What will the ”ra’ients o“ this look like ?Motivate’ by intuition “rom tra’itional obje‘t re‘o”nition (’e“ormable part mo’els). Allows “or some sla‘k inexa‘t spatial lo‘ation.

B ×H ×W × C ⇒ B × × × CH

′

W

′

g[b, y, x, c] = f [b, y s + , x s + , ]max

,k

y

k

x

k

y

k

x

c

1

19

/

ARCHITECTURESARCHITECTURESVGG-16

Input is a 224x224x3 Ima”e

Blo‘k 13x3 Conv (Pa’ 1): 3->64 + RELU (*pa’ 1 means on all si’es, all ‘onv layers have a bias )

20

/


Input is a 224x224x3 Image

- Block 1 - 3x3 Conv (Pad 1): 3->64 + RELU - 3x3 Conv (Pad 1): 64->64 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 64->64 Input to Block 2 is 112x112x64 (called pool1)

- Block 2 - 3x3 Conv (Pad 1): 64->128 + RELU - 3x3 Conv (Pad 1): 128->128 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 128->128 Input to Block 3 is 56x56x128 (called pool2)

- Block 3 - 3x3 Conv (Pad 1): 128->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 256->256 Input to Block 4 is 28x28x256 (called pool3)

- Block 4 - 3x3 Conv (Pad 1): 256->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Input to Block 5 is 14x14x512 (called pool4)

- Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)

21

/


This is the “inal output that is traine’ with a so max + ‘ross entropy.

Lots o“ layers: 138 Million ParametersCompare’ to previous ar‘hite‘tures, use’ really small ‘onv “ilters.

This has now be‘ome stan’ar’.Two 3x3 layers is better than a sin”le 5x5 layer.

More non-linearFewer in’epen’ent wei”hts

Train this with ba‘kprop !Ba‘k in the ’ay, woul’ take a week or more.

- Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)

- Reshape to a (49*512=25088) dimensional vector (or B x 25088)

- Fully connected (matmul + bias) 25088 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 1000

22

/

TRAINING IN PRACTICETRAINING IN PRACTICERemember: Gra’ient Des‘ent is Fra”ile

The Effect of Parameterization

Two representations o“ the same hypothesis spa‘e.

Let s initialize . Say , .

Compute loss with respe‘t to the same example.

Say ,

What are an’ ?

, .

Up’ate with learnin” rate = 1

Up’ate’ ,

,

f (x; θ) = θ x

(x; ) = 2 xf

′

θ

′

θ

′

= θ/2θ

′

θ = 10 = 5θ

′

L = = 1∇

f

∇

f

′

x = 1

∇

θ

∇

θ

′

= 1∇

θ

= 2∇

θ

′

θ = 9 θ = 3

f (x) = 9x (x) = 6xf

′

23

/

TRAINING IN PRACTICETRAINING IN PRACTICEInitialization

Be‘ause we re usin” a “irst or’er metho’, it is important to make sure that the a‘tivations o“ all layers are thesame ma”nitu’e.

Be‘ause we have RELUs , it is important to make sure rou”hly hal“ the expe‘tations arepositive.Normalize your inputs to be 0-mean an’ unit varian‘e.

Compute ’ataset mean an’ stan’ar’ ’eviation, subtra‘t an’ ’ivi’e “rom all inputs.For ima”es, you usually ‘ompute the mean over all pixels—so sin”le normalization “or all pixels in theinput.

But ’ifferent “or ’ifferent ‘olor ‘hannels.Sometimes, you will want per-lo‘ation normalization.Other times, you will normalize ea‘h input by its own pixel mean an’ stan’ar’ ’eviation.

y = max(0, x)

24

/


Initialize all biases to 0. Why ? Don t want to shi the mean.Now initialize wei”hts ran’omly so that varian‘e o“ outputs = varian‘e o“ inputs = 1.

Use approximation “or s‘alar an’ .

Say you have a “ully ‘onne‘tion layer:

is -’imensional. We assume its 0-mean unit-varian‘e ‘omin” in.

is ’imensional.

We will initialize with 0-mean an’ varian‘e .

What shoul’ be the value o“ ? Take 5 mins.

.

Now what about a ‘onvolution layer with kernel size .

Initialize kernel with 0-mean an’ varian‘e . What shoul’ be ?

var(wx) = var(w)var(x) w x

y = xW

T

x N

W N ×M

W σ

2

σ

2

= 1/Nσ

2

K × K × ×C

1

C

2

σ

2

σ

2

= 1/( )σ

2

K

2

C

1

25

/


A‘tually, usin” normal ’istributions is sometimes unstable.Probability “or values very “ar “rom the mean is low, but not 0.When you sample millions o“ wei”hts, you mi”ht en’ up with su‘h a hi”h value!Solution: Use trun‘ate’ ’istributions that are “or‘e’ to have 0 probability outsi’e a ran”e.

Uni“orm ’istribution, trun‘ate’-normalFi”ure out what parameters o“ this ’istribution shoul’ be to have equivalent varian‘e.

26

/


But this only ensures zero-mean unit-varian‘e at initialization.As your wei”hts up’ate, they ‘an be”in to ”ive you biase’ wei”hts.Another option, a’’ normalization in the network itsel“ !

Ser”ey Ioffe an’ Christian Sze”e’y, Bat‘h Normalization: A‘‘eleratin” Deep Network Trainin” byRe’u‘in” Internal Covariate Shi .

27

Documents

CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec...Question: What i“ I ’o this with a stri’e o“ 2 ? Downsample above by . Think o“ what happens when sizes are even