Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
/
CSE 559A: Computer Vision
Fall 2020: T-R: 11:30-12:50pm @ Zoom
Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams
De‘ 3, 2020
http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/
1
http://www.cse.wustl.edu/~ayan/courses/cse559a/
/
LAST TIMELAST TIMETalke’ about ‘ore semanti‘ vision tasks
2
/
THE EFFECT OF DATATHE EFFECT OF DATA
3
/
THE EFFECT OF DATATHE EFFECT OF DATA
4
/
THE EFFECT OF DATATHE EFFECT OF DATA
5
/
THE EFFECT OF DATATHE EFFECT OF DATA
6
/
THE EFFECT OF DATATHE EFFECT OF DATA
7
/
THE EFFECT OF DATATHE EFFECT OF DATA
8
/
THE EFFECT OF DATATHE EFFECT OF DATA
9
/
THE EFFECT OF DATATHE EFFECT OF DATA
10
/
THE EFFECT OF DATATHE EFFECT OF DATA
11
/
THE EFFECT OF DATATHE EFFECT OF DATA
12
/
THE EFFECT OF DATATHE EFFECT OF DATAOl’er ML Metho’s ’esi”ne’ “or small trainin” sets
Use’ more ‘omplex optimization metho’s (than ”ra’ient ’es‘ent): se‘on’ or’er metho’s, et‘.Metho’s ha’ better ”uarantees i“ you ‘hose simpler ‘lassi“iersAn’ in pra‘ti‘e, ”ave you better results than neural networksBut were qua’rati‘ in trainin” set size
With trainin” set o“ millions, qua’rati‘-time optimization was not “easible.So people “irst move’ to ”ra’ient ’es‘ent, but with the same simple ‘lassi“iers.Foun’ that with a’’itional ‘omputation power, i“ you train with small step size “or many iterations (still betterthan qua’rati‘), ”ra’ient ’es‘ent ”ives you a reasonable answer.But then, sin‘e ”ra’ient ’es‘ent was workin”, the question was why not try more ‘omplex ‘lassi“iers ?An’ Krizhevsky an’ others ’emonstrate’: in this lar”e trainin” set / hi”h trainin” ‘omputation bu’”et, ’eepneural networks are mu‘h better!
13
/
ARCHITECTURESARCHITECTURESBroad Design Principles
Think o“ a network that ‘an express the operations that you think are nee’e’ to solve the problemWhat kin’ o“ a re‘eptive “iel’ shoul’ it have.How non-linear ’oes it nee’ to be.What shoul’ be the nature o“ the “low o“ in“ormation a‘ross the ima”e.
Make sure its a “un‘tion you ‘an a‘tually learn.Think o“ the “low o“ ”ra’ients.Try to make other ar‘hite‘tures that you know ‘an be su‘‘ess“ully traine’ as a startin” point.
Dealin” with Over“ittin”: One approa‘h:First “in’ the bi””est ’eepest network that will over“it the ’ata(Given enou”h ‘apa‘ity, CNNs will o en be able to just memorize the ’ataset)Then s‘ale it ’own so that it ”eneralizes.
14
/
ARCHITECTURESARCHITECTURESLet s ‘onsi’er ima”e ‘lassi“i‘ation.
We will “ix our input ima”e to be a spe‘i“i‘ size.
Typi‘ally ‘hoose square ima”es o“ size
Given ima”e, resize proportionally so that smaller si’e (hei”ht or wi’th) is
Then take an ‘rop alon” the other ’ire‘tion
(Sometime take multiple ‘rops an’ avera”e)
The “inal output will be a ’imensional ve‘tor “or ‘lasses.
Train usin” so -max ‘ross entropy.Classi“y usin” ar”-max
O en, you ll hear about Top-K error.
How o en is the true ‘lass in the top K hi”hest values o“ pre’i‘te’ ve‘tor.
S × S
S
S
C C
C
15
/
ARCHITECTURESARCHITECTURESLet s talk about VGG-16. Winner o“ Ima”enet 2014.
ReferenceKen Chat“iel’, Karen Simonyan, An’rea Ve’al’i, an’ An’rew Zisserman, Return o“ the Devil in the Details:Delvin” Deep into Convolutional Nets
Karen Simonyan & An’rew Zisserman, Very Deep Convolutional Networks “or Lar”e-S‘ale Ima”e Re‘o”nition.
Four kin’s o“ layers:
ConvolutionalMax-poolin”Fully Conne‘te’So -max
16
/
ARCHITECTURESARCHITECTURESConvolutional Layers
Take a spatial input, pro’u‘e a spatial output.
Can also ‘ombine with ’own-samplin”.
Here, is stri’e.
PSET 5 asks you to implement vali’ ‘onvolutionBut o en ‘ombine’ with pa’’in” (just like in re”ular ‘onvolution)
B ×H ×W × C ⇒ B × × ×H
′
W
′
C
′
g[b, y, x, ] = f [b, y s + , x s + , ]k[ , , , ]c
2
∑
k
y
∑
k
x
∑
c
1
k
y
k
x
c
1
k
y
k
x
c
1
c
2
s
17
/
ARCHITECTURESARCHITECTURESQuestion: Input a‘tivation is , an’ I ‘onvolve it with a kernel o“ size , whatis the size o“ my output ? Assume vali’ ‘onvolution.
Question: What i“ I ’o this with a stri’e o“ 2 ?
Downsample above by . Think o“ what happens when sizes are even or o’’.
In ”eneral, you want to pa’ su‘h that an’ are even, so that you keep the ri”ht an’ bottom e’”eo“ your ima”es.
B ×H ×W × C
1
K × K × ×C
1
C
2
B × (H − K + 1) × (W − K + 1) × C
2
2
B × ⌊(H − K)/2⌋ + 1 × ⌊(W − K)/2⌋ + 1 × C
2
H − K W − K
18
/
ARCHITECTURESARCHITECTURESMax-Pooling Layer
For ea‘h ‘hannel, ‘hoose the maximum value in a spatial nei”hborhoo’.
What will the ”ra’ients o“ this look like ?Motivate’ by intuition “rom tra’itional obje‘t re‘o”nition (’e“ormable part mo’els). Allows “or some sla‘k inexa‘t spatial lo‘ation.
B ×H ×W × C ⇒ B × × × CH
′
W
′
g[b, y, x, c] = f [b, y s + , x s + , ]max
,k
y
k
x
k
y
k
x
c
1
19
/
ARCHITECTURESARCHITECTURESVGG-16
Input is a 224x224x3 Ima”e
Blo‘k 13x3 Conv (Pa’ 1): 3->64 + RELU (*pa’ 1 means on all si’es, all ‘onv layers have a bias )
20
/
ARCHITECTURESARCHITECTURESVGG-16
Input is a 224x224x3 Image
- Block 1 - 3x3 Conv (Pad 1): 3->64 + RELU - 3x3 Conv (Pad 1): 64->64 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 64->64 Input to Block 2 is 112x112x64 (called pool1)
- Block 2 - 3x3 Conv (Pad 1): 64->128 + RELU - 3x3 Conv (Pad 1): 128->128 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 128->128 Input to Block 3 is 56x56x128 (called pool2)
- Block 3 - 3x3 Conv (Pad 1): 128->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 256->256 Input to Block 4 is 28x28x256 (called pool3)
- Block 4 - 3x3 Conv (Pad 1): 256->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Input to Block 5 is 14x14x512 (called pool4)
- Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)
21
/
ARCHITECTURESARCHITECTURESVGG-16
This is the “inal output that is traine’ with a so max + ‘ross entropy.
Lots o“ layers: 138 Million ParametersCompare’ to previous ar‘hite‘tures, use’ really small ‘onv “ilters.
This has now be‘ome stan’ar’.Two 3x3 layers is better than a sin”le 5x5 layer.
More non-linearFewer in’epen’ent wei”hts
Train this with ba‘kprop !Ba‘k in the ’ay, woul’ take a week or more.
- Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)
- Reshape to a (49*512=25088) dimensional vector (or B x 25088)
- Fully connected (matmul + bias) 25088 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 1000
22
/
TRAINING IN PRACTICETRAINING IN PRACTICERemember: Gra’ient Des‘ent is Fra”ile
The Effect of Parameterization
Two representations o“ the same hypothesis spa‘e.
Let s initialize . Say , .
Compute loss with respe‘t to the same example.
Say ,
What are an’ ?
, .
Up’ate with learnin” rate = 1
Up’ate’ ,
,
f (x; θ) = θ x
(x; ) = 2 xf
′
θ
′
θ
′
= θ/2θ
′
θ = 10 = 5θ
′
L = = 1∇
f
∇
f
′
x = 1
∇
θ
∇
θ
′
= 1∇
θ
= 2∇
θ
′
θ = 9 θ = 3
f (x) = 9x (x) = 6xf
′
23
/
TRAINING IN PRACTICETRAINING IN PRACTICEInitialization
Be‘ause we re usin” a “irst or’er metho’, it is important to make sure that the a‘tivations o“ all layers are thesame ma”nitu’e.
Be‘ause we have RELUs , it is important to make sure rou”hly hal“ the expe‘tations arepositive.Normalize your inputs to be 0-mean an’ unit varian‘e.
Compute ’ataset mean an’ stan’ar’ ’eviation, subtra‘t an’ ’ivi’e “rom all inputs.For ima”es, you usually ‘ompute the mean over all pixels—so sin”le normalization “or all pixels in theinput.
But ’ifferent “or ’ifferent ‘olor ‘hannels.Sometimes, you will want per-lo‘ation normalization.Other times, you will normalize ea‘h input by its own pixel mean an’ stan’ar’ ’eviation.
y = max(0, x)
24
/
TRAINING IN PRACTICETRAINING IN PRACTICEInitialization
Initialize all biases to 0. Why ? Don t want to shi the mean.Now initialize wei”hts ran’omly so that varian‘e o“ outputs = varian‘e o“ inputs = 1.
Use approximation “or s‘alar an’ .
Say you have a “ully ‘onne‘tion layer:
is -’imensional. We assume its 0-mean unit-varian‘e ‘omin” in.
is ’imensional.
We will initialize with 0-mean an’ varian‘e .
What shoul’ be the value o“ ? Take 5 mins.
.
Now what about a ‘onvolution layer with kernel size .
Initialize kernel with 0-mean an’ varian‘e . What shoul’ be ?
var(wx) = var(w)var(x) w x
y = xW
T
x N
W N ×M
W σ
2
σ
2
= 1/Nσ
2
K × K × ×C
1
C
2
σ
2
σ
2
= 1/( )σ
2
K
2
C
1
25
/
TRAINING IN PRACTICETRAINING IN PRACTICEInitialization
A‘tually, usin” normal ’istributions is sometimes unstable.Probability “or values very “ar “rom the mean is low, but not 0.When you sample millions o“ wei”hts, you mi”ht en’ up with su‘h a hi”h value!Solution: Use trun‘ate’ ’istributions that are “or‘e’ to have 0 probability outsi’e a ran”e.
Uni“orm ’istribution, trun‘ate’-normalFi”ure out what parameters o“ this ’istribution shoul’ be to have equivalent varian‘e.
26
/
TRAINING IN PRACTICETRAINING IN PRACTICEInitialization
But this only ensures zero-mean unit-varian‘e at initialization.As your wei”hts up’ate, they ‘an be”in to ”ive you biase’ wei”hts.Another option, a’’ normalization in the network itsel“ !
Ser”ey Ioffe an’ Christian Sze”e’y, Bat‘h Normalization: A‘‘eleratin” Deep Network Trainin” byRe’u‘in” Internal Covariate Shi .
27