27
/ CSE 559A: Computer Vision Fall 2020: T-R: 11:30-12:50pm @ Zoom Instru‘tor: Ayan Chakrabarti ([email protected]’u). Course Staff: A’ith Boloor, Patri‘k Williams De‘ 3, 2020 http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/ 1

CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec...Question: What i“ I ’o this with a stri’e o“ 2 ? Downsample above by . Think o“ what happens when sizes are even

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • /

    CSE 559A: Computer Vision

    Fall 2020: T-R: 11:30-12:50pm @ Zoom

    Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams

    De‘ 3, 2020

    http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/

    1

    http://www.cse.wustl.edu/~ayan/courses/cse559a/

  • /

    LAST TIMELAST TIMETalke’ about ‘ore semanti‘ vision tasks

    2

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    3

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    4

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    5

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    6

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    7

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    8

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    9

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    10

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    11

  • /

    THE EFFECT OF DATATHE EFFECT OF DATA

    12

  • /

    THE EFFECT OF DATATHE EFFECT OF DATAOl’er ML Metho’s ’esi”ne’ “or small trainin” sets

    Use’ more ‘omplex optimization metho’s (than ”ra’ient ’es‘ent): se‘on’ or’er metho’s, et‘.Metho’s ha’ better ”uarantees i“ you ‘hose simpler ‘lassi“iersAn’ in pra‘ti‘e, ”ave you better results than neural networksBut were qua’rati‘ in trainin” set size

    With trainin” set o“ millions, qua’rati‘-time optimization was not “easible.So people “irst move’ to ”ra’ient ’es‘ent, but with the same simple ‘lassi“iers.Foun’ that with a’’itional ‘omputation power, i“ you train with small step size “or many iterations (still betterthan qua’rati‘), ”ra’ient ’es‘ent ”ives you a reasonable answer.But then, sin‘e ”ra’ient ’es‘ent was workin”, the question was why not try more ‘omplex ‘lassi“iers ?An’ Krizhevsky an’ others ’emonstrate’: in this lar”e trainin” set / hi”h trainin” ‘omputation bu’”et, ’eepneural networks are mu‘h better!

    13

  • /

    ARCHITECTURESARCHITECTURESBroad Design Principles

    Think o“ a network that ‘an express the operations that you think are nee’e’ to solve the problemWhat kin’ o“ a re‘eptive “iel’ shoul’ it have.How non-linear ’oes it nee’ to be.What shoul’ be the nature o“ the “low o“ in“ormation a‘ross the ima”e.

    Make sure its a “un‘tion you ‘an a‘tually learn.Think o“ the “low o“ ”ra’ients.Try to make other ar‘hite‘tures that you know ‘an be su‘‘ess“ully traine’ as a startin” point.

    Dealin” with Over“ittin”: One approa‘h:First “in’ the bi””est ’eepest network that will over“it the ’ata(Given enou”h ‘apa‘ity, CNNs will o en be able to just memorize the ’ataset)Then s‘ale it ’own so that it ”eneralizes.

    14

  • /

    ARCHITECTURESARCHITECTURESLet s ‘onsi’er ima”e ‘lassi“i‘ation.

    We will “ix our input ima”e to be a spe‘i“i‘ size.

    Typi‘ally ‘hoose square ima”es o“ size

    Given ima”e, resize proportionally so that smaller si’e (hei”ht or wi’th) is

    Then take an ‘rop alon” the other ’ire‘tion

    (Sometime take multiple ‘rops an’ avera”e)

    The “inal output will be a ’imensional ve‘tor “or ‘lasses.

    Train usin” so -max ‘ross entropy.Classi“y usin” ar”-max

    O en, you ll hear about Top-K error.

    How o en is the true ‘lass in the top K hi”hest values o“ pre’i‘te’ ve‘tor.

    S × S

    S

    S

    C C

    C

    15

  • /

    ARCHITECTURESARCHITECTURESLet s talk about VGG-16. Winner o“ Ima”enet 2014.

    ReferenceKen Chat“iel’, Karen Simonyan, An’rea Ve’al’i, an’ An’rew Zisserman, Return o“ the Devil in the Details:Delvin” Deep into Convolutional Nets

    Karen Simonyan & An’rew Zisserman, Very Deep Convolutional Networks “or Lar”e-S‘ale Ima”e Re‘o”nition.

    Four kin’s o“ layers:

    ConvolutionalMax-poolin”Fully Conne‘te’So -max

    16

  • /

    ARCHITECTURESARCHITECTURESConvolutional Layers

    Take a spatial input, pro’u‘e a spatial output.

    Can also ‘ombine with ’own-samplin”.

    Here, is stri’e.

    PSET 5 asks you to implement vali’ ‘onvolutionBut o en ‘ombine’ with pa’’in” (just like in re”ular ‘onvolution)

    B ×H ×W × C ⇒ B × × ×H

    W

    C

    g[b, y, x, ] = f [b, y s + , x s + , ]k[ , , , ]c

    2

    k

    y

    k

    x

    c

    1

    k

    y

    k

    x

    c

    1

    k

    y

    k

    x

    c

    1

    c

    2

    s

    17

  • /

    ARCHITECTURESARCHITECTURESQuestion: Input a‘tivation is , an’ I ‘onvolve it with a kernel o“ size , whatis the size o“ my output ? Assume vali’ ‘onvolution.

    Question: What i“ I ’o this with a stri’e o“ 2 ?

    Downsample above by . Think o“ what happens when sizes are even or o’’.

    In ”eneral, you want to pa’ su‘h that an’ are even, so that you keep the ri”ht an’ bottom e’”eo“ your ima”es.

    B ×H ×W × C

    1

    K × K × ×C

    1

    C

    2

    B × (H − K + 1) × (W − K + 1) × C

    2

    2

    B × ⌊(H − K)/2⌋ + 1 × ⌊(W − K)/2⌋ + 1 × C

    2

    H − K W − K

    18

  • /

    ARCHITECTURESARCHITECTURESMax-Pooling Layer

    For ea‘h ‘hannel, ‘hoose the maximum value in a spatial nei”hborhoo’.

    What will the ”ra’ients o“ this look like ?Motivate’ by intuition “rom tra’itional obje‘t re‘o”nition (’e“ormable part mo’els). Allows “or some sla‘k inexa‘t spatial lo‘ation.

    B ×H ×W × C ⇒ B × × × CH

    W

    g[b, y, x, c] = f [b, y s + , x s + , ]max

    ,k

    y

    k

    x

    k

    y

    k

    x

    c

    1

    19

  • /

    ARCHITECTURESARCHITECTURESVGG-16

    Input is a 224x224x3 Ima”e

    Blo‘k 13x3 Conv (Pa’ 1): 3->64 + RELU (*pa’ 1 means on all si’es, all ‘onv layers have a bias )

    20

  • /

    ARCHITECTURESARCHITECTURESVGG-16

    Input is a 224x224x3 Image

    - Block 1 - 3x3 Conv (Pad 1): 3->64 + RELU - 3x3 Conv (Pad 1): 64->64 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 64->64 Input to Block 2 is 112x112x64 (called pool1)

    - Block 2 - 3x3 Conv (Pad 1): 64->128 + RELU - 3x3 Conv (Pad 1): 128->128 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 128->128 Input to Block 3 is 56x56x128 (called pool2)

    - Block 3 - 3x3 Conv (Pad 1): 128->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 3x3 Conv (Pad 1): 256->256 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 256->256 Input to Block 4 is 28x28x256 (called pool3)

    - Block 4 - 3x3 Conv (Pad 1): 256->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Input to Block 5 is 14x14x512 (called pool4)

    - Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)

    21

  • /

    ARCHITECTURESARCHITECTURESVGG-16

    This is the “inal output that is traine’ with a so max + ‘ross entropy.

    Lots o“ layers: 138 Million ParametersCompare’ to previous ar‘hite‘tures, use’ really small ‘onv “ilters.

    This has now be‘ome stan’ar’.Two 3x3 layers is better than a sin”le 5x5 layer.

    More non-linearFewer in’epen’ent wei”hts

    Train this with ba‘kprop !Ba‘k in the ’ay, woul’ take a week or more.

    - Block 5 - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 3x3 Conv (Pad 1): 512->512 + RELU - 2x2 Max-Pool (Pad 0, Stride 2): 512->512 Output of Block 5 is 7x7x512 (called pool5)

    - Reshape to a (49*512=25088) dimensional vector (or B x 25088)

    - Fully connected (matmul + bias) 25088 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 4096 + RELU - Fully connected (matmul + bias) 4096 -> 1000

    22

  • /

    TRAINING IN PRACTICETRAINING IN PRACTICERemember: Gra’ient Des‘ent is Fra”ile

    The Effect of Parameterization

    Two representations o“ the same hypothesis spa‘e.

    Let s initialize . Say , .

    Compute loss with respe‘t to the same example.

    Say ,

    What are an’ ?

    , .

    Up’ate with learnin” rate = 1

    Up’ate’ ,

    ,

    f (x; θ) = θ x

    (x; ) = 2  xf

    θ

    θ

    = θ/2θ

    θ = 10 = 5θ

    L = = 1∇

    f

    f

    x = 1

    θ

    θ

    = 1∇

    θ

    = 2∇

    θ

    θ = 9 θ = 3

    f (x) = 9x (x) = 6xf

    23

  • /

    TRAINING IN PRACTICETRAINING IN PRACTICEInitialization

    Be‘ause we re usin” a “irst or’er metho’, it is important to make sure that the a‘tivations o“ all layers are thesame ma”nitu’e.

    Be‘ause we have RELUs , it is important to make sure rou”hly hal“ the expe‘tations arepositive.Normalize your inputs to be 0-mean an’ unit varian‘e.

    Compute ’ataset mean an’ stan’ar’ ’eviation, subtra‘t an’ ’ivi’e “rom all inputs.For ima”es, you usually ‘ompute the mean over all pixels—so sin”le normalization “or all pixels in theinput.

    But ’ifferent “or ’ifferent ‘olor ‘hannels.Sometimes, you will want per-lo‘ation normalization.Other times, you will normalize ea‘h input by its own pixel mean an’ stan’ar’ ’eviation.

    y = max(0, x)

    24

  • /

    TRAINING IN PRACTICETRAINING IN PRACTICEInitialization

    Initialize all biases to 0. Why ? Don t want to shi the mean.Now initialize wei”hts ran’omly so that varian‘e o“ outputs = varian‘e o“ inputs = 1.

    Use approximation “or s‘alar an’ .

    Say you have a “ully ‘onne‘tion layer:

    is -’imensional. We assume its 0-mean unit-varian‘e ‘omin” in.

    is ’imensional.

    We will initialize with 0-mean an’ varian‘e .

    What shoul’ be the value o“ ? Take 5 mins.

    .

    Now what about a ‘onvolution layer with kernel size .

    Initialize kernel with 0-mean an’ varian‘e . What shoul’ be ?

    var(wx) = var(w)var(x) w x

    y = xW

    T

    x N

    W N ×M

    W σ

    2

    σ

    2

    = 1/Nσ

    2

    K × K × ×C

    1

    C

    2

    σ

    2

    σ

    2

    = 1/( )σ

    2

    K

    2

    C

    1

    25

  • /

    TRAINING IN PRACTICETRAINING IN PRACTICEInitialization

    A‘tually, usin” normal ’istributions is sometimes unstable.Probability “or values very “ar “rom the mean is low, but not 0.When you sample millions o“ wei”hts, you mi”ht en’ up with su‘h a hi”h value!Solution: Use trun‘ate’ ’istributions that are “or‘e’ to have 0 probability outsi’e a ran”e.

    Uni“orm ’istribution, trun‘ate’-normalFi”ure out what parameters o“ this ’istribution shoul’ be to have equivalent varian‘e.

    26

  • /

    TRAINING IN PRACTICETRAINING IN PRACTICEInitialization

    But this only ensures zero-mean unit-varian‘e at initialization.As your wei”hts up’ate, they ‘an be”in to ”ive you biase’ wei”hts.Another option, a’’ normalization in the network itsel“ !

    Ser”ey Ioffe an’ Christian Sze”e’y, Bat‘h Normalization: A‘‘eleratin” Deep Network Trainin” byRe’u‘in” Internal Covariate Shi .

    27