Lecture 11 Recap - GitHub Pages · Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. 44 Variational Autoencoders I2DL: Prof. Niessner,

Lecture 11 Recap

I2DL: Prof. Niessner, Dr. Dai 1

Transfer Learning

3I2DL: Prof. Niessner, Dr. Dai

P1 P2

Large dataset Small dataset

Distribution Distribution

Use what has been learned for another

setting

Transfer Learning


Trained on ImageNet New dataset with C classes

TRAIN

FROZEN

[Donahue et al., ICML’14] DeCAF, [Razavian et al., CVPRW’14] CNN Features off-the-shelf

Source : http://cs231n.stanford.edu/slides/2016/winter1516_lecture11.pdf

http://cs231n.stanford.edu/slides/2016/winter1516_lecture11.pdf

Basic Structure of RNN• We want to have notion of “time” or “sequence”


Hidden state

InputPrevious hidden state

𝑨𝑡 = 𝜽𝑐𝑨𝑡−1 + 𝜽𝑥𝒙𝑡

Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long-Term Dependencies


I moved to Germany … so I speak German fluently.Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Long-Short Term Memory Units(LSTM)




Long-Short Term Memory Units• Key ingredients • Cell = transports the information through the unit




LSTM• Highway for the gradient to flow




RNNs in Computer Vision• Caption generation


[Xu et al., PMLR’15] Neural Image Caption Generation

Autoencoders


Machine Learning

16

• Labels or target classes

• Goal: Learn a mapping from input to label

• Classification, regression

Supervised learning

I2DL: Prof. Niessner, Dr. Dai

DOG DOG

DOG

CAT

CAT

CAT

Machine Learning

17

Unsupervised learning Supervised learning



Machine Learning

• No label or target class• Find out properties of

the structure of the data

• Clustering (k-means, PCA)

18

DOG DOG

DOG

CAT

CAT

CAT


DOG DOG

DOG

CAT

CAT

CAT

Machine Learning

19



Machine Learning

20


DOG DOG

DOG

CAT

CAT

CAT


Autoencoders• Unsupervised approach for learning a lower-

dimensional feature representation from unlabeled training data


Source: https://hackernoon.com

https://hackernoon.com/

Autoencoders• From an input image

to a feature representation (bottleneck layer)

• Encoder: a CNN in our case


𝑥

Conv

𝑧

Input ImageSource: https://bit.ly/37dpsbQ

https://bit.ly/37dpsbQ

Autoencoders• Why do we need this dimensionality reduction?

• To capture the patterns, the most meaningful factors of variation in our data

• Other dimensionality reduction methods?


Autoencoder Training

24

Conv Transpose Conv

Input Image Output Image

ReconstructionLoss (like L1, L2)


Source: https://bit.ly/37dpsbQ


Autoencoder Training

25

Latent space 𝑧dim 𝑧 < dim(𝑥)

Inp

ut 𝑥

Rec

onst

ruct

ion 𝑥′

Input images

Reconstructed images


26

Autoencoder Training• No labels

required

• We can use unlabeled data to first get its structure



Inp

ut 𝑥

Rec

onst

ruct

ion 𝑥′

27

Autoencoder Use CasesEmbedding of

MNIST numbers


Source: https://lts2.epfl.ch/blog/perekres/2015/02/21/layer-by-layer-visualizations-of-mnist-dataset-feature-representations/

https://lts2.epfl.ch/blog/perekres/2015/02/21/layer-by-layer-visualizations-of-mnist-dataset-feature-representations/

28

Autoencoder for Pre-Training• Test case: Medical applications based on CT images

– Large set of unlabeled data.– Small set of labeled data.

• We cannot take a network pre-trained on ImageNet. Why?

• The image features are different for CT vs natural images


29

Autoencoder for Pre-Training• Test case: medical applications based on CT images

– Large set of unlabeled data.– Small set of labeled data.

• We can pre-train our network using an autoencoder to “learn” the type of features present in CT images


30

Autoencoder for Pre-Training• Step 1: Unsupervised training with autoencoders

Input Reconstruction




31

Autoencoder for Pre-Training• Step 2: Supervised training with the labeled data

Input Reconstruction

Throw away the decoder




Autoencoder for Pre-Training• Step 2: Supervised training with the labeled data

32

Input

Ground truth labels for supervised learning

Loss

Backprop as always


𝑥 𝑦

𝑦∗

𝑧

Why use Autoencoders?• Pre-training, as mentioned before

– Image same image reconstructed– Use the encoder as “feature extractor”

• Use them to get pixel-wise predictions– Image semantic segmentation– Low-resolution image High-resolution image– Image Depth map


Autoencoders for Pixel-wise Predictions


Semantic Segmentation (FCN)• Recall the Fully Convolutional Networks

35[Long et al., CVPR’15] : Fully Convolutional Networks for Semantic Segmentation

Can we do better?


SegNet

36I2DL: Prof. Niessner, Dr. Dai[Badrinarayanan et al., TPAMI‘16] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

SegNet

37

Input

Ground Truth

SegNet

I2DL: Prof. Niessner, Dr. Dai[Badrinarayanan et al., TPAMI‘16] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

SegNet• Encoder: normal convolutional filters + pooling

• Decoder: Upsampling + convolutional filters


SegNet• Encoder: Normal convolutional filters + pooling



SegNet• Encoder: normal convolutional filters + pooling


• The convolutional filters in the decoder are learned using backprop and their goal is to refine the upsampling


Generative Models• Given training data, how to generate new samples

from the same distribution

42I2DL: Prof. Niessner, Dr. Dai Source: https://openai.com/blog/generative-models/

Rea

l Im

ages

Gen

erat

ed Im

ages

https://openai.com/blog/generative-models/

Generative Models


Explicit Density Implicit Density

Tractable Density Approximate Density

Variational Markov Chain

Markov Chain Direct

Variational Autoencoder Boltzmann Machine

GSN GANFully Visible Belief Nets

Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017

44

Variational Autoencoders


Autoencoders• Encode the input into a representation (bottleneck)

and reconstruct it with the decoder

45

Conv Transpose Conv

Encoder Decoder


𝑥 𝑥

𝑧

Autoencoders• Encode the input into a representation (bottleneck)

and reconstruct it with the decoder


Source: https://bit.ly/37ctFMS

Latent space learnedby autoencoder on MNIST

https://bit.ly/37ctFMS

𝑧

Variational Autoencoder

47

Conv Transpose Conv

Encoder Decoder


𝑥 𝑥𝜙 𝜃

𝑞𝜙 𝑧 𝑥 𝑝𝜃 𝑥 𝑧)

48


Conv Transpose Conv

Goal: Sample from the latent distribution to generate new outputs!


𝑧

𝑥 𝑥𝜙 𝜃


49

• Latent space is now a distribution• Specifically it is a Gaussian

Encoder Decoder

Sample


𝑥 𝜙 𝜃 𝑥

𝜇𝑧|𝑥

Σ𝑧|𝑥 𝑧

𝑧|𝑥 ∼ 𝒩(𝜇𝑧|𝑥, Σ𝑧|𝑥)


50

• Latent space is now a distribution• Specifically it is a Gaussian

EncoderMean

Diagonal covariance


𝑥 𝜙

𝜇𝑧|𝑥

Σ𝑧|𝑥



51

• Training


Encoder Decoder

Sample𝑥 𝜃 𝑥

𝜇𝑧|𝑥

Σ𝑧|𝑥 𝑧


𝜙

Variational Autoencoder• Sampling operation is not differentiable

-> We can‘t backpropagate through the latent space


Encoder Decoder

Sample𝑥 𝜙 𝜃 𝑥

𝜇𝑧|𝑥

Σ𝑧|𝑥 𝑧


🚫

Reparametrization Trick• Now we only need to backpropagate through an

addition and a multiplication


Encoder Decoder

Sample

𝑥 𝜙 𝜃 𝑥

𝜇𝑧|𝑥

Σ𝑧|𝑥 𝑧

𝒩(0,1)

* +


54

• Test: Sample from the latent space


Decoder

Sample 𝜃 𝑥

𝜇𝑧|𝑥

Σ𝑧|𝑥 𝑧


Autoencoder vs VAE

Autoencoder Variational Autoencoder Ground TruthSource: https://github.com/kvfrans/variational-autoencoder


https://camo.githubusercontent.com/159ce00b17959f1638b6c6fcfb78408492aba497/687474703a2f2f6b766672616e732e636f6d2f636f6e74656e742f696d616765732f323031362f30382f6d6e6973742e6a7067

https://github.com/kvfrans/variational-autoencoder

Autoencoder Overview• Autoencoders (AE)

– Reconstruct input– Unsupervised learning– Latent space features are useful

• Variational Autoencoders (VAE)– Probability distribution in latent space (e.g., Gaussian)– Interpretable latent space (head pose, smile)– Sample from model to generate output


Generative Adversarial Networks (GANs)



58

Source: https://github.com/hindupuravinash/the-gan-zoo


https://github.com/hindupuravinash/the-gan-zoo

Convolution and Deconvolution

Convolutionno padding, no stride

Source: https://github.com/vdumoulin/conv_arithmetic

Transposed convolutionno padding, no stride

Input

Output

Input

Output


https://github.com/vdumoulin/conv_arithmetic

Autoencoder

Conv DeconvI2DL: Prof. Niessner, Dr. Dai 60

Decoder as Generative Model


Test time:-> reconstruction from

‘random’ vector

Output Image

ReconstructionLoss (often L2)



Interpolation between two chair models


[Dosovitsky et al., ‘14] Learning to Generate Chairs


Morphing betweenchair models


[Dosovitsky et al., ‘14] Learning to Generate Chairs


Latent space zdim (z) < dim (x)

“Test time”:-> reconstruction from

‘random’ vector

Reconstruction Loss Often L2, i.e., sum of squared dist.-> L2 distributes error equally

-> mean is opt.-> res. Is blurry

Instead of L2, can we “learn” a loss function?



[Goodfellow et al., NIPS‘14] Generative Adversarial Networks (slide from McGuinness)

67

𝑧𝐺

𝐺(𝑧)

𝐷

𝐷(𝐺(𝑧))



68

𝑧𝐺

𝐺(𝑧)

𝐷

𝑥

𝐷(𝑥)

𝐷(𝐺(𝑧))


[Goodfellow et al., NIPS‘14] Generative Adversarial Networks (slide from McGuinness)


real data fake data

I2DL: Prof. Niessner, Dr. Dai [Goodfellow, NIPS‘16] Tutorial: Generative Adversarial Networks 69

• Minimax Game:– G minimizes probability that D is correct– Equilibrium is saddle point of discriminator loss

• Discriminator loss

• Generator loss binary cross entropy

GANs: Loss Functions

• D provides supervision (i.e., gradients) for G

I2DL: Prof. Niessner, Dr. Dai[Goodfellow et al., NIPS‘14] Generative Adversarial Networks

𝐽 𝐷 = −1

2𝔼𝐱∼𝑝𝑑𝑎𝑡𝑎 log𝐷 𝒙 −

1

2𝔼𝒛 log 1 − 𝐷 𝐺 𝒛

𝐽(𝐺) = −𝐽 𝐷

70

• Heuristic Method (often used in practice)– G maximizes the log-probability of D being mistaken– G can still learn even when D rejects all generator samples

• Discriminator loss

GANs: Loss Functions

• Generator loss


𝐽 𝐷 = −1


1


𝐽(𝐺) = −1

2𝔼𝒛 log𝐷 𝐺 𝒛

71[Goodfellow et al., NIPS‘14] Generative Adversarial Networks

Alternating Gradient Updates• Step 1: Fix G, and perform gradient step to

• Step 2: Fix D, and perform gradient step to


𝐽 𝐷 = −1


1


𝐽(𝐺) = −1

2𝔼𝒛 log𝐷 𝐺 𝒛

Training a GAN

74

Source: https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f


https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f

GANs: Loss FunctionsMinimax

Heuristic

I2DL: Prof. Niessner, Dr. Dai 75[Goodfellow et al., NIPS‘14] Generative Adversarial Networks

DCGAN: Generator

Generator of Deep Convolutional GANs

I2DL: Prof. Niessner, Dr. Dai 76[Radford et al., ICLR‘16] DCGAN : Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

DCGAN: Results

Results on MNIST

77I2DL: Prof. Niessner, Dr. Dai[Radford et al., ICLR‘16] DCGAN : Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

DCGAN: Results

Results on CelebA (200k relatively well aligned portrait photos)


[Radford et al., ICLR‘16] DCGAN : Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Conditional Generative Adversarial

Networks (cGANs)79I2DL: Prof. Niessner, Dr. Dai

Pix2Pix: Image-to-Image Translation


[Isola et al., CVPR‘17] Pix2Pix : Image-to-Image Translation with Conditional Adversarial Networks

80

real or fake?

Discriminator

z G(z)

D

Generator

G

min𝐺

max𝐷

𝔼𝑧,𝑥 log 𝐷(𝐺 𝑧 ) + log(1 − 𝐷 𝑥 )


min𝐺

max𝐷

𝔼𝑥,𝑦 log 𝐷(𝐺 𝑥 ) + log(1 − 𝐷 𝑦 )

81


G(x)x

real or fake?

Discriminator

D

Generator

G


min𝐺

max𝐷


82


Real!

DiscriminatorGenerator


G(x)x

DG

min𝐺

max𝐷


83


min𝐺

max𝐷


DiscriminatorGenerator

Real too!


G(x)x

DG

84


min𝐺

max𝐷

𝔼𝑥,𝑦 log𝐷(𝑥, 𝐺 𝑥 ) + log(1 − 𝐷 𝑥, 𝑦 )

real or fake pair?

match joint distribution p G x , y ∼ p(x, y)

fake pair real pair


G

G(x)x

D

85[Isola et al., CVPR‘17] Pix2Pix : Image-to-Image Translation with Conditional Adversarial Networks

Pix2Pix


Edges → ImagesInput Output Input Output Input Output


[Isola et al., CVPR‘17] Pix2Pix : Image-to-Image Translation with Conditional Adversarial NetworksEdges from [Xie et al., ICCV’15] Holistically-Nested Edge Detection]

87

Sketches → ImagesInput Output Input Output Input Output

Trained on Edges → Images

Data from [Eitz, Hays, Alexa, 2012]I2DL: Prof. Niessner, Dr. Dai 88

Input Output Ground Truth


Data from maps.google.com

89


BW → ColorInput Output Input Output Input Output


Data from ImageNet [Isola et al., CVPR‘17] Pix2Pix : Image-to-Image Translation with Conditional Adversarial Networks

GAN Applications


BigGAN: HD Image Generation


[Brock et al., ICLR‘18] BigGAN : Large Scale GAN Training for High Fidelity Natural Image Synthesis

StyleGAN: Face Image Generation


[Karras et al., ‘18] StyleGAN : A Style-Based Generator Architecture for Generative Adversarial Networks [Karras et al., ‘19] StyleGAN2 : Analyzing and Improving the Image Quality of StyleGAN

Cycle GAN: Unpaired Image-to-Image Translation

I2DL: Prof. Niessner, Dr. Dai[Zhu et al., ICCV‘17] Cycle GAN : Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

94

SPADE: GAN-Based Image Editing

95I2DL: Prof. Niessner, Dr. Dai[Park et al., CVPR‘19] SPADE : Semantic Image Synthesis with Spatially-Adaptive Normalization

Few-Shot-Vid2Vid: Single-Shot Video Generation


• https://www.youtube.com/watch?v=APoB1u3kTOU• https://www.youtube.com/watch?v=kkA6CHRovKA

[Wang et al., NeurIPS‘19] Few-Shot Video-to-Video Synthesis

https://www.youtube.com/watch?v=APoB1u3kTOU

https://www.youtube.com/watch?v=kkA6CHRovKA

References for Further Reading• https://towardsdatascience.com/intuitively-

understanding-variational-autoencoders-1bfe67eb5daf

• https://phillipi.github.io/pix2pix/

• http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf


https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

https://phillipi.github.io/pix2pix/

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf

Next Lecture


• Next lecture on 28th January: - Course outlook

• Reminder: Exercise 3 due tomorrow at 18:00

• Thursday exercise session on exercise 3 and exam tips

See you next time!


Documents

Lecture 11 Recap - GitHub Pages · Figure copyright and adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. 44 Variational Autoencoders I2DL: Prof. Niessner,