Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Unsupervised Learning: Autoencoders
Yunsheng Bai
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Introduction to Autoencoders
https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf
https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf
Inner product between them
Change of basis
PCA ≈ Autoencoder with Linear Activation Function
Not necessarily orthogonal
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttps://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf
Could have many layers, but as long as activation is linear → a single W and a single V
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttps://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf
PCA ≈ Autoencoder with Linear Activation Function
https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7cHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
PCA vs Autoencoder— autoencoders are much more
flexible than PCA.
— NN activation functions
introduce “non-linearities”
in encoding, but PCA only does
linear transformation.
— we can stack autoencoders to
form a deep autoencoder
network
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
Layer 1 Layer 2 Layer 3 Layer 4
Stacked
Goal: Learn Useful Features from DataWe’ve seen that autoencoders can do PCA, but fundamentally, why does an autoencoder work?
https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694
Goal: Feature/Representation LearningWhy can’t an autoencoder simply copy input to output through identity functions?
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1
1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1
Overcomplete
min ||x-g(f(x))||2
f g
To Achieve Feature Learning, Conflicting GoalsAutoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Undercomplete Autoencoders
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Pythonhttp://rgraphgallery.blogspot.com/2013/04/rg-3d-scatter-plots-with-vertical-lines.html
hEncoders and decoders are too powerful :(
“If you could speak only a few words per month, you would probably try to make them worth listening to.”
Regularized AutoencodersRegularized autoencoders use a loss function that encourages the model to have other properties besides the ability to copy its input to its output. These other properties include sparsity of the representation, smallness of the derivative of the representation, and robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and overcomplete but still learn something useful about the data distribution, even if the model capacity is great enough to learn a trivial identity function.
→ introduce new things to the loss
→ they are just different regularizers
2008: Sparse Autoencoders (SAE)
2008: Denoising Autoencoders (DAE)
2011: Contractive Autoencoders (CAE)
2011: Stacked Convolutional Autoencoders (SCAE)
2011: Recursive Autoencoders (RAE)
2013: Variational Autoencoders (VAE)
2015: Adversarial Autoencoders (AAE)
2017: Wasserstein Autoencoders (WAE)
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Properties of Autoencoders (Ideally)
1. Learn useful features from data (effective representations)a. Capture the intrinsic properties of data → feed them into downstream applicationsb. Can be thought of as patterns in data → generate new data
2. Produce low-dimensional vectors (efficient/compact representations)a. Efficient for storageb. Efficient for downstream modelsc. May be free of noise in inputd. Easier to visualize than high-dimensional data
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Properties of Autoencoders (Ideally)3. Are flexible: Can be modified/guided/regularized in various ways:
a. Input data, e.g. add noiseb. Output data, e.g. something different from the inputc. Architecture, e.g. fully connected layer → convolutional layerd. Loss, e.g. add additional loss terms → capture other useful information from inpute. Latent space, e.g. Gaussian (more later in VAE)
i. Enforce certain prior knowledge, usually through additional loss termsii. Analyzing the latent space/representations is a trend (?), e.g. debiasing word
embeddingsf. … (Be creative! This is where research comes from)
History of Autoencoders10 years ago, we thought that deep nets would also need an unsupervised cost, like the autoencoder cost, to regularize them.
Today, we know we are able to recognize images just by using backprop on the supervised cost as long as there is enough labeled data.
(Humans can learn from very few labeled examples. Why? One popular hypothesis: Brain can leverage unsupervised or semi-supervised learning.)
There are other tasks where we do still use autoencoders, but they’re not the fundamental solution to training deep nets that people once thought they were going to be.
(Ian Goodfellow, 2016)https://www.quora.com/Why-are-autoencoders-considered-a-failure-What-are-their-alternativeshttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XGlorot2011Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Applications of Autoencoders1. Data Compression for Storage
a. Difficult to train an autoencoder better than a basic algorithm like JPEGb. Autoencoders are data-specific: may be hard to generalize to unseen data
2. Dimensionality Reduction for Data Visualizationa. t-SNE is good, but typically requires relatively low-dimensional data
i. For high-dimensional data, first use autoencode, then use t-SNEb. Latent space visualization (more later)
https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df
Applications of Autoencoders3. Unsupervised Pretraining
a. Greedy Layer-Wise Unsupervised Pretraining: Train each layer of feedforward net greedily; continue stacking layers; output of prior layers is input for the next one; fine tune
b. Today, we have random weight initialization, rectified linear units (ReLUs) (2011), dropout (2012), batch normalization (2014), residual learning (2015) + large labeled datasets
c. Still usefuli. Train a deep autoencoderii. Train an autoencoder on an unlabeled dataset, and reuse the lower layers to create a
new network trained on the labeled data (~supervised pretraining) iii. Train an autoencoder on an unlabeled dataset, and use the learned representations in
downstream tasks (see more in 4)
https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008
Greedy Layer-Wise Unsupervised Pretrainingfor Training Deep Autoencoders
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Unsupervised Pretraining for Supervised Tasks
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Unsupervised Pretraining for Supervised Tasks
downside: two-staged → hyperparameters tuning :(
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Supervised Pretraining
https://www.youtube.com/watch?v=R3DNKE3zKFk
Multi-Task Learning
Transfer Learning Domain Adaptation
supervised pretraining
https://www.youtube.com/watch?v=R3DNKE3zKFk
Multi-Task Learning
Applications of Autoencoders4. Generate Representations for Downstream Tasks
a. Special case of unsupervised pre-training (3.c.iii)b. Useful when the initial representation is poor, and there is a lot of unlabeled data
i. Word embeddings (better than one-hot representations)ii. Graph node embeddingsiii. Image embeddings (Images already lie in a rich vector space? Check out puppy image
embeddings!)iv. Semantic hashing: turn database entries (text, image, etc.) into low-dimensional and
binary codes → Information retrievalc. Question: If there are labels, is there any reason to use a decoder with a reconstruction loss?
5. Generate New Data (Generative Model)a. Especially, Variational Autoencoders (VAE), Adversarial Autoencoders (AAE) (more later)b. Creative applications (more later)
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Or logistic regression SVM, classifier, etc.
using the Hidden 2 as input
Copy output of Layer 2
(embedding)
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
graph node embedding
Copy output of Layer 2(code)
Database Query
Hidden rep.
Compare the codes
Query
in databaseHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
semantic hashing
Applications of Autoencoders6. Self-supervised Learning
a. ∈ supervised learning where the targets are generated from the input datab. Merely learning to reconstruct the input might not be enough to learn abstract features of the
kind that label-supervised learning induces (where targets are "dog", "car"...)i. Data denoising
ii. Jigsaw puzzle solveriii. ...
https://blog.keras.io/building-autoencoders-in-keras.htmlhttps://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008
Skipgram vs Autoencoders1. In NLP word embeddings, why is Skipgram more popular than autoencoders?
a. Simplerb. More efficientc. Works well already
2. When does Skipgram no longer suffice? Additional goals, e.g.a. Denoisingb. Complex characteristics of word use + polysemy → Use bidirectional LSTM with attention
as the encoder!c. Generative setting (generate new data)d. Inductive setting (embed unseen words)
3. Can Skipgram be viewed as a special case of some autoencoder model?a. In fact, encoding and decoding are very general concepts and are used in many places
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Sparse Autoencoders (SAE) (2008)
Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
An image should be represented by only a few bases.
Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
A document should be about only a few topics.
Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
D x
DT
��
x ��=
=
DTx=
Dh=DDT =x→ DDT=I
Change of basis
+ Sparsity constraint
“If you could speak only a few words per month, you would probably try to make them worth listening to.”
Inner product between them
https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf
Change of basis
Recall PCA
Motivation 2: Prevent Identity Transform
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1
1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1
f g
W x
WT
h
x h=
=
WTx=hf(x)=h
Wh=WWT =x→ WWT=I (fine)g(h)=x
1 0 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 10 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0
1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 0 1 0 0 0
W ≈ I
Motivation 2: Prevent Identity Transform
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
0.20.3 0.1 + ...
Same as input, i.e. x
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W
In the case of image, we can think
of W as a set of convolution filters
(each with the same size as the input, e.g. 4x4).
Motivation 2: Prevent Identity Transform
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf
1000000000000000
0100000000000000
0010000000000000
1 0 1 + ...
Sparse Autoencodersf g
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in PythonHands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systemshttp://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
# training samples Reconstruction loss Regularization term Sparsity penalty
This results in sparse activation of hidden units across training points, but does not guarantee that each input has a sparse representation. (Makhzani, Alireza, and Brendan Frey. "K-sparse autoencoders." arXiv preprint arXiv:1312.5663 (2013).)
activation of hidden unit j of layer 2 (assume two layers
in encoder)
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
Results
Techniques to Interpret Autoencoders1. Visualize the weight matrix W
a. Each column of W corresponds to the weights of a particular neuronb. When there is a natural interpretation of the weights, can visualize them
i. Especially true in the case of image as seen previously (~convolution filters)ii. Especially true for the top hidden layers since they often capture relatively large
features
2. Visualize the most exciting input per neurona. Treat each neuron as a feature detector. To find the feature a particular neuron is looking for,
i. Feed a random inputii. Measure the activation of the neuron you are interested iniii. Perform backpropagation to tweak the input so that the neuron will activate even
more (gradient ascent)iv. Iterate several times
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Denoising Autoencoders (DAE) (2008)
Sparse Coding Could Also Handle Image DenoisingKey: the use of sparse and redundant representations over trained dictionaries.
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdfElad, Michael, and Michal Aharon. "Image denoising via learned dictionaries and sparse representation." Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE, 2006.
Denoising Autoencoders: Implementation-level
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Denoising Autoencoders: Results
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
Gaussian noise
Denoising Autoencoders: Results
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python
Salt and pepper noise
Denoising Autoencoders: Research-level
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
Why equivalent to reconstruction loss
? (1) Intuitively (2) Recall Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Contractive Autoencoders (CAE) (2011)
CAE: Resist Infinitesimal Perturbations of InputAll autoencoder training procedures involve a compromise between two opposing forces: being data-specific and being data-insensitive.
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville)
CAE and DAE are equivalent under
certain conditions.
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Stacked Convolutional Autoencoders (SCAE) (2011)
SCAEUse convolutional + pooling layers instead of fully connected layers.
Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307.
Image Deblurring/Denoising/Super-Resolution and Image Colorization
Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307.https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Recursive Autoencoders (RAE) (2011)
Sentence Representation
Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment distributions." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.
Why not simple average? “white blood cells destroying an infection” ≠ “an infection destroying white blood cells”
Sentence Representation
https://www.doc.ic.ac.uk/~js4416/163/website/nlp/recursive.htmlSocher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in neural information processing systems. 2011.
Could use parse tree
Could introduce a supervised loss
Could penalize top-level nodes more heavily, which contain more children
Could use many layers
Could normalize the hidden representations
Could predict all children underneath → unfolding RAE
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Variational Autoencoders (VAE) (2013)
VAE: Intuition
https://www.jeremyjordan.me/variational-autoencoders/
Encoder Outputs Statistical Distributions; Feed Samples into Decoder→ Add Noise at All Times; Generate New Data After Training
VAE: Implementation-level
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Probabilistic (produce and even after
training) + generativeautoencoders
Assume the prior distribution of z, i.e. p(z)
to be Gaussian → encourage the learned posterior q(z|x) to be
similar to p(z) through an additional loss term measuring their KL
divergence
-
VAE: Research-level
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).https://www.jeremyjordan.me/variational-autoencoders/
generative model
variational approximation
to the intractable true posterior
probabilistic encoder(recognition model)
probabilistic decoder(generative model)
latent representation or code
Variational Bayesian Inference
MusicVAE: Generative Model → Creative Artists
https://magenta.tensorflow.org/music-vaeRoberts, Adam, et al. "A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music." arXiv preprint arXiv:1803.05428 (2018).
The desirable properties of a latent space can be summarized as follows:
1. Expression: Any real example can be mapped to some point in the latent space and reconstructed from it.
2. Realism: Any point in this space represents some realistic example, including ones not in the training set.
3. Smoothness: Examples from nearby points in latent space have similar qualities to one another.
https://experiments.withgoogle.com/ai/beat-blender/view/
Key: Design Latent Space Properties
https://www.jeremyjordan.me/variational-autoencoders/
Learn smooth latent state representations of the input data.Good for interpolation, sampling, generation, downstream
classification, etc.
“holes” :(
Interpolation → Smooth Transformation
https://www.jeremyjordan.me/variational-autoencoders/
SketchRNN: Seq2seq + Variational Autoencoder
https://research.googleblog.com/2017/04/teaching-machines-to-draw.html
sequence-to-sequence (seq2seq) autoencoder framework with variational inferencesketch → sequence of motor actions controlling a pen (how about text or graph as a sequence?)by adding noise to the latent vector, the model cannot reproduce the input sketch exactly
Arithmetic operations on sketch embeddings!
Smoothness of latent space
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Adversarial Autoencoders (AAE) (2015)
DKL
Prior p(z) Additional
loss term
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).
AAE: Regularized by An Adversarial Network Which Guides Posterior q(z|x) to Match Any Arbitrary Prior p(z)VAE AAE
AAE: Design Arbitrary Prior
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).
AAE: Labels Can Further Guide (Semi-Supervised)
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Wasserstein Autoencoders (WAE) (2017)
WAE: MotivationVAE
Pros:
1. Theoretically elegant2. Stable training3. Encoder-decoder architecture4. Nice latent manifold structure
Cons:
1. Tend to generate blurry samples
GAN
Pros:
1. Good visual quality of images
Cons:
1. Harder to train2. No encoder; only a decoder/generator and
a discriminator3. “Mode collapse” problem 4. ~JS divergence, “worse” than Wasserstein
distance (see details in the paper)
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).
Combine VAE + GAN in A Principled Way?VAE GAN
DKL
Prior p(z) Additional
loss term
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).
decoder/generator
discriminator
decoderencoder
AAE
WAEA generalization of AAE; minimizes Wasserstein distance between the model and the target distribution.
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).https://openreview.net/forum?id=HkL7n1-0b
Roadmap1. Introduction to Autoencoders2. Sparse Autoencoders (SAE) (2008)3. Denoising Autoencoders (DAE) (2008)4. Contractive Autoencoders (CAE) (2011)5. Stacked Convolutional Autoencoders (SCAE) (2011)6. Recursive Autoencoders (RAE) (2011)7. Variational Autoencoders (VAE) (2013)8. Adversarial Autoencoders (AAE) (2015)9. Wasserstein Autoencoders (WAE) (2017)
10. Autoencoders for Graphs
Autoencoders for Graphs
Graphs Are Different1. Are there smooth linear interpolations? Arithmetic operations?2. Graph is composed of correlated substructures
a. E.g. Two triangles → rectangleb. Hierarchy: pixels (atomic) → patterns → images; words (atomic) → phrases → sentences →
paragraphs/documents; nodes (atomic) → substructures → graphs (transfer learning)
3. Graphs are of different sizes4. Graph nodes lack order5. How to detect substructures?
a. For image, convolutional layers → SCAEb. For graph, graph convolutional layers → node/substructure/graph?c. Some people treat graph as sequences/random walks → “deconstruction” view
i. ~Parse sentences into trees instead of feeding into LSTMd. How about decompose graphs into equal-size subgraphs?
Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018).
GraphVAE
Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018).
Thank you!