Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Christof Angermueller
https://cangermueller.com
@cangermueller
University of Cambridge, European Bioinformatics Institute (EBI-EMBL)
Cambridge, UK
Generative RNNs for sequence modeling
2016-01-21
} Wanted: Probability over sequences x
Sequence modeling
x ~ P(x1,..., xT )
Applications
Text translation
X =
Speech recognition
X =
Bioinformatics
X =
Music modeling
X =
} Wanted: Probability over sequences x
Sequence modeling
x ~ P(x1,..., xT )
Models
n-gram
à Markov assumption
LDS
à Simple linear dynamic
HMM
à Static hidden state
RNN
à Non-linear transition function à Continuous hidden state
Discriminative RNN
x1
h2
x2
y2
x3
h3
y3
h1
y1
I like RNNs
Subject Verb Object
X
Y
P(y1,..., yT | x1,..., xT )
à Requires target labels Y
Generative RNN
x1 x2 x3
I like RNNs
X
x ~ P(x1,..., xT )
Generative RNN
x1
h2
x2 x3
h3h1
I like RNNs
X
Idea: trying to predict the next word
ht = fh (W xhxt +W
hhht−1 + bh )
Generative RNN
x1
h2
x2 x3
h3h1
y1
I like RNNs
X
Y
P(x2 | y1)
Idea: trying to predict the next word
ht = fh (W xhxt +W
hhht−1 + bh )
yt = fy (Whyht + b
y )
parameterizes
Generative RNN
x1
h2
x2
y2
x3
h3h1
y1
I like RNNs
X
Y
P(x2 | y1) P(x3 | y2 )
Idea: trying to predict the next word
Generative RNN
x1
h2
x2
y2
x3
h3
y3
h1
y1
I like RNNs
X
Y
P(x2 | y1) P(x3 | y2 ) P(x4 | y3) Likelihood
Loss function
} Training via BPTT
Idea: trying to predict the next word
Sampling sequences
X
Y
h0
y1Y
P(x2 | h1)
<START>
Generating sequences
x1
I
X
Y
h0
y1Y
P(x2 | h1)
<START>
I ~ P(x2|y1)
Generating sequences
x1
h1
y1
I
X
Y
h0
y1Y
P(x2 | h1) P(x3 | h2 )
<START>
I
Generating sequences
x1 x2
h1
y1
I like
X
Y
h0
y1Y
I
P(x2 | h1) P(x3 | h2 )
<START>
like ~ P(x3|y2)
Generating sequences
x1
h2
x2
y2
x3
h3
y3
h1
y1
I like RNNs
X
Y
h0
y0Y
I
P(x2 | h1) P(x3 | h2 ) P(x4 | h3) P(x5 | h4 )
<START>
like RNNs <STOP> <STOP> ~ P(x2|y1)
<STOP>
x4x0
Example: Wikipedia
Train
Generate
} xt are characters instead of words! } Fewer Parameters
char-RNN
https://github.com/karpathy/char-rnn http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Example: Linux source code
Conditional language model
P(x1,..., xT | z1,..., zL )
x1
h2
x2
y2
x3
h3
y3
h1
y1
z1
!h2
z2 z3
!h3!h1
Encoder Decoder
Initialize with last encoder hidden state
Example: Shakespeare
Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father's world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master's ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.
Image caption generation
Hand-writing generation
He dismissed the idea
Hand-writing generation
• xt,1, xt,2: spatial position • st: pen state (0 = up or 1 = down)
xt = (xt,1, xt,2, st )
Challenges } Multi-dimensional output } Multi-modal, correlation (x1, x2)
} Can’t be represented by simple output function
yt =σ (Whyht + b
y )
Solution: Mixture Density Network
Output vector
Conditional probability distribution
à RNN predicts parameters of Gaussian Mixture Model (GMM)
Samples
More of national temperament
Can we choose a writing style?
Conditional language model
x1
h2
x2
y2
x3
h3
y3
h1
y1
z1
!h2
z2 z3
!h3!h1
I love RNNs He dismissed the idea
Seed sequence pair
Target sequence pair
http://www.cs.toronto.edu/~graves/handwriting.html
He dismissed the idea
I love RNNs
Same idea: Chinese characters
http://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow/
Polyphonic music modeling
Time
Notes (binary) x =
Polyphonic music modeling
Time
Notes (binary) x =
Challenges 1. Correlation along time
Polyphonic music modeling
Time
Notes (binary) x =
Challenges 1. Correlation along time 2. Correlation between notes
} High-dimensional, multi-modal output
Polyphonic music modeling
Time
Notes (binary) x =
Challenges 1. Correlation along time 2. Correlation between notes
} High-dimensional, multi-modal output
3. Time-dependent, non-local factors of variation } Theme, tune, chord progression, …
RNN: correlation along time
Time
x =
RNN
RBM: correlation between notes
Notes (binary) x =
RBM
RNN-RBM, 2013
Time
Notes (binary) x =
RBM
RNN
+
RBM 101
Likelihood
Conditional independence
à Inference P(h|v) easy
Intractable partition function
P(v) = P(v,h) = 1Z
exp(−E(v,h))h∑
h∑
à Sampling v ~ P(v) intractable à Contrastive Divergence approximation
Learning requires sampling
Sampling
RBM
RNN
+
RNN
1. Hidden state
1.
2. Prediction
2.
RBM
Bias RBM
3.
3. Sample v(t) from RBM via CD
Learning
RBM
RNN
+
1.
1. Propagate h’(t-1)
2.
2. Predict bias terms bh(t) and bv(t)
3.
3. Sample v(t) from RBM via CD
P(v) = P(v,h) = 1Z
exp(−E(v,h))h∑
h∑
4. Estimate gradients
5. Back-propagate via BPTT
Results
RBM
RNN
RNN-RBM
RNN-NADE
http://www-etud.iro.umontreal.ca/~boulanni/icml2012
Big picture
x1
h2
x2
y2
x3
h3
y3
h1
y1
RNN-GMM Graves, 2013
RNN-RBM Boulanger, 2012
RNN-NADE Boulanger, 2012
RNN-DBN Gan et al, 2015
Prediction RNN Likelihood function parameterizes
More latent-variable RNNs
Bayer and Osendorfer, 2014 Learning Stochastic Recurrent Networks
} Stochastic Gradient Variational Bayes (SGVB) to speed up training
Krishnan, Shalit, and Sontag, 2015 Deep Kalman Filters
Chung et al., 2015 A Recurrent Latent Variable Model for Sequential data
} Generative RNNs allow sequence modeling
} Different output functions allow to model data of different modalities
} Latent-variable RNNs allow to model highly-structured data at the cost of runtime
Conclusions
} Graves, “Generating Sequences With Recurrent Neural Networks.
} Boulanger-Lewandowski, Bengio, and Vincent, “Modeling Temporal Dependencies in High-Dimensional Sequences.”
} Boulanger-Lewandowski, Bengio, and Vincent, “High-Dimensional Sequence Transduction.”
} Bayer and Osendorfer, “Learning Stochastic Recurrent Networks.”
} Chung et al., “A Recurrent Latent Variable Model for Sequential Data.”
} Gan et al., “Deep Temporal Sigmoid Belief Networks for Sequence Modeling.”
} Bowman et al., “Generating Sentences from a Continuous Space.”
References
} Krishnan, Shalit, and Sontag, “Deep Kalman Filters.” } Gregor et al., “DRAW.” } Kingma and Welling, “Auto-Encoding Variational Bayes.” } Larochelle and Murray, “The Neural Autoregressive Distribution
Estimator.” } Brakel, Stroobandt, and Schrauwen, “Training Energy-Based
Models for Time-Series Imputation.” } Goel and Vohra, “Learning Temporal Dependencies in Data Using
a DBN-BLSTM.” } Fabius and van Amersfoort, “Variational Recurrent Auto-
Encoders.” } Zaremba and COM, “An Empirical Exploration of Recurrent
Network Architectures.”
References