Generative RNNs for sequence modeling · 2016-01-21 · }Krishnan, Shalit, and Sontag, “Deep...

Preview:

Citation preview

Christof Angermueller

https://cangermueller.com

cangermueller@gmail.com

@cangermueller

University of Cambridge, European Bioinformatics Institute (EBI-EMBL)

Cambridge, UK

Generative RNNs for sequence modeling

2016-01-21

}  Wanted: Probability over sequences x

Sequence modeling

x ~ P(x1,..., xT )

Applications

Text translation

X =

Speech recognition

X =

Bioinformatics

X =

Music modeling

X =

}  Wanted: Probability over sequences x

Sequence modeling

x ~ P(x1,..., xT )

Models

n-gram

à Markov assumption

LDS

à Simple linear dynamic

HMM

à Static hidden state

RNN

à Non-linear transition function à Continuous hidden state

Discriminative RNN

x1

h2

x2

y2

x3

h3

y3

h1

y1

I like RNNs

Subject Verb Object

X

Y

P(y1,..., yT | x1,..., xT )

à Requires target labels Y

Generative RNN

x1 x2 x3

I like RNNs

X

x ~ P(x1,..., xT )

Generative RNN

x1

h2

x2 x3

h3h1

I like RNNs

X

Idea: trying to predict the next word

ht = fh (W xhxt +W

hhht−1 + bh )

Generative RNN

x1

h2

x2 x3

h3h1

y1

I like RNNs

X

Y

P(x2 | y1)

Idea: trying to predict the next word

ht = fh (W xhxt +W

hhht−1 + bh )

yt = fy (Whyht + b

y )

parameterizes

Generative RNN

x1

h2

x2

y2

x3

h3h1

y1

I like RNNs

X

Y

P(x2 | y1) P(x3 | y2 )

Idea: trying to predict the next word

Generative RNN

x1

h2

x2

y2

x3

h3

y3

h1

y1

I like RNNs

X

Y

P(x2 | y1) P(x3 | y2 ) P(x4 | y3) Likelihood

Loss function

}  Training via BPTT

Idea: trying to predict the next word

Sampling sequences

X

Y

h0

y1Y

P(x2 | h1)

<START>

Generating sequences

x1

I

X

Y

h0

y1Y

P(x2 | h1)

<START>

I ~ P(x2|y1)

Generating sequences

x1

h1

y1

I

X

Y

h0

y1Y

P(x2 | h1) P(x3 | h2 )

<START>

I

Generating sequences

x1 x2

h1

y1

I like

X

Y

h0

y1Y

I

P(x2 | h1) P(x3 | h2 )

<START>

like ~ P(x3|y2)

Generating sequences

x1

h2

x2

y2

x3

h3

y3

h1

y1

I like RNNs

X

Y

h0

y0Y

I

P(x2 | h1) P(x3 | h2 ) P(x4 | h3) P(x5 | h4 )

<START>

like RNNs <STOP> <STOP> ~ P(x2|y1)

<STOP>

x4x0

Example: Wikipedia

Train

Generate

}  xt are characters instead of words! }  Fewer Parameters

char-RNN

https://github.com/karpathy/char-rnn http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Example: Linux source code

Conditional language model

P(x1,..., xT | z1,..., zL )

x1

h2

x2

y2

x3

h3

y3

h1

y1

z1

!h2

z2 z3

!h3!h1

Encoder Decoder

Initialize with last encoder hidden state

Example: Shakespeare

Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father's world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master's ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

Image caption generation

Hand-writing generation

He dismissed the idea

Hand-writing generation

•  xt,1, xt,2: spatial position •  st: pen state (0 = up or 1 = down)

xt = (xt,1, xt,2, st )

Challenges }  Multi-dimensional output }  Multi-modal, correlation (x1, x2)

}  Can’t be represented by simple output function

yt =σ (Whyht + b

y )

Solution: Mixture Density Network

Output vector

Conditional probability distribution

à RNN predicts parameters of Gaussian Mixture Model (GMM)

Samples

More of national temperament

Can we choose a writing style?

Conditional language model

x1

h2

x2

y2

x3

h3

y3

h1

y1

z1

!h2

z2 z3

!h3!h1

I love RNNs He dismissed the idea

Seed sequence pair

Target sequence pair

http://www.cs.toronto.edu/~graves/handwriting.html

He dismissed the idea

I love RNNs

Same idea: Chinese characters

http://blog.otoro.net/2015/12/28/recurrent-net-dreams-up-fake-chinese-characters-in-vector-format-with-tensorflow/

Polyphonic music modeling

Time

Notes (binary) x =

Polyphonic music modeling

Time

Notes (binary) x =

Challenges 1.  Correlation along time

Polyphonic music modeling

Time

Notes (binary) x =

Challenges 1.  Correlation along time 2.  Correlation between notes

}  High-dimensional, multi-modal output

Polyphonic music modeling

Time

Notes (binary) x =

Challenges 1.  Correlation along time 2.  Correlation between notes

}  High-dimensional, multi-modal output

3.  Time-dependent, non-local factors of variation }  Theme, tune, chord progression, …

RNN: correlation along time

Time

x =

RNN

RBM: correlation between notes

Notes (binary) x =

RBM

RNN-RBM, 2013

Time

Notes (binary) x =

RBM

RNN

+

RBM 101

Likelihood

Conditional independence

à Inference P(h|v) easy

Intractable partition function

P(v) = P(v,h) = 1Z

exp(−E(v,h))h∑

h∑

à Sampling v ~ P(v) intractable à Contrastive Divergence approximation

Learning requires sampling

Sampling

RBM

RNN

+

RNN

1. Hidden state

1.

2. Prediction

2.

RBM

Bias RBM

3.

3. Sample v(t) from RBM via CD

Learning

RBM

RNN

+

1.

1. Propagate h’(t-1)

2.

2. Predict bias terms bh(t) and bv(t)

3.

3. Sample v(t) from RBM via CD

P(v) = P(v,h) = 1Z

exp(−E(v,h))h∑

h∑

4. Estimate gradients

5. Back-propagate via BPTT

Results

RBM

RNN

RNN-RBM

RNN-NADE

http://www-etud.iro.umontreal.ca/~boulanni/icml2012

Big picture

x1

h2

x2

y2

x3

h3

y3

h1

y1

RNN-GMM Graves, 2013

RNN-RBM Boulanger, 2012

RNN-NADE Boulanger, 2012

RNN-DBN Gan et al, 2015

Prediction RNN Likelihood function parameterizes

More latent-variable RNNs

Bayer and Osendorfer, 2014 Learning Stochastic Recurrent Networks

}  Stochastic Gradient Variational Bayes (SGVB) to speed up training

Krishnan, Shalit, and Sontag, 2015 Deep Kalman Filters

Chung et al., 2015 A Recurrent Latent Variable Model for Sequential data

}  Generative RNNs allow sequence modeling

}  Different output functions allow to model data of different modalities

}  Latent-variable RNNs allow to model highly-structured data at the cost of runtime

Conclusions

}  Graves, “Generating Sequences With Recurrent Neural Networks.

}  Boulanger-Lewandowski, Bengio, and Vincent, “Modeling Temporal Dependencies in High-Dimensional Sequences.”

}  Boulanger-Lewandowski, Bengio, and Vincent, “High-Dimensional Sequence Transduction.”

}  Bayer and Osendorfer, “Learning Stochastic Recurrent Networks.”

}  Chung et al., “A Recurrent Latent Variable Model for Sequential Data.”

}  Gan et al., “Deep Temporal Sigmoid Belief Networks for Sequence Modeling.”

}  Bowman et al., “Generating Sentences from a Continuous Space.”

References

}  Krishnan, Shalit, and Sontag, “Deep Kalman Filters.” }  Gregor et al., “DRAW.” }  Kingma and Welling, “Auto-Encoding Variational Bayes.” }  Larochelle and Murray, “The Neural Autoregressive Distribution

Estimator.” }  Brakel, Stroobandt, and Schrauwen, “Training Energy-Based

Models for Time-Series Imputation.” }  Goel and Vohra, “Learning Temporal Dependencies in Data Using

a DBN-BLSTM.” }  Fabius and van Amersfoort, “Variational Recurrent Auto-

Encoders.” }  Zaremba and COM, “An Empirical Exploration of Recurrent

Network Architectures.”

References

Recommended