67
Recurrent Neural Networks NN Reading Group Marco Pedersoli & Thomas Lucas

Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Recurrent Neural Networks

NN Reading GroupMarco Pedersoli & Thomas Lucas

Page 2: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Outline

● Motivation: why and when are RNN useful● Formal definition of a RNN● Backpropagation Through Time● Vanishing Gradients● Improved RNN● Regularization for RNN● Teacher Forcing Training

Page 3: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Introduction

Building block: y = h(Wx)

W = linear (also convolutional, low rank, etc...)

h = element-wise non-linear operation (tanh, reLU, etc..)

Page 4: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Multilayer Network

Same input and output, but hidden representations: more powerful!

Composition of building blocks

Can still learn parameters with backpropagation

Page 5: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Sequential Input/Output

So far x,y fixed size vectors --> limiting factor!

We want input output with a complex structure eg. variable length sequence!

Can we reuse the same building block?

How are you? (OB) (VB) (SB)?

Page 6: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Repeated Network

+ Works!+ Scale well with data

because reuses same parameters!

- No context, every decision is taken independently!

How

are

you

(OB)

(VB)

(SB)

Page 7: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Convolutional Network

Howare

Howare you

areyou

(OB)

(VB)

(SB)

+ Works!+ Scale well with data

because same parameters reused!

+ Context, every decision depends also on the neighbours!

- Context has fixed, predefined structure!

- Does not scale if we want long range, sparse correlations!

Page 8: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Recurrent Network

+ Works!+ Scale well with data

because same parameters reused!

+ State vector can summarize all the past, thus global context with long term dependencies!

+ Same optimization as before with Backpropagation

How

are

you

(OB)

(VB)

(SB)

Page 9: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN topology: Many to Many (Coupled)

+ Works!+ Scale well with data

because same parameters reused!

+ State vector can summarize all the past, thus global context with long term dependencies!

+ Same optimization as before with Backpropagation

How (OB)

are

you

(Vb)

(SB)

Page 10: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN topology: Many to one

How

are

you Question

Classification problems with variable input size.

E.g.:

- Classify the category of a sentence

- Sentiment Classification on a audio track

How

are

you

Page 11: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN topology: One to many

How are

you

today

Generation of a sequence given an initial state.

E.g.:

- Generate a sentence describing an image

- Generate a sentence given the first word

How are

you

today

Page 12: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN topology: Many to Many (Encoder/Decoder)

Generation of a sequence given a different length sequence.

E.g.:

- Translate a sentence- Describe a movie

How

are

you Comment

ça

How

are

you Comment

ça

va

Page 13: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN graphical representation

Page 14: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Forward propagation in a single neuron

Computation inside one single hidden neuron, denoted h.

Page 15: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A network with one neuron per layer

Page 16: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A network with one neuron per layer

Page 17: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Making the network recurrent

Page 18: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph in time

Page 19: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph in time

Page 20: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph in time

Page 21: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A recurrent network with one hidden layer

Page 22: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph

Page 23: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph

Page 24: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph

Page 25: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Equations of the recurrence

Page 26: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

The unfolded graph is a feed-forward graph

Page 27: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Recap

Page 28: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Recap

Page 29: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Unfolding the graph in time

Page 30: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Back-propagation in a recurrent network

Page 31: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Equations of the back-propagation

Last hidden layer All other hidden layers

Page 32: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Equations of the back-propagation

Page 33: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

The vanishing gradient problem

Figure credit: Alex Graves

Page 34: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A simple case of vanishing gradient

Page 35: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A sufficient condition for gradients to vanish

Page 36: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A sufficient condition for gradients to vanish

Page 37: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A sufficient condition for gradients to vanish

Page 38: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

A sufficient condition for gradients to vanish

Page 39: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Improved RNNs

● Gated Units○ Long-Short Term Memory RNN○ Gated Recurrent Unit

● Skip connections○ Clockwork RNN○ Hierarchical Multi-resolution RNN

Page 40: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

LSTM recurrent networks

[based on Christopher Olah blog]

Page 41: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

LSTM recurrent networks

Long term dependencies problem

Page 42: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Inside a standard RNN

Page 43: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Inside a LSTM

Page 44: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

forget

outputinput

Page 45: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

RNN LSTM

Page 46: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

LSTM

Page 47: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

Gated Recurrent Unit

Page 48: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

Gated Recurrent Unit

Page 49: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Gated units

Gated Recurrent Unit

Page 50: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

LSTM equations

Figure credit: Alex Graves

Page 51: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

GRU Equations

Page 52: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Clockwork RNN

[Koutnik at al. 2014]

- faster than standard RNN- handles better long term

dependencies

t = 6

Page 53: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Clockwork RNN

Page 54: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Clockwork RNN

[Chung et al. 2016]

Use prior knowledge about data to define a hierarchy of representations

- h1 represents char- h2 represents words- h3 represents sentences

UPDATE: standard RNN step

COPY: h is copied to the next timestep

FLUSH: state passed to the upper layer

Page 55: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Hierarchical Multiscale RNN

Learn when to flush and update from data

Page 56: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Hierarchical Multiscale RNN

Optimization problem:z is discrete!

- Sampling: REINFORCE- Soft: softmax on z- Step Fn:

straight-through estim.- - fwd pass step- bkd pass hard sigmoid

- Step Fn & Annealing:- slope from 1 to 5

Page 57: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Hierarchical Multiscale RNN

IAM-OnDB dataset:handwriting data

predict (x,y) coordinates of strokes

Page 58: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Regularization in RNNs

Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory!

- Dropout in input-to-hidden and hidden-to-output [Zaremba et al. 2014]

- Apply dropout at sequence level (same zeroed units for the entire sequence) [Gal 2016]

- Dropout only at the cell update (for LSTM and GRU units) [Semeniuta et al. 2016]

- Enforcing norm of the hidden state to be similar along time [Krueger & Memisevic 2016]

- Zoneout some hidden units (copy their state to the next timestep) [Krueger et al. 2016]

Page 59: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

RNN Regularization

Similar to dropout:

instead of dropping out hidden units, here it zones out hidden units!

Page 60: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Teacher Forcing for sequence prediction

Training: Teacher forcing Test:Sampling from Y

- Inference and training do not match- Training does not explore outside training data

Page 61: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Scheduled Sampling[Bengio et al. 2015]

Slowly move from Teacher Forcing to Sampling

Probability of sampling from the ground truth

Page 62: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Scheduled Sampling

Baseline: based on google NICBaseline with dropout: see part about regularization of RNNAlways Sampling: using sampling from the beginning of trainingScheduled Sampling: the proposed approach with inverse Sigmoid decayUniform scheduled Sampling: using scheduled sampling but with uniform Y sampling

Microsoft COCO developement set

Page 63: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Sequence Level Training

[Ranzato et al. 2016]

During training objective is different than at test time

- Training: generate next word given the previous- Test: generate the entire sequence given an initial state

Optimize directly evaluation metric (e.g. BLUE score for sentence generation)Set the problem as a Reinforcement Learning:- RNN is an Agent- Policy defined by the learned parameters- Action is the selection of the next word based on the policy- Reward is the evaluation metric

Page 64: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Comparison of different training approaches

- XENT = Cross Entropy loss: standard approach, but suffers of exposure bias- DAD = Scheduled Sampling: the loss is always based on fixed target labels, that may not be aligned with the so far generated sentence- E2E = End-to-end backprop: select top k output and renormalize. Uses a schedule between target labels and top k output- MIXER = Sequence level training: optimize at sequence level. Start from model trained with Cross Entropy and use reinforce for the last K steps of the sequence.

Page 65: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Further reading

● Other topologies○ Bidirectional RNN○ Recursive RNN

● Applications○ RNN for text generation○ RNN for image captioning○ RNN language translation○ pixel RNN for image generation

● RNN + Attention models○ Draw for image generation○ Encode + Review + Decode

Page 66: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden
Page 67: Recurrent Neural Networks NN Reading Group - Inria · Standard Dropout in recurrent layers does not work well because it causes loss of long-term memory! - Dropout in input-to-hidden

Outline:- Motivation: why we need RNNs / intuition of how they work. (for the moment I will use old

illustrations, then I will try to use yours...)- Equations and notations (+derivation of backprop?)- RNN topologies- Intuitions of why the gradients vanish.- Sufficient condition for vanishing gradient.- Solutions to the vanishing gradient problem.

- Gated units (Lstm/Gru). (some equations)- Clockwork RNN.- Hierarchical Multiscale RNN (http://arxiv.org/pdf/1609.01704v2.pdf).

- Teacher forcing- scheduled sampling- optimization on BLUE scores

- RNNs with bells and whistles- Regularisation of RNNs (zone-out)- Regularisation on the norm

- Applications in “recent” research.- Karpathy text generation (reuse master thesis material?)- Language translation.

- NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

- On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

- Vinyals NIC.

- Cool stuff that we don’t present- Attention mechanism (DRAW networks)- Pixel RNN- ...