An Empirical Evaluation of Generic Convolutional and Recurrent …lcarin/Rachel12.7.2018.pdf · 07/12/2018 · LAMBADA Train: full text of 2,662 novels Test: 10K passages for which

IntroductionModel

Experiments

An Empirical Evaluation of GenericConvolutional and Recurrent Networks for

Sequence Modeling

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun

Presented by Rachel Draelos

December 7, 2018

Bai et al.

IntroductionModel

Experiments

Introduction

Question: are CNNs or RNNs better for sequencemodeling?Claims:

Generic temporal convolutional network (TCN) architectureoutperforms canonical RNN models across a variety ofsequence modeling tasksTCNs have longer memory than RNNs

Bai et al.

IntroductionModel

Experiments

Background

Inspired by prior work showing that certain convolutionalarchitectures achieve good performance on sequence tasks:

audio synthesis: WaveNet (van den Oord et al 2016)word-level language modeling: gated conv nets (Dauphinet al 2017)machine translation: ByteNet (Kalchbrenner et al 2016),ConvS2S (Gehring 2017)

Bai et al.

IntroductionModel

Experiments

Generic Sequence Modeling Task

Given an input sequence x0, ..., xT the goal is to predictoutputs y0, ..., yT at each time.Sequence modeling network f : X T+1 → YT+1 producesthe mapping ŷ0, ..., ŷT = f(x0, ..., xT )f must satisfy the constraint that yt depends only onx0, ..., xt and not on any future inputs xt+1, ..., xTSequence modeling goal: find a network f that minimizesexpected loss between actual outputs and predictions,L(y0, ..., yT , f(x0, ..., xT ))

Bai et al.

IntroductionModel

Experiments

Temporal Convolutional Network (TCN)

“Not a truly new architecture” - rather, a description of a familyof architectures.

Characteristics: 1-D fully convolutional network with:

(1) Causal convolutions(2) Input length = output length(3) Dilated convolutions(4) Residual connections

Bai et al.

IntroductionModel

Experiments

(1) Causal convolutions & (2) Input length=outputlength

Causal convolutions: No information leakage from future topast. An output at time t is the result of convolution thatused only elements from time t and earlier in the previouslayer.Input length = output length: each hidden layer is the samelength as the input layer (via zero padding of length (kernelsize - 1))

Bai et al.

IntroductionModel

Experiments

(3) Dilated convolutions

Filter is expanded with gaps. Enables much larger receptivefield.

For a 1-D sequence input x ∈ Rn and a filterf : {0, ..., k − 1} → R, the dilated convolution operation F onelement s of the sequence is:

F (s) = (x ∗d f)(s) =k−1∑i=0

f(i) · xs−d·i (1)

where d is the dilation factor, k is the filter size, and s− d · iaccounts for the direction of the past.

Bai et al.

IntroductionModel

Experiments

(3) Dilated convolutions

Two ways to expand the receptive field: larger filter sizes kand larger dilation factor d.Effective history of a layer is (k − 1)dTCN model: increase d exponentially with the depth of thenetwork: d = O(2i) at level i

Figure: (1a). Dilated causal convolution with dilation factors d = 1,2,4and filter size k = 3.

Bai et al.

IntroductionModel

Experiments

(4) Residual connections

Figure: (1b)

A TCN residual block includes twolayers of dilated causal convolutions,ReLU nonlinearities, weightnormalization, and spatial dropout (ateach training step a whole channel iszeroed out)

The TCN uses an additional 1× 1convolution to ensure thatelement-wise addition of (residualblock input ⊕ residual block output)receives tensors of the same shape.

Bai et al.

IntroductionModel

Experiments

Comparison

TCN model size ≈ RNN model sizeTCNs are relatively insensitive to hyperparameter changesas long as the effective history size (i.e. receptive field) islarge enoughFor RNNs, they used grid search to choosehyperparameters

Bai et al.

IntroductionModel

Experiments

Tasks: Synthetic Data and Music

Benchmark Input Objective

The adding problem

Sequence of length T and depth 2. Dim 1: random values in [0,1]. Dim 2: all zeros except for two elements marked by 1.

Sum the two random values whose second dimensions are marked by 1.

Sequential MNIST and P-MNIST

Each 28x28 MNIST image is presented to the model as a 784x1 sequence. In P-MNIST, the order of the sequence is permuted using a fixed random order.

Digit classification

Copy memory

Sequence of length T+20. The first 10 values are chosen randomly among the digits 1-8; the middle values are 0; the last 11 values are ‘9’ (the first 9 is a delimiter).

Generate an output of the same length that is zero everywhere except the last 10 values, where the model must repeat the 10 values it encountered at the start of the input.

Polyphonic music: JSB Chorales & Nottingham

Sequence where each element is an 88-bit binary code (for 88 piano keys), with a 1 indicating a key that is pressed at a given time.

Predict the next note

Bai et al.

IntroductionModel

Experiments

Tasks: Language

Benchmark Data Objective

PennTreebank

Words: 888K train, 70K valid, 79K test. Vocabulary 10K Chars: 5,059K train, 396K valid, 446K test. Alphabet size 50

Predict the next word Predict the next character

Wikitext-103 Words: 103M train, 218K valid, 246K test. Vocabulary 268K

Predict the next word

LAMBADA

Train: full text of 2,662 novels Test: 10K passages for which humans are good at predicting the last word only when given context Input: ~4.6 context sentences plus 1 target sentence with the last word missing

Predict the last word of the target sentence

text8 Chars: 90M train, 5M valid, 5M test. 27 unique alphabets.

Predict the next character

Bai et al.

IntroductionModel

Experiments

Results

Higher is better

Lower is better

Bai et al.

IntroductionModel

Experiments

Results

Bai et al.

IntroductionModel

Experiments

Results: Copy Memory Task

Bai et al.

IntroductionModel

Experiments

Results: TCN vs. SoTA

Best results are highlighted in yellow (higher is better) or blue (lower is better).Note that the SoTA model may be larger than the TCN model (“Size”columns.)

Bai et al.

IntroductionModel

Experiments

TCN Summary: Advantages/Disadvantages

+ Parallelism: long input sequence can be processed as a whole rather thansequentially+ Control over receptive field size/memory size: e.g. via more layers,larger dilation factors, larger filter sizes+ Stable gradients: avoids exploding/vanishing gradients+ Lower memory requirement for training: TCNs use up to a multiplicativefactor less memory than gated RNNs− Higher memory requirement for eval/testing: TCNs need raw sequenceup to the effective history length, not just current input xt− Parameter changes for transfer of domain: a TCN model may performpoorly if transferred from a domain where only little memory is needed (smallk and d) to a domain where longer memory is needed (large k and d)

Bai et al.

IntroductionModelExperiments

Documents

An Empirical Evaluation of Generic Convolutional and Recurrent …lcarin/Rachel12.7.2018.pdf · 07/12/2018 · LAMBADA Train: full text of 2,662 novels Test: 10K passages for which