Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
IntroductionModel
Experiments
An Empirical Evaluation of GenericConvolutional and Recurrent Networks for
Sequence Modeling
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun
Presented by Rachel Draelos
December 7, 2018
Bai et al.
IntroductionModel
Experiments
Introduction
Question: are CNNs or RNNs better for sequencemodeling?Claims:
Generic temporal convolutional network (TCN) architectureoutperforms canonical RNN models across a variety ofsequence modeling tasksTCNs have longer memory than RNNs
Bai et al.
IntroductionModel
Experiments
Background
Inspired by prior work showing that certain convolutionalarchitectures achieve good performance on sequence tasks:
audio synthesis: WaveNet (van den Oord et al 2016)word-level language modeling: gated conv nets (Dauphinet al 2017)machine translation: ByteNet (Kalchbrenner et al 2016),ConvS2S (Gehring 2017)
Bai et al.
IntroductionModel
Experiments
Generic Sequence Modeling Task
Given an input sequence x0, ..., xT the goal is to predictoutputs y0, ..., yT at each time.Sequence modeling network f : X T+1 → YT+1 producesthe mapping ŷ0, ..., ŷT = f(x0, ..., xT )f must satisfy the constraint that yt depends only onx0, ..., xt and not on any future inputs xt+1, ..., xTSequence modeling goal: find a network f that minimizesexpected loss between actual outputs and predictions,L(y0, ..., yT , f(x0, ..., xT ))
Bai et al.
IntroductionModel
Experiments
Temporal Convolutional Network (TCN)
“Not a truly new architecture” - rather, a description of a familyof architectures.
Characteristics: 1-D fully convolutional network with:
(1) Causal convolutions(2) Input length = output length(3) Dilated convolutions(4) Residual connections
Bai et al.
IntroductionModel
Experiments
(1) Causal convolutions & (2) Input length=outputlength
Causal convolutions: No information leakage from future topast. An output at time t is the result of convolution thatused only elements from time t and earlier in the previouslayer.Input length = output length: each hidden layer is the samelength as the input layer (via zero padding of length (kernelsize - 1))
Bai et al.
IntroductionModel
Experiments
(3) Dilated convolutions
Filter is expanded with gaps. Enables much larger receptivefield.
For a 1-D sequence input x ∈ Rn and a filterf : {0, ..., k − 1} → R, the dilated convolution operation F onelement s of the sequence is:
F (s) = (x ∗d f)(s) =k−1∑i=0
f(i) · xs−d·i (1)
where d is the dilation factor, k is the filter size, and s− d · iaccounts for the direction of the past.
Bai et al.
IntroductionModel
Experiments
(3) Dilated convolutions
Two ways to expand the receptive field: larger filter sizes kand larger dilation factor d.Effective history of a layer is (k − 1)dTCN model: increase d exponentially with the depth of thenetwork: d = O(2i) at level i
Figure: (1a). Dilated causal convolution with dilation factors d = 1,2,4and filter size k = 3.
Bai et al.
IntroductionModel
Experiments
(4) Residual connections
Figure: (1b)
A TCN residual block includes twolayers of dilated causal convolutions,ReLU nonlinearities, weightnormalization, and spatial dropout (ateach training step a whole channel iszeroed out)
The TCN uses an additional 1× 1convolution to ensure thatelement-wise addition of (residualblock input ⊕ residual block output)receives tensors of the same shape.
Bai et al.
IntroductionModel
Experiments
Comparison
TCN model size ≈ RNN model sizeTCNs are relatively insensitive to hyperparameter changesas long as the effective history size (i.e. receptive field) islarge enoughFor RNNs, they used grid search to choosehyperparameters
Bai et al.
IntroductionModel
Experiments
Tasks: Synthetic Data and Music
Benchmark Input Objective
The adding problem
Sequence of length T and depth 2. Dim 1: random values in [0,1]. Dim 2: all zeros except for two elements marked by 1.
Sum the two random values whose second dimensions are marked by 1.
Sequential MNIST and P-MNIST
Each 28x28 MNIST image is presented to the model as a 784x1 sequence. In P-MNIST, the order of the sequence is permuted using a fixed random order.
Digit classification
Copy memory
Sequence of length T+20. The first 10 values are chosen randomly among the digits 1-8; the middle values are 0; the last 11 values are ‘9’ (the first 9 is a delimiter).
Generate an output of the same length that is zero everywhere except the last 10 values, where the model must repeat the 10 values it encountered at the start of the input.
Polyphonic music: JSB Chorales & Nottingham
Sequence where each element is an 88-bit binary code (for 88 piano keys), with a 1 indicating a key that is pressed at a given time.
Predict the next note
Bai et al.
IntroductionModel
Experiments
Tasks: Language
Benchmark Data Objective
PennTreebank
Words: 888K train, 70K valid, 79K test. Vocabulary 10K Chars: 5,059K train, 396K valid, 446K test. Alphabet size 50
Predict the next word Predict the next character
Wikitext-103 Words: 103M train, 218K valid, 246K test. Vocabulary 268K
Predict the next word
LAMBADA
Train: full text of 2,662 novels Test: 10K passages for which humans are good at predicting the last word only when given context Input: ~4.6 context sentences plus 1 target sentence with the last word missing
Predict the last word of the target sentence
text8 Chars: 90M train, 5M valid, 5M test. 27 unique alphabets.
Predict the next character
Bai et al.
IntroductionModel
Experiments
Results
Higher is better
Lower is better
Bai et al.
IntroductionModel
Experiments
Results
Bai et al.
IntroductionModel
Experiments
Results: Copy Memory Task
Bai et al.
IntroductionModel
Experiments
Results: TCN vs. SoTA
Best results are highlighted in yellow (higher is better) or blue (lower is better).Note that the SoTA model may be larger than the TCN model (“Size”columns.)
Bai et al.
IntroductionModel
Experiments
TCN Summary: Advantages/Disadvantages
+ Parallelism: long input sequence can be processed as a whole rather thansequentially+ Control over receptive field size/memory size: e.g. via more layers,larger dilation factors, larger filter sizes+ Stable gradients: avoids exploding/vanishing gradients+ Lower memory requirement for training: TCNs use up to a multiplicativefactor less memory than gated RNNs− Higher memory requirement for eval/testing: TCNs need raw sequenceup to the effective history length, not just current input xt− Parameter changes for transfer of domain: a TCN model may performpoorly if transferred from a domain where only little memory is needed (smallk and d) to a domain where longer memory is needed (large k and d)
Bai et al.
IntroductionModelExperiments