Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Review Series of Recent Deep Learning Papers:
Parameter Prediction Paper: DecoupledNeural Interfaces Using Synthetic
GradientsMax Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol
Vinyals, Alex Graves, David Silver Koray Kavukcuoglu
Reviewed by : Arshdeep Sekhon
1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/
August 25, 2018
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 1 / 17
Locking in Neural Networks
To update Layer 1:
1 Forward Propagation through Layer 2 and Layer 3
2 Backward Propagation through Layer 2 and Layer 3
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 2 / 17
Why is locking a problem?
1 Updates in sequential and synchronous manner
2 A distributed system: Updates depend on the slowest part
3 parallelizing training of neural network modules can speed up training.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 3 / 17
Decoupled Neural interface
1 Layer 1 will be updated before Layer 2 and Layer 3 have even beenexecuted.
2 No longer locked to the rest of the network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 4 / 17
A Decoupled Neural Interface
1 Decoupled Neural Interfaces predict gradients : synthetic gradientsfrom previous layer outputs or activations
2 do not rely on backpropagation to get error gradients
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 5 / 17
Gradients for FeedForward Networks
1 A network with N layers fi ,i ε{1, · · · ,N}2 For the ith layer, input hi−1, output hi = fi (hi−1)3 The complete graph is represented by FN
1
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 6 / 17
Gradients for FeedForward Networks
1
θi ← θi − αδiδhiδθi
; δi =δL
δhi(1)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 7 / 17
Synthetic Gradients for FeedForward Networks
θn ← θn − αδ̂iδhiδθn
; δi = Mi+1(hi ) (2)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 8 / 17
Synthetic Gradients for FeedForward Networks
1
θn ← θn − αδ̂iδhiδθn
; δi = Mi+1(hi ) (3)
2 n ε{1, · · · , n}3 To train Mi+1(hi ),
4 wait for true error gradient to be computed
5 after a full forwards and backwards pass of FNi+1
6 Minimize ||δ̂i − δi ||22
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 9 / 17
Every Layer DNI for FeedForward Networks
use backpropagated ˆδi+1 instead of the true gradients
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 10 / 17
Synthetic Gradients for RNNs
The task: stream prediction; possibly infinite
Unrolling the recurrent network:
Forward Graph: Finf1 made up of fi where i varies from 1 to inf
At a particular point in time t, minimise Loss over the next steps
inf∑τ=t
Lτ (4)
θ ← θ − αinf∑τ=t
δLτδθ
(5)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 11 / 17
Synthetic Gradients for RNNs
Truncated Backpropagation
At a particular point in time t, minimise Loss over the next steps
θ ← θ − α( T∑
τ=t
δLτδθ
+( inf∑
τ=T+1
δLτδhT
)δhTδθ
)(6)
θ ← θ − α( T∑
τ=t
δLτδθ
+(δT
)δhTδθ
)(7)
truncated BPTT: δT = 0; limits temporal dependency learnt by rnn
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 12 / 17
Synthetic Gradients for RNNs
Truncated Backpropagation
θ ← θ − α( T∑
τ=t
δLτδθ
+(δ̂T
)δhTδθ
)(8)
δ̂T = MT (hT ); learned approximation of the future loss gradients
divide unrolled rnn into subnetworks of length T
insert a DNI between Ft+T−1t and Ft+2T−1
t+T
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 13 / 17
Synthetic Gradients for RNNs
Truncated Backpropagation
θ ← θ − α( T∑
τ=t
δLτδθ
+(δ̂T
)δhTδθ
)(9)
train MT by minimizing d(δT , δ̂T )
true δT not available: Bootstrapping
δT =∑2T
τ=T+1
δLτδhT
+ δ̂2T+1h2ThT
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 14 / 17
Other applications
1 Add an auxiliary task
2 Combine with true backpropagation gradients
3 Arbitrary Network Graphs
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 15 / 17
Results: Penn Tree Bank Language Modeling
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 16 / 17
Results: Copy Repeat Copy Task
1 Copy: Copy a sentence of length N
2 Repeat Copy: copy a sentence of length N R times
3 Max sequence length successfully modeled increases with DNI for thesame T in BPTT
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 17 / 17