Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Neural Turing Machines: Perils and PromiseDaniel Shank

Overview1.Neural Turing Machines

2.Applications and Performance

3.Challenges and Recommendations

4.Dynamic Neural Computers

328

Neural Turing Machines

428

What’s a Turing Machine?Model of a computer

Memory tape

Read and write heads

528

What’s a Neural Turing Machine?Neural Network “Controller”

Memory

Learns from sequence

Graves et al 2014,arXiv:1410.5401v2

628

Neural Turing Machines are Differentiable Turing Machines‘Sharp’ functions made smooth

Can train with backpropagation

728

Applications and Performance

828

Neural Turing Machines can…Learn simple algorithms (Copy, repeat, recognize simple formal languages...)

Generalize

Do well at language modeling

Do well at bAbI

928

Generalization on Copy/Repeat task

Graves et al 2014

1028

Neural Turing Machines Outperform LSTMs

Graves et al 2014

1128

Balanced Parenthesis

Tristan Deleu https://medium.com/snips-ai/

1228

bAbI dataset1 Mary moved to the bathroom.

2 John went to the hallway.

3 Where is Mary? bathroom 1

4 Daniel went back to the hallway.

5 Sandra moved to the garden.

6 Where is Daniel? hallway 4

7 John moved to the office.

8 Sandra journeyed to the bathroom.

9 Where is Daniel? hallway 4

10 Mary moved to the hallway.

11 Daniel travelled to the office.

12 Where is Daniel? office 11

13 John went back to the garden.

14 John moved to the bedroom.

15 Where is Sandra? bathroom 8

1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2

Small vocabulary

Stories

Context

https://research.facebook.com/research/babi/

1328

bAbI results

Empirical Study on Deep Learning Models for Question AnsweringYu et al. 2015

1428

Challenges and Recommendations

1528

ProblemsArchitecture dependent

Large number of parameters

Doesn’t benefit much from GPU acceleration

Hard to train

1628

Hard to trainNumerical Instability

Using memory is hard

Needs smart optimization

Difficult to use in practice

1728

Combating Numerical Instability: Gradient clippingLimits training speed of parameters

Particularly helpful for learning long range dependencies

1828

Loss clippingCap total response to a given training batch

Helpful in addition to gradient clipping

1928

Graves’ RMSpropA version of back propagation used to train the networkUsed in many of Graves’ RNN papers:

Similar to normalizing gradient updates by their variance, important for the NTM’s high-variability changes in loss.

2028

Adam OptimizerWorks well for many tasks

Comes pre-loaded in most ML frameworks

Like Graves’ RMSprop, smooths gradients

2128

Attention to initializationMemory initialization extremely important

Poor initialization can prevent convergence

Pay particularly close attention to the starting value of the memory

2228

Short sequences first (“Curriculum Learning”)1) Feed in short training data

2) When loss hits a target, increase the size of the input

3) Repeat

2328

Dynamic Neural Computers

2428

Neural Turing Machines “V2”Similar to NTMs, except…

No index shift based addressing

Can ‘allocate’ and ‘deallocate’ memory

Remembers recent memory use

2528

Architecture updates(1)

Graves et al. 2016

Daniel Shank

2628

Architecture updates(2)

Graves et al. 2016

2728

Dynamic Neural Computer Performance on Inference Tasks

Graves et al. 2016

2828

Dynamic Neural Computer bAbI Results

Graves et al. 2016

2928

ReferencesImplementations:Tensorflow: https://github.com/carpedm20/NTM-tensorflowGo: https://github.com/fumin/ntmTorch: https://github.com/kaishengtai/torch-ntmNode.JS: https://github.com/gcgibson/NTMLasagne: https://github.com/snipsco/ntm-lasagneTheano: https://github.com/shawntan/neural-turing-machines

Papers:

Graves et al. 2016 – Hybrid computing using a neural network with dynamic external memory

Graves et al. 2014 – Neural Turing Machines

Yu et al. 2015 – Empirical Study on Deep Learning Models for Question Answering

Rae et al. 2016 – Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

https://github.com/carpedm20/NTM-tensorflow

https://github.com/carpedm20/NTM-tensorflow

https://github.com/fumin/ntm

https://github.com/fumin/ntm

https://github.com/kaishengtai/torch-ntm

https://github.com/kaishengtai/torch-ntm

https://github.com/gcgibson/NTM

https://github.com/gcgibson/NTM

https://github.com/snipsco/ntm-lasagne

https://github.com/snipsco/ntm-lasagne

https://github.com/shawntan/neural-turing-machines

https://github.com/shawntan/neural-turing-machines

3028

NTM operations

The Convolutional Shift parameter has provento be one of if not the most problematic.

3128

Technology

Daniel Shank, Data Scientist, Talla at MLconf SF 2016