31
Neural Turing Machines: Perils and Promise Daniel Shank

Daniel Shank, Data Scientist, Talla at MLconf SF 2016

  • Upload
    mlconf

  • View
    250

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Neural Turing Machines: Perils and PromiseDaniel Shank

Page 2: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

Overview1.Neural Turing Machines

2.Applications and Performance

3.Challenges and Recommendations

4.Dynamic Neural Computers

Page 3: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

328

Neural Turing Machines

Page 4: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

428

What’s a Turing Machine?Model of a computer

Memory tape

Read and write heads

Page 5: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

528

What’s a Neural Turing Machine?Neural Network “Controller”

Memory

Learns from sequence

Graves et al 2014,arXiv:1410.5401v2

Page 6: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

628

Neural Turing Machines are Differentiable Turing Machines‘Sharp’ functions made smooth

Can train with backpropagation

Page 7: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

728

Applications and Performance

Page 8: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

828

Neural Turing Machines can…Learn simple algorithms (Copy, repeat, recognize simple formal languages...)

Generalize

Do well at language modeling

Do well at bAbI

Page 9: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

928

Generalization on Copy/Repeat task

Graves et al 2014

Page 10: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1028

Neural Turing Machines Outperform LSTMs

Graves et al 2014

Page 11: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1128

Balanced Parenthesis

Tristan Deleu https://medium.com/snips-ai/

Page 12: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1228

bAbI dataset1 Mary moved to the bathroom.

2 John went to the hallway.

3 Where is Mary? bathroom 1

4 Daniel went back to the hallway.

5 Sandra moved to the garden.

6 Where is Daniel? hallway 4

7 John moved to the office.

8 Sandra journeyed to the bathroom.

9 Where is Daniel? hallway 4

10 Mary moved to the hallway.

11 Daniel travelled to the office.

12 Where is Daniel? office 11

13 John went back to the garden.

14 John moved to the bedroom.

15 Where is Sandra? bathroom 8

1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2

Small vocabulary

Stories

Context

https://research.facebook.com/research/babi/

Page 13: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1328

bAbI results

Empirical Study on Deep Learning Models for Question AnsweringYu et al. 2015

Page 14: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1428

Challenges and Recommendations

Page 15: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1528

ProblemsArchitecture dependent

Large number of parameters

Doesn’t benefit much from GPU acceleration

Hard to train

Page 16: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1628

Hard to trainNumerical Instability

Using memory is hard

Needs smart optimization

Difficult to use in practice

Page 17: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1728

Combating Numerical Instability: Gradient clippingLimits training speed of parameters

Particularly helpful for learning long range dependencies

Page 18: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1828

Loss clippingCap total response to a given training batch

Helpful in addition to gradient clipping

Page 19: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

1928

Graves’ RMSpropA version of back propagation used to train the networkUsed in many of Graves’ RNN papers:

Similar to normalizing gradient updates by their variance, important for the NTM’s high-variability changes in loss.

Page 20: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2028

Adam OptimizerWorks well for many tasks

Comes pre-loaded in most ML frameworks

Like Graves’ RMSprop, smooths gradients

Page 21: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2128

Attention to initializationMemory initialization extremely important

Poor initialization can prevent convergence

Pay particularly close attention to the starting value of the memory

Page 22: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2228

Short sequences first (“Curriculum Learning”)1) Feed in short training data

2) When loss hits a target, increase the size of the input

3) Repeat

Page 23: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2328

Dynamic Neural Computers

Page 24: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2428

Neural Turing Machines “V2”Similar to NTMs, except…

No index shift based addressing

Can ‘allocate’ and ‘deallocate’ memory

Remembers recent memory use

Page 25: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2528

Architecture updates(1)

Graves et al. 2016

Daniel Shank
Page 26: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2628

Architecture updates(2)

Graves et al. 2016

Page 27: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2728

Dynamic Neural Computer Performance on Inference Tasks

Graves et al. 2016

Page 28: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2828

Dynamic Neural Computer bAbI Results

Graves et al. 2016

Page 29: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

2928

ReferencesImplementations:Tensorflow: https://github.com/carpedm20/NTM-tensorflowGo: https://github.com/fumin/ntmTorch: https://github.com/kaishengtai/torch-ntmNode.JS: https://github.com/gcgibson/NTMLasagne: https://github.com/snipsco/ntm-lasagneTheano: https://github.com/shawntan/neural-turing-machines

Papers:

Graves et al. 2016 – Hybrid computing using a neural network with dynamic external memory

Graves et al. 2014 – Neural Turing Machines

Yu et al. 2015 – Empirical Study on Deep Learning Models for Question Answering

Rae et al. 2016 – Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Page 30: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

3028

NTM operations

The Convolutional Shift parameter has provento be one of if not the most problematic.

Page 31: Daniel Shank, Data Scientist, Talla at MLconf SF 2016

3128