69
Differentiable Programming Atılım Güneş Baydin National University of Ireland Maynooth (Based on joint work with Barak Pearlmutter) Microsoft Research Cambridge, February , 6

Differentiable Programming - Maynooth Universitygunes/files/Baydin-MSR-Slides-20160201.pdf · Deep learning layouts Newer additions: Make algorithmic elements continuous and differentiable!enables

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Differentiable ProgrammingAtılım Güneş Baydin

National University of Ireland Maynooth(Based on joint work with Barak Pearlmutter)

Microsoft Research Cambridge, February 1, 2016

Deep learning layoutsNeural network models are assembled from building blocksand trained with backpropagation

Traditional:FeedforwardConvolutionalRecurrent

1/40

Deep learning layoutsNeural network models are assembled from building blocksand trained with backpropagation

Traditional:FeedforwardConvolutionalRecurrent

1/40

Deep learning layoutsNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning

NTM on copy task(Graves et al. 2014)

Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)

2/40

Deep learning layoutsStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

VGG, 19 layers (ILSVRC 2014)1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 64

3x3

conv

, 64

1x1

conv

, 256

1x1

conv

, 128

, /2

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 128

3x3

conv

, 128

1x1

conv

, 512

1x1

conv

, 256

, /2

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 256

3x3

conv

, 256

1x1

conv

, 102

4

1x1

conv

, 512

, /2

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

1x1

conv

, 512

3x3

conv

, 512

1x1

conv

, 204

8

ave

pool

, fc 1

000

7x7

conv

, 64,

/2,

poo

l/2

3x3

conv

, 64

3x3

conv

, 64,

poo

l/2

3x3

conv

, 128

3x3

conv

, 128

, poo

l/2

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

3x3

conv

, 256

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

3x3

conv

, 512

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

11x1

1 co

nv, 9

6, /4

, poo

l/2

5x5

conv

, 256

, poo

l/2

3x3

conv

, 384

3x3

conv

, 384

3x3

conv

, 256

, poo

l/2

fc, 4

096

fc, 4

096

fc, 1

000

(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)3/40

The bigger pictureOne way of viewing deep learning systems is“differentiable functional programming”

Two main characteristics:Differentiability→ optimization

Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives

4/40

The bigger pictureIn a functional interpretation

Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)

5/40

The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation

(Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)

6/40

The bigger pictureThese insights clearly put into words inChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/

“The field does not (yet) have a unifying insight or narrative”

and reiterated in David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794

“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”

7/40

In this talkVision:Functional languages with

deeply embedded,general-purpose

differentiation capability, i.e., differentiable programming

Automatic (algorithmic) differentiation (AD) in a functionalframework is a manifestation of this vision.

8/40

In this talkVision:Functional languages with

deeply embedded,general-purpose

differentiation capability, i.e., differentiable programming

Automatic (algorithmic) differentiation (AD) in a functionalframework is a manifestation of this vision.

8/40

In this talk

I will talk about:Mainstream frameworksWhat AD research can contributeMy ongoing work

9/40

Mainstream Frameworks

Frameworks“Theano-like”

Fine-grainedDefine computational graphs in asymbolic wayGraph analysis and optimizations

Examples:TheanoComputation Graph Toolkit (CGT)TensorFlowComputational Network Toolkit(CNTK)

(Kenneth Tran. “Evaluation of Deep Learning Toolkits”.https://github.com/zer0n/deepframeworks)

10/40

Frameworks“Torch-like”

Coarse-grainedBuild models by combiningpre-specified modulesEach module is manuallyimplemented, hand-tuned

Examples:Torch7Caffe

11/40

FrameworksCommon in both:

Define models usingthe framework’s (constrained) symbolic languageThe framework handles backpropagation→ you don’t have to code derivatives(unless adding new modules)Because derivatives are “automatic”, some call it“autodiff” or “automatic differentiation”

This is NOT the traditional meaning of automatic differentiation(AD) (Griewank & Walther, 2008)Because “automatic” is a generic (and bad) term,algorithmic differentiation is a better name

12/40

FrameworksCommon in both:

Define models usingthe framework’s (constrained) symbolic languageThe framework handles backpropagation→ you don’t have to code derivatives(unless adding new modules)Because derivatives are “automatic”, some call it“autodiff” or “automatic differentiation”

This is NOT the traditional meaning of automatic differentiation(AD) (Griewank & Walther, 2008)Because “automatic” is a generic (and bad) term,algorithmic differentiation is a better name

12/40

FrameworksCommon in both:

Define models usingthe framework’s (constrained) symbolic languageThe framework handles backpropagation→ you don’t have to code derivatives(unless adding new modules)Because derivatives are “automatic”, some call it“autodiff” or “automatic differentiation”

This is NOT the traditional meaning of automatic differentiation(AD) (Griewank & Walther, 2008)Because “automatic” is a generic (and bad) term,algorithmic differentiation is a better name

12/40

“But, how is AD different from Theano?”

In Theanoexpress all math relations using symbolic placeholdersuse a mini-language with very limited control flow(e.g. scan)end up designing a symbolic graph for your algorithmTheano optimizes it

13/40

“But, how is AD different from Theano?”In Theano

express all math relations using symbolic placeholdersuse a mini-language with very limited control flow(e.g. scan)end up designing a symbolic graph for your algorithmTheano optimizes it

13/40

“But, how is AD different from Theano?”

Theano gives you automatic derivativesTransforms your graph into a derivative graphApplies optimizations

Identical subgraph eliminationSimplificationsStability improvements(http://deeplearning.net/software/theano/optimizations.html)

Compiles to a highly optimized form

14/40

“But, how is AD different from Theano?”You are limited to symbolic graph building, with the mini-language

For example, instead of this in pure Python (for Ak):

You build this symbolic graph:

15/40

“But, how is AD different from Theano?”You are limited to symbolic graph building, with the mini-languageFor example, instead of this in pure Python (for Ak):

You build this symbolic graph:

15/40

“But, how is AD different from Theano?”You are limited to symbolic graph building, with the mini-languageFor example, instead of this in pure Python (for Ak):

You build this symbolic graph:

15/40

“But, how is AD different from Theano?”AD allows you to just fully use your host languageand gives you exact and efficient derivatives

So, you just do this:

For Python, autogradhttps://github.com/HIPS/autograd

Harvard Intelligent Probabilistic Systems Group(Dougal Maclaurin, David Duvenaud, Ryan P Adams. “Autograd:effortless gradients in Numpy.” 2015)

16/40

“But, how is AD different from Theano?”AD allows you to just fully use your host languageand gives you exact and efficient derivativesSo, you just do this:

For Python, autogradhttps://github.com/HIPS/autograd

Harvard Intelligent Probabilistic Systems Group(Dougal Maclaurin, David Duvenaud, Ryan P Adams. “Autograd:effortless gradients in Numpy.” 2015)

16/40

“But, how is AD different from Theano?”AD allows you to just fully use your host languageand gives you exact and efficient derivativesSo, you just do this:

For Python, autogradhttps://github.com/HIPS/autograd

Harvard Intelligent Probabilistic Systems Group(Dougal Maclaurin, David Duvenaud, Ryan P Adams. “Autograd:effortless gradients in Numpy.” 2015)

16/40

Here is the differenceAD does not use symbolic graphsGives numeric code that computesthe function AND its derivatives at a given point

f(a, b):c = a * bd = sin creturn d

f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')

Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching

17/40

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return 1.791

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return 1.791, 0.5

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

18/40

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return 1.791

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return 1.791, 0.5

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

18/40

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return 1.791

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return 1.791, 0.5

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

18/40

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return 1.791

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return 1.791, 0.5

(tangent)

i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)

∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD

18/40

Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2

b = 3

c = a * b = 6

d = log c = 1.791

return 1.791

(primal)

a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return 1.791, 0.5

(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂

∂a f(a, b)∣∣(2,3) = 0.5

This is called the forward (tangent) mode of AD18/40

Function evaluation tracesf(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return 1.791

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return 1.791, 0.5, 0.333

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

19/40

Function evaluation tracesf(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return 1.791

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return 1.791, 0.5, 0.333

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

19/40

Function evaluation tracesf(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return 1.791

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return 1.791, 0.5, 0.333

(adjoint)

i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

19/40

Function evaluation tracesf(a, b):

c = a * bif c > 0

d = log celse

d = sin creturn d

f(2, 3)

a = 2b = 3c = a * b = 6d = log c = 1.791return 1.791

(primal)

a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return 1.791, 0.5, 0.333

(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)

This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD

19/40

Torch-autogradThere are signs that this type of generalized ADwill become mainstream in machine learning

A very recent development (November 2015)Torch-autograd by Twitter Cortex(inspired by Python autograd)https://blog.twitter.com/2015/autograd-for-torch

“autograd has dramatically sped up our model building ...extremely easy to try and test out new ideas”

20/40

Torch-autogradThere are signs that this type of generalized ADwill become mainstream in machine learningA very recent development (November 2015)Torch-autograd by Twitter Cortex(inspired by Python autograd)https://blog.twitter.com/2015/autograd-for-torch

“autograd has dramatically sped up our model building ...extremely easy to try and test out new ideas”

20/40

A cool functional DSL for Torch and Caffe

A side note about the functional interpretation deep learning:dnngraph by Andrew Tullochhttp://ajtulloch.github.io/dnngraph/

Specify neural network layouts in Haskell,it gives you Torch and Caffe scripts

21/40

What Can AD Research Contribute?

The ambitionDeeply embedded ADDerivatives (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism)

The embodiment of the “differentiable programming” paradigmI have been working on these issues with Barak Pearlmutterand created DiffSharp (later in the talk)

22/40

The ambitionDeeply embedded ADDerivatives (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism)

The embodiment of the “differentiable programming” paradigmI have been working on these issues with Barak Pearlmutterand created DiffSharp (later in the talk)

22/40

The ambitionDeeply embedded ADDerivatives (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism)

The embodiment of the “differentiable programming” paradigmI have been working on these issues with Barak Pearlmutterand created DiffSharp (later in the talk)

22/40

AD in a functional frameworkAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations

R6RS-ADhttps://github.com/qobi/R6RS-AD

Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/

Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language

Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/

23/40

AD in a functional framework“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

min (λx . (f x) + min (λy . g x y))(min: gradient descent)

Must handle “perturbation confusion” (Manzyuk et al., 2012)D (λx . x× (D (λy . x+ y) 1)) 1

ddx

(x(

ddyx+ y

)∣∣∣∣y=1

)∣∣∣∣∣x=1

?= 1

24/40

AD in a functional framework“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references

min (λx . (f x) + min (λy . g x y))(min: gradient descent)

Must handle “perturbation confusion” (Manzyuk et al., 2012)D (λx . x× (D (λy . x+ y) 1)) 1

ddx

(x(

ddyx+ y

)∣∣∣∣y=1

)∣∣∣∣∣x=1

?= 1

24/40

Tricks of the tradeMany methods from AD researchHessian-vector products (Pearlmutter, 1994)Tape reduction and elimination (Naumann, 2004)Context-aware source-to-source transformation (Utke, 2004)Utilizing sparsity by matrix coloring (Gebremedhin et al.,2013)

Reverse AD checkpointing (Dauvergne & Hascoët, 2006)

25/40

My Ongoing Work

DiffSharphttp://diffsharp.github.io/DiffSharp/

AD with linear algebra primitivesarbitrary nesting of forward/reverse ADa comprehensive higher-order APIgradients, Hessians, Jacobians, directionalderivatives, matrix-free Hessian- andJacobian-vector products

Implemented in F#→ the best tool for this job→ cross-platform (Linux, Mac OS, Windows)→ easy deployment with nuget→ the immense .NET user base of C# and F# users→ implicit quotations in F# 4.0 is a “killer feature” for deeplyembedding transformation-based AD

26/40

DiffSharphttp://diffsharp.github.io/DiffSharp/

AD with linear algebra primitivesarbitrary nesting of forward/reverse ADa comprehensive higher-order APIgradients, Hessians, Jacobians, directionalderivatives, matrix-free Hessian- andJacobian-vector products

Implemented in F#→ the best tool for this job→ cross-platform (Linux, Mac OS, Windows)→ easy deployment with nuget→ the immense .NET user base of C# and F# users→ implicit quotations in F# 4.0 is a “killer feature” for deeplyembedding transformation-based AD

26/40

DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.

f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X

f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X

f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JT

f (Rn → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JT

f ) (Rn → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JT

f v (Rn → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JT

f v) (Rn → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JT

f (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/40

DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html

High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html

A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics

28/40

Hypehttp://hypelib.github.io/Hype/

An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core

highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)

Researching the differentiable functional programming paradigmfor machine learning29/40

HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs

30/40

HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs

Optimization and training as higher-order functions→ works with any function that you want to describe your data→ can be composed, curried, nested

31/40

HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code

32/40

HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

33/40

HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations

Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed

33/40

HypeWe also provide a Torch-like API for neural networks

A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

34/40

HypeWe also provide a Torch-like API for neural networks

A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works

34/40

Hypehttp://hypelib.github.io/Hype/feedforwardnets.html

We also have some nice additions for F# interactive

35/40

Roadmap

Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)

36/40

Roadmap

I would like to see this work integrated with tools in otherlanguages (C++, Python) and frameworks (Torch, CNTK)

37/40

Conclusion

Conclusion

An exciting research area at the intersection ofprogramming languagesfunctional programmingmachine learning

38/40

Beyond deep learningApplications in probabilistic programming(Wingate, Goodman, Stuhlmüller, Siskind. “Nonstandard interpretations of probabilistic programs for efficient inference.”2011)

Hamiltonian Monte Carlohttp://diffsharp.github.io/DiffSharp/examples-hamiltonianmontecarlo.html

No-U-Turn samplerGradient-based maximum a posteriori estimates

For example, Stan is built on ADhttp://mc-stan.org/(Carpenter et al., 2015)

39/40

Other areasAny work in AD remains applicable to thetraditional application domains of AD in industry and academia(Corliss et al., 2002)

Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance

40/40

Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]