14
Dierentiable programming and its applications to dynamical systems Adri´ an Hern´ andez and Jos´ e M. Amig ´ o * Centro de Investigaci´ on Operativa, Universidad Miguel Hern´ andez, Avenida de la Universidad s/n, 03202 Elche, Spain Abstract Dierentiable programming is the combination of classical neural networks modules with algorithmic ones in an end- to-end dierentiable model. These new models, that use automatic dierentiation to calculate gradients, have new learning capabilities (reasoning, attention and memory). In this tutorial, aimed at researchers in nonlinear systems with prior knowledge of deep learning, we present this new programming paradigm, describe some of its new features such as attention mechanisms, and highlight the benefits they bring. Then, we analyse the uses and limitations of traditional deep learning models in the modeling and prediction of dynamical systems. Here, a dynamical system is meant to be a set of state variables that evolve in time under general internal and external interactions. Finally, we review the advantages and applications of dierentiable programming to dynamical systems. Keywords: Deep learning, dierentiable programming, dynamical systems, attention, recurrent neural networks 1. Introduction The increase in computing capabilities together with new deep learning models has led to great advances in several machine learning tasks [1, 2, 3]. Deep learning architectures such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), as well as the use of distributed representa- tions in natural language processing, have allowed to take into account the symmetries and the structure of the problem to be solved. However, a major criticism of deep learning remains, namely, that it only performs perception, mapping in- puts to outputs [4]. A new direction to more general and flexible mod- els is dierentiable programming, that is, the combina- tion of geometric modules (traditional neural networks) with more algorithmic modules in an end-to-end dier- entiable model. As a result, dierentiable programming is a dynamic computational graph composed of dier- entiable functions that provides not only perception but also reasoning, attention and memory. To eciently cal- culate derivatives, this approach uses automatic dier- entiation, an algorithmic technique similar to backprop- agation and implemented in modern software packages such as PyTorch, Julia, etc. * Corresponding author To keep our exposition concise, this tutorial is aimed at researchers in nonlinear systems with prior knowl- edge of deep learning; see [5] for an excellent intro- duction to the concepts and methods of deep learning. Therefore, this tutorial focuses right away on the lim- itations of traditional deep learning and the advantages of dierential programming, with special attention to its application to dynamical systems. By a dynamical sys- tem we mean here and hereafter a set of state variables that evolve in time under the influence of internal and possibly external inputs. Examples of dierentiable programming techniques that have been successfully developed in recent years include (i) attention mechanisms [6], which allow the model to automatically search and learn which parts of a source sequence are relevant to predict the target ele- ment, (ii) self-attention, (iii) end-to-end Memory Networks [7], and (iv) Dierentiable Neural Computers (DNCs) [8], which are neural networks (controllers) with an exter- nal read-write memory. As expected, in recent years there has been a grow- ing interest in applying deep learning techniques to dy- namical systems. In this regard, RNNs and Long Short- Term Memories (LSTMs), specially designed for se- quence modelling and temporal dependence, have been Preprint submitted to Physica D May 5, 2020 arXiv:1912.08168v2 [math.DS] 2 May 2020

Centro de Investigacion´ Operativa, Universidad Miguel

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Centro de Investigacion´ Operativa, Universidad Miguel

Differentiable programming and its applications to dynamical systems

Adrian Hernandez and Jose M. Amigo∗

Centro de Investigacion Operativa, Universidad Miguel Hernandez, Avenida de la Universidad s/n, 03202 Elche, Spain

Abstract

Differentiable programming is the combination of classical neural networks modules with algorithmic ones in an end-to-end differentiable model. These new models, that use automatic differentiation to calculate gradients, have newlearning capabilities (reasoning, attention and memory). In this tutorial, aimed at researchers in nonlinear systemswith prior knowledge of deep learning, we present this new programming paradigm, describe some of its new featuressuch as attention mechanisms, and highlight the benefits they bring. Then, we analyse the uses and limitations oftraditional deep learning models in the modeling and prediction of dynamical systems. Here, a dynamical system ismeant to be a set of state variables that evolve in time under general internal and external interactions. Finally, wereview the advantages and applications of differentiable programming to dynamical systems.

Keywords: Deep learning, differentiable programming, dynamical systems, attention, recurrent neural networks

1. Introduction

The increase in computing capabilities together withnew deep learning models has led to great advances inseveral machine learning tasks [1, 2, 3].

Deep learning architectures such as Recurrent NeuralNetworks (RNNs) and Convolutional Neural Networks(CNNs), as well as the use of distributed representa-tions in natural language processing, have allowed totake into account the symmetries and the structure ofthe problem to be solved.

However, a major criticism of deep learning remains,namely, that it only performs perception, mapping in-puts to outputs [4].

A new direction to more general and flexible mod-els is differentiable programming, that is, the combina-tion of geometric modules (traditional neural networks)with more algorithmic modules in an end-to-end differ-entiable model. As a result, differentiable programmingis a dynamic computational graph composed of differ-entiable functions that provides not only perception butalso reasoning, attention and memory. To efficiently cal-culate derivatives, this approach uses automatic differ-entiation, an algorithmic technique similar to backprop-agation and implemented in modern software packagessuch as PyTorch, Julia, etc.

∗Corresponding author

To keep our exposition concise, this tutorial is aimedat researchers in nonlinear systems with prior knowl-edge of deep learning; see [5] for an excellent intro-duction to the concepts and methods of deep learning.Therefore, this tutorial focuses right away on the lim-itations of traditional deep learning and the advantagesof differential programming, with special attention to itsapplication to dynamical systems. By a dynamical sys-tem we mean here and hereafter a set of state variablesthat evolve in time under the influence of internal andpossibly external inputs.

Examples of differentiable programming techniquesthat have been successfully developed in recent yearsinclude

(i) attention mechanisms [6], which allow the modelto automatically search and learn which parts of asource sequence are relevant to predict the target ele-ment,

(ii) self-attention,(iii) end-to-end Memory Networks [7], and(iv) Differentiable Neural Computers (DNCs) [8],

which are neural networks (controllers) with an exter-nal read-write memory.

As expected, in recent years there has been a grow-ing interest in applying deep learning techniques to dy-namical systems. In this regard, RNNs and Long Short-Term Memories (LSTMs), specially designed for se-quence modelling and temporal dependence, have been

Preprint submitted to Physica D May 5, 2020

arX

iv:1

912.

0816

8v2

[m

ath.

DS]

2 M

ay 2

020

Page 2: Centro de Investigacion´ Operativa, Universidad Miguel

successful in various applications to dynamical systemssuch as model identification and time series prediction[9, 10, 11].

The performance of theses models (e.g. encoder-decoder networks), however, degrades rapidly as thelength of the input sequence increases and they are notable to capture the dynamic (i.e., time-changing) inter-dependence between time steps. The combination ofneural networks with new differentiable modules couldovercome some of those problems and offer new oppor-tunities and applications.

Among the potential applications of differentiableprogramming to dynamical systems let us mention

(i) attention mechanisms to select the relevant timesteps and inputs,

(ii) memory networks to store historical data from dy-namical systems and selectively use it for modelling andprediction, and

(iii) the use of differentiable components in scientificcomputing.Despite some achievements, more work is still neededto verify the benefits of these models over traditionalnetworks.

Thanks to software libraries that facilitate auto-matic differentiation, differentiable programming ex-tends deep learning models with new capabilities (rea-soning, memory, attention, etc.) and the models can beefficiently coded and implemented.

In the following sections of this tutorial we introducedifferentiable programming and explain in detail why itis an extension of deep learning (Section 2). We de-scribe some models based on this new approach such asattention mechanisms (Section 3.1), memory networksand differentiable neural computers (Section 3.2), andcontinuous learning (Section 3.3). Then we review theuse of deep learning in dynamical systems and their lim-itations (Section 4.1). And, finally, we present the newopportunities that differentiable programming can bringto the modelling, simulation and prediction of dynami-cal systems (Section 4.2). The conclusions and outlookare summarized in Section 5.

2. From deep learning to differentiable program-ming

In recent years, we have seen major advances in thefield of machine learning. The combination of deepneural networks with the computational capabilities ofGraphics Processing Units (GPUs) [12] has improvedthe performance of several tasks (image recognition,machine translation, language modelling, time series

prediction, game playing and more) [1, 2, 3]. Inter-estingly, deep learning models and architectures haveevolved to take into account the structure of the prob-lem to be resolved.

Deep learning is a part of machine learning thatis based on neural networks and uses multiple layers,where each layer extracts higher level features from theinput. RNNs are a special class of neural networkswhere outputs from previous steps are fed as inputs tothe current step [13, 14]. This recurrence makes themappropriate for modelling dynamic processes and sys-tems.

CNNs are neural networks that alternate convolu-tional and pooling layers to implement translational in-variance [15]. They learn spatial hierarchies of featuresthrough backpropagation by using these building layers.CNNs are being applied successfully to computer visionand image processing [16].

Especially important is the use of distributed rep-resentations as inputs to natural language processingpipelines. With this technique, the words of the vocab-ulary are mapped to an element of a vector space with amuch lower dimensionality [17, 18]. This word embed-ding is able to keep, in the learned vector space, someof the syntactic and semantic relationships presented inthe original data.

Let us recall that, in a feedforward neural network(FNN) composed of multiple layers, the output (withoutthe bias term) at layer l, see Figure 1, is defined as

xl+1 = σ(W lxl), (1)

W l being the weight matrix at layer l. σ is the activationfunction and xl+1, the output vector at layer l and theinput vector at layer l + 1. The weight matrices for thedifferent layers are the parameters of the model.

Learning is the mechanism by which the parame-ters of a neural network are adapted to the environmentin the training process. This is an optimization prob-lem which has been addressed by using gradient-basedmethods, in which given a cost function f : Rn → R,the algorithm finds local minima w∗ = arg minw f (w)updating each layer parameter wi j with the rule wi j :=wi j − η∇wi j f (w), η > 0 being the learning rate.

In addition to regarding neural networks as universalapproximators, there is no sound theoretical explanationfor a good performance of deep learning. Several theo-retical frameworks have been proposed:

(i) As pointed out in [19], the class of functions ofpractical interest can be approximated with expo-nentially fewer parameters than the generic ones.

2

Page 3: Centro de Investigacion´ Operativa, Universidad Miguel

Figure 1: Multilayer neural network.

Symmetry, locality and compositionality proper-ties make it possible to have simpler neural net-works.

(ii) From the point of view of information theory [20],an explanation has been put forward based on howmuch information each layer of the neural networkretains and how this information varies with thetraining and testing process.

Although deep learning can implicitly implementlogical reasoning [21], it has limitations that make it dif-ficult to achieve more general intelligence [4]. Amongthese limitations, we can highlight the following:

(i) It only performs perception, representing a map-ping between inputs and outputs.

(ii) It follows a hybrid model where synaptic weightsperform both processing and memory tasks butdoesn’t have an explicit external memory.

(iii) It does not carry out conscious and sequential rea-soning, a process that is based on perception andmemory through attention.

A path to a more general intelligence, as we will seebelow, is the combination of geometric modules withmore algorithmic modules in an end-to-end differen-tiable model. This approach, called differentiable pro-gramming, adds new parametrizable and differentiablecomponents to traditional neural networks.

Differentiable programming, a broad term, is definedin [22] as a programming model (model of how a com-puter program is executed), trainable with gradient de-scent, where neural networks are truly functional blockswith data-dependent branches and recursion.

Here, and for the purposes of this tutorial, we definedifferentiable programming as a programming modelwith the following characteristics:

(i) Programs are directed acyclic graphs.(ii) Graph nodes are mathematical functions or vari-

ables and the edges correspond to the flow of in-termediate values between the nodes.

(iii) n is the number of nodes and l the number of inputvariables of the graph, with 1 ≤ l < n. vi fori ∈ {1, ..., n} is the variable associated with node i.

(iv) E is the set of edges in the graph. For each(i, j) ∈ E we have i < j, therefore the graph istopologically ordered.

(v) fi for i ∈ {(l+1), ..., n} is the differentiable functioncomputed by node i in the graph. αi for i ∈ {(l +

1), ..., n} contains all input values for node i.(vi) The forward algorithm or pass, given input vari-

ables v1, ..., vl calculates vi = fi(αi) for i = {(l +

1), ..., n}.(vii) The graph is dynamically constructed and com-

posed of parametrizable functions that are differ-entiable and whose parameters are learned fromdata.

Then, neural networks are just a class of these differ-entiable programs composed of classical blocks (feed-forward, recurrent neural networks, etc.) and new onessuch as differentiable branching, attention, memories,etc.

Differentiable programming can be seen as a con-tinuation of the deep learning end-to-end architecturesthat have replaced, for example, the traditional linguisticcomponents in natural language processing [23, 24]. Toefficiently calculate the derivatives in a gradient descent,this approach uses automatic differentiation, an algo-rithmic technique similar but more general than back-propagation.

Automatic differentiation, in its reverse mode and incontrast to manual, symbolic and numerical differen-tiation, computes the derivatives in a two-step process[25, 26]. As described in [25] and rearranging the in-dexes of the previous definition, a function f : Rn → Rm

is constructed with intermediate variables vi such that:

(i) variables vi−n = xi, i = 1, ..., n are the inputs vari-ables.

(ii) variables vi, i = 1, ..., l are the intermediate vari-ables.

(iii) variables ym−i = vl−i, i = m − 1, ..., 0 are the outputvariables.

In a first step, similar to the forward pass describedbefore, the computational graph is built populating in-termediate variables vi and recording the dependencies.In a second step, called the backward pass, derivativesare calculated by propagating for the output y j being

3

Page 4: Centro de Investigacion´ Operativa, Universidad Miguel

considered, the adjoints vi =∂y j

∂vifrom the output to the

inputs.The reverse mode is more efficient to evaluate for

functions with a large number of inputs (parameters)and a small number of outputs. When f : Rn → R,as is the case in machine learning with n very large andf the cost function, only one pass of the reverse modeis necessary to compute the gradient ∇ f = ( ∂y

∂x1, ..., ∂y

∂xn).

In the last years, deep learning frameworks suchas PyTorch have been developed that provide reverse-mode automatic differentiation [27]. The define-by-runphilosophy of PyTorch, whose execution dynamicallyconstructs the computational graph, facilitates the de-velopment of general differentiable programs.

Differentiable programming is an evolution of classi-cal (traditional) software programming where, as shownin Table 1:

(i) Instead of specifying explicit instructions to thecomputer, an objective is set and an optimizablearchitecture is defined which allows to search in asubset of possible programs.

(ii) The program is defined by the input-output dataand not predefined by the user.

(iii) The algorithmic elements of the program have tobe differentiable, say, by converting them into dif-ferentiable blocks.

Classical DifferentiableSequence of instructions Sequence of diff. primitivesFixed architecture Optimizable architectureUser defined Data definedImperative programming Declarative programmingIntuitive Abstract

Table 1: Differentiable vs classical programming.

RNNs, for example, are an evolution of feedforwardnetworks because they are classical neural networks in-side a for-loop (a control flow statement for iteration)which allows the neural network to be executed repeat-edly with recurrence. However, this for-loop is a prede-fined feature of the model. Differentiable programmingallows to dynamically constructs the graph and vary thelength of the loop. Then, the ideal situation would be toaugment the neural network with programming primi-tives (for-loops, if branches, while statements, externalmemories, logical modules, etc.) that are not predefinedby the user but are parametrizable by the training data.

The trouble is that many of these programming prim-itives are not differentiable and need to be converted

into optimizable modules. For instance, if the con-dition a of an ”if” primitive (e.g., if a is satisfied doy(x), otherwise do z(x)) is to be learned, it can be theoutput of a neural network (linear transformation anda sigmoid function) and the conditional primitive willtransform into a weighted combination of both branchesay(x) + (1 − a)z(x). Similarly, in an attention module,different weights that are learned with the model are as-signed to give a different influence to each part of theinput. Figure 2 shows the computational graph of a con-ditional branching.

Figure 2: Computational graph of differentiable branching.

The process of extending deep learning with differen-tiable primitives would consist of the following steps:

(i) Select a new function that improves the classicalinput-output transformation of deep learning, e.g.attention, continuous learning, memories, etc.

(ii) Convert this function into a directed acyclic graph,a sequence of parametrizable and differentiablefunctions. For example, Figure 2 shows this se-quence of operations used in attention for differ-entiable branching.

(iii) Integrate this new function into the base model.

In this way, using differentiable programming we cancombine traditional perception modules (CNN, RNN,FNN) with additional algorithmic modules that providereasoning, abstraction and memory [28]. In the follow-ing section we describe, by following this process, someexamples of this approach that have been developed inrecent years.

4

Page 5: Centro de Investigacion´ Operativa, Universidad Miguel

3. Differentiable learning and reasoning

3.1. Differentiable attention

One of the aforementioned limitations of deep learn-ing models is that they do not perform conscious andsequential reasoning, a process that is based on percep-tion and memory through attention.

Reasoning is the process of consciously establishingand verifying facts combining attention with new or ex-isting information. An attention mechanism allows thebrain to focus on one part of the input or memory (im-age, text, etc), giving less attention to others.

Attention mechanisms have provided and will pro-vide a paradigm shift in machine learning. From tradi-tional large-scale vector transformations to a more con-scious process that focuses only on a set of elements,e.g. decomposing a problem into a sequence of atten-tion based reasoning operations [29].

Figure 3: Attention diagram.

One way to make this attention process differentiableis to make it a convex combination of the input or mem-ory, where all the steps are differentiable and the com-bination weights are parametrizable.

As in [30], this differentiable attention process is de-scribed as mapping a query and a set of key-value pairsto an output:

att(q, s) =

T∑i=1

αi(q, ki)Vi, (2)

where, as seen in figure 3, ki and Vi are the key and thevalue vectors from the source/memory s, and q is thequery vector. αi(q, ki) is the similarity function betweenthe query and the corresponding key and is calculated byapplying the softmax function:

S o f tmax(zi) =exp(zi)∑i′ exp(zi′ )

(3)

to the score function score(q, ki) :

αi =exp(score(q, ki))∑T

i′=1 exp(score(q, ki′ )). (4)

The score function can be computed using a feedfor-ward neural network:

score(q, ki) = Za tanh(Wa[q, ki])), (5)

as proposed in [6], where Za and Wa are matrices tobe jointly learned with the rest of the model and [q, ki]is a linear function or concatenation of q and ki. Also,in [31] the authors use a cosine similarity measure forcontent-based attention, namely,

score(q, ki) = cos((q, ki)) (6)

where ((q, ki)) denotes the angle between q and ki.Then, differentiable attention can be seen as a sequen-

tial process of reasoning in which the task (query) isguided by a set of elements of the input source (or mem-ory) using attention.

The attention process can focus on:

(i) Temporal dimensions, e.g. different time steps ofa sequence.

(ii) Spatial dimensions, e.g. different regions of an im-age.

(iii) Different elements of a memory.(iv) Different features or dimensions of an input vec-

tor, etc.

Depending on where the process is initiated, we have:

(i) Top-down attention, initiated by the current task.(ii) Bottom-up, initiated spontaneously by the source

or memory.

3.1.1. Attention mechanisms in seq2seq modelsRNNs (see Figure 4) are a basic component of mod-

ern deep learning architectures, especially of encoder-decoder networks. The following equations define thetime evolution of an RNN:

ht = f h(W ihxt + Whhht−1), (7)

yt = f o(Whoht), (8)

W ih, Whh and Who being weight matrices. f h and f o arethe hidden and output activation functions while xt, ht

and yt are the network input, hidden state and output.An evolution of RNNs are LSTMs [32], an RNN

structure with gated units, i.e. regulators. LSTM arecomposed of a cell, an input gate, an output gate and aforget gate, and allow gradients to flow unchanged. The

5

Page 6: Centro de Investigacion´ Operativa, Universidad Miguel

Figure 4: Temporal structure of a recurrent neural network.

memory cell remembers values over arbitrary time inter-vals and the three gates regulate the flow of informationinto and out of the cell.

An encoder-decoder network maps an input sequenceto a target one with both sequences of arbitrary length[2]. They have applications ranging from machinetranslation to time series prediction.

Figure 5: An encoder-decoder network.

More specifically, this mechanism uses an RNN (orany of its variants, an LSTM or a GRU, Gated Recur-rent Unit) to map the input sequence to a fixed-lengthvector, and another RNN (or any of its variants) to de-code the target sequence from that vector (see Figure 5).Such a seq2seq model features normally an architecturecomposed of:

(i) An encoder which, given an input sequence X =

(x1, x2, ..., xT ) with xt ∈ Rn, maps xt to

ht = f1(ht−1, xt), (9)

where ht ∈ Rm is the hidden state of the encoder

at time t, m is the size of the hidden state and f1 isan RNN (or any of its variants).

(ii) A decoder, where st is the hidden state and whoseinitial state s0 is initialized with the last hiddenstate of the encoder hT . It generates the outputsequence Y = (y1, y2, ..., yT ′ ), yt ∈ Ro (the dimen-sion o depending on the task), with

yt = f2(st−1, yt−1), (10)

where f2 is an RNN (or any of its variants) with anadditional softmax layer.

Because the encoder compresses all the informationof the input sequence in a fixed-length vector (the finalhidden state hT ), the decoder possibly does not take intoaccount the first elements of the input sequence. The useof this fixed-length vector is a limitation to improve theperformance of the encoder-decoder networks. More-over, the performance of encoder-decoder networks de-grades rapidly as the length of the input sequence in-creases [33]. This occurs in applications such as ma-chine translation and time series predition, where it isnecessary to model long time dependencies.

The key to solve this problem is to use an atten-tion mechanism. In [6] an extension of the basicencoder-decoder arquitecture was proposed by allowingthe model to automatically search and learn which partsof a source sequence are relevant to predict the targetelement. Instead of encoding the input sequence in afixed-length vector, it generates a sequence of vectors,choosing the most appropriate subset of these vectorsduring the decoding process.

With the attention mechanism, the encoder is a bidi-rectional RNN [34] with a forward hidden state

−→hi =

f1(−→h i−1, xi) and a backward one

←−hi = f1(

←−h i+1, xi). The

encoder state is represented as a simple concatenationof the two states,

hi = [−→hi;←−hi], (11)

with i = 1, ...,T . The encoder state includes both thepreceding and following elements of the sequence, thuscapturing information from neighbouring inputs.

The decoder has an output

yt = f2(st−1, yt−1, ct) (12)

for t = 1, ...,T ′. f2 is an RNN with an additional soft-max layer, and the input is a concatenation of yt−1 withthe context vector ct, which is a sum of hidden states ofthe input sequence weighted by alignment scores:

6

Page 7: Centro de Investigacion´ Operativa, Universidad Miguel

ct =

T∑i=1

αtihi. (13)

Similar to equation (4), the weight αti of each state hi iscalculated by

αti =exp(score(st−1, hi))∑T

i′=1 exp(score(st−1, hi′ )). (14)

In this attention mechanism, the query is the state st−1and the key and the value are the hidden states hi. Thescore measures how well the input at position i and theoutput at position t match. αti are the weights that im-plement the attention mechanism, defining how much ofeach input hidden state should be considered when de-ciding the next state st and generating the output yt (seeFigure 6).

Figure 6: An encoder-decoder network with attention.

As we have described previously, the score functioncan be parametrized using different alignment modelssuch as feedforward networks and the cosine similarity.

An example of a matrix of alignment scores can beseen in Figure 7. This matrix provides interpretabilityto the model since it allows to know which part (time-step) of the input is more important to the output.

3.2. Other attention mechanisms and differentiableneural computers

A variant of the attention mechanism is self-attention,in which the attention component relates different posi-tions of a single sequence in order to compute a repre-sentation of the sequence. In this way, the keys, valuesand queries come from the same source. The mecha-nism can connect distant elements of the sequence moredirectly than using RNNs [35].

Figure 7: A matrix of alignment scores.

Another variant of attention are end-to-end memorynetworks [7], which we describe in Section 4.2.2 and areneural networks with a recurrent attention model overan external memory. The model, trained end-to-end,outputs an answer based on a set of inputs x1, x2, ..., xn

stored in a memory and a query.Traditional computers are based on the von Neumann

architecture which has two basic components: the CPU(Central Processing Unit), which carries out the pro-gram instructions, and the memory, which is accessedby the CPU to perform write/read operations. In con-trast, neural networks follow a hybrid model wheresynaptic weights perform both processing and memorytasks.

Neural networks and deep learning models are goodat mapping inputs to outputs but are limited in their abil-ity to use facts from previous events and store usefulinformation. Differentiable Neural Computers (DNCs)[8] try to overcome these shortcomings by combiningneural networks with an external read-write memory.

As described in [8], a DNC is a neural network, calledthe controller (playing the role of a differentiable CPU),with an external memory, an N × W matrix. The DNCuses differentiable attention mechanisms to define dis-tributions (weightings) over the N rows and learn theimportance each row has in a read or write operation.

To select the most appropriate memory componentsduring read/write operations, a weighted sum w(i) isused over the memory locations i = 1, ...,N. The atten-tion mechanism is used in three different ways:

(i) Access content (read or write) based on similarity.(ii) Time ordered access (temporal links) to recover

the sequences in the order in which they were writ-ten.

7

Page 8: Centro de Investigacion´ Operativa, Universidad Miguel

(iii) Dynamic memory allocation, where the DNC as-signs and releases memory based on usage per-centage.

At each time step, the DNC gets an input vector andemits an output vector that is a function of the combina-tion of the input vector and the memories selected.

DNCs, by combining the following characteristics,have very promising applications in complex tasks thatrequire both perception and reasoning:

(i) The classical perception capability of neural net-works.

(ii) Read and write capabilities based on content sim-ilarity and learned by the model.

(iii) The use of previous knowledge to plan and reason.(iv) End-to-end differentiability of the model.(v) Implementation using software packages with au-

tomatic differentiation libraries such as PyTorch,Tensorflow or similar.

3.3. Meta-plasticity and continuous learning

The combination of geometric modules (classicalneural networks) with algorithmic ones adds new learn-ing capabilities to deep learning models. In the previ-ous sections we have seen that one way to improve thelearning process is by focusing on certain elements ofthe input or a memory and making this attention differ-entiable.

Another natural way to improve the process of learn-ing is to incorporate differentiable primitives that addflexibility and adaptability. A source of inspiration isneuromodulators, which furnish the traditional synap-tic transmission with new computational and processingcapabilities [36].

Unlike the continuous learning capabilities of ani-mal brains, which allow animals to adapt quickly tothe experience, in neural networks, once the trainingis completed, the parameters are fixed and the networkstops learning. To solve this issue, in [37] a differen-tiable plasticity component is attached to the networkthat helps previously-trained networks adapt to ongoingexperience.

The process to introduce the differentiable plasticcomponent in the network is as follows. The activationy j of neuron j has a conventional fixed weight wi j anda plastic component αi jHi j(t), where αi j is a structuralparameter tuned during the training period and Hi j(t) aplastic component automatically updated as a functionof ongoing inputs and outputs. The equations for the ac-tivation of y j with learning rate η, as described in [37],are:

y j = tanh

∑i∈inputs

(wi j + αi jHi j(t))yi

, (15)

Hi j(t + 1) = ηyiy j + (1 − η)Hi j(t). (16)

Then, during the initial training period, wi j and αi j

are trained using gradient descent and after this period,the model keeps learning from ongoing experience.

4. Dynamical systems and differentiable program-ming

4.1. Modeling dynamical systems with neural networksDynamical systems deal with time-evolutionary pro-

cesses and their corresponding systems of equations.At any given time, a dynamical system has a state thatcan be represented by a point in a state space (mani-fold). The evolutionary process of the dynamical sys-tem describes what future states follow from the currentstate. This process can be deterministic, if its entire fu-ture is uniquely determined by its current state, or non-deterministic otherwise [38] (e.g., a random dynamicalsystem [39]). Furthermore, it can be a continuous-timeprocess, represented by differential equations or, as inthis paper, a discrete-time process, represented by dif-ference equations or maps. Thus,

ht = f (ht−1; θ) (17)

for autonomous discrete-time deterministic dynamicalsystems with parameters θ, and

ht = f (ht−1, xt; θ) (18)

for non-autonomous discrete-time deterministic dynam-ical systems driven by an external input xt.

Dynamical systems have important applications inphysics, chemistry, economics, engineering, biologyand medicine [40]. They are relevant even in day-to-dayphenomena with great social impact such as tsunamiwarning, earth temperature analysis and financial mar-kets prediction.

Dynamical systems that contain a very large numberof variables interacting with each other in non-trivialways are sometimes called complex (dynamical) sys-tems [41]. Their behaviour is intrinsically difficult tomodel due to the dependencies and interactions betweentheir parts and they have emergence properties arisingfrom these interactions such as adaptation, evolution,learning, etc.

Here we consider discrete-time, deterministic andnon-autonomous (i.e., the time evolution depending also

8

Page 9: Centro de Investigacion´ Operativa, Universidad Miguel

on exogenous variables) dynamical systems as well asthe more general complex systems. Specifically, the dy-namical systems of interest range from systems of dif-ference equations with multiple time delays to systemswith a dynamic (i.e., time-changing) interdependencebetween time steps. Notice that the former ones maybe rewritten as higher dimensional systems with timedelay 1.

On the other hand, in recent years deep learning mod-els have been very successful in performing varioustasks such as image recognition, machine translation,game playing, etc. When the amount of training datais sufficient and the distribution that generates the realdata is the same as the distribution of the training data,these models perform extremely well and approximatethe input-output relation.

In view of the importance of dynamical systems formodeling physical, biological and social phenomena,there is a growing interest in applying deep learningtechniques to dynamical systems. This can be done indifferent contexts, such as:

(i) Modeling dynamical systems with known struc-ture and equations but non-analytical or complexsolutions [42].

(ii) Modeling dynamical systems without knowledgeof the underlying governing equations [43, 44]. Inthis regard, let us mention that commercial initia-tives are emerging that combine large amounts ofmeteorological data with deep learning models toimprove weather predictions.

(iii) Modeling dynamical systems with partial or noisydata [45].

A key aspect in modelling dynamical systems is tem-poral dependence. There are two ways to introduce itinto a neural network [46]:

(i) A classical feedforward neural network with timedelayed states in the inputs but perhaps with anunnecessary increase in the number of parameters.

(ii) A recurrent neural network (RNN) which, asshown in Equations (7) and (8), has a temporal re-currence that makes it appropriate for modellingdiscrete dynamical systems of the form given inEquations (17) and (18).

Thus, RNNs, specially designed for sequence mod-elling [47], seem the ideal candidates to model, analyzeand predict dynamical systems in the broad sense usedin this tutorial. The temporal recurrence of RNNs, the-oretically, allows to model and identify dynamical sys-tems described with equations with any temporal depen-dence.

To learn chaotic dynamics, recurrent radial basisfunction (RBF) networks [48] and evolutionary algo-rithms that generate RNNs have been proposed [49].”Nonlinear Autoregressive model with exogenous in-put” (NARX) [50] and boosted RNNs [51] have beenapplied to predict chaotic time series.

However, a difficulty with RNNs is the vanishing gra-dient problem [52]. RNNs are trained by unfoldingthem into deep feedforward networks, creating a newlayer for each time step of the input sequence. Whenbackpropagation computes the gradient by the chainrule, this gradient vanishes as the number of time-stepsincreases. As a result, for long input-output sequences,as depicted in Figure 8, RNNs have trouble modellinglong-term dependencies, that is, relationships betweenelements that are separated by large periods of time.

Figure 8: Vanishing gradient problem in RNNs. Information sensitiv-ity decays over time forgetting the first input.

To overcome this problem, LSTMs were proposed.LSTMs have an advantage over basic RNNs due to theirrelative insensitivity to temporal delays and, therefore,are appropriate for modeling and making predictionsbased on time series whenever there exist temporary de-pendencies of unknown duration. With the appropriatenumber of hidden units and activation functions [10],LSTMs can model and identify any non-linear dynami-cal system of the form:

ht = f (xt, ..., xt−T , ht−1, ..., ht−T ), (19)

yt = g(ht), (20)

f and g are the state and output functions while xt, ht

and yt are the system input, state and output.LSTMs have succeeded in various applications to dy-

namical systems such as model identification and timeseries prediction [9, 10, 11].

9

Page 10: Centro de Investigacion´ Operativa, Universidad Miguel

An also remarkable application of the LSTM hasbeen machine translation [2, 53], using the encoder-decoder architecture described in Section 3.1.1.

However, as we have seen, the decoder possibly doesnot take into account the first elements of the input se-quence because the encoder compresses all the informa-tion of the input sequence in a fixed-length vector. Then,the performance of encoder-decoder networks degradesrapidly as the length of input sequence increases andthis can be a problem in time series analysis, where pre-dictions are based upon a long segment of the series.

Furthermore, as depicted in Figure 9, a complexdynamic may feature interdependencies between timesteps that vary with time. In this situation, the equationthat defines the temporal evolution may change at eacht ∈ 1, ...,T . For these dynamical systems, adding anattention module like the one described in Equation 13can help model such time-changing interdependencies.

Figure 9: Temporal interdependencies in a dynamical system.

4.2. Improving dynamical systems with differentiableprogramming

Deep learning models together with graphic proces-sors and large amounts of data have improved the mod-eling of dynamical systems but this has some limitationssuch as those mentioned in the previous section. Thecombination of neural networks with new differentiablealgorithmic modules is expected to overcome some ofthose shortcomings and offer new opportunities and ap-plications.

In the next three subsections we illustrate with exam-ples the kind of applications of differentiable program-ming to dynamical systems we have in mind, namely:implementations of attention mechanisms, memory net-works, scientific simulations and modeling in physics.

4.2.1. Attention mechanisms in dynamical systemsIn the previous sections we have described the atten-

tion mechanism, which allows a task to be guided by aset of elements of the input or memory source. When

applying this mechanism to dynamical systems model-ing or prediction, it is necessary to decide the followingaspects:

(i) In which phase or phases of the model should theattention mechanism be introduced?

(ii) What dimension is the mechanism going to focuson? Temporal, spatial, etc.

(iii) What parts of the system will correspond to thequery, the key and the value?

One option, which is also quite illustrative, is to usea dual-stage attention, an encoder with input attentionand a decoder with temporal attention, as pointed out in[54].

Here we describe this option, in which the first stageextracts the relevant input features and the second se-lects the relevant time steps of the model. In many dy-namical systems there are long term dependencies be-tween time steps and these dependencies can be dy-namic, i.e., time-changing. In these cases, attentionmechanisms learn to focus on the most relevant partsof the system input or state.

X = (x1, x2, ..., xT ) with xt ∈ Rn represents the inputsequence. T is the length of the time interval and n thenumber of input features or dimensions. At each timestep t, xt = (x1

t , x2t , ..., x

nt ).

Encoder with input attentionThe encoder, given an input sequence X, maps ut to

ht = f1(ht−1,ut), (21)

where ht ∈ Rm is the hidden state of the encoder attime t, m is the size of the hidden state and f1 is anRNN (or any of its variants). xt is replaced by ut, whichadaptively selects the relevant input features with

ut = (α1t x1

t , α2t x2

t , ..., αnt xn

t ). (22)

αkt is the attention weight measuring the importance

of the k input feature at time t and is computed by

αkt =

exp(score(ht−1, xk))∑Ti=1 exp(score(ht−1, xi))

, (23)

where xk = (xk1, x

k2, ..., x

kT ) is the k input feature series

and the score function can be computed using a feed-forward neural network, a cosine similarity measure orother similarity functions.

Then, this first attention stage extracts the relevant in-put features, as seen in Figure 10 with the correspondingquery, keys and values.

10

Page 11: Centro de Investigacion´ Operativa, Universidad Miguel

Figure 10: Diagram of the input attention mechanism.

Decoder with temporal attentionSimilar to the attention decoder described in Section

3.1.1, the decoder has an output

yt = f2(st−1, yt−1, ct) (24)

for t = 1, ...,T ′. f2 is an RNN (or any of its variants)with an additional linear or softmax layer, and the in-put is a concatenation of yt−1 with the context vector ct,which is a sum of hidden states of the input sequenceweighted by alignment scores:

ct =

T∑i=1

βit hi. (25)

The weight βit of each state hi is computed using the

similarity function, score(st−1, hi), and applying a soft-max function, as described in Section 3.1.1.

This second attention stage selects the relevant timesteps, as shown in Figure 11 with the correspondingquery, keys and values.

Further remarksIn [54], the authors define this dual-stage attention

RNN and show that the model outperforms a classicalmodel in time series prediction.

In [55], a comparison is made between LSTMs andattention mechanisms for financial time series forecast-ing. It is shown there that an LSTM with attention per-form better than stand-alone LSTMs.

A temporal attention layer is used in [56] to select rel-evant information and to provide model interpretability,an essential feature to understand deep learning models.Interpretability is further studied in detail in [57], con-cluding that attention weights partially reflect the im-pact of the input elements on model prediction.

Figure 11: Diagram of the input attention mechanism.

Despite the theoretical advantages and some achieve-ments, further studies are needed to verify the benefitsof the attention mechanism over traditional networks.

4.2.2. Memory networksMemory networks allow long-term dependencies in

sequential data to be learned thanks to an external mem-ory component. Instead of taking into account only themost recent states, memory networks consider the entirelist of entries or states.

Here we define one possible application of memorynetworks to dynamical systems, following an approachbased on [7]. We are given a time series of histori-cal data n1, ..., nT ′ with ni ∈ Rn and the input seriesx1, ..., xT with xt ∈ Rn the current input, which is thequery.

The set {ni} are converted into memory vectors {mi}

and output vectors {ci} of dimension d. The query xt isalso transformed to obtain a internal state ut of dimen-sion d. These transformations correspond to a lineartransformation: Ani = mi, Bni = ci,Cxt = ut, beingA, B,C parameterizable matrices.

A match between ut and each memory vector mi iscomputed by taking the inner product followed by asoftmax function:

pit = S o f tmax(uT

t mi). (26)

The final vector from the memory, ot, is a weightedsum over the transformed inputs {ci}:

ot =∑

j

pit ci. (27)

To generate the final prediction yt, a linear layer isapplied to the sum of the output vector ot and the trans-formed input ut and to the previous output yt−1:

11

Page 12: Centro de Investigacion´ Operativa, Universidad Miguel

yt = W1(ot + ut) + W2yt−1 (28)

This model is differentiable end-to-end by learningthe matrices (the final matrices W i ant the three transfor-mation matrices A, B and C) to minimize the predictionerror.

In [58] the authors propose a similar model based onmemory networks with a memory component, three en-coders and an autoregressive component for multivari-ate time-series forecasting. Compared to non-memoryRNN models, their model is better at modeling and cap-turing long-term dependencies and, moreover, it is in-terpretable.

Taking advantage of the highlighted capabilities ofDifferentiable Neural Computers (DNCs), an enhancedDNC for electroencephalogram (EEG) data analysis isproposed in [59]. By replacing the LSTM network con-troller with a recurrent convolutional network, the po-tential of DNCs in EEG signal processing is convinc-ingly demonstrated.

4.2.3. Scientific simulation and physical modelingScientific modeling, as pointed out in [60], has tradi-

tionally employed three approaches:

(i) Direct modeling, if the exact function that relatesinput and output is known.

(ii) Using a machine learning model. As we havementioned, neural networks are universal approx-imators.

(iii) Using a differential equation if some structure ofthe problem is known. For example, if the rate ofchange of the unknown function is a function ofthe physical variables.

Machine learning models have to learn the input-output transformation from scratch and need a lot ofdata. One way to make them more efficient is to com-bine them with a differentiable component suited to aspecific problem. This component allows specific priorknowledge to be incorporated into deep learning modelsand can be a differentiable physical model or a differen-tiable ODE (ordinary differential equation) solver.

(i) Differentiable physical models.Differentiable plasticity, as described in Section3.3, can be applied to deep learning models of dy-namical systems in order to help them adapt to on-going data and experience.As done in [37], the plasticity component de-scribed in Equations 15 and 16, can be introducedin some layers of the deep learning architecture.

In this way, the model can continuously learn be-cause the plastic component is updated by neuralactivity.DiffTaichi, a differentiable programming languagefor building differentiable physical simulations,is proposed in [62], integrating a neural networkcontroller with a physical simulation module.A differentiable physics engine is presented in[63]. The system simulates rigid body dynamicsand can be integrated in an end-to-end differen-tiable deep learning model for learning the physi-cal parameters.

(ii) Differentiable ODE solvers.As described in [60], an ODE can be embeddedinto a deep learning model. For example, the Eulermethod takes in the derivative function and the ini-tial values and outputs the approximated solution.The derivative function could be a neural network.This solver is differentiable and can be integratedinto a lager model that can be optimized using gra-dient descent.In [61] a differentiable model of a trebuchet is de-scribed. In a classical trebuchet model, the param-eters (the mass of the counterweight and the angleof release) are fed into an ODE solver that cal-culates the distance, which is compared with thetarget distance.In the extended model, a neural network is intro-duced. The network takes two inputs, the targetdistance and the current wind speed, and outputsthe trebuchet parameters, which are fed into thesimulator to calculate the distance. This distanceis compared with the target distance and the er-ror is back-propagated through the entire model tooptimize the parameters of the network. Then, theneural network is optimized so that the model canachieve any target distance. Using this extendedmodel is faster than optimizing only the trebuchet.This type of applications shows how combin-ing differentiable ODE solvers and deep learningmodels allows to incorporate previous structure tothe problem and makes the learning process moreefficient.We may conclude that combining scientific com-puting and differentiable components will opennew avenues in the coming years.

5. Conclusions and future directions

Differentiable programming is the use of new differ-entiable components beyond classical neural networks.This generalization of deep learning allows to have data

12

Page 13: Centro de Investigacion´ Operativa, Universidad Miguel

parametrizable architectures instead of pre-fixed onesand new learning capabilities such as reasoning, atten-tion and memory.

The first models created under this new paradigm,such as attention mechanisms, differentiable neuralcomputers and memory networks, are already having agreat impact on natural language processing.

These new models and differentiable programmingare also beginning to improve machine learning appli-cations to dynamical systems. As we have seen, thesemodels improve the capabilities of RNNs and LSTMsin identification, modeling and prediction of dynamicalsystems. They even add a necessary feature in machinelearning such as interpretability.

However, this is an emerging field and further re-search is needed in several directions. To mention a few:

(i) More comparative studies between attentionmechanisms and LSTMs in predicting dynamicalsystems.

(ii) Use of self-attention and its possible applicationsto dynamical systems.

(iii) As with RNNs, a theoretical analysis (e.g., in theframework of dynamical systems) of attention andmemory networks.

(iv) Clear guidelines so that scientists without ad-vanced knowledge of machine learning can usenew differentiable models in computational sim-ulations.

Acknowledgments. This work was financially sup-ported by the Spanish Ministry of Science, Inno-vation and Universities, grant MTM2016-74921-P(AEI/FEDER, EU).

References

[1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521(2015) 436–44. doi:10.1038/nature14539.

[2] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learn-ing with neural networks, in: NIPS, 2014.

[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton,Y. Chen, T. P. Lillicrap, F. F. C. Hui, L. Sifre, G. van den Driess-che, T. Graepel, D. Hassabis, Mastering the game of go withouthuman knowledge, Nature 550 (2017) 354–359.

[4] G. Marcus, Deep learning: A critical appraisal, ArXivabs/1801.00631.

[5] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MITPress, 2016, http://www.deeplearningbook.org.

[6] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation byjointly learning to align and translate, ArXiv 1409.

[7] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, End-to-endmemory networks, in: NIPS, 2015.

[8] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka,A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette,T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols,G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom,K. Kavukcuoglu, D. Hassabis, Hybrid computing using a neu-ral network with dynamic external memory, Nature 538 (2016)471–476.

[9] Z. Wang, D. Xiao, F. Fang, R. Govindan, C. Pain, Y. Guo, Modelidentification of reduced order fluid dynamics systems usingdeep learning, International Journal for Numerical Methods inFluids 86. doi:10.1002/fld.4416.

[10] Y. Wang, A new concept using lstm neural networks for dynamicsystem identification, 2017, pp. 5324–5329. doi:10.23919/

ACC.2017.7963782.[11] Y. Li, H. Cao, Prediction for tourism flow based on lstm neu-

ral network, Procedia Computer Science 129 (2018) 277–283.doi:10.1016/j.procs.2018.03.076.

[12] O. Yadan, K. Adams, Y. Taigman, M. Ranzato, Multi-gpu train-ing of convnets, CoRR abs/1312.5853.

[13] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke,J. Schmidhuber, A novel connectionist system for unconstrainedhandwriting recognition, IEEE Transactions on Pattern Analysisand Machine Intelligence 31 (2009) 855–868.

[14] A. Sherstinsky, Fundamentals of recurrent neural network(rnn) and long short-term memory (lstm) network, ArXivabs/1808.03314.

[15] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-basedlearning applied to document recognition, Proceedings of theIEEE 86 (1998) 2278 – 2324. doi:10.1109/5.726791.

[16] R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, Convolu-tional neural networks: an overview and application in radiol-ogy, in: Insights into imaging, 2018.

[17] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neuralprobabilistic language model, J. Mach. Learn. Res. 3 (2003)1137–1155.URL http://dl.acm.org/citation.cfm?id=944919.

944966

[18] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean,Distributed representations of words and phrases and theircompositionality, in: Proceedings of the 26th InternationalConference on Neural Information Processing Systems -Volume 2, NIPS’13, Curran Associates Inc., USA, 2013, pp.3111–3119.URL http://dl.acm.org/citation.cfm?id=2999792.

2999959

[19] H. W. Lin, M. Tegmark, Why does deep and cheap learningwork so well?, Journal of Statistical Physics doi:10.1007/s10955-017-1836-5.

[20] R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neuralnetworks via information, ArXiv abs/1703.00810.

[21] P. Hohenecker, T. Lukasiewicz, Ontology reasoning with deepneural networks, ArXiv abs/1808.07980.

[22] F. Wang, Backpropagation with continuation callbacks : Foun-dations for efficient and expressive differentiable programming,NIPS’18, 2018.

[23] L. Deng, Y. Liu, A Joint Introduction to Natural Language Pro-cessing and to Deep Learning, Springer Singapore, Singapore,2018, pp. 1–22.

[24] Y. Goldberg, Neural network methods for natural lan-guage processing, Synthesis Lectures on Human Lan-guage Technologies 10 (2017) 1–309. doi:10.2200/

S00762ED1V01Y201703HLT037.[25] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind,

Automatic differentiation in machine learning: a survey, Journalof Machine Learning Research 18 (153) (2018) 1–43.

13

Page 14: Centro de Investigacion´ Operativa, Universidad Miguel

URL http://jmlr.org/papers/v18/17-468.html

[26] F. Wang, X. Wu, G. M. Essertel, J. M. Decker, T. Rompf, De-mystifying differentiable programming: Shift/reset the penulti-mate backpropagator, ArXiv abs/1803.10228.

[27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differen-tiation in pytorch, in: NIPS-W, 2017.

[28] F. Yang, Z. Yang, W. W. Cohen, Differentiable learning oflogical rules for knowledge base reasoning (2017) 2316–2325.URL http://dl.acm.org/citation.cfm?id=3294771.

3294992

[29] D. A. Hudson, C. D. Manning, Compositional attention net-works for machine reasoning, in: Proceedings of the Interna-tional Conference on Learning Representations (ICLR), 2018.

[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in:NIPS, 2017.

[31] A. Graves, G. Wayne, I. Danihelka, Neural turing machines,ArXiv abs/1410.5401.

[32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neuralcomputation 9 (1997) 1735–80. doi:10.1162/neco.1997.

9.8.1735.[33] K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the

properties of neural machine translation: Encoder–decoder ap-proaches, in: Proceedings of SSST-8, Eighth Workshop on Syn-tax, Semantics and Structure in Statistical Translation, Associa-tion for Computational Linguistics, Doha, Qatar, 2014, pp. 103–111. doi:10.3115/v1/W14-4012.URL https://www.aclweb.org/anthology/W14-4012

[34] A. Graves, N. Jaitly, A. rahman Mohamed, Hybrid speechrecognition with deep bidirectional lstm, 2013 IEEE Workshopon Automatic Speech Recognition and Understanding (2013)273–278.

[35] G. Tang, M. Muller, A. Rios, R. Sennrich, Why self-attention? atargeted evaluation of neural machine translation architectures,in: EMNLP, 2018.

[36] A. Hernandez, J. M. Amigo, Multilayer adaptive networks inneuronal processing, The European Physical Journal SpecialTopics 227 (2018) 1039–1049.

[37] T. Miconi, K. O. Stanley, J. Clune, Differentiable plastic-ity: training plastic neural networks with backpropagation, in:ICML, 2018.

[38] G. Layek, An Introduction to Dynamical Systems and Chaos,2015. doi:10.1007/978-81-322-2556-0.

[39] L. Arnold, Random Dynamical Systems, 2003.[40] T. Jackson, A. Radunskaya, Applications of Dynamical Sys-

tems in Biology and Medicine, Vol. 158, 2015. doi:10.1007/978-1-4939-2782-1.

[41] C. Gros, Complex and adaptive dynamical systems. A primer.3rd ed, Vol. 1, 2008. doi:10.1063/1.3177233.

[42] S. Pan, K. Duraisamy, Long-time predictive modeling of nonlin-ear dynamical systems using neural networks, Complexity 2018(2018) 4801012:1–4801012:26.

[43] P. Dben, P. Bauer, Challenges and design choices for globalweather and climate models based on machine learning, Geo-scientific Model Development 11 (2018) 3999–4009. doi:

10.5194/gmd-11-3999-2018.[44] K. Chakraborty, K. G. Mehrotra, C. K. Mohan, S. Ranka, Fore-

casting the behavior of multivariate time series using neural net-works, Neural Networks 5 (1992) 961–970.

[45] K. Yeo, I. Melnyk, Deep learning algorithm for data-driven sim-ulation of noisy dynamical system, Journal of ComputationalPhysics 376 (2019) 1212 – 1231. doi:https://doi.org/10.1016/j.jcp.2018.10.024.

[46] K. S. Narendra, K. Parthasarathy, Identification and control of

dynamical systems using neural networks, IEEE transactions onneural networks 1 1 (1990) 4–27.

[47] B. Chang, M. Chen, E. Haber, E. H. Chi, AntisymmetricRNN:A dynamical system view on recurrent neural networks, in: In-ternational Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=ryxepo0cFX

[48] T. Miyoshi, H. Ichihashi, S. Okamoto, T. Hayakawa, Learningchaotic dynamics in recurrent rbf network, 1995, pp. 588 – 593vol.1. doi:10.1109/ICNN.1995.488245.

[49] Y. Sato, S. Nagaya, Evolutionary algorithms that generate re-current neural networks for learning chaos dynamics, in: Pro-ceedings of IEEE International Conference on EvolutionaryComputation, 1996, pp. 144–149. doi:10.1109/ICEC.1996.542350.

[50] E. Diaconescu, The use of narx neural networks to predictchaotic time series, WSEAS Transactions on Computer Re-search 3.

[51] M. Assaad, R. Bon, H. Cardot, Predicting chaotic time series byboosted recurrent neural networks, Vol. 4233, 2006, pp. 831–840. doi:10.1007/11893257\_92.

[52] Y. Bengio, P. Simard, P. Frasconi, Learning long-term depen-dencies with gradient descent is difficult, IEEE transactions onneural networks / a publication of the IEEE Neural NetworksCouncil 5 (1994) 157–66. doi:10.1109/72.279181.

[53] K. Cho, B. van Merrinboer, C. Gulcehre, F. Bougares,H. Schwenk, Y. Bengio, Learning phrase representations us-ing rnn encoder-decoder for statistical machine translationdoi:10.3115/v1/D14-1179.

[54] Y. Qin, D. Song, H. Cheng, W. Cheng, G. Jiang, G. W. Cottrell,A dual-stage attention-based recurrent neural network for timeseries prediction, ArXiv abs/1704.02971.

[55] T. Hollis, A. Viscardi, S. E. Yi, A comparison of lstms and at-tention mechanisms for forecasting financial time series, ArXivabs/1812.07699.

[56] P. Vinayavekhin, S. Chaudhury, A. Munawar, D. J. Agravante,G. D. Magistris, D. Kimura, R. Tachibana, Focusing on whatis relevant: Time-series learning and understanding using atten-tion, 2018 24th International Conference on Pattern Recognition(ICPR) (2018) 2624–2629.

[57] S. Serrano, N. A. Smith, Is attention interpretable?, in: ACL,2019.

[58] Y.-Y. Chang, F.-Y. Sun, Y.-H. Wu, S. de Lin, A memory-networkbased solution for multivariate time-series forecasting, ArXivabs/1809.02105.

[59] Y. Ming, D. Pelusi, C.-N. Fang, M. Prasad, Y.-K. Wang, D. Wu,C.-T. Lin, Eeg data analysis with stacked differentiable neu-ral computers, Neural Computing and Applications doi:10.1007/s00521-018-3879-1.

[60] C. Rackauckas, M. Innes, Y. Ma, J. Bettencourt, L. White,V. Dixit, Diffeqflux.jl - a julia library for neural differential equa-tions, ArXiv abs/1902.02376.

[61] M. Innes, A. Edelman, K. Fischer, C. Rackauckus, E. Saba,V. Shah, W. Tebbutt, Zygote: A differentiable programming sys-tem to bridge machine learning and scientific computing, ArXivabs/1907.07587.

[62] Y. Hu, L. Anderson, T.-M. Li, Q. Sun, N. Carr, J. Ragan-Kelley,F. Durand, Difftaichi: Differentiable programming for physicalsimulation, ArXiv abs/1910.00935.

[63] F. d. A. Belbute-Peres, K. A. Smith, K. R. Allen, J. B.Tenenbaum, J. Z. Kolter, End-to-end differentiable physics forlearning and control, in: Proceedings of the 32Nd Interna-tional Conference on Neural Information Processing Systems,NIPS’18, Curran Associates Inc., USA, 2018, pp. 7178–7189.URL http://dl.acm.org/citation.cfm?id=3327757.

3327820

14