View
41
Download
0
Category
Preview:
Citation preview
Recurrent Neural Networks, Language, Semantics, and Disentangling Factors
Yoshua Bengio
August 29th, 2017 @ DS3
Data Science Summer School
Recurrent Neural Networks
• Selectively summarize an input sequence in a fixed-size state vector via a recursive update
2
stst− 1 st + 1
F✓ F✓ F✓
x tx t− 1 x t + 1x
sF✓
unfold
Generalizes naturally to new lengths not seen during training
shared over time
Recurrent Neural Networks
• Can produce an output at each time step: unfolding the graph tells us how to back-prop through time.
3
x tx t− 1 x t + 1x
unfold
V WW
W W W
V V V
U U U U
s
o
st− 1
ot− 1 ot
st st + 1
ot + 1
Generative RNNs
4
x tx t− 1 x t + 1
WW W W
V V V
U U U
st− 1
ot− 1 ot
st st + 1
ot + 1
L t + 1L t− 1 L t
x t + 2
• An RNN can represent a fully-connected directed generativemodel: every variable predicted from all previous ones.
Neural word embeddings - visualization
5
6
Conditional Distributions
• Sequence to vector
• Sequence to sequence of the same length, aligned
• Vector to sequence
• Sequence to sequence
stst− 1 st + 1
F✓ F✓ F✓
x tx t− 1 x t + 1x
sF✓
unfold
• During training, past yin input is from training data
• At generation time, past y in input isgenerated
• Mismatch cancause ”compounding error”
7
Test-time path
Training-time path
Maximum Likelihood =
Teacher Forcing
Ideas to reduce the train/generate mismatch in
teacher forcing
• Scheduled sampling (S. Bengio et al, NIPS 2015)
• Backprop through open-loop sampling recurrence & minimize long-term cost (but which one? GAN would be most natural Professor Forcing)
8
Related toSEARN (Daumé et al 2009)DAGGER (Ross et al 2010)
Gradually increase theprobability of usingthe model’s samplesvs the ground truthas input.
Increasing the Expressive Power of RNNs with
more Depth
• ICLR 2014, How to construct deep recurrent neural networks
9
y t -1
x t -1
h t -1
x t
h t
y t
x t + 1
h t + 1
y t + 1
x t
h t -1h t
y t
x t
h t -1h t
y tOrdinary RNNs
+ deep hid-to-out+ deep hid-to-hid+deep in-to-hid
+ skip connections for creating shorter paths
x t
h t -1
h t
y t
z t -1z t
+ stacking
Bidirectional RNNs, Recursive Nets,
Multidimensional RNNs, etc.
• The unfolded architecture needs not be a straight chain
10
(Multidimensional RNNs, Graves et al 2007)
Bidirectional RNNs (Schuster and Paliwal, 1997)
See Alex Graves’s work, e.g., 2012
y
L
x1 x2 x3x4
V V V V
o
Recursive (tree-structured)Neural Nets:
Frasconi et al 97Socher et al 2011
Multiplicative Interactions
• Multiplicative Integration RNNs:
• Replace
• By
• Or more general:
11
(Wu et al, 2016, arXiv:1606.06630)
Learning Long-Term
Dependencies with Gradient
Descent is Difficult
Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994
Simple Experiments from 1991 while I was at
MIT
• 2 categories of sequences
• Can the single tanh unit learn to store for T time steps 1 bit of information given by the sign of initial input?
13
Prob(success | seq. length T)
How to store 1 bit? Dynamics with multiple
basins of attraction in some dimensions
• Some subspace of the state can store 1 or more bits of information if the dynamical system has multiple basins of attraction in some dimensions
14
Basins boundary
Bit=1
Bit=0
Note: gradients MUST be high near the boundary
Robustly storing 1 bit in the presence of
bounded noise
• With spectral radius > 1, noise can kick state out of attractor
• Not so with radius<1
15
UNSTABLE
CONTRACTIVE STABLE
Storing Reliably Vanishing gradients
• Reliably storing bits of information requires spectral radius<1
• The product of T matrices whose spectral radius is < 1 is a matrix whose spectral radius converges to 0 at exponential rate in T
• If spectral radius of Jacobian is < 1 propagated gradients vanish
16
Vanishing or Exploding Gradients
• Hochreiter’s 1991 MSc thesis (in German) had independently discovered that backpropagated gradients in RNNs tend to either vanish or explode as sequence length increases
17
Why it hurts gradient-based learning
• Long-term dependencies get a weight that is exponentially smaller (in T) compared to short-term dependencies
18
Becomes exponentially smallerfor longer time differences,when spectral radius < 1
Vanishing Gradients in Deep Nets are Different
from the Case in RNNs
• If it was just a case of vanishing gradients in deep nets, we could just rescale the per-layer learning rate, but that does not really fix the training difficulties.
• Can’t do that with RNNs because the weights are shared, & total true gradient = sum over different “depths”
19
12
34
To store information robustly the dynamics
must be contractive
• The RNN gradient is a product of Jacobian matrices, each associated with a step in the forward computation. To store information robustly in a finite-dimensional state, the dynamics must be contractive [Bengio et al 1994].
• Problems:
• e-values of Jacobians > 1 gradients explode
• or e-values < 1 gradients shrink & vanish
• or random variance grows exponentially
20
Storing bitsrobustly requirese-values<1
Gradient clipping
RNN Tricks (Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)
• Clipping gradients (avoid exploding gradients)
• Leaky integration (propagate long-term dependencies)
• Momentum (cheap 2nd order)
• Initialization (start in right ballpark avoids exploding/vanishing)
• Sparse Gradients (symmetry breaking)
• Gradient propagation regularizer (avoid vanishing gradient)
• Gated self-loops (LSTM & GRU, reduces vanishing gradient)
21
Dealing with Gradient Explosion by Gradient
Norm Clipping
22
(Mikolov thesis 2012;Pascanu, Mikolov, Bengio, ICML 2013)
Conference version (1993) of the 1994 paper by
the same authors had a predecessor of GRU
and targetprop
• Flip-flop unit to store 1 bit, with gating signal to control when to write
• Pseudo-backprop through it by a form of targetprop
23
(The problem of learning long-term dependencies in recurrent networks, Bengio, Frasconi & Simard ICNN’1993)
x tx t− 1 x t + 1x
unfold
s
o
st− 1
ot− 1 ot
st st + 1
ot + 1
W1
W3
W1 W1 W1 W1
W3
st− 2
W3 W3 W3
Delays & Hierarchies to Reach Farther
• Delays and multiple time scales, Elhihi & Bengio NIPS 1995, Koutnik et al ICML 2014
• How to do this right?
• How to automatically
and adaptively do it?
24
Hierarchical RNNs (words / sentences):Sordoni et al CIKM 2015, Serban et al AAAI 2016
Fighting the vanishing gradient:
LSTM & GRU
• Create a path wheregradients can flow for longer with a self-loop
• Corresponds to an eigenvalue of Jacobianslightly less than 1
• LSTM is now heavily used(Hochreiter & Schmidhuber1997)
• GRU light-weight version (Cho et al 2014)
25
LSTM: (Hochreiter & Schmidhuber 1997)(Hochreiter 1991); first version of the LSTM, called Neural Long-Term Storage with self-loop
Fast Forward 20 years: Attention Mechanisms
for Memory Access
• Neural Turing Machines (Graves et al 2014)
• and Memory Networks (Weston et al 2014)
• Use a content-based attention mechanism(Bahdanau et al 2014) to control the readand write access into a memory
• The attention mechanism outputs a softmax over memory locations
26
write
read
Large Memory Networks: Sparse Access
Memory for Long-Term Dependencies
• Memory = part of the state
• Memory-based networks are special RNNs
• A mental state stored in an external memory can stay for arbitrarilylong durations, until it is overwritten (partially or not)
• Forgetting = vanishing gradient.
• Memory = higher-dimensional state, avoiding or reducing the need for forgetting/vanishing
27
passive copy
access
Attention Mechanism for Deep Learning
• Consider an input (or intermediate) sequence or image
• Consider an upper level representation, which can choose « where to look », by assigning a weight or probability to each input position, as produced by an MLP, applied at each position
28
Lower-level
Higher-level
Softmax over lowerlocations conditionedon context at lower andhigher locations
• Soft attention (backprop) vs• Stochastic hard attention (RL)
(Bahdanau, Cho & Bengio, ICLR 2015; Jean et al ACL 2015; Jean et al WMT 2015; Xu et al ICML 2015; Chorowski et al NIPS 2015; Firat, Cho & Bengio 2016)
End-to-End Machine Translation with Recurrent
Nets and Attention Mechanism
• Reached the state-of-the-art in one year, from scratch
29
(Bahdanau et al ICLR 2015, Jean et al ACL 2015, Gulcehre et al 2015, Firat et al 2016)
Google-Scale NMT Success
• Distributed training, very large model ensemble
• Not only does NMT work in terms of BLEU but it makes a killing in terms of human evaluation on Google Translate data
30
(Wu et al & Dean, Nature, 2016)
Humanevaluation
humantranslation
n-gramtranslation
currentneural nettranslation
Pointing the Unknown Words
The next word generated can either come from vocabulary or is copied from the input sequence.
31
Gulcehre, Ahn, Nallapati, Zhou & Bengio ACL 2016Based on ‘Pointer Networks’, Vinyals et al 2015
Text summarization
MachineTranslation
Designing the RNN Architecture
• Recurrent depth: max path length divided by sequence length
• Feedforward depth: max length from input to nearest output
• Skip coefficient: shortest path length divided sequence length
32
(Architectural Complexity Measures of Recurrent Neural NetworksZhang et al 2016, arXiv:1602.08210)
It makes a difference
• Impact of change in recurrent depth
• Impact of change in skip coefficient
33
Near-Orthogonality to Help Information
Propagation
• Initialization to orthogonal recurrent W
• Unitary matrices: all e-values of matrix are 1
• Zoneout: randomly choose to simply copy the state unchanged
34
(Krueger et al 2016, submitted)
(Arjowski, Amar & Bengio ICML 2016)
(Saxe et al 2013, ICLR2014)
Variational Generative RNNs
• (Chung et al, NIPS’2015)
• Regular RNNs have noise injected only in input space
• VRNNs also allow noise (latent variable) injected in top hiddenlayer; more « high-level » variability
35
Injecting higher-level variations / latent variables in RNNs
Variational Hierarchical RNNs for Dialogue
Generation (Serban et al 2016)
• Lower level = words of an utterance (turn of speech)
• Upper level = state of the dialogue
• Inject high-level choices
36
VHRNN
Results –
Twitter Dialogues
37
Other Fully-Observed Neural
Directed Graphical Models
38
Neural Auto-Regressive Models
• Decomposes the joint of a fully observed directed model in terms of conditionals
• Logistic auto-regressive: (Frey 1997)
• First neural version: (Bengio&BengioNIPS’99)
39
x1x2 x3
x4
P(x1) P(x2|x1)P(x3|x2 ,x1)
P(x4|x3 , x2 ,x1)
x1x2 x3
x4
h1
x1x2 x3
x4
P(x1) P(x2|x1)P(x3|x2 ,x1)P(x4|x3 , x2 ,x1)
h2 h3
NADE: Neural AutoRegressive Density
Estimator
• Introduces smart sharing between some weights so that the different hidden groups use the same weights to the same input but look at more and more of the inputs.
40
h1
x1x2 x3
x4
P(x1) P(x2|x1)P(x3|x2 ,x1)
P(x4|x3 , x2 ,x1)
h2 h3
W1
W1
W1
W2
W2
W3
(Larochelle & Murray AISTATS 2011)
Pixel RNNs
• Similar to NADE and RNNs but for 2-D images
• Surprisingly sharp and realistic generation
• Gets texture right but not necessarily global structure
41
Pixel Recurrent Neural Networks
x1
x i
xn
xn 2
Figure 2. Left: To generate pixel x i one conditions on all thepre-
viously generated pixels left and above of x i . Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
Figure 3. In the Diagonal BiLSTM, to allow for parallelization
along the diagonals, the input map is skewed by offseting each
row by one position with respect to the previous row. When the
spatial layer is computed left to right and column by column, the
output map is shifted back into the original size. The convolution
usesakernel of size 2⇥1.
(2015); Uria et al. (2014)). By contrast we model p(x) as
a discrete distribution, with every conditional distribution
in Equation 2 being a multinomial that is modeled with a
softmax layer. Each channel variable x i ,⇤simply takes one
of 256 distinct values. Thediscrete distribution isrepresen-
tationally simple and has the advantage of being arbitrarily
multimodal without prior on the shape. Experimentally we
also find the discrete distribution to be easy to learn and
to produce better performance compared to a continuous
distribution (Section 5).
3. Pixel Recurrent Neural Networks
In this section we describe the architectural components
that compose the PixelRNN. In Sections 3.1 and 3.2, we
describe the two types of LSTM layers that use convolu-
tions to compute at once the states along one of the spatial
dimensions. In Section 3.3 we describe how to incorporate
residual connections to improvethetraining of aPixelRNN
with many LSTM layers. In Section 3.4 we describe the
softmax layer that computes the discrete joint distribution
of the colors and the masking technique that ensures the
proper conditioning scheme. In Section 3.5 wedescribe the
PixelCNN architecture. Finally in Section 3.6 we describe
themulti-scale architecture.
3.1. Row LSTM
The Row LSTM is a unidirectional layer that processes
the image row by row from top to bottom computing fea-
tures for a whole row at once; the computation is per-
formed with a one-dimensional convolution. For a pixel
x i the layer captures aroughly triangular context abovethe
pixel as shown in Figure 2 (center). The kernel of the one-
dimensional convolution has size k ⇥ 1 where k ≥ 3; the
larger thevalueof k thebroader thecontext that iscaptured.
The weight sharing in the convolution ensures translation
invariance of the computed features along each row.
The computation proceeds as follows. An LSTM layer has
an input-to-state component and a recurrent state-to-state
component that together determine thefour gates inside the
LSTM core. To enhance parallelizati on in the Row LSTM
theinput-to-state component isfirst computed for theentire
two-dimensional input map; for this a k ⇥1 convolution is
used to follow therow-wise orientation of theLSTM itself.
Theconvolution ismasked to includeonly thevalid context
(seeSection 3.4) and produces atensor of size 4h⇥n⇥n,
representing the four gate vectors for each position in the
input map, where h is the number of features in the LSTM
layer.
To compute one step of the state-to-state component of
the LSTM layer, one is given the previous hidden and cell
states h i − 1 and ci − 1, each of size h ⇥ n ⇥ 1. The new
hidden and cell states h i , ci are obtained as follows:
[oi , f i , i i , gi ] = σ(K ss ~ h i − 1 + K i s ~ x i )
ci = f i ci − 1 + i i gi
h i = oi tanh(ci )
(3)
where x i of size h ⇥n ⇥1 is row i of the input map, and
~ represents the convolution operation and the element-
wise multiplication. The weights K ss and K i s are the
kernel weights for the state-to-state and the input-to-state
components, where the latter is precomputed as described
above. In the case of the output, forget and input gates
oi , f i and i i , the activation σ is the logistic sigmoid func-
tion, whereas for the content gate gi , σ is the tanh func-
tion. Each step computes at once the new state for an en-
tire row of the input map. Since the Row LSTM layer is
unidirectional, it is relatively fast, but it has a considerable
drawback. Due to its roughly triangular shape, the recep-
tivefield induced by the layer misses a large portion of the
previously generated context corresponding to the areas on
either side of the current pixel. For example, for a value
of k = 3 for the state-to-state convolution, which we find
gives the best performance in the experiments, the recep-
tive field for the pixels near the center of the image misses
roughly half of the generated context (Figure 2).
(van den Oord et al ICML 2016, best paper)
Pixel Recurrent Neural Networks
occluded completions original occluded completions original
Figure 8. Image completions sampled from amodel that was trained on 32x32 ImageNet images. Note that diversity of the completions
is high, which can be attributed to the log-likelihood loss function used in this generative model, as it encourages models with high
entropy. As these are sampled from the model, we can easily generate millions of different completions. It is also interesting to see that
textures such aswater, wood and shrubbery are also inputed relativewell (see Figure 1).
trained to model the raw RGB pixel values of images. We
treated the pixel values as discrete random variables by us-
ing asoftmax layer in theconditional distributions. Weem-
ployed masked convolutions to allow PixelRNNs to model
full dependencies between the color channels. We pro-
posed and evaluated architectural improvements in these
models resulting in PixelRNNs with up to 12 LSTM lay-
ers.
We have shown that the PixelRNNs significantly improve
the state of the art on the Binary MNIST and CIFAR-10
datasets. We also provide new benchmarks for generative
image modeling on the ImageNet dataset. Based on the
samples and completions drawn from the models we can
conclude that the PixelRNNs are able to model both spa-
tially local and long-range correlations and are able to pro-
duce images that are sharp and coherent. Given that these
models improve as we make them larger and that there is
practically unlimited data available to train on, more com-
putation and larger models are likely to further improvethe
results.
Acknowledgements
Theauthorswould liketo thank Shakir Mohamed and Guil-
laume Desjardins for helpful input on this paper and Lu-
cas Theis, Alex Graves, Karen Simonyan, Lasse Espeholt,
Danilo Rezende, Karol Gregor and Ivo Danihelka for in-
sightful discussions.
References
Dinh, Laurent, Krueger, David, and Bengio, Yoshua.
NICE: Non-linear independent components estimation.
arXiv preprint arXiv:1410.8516, 2014.
Germain, Mathieu, Gregor, Karol, Murray, Iain, and
Larochelle, Hugo. MADE: Masked autoencoder for dis-
tribution estimation. arXiv preprint arXiv:1502.03509,
2015.
Graves, Alex. Generating sequences with recurrent neural
networks. arXiv preprint arXiv:1308.0850, 2013.
Graves, Alex and Schmidhuber, Jurgen. Offline handwrit-
ing recognition with multidimensional recurrent neural
networks. In Advances in Neural Information Process-
ing Systems, 2009.
Gregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell,
Charles, and Wierstra, Daan. Deep autoregressive net-
works. In Proceedings of the 31st International Confer-
ence on Machine Learning, 2014.
Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra,
Daan. DRAW: A recurrent neural network for image
generation. Proceedings of the 32nd International Con-
ference on Machine Learning, 2015.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Jian. Deep residual learning for imagerecognition. arXiv
preprint arXiv:1512.03385, 2015.
Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-
term memory. Neural computation, 1997.
Kalchbrenner, Nal and Blunsom, Phil. Recurrent continu-
ous translation models. In Proceedings of the2013 Con-
ferenceon Empirical Methods in Natural LanguagePro-
cessing, 2013.
Kalchbrenner, Nal, Danihelka, Ivo, and Graves, Alex.
Grid long short-term memory. arXiv preprint
arXiv:1507.01526, 2015.
Kingma, Diederik P and Welling, Max. Auto-encoding
variational bayes. arXiv preprint arXiv:1312.6114,
2013.
Krizhevsky, Alex. Learning multiple layers of features
from tiny images. 2009.
Larochelle, Hugo and Murray, Iain. The neural autore-
gressive distribution estimator. The Journal of Machine
Learning Research, 2011.
Forward Computation of the Gradient
• BPTT does not seem biologically plausible and is memory-expensive
• RTRL (Real-Time Recurrent Learning, Williams & Zipser 1989, Neural Comp.)
• Practically useful: online learning, no need to store all the past states and revisit history backwards (which is biologically weird)
• Compute the gradients forward in time, rather than backwards• Think about multiplying many matrices left-to-right vs right-to-left
• BUT exact computation is O(nhidden x nweights) instead of O(nweights), to recursively compute dh(t)/dW all params
• Recently proposed, *approximate* the forward gradient using an efficient stochastic estimator (rank 1 estimator of dh/dWtensor)(Training recurrent networks online without backtracking, Ollivier et al
arXiv:1507.07680)
42
Rare & Dangerous
States
• Example: autonomousvehicles in near-accident situations
• Current supervised learningmay not handle well thesecases because they are toorare (not enough data)
• It would be even worse with current RL (statistical inefficiency)
• Long-term objective: develop better predictive models of the world able to generalize in completely unseen scenarios
• Example of similar human ability: figuring out intuitive physics, no need to die a thousand deaths
How to Discover Good Representations
• How to discover abstractions? What is a good representation?
• Need clues to disentangle the underlying factors
• Spatial & temporal scales
• Marginal independence
• Controllable factors
44
Jointly Learn to Detect and Control an Attribute
of the Scene
45
In order to quantify the change in f k when actions are taken according to ⇡k , wedefine the selectivity of a feature as:
sel (s, a, k) = Es0⇠P as s 0
|f k (s0) − f k (s)|P
k 0 |f k 0(s0) − f k 0(s)|(1)
where s,s0 are successive raw state representations (e.g. pixels), a is the action, Pass0 is the environment’s transition distribution from
s to s0 under action a. The normalization by the change in all features means that the selectivity of f k is maximal when only thatsingle featurechanges as a result of some action.
By having an objective that maximizes selectivity and minimizes the autoencoder objective, we can ensure that the features learnedcan both reconstruct thedata and recover independently controllable factors. Hence, wedefine the following objective, which can beminimized via stochastic gradient descent:
Es[ 12||s − g(f (s))||22]
| {z }reconstruction error
− λX
k
Es[X
a
⇡k (a|s) logsel (s, a, k)]
| {z }disentanglement objective
(2)
Here one can think of logsel(s, a, k) as the reward signal Rk (s, a) of a control problem, and the expected reward Ea⇠⇡ k[Rk ] is
maximized by finding theoptimal set of policies ⇡k .
Note that it isalso possible to havedirected selectivity: by not taking theabsolutevalueof thenumerator of (1) (and replacing logselwith log(1+ sel ) in (2)), thepolicies must learn to increase thelearned latent feature rather than simply change it. Thismay beusefulif the policy to gradually increase a feature is distinct from the policy that decreases it.
2.3 A first toy problem
Consider the simple environment described in Figure 1(a): the agent sees a 2⇥ 2 square of adjacent cells in the environment, andhas 4 actions that move it up, down, left or right. An autoencoder with directed selectivity (see Figure 1(c,d)) learns latent featuresthat map to the (x, y) position of the square in the input space, without ever having explicitly access to these values, and whilereconstructing the input properly. An autoencoder without selectivity also reconstructs the input properly but without learning thesetwo latent (x, y) features explicitly.
In thissetting f , g and ⇡ sharesomeof their parameters. Weusethefollowing architecture: f has two 16⇥3⇥3 ReLU convolutionallayers, followed by afully connected ReLU layer of 32 units, and atanh layer of n = 4 features; g is the transpose architecture of f ;⇡k is asoftmax policy over 4 actions, computed from the output of the ReLU fully connected layer.
(a) (b) (c) (d)
Figure 1: (a) & (b) A simple environment with 4 actions that push a square left, right, up or down. (a) is an example ground truth,(b) is the reconstruction of the model trained with selectivity. (c) The slope of a linear regression of the true features (the real xand y position of the agent) as a function of each latent feature. White is no correlation, blue and red indicate strong negative orpositive slopes respectively. We can see that features 0 and 1 recover y and features 2 and 3 recover x. (d) Representation of thelearned policies. Each row is a policy ⇡k , each column corresponds to an action (left/right/up/down). Each cell (k, i ) represents theprobability of action i in policy ⇡k ; Wecan see that features 0 and 1 correspond to going down and up (− y/+ y) and features 2 and 3correspond to going right and left (+ x/− x).
2.4 A slightly harder toy problem
In the next experiments, we aim to generalize the model above in a slightly more complex environment. Instead of having f and ⇡k
parametrized by thesameparameters, wenow introduce adifferent set of parameters for each policy and for theencoder, so that eachpolicy can be learned separately.
2
Independently Controllable Factors
(Bengio et al & Bengio, 2017)
• Jointly train for each aspect (factor)
• A policy (which tries to selectively change just thatfactor)
• A representation (which maps state to value of factor)
• Optimize both policy and representation to minimize
46
Discrete Set of Attributes:
Experiments
True underlying attributes are (x1, y2), (x2, y2) of digits.
Learner identifies them!
47
Ongoing Work
NIPS Submission: Independently Controllable Features
48
17/ 17
Independent ly Controllable Features
Questions?
E. Bengio, J. Pondard, M. Sarfat i, V. Thomas et al. Independent ly Controllable Features
Emmanuel Bengio, Valentin Thomas, Jules Pondard, Marc Sarfati, Philippe BeaudoinMarie-Jean Meurs, Doina Precup, Joelle Pineau, Yoshua Bengio
Continuous Set of Attributes: Attribute
Embeddings
49
Continuous Attributes Model: Computational
Graph
Total objective = autoencoder loss + selectivity loss
50
Continuous Attributes: Results
51
Predict the effect of actions in attribute space
Given initial state and set of actions, predict new attribute values and the corresponding reconstructed images
52
Given two states, recover the causal actions
leading from one to the other
53
Representing Agents, Objects, Policies
The controllable factors idea works on a small scale: ongoing workto expand this to more complex environments
• Expand to sequences of actions and use them to define options
• Notion of objects and attributes naturally falls out
• Extension to non-static set of objects: types & instances
• Objects are groups of controllable features: by whom? Agents
• Factors controlled by other agents; mirror neurons
• Because the set of objects may be unbounded, we need to learnto represent policies themselves, and the definition of an objectis bound to the policies associated with it (for using it and changing its attributes)
54
Research Program Towards Higher-Level Deep
Learning
• Build a series of gradually more complex environments
• Design training frameworks for learning agents to discover the notion of objects, attributes, types, other agents, relations, the combinatorial space of policies, etc…
• Associate all these concepts to linguistic constructions: learnsemantically grounded linguistic expressions.
• Words are clues of higher-level abstractions
• Language is an input/output modality
55
Montreal Institute for Learning Algorithms
Recommended