Deep Learning in NLP: The Past, The Present, and The Future
Tomas Mikolov, Facebook AI Research
Meaning in Context Workshop
Munich, 2015
Deep Learning in NLP: Overview
In this talk, focus on:
• The fundamental concepts
• Understanding the hype around deep learning
• Possible directions for future research
Tomas Mikolov, FAIR 2
Neural networks: motivation
• To come up with more precise way how to represent and model words, documents and language than the basic approaches (n-grams, word classes, logistic regression / SVM)
• There is nothing that neural networks can do in NLP that the basic techniques completely fail at
• But: the victory in competitions goes to the best, thus few percent gain in accuracy counts!
Tomas Mikolov, FAIR 3
Neuron (perceptron)
Tomas Mikolov, FAIR
Input synapses
w1
w2
w3
W: input weightsActivation function: max(0, value)
Neuron with non-linear activation function
7
Neuron (perceptron)
Tomas Mikolov, FAIR
Input synapses
w1
w2
w3
W: input weightsActivation function: max(0, value)
Neuron with non-linear activation function
Output (axon)
8
Neuron (perceptron)
Tomas Mikolov, FAIR
Input synapses
w1
w2
w3
W: input weightsActivation function: max(0, value)I: input signal
𝑂𝑢𝑡𝑝𝑢𝑡 = max(0, 𝐼 ∙ 𝑊)
Neuron with non-linear activation function
Output (axon)
i1
i3
i2
9
Neuron (perceptron)
• It should be noted that the perceptron model is quite different from the biological neurons (those communicate by sending spike signals at different frequencies)
• The learning seems also quite different
• It would be better to think of artificial neural networks as non-linear projections of data
Tomas Mikolov, FAIR 10
Activation function
• In the previous example, we used max(0, value): this is usually referred to as “rectified activation function”
• Many other functions can be used
• Other common ones: sigmoid, tanh
Tomas Mikolov, FAIR
Figure from Wikipedia
11
Activation function
• The critical part is however the non-linearity
• Example: XOR problem
• There is no linear classifier that can solve this problem:
Tomas Mikolov, FAIR
?
12
Non-linearity: example
Intuitive NLP example:
• Input: sequence of words
• Output: binary classification (for example, positive / negative sentiment)
• Input: “the idea was not bad”
• Non-linear classifier can learn that “not” and “bad” next to each other means something else than “not” or “bad” itself
• “not bad” != “not” + “bad”
Tomas Mikolov, FAIR 13
Activation function
• The non-linearity is a crucial concept that gives neural networks more representational power compared to some other techniques (linear SVM, logistic regression)
• Without the non-linearity, it is not possible to model certain combinations of features (like Boolean XOR function), unless we do manual feature engineering
Tomas Mikolov, FAIR 14
Hidden layer
• Hidden layer represents learned non-linear combination of input features (this is different than SVMs with non-linear kernels that are not learned)
• With hidden layer, we can solve the XOR problem:
1. some neurons in the hidden layer will activate only for some combination of input features
2. the output layer can represent combination of the activations of the hidden neurons
Tomas Mikolov, FAIR 15
Hidden layer
• Neural net with one hidden layer is universal approximator: it can represent any function
• However, not all functions can be represented efficiently with a single hidden layer – we shall see that in the deep learning section
Tomas Mikolov, FAIR 16
Objective function
• Objective function defines how well does the neural network perform some task
• The goal of training is to adapt the weights so that the objective function is maximized / minimized
• Example: classification accuracy, reconstruction error
Tomas Mikolov, FAIR 18
Training of neural networks
• There are many ways how to train neural networks
• The most widely used and successful in practice is to use stochastic gradient descent and backpropagation
• Many algorithms are introduced as superior to SGD, but when properly compared, the gains are not easy to achieve
Tomas Mikolov, FAIR 19
Deep Learning
• Deep model architecture is about having more computational steps (hidden layers) in the model
• Deep learning aims to learn patterns that cannot be learned efficiently with shallow models
Tomas Mikolov, FAIR 20
Deep Learning
• But: it was previously mentioned that one hidden layer neural net is universal approximator as it can represent any function
• Why would we need more hidden layers then?
Tomas Mikolov, FAIR 21
Deep Learning
• The crucial part to understand deep learning is the efficiency
• The “universal approximator” argument says nothing else than that a neural net with non-linearities can work as a look-up table to represent any function: some neurons can activate only for some specific range of input values
Tomas Mikolov, FAIR 22
Deep Learning
• Look-up table is not efficient: for certain functions, we would need exponentially many hidden units with increasing size of the input layer
• Example: parity function (N bits at input, output is 1 if the number of active input bits is odd) (Perceptrons, Minsky & Papert 1969)
Tomas Mikolov, FAIR 23
Deep Learning
• Having hidden layers exponentially larger than is necessary is bad
• If we cannot compactly represent patterns, we have to memorize them -> need exponentially more training examples
Tomas Mikolov, FAIR 24
Deep Learning
• Whenever we try to learn a complex function that is a composition of simpler functions, it may be beneficial to use deep architecture
Tomas Mikolov, FAIR
INPUT LAYER HIDDEN LAYER 1 HIDDEN LAYER 2 HIDDEN LAYER 3 OUTPUT LAYER
25
Deep Learning
• Historically, deep learning was assumed to be impossible to achieve by using SGD + backpropagation
• Recently there was a lot of success for tasks that contain signals that are very compositional (speech, vision) and where large amount of training data is available
Tomas Mikolov, FAIR 26
Deep Learning
• Deep learning is still an open research problem
• Many deep models have been proposed that do not learn anything else than a shallow (one hidden layer) model can learn: beware the hype!
• Not everything labeled “deep” is a successful example of deep learning
Tomas Mikolov, FAIR 27
Deep Learning in NLP: RNN language model
𝐬 𝑡 = 𝑓(𝐔𝐰 𝑡 +𝐖𝐬 𝑡 − 1 )𝐲 𝑡 = 𝑔(𝐕𝐬 𝑡 )
𝑓() is often sigmoid activation function:
𝑓 𝑧 =1
1 + 𝑒−𝑧
𝑔() is often the softmax function:
𝑔 𝑧𝑚 =𝑒𝑧𝑚
𝑘 𝑒𝑧𝑘
Tomas Mikolov, FAIR 28
Neural Networks for NLP
• More info in Coling 2014 tutorial:Using Neural Networks for Modelling and Representing Natural Languages
Tomas Mikolov, FAIR 29
Deep Learning: Understanding the Hype
• For researchers, it is important to understand which concepts do work, and which are just over-hyped
• The incentives in the last few years were too high to keep all researchers honest: hyping your work and (sometimes non-existing) contributions was rewarded by big companies
• Deep learning is sometimes like the moon landing / chess-playing computer: great for PR! (and, sometimes, quite useless too)
Tomas Mikolov, FAIR 30
Deep Learning: Understanding the Hype
• Because of the hype, it is difficult to distinguish which ideas truly work
• It is important to discuss which ideas do not work: this can save many people a lot of time!
Tomas Mikolov, FAIR 31
“Big, Deep Models Can Solve Everything”
• Claim: deep learning can do everything, just use big models with many hidden layers! And a lot of training data!
• Truth: it is as easy to break the current state of the art deep models as it was done already in the 60’s
• Example: memorization of variable-length sequence of symbols breaks all standard models (RNNs, LSTMs)
Tomas Mikolov, FAIR 32
Future of Deep Learning Research (for NLP)
• Using more hidden layers, more neurons and more data will not lead to new breakthroughs: this has already been done (incremental progress is however still possible)
• The existing techniques can still be applied to new datasets, but the novelty of such research is limited
• We should redefine our goals and be more ambitious: the new task should be AI-style language understanding (no fake chatbots…)
Tomas Mikolov, FAIR 33
Future of Deep Learning Research (for NLP)
In the future, we will probably need:
• Novel datasets that allow complex patterns in language to be learned
• Novel machine learning techniques that can represent and discover complex patterns
Tomas Mikolov, FAIR 34
Stack-Augmented RNN• Can discover some complex
patterns in sequential data• Learns how to construct and
operate structured long term memory
• Potentially unlimited capacity
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, Joulin & Mikolov, NIPS 2015
Code available: https://github.com/facebook/Stack-RNNTomas Mikolov, FAIR 35
Learning Binary Addition From Scratch
Learned computation performed by stacks
Tomas Mikolov, FAIR 37
Conclusion
• Basic concepts in artificial neural networks are simple
• There is a lot of hype around deep learning and a lot of big claims, many are overstated
• There is still a lot of research to be done in the future, particularlyin NLP: we need to learn more (complex) patterns
Tomas Mikolov, FAIR 38