Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Neural NetworksLecture 9: Applications
Selected topics
Advanced
Basics
Foundations
Lesson Title
1 Introduction
2 Probability and Information Theory
3 Basics of Machine Learning
4 Deep Feed-forward networks
5 Back-propagation
6 Optimization and regularization
7 Convolutional Networks
8 Sequence Modeling: recurrent networks
9 Applications
10 Software and Practical methods
11 Representational Learning
12 Deep Generative Models
13 Deep Reinforcement Learning
14 Deep Learning and the Brain
15 Conclusions and Future Perspective
16 Project presentations
Learning objectives
• Understand the needs of large scale deep learning
• Know most popular applications of modern deep networks in different fields
Large scale Deep Learning
CHAPTER 1. INTRODUCTION
1950 1985 2000 2015 2056
Year
10
�2
10
�1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
10
10
10
11
Num
ber
ofneurons
(logarithm
ic
scale)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sponge
Roundworm
Leech
Ant
Bee
Frog
Octopus
Human
Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubledin size roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015).
1. Perceptron (Rosenblatt, 1958, 1962)
2. Adaptive linear element (Widrow and Hoff, 1960)
3. Neocognitron (Fukushima, 1980)
4. Early back-propagation network (Rumelhart et al., 1986b)
5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991)
6. Multilayer perceptron for speech recognition (Bengio et al., 1991)
7. Mean field sigmoid belief network (Saul et al., 1996)
8. LeNet-5 (LeCun et al., 1998b)
9. Echo state network (Jaeger and Haas, 2004)
10. Deep belief network (Hinton et al., 2006)
11. GPU-accelerated convolutional network (Chellapilla et al., 2006)
12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
13. GPU-accelerated deep belief network (Raina et al., 2009)
14. Unsupervised convolutional network (Jarrett et al., 2009)
15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
16. OMP-1 network (Coates and Ng, 2011)
17. Distributed autoencoder (Le et al., 2012)
18. Multi-GPU convolutional network (Krizhevsky et al., 2012)
19. COTS HPC unsupervised convolutional network (Coates et al., 2013)
20. GoogLeNet (Szegedy et al., 2014a)
27
Improvement in accuracy Increase in networks sizeHigh performance hardware and software infrastructure
Fast implementations
CPU
• Exploit fixed point arithmetic in CPU families where this offers a speedup
GPU
• Parallelism & High memory bandwidth
• Low-level: Tensorflow, PyTorch, Theano (RIP)
• High-level: Keras, Caffe
• NN training hardly requires branching, appropriate for GPU
• Insufficient for large scale applications
• Multi-GPU
• Multi-machine
• Model parallelism (multiple machines on single data point)
• Data parallelism (each example run by separate machine)
• Trivial at test time
• Asynchronous SGD at train time
Distributed implementations
• Commercial applications need low time and memory cost during testing
• If no personalization, then train once and deploy to many users
• End user is resource constrained (e.g. train speech recognition in computer cluster, then deploy it in mobile phone)
• Model compression aims to train a small model to mimic a large one (by sampling x and f(x) from the large model)
• Better than directly training a small model!
Model compression
Dynamic structure
Mixture of experts: select which network will be used to compute the output given the current input
Specialized hardware
ASICS (application-specific integrated circuits)
FPGA (field programmable gated array)
GPU (with lower precision)
Optical systems (lasers systems)
Computer vision
Convolutional Neural Networks achieve many state-of-the-art results
Specialized pre-processing is no longer very important (but normalise the contrast of images… helps)
Computer vision
Classification Retrieval
Computer vision
Caption generation
Computer vision
Face detection
Computer vision
Other body parts recognition…
Computer vision
Detection (bounding box) Segmentation
Computer vision
Transcription
Computer vision
Segmentation (annotating each pixel as cell or not cell)
Computer vision
Detection (intrusion)
Computer vision
Detection (leader-follower system)
https://www.youtube.com/watch?v=JaQTekVIXe4
Speech recognition
Acoustic signal (natural language utterance) to sequence of words
Most industrial products now use NN (voice recognition in mobiles…)
Speech recognition
Spectrogram: spectrum of frequencies as they vary with time
Speech recognition
Speech recognition
Recognition (Convnets, LSTMs)
Synthesis (Wavenets)
Simulates the sound of speech at low-level (building the waveform, @16000 samples per second)
"If you train it with an American’s speech, it produces American speech. If you train it with German, it produces German. And if youtrain it with Chopin, it produces… well, not quite Chopin, but piano in a logical, one might even be tempted to say creative vein."
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Natural Language Processing (NLP)
Natural occurring language are ambiguous and defy formal description
Example: machine translation requires to read a sentence in one human language and emit an equivalent sentence in another human language
NLP applications are based on language models (probability distribution over sequences of words )
Natural Language Processing (NLP)
Natural occurring language are ambiguous and defy formal description
Example: machine translation requires to read a sentence in one human language and emit an equivalent sentence in another human language
NLP applications are based on language models (probability distribution over sequences of words )
CHAPTER 12. APPLICATIONS
This decomposition is justified by the chain rule of probability. The probabilitydistribution over the initial sequence P (x
1
, . . . , xn�1
) may be modeled by a differentmodel with a smaller value of n.
Training n-gram models is straightforward because the maximum likelihoodestimate can be computed simply by counting how many times each possible ngram occurs in the training set. Models based on n-grams have been the corebuilding block of statistical language modeling for many decades (Jelinek andMercer, 1980; Katz, 1987; Chen and Goodman, 1999).
For small values of n, models have particular names: unigram for n=1, bigramfor n=2, and trigram for n=3. These names derive from the Latin prefixes forthe corresponding numbers and the Greek suffix “-gram” denoting something thatis written.
Usually we train both an n-gram model and an n�1 gram model simultaneously.This makes it easy to compute
P (xt | xt�n+1
, . . . , xt�1
) =
Pn(xt�n+1
, . . . , xt)
Pn�1
(xt�n+1
, . . . , xt�1
)
(12.6)
simply by looking up two stored probabilities. For this to exactly reproduceinference in Pn, we must omit the final character from each sequence when wetrain Pn�1
.As an example, we demonstrate how a trigram model computes the probability
of the sentence “THE DOG RAN AWAY.” The first words of the sentence cannot behandled by the default formula based on conditional probability because there is nocontext at the beginning of the sentence. Instead, we must use the marginal prob-ability over words at the start of the sentence. We thus evaluate P
3
(THE DOG RAN).Finally, the last word may be predicted using the typical case, of using the condi-tional distribution P (AWAY | DOG RAN). Putting this together with equation 12.6,we obtain:
P (THE DOG RAN AWAY) = P3
(THE DOG RAN)P3
(DOG RAN AWAY)/P2
(DOG RAN).(12.7)
A fundamental limitation of maximum likelihood for n-gram models is that Pn
as estimated from training set counts is very likely to be zero in many cases, eventhough the tuple (xt�n+1
, . . . , xt) may appear in the test set. This can cause twodifferent kinds of catastrophic outcomes. When Pn�1
is zero, the ratio is undefined,so the model does not even produce a sensible output. When Pn�1
is non-zero butPn is zero, the test log-likelihood is �1. To avoid such catastrophic outcomes,most n-gram models employ some form of smoothing. Smoothing techniques
462
N-grams models: sequence of n tokens (words)
Estimate probability of n-gram by how many times each possible n-gram occurs in the training set.
Natural Language Processing (NLP)
N-grams models suffer from
curse of dimensionality: n-grams. Even in massive training dataset, most n-grams do not occur.
each word being a one-hot vector: any two different words have the same distance from each other. Neighbours to leverage information are only training examples that repeat literally the same context.
"Cat" = (0,0,0,0,0,0,1,0,0,0,0,0)
"Dog"= (0,0,1,0,0,0,0,0,0,0,0,0)
Natural Language Processing (NLP)
Neural language model: use a distributed representation of words (also called word embeddings)
Word2Vec
Natural Language Processing (NLP)
Neural language model: use a distributed representation of words (also called word embeddings)
Word2Vec
Natural Language Processing (NLP)
Neural language model: use a distributed representation of words (also called word embeddings)
"Cat" = (0.12,0.31,1.23,0.05,0.91)
"Dog" = (0.17,0.23,1.05,0.04,0.6)
Vectors are positioned such that words that share common context in the corpus are located in close proximity in the new space
Natural Language Processing (NLP)
Neural language model: use a distributed representation of words (also called word embeddings)
Natural Language Processing (NLP)
Neural language model: use a distributed representation of words (also called word embeddings)
Natural Language Processing (NLP)Neural machine translation: models a conditional probability of a natural language sentenceCHAPTER 12. APPLICATIONS
Decoder
Output object (English sentence)
Intermediate, semantic representation
Source object (French sentence or image)
Encoder
Figure 12.5: The encoder-decoder architecture to map back and forth between a surfacerepresentation (such as a sequence of words or an image) and a semantic representation.By using the output of an encoder of data from one modality (such as the encoder mappingfrom French sentences to hidden representations capturing the meaning of sentences) asthe input to a decoder for another modality (such as the decoder mapping from hiddenrepresentations capturing the meaning of sentences to English), we can train systems thattranslate from one modality to another. This idea has been applied successfully not justto machine translation but also to caption generation from images.
A drawback of the MLP-based approach is that it requires the sequences to bepreprocessed to be of fixed length. To make the translation more flexible, we wouldlike to use a model that can accommodate variable length inputs and variablelength outputs. An RNN provides this ability. Section 10.2.4 describes several waysof constructing an RNN that represents a conditional distribution over a sequencegiven some input, and section 10.4 describes how to accomplish this conditioningwhen the input is a sequence. In all cases, one model first reads the input sequenceand emits a data structure that summarizes the input sequence. We call thissummary the “context” C. The context C may be a list of vectors, or it may be avector or tensor. The model that reads the input to produce C may be an RNN(Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutionalnetwork (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN,then reads the context C and generates a sentence in the target language. Thisgeneral idea of an encoder-decoder framework for machine translation is illustratedin figure 12.5.
In order to generate an entire sentence conditioned on the source sentence, themodel must have a way to represent the entire source sentence. Earlier modelswere only able to represent individual words or phrases. From a representation
474
Reinforcement Learning
State Action
Reward
Reinforcement Learning
Knowledge and relations
Many applications require representing relations and reasoning about them
Relation
Attribute
Knowledge and relations
Many applications require representing relations and reasoning about them
Data: Knowledge/relational databases (Freebase, Opencyc, WordNet, Wikibase, …,GeneOntology).
Relation
Attribute
Knowledge graphs: store facts about the world as relation between entities
Entities are no longer just strings but real world objects with attributes,taxonomic information and relations to other objects
Knowledge and relations
Neural language model: embedding vector for each entity and relations
Visual Q&A
Visual Q&A
Art
Neural style transfer: decompose style + content
Selected topics
Advanced
Basics
Foundations
Lesson Title
1 Introduction
2 Probability and Information Theory
3 Basics of Machine Learning
4 Deep Feed-forward networks
5 Back-propagation
6 Optimization and regularization
7 Convolutional Networks
8 Sequence Modeling: recurrent networks
9 Applications
10 Software and Practical methods
11 Representational Learning
12 Deep Generative Models
13 Deep Reinforcement Learning
14 Deep Learning and the Brain
15 Conclusions and Future Perspective
16 Project presentations