Lecture 9 - NN 2017-18

Neural NetworksLecture 9: Applications

Selected topics

Advanced

Basics

Foundations

Lesson Title

1 Introduction

2 Probability and Information Theory

3 Basics of Machine Learning

4 Deep Feed-forward networks

5 Back-propagation

6 Optimization and regularization

7 Convolutional Networks

8 Sequence Modeling: recurrent networks

9 Applications

10 Software and Practical methods

11 Representational Learning

12 Deep Generative Models

13 Deep Reinforcement Learning

14 Deep Learning and the Brain

15 Conclusions and Future Perspective

16 Project presentations

Learning objectives

• Understand the needs of large scale deep learning

• Know most popular applications of modern deep networks in different fields

Large scale Deep Learning

CHAPTER 1. INTRODUCTION

1950 1985 2000 2015 2056

Year

10

�2

10

�1

10

0

10

1

10

2

10

3

10

4

10

5

10

6

10

7

10

8

10

9

10

10

10

11

Num

ber

ofneurons

(logarithm

ic

scale)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Sponge

Roundworm

Leech

Ant

Bee

Frog

Octopus

Human

Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubledin size roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015).

1. Perceptron (Rosenblatt, 1958, 1962)

2. Adaptive linear element (Widrow and Hoff, 1960)

3. Neocognitron (Fukushima, 1980)

4. Early back-propagation network (Rumelhart et al., 1986b)

5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991)

6. Multilayer perceptron for speech recognition (Bengio et al., 1991)

7. Mean field sigmoid belief network (Saul et al., 1996)

8. LeNet-5 (LeCun et al., 1998b)

9. Echo state network (Jaeger and Haas, 2004)

10. Deep belief network (Hinton et al., 2006)

11. GPU-accelerated convolutional network (Chellapilla et al., 2006)

12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)

13. GPU-accelerated deep belief network (Raina et al., 2009)

14. Unsupervised convolutional network (Jarrett et al., 2009)

15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)

16. OMP-1 network (Coates and Ng, 2011)

17. Distributed autoencoder (Le et al., 2012)

18. Multi-GPU convolutional network (Krizhevsky et al., 2012)

19. COTS HPC unsupervised convolutional network (Coates et al., 2013)

20. GoogLeNet (Szegedy et al., 2014a)

27

Improvement in accuracy Increase in networks sizeHigh performance hardware and software infrastructure

Fast implementations

CPU

• Exploit fixed point arithmetic in CPU families where this offers a speedup

GPU

• Parallelism & High memory bandwidth

• Low-level: Tensorflow, PyTorch, Theano (RIP)

• High-level: Keras, Caffe

• NN training hardly requires branching, appropriate for GPU

• Insufficient for large scale applications

• Multi-GPU

• Multi-machine

• Model parallelism (multiple machines on single data point)

• Data parallelism (each example run by separate machine)

• Trivial at test time

• Asynchronous SGD at train time

Distributed implementations

• Commercial applications need low time and memory cost during testing

• If no personalization, then train once and deploy to many users

• End user is resource constrained (e.g. train speech recognition in computer cluster, then deploy it in mobile phone)

• Model compression aims to train a small model to mimic a large one (by sampling x and f(x) from the large model)

• Better than directly training a small model!

Model compression

Dynamic structure

Mixture of experts: select which network will be used to compute the output given the current input

Specialized hardware

ASICS (application-specific integrated circuits)

FPGA (field programmable gated array)

GPU (with lower precision)

Optical systems (lasers systems)

Computer vision

Convolutional Neural Networks achieve many state-of-the-art results

Specialized pre-processing is no longer very important (but normalise the contrast of images… helps)

Computer vision

Classification Retrieval

Computer vision

Caption generation

Computer vision

Face detection

Computer vision

Other body parts recognition…

Computer vision

Detection (bounding box) Segmentation

Computer vision

Transcription

Computer vision

Segmentation (annotating each pixel as cell or not cell)

Computer vision

Detection (intrusion)

Computer vision

Detection (leader-follower system)

https://www.youtube.com/watch?v=JaQTekVIXe4

https://www.youtube.com/watch?v=JaQTekVIXe4

Speech recognition

Acoustic signal (natural language utterance) to sequence of words

Most industrial products now use NN (voice recognition in mobiles…)

Speech recognition

Spectrogram: spectrum of frequencies as they vary with time

Speech recognition

Speech recognition

Recognition (Convnets, LSTMs)

Synthesis (Wavenets)

Simulates the sound of speech at low-level (building the waveform, @16000 samples per second)

"If you train it with an American’s speech, it produces American speech. If you train it with German, it produces German. And if youtrain it with Chopin, it produces… well, not quite Chopin, but piano in a logical, one might even be tempted to say creative vein."

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Natural Language Processing (NLP)

Natural occurring language are ambiguous and defy formal description

Example: machine translation requires to read a sentence in one human language and emit an equivalent sentence in another human language

NLP applications are based on language models (probability distribution over sequences of words )


Natural occurring language are ambiguous and defy formal description

Example: machine translation requires to read a sentence in one human language and emit an equivalent sentence in another human language

NLP applications are based on language models (probability distribution over sequences of words )

CHAPTER 12. APPLICATIONS

This decomposition is justified by the chain rule of probability. The probabilitydistribution over the initial sequence P (x

1

, . . . , xn�1

) may be modeled by a differentmodel with a smaller value of n.

Training n-gram models is straightforward because the maximum likelihoodestimate can be computed simply by counting how many times each possible ngram occurs in the training set. Models based on n-grams have been the corebuilding block of statistical language modeling for many decades (Jelinek andMercer, 1980; Katz, 1987; Chen and Goodman, 1999).

For small values of n, models have particular names: unigram for n=1, bigramfor n=2, and trigram for n=3. These names derive from the Latin prefixes forthe corresponding numbers and the Greek suffix “-gram” denoting something thatis written.

Usually we train both an n-gram model and an n�1 gram model simultaneously.This makes it easy to compute

P (xt | xt�n+1

, . . . , xt�1

) =

Pn(xt�n+1

, . . . , xt)

Pn�1

(xt�n+1

, . . . , xt�1

)

(12.6)

simply by looking up two stored probabilities. For this to exactly reproduceinference in Pn, we must omit the final character from each sequence when wetrain Pn�1

.As an example, we demonstrate how a trigram model computes the probability

of the sentence “THE DOG RAN AWAY.” The first words of the sentence cannot behandled by the default formula based on conditional probability because there is nocontext at the beginning of the sentence. Instead, we must use the marginal prob-ability over words at the start of the sentence. We thus evaluate P

3

(THE DOG RAN).Finally, the last word may be predicted using the typical case, of using the condi-tional distribution P (AWAY | DOG RAN). Putting this together with equation 12.6,we obtain:

P (THE DOG RAN AWAY) = P3

(THE DOG RAN)P3

(DOG RAN AWAY)/P2

(DOG RAN).(12.7)

A fundamental limitation of maximum likelihood for n-gram models is that Pn

as estimated from training set counts is very likely to be zero in many cases, eventhough the tuple (xt�n+1

, . . . , xt) may appear in the test set. This can cause twodifferent kinds of catastrophic outcomes. When Pn�1

is zero, the ratio is undefined,so the model does not even produce a sensible output. When Pn�1

is non-zero butPn is zero, the test log-likelihood is �1. To avoid such catastrophic outcomes,most n-gram models employ some form of smoothing. Smoothing techniques

462

N-grams models: sequence of n tokens (words)

Estimate probability of n-gram by how many times each possible n-gram occurs in the training set.


N-grams models suffer from

curse of dimensionality: n-grams. Even in massive training dataset, most n-grams do not occur.

each word being a one-hot vector: any two different words have the same distance from each other. Neighbours to leverage information are only training examples that repeat literally the same context.

"Cat" = (0,0,0,0,0,0,1,0,0,0,0,0)

"Dog"= (0,0,1,0,0,0,0,0,0,0,0,0)


Neural language model: use a distributed representation of words (also called word embeddings)

Word2Vec



Word2Vec



"Cat" = (0.12,0.31,1.23,0.05,0.91)

"Dog" = (0.17,0.23,1.05,0.04,0.6)

Vectors are positioned such that words that share common context in the corpus are located in close proximity in the new space





Natural Language Processing (NLP)Neural machine translation: models a conditional probability of a natural language sentenceCHAPTER 12. APPLICATIONS

Decoder

Output object (English sentence)

Intermediate, semantic representation

Source object (French sentence or image)

Encoder

Figure 12.5: The encoder-decoder architecture to map back and forth between a surfacerepresentation (such as a sequence of words or an image) and a semantic representation.By using the output of an encoder of data from one modality (such as the encoder mappingfrom French sentences to hidden representations capturing the meaning of sentences) asthe input to a decoder for another modality (such as the decoder mapping from hiddenrepresentations capturing the meaning of sentences to English), we can train systems thattranslate from one modality to another. This idea has been applied successfully not justto machine translation but also to caption generation from images.

A drawback of the MLP-based approach is that it requires the sequences to bepreprocessed to be of fixed length. To make the translation more flexible, we wouldlike to use a model that can accommodate variable length inputs and variablelength outputs. An RNN provides this ability. Section 10.2.4 describes several waysof constructing an RNN that represents a conditional distribution over a sequencegiven some input, and section 10.4 describes how to accomplish this conditioningwhen the input is a sequence. In all cases, one model first reads the input sequenceand emits a data structure that summarizes the input sequence. We call thissummary the “context” C. The context C may be a list of vectors, or it may be avector or tensor. The model that reads the input to produce C may be an RNN(Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutionalnetwork (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN,then reads the context C and generates a sentence in the target language. Thisgeneral idea of an encoder-decoder framework for machine translation is illustratedin figure 12.5.

In order to generate an entire sentence conditioned on the source sentence, themodel must have a way to represent the entire source sentence. Earlier modelswere only able to represent individual words or phrases. From a representation

474

Reinforcement Learning

State Action

Reward

Reinforcement Learning

Knowledge and relations

Many applications require representing relations and reasoning about them

Relation

Attribute


Many applications require representing relations and reasoning about them

Data: Knowledge/relational databases (Freebase, Opencyc, WordNet, Wikibase, …,GeneOntology).

Relation

Attribute

Knowledge graphs: store facts about the world as relation between entities

Entities are no longer just strings but real world objects with attributes,taxonomic information and relations to other objects


Neural language model: embedding vector for each entity and relations

Visual Q&A

Visual Q&A

Art

Neural style transfer: decompose style + content

Selected topics

Advanced

Basics

Foundations

Lesson Title

1 Introduction

2 Probability and Information Theory

3 Basics of Machine Learning

4 Deep Feed-forward networks

5 Back-propagation

6 Optimization and regularization

7 Convolutional Networks

8 Sequence Modeling: recurrent networks

9 Applications

10 Software and Practical methods

11 Representational Learning

12 Deep Generative Models

13 Deep Reinforcement Learning

14 Deep Learning and the Brain

15 Conclusions and Future Perspective

16 Project presentations

Documents

Lecture 9 - NN 2017-18