65
An Introduction to Neural Machine Translation ATA59 Carola F. Berger

An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

An Introduction to �Neural Machine Translation

ATA59 Carola F. Berger

Page 2: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary
Page 3: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Disclaimer 1 – This Presentation�

Carola F. Berger, An Introduction to NMT, ATA59 2

Page 4: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Disclaimer 1 – This Presentation�

Carola F. Berger, An Introduction to NMT, ATA59 2

Page 5: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Disclaimer 2�

Carola F. Berger, An Introduction to NMT, ATA59 3

After this presentation, you will hopefully understand how neural MT works, but you will not understand what it does.

Page 6: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Disclaimer 2�

Carola F. Berger, An Introduction to NMT, ATA59 3

After this presentation, you will hopefully understand how neural MT works, but you will not understand what it does.

~100 million parameters

Page 7: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Outline�

  Brief recap: previous MT approaches

  How do neural networks work?

  How do words get into and out of a neural network?

Carola F. Berger, An Introduction to NMT, ATA59 4

Page 8: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

MT Approaches�  Rules based:

Basically just grammar + dictionary

  Statistical MT: Chop sentences up into n-grams (sequences of n words) or phrases. Training of the engine: Calculate frequency = probability in source and target language. Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = ?

Carola F. Berger, An Introduction to NMT, ATA59 5

Page 9: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

MT Approaches�  Rules based:

Basically just grammar + dictionary

  Statistical MT: Chop sentences up into n-grams (sequences of n words) or phrases. Training of the engine: Calculate frequency = probability in source and target language. Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary Pros: Output is deterministic. No words missing. Cons: Context!

  Neural MT – this presentation

Carola F. Berger, An Introduction to NMT, ATA59 5

Page 10: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Statistical MT�

Carola F. Berger, An Introduction to NMT, ATA59 6

From G. M. de Buy Wenninger, K. Sima’an, �PBML No. 101, April 2014, pp. 43

Page 11: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Neural Networks Work?�

A (not so) brief recap of last year’s �presentation at ATA58. See also handouts as PDF in the app or on my website (see �references).

Carola F. Berger, An Introduction to NMT, ATA59 7

Page 12: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Biological Neuron�

Bruce Blaus, https://commons.wikimedia.org/wiki/File:Blausen_0657_MultipolarNeuron.png

Carola F. Berger, An Introduction to NMT, ATA59 8

Page 13: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron�

Carola F. Berger, An Introduction to NMT, ATA59 9

Page 14: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron�

Carola F. Berger, An Introduction to NMT, ATA59 10

Page 15: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

+

If blob > 3 => output 1, else output 0

Carola F. Berger, An Introduction to NMT, ATA59 10

Page 16: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

1

1

+

If blob > 3 => output 1, else output 0

Carola F. Berger, An Introduction to NMT, ATA59 10

Page 17: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

1

1

+

If blob > 3 => output 1, else output 0

0

Carola F. Berger, An Introduction to NMT, ATA59 10

Page 18: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

2

1

+

If blob > 3 => output 1, else output 0

1

Carola F. Berger, An Introduction to NMT, ATA59 10

Page 19: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

x1� x2� output�

1� 1� 0�1� 2� 1�

Page 20: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

Page 21: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

Page 22: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

Page 23: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

Page 24: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron - Perceptron�

Carola F. Berger, An Introduction to NMT, ATA59 11

Page 25: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neuron�

Carola F. Berger, An Introduction to NMT, ATA59 12

Page 26: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Network�  Human brain:

Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) �Mapping the structural core of human cerebral cortex. PLoS Biology Vol. 6, No. 7, e159.

Carola F. Berger, An Introduction to NMT, ATA59 13

Page 27: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neural Network�

Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

Carola F. Berger, An Introduction to NMT, ATA59 14

Page 28: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

What is AI?�

Adapted from: S. Jurvetson, https://www.flickr.com/photos/jurvetson/31409423572/ Carola F. Berger, An Introduction to NMT, ATA59 15

Page 29: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neural Network�  Training:

Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

Carola F. Berger, An Introduction to NMT, ATA59 16

Page 30: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neural Network�  Training:

Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

Adapt weights �(“arrows”) according to difference between desired output and �actual output, e.g. by backpropagation

Feed in �training data

Carola F. Berger, An Introduction to NMT, ATA59 16

Page 31: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Sample input data

Carola F. Berger, An Introduction to NMT, ATA59 17

Page 32: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Sample input (20x20 pixels)

Carola F. Berger, An Introduction to NMT, ATA59 18

Page 33: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Sample input (20x20 pixels)

Carola F. Berger, An Introduction to NMT, ATA59 18

Page 34: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Weights

Carola F. Berger, An Introduction to NMT, ATA59 19

Page 35: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Weights to hidden units – “feature” extraction

Carola F. Berger, An Introduction to NMT, ATA59 20

Page 36: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Weights to hidden units

Carola F. Berger, An Introduction to NMT, ATA59 21

Page 37: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Weights to hidden units

Carola F. Berger, An Introduction to NMT, ATA59 21

Page 38: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Hidden units to output

Carola F. Berger, An Introduction to NMT, ATA59 22

Page 39: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Hidden units to output

Carola F. Berger, An Introduction to NMT, ATA59 23

Page 40: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Hidden units to output

Carola F. Berger, An Introduction to NMT, ATA59 23

Page 41: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�

Input Internal convolution Hidden Output

Internal conv.

2

Carola F. Berger, An Introduction to NMT, ATA59 24

Page 42: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�What happens with unknowns? Klingon 6 [jav]

Input Internal convolution Hidden Output

Carola F. Berger, An Introduction to NMT, ATA59 25

Page 43: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Net Example – Digit Recognition�Klingon 6 [jav]

Input Internal convolution Hidden Output

Internal conv.

2

Carola F. Berger, An Introduction to NMT, ATA59 25

Page 44: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Artificial Neural Network�

Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

Carola F. Berger, An Introduction to NMT, ATA59 26

Feed-forward neural net

Page 45: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and �Out of the Network?�

• Challenges for NMT:

  Input and output length not fixed, different sentence ordering in source and target languages => Use recurrent neural networks with attention or convolutional networks

  Context => document (or at least paragraph) level, not sentence level

  Training metrics

Carola F. Berger, An Introduction to NMT, ATA59 27

Page 46: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and �Out of the Network?�

• Challenges for NMT:

  Input and output length not fixed, different sentence ordering in source and target languages => Use recurrent neural networks with attention or convolutional networks

  Context => document (or at least paragraph) level, not sentence level

  Training metrics

Carola F. Berger, An Introduction to NMT, ATA59 27

Page 47: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Adapted from: Cburnett, https://commons.wikimedia.org/wiki/File:Artificial_neural_network.svg

Carola F. Berger, An Introduction to NMT, ATA59 28

Recurrent neural net

Page 48: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Source: K. Xu et al, https://arxiv.org/abs/1502.03044

Carola F. Berger, An Introduction to NMT, ATA59 29

Attention mechanism

Page 49: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Source: D. Bahdanau et al, ICLR conference proceedings, https://arxiv.org/pdf/1409.0473.pdf

Carola F. Berger, An Introduction to NMT, ATA59 30

Attention mechanism

Page 50: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Carola F. Berger, An Introduction to NMT, ATA59 31

Encoder� Hidden �State� Decoder�Source

Text�Target

Text�

RNN with�attention

RNN with�attention

tokenization �(optional)

Page 51: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Carola F. Berger, An Introduction to NMT, ATA59 31

Encoder� Hidden �State� Decoder�Source

Text�Target

Text�

RNN with�attention

RNN with�attention

Page 52: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Carola F. Berger, An Introduction to NMT, ATA59 32

https://projector.tensorflow.org

Page 53: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

How Do Words Get Into and Out of the Network?�

Carola F. Berger, An Introduction to NMT, ATA59 6

From G. M. de Buy Wenninger, K. Sima’an, �PBML No. 101, April 2014, pp. 43

Recall: SMT

Page 54: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Nets - Recap�ü  Training = extraction of

“features” (=patterns) from training data

Carola F. Berger, An Introduction to NMT, ATA59 33

Page 55: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Nets - Recap�ü  Training = extraction of

“features” (=patterns) from training data ü  The more hidden layers and hidden units, the

more parameters (possible overfitting!)

Carola F. Berger, An Introduction to NMT, ATA59 33

Page 56: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Nets - Recap�ü  Training = extraction of

“features” (=patterns) from training data ü  The more hidden layers and hidden units, the

more parameters (possible overfitting!) ü  Beware: Garbage in -> worse garbage out!

Carola F. Berger, An Introduction to NMT, ATA59 33

Page 57: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Nets - Recap�ü  Training = extraction of

“features” (=patterns) from training data ü  The more hidden layers and hidden units, the

more parameters (possible overfitting!) ü  Beware: Garbage in -> worse garbage out! ü  ANNs work well for pattern recognition

after training, including “context”

Carola F. Berger, An Introduction to NMT, ATA59 33

Page 58: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Neural Nets - Recap�ü  Training = extraction of

“features” (=patterns) from training data ü  The more hidden layers and hidden units, the

more parameters (possible overfitting!) ü  Beware: Garbage in -> worse garbage out! ü  ANNs work well for pattern recognition

after training, including “context” ü  Completely unpredictable when confronted

with new, hitherto unknown data

Carola F. Berger, An Introduction to NMT, ATA59 33

Page 59: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Training Data�

Carola F. Berger, An Introduction to NMT, ATA59 34 Source: P. Koehn, https://arxiv.org/abs/1709.07809

Page 60: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Unpredictability�

Carola F. Berger, An Introduction to NMT, ATA59 35

Page 61: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Unpredictability�

Carola F. Berger, An Introduction to NMT, ATA59 35

Page 62: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Unpredictability�

Carola F. Berger, An Introduction to NMT, ATA59 35

Page 63: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Unpredictability�

Carola F. Berger, An Introduction to NMT, ATA59 35

Page 64: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

Unpredictability�

Carola F. Berger, An Introduction to NMT, ATA59 35

Page 65: An Introduction to Neural Machine Translation...Translation after training: Chop source sentences into n-grams or phrases, apply previously calculated probabilities. 1-gram SMT = dictionary

References & Further Reading�  Slides at: https://www.CFBtranslations.com

  Handouts in app and also at https://www.CFBtranslations.com

  A. Ng, Machine Learning, Coursera, https://www.coursera.org/learn/machine-learning

  Google’s Tensorflow: https://www.tensorflow.org/

  Madly Ambiguous - game to illustrate how NMT deals with context: http://madlyambiguous.osu.edu

Carola F. Berger, An Introduction to NMT, ATA59 36