Fast evaluation of Connectionist Language Models

Fast evaluation of connectionist language models10th International Work-Conference on Artifical Neural Networks

F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera

Departamento de Ciencias Físicas, Matemáticas y de la ComputaciónUniversidad CEU-Cardenal Herrera

46115 Alfara del Patriarca (Valencia), Spain

Departamento de Sistemas Informáticos y ComputaciónUniversidad Politécnica de Valencia

Valencia, Spain

fzamora,mcastro,[email protected]

June 11 2009

F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 1 / 26

Index

1 Introduction and motivation

2 Neural Network Language Models (NN LMs)

3 Fast evaluation of NNLMs

4 Estimation of the NNLMs

5 Evaluation of the proposed approach

6 Discussion and conclusions


Index








Introduction and motivation

Language modelling is the attempt to characterize, capture and exploitregularities in natural language.In pattern recognition problems language models (LM) are useful to guidethe search for the optimal response and to increase the success rate ofthe system.

ExampleLM statistical framework

S = A move to stop . . .

p(S) =|S|∏i=1

p(si|si−11 ) =

= p(A) p(move|A) p(to|A move) p(stop|A move to) . . .


Statistical framework: n-grams

+ n-grams are the most popular LM, due to their simplicity and robustness.+ The model parameters are learnt from text corpora using the occurrence

frequencies of subsequences of n word units.

ExamplesPossible n-grams with n = 2 (bigrams)

S = A move to stop Mr. Gaitskell . . . == <s> A move to stop Mr. Gaitskell . . . </s>

(<s> A), (A move), (move to), (to stop), (stop Mr.), (Mr. Gaitskell), . . .

Drawbacks of n-grams

– Larger values of n can capture longer-term dependencies between words.– But the number of different n-grams grows exponentially with n, and

requires more and more training data.– To alleviate this problem, some techniques such as smoothing or

clustering can be applied.


Connectionist language models

Recently some authors propose the application of neural networks (NN)to language modelling [Bengio][Castro][Schwenk].These models have the capacity of calculate an automatic smoothing ofunseen n-grams, and are more scalable with n.

⇓Despite their theoretical advantages, these LM are more expensive tocompute.

⇓A novel technique to speedup the computation of connectionist languagemodels is presented in this work.

MotivationTo integrate the connectionist language model in the Viterbi decoder of apattern recognition software.


Index








Neural Network Language Models (NN LMs)

LM probability equation, n-grams:p(s1 . . . s|S|) ≈

∏|S|i=1 p(si|si−n+1 . . . si−1) .

A NN LM is a statistical LM which follows the same equation as n-grams.Probabilities that appear in that expression are estimated with a NN.The model naturally fits under the probabilistic interpretation of theoutputs of the NNs: if a NN, in this case a Multilayer Perceptron (MLP), istrained as a classifier, the outputs associated to each class areestimations of the posterior probabilities of the defined classes.


NN LMs: Codification of the vocabulary I

The training set for a LM is asequence s1s2 . . . s|S| of wordsfrom a vocabulary Ω.Each input word is locallyencoded following a “1-of-|Ω|”scheme.

Problems:– For tasks with large

vocabularies, the resulting NNvery huge.

– The input of the NN is verysparse.

– It leads to slow convergenceduring the training process.

A trigram example of NN LM

1 2 3 4 . . . |Ω|si−2 = 0 0 1 0 . . . 0

si−1 = 1 0 0 0 . . . 0


NN LMs: Codification of the vocabulary II

We use ideas from Bengio and Schwenk to learn a distributed representationof each word during the MLP training.

ExamplesDistributed encoding

si−2 = 0.2 0.1 0.5 0.3 . . . 0.1

si−1 = 0.4 0.4 0.3 0.6 . . . 0.2

|si−2| << |Ω||si−1| << |Ω|


NN LMs: Codification of the vocabulary IIIThe input is composed of words si−n+1, . . . , si−1 of n-grams equation. Eachword is represented using a local encoding.

p(si|si−n+1 . . . si−1)


NN LMs: Codification of the vocabulary IIIA new P, projection layer, formed by Pi−n+1, . . . , Pi−1 subsets of projectionunits is added. Pj encodes the corresponding input word sj.

Pj ⇒ codified word.


NN LMs: Codification of the vocabulary IIIThe weights from each local encoding of input word sj to the correspondingsubset of projection units Pj are the same for all input words j.

Shared weights inprojection layer.


NN LMs: Codification of the vocabulary IIIAfter training, the projection layer is removed from the network bypre-computing a table of size |Ω| which serves as a distributed encoding.

a 0.4 0.2 . . . 0.3move 0.2 0.1 . . . 0.8

to 0.6 0.7 . . . 0.6stop 0.1 0.5 . . . 0.2

. . .</s> 0.4 0.3 . . . 0.9


NN LMs: Codification of the vocabulary III

H is the hidden layer with an empirical number of units.



O the output layer with |Ω| units. The softmax activation function ensures alsothat the output values sum to one.

p(ω|si−n+1 . . . si−1), ω ∈ Ω.

oi =exp(ai)|Ω|Xj=1

exp(aj)

,

being ai the activation valueof the i-th output unit and oi

is its output value.



O the output layer with |Ω| units. The softmax activation function ensures alsothat the output values sum to one.

p(ω|si−n+1 . . . si−1), ω ∈ Ω.

oi =exp(ai)|Ω|Xj=1

exp(aj)

,




NN LMs: Codification of the vocabulary IV

This NN predicts the posterior probability of each word of the vocabularygiven the history. A single forward pass of the MLP gives p(ω|si−n+1 . . . si−1)for every word ω ∈ Ω.

Advantages+ Automatic estimation (as with statistical LM).+ The Lowest (in general) number of parameters of the obtained models.+ Automatic smoothing performed by the neural networks estimators.

Problems– The larger the lexicon is, the larger the number of parameters the neural

network needs.– On speech or handwritten recognition or in translation tasks, there’s a

thousands number of language model lookups.– Huge NN LMs consume excessive time computing these values.


Index








Fast evaluation of NNLMs I

The softmax normalization term requires the computation of every outputvalue. This computation dominates the cost of the forward pass in a typicalNN LM topology.

p(ω|si−n+1 . . . si−1), ω ∈ Ω.

oi =exp(ai)|Ω|Xj=1

exp(aj)

,




Proposed approach I

Pre-computing and storing the softmax normalization constants mostprobably needed during the LM evaluation.A space/time trade-off has to be considered: the more space is dedicatedto store pre-computed softmax normalization constants, the more timereduction can be obtained.When a given normalization constant is not found:

Computed on-the-fly.Some kind of smoothing must be applied.

We follow the last idea when a softmax normalization constant is not found.


Proposed approach II

1 Observe that a bigram NN LM only needs a |Ω|-sized pre-computed tableof softmax normalization constants.



2 Choose a hierarchy of models, from higher order n-grams downtobigrams.

Possible hierarchy of models



3 For the bigram NN LM: pre-compute every softmax normalizationconstants for every word of the lexicon. Store the constants in a |Ω|-sizedtable.

for each ω ∈ Ω⇒|Ω|∑j=1

exp(aj)

⇓Pre-computed table

|Ω|

8>><>>:<s> 0.1a -0.1move 1.0. . . . . .



4 For each n-gram NN LM bigger than bigram: pre-compute the softmaxnormalization constants for the K more frequent (n− 1)-grams in thetraining set. Store the constants in a table.

|Ω|∑j=1

exp(aj)

⇓Pre-computed table

K

8>><>>:<s> <s> <s> 0.5<s> <s> a 0.1<s> a move 1.0a move to -0.1



5 During the test evaluation, for each token: search the softmax pre-computedconstant associated to the (n− 1)-length prefix of the token in the table. If theconstant is in the table, calculate the probability; otherwise, switch to theinmediately inferior NN LM.

















Index








Estimation of the NN LMs: Corpus

Experiments with the LOB text corpus have been conducted.A subcorpus with a lexicon of 3 000 words has been built as follows:

Apply a random ordering of the sentences.Select those whose lexicon lies in the first different 3 000 words.

This subcorpus was partitioned into three sets:

Partition # Sentences # running wordsTraining 4 303 37 606Validation 600 5 348Test 600 5 455


Validation set perplexity for the estimated NN LMs

Language Model Bigram Trigram 4gramNN LM 80–128 82.17 73.62 74.07NN LM 128–192 82.52 73.30 72.50NN LM 192–192 80.01 71.90 71.91Mixed NN LM 78.92 71.34 70.63

Three different NN LM topologies for bigram, trigram and 4gram, and acombination of the three for each n-gram.


Test set perplexity of the best NN LMs and SRI

Language Model Bigram Trigram 4-gramMixed NN LM 88.72 80.94 79.90

SRI 88.29 83.19 87.05

Best NN LM compared with SRI statistical n-gram.


Index








Evaluation of the proposed approach: perplexity

Influence of the number of pre-computed softmax normalization constants inthe test set perplexity of the proposed approach for the Mixed NN LMs (left),and the Mixed NN LMs and statistical bigram (right).


Evaluation of the proposed approach: speed up

Model seconds/token tokens/secondNN LM 6.43× 10−3 155Fast NN LM 1.94× 10−4 5 154

A speedup of 33 times faster is achieved (for 3 000 words, higher speedupswould be achieved for bigger lexicon sizes).

ConclusionThis speed-up allows the integration of these LMs in a search procedure of arecognition task.


Index








Discussion and conclusions

A novel method to allow fast evaluation of connectionist language modelshas been presented.The best perplexity is obtained by combining the Mixed NN LM with astatistical bigram. A speedup of 33 times faster is achieved with thislexicon size (3 000 words).Nevertheless, higher speedups would be achieved for bigger lexiconsizes.Our next goal is to train more complex NN LMs and to integrate them in arecognition or translation systems.


Fast evaluation of connectionist language models10th International Work-Conference on Artifical Neural Networks

F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera

Departamento de Ciencias Físicas, Matemáticas y de la ComputaciónUniversidad CEU-Cardenal Herrera

46115 Alfara del Patriarca (Valencia), Spain

Departamento de Sistemas Informáticos y ComputaciónUniversidad Politécnica de Valencia

Valencia, Spain

fzamora,mcastro,[email protected]

June 11 2009


Technology

Fast evaluation of Connectionist Language Models