View
45
Download
7
Category
Tags:
Preview:
DESCRIPTION
Nice presentation of deep learning techniques for Natural Language Processing
Citation preview
Learning Representations of Text using NeuralNetworks
Tomas MikolovJoint work with Ilya Sutskever, Kai Chen, Greg Corrado,
Jeff Dean, Quoc Le, Thomas Strohmann
Google Research
NIPS Deep Learning Workshop 2013
1 / 31
Overview
Distributed Representations of Text
Efficient learning
Linguistic regularities
Examples
Translation of words and phrases
Available resources
2 / 31
Representations of Text
Representation of text is very important for performance ofmany real-world applications. The most common techniquesare:
Local representationsN-grams
Bag-of-words
1-of-N coding
Continuous representationsLatent Semantic Analysis
Latent Dirichlet Allocation
Distributed Representations
3 / 31
Distributed Representations
Distributed representations of words can be obtained fromvarious neural network based language models:
Feedforward neural net language model
Recurrent neural net language model
4 / 31
Feedforward Neural Net Language Model
input projection hidden output
w(t-3)
w(t-2)
w(t-1)
w(t)
U
U
U
V W
Four-gram neural net language model architecture (Bengio2001)The training is done using stochastic gradient descent andbackpropagationThe word vectors are in matrix U
5 / 31
Efficient Learning
The training complexity of the feedforward NNLM is high:
Propagation from projection layer to the hidden layer
Softmax in the output layer
Using this model just for obtaining the word vectors is veryinefficient
6 / 31
Efficient Learning
The full softmax can be replaced by:
Hierarchical softmax (Morin and Bengio)
Hinge loss (Collobert and Weston)
Noise contrastive estimation (Mnih et al.)
Negative sampling (our work)
We can further remove the hidden layer: for large models,this can provide additional speedup 1000x
Continuous bag-of-words model
Continuous skip-gram model
7 / 31
Skip-gram Architecture
w(t)
Input projection output
w(t-2)
w(t-1)
w(t+1)
w(t+2)
Predicts the surrounding words given the current word8 / 31
Continuous Bag-of-words Architecture
w(t)
w(t-1)
w(t-2)
Input projection output
w(t+1)
w(t+2)
SUM
Predicts the current word given the context9 / 31
Efficient Learning - Summary
Efficient multi-threaded implementation of the new modelsgreatly reduces the training complexity
The training speed is in order of 100K - 5M words persecond
Quality of word representations improves significantly withmore training data
10 / 31
Linguistic Regularities in Word Vector Space
The word vector space implicitly encodes many regularitiesamong words
11 / 31
Linguistic Regularities in Word Vector Space
The resulting distributed representations of words containsurprisingly a lot of syntactic and semantic information
There are multiple degrees of similarity among words:
KING is similar to QUEEN as MAN is similar toWOMAN
KING is similar to KINGS as MAN is similar to MEN
Simple vector operations with the word vectors providevery intuitive results
12 / 31
Linguistic Regularities - Results
Regularity of the learned word vector space is evaluatedusing test set with about 20K questions
The test set contains both syntactic and semanticquestions
We measure TOP1 accuracy (input words are removedduring search)
We compare our models to previously published wordvectors
13 / 31
Linguistic Regularities - Results
Model Vector Training Training Accuracy
Dimensionality Words Time [%]
Collobert NNLM 50 660M 2 months 11
Turian NNLM 200 37M few weeks 2
Mnih NNLM 100 37M 7 days 9
Mikolov RNNLM 640 320M weeks 25
Huang NNLM 50 990M weeks 13
Our NNLM 100 6B 2.5 days 51
Skip-gram (hier.s.) 1000 6B hours 66
CBOW (negative) 300 1.5B minutes 72
14 / 31
Linguistic Regularities in Word Vector Space
Expression Nearest token
Paris - France + Italy Rome
bigger - big + cold colder
sushi - Japan + Germany bratwurst
Cu - copper + gold Au
Windows - Microsoft + Google Android
Montreal Canadiens - Montreal + Toronto Toronto Maple Leafs
15 / 31
Performance on Rare Words
Word vectors from neural networks were previouslycriticized for their poor performance on rare words
Scaling up training data set size helps to improveperformance on rare words
For evaluation of progress, we have used data set fromLuong et al.: Better word representations with recursiveneural networks for morphology, CoNLL 2013
16 / 31
Performance on Rare Words - Results
Model Correlation with Human Ratings
(Spearman’s rank correlation)
Collobert NNLM 0.28
Collobert NNLM + Morphology features 0.34
CBOW (100B) 0.50
17 / 31
Rare Words - Examples of Nearest Neighbours
Redmond Havel graffiti capitulate
conyers plauen cheesecake abdicate
Collobert NNLM lubbock dzerzhinsky gossip accede
keene osterreich dioramas rearm
McCarthy Jewell gunfire -
Turian NNLM Alston Arzu emotion -
Cousins Ovitz impunity -
Podhurst Pontiff anaesthetics Mavericks
Mnih NNLM Harlang Pinochet monkeys planning
Agarwal Rodionov Jews hesitated
Redmond Wash. Vaclav Havel spray paint capitulation
Skip-gram Redmond Washington president Vaclav Havel grafitti capitulated
(phrases) Microsoft Velvet Revolution taggers capitulating
18 / 31
From Words to Phrases and Beyond
Often we want to represent more than just individualwords: phrases, queries, sentences
The vector representation of a query can be obtained by:
Forming the phrases
Adding the vectors together
19 / 31
From Words to Phrases and Beyond
Example query:restaurants in mountain view that are not very good
Forming the phrases:restaurants in (mountain view) that are (not very good)
Adding the vectors:restaurants + in + (mountain view) + that + are + (not verygood)
Very simple and efficient
Will not work well for long sentences or documents
20 / 31
Compositionality by Vector Addition
Expression Nearest tokens
Czech + currency koruna, Czech crown, Polish zloty, CTK
Vietnam + capital Hanoi, Ho Chi Minh City, Viet Nam, Vietnamese
German + airlines airline Lufthansa, carrier Lufthansa, flag carrier Lufthansa
Russian + river Moscow, Volga River, upriver, Russia
French + actress Juliette Binoche, Vanessa Paradis, Charlotte Gainsbourg
21 / 31
Visualization of Regularities in Word Vector Space
We can visualize the word vectors by projecting them to 2Dspace
PCA can be used for dimensionality reduction
Although a lot of information is lost, the regular structure isoften visible
22 / 31
Visualization of Regularities in Word Vector Space
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
princess
prince
heroine
hero
actress
actor
landlady
landlord
female
male
cow
bull
hen
cockqueen
king
she
he
23 / 31
Visualization of Regularities in Word Vector Space
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
give
gave
given
take
took
taken
draw
drew
drawn
fall
fell
fallen
24 / 31
Visualization of Regularities in Word Vector Space
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
China
Japan
France
Russia
Germany
Italy
SpainGreece
Turkey
Beijing
Paris
Tokyo
Poland
Moscow
Portugal
Berlin
RomeAthens
Madrid
Ankara
Warsaw
Lisbon
25 / 31
Machine Translation
Word vectors should have similar structure when trainedon comparable corpora
This should hold even for corpora in different languages
26 / 31
Machine Tanslation - English to Spanish
−0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
cat
dog
cow
horse
pig
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
gato (cat)
perro (dog)vaca (cow)
caballo (horse)
cerdo (pig)
The figures were manually rotated and scaled
27 / 31
Machine Translation
For translation from one vector space to another, we needto learn a linear projection (will perform rotation andscaling)
Small starting dictionary can be used to train the linearprojection
Then, we can translate any word that was seen in themonolingual data
28 / 31
MT - Accuracy of English to Spanish translation
107
108
109
1010
0
10
20
30
40
50
60
70
Number of training words
Acc
ura
cy
Precision@1
Precision@5
29 / 31
Machine Translation
When applied to English to Spanish word translation, theaccuracy is above 90% for the most confident translations
Can work for any language pair (we tried English toVietnamese)
More details in paper: Exploiting similarities amonglanguages for machine translation
30 / 31
Available Resources
The project webpage is code.google.com/p/word2vecopen-source code
pretrained word vectors (model for common words andphrases will be uploaded soon)
links to the papers
31 / 31
Recommended