Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Motivation Methods Results
Universal Sentence Encoder
Daniel Cer1
Yinfei Yang1
Sheng-yi Kong1
Nan Hua1
Nicole Limtiaco2
, Rhomni St. John1
Noah Constant1
Mario Guajardo-Cespedes1
, Steve Yuan3
Chris Tar1
Yun-Hsuan Sung1
Brian Strope1
Ray Kurzweil1
1Google Research, Mountain View, CA
2Google Research, New York, NY
3Google, Cambridge, MA
19 April 2019
Presented by: Serge Assaad
Motivation Methods Results
Motivation
Limited amounts of training data are available for many NLP
tasks.
Many models address the problem by implicitly performing
limited transfer learning through the use of pre-trained word
embeddings
Recent work has demonstrated strong transfer task
performance using pre-trained sentence level embeddings
(Conneau et al., 2017)
In this paper, 2 models are presented to produce sentence
embeddings that transfer well to other NLP tasks.
Motivation Methods Results
Background: Unpacking ”Attention”
Suppose you have a sentence s with words w1, ..., wN and
corresponding embeddings x1, ..., xN .
We will now try to make the embeddings ”context-aware”:
Pick an embedding xi, and find its similarities with all the
other embeddings, i.e. find sim = [xi · x1, ..., xi · xN ]
Now find ↵ = softmax(sim)
Finally, compute xawarei =PN
k=1 ↵kxk
You now have a ”context-aware” embedding!
Motivation Methods Results
Background: Unpacking ”Attention”
Converting this to matrix form, you get:
Xaware = softmax(XXT )X (1)
(This is called self-attention).
Now we’ll generalize a bit and say we have matrices Q,K,V of
queries, keys, and values. We can apply the same idea to get:
Attention(Q,K, V ) = softmax(QKT
pdk
)V (2)
where dk is the dimension of the rows of K (dividing bypdk helps
with vanishing gradients)
Motivation Methods Results
Background: Multi-head attention
Let’s add another layer of complexity:
MultiHead(Q,K, V ) = Concat(head1, ..., headH)WO(3)
where:
headi = Attention(QWQi ,KWK
i , V W Vi ) (4)
WQi ,WK
i ,W Vi ,WO
are learned parameters.
This multi-head idea allows our model to attend to information
from di↵erent representation subspaces via the di↵erent heads.
Motivation Methods Results
Background: Transformer (Vaswani et al., 2017)
Figure: Transformer Architecture
Motivation Methods Results
Background: Transformer (Vaswani et al., 2017)
In the encoder, the attention used is just self-attention (i.e.
Q = K = V , which are the word embeddings in the first layer)
The decoder employs self-attention on the embeddings of the
output, followed by encoder-decoder attention (i.e. K and Vare both the outputs of the encoder, and Q comes from
upstream layers of the decoder).
Motivation Methods Results
Model 1 - Transformer-based Universal Sentence Encoder(USE)
Suppose we have a sentence which comprises of N words, and
x1, ..., xN are the word embeddings.
The function TransformerEncoder takes in N word
embeddings and outputs N ”context-aware” embeddings.
For the Transformer-based USE, the sentence embedding is
calculated as follows:
S =
PTransformerEncoder(x1, ..., xN )p
N(5)
Motivation Methods Results
Model 2 - Deep Averaging Network (DAN) (Iyyer et al.,2015)
Figure: DAN Architecture
Motivation Methods Results
Experiments: Classification Benchmarks
For both models (Transformer and DAN), classification layers are
added for transfer learning on the following tasks:
MR : Movie review snippet sentiment on a five star scale
(Pang and Lee, 2005).
CR : Sentiment of sentences mined from customer reviews
(Hu and Liu, 2004).
SUBJ: Subjectivity of sentences from movie reviews and plot
summaries (Pang and Lee, 2004).
MPQA: Phrase level opinion polarity from news data (Wiebe
et al., 2005).
Motivation Methods Results
Experiments: Classification Benchmarks
TREC : Fine grained question classification sourced from
TREC (Li and Roth, 2002).
SST : Binary phrase level sentiment classification (Socher et
al., 2013).
STS Benchmark : Semantic textual similarity (STS)
between sentence pairs scored by Pearson correlation with
human judgments (Cer et al., 2017).
For the STS Benchmark task, no transfer layers are used. The
sentence embeddings are used to compute similarity via:
sim(u, v) = 1�arccos( u·v
||u||||v||)
⇡(6)
Motivation Methods Results
Results
Table: Accuracy results for sentence-level tasks (last column is Pearson
correlation of sentence similarity with human judgments)
Motivation Methods Results
Discussion
Figure: TL;DR - Transformer runtime is O(n2), DAN runtime is O(n) (atthe expense of performance)
Motivation Methods Results
Skip-Thought Vectors
Ryan Kiros1 Yukun Zhu1 Ruslan Salakhutdinov1,2
Richard S. Zemel1,2 Antonio Torralba3 , Raquel Urtasun1
Sanja Fidler1
1University of Toronto
2Canadian Institute for Advanced Research
3Massachusetts Institute of Technology
19 April 2019
Presented by: Serge Assaad
Motivation Methods Results
Motivation
Questions:
Can we ”universally/generically” encode information insentences in a distributed semantic representation the sameway we do it for words?
Can we use these supposedly generic embeddings to do wellon tasks related to these sentences without resorting toexpensive task-specific language models?
Motivation Methods Results
Skip-Thought Model
Figure: The skip-thoughts model. Given a tuple (si�1, si, si+1) ofcontiguous sentences, with si the i-th sentence of a book, the sentencesi is encoded and tries to reconstruct the previous sentences i� 1 andnext sentences i+ 1. In this example, the input is the sentence triplet (Igot back home. I could see the cat on the steps. This was strange.)Unattached arrows are connected to the encoder output. Colors indicatewhich components share parameters. < eos > is the end of sentencetoken.
Motivation Methods Results
Notation
wti is the t-th word in sentence si, and xti is its word
embedding.
For the sentence si with words w1i , ..., w
Ni at each time step t,
the encoder of the recurrent model (presented later) producesa hidden state hti which represents the sequence w1
i , ..., wti .
Thus, we can think of the ”sentence embedding” as hNi ,which represents the entire sentence.
Motivation Methods Results
Skip-Thought Model: Encoder (GRU)
A typical GRU encoder (dubbed E):
rt = �(Wrxt + Urh
t�1) (1)
zt = �(Wzxt + Uzh
t�1) (2)
h̄t = tanh(Wxt + U(rt � ht�1)) (3)
ht = (1� zt)� ht�1 + zt � h̄t (4)
h̄t is the proposed update at time t, zt is the update gate, rt is thereset gate, and � denotes the Hadamard product
Motivation Methods Results
Skip-Thought Model: Decoder (GRU)
The decoder (dubbed Dnext) described here tries to predict thewords in the next sentence si+1, given the hidden state hNi(denoted hi for brevity) for the sentence si.We use the matrices Cz, Cr, and C to bias the decoder gates withthe sentence vector hi
rt = �(W dr x
t�1 + Udr h
t�1 + Crhi) (5)
zt = �(W dz x
t�1 + Udz h
t�1 + Czhi) (6)
h̄t = tanh(W dxt�1 + U(rt � ht�1) + Chi) (7)
hti+1 = (1� zt)� ht�1 + zt � h̄t (8)
Motivation Methods Results
Skip-Thought Model: Decoder
We then convert this hidden state hti+1 into a softmax probability:
P (wti+1|w<t
i+1, hi) / exp(vwti+1
hti+1) (9)
Where vwti+1
is the row of a learned vocabulary matrix V which
corresponds to word wti+1
Note: We train another identical decoder Dprev to predict theprevious sentence si�1, which has separate parameters from Dnext
with the exception of the vocabulary matrix V , which is sharedbetween decoders.
Motivation Methods Results
Objective
We optimize the parameters of the encoder E and the 2 decoders(Dnext and Dprev) according to the following objective:
L (E,Dnext, Dprev) =X
t
logP (wti+1|w<t
i+1, hi) (10)
+X
t
logP (wti�1|w<t
i�1, hi) (11)
Result: After training, we can extract the hidden state of eachsentence hi from the encoder and that’s the ”sentenceembedding”!
Motivation Methods Results
Training details
A few di↵erent versions of the skip-thoughts model are used:
uni-skip: The model just presented (unidirectionalskip-thoughts)
bi-skip: Similar, but with 2 encoders Eforward and Ereverse,whose outputs are concatenated then fed to the decoder. Thelatter encoder is fed the sentence in reverse order.
combine-skip: A combination of uni-skip and bi-skip, wherethe outputs of the encoders of the uni-skip and bi-skip modelsare concatenated then fed to the decoders.
combine-skip + COCO: Encoder embeddings areconcatenated with an image-sentence embedding modeltrained on COCO (described later).
These models are trained using the BookCorpus dataset (Y. Zhu,2015).
Motivation Methods Results
Experiments: Sentence-Level comparisons
SICK task : Given 2 sentences, the goal is to produce a scoreof how semantically related they are. The ground truth ishuman annotated scores from 1 to 5. (Marelli et. al, 2014)
Microsoft Paraphrase Corpus Predicting whether or not 2sentences are paraphrases of each other.
Motivation Methods Results
Results: Sentence-Level comparisons
Table: Left: Test set results on the SICK semantic relatedness subtask.The metrics are Pearson’s r, Spearman’s ⇢, and MSE.Right: Test set results on the Microsoft Paraphrase Corpus. The metricsare accuracy and F1.
Motivation Methods Results
Experiments: Image-sentence ranking
Using image-sentence pairs from the COCO dataset, matrices Uand V are learned to minimize the ranking loss:
L (U, V ) =X
x
KX
k=1
max{0,↵� cos(Ux, V y) + cos(Ux, V yk)}
+X
y
KX
k=1
max{0,↵� cos(V y, Ux) + cos(V y, Uxk)}
Where:x is an image vector obtained from a pre-trained VGG-19y is the skip-thought vector for the ground truth sentencesxk and yk are ”incorrect” image and skip-thought vectors, i.e.that do not correspond to y and x respectively.cos is cosine similarity↵ = 0.2, K = 50
Motivation Methods Results
Results: Image-sentence ranking
Table: COCO test-set results for image-sentence retrieval experiments.R@K is Recall@K (high is good). Med r is the median rank (low is good).
Motivation Methods Results
Experiments: Classification Benchmarks
Some of the same tasks as the USE paper:
MR : Movie review snippet sentiment on a five star scale(Pang and Lee, 2005).
CR : Sentiment of sentences mined from customer reviews(Hu and Liu, 2004).
SUBJ: Subjectivity of sentences from movie reviews and plotsummaries (Pang and Lee, 2004).
MPQA: Phrase level opinion polarity from news data (Wiebeet al., 2005).
TREC : Fine grained question classification sourced fromTREC (Li and Roth, 2002).
Motivation Methods Results
Results: Classification Benchmarks
Table: Classification accuracies on several standard benchmarks.