Learning Task-specific Bilexical Embeddingspranava/slides/colingslides.pdfLearning Task-speci c Bilexical Embeddings Pranava Madhyastha(1), Xavier Carreras (1;2), Ariadna Quattoni

Learning Task-specific Bilexical Embeddings

Pranava Madhyastha(1), Xavier Carreras(1,2), Ariadna Quattoni(1,2)

(1) Universitat Politècnica de Catalunya (2) Xerox Research Centre Europe

Bilexical Relations

I Increasing interest in bilexical relations (relation between pairs of words)

I Dependency Parsing - lexical items (words) connected by binary relations

Small birds sing loud songs

ROOT

NMOD SUBJ

OBJ

NMOD

I Bilexical Predictions can be modelled as Pr(modifier|head)

2

In Focus: Unseen words

Adjective-Noun relation, where an adjective modifies a noun

Vynil can be applied to electronic devices and cases

NMOD?

NMOD?

I If one/more of the above nouns or adjectives have not been observed in thesupervision: estimating Pr(adjective|noun)

I Zipf distribution

I Generalisation is a challenge

3

Distributional Word Space Models

I Distributional Hypothesis: Linguistic items with similar distributions have similar meanings

the curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-whitesun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber abird on the shape of the moon over the trees in frontBut I could nt see the moon or the stars , only therning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None ofthe light of an enormous moon . The plash of flowing wman s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airshoud obscured part of the moon . The Allied guns behind

I For every word we can compute an n-dimensional vector space representation φ(w)→ Rn from a largecorpus

4

Contributions

Formulation of statistical models to improve bilexical prediction tasks

I Supervised framework to learn bilexical models over distributional representations

⇒ based on learning bilinear formsI Compressing representations by imposing low-rank constraints to bilinear forms

I Lexical embeddings tailored for a specific bilexical task.

5

Overview

Bilexical Models

Low Rank Constraints

Learning

Experiments

6

Overview

Bilexical Models


Learning

Experiments

7

Unsupervised Bilexical Models

I We can define a simple bilexical model as:

Pr(m | h) = exp {〈φ(m), φ(h)〉}∑m′ exp {〈φ(m′), φ(h)〉}

where 〈φ(x), φ(y)〉 denotes the inner-product.I Problem: Designing appropriate contexts for required relations

I Solution: Leverage supervised training corpus

8

Supervised bilexical model

I We define the bilexical model in a bilinear setting here as:

φ(m)>Wφ(h)

where:

φ(m) and φ(h) are n−dimensional representations of m and hW ∈ Rn×n is a matrix of parameters

9

Interpreting the Bilinear Models

I If we write the bilinear model as:

n∑i=1

n∑j=1

fi ,j(m, h)Wi ,j

I fi ,j(m, h) = φ(m)[i ]φ(h)[j]

⇒ Bilinear models are linear models, with an extended feature space!I =⇒ We can re-use all the algorithms designed for linear models.

10

Using Bilexical Models

I We define the bilexical operator as:

Pr(m|h) =exp

{φ(m)>Wφ(h)

}∑m′∈M exp {φ(m′)>Wφ(h)}

⇒ Standard conditional log-linear model

11

Overview

Bilexical Models


Learning

Experiments

12

Rank Constraints

φ(m)>Wφ(h)

[m1 m2 · · · · · · · · · mn

]︸︷︷︸φ(m)>

w11 w12 · · · · · · · · · w1nw21 w22 · · · · · · · · · w2n

......

......

......

......

......

......

......

......

......

wn1 wn2 · · · · · · · · · wnn

h1h2.........hn

φ(h)

13

Rank Constraints

I Factorizing W :

[m1 m2 · · · · · · mn

]︸︷︷︸φ(m)>

u11 · · · u1ku21 · · · w2k

......

......

......

un1 · · · unn

︸︷︷︸

U

σ1 · · · 0... . . . ...0 · · · σk

︸︷︷︸

Σ

v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn

︸︷︷︸

V>

︸︷︷︸

SVD(W ) = UΣV>

h1h2......hn

φ(h)

I Please note: W has rank k

14

Low Rank Embedding

I Regrouping, we get:[m1 m2 · · · · · · mn

]u11 · · · u1ku21 · · · w2k

......

......

......

un1 · · · unn

︸︷︷︸

φ(m)>U

σ1 · · · 0... . . . ...0 · · · σk

︸︷︷︸

Σ

v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn

h1h2......hn

︸︷︷︸

V>φ(h)

I We can see φ(m)>U as a projection of m and V>φ(h) as a projection of hI ⇒ Rank(W ) defines the dimesionality of the induced space, hence the embedding

15

Computational Properties

I In many tasks, given a head, rank a huge number of modifiersI Strategy:

I Project each lexical item in the vocabulary into its low dimensional embedding ofsize k

I Compute the bilexical score as k−dimensional inner productI Substantial computational gain as long as we obtain low-rank models

16

Summary

I Induce high dimensional representation from a huge corpus

I Learn embeddings suited for a given task

I Our bilexical formulation is, in principle, a linear model but with an extendedfeatures space

I Low rank bilexical embedding is computationally efficient

17

Overview

Bilexical Models


Learning

Experiments

18

Formulation

I Given:I Set of training tuples D = (m1, h1) . . . (ml , hl)I where m are modifiers and h are headsI the distributional representations: φ(m) and φ(h) is computed over some corpus

I We set it as a conditional log-linear distribution:

Pr(m|h) =exp

{φ(m)>Wφ(h)

}∑m′∈M exp {φ(m′)>Wφ(h)}

19

Learning and Regularization

I Std. conditional Max. Likelihood optimization; maximize the log-likelihoodfunction:

log Pr(D) =∑

(m,h)∈D

φ(m)>Wφ(h)− log∑

m′∈Mexp

{φ(m′)>Wφ(h)

}I Adding regularization penalty, our algorithm essentially maximizes:∑

(m,h)∈D

log Pr(m | h)) + λ‖W ‖p

I Regularization using the proximal gradient method (FOBOS):I `1 Regularization, ‖W ‖1 ⇒ Sparse feature spaceI `2 Regularization, ‖W ‖2 ⇒ Dense parametersI `∗ Regularization, ‖W ‖∗ ⇒ Low Rank Embedding

20

Algorithm: Proximal Algorithm for Bilexical Operators

1 while iteration < MaxIteration do2 Wt+0.5 = Wt − ηtg(Wt); // gradient of neg log-likelihood

/* adding regularization penalty: */

/* Wt+1 = argminW ||Wt+0.5 −W ||22 + ηtλr(W ) *//* we use proximal operator */

3 if `1 regularizer then4 Wt+1(i , j) = sign(Wt+0.5(i , j)) ·max(Wt+0.5(i , j)− ηtλ, 0);

// Basic thresholding operation

5 else if `2 regularizer then6 Wt+1 =

11+ηtλ

Wt+0.5; // Basic scaling operation

7 else if nuclear norm regularizer then8 Wt+0.5 = UΣV

>;9 σ̄i = max(σi − ηtλ, 0); // σi = the i-th element on Σ

10 Wt+1 = UΣ̄V> ;

11 end

21

Overview

Bilexical Models


Learning

Experiments

22

Experiments

I Tasks:I Noun-Adjective relations:

I Pr(adjective|noun) and Pr(noun|adjective)I Verb-Object relations:

I Pr(object|verb) and Pr(verb|object)I Data:

I Supervised corpus: gold standard dependencies of the Penn TreebankI We partition the heads of head-modifier relations into three parts:

60% heads for training, 10% heads for validation and 30% heads for test.I No heads from the test set were training set.

I Corpora for distributional representations: BLLIP corpus

I Training: For each head word, using supervised data, we compile a list ofcompatible and incompatible modifiers

23

Results

Nouns Predicted Adjectives

presidentexecutive, senior, chief, frank, former, international, marketing, assistant,annual, financial

wife former, executive, new, financial, own, senior, old, other, deputy, major

sharesannual, due, net, convertible, average, new, high-yield, initial, tax-exempt, subordinated

mortgagesannualized, annual, three-month, one-year, average, six-month, conven-tional, short-term, higher, lower

month last, next, fiscal, first, past, latest, early, previous, new, current

problem new, good, major, tough, bad, big, first, financial, long, federal

holidaynew, major, special, fourth-quarter, joint, quarterly, third-quarter, small,strong, own

Table: 10 most likely adjectives for some nouns

24

Results

46

48

50

52

54

56

58

60

62

1e3 1e4 1e5 1e6 1e7 1e8

pairw

ise a

ccura

cy

number of operations

Objects given Verb

unsupervisedNNL1L2

I Pairwise accuracy - measure ofcompatible/incompatible modifiers

I Capacity of the model - given thehead, number of double operationsrequired to compute scores for allmodifiers

I In general if the representation is nand there are m modifiers then:

I `1 & `2 ⇒ if W has d non-zeroweights ⇒ dm

I `∗ ⇒ if the rank of W is k ⇒kn + km

25

50

55

60

65

70

75

80

85

90

1e3 1e4 1e5 1e6 1e7 1e8

pairw

ise a

ccura

cy

Adjectives given Noun

unsupervisedNNL1L2 66

68

70

72

74

76

78

80

1e3 1e4 1e5 1e6 1e7 1e8

Nouns given Adjective

unsupervisedNNL1L2

46

48

50

52

54

56

58

60

62

1e3 1e4 1e5 1e6 1e7 1e8


Objects given Verb

unsupervisedNNL1L2

60

65

70

75

80

1e3 1e4 1e5 1e6 1e7 1e8


Verbs given Object

unsupervisedNNL1L2

Figure: Pairwise accuracy vs number of double operations to compute the distribution over mfor a given h

26

Prepositional Phrase attachment

Verb Object Modifier

(v) (o) (m)

NMOD NOMINAL(prep)

VERBAL(prep)

I Given: For every preposition p, set of trainingtuples Dp = {(v , o, p,m, y)1. . .(v , o, p,m, y)l}

I Distributional representations: φ(v), φ(o), φ(m)

Pr(y=V |〈v , o, p,m〉) =exp

{φ(v)>W Vp φ(m)

}Z

&

Pr(y=O|〈v , o, p,m〉) =exp

{φ(o)>W Vp φ(m)

}Z

I Does the bilinear model complement the linearmodel?

I For a constant λ ∈ [0, 1]

Pr(y |x) = λ PrL(y |x) + (1− λ) PrB(y |x)

27

Results

55

60

65

70

75

80

for from with

att

ach

me

nt

accu

racy

bilinear L1bilinear L2

bilinear NNlinear

interpolated L1interpolated L2

interpolated NN

Figure: Attachment accuracies of linear, bilinear and interpolated models for three prepositions

28

Conclusion

I We have presented a semi-supervised bilexical model that has a potential togeneralize over unseen words

I We have proposed a method to learn low-rank embeddings for scoring bilexicalrelations efficiently

I We want to apply this idea to other bilexical tasks in NLP

I We want to explore how we can combine other feature representations withlow-rank bilexical operators.

29

Thank You

30

Bilexical ModelsLow Rank ConstraintsLearningExperiments

Documents

Learning Task-specific Bilexical Embeddingspranava/slides/colingslides.pdfLearning Task-speci c Bilexical Embeddings Pranava Madhyastha(1), Xavier Carreras (1;2), Ariadna Quattoni