30
Learning Task-specific Bilexical Embeddings Pranava Madhyastha (1) , Xavier Carreras (1,2) , Ariadna Quattoni (1,2) (1) Universitat Polit` ecnica de Catalunya (2) Xerox Research Centre Europe

Learning Task-specific Bilexical Embeddingspranava/slides/colingslides.pdfLearning Task-speci c Bilexical Embeddings Pranava Madhyastha(1), Xavier Carreras (1;2), Ariadna Quattoni

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Learning Task-specific Bilexical Embeddings

    Pranava Madhyastha(1), Xavier Carreras(1,2), Ariadna Quattoni(1,2)

    (1) Universitat Politècnica de Catalunya (2) Xerox Research Centre Europe

  • Bilexical Relations

    I Increasing interest in bilexical relations (relation between pairs of words)

    I Dependency Parsing - lexical items (words) connected by binary relations

    Small birds sing loud songs

    ROOT

    NMOD SUBJ

    OBJ

    NMOD

    I Bilexical Predictions can be modelled as Pr(modifier|head)

    2

  • In Focus: Unseen words

    Adjective-Noun relation, where an adjective modifies a noun

    Vynil can be applied to electronic devices and cases

    NMOD?

    NMOD?

    I If one/more of the above nouns or adjectives have not been observed in thesupervision: estimating Pr(adjective|noun)

    I Zipf distribution

    I Generalisation is a challenge

    3

  • Distributional Word Space Models

    I Distributional Hypothesis: Linguistic items with similar distributions have similar meanings

    the curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-whitesun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber abird on the shape of the moon over the trees in frontBut I could nt see the moon or the stars , only therning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None ofthe light of an enormous moon . The plash of flowing wman s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airshoud obscured part of the moon . The Allied guns behind

    I For every word we can compute an n-dimensional vector space representation φ(w)→ Rn from a largecorpus

    4

  • Contributions

    Formulation of statistical models to improve bilexical prediction tasks

    I Supervised framework to learn bilexical models over distributional representations

    ⇒ based on learning bilinear formsI Compressing representations by imposing low-rank constraints to bilinear forms

    I Lexical embeddings tailored for a specific bilexical task.

    5

  • Overview

    Bilexical Models

    Low Rank Constraints

    Learning

    Experiments

    6

  • Overview

    Bilexical Models

    Low Rank Constraints

    Learning

    Experiments

    7

  • Unsupervised Bilexical Models

    I We can define a simple bilexical model as:

    Pr(m | h) = exp {〈φ(m), φ(h)〉}∑m′ exp {〈φ(m′), φ(h)〉}

    where 〈φ(x), φ(y)〉 denotes the inner-product.I Problem: Designing appropriate contexts for required relations

    I Solution: Leverage supervised training corpus

    8

  • Supervised bilexical model

    I We define the bilexical model in a bilinear setting here as:

    φ(m)>Wφ(h)

    where:

    φ(m) and φ(h) are n−dimensional representations of m and hW ∈ Rn×n is a matrix of parameters

    9

  • Interpreting the Bilinear Models

    I If we write the bilinear model as:

    n∑i=1

    n∑j=1

    fi ,j(m, h)Wi ,j

    I fi ,j(m, h) = φ(m)[i ]φ(h)[j]

    ⇒ Bilinear models are linear models, with an extended feature space!I =⇒ We can re-use all the algorithms designed for linear models.

    10

  • Using Bilexical Models

    I We define the bilexical operator as:

    Pr(m|h) =exp

    {φ(m)>Wφ(h)

    }∑m′∈M exp {φ(m′)>Wφ(h)}

    ⇒ Standard conditional log-linear model

    11

  • Overview

    Bilexical Models

    Low Rank Constraints

    Learning

    Experiments

    12

  • Rank Constraints

    φ(m)>Wφ(h)

    [m1 m2 · · · · · · · · · mn

    ]︸ ︷︷ ︸φ(m)>

    w11 w12 · · · · · · · · · w1nw21 w22 · · · · · · · · · w2n

    ......

    ......

    ......

    ......

    ......

    ......

    ......

    ......

    ......

    wn1 wn2 · · · · · · · · · wnn

    h1h2.........hn

    φ(h)

    13

  • Rank Constraints

    I Factorizing W :

    [m1 m2 · · · · · · mn

    ]︸ ︷︷ ︸φ(m)>

    u11 · · · u1ku21 · · · w2k

    ......

    ......

    ......

    un1 · · · unn

    ︸ ︷︷ ︸

    U

    σ1 · · · 0... . . . ...0 · · · σk

    ︸ ︷︷ ︸

    Σ

    v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn

    ︸ ︷︷ ︸

    V>

    ︸ ︷︷ ︸

    SVD(W ) = UΣV>

    h1h2......hn

    φ(h)

    I Please note: W has rank k

    14

  • Low Rank Embedding

    I Regrouping, we get:[m1 m2 · · · · · · mn

    ]u11 · · · u1ku21 · · · w2k

    ......

    ......

    ......

    un1 · · · unn

    ︸ ︷︷ ︸

    φ(m)>U

    σ1 · · · 0... . . . ...0 · · · σk

    ︸ ︷︷ ︸

    Σ

    v11 · · · · · · v1n... ... ... ...vk1 · · · · · · vkn

    h1h2......hn

    ︸ ︷︷ ︸

    V>φ(h)

    I We can see φ(m)>U as a projection of m and V>φ(h) as a projection of hI ⇒ Rank(W ) defines the dimesionality of the induced space, hence the embedding

    15

  • Computational Properties

    I In many tasks, given a head, rank a huge number of modifiersI Strategy:

    I Project each lexical item in the vocabulary into its low dimensional embedding ofsize k

    I Compute the bilexical score as k−dimensional inner productI Substantial computational gain as long as we obtain low-rank models

    16

  • Summary

    I Induce high dimensional representation from a huge corpus

    I Learn embeddings suited for a given task

    I Our bilexical formulation is, in principle, a linear model but with an extendedfeatures space

    I Low rank bilexical embedding is computationally efficient

    17

  • Overview

    Bilexical Models

    Low Rank Constraints

    Learning

    Experiments

    18

  • Formulation

    I Given:I Set of training tuples D = (m1, h1) . . . (ml , hl)I where m are modifiers and h are headsI the distributional representations: φ(m) and φ(h) is computed over some corpus

    I We set it as a conditional log-linear distribution:

    Pr(m|h) =exp

    {φ(m)>Wφ(h)

    }∑m′∈M exp {φ(m′)>Wφ(h)}

    19

  • Learning and Regularization

    I Std. conditional Max. Likelihood optimization; maximize the log-likelihoodfunction:

    log Pr(D) =∑

    (m,h)∈D

    φ(m)>Wφ(h)− log∑

    m′∈Mexp

    {φ(m′)>Wφ(h)

    }I Adding regularization penalty, our algorithm essentially maximizes:∑

    (m,h)∈D

    log Pr(m | h)) + λ‖W ‖p

    I Regularization using the proximal gradient method (FOBOS):I `1 Regularization, ‖W ‖1 ⇒ Sparse feature spaceI `2 Regularization, ‖W ‖2 ⇒ Dense parametersI `∗ Regularization, ‖W ‖∗ ⇒ Low Rank Embedding

    20

  • Algorithm: Proximal Algorithm for Bilexical Operators

    1 while iteration < MaxIteration do2 Wt+0.5 = Wt − ηtg(Wt); // gradient of neg log-likelihood

    /* adding regularization penalty: */

    /* Wt+1 = argminW ||Wt+0.5 −W ||22 + ηtλr(W ) *//* we use proximal operator */

    3 if `1 regularizer then4 Wt+1(i , j) = sign(Wt+0.5(i , j)) ·max(Wt+0.5(i , j)− ηtλ, 0);

    // Basic thresholding operation

    5 else if `2 regularizer then6 Wt+1 =

    11+ηtλ

    Wt+0.5; // Basic scaling operation

    7 else if nuclear norm regularizer then8 Wt+0.5 = UΣV

    >;9 σ̄i = max(σi − ηtλ, 0); // σi = the i-th element on Σ

    10 Wt+1 = UΣ̄V> ;

    11 end

    21

  • Overview

    Bilexical Models

    Low Rank Constraints

    Learning

    Experiments

    22

  • Experiments

    I Tasks:I Noun-Adjective relations:

    I Pr(adjective|noun) and Pr(noun|adjective)I Verb-Object relations:

    I Pr(object|verb) and Pr(verb|object)I Data:

    I Supervised corpus: gold standard dependencies of the Penn TreebankI We partition the heads of head-modifier relations into three parts:

    60% heads for training, 10% heads for validation and 30% heads for test.I No heads from the test set were training set.

    I Corpora for distributional representations: BLLIP corpus

    I Training: For each head word, using supervised data, we compile a list ofcompatible and incompatible modifiers

    23

  • Results

    Nouns Predicted Adjectives

    presidentexecutive, senior, chief, frank, former, international, marketing, assistant,annual, financial

    wife former, executive, new, financial, own, senior, old, other, deputy, major

    sharesannual, due, net, convertible, average, new, high-yield, initial, tax-exempt, subordinated

    mortgagesannualized, annual, three-month, one-year, average, six-month, conven-tional, short-term, higher, lower

    month last, next, fiscal, first, past, latest, early, previous, new, current

    problem new, good, major, tough, bad, big, first, financial, long, federal

    holidaynew, major, special, fourth-quarter, joint, quarterly, third-quarter, small,strong, own

    Table: 10 most likely adjectives for some nouns

    24

  • Results

    46

    48

    50

    52

    54

    56

    58

    60

    62

    1e3 1e4 1e5 1e6 1e7 1e8

    pairw

    ise a

    ccura

    cy

    number of operations

    Objects given Verb

    unsupervisedNNL1L2

    I Pairwise accuracy - measure ofcompatible/incompatible modifiers

    I Capacity of the model - given thehead, number of double operationsrequired to compute scores for allmodifiers

    I In general if the representation is nand there are m modifiers then:

    I `1 & `2 ⇒ if W has d non-zeroweights ⇒ dm

    I `∗ ⇒ if the rank of W is k ⇒kn + km

    25

  • 50

    55

    60

    65

    70

    75

    80

    85

    90

    1e3 1e4 1e5 1e6 1e7 1e8

    pairw

    ise a

    ccura

    cy

    Adjectives given Noun

    unsupervisedNNL1L2 66

    68

    70

    72

    74

    76

    78

    80

    1e3 1e4 1e5 1e6 1e7 1e8

    Nouns given Adjective

    unsupervisedNNL1L2

    46

    48

    50

    52

    54

    56

    58

    60

    62

    1e3 1e4 1e5 1e6 1e7 1e8

    number of operations

    Objects given Verb

    unsupervisedNNL1L2

    60

    65

    70

    75

    80

    1e3 1e4 1e5 1e6 1e7 1e8

    number of operations

    Verbs given Object

    unsupervisedNNL1L2

    Figure: Pairwise accuracy vs number of double operations to compute the distribution over mfor a given h

    26

  • Prepositional Phrase attachment

    Verb Object Modifier

    (v) (o) (m)

    NMOD NOMINAL(prep)

    VERBAL(prep)

    I Given: For every preposition p, set of trainingtuples Dp = {(v , o, p,m, y)1. . .(v , o, p,m, y)l}

    I Distributional representations: φ(v), φ(o), φ(m)

    Pr(y=V |〈v , o, p,m〉) =exp

    {φ(v)>W Vp φ(m)

    }Z

    &

    Pr(y=O|〈v , o, p,m〉) =exp

    {φ(o)>W Vp φ(m)

    }Z

    I Does the bilinear model complement the linearmodel?

    I For a constant λ ∈ [0, 1]

    Pr(y |x) = λ PrL(y |x) + (1− λ) PrB(y |x)

    27

  • Results

    55

    60

    65

    70

    75

    80

    for from with

    att

    ach

    me

    nt

    accu

    racy

    bilinear L1bilinear L2

    bilinear NNlinear

    interpolated L1interpolated L2

    interpolated NN

    Figure: Attachment accuracies of linear, bilinear and interpolated models for three prepositions

    28

  • Conclusion

    I We have presented a semi-supervised bilexical model that has a potential togeneralize over unseen words

    I We have proposed a method to learn low-rank embeddings for scoring bilexicalrelations efficiently

    I We want to apply this idea to other bilexical tasks in NLP

    I We want to explore how we can combine other feature representations withlow-rank bilexical operators.

    29

  • Thank You

    30

    Bilexical ModelsLow Rank ConstraintsLearningExperiments