141
MultiWord Expressions in NLP Jan Odijk LOT Summerschool Utrecht, June 2004

MultiWord Expressions in NLP

  • Upload
    platt

  • View
    62

  • Download
    2

Embed Size (px)

DESCRIPTION

MultiWord Expressions in NLP. Jan Odijk LOT Summerschool Utrecht, June 2004. Overview. NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon. Overview. NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks - PowerPoint PPT Presentation

Citation preview

Page 1: MultiWord Expressions in NLP

MultiWord Expressions in NLP

Jan Odijk

LOT Summerschool

Utrecht, June 2004

Page 2: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 3: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 4: MultiWord Expressions in NLP

Natural Language Processing

• Automatic processing of natural language– Generation: Semantic Repr String– Analysis: String Semantic Representation

• Example applications– Machine Translation (MT)– Information Retrieval (IR)– Cross-language Information Retrieval (CLIR)– Question-Answering

Page 5: MultiWord Expressions in NLP

Natural Language Processing

• Based on Grammars– (Popular) frameworks

• Feature structure based– Head-driven Phrase Structure Grammar (HPSG)– Lexical-Functional Grammar (LFG)

• Tree-based– Tree-Adjoining Grammar (TAG)– M-Grammar

• Based on grammar components or dedicated modules– Decompounding– PoS-tagging– Chunking– Named Entity Recognition– Name/Address grammars– Date / Amount grammars

Page 6: MultiWord Expressions in NLP

Natural Language Processing

• Based on Statistics– No explicit grammar– Statistics

• Derived from (annotated) training corpus

• Tested with test corpus

• Applied to new corpora

• Combinations of grammar and statistics

Page 7: MultiWord Expressions in NLP

NLP Grammar

• Defines <form, meaning> pairs and structural descriptions at various levels

• Components– Semantics– Syntax– Morphology– Orthography (Phonology)

Page 8: MultiWord Expressions in NLP

NLP Grammar

• Semantics– Defines the meaning of an utterance– usually synchronized with syntax

(compositionality)• HPSG: CONTENTS attribute• M-Grammar: in-tandem build up• Synchronous TAG: in-tandem build-up with

derivation trees• LFG: in tandem with f-structure

Page 9: MultiWord Expressions in NLP

NLP Grammar

• Syntax– Defines the syntactic structure of an utterance– Object types: Trees, DAGs– Features: attribute-value pairs– Value: atomic or structured

Page 10: MultiWord Expressions in NLP

NLP Grammar

• Syntax– Often surface syntax and deep syntax

(not necessarily on a separate level)

• HPSG: surface tree v. DAG

• M-Grammar: surface trees v. derivation trees

• LFG: c-structure v. f-structure

• TAG: derived tree v. derivation tree

• Alpino: surface tree v. dependency tree

Page 11: MultiWord Expressions in NLP

NLP Grammar

• Morphology– Relates (word structure, string)– Word-internal structure build-up usually in the

syntactic component – Usually a rule system (intensional definition)– Simple Inflection: sometimes list of triples

<base form, morph prop, word form> (extensional definition)

Page 12: MultiWord Expressions in NLP

NLP Grammar

• Orthography– Relates ([String], String)

• [he, said, :, “, come, in, !, “]• He said: “come in!”

– Usually trivial in generation– Easy in analysis (tokenization) for many

languages– Sometimes split (erop, opgebeld)– Very problematic for Chinese, Japanese, etc.

Page 13: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 14: MultiWord Expressions in NLP

What are MWEs?

Page 15: MultiWord Expressions in NLP

What are MWEs?

• sequence of words that has lexical, orthographic, phonological, morphological, syntactic, semantic, pragmatic or translational properties not predictable from the individual components or their normal mode of combination

Page 16: MultiWord Expressions in NLP

What are MWEs?

• sequence of – Not necessarily contiguous in a concrete utterance

• ...omdat hij de plaat wilde poetsen

– Not necessarily always in the same order in each utterance• Hij poetste gisteren de plaat

• words – Ambiguity between type and token (intentional)– Inflected word form v. lemma– Ambiguity between

• Character sequences separated from other character sequences by spaces and other separators (Narrow interpretation)

• Abstract lexical units of the grammar (Broad interpretation)

Page 17: MultiWord Expressions in NLP

What are MWEs?

• that has properties not predictable from the individual components and their normal mode of combination

Page 18: MultiWord Expressions in NLP

What are MWEs?

• Lexical– De plaat poetsen– Een poging wagen / doen / *maken– Dat varkentje eens wassen– Zware / *sterke shag– Scherpe kritiek– Perdre la tête/ la boule / *la cervelle– Se creuser la tête / * la boule / la cervelle

Page 19: MultiWord Expressions in NLP

What are MWEs?

• Orthographic– viz.

– Bijv.

– www.uilots.nl

– i.v.m.

– Yahoo!

– Groen!

– Aujourd’hui (v. l’homme)

– ‘s (avonds/morgens/middags)

Page 20: MultiWord Expressions in NLP

What are MWEs?

• phonological, – Over de rooie/*rode (gaan/zijn/raken)

– om de dooie/*dode donder niet

– op zijn dooie akkertje/gemak

– op zijn dooie eentje

– De kwaaie/*kwade Piet toegespeeld krijgen

– Je niet in de kouwe/*koude kleren gaan zitten

– Een gouwe ouwe

– (but geen rode/rooie cent/duit (hebben))

Page 21: MultiWord Expressions in NLP

What are MWEs?

• morphological, – Ten gevolge van– Ter wereld– Van goeden huize– Zonder aanzien des persoons– Het lood*(je) leggen– Dat varken*(tje) wassen– De *raap is / rapen zijn gaar

Page 22: MultiWord Expressions in NLP

What are MWEs?

• Syntactic– Ten gevolge van– In opdracht van (no article)– Iemand een oor aannaaien– Rekening houden met (obligatorily indefinite)– Het bijvoeglijk(*e) naamwoord (v. een

groot/grote man)

Page 23: MultiWord Expressions in NLP

What are MWEs?

• Semantic– De plaat poetsen– Dat varkentje wassen– Een bok schieten– Een flater slaan

Page 24: MultiWord Expressions in NLP

What are MWEs?

• Pragmatic– Ladies and Gentlemen– Ik heb gezegd.– Eet smakelijk! (Bon appétit!, Enjoy!)– Sincerely yours

Page 25: MultiWord Expressions in NLP

What are MWEs?

• Translational properties – Laten zien (F. montrer, E. show)– Witte wijn (P. vinho verde)– Nuclear power plant (D. atoomcentrale, G.

Kernkraftwerk)– Space probe (F. sonde spatiale)– Iemand iets laten weten

• inform someone of something

Page 26: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 27: MultiWord Expressions in NLP

MWEs in NLP

• MWEs occur very often in natural language– Esp. in languages with little compounding

• Especially in specialized domains– Multi-word terminology

Page 28: MultiWord Expressions in NLP

MWEs in NLP

• MT– Improves parsing and translation of the MWEs

– Also improves parsing hence translation of the sentence containing the MWEs (Nivre & Nilsson LREC 2004)

• CLIR– Nuclear power plant

• Kern- macht plant

• Kern- Macht Pflanz

• v. atoomcentrale / Kernkraftwerk

Page 29: MultiWord Expressions in NLP

MWEs in NLP

• Problems MWEs pose for NLP– How are MWEs to be dealt with in the

grammar of an NLP system?– What lexical representation of MWEs is

required for this?– How can we obtain lexicons containing MWEs

with such lexical representations

Page 30: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 31: MultiWord Expressions in NLP

Types of MWEs (I)

• Fixed

• Semi-flexible

• Flexible

Page 32: MultiWord Expressions in NLP

Fixed MWEs

• Fixed MWEs– Words of the MWE in a fixed order– No variation in lexical item choice– Always contiguous (no other elements in

between)– No inflectional processes except at the edges

Page 33: MultiWord Expressions in NLP

Fixed MWEs

• Fixed MWEs– ad hoc, stante pede, ter plaatse– Hong Kong, Kuala Lumpur, New York, San Francisco– credit card, travel agency, real estate agency

• NOT– in plaats van (cf. in plaats daarvan) (‘instead of’)– carta telefonica (cf. carte telefoniche)– de plaat poetsen (‘polish the plate’, ‘bolt’)

Page 34: MultiWord Expressions in NLP

Semi-Flexible MWEs

• Semi-Flexible MWEs– MWEs with fixed order of elements– That are impenetrable for other words– Parts can be inflected

Page 35: MultiWord Expressions in NLP

Semi-Flexible MWEs

• Examples:– Chambre des représentants

• House of representatives

– Patatas fritas• French fries

– Mise au point automatique• Autofocus

– Calculateur analogique• Analogue computer

Page 36: MultiWord Expressions in NLP

Semi-Flexible MWEs

• Examples:– Cité plus haut

• Above-stated

– Résistant aux acides• Acid-proof

– Malade en altitude• Airsick

Page 37: MultiWord Expressions in NLP

Flexible MWEs

• Flexible MWEs• Allow or require inflection in multiple parts, and

• Allow permutations of subphrases, or

• Allow intrusion by other phrases, or

• Have controlled variation (bound pronouns)

Page 38: MultiWord Expressions in NLP

Flexible MWEs

– de plaat poetsen (‘bolt’)• Hij heeft gisteren de plaat gepoetst• …omdat hij de plaat wilde poetsen• Hij poetste gisteren de plaat

– to lose one’s temper• He lost his temper• She lost her temper

Page 39: MultiWord Expressions in NLP

Treatment

• Fixed MWEs– No inflection: Relate single string to sequence of

strings (in Orthography)• ([ad_hoc] , [ad, hoc]) • Lexical entry for ad_hoc

– With inflection: Relate single stem to sequence of stems in Morphology

• ([real, estate, agency, Plur] -> [real_estate_agency, Plur])• Lexical entry for real_estate_agency

Page 40: MultiWord Expressions in NLP

Treatment

• Semi-flexible MWEs– Require local syntax– Chunking may be enough

Page 41: MultiWord Expressions in NLP

Treatment

• Flexible MWEs– Require sophisticated syntax

Page 42: MultiWord Expressions in NLP

Types of MWEs (II)

• Verb –particle combinations (English, German, Dutch, Hungarian)– Ik sloeg hem over– I looked the passage up

Page 43: MultiWord Expressions in NLP

Types of MWEs (II)

• Verb + prepositional complement– I looked after her– Hij heeft altijd van haar gehouden

Page 44: MultiWord Expressions in NLP

Types of MWEs (II)

• Circumpositions (Dutch, German)– Op iemand af / ?toe / *heen– Auf jemanden *ab / zu– Over de brug heen / *af / *toe

Page 45: MultiWord Expressions in NLP

Types of MWEs (II)

• Lexical item (from open or closed class)

• + closed class lexical item– Finite (actually small) list

• Limited variety of predictable syntactic structures

• Dealt with by almost any grammar-based NLP system

Page 46: MultiWord Expressions in NLP

Types of MWEs (II)

• Multiword Names– Examples

• Fifth Avenue

• Koning Leopold III-laan

• Krimpen aan de IJssel

• Koninklijke Nederlandse Philips N.V.

Page 47: MultiWord Expressions in NLP

Types of MWEs (II)

• Multiword Names– Issues

• Keys – variation– (Koning) Leopold III-laan

– Fifth (Avenue)

– ((Calle) Roberto) González

– Many different ones, continuously new ones

– Very important for correct parsing and translation• Minister Kohl Minister Cabbage

Page 48: MultiWord Expressions in NLP

Types of MWEs(II)

• Compounds (in English)– Examples

• Real estate agency

• Nuclear power plant

• Blue cheese

• Private eye

• High school

Page 49: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms– No or unpredictable meaning of the

components– Fixed (or very limited ) lexical item selection– Opaque

• Kick the bucket

• De plaat poetsen

• Casser sa pipe

Page 50: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms– Semi-transparant

• `een bok schieten’– Bok (male goat) = blunder

– Schieten (shoot) = make

• `dat varkentje wassen’– Varkentje (little pig) = problem

– Wassen (wash) = address, take care of

Page 51: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms– Usually completely normal syntactic structure– Both a literal and an idiomatic reading– Participate normally in many grammatical

processes– BUT: often restrictions on participating in

grammatical processes

Page 52: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms (opaque)– Normal participation

• Hij poetste de plaat (V2)• Poetste hij de plaat? (V1, question formation)• ...omdat hij de plaat wilde poetsen (VR)

– But not• #De plaat heeft hij niet gepoetst (topicalization)• #Welke plaat heeft hij gepoetst? (wh-Q)• #Hij heeft de mooie plaat gepoetst (internal modification)• #Hij heeft de plaat waarschijnlijk niet gepoetst (Mid-field NP-Adv

permutation)• #De plaat die hij gepoetst heeft (Relativization)• #De plaat werd door hem gepoetst (Passive)• #Wat een plaat (independent occurrence))

Page 53: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms (semi-transparant)– Normal participation

• Hij schoot een bok (V2)• Schoot hij een bok? (V1, question formation)• ...omdat hij een bok zou schieten (VR)

– Also• Een bok heeft hij niet geschoten (topicalization)• Wat voor een bok heeft hij nu weer geschoten? (wh-Q)• Hij heeft een enorme bok geschoten (internal modification)• Hij heeft die bok waarschijnlijk geschoten omdat ... (Mid-field NP-Adv

permutation)• De bok die hij geschoten heeft (Relativization)• Er werd door hem een enorme bok geschoten (Passive)

– But not:• Wat een bok! (independent occurrence)

Page 54: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms– In some cases irregular syntactic structure

• Ten gevolge van (fossilized portmanteau words, noun e-form)• Het bijvoeglijk naamwoord (no –e)• Iemand de oren wassen (inalienable possession construction)

– Regular syntactic structure but not predictable from the components’ properties

• Iemand welkom heten– (`heten’ on its own can only take a pred. complement)– *Hij heet hem aardig / president / Jan

• to lose face (count noun in determinerless NP)

Page 55: MultiWord Expressions in NLP

Types of MWEs(II)

• Idioms– Cranberry words

• Ergens de brui aan geven

Page 56: MultiWord Expressions in NLP

Types of MWEs(II)

• Semi-idioms (collocations)– One element occurs in its normal meaning– The lexical selection of the other element is

fixed or very limited– The other element has a special meaning– Examples

• Zware tabak (heavy tobacco) `strong tobacco’• Scherpe kritiek (sharp criticism) `severe criticism’• Heavy smoker

Page 57: MultiWord Expressions in NLP

Types of MWEs(II)

• Support verb constructions– Type I

• Een poging wagen

• Een lezing houden / geven

• To pay attention to (aandacht schenken aan)

• To take advantage of

Page 58: MultiWord Expressions in NLP

Types of MWEs (II)

• Arguments of the noun outside the NP– De kritiek die we hadden op hem– ?de kritiek op hem die we hadden– *De kritiek die we naar voren brachten op hem– De kritiek op hem die we naar voren brachten– De kritiek op hem verstomde

Page 59: MultiWord Expressions in NLP

Types of MWEs (II)

• Arguments of the noun outside the NP– De aandacht die we schonken aan hem– *de aandacht aan hem die we schonken– *De aandacht die we becommentarieerden aan

hem– De aandacht aan hem die we

becommentarieerden– De aandacht *aan / voor hem verflauwde

Page 60: MultiWord Expressions in NLP

Types of MWEs (II)

• Arguments of the noun outside the NP– The attention that we paid to this subject– *The attention to this subject that we paid– *The attention that we criticised to this subject– The attention to this subject that we criticized– The attention to this subject

Page 61: MultiWord Expressions in NLP

Types of MWEs (II)

• Arguments of the noun outside the NP– No attention was paid to this subject– This subject was paid no attention to– Advantage was taken of this proposition– This proposition was taken advantage of

Page 62: MultiWord Expressions in NLP

Types of MWEs (II)

– Type II• Iemand een stomp geven

• Iemand een klap geven

• To give someone a kiss

– Noun signifying bodily touch– `give’ + indirect object as Patient

Page 63: MultiWord Expressions in NLP

Types of MWEs (II)

– Type III(?) ..\Utrecht\MWEs\copulasetc\selectzijn edited Current.xls

• In de war zijn / raken / * gaan / ?komen / brengen

• In zijn nopjes zijn / raken / * komen / *brengen

• De pijp uit zijn / ?raken / gaan/ *komen / *brengen

Page 64: MultiWord Expressions in NLP

Types of MWEs (II)

• Quasi-idioms– Characteristics: Regular meaning + something

extra (specialization)– Examples

• Huisdeur (`front door’)• Bijvoeglijk naamwoord• Fried eggs

– Terms– Compounds

Page 65: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 66: MultiWord Expressions in NLP

Requirements

• Account for the fact that MWEs usually have `normal’ syntactic structures

• Account for the normal participation of MWEs in most syntactic processes

• Account for the restrictions on the participation in some syntactic processes

• Recognize it as an MWE and assign it the associated semantics

Page 67: MultiWord Expressions in NLP

Tree-Adjoining Grammar (TAG)

• Originally developed by Aravind Joshi• Extended by Shieber, Schabes• Applied to French by Abeillé • Basic object: trees• Enriched with features (incl. unification)• Known parsing algorithm and complexity

properties• Defines mildly context-sensitive languages

Page 68: MultiWord Expressions in NLP

TAG

• Lexicalized TAG (LTAG)

• Trees:– Elementary Trees

• Associated with a lexical item

• Initial– Leaves labeled by

» Terminals

» Substitution nodes

Page 69: MultiWord Expressions in NLP

TAG

• Initial Trees (examples)– N[Jean]– S[N0↓ V[dormir]]– S[N0↓ V[aimer] N1↓]– S[N0↓ V[ressembler] PP[P[à] N1↓]

• Used for words (verbs) taking nominal, adjectival, prepositional arguments

Page 70: MultiWord Expressions in NLP

TAG

• Auxiliary Trees– One leaf node (foot node) same category as root node

– Used for modifiers, auxiliary verbs, raising verbs, verbs taking sentential arguments

– Examples• N[A[beau] N*]

• N[Det[le] N*]

• S[N0↓ V[penser] S1’[C[que] S1*]]

• V[V[sembler] V*]

Page 71: MultiWord Expressions in NLP

TAG

• Constraints on Elementary Trees– Lexicalization– Predicate-Argument co-occurrence– Semantic Consistency

Page 72: MultiWord Expressions in NLP

TAG

• Operations– Substitution– Adjunction

• NO– Deletion– Movement– Permutation

Page 73: MultiWord Expressions in NLP

TAG

• Operations– Substitution

• Substitutes a tree at a leaf node marked for substitution (↓)

• Example– S[N0↓ V[dormir]] + N[Jean] – S[N[Jean] V[dormir]]

Page 74: MultiWord Expressions in NLP

TAG

• Operations– Adjunction

• Inserts an auxiliary tree (or a tree derived from an auxiliary tree) at any node (with the same label)

• If this node dominates a subtree, this subtree is copied under the foot node of the auxiliary tree

– Example• S[N[Jean] V[dormir]] + V[semble V*] • S[N[Jean] V[semble V[dormir]]]

Page 75: MultiWord Expressions in NLP

TAG

• Derived tree– Tree created by substitution or adjunction– Encodes word order, inflection,

morphosyntactic features

• Derivation Tree– α-dormir[ 1/α’-Jean 2/β3-semble]– Close to dependency trees– Basis for semantic interpretation

Page 76: MultiWord Expressions in NLP

TAG

• Lexical rules– Elementary tree elementary tree– For passive, wh-questions, cleft-constructions, relatve

clauses, cliticization– Define the lexical item’s tree family

• Examples of derived elementary trees– Passive: S[N1↓ V[être] V[aimé] PP[P[par] N0↓]– Object-cleft S[V[CI[ce] V[être] N1↓ S’[C[que] S[N0↓

V[aimer]]]]– Object-Rel.: N[N1* S’[C[que] S[N 0↓V[aimer]]]]

Page 77: MultiWord Expressions in NLP

TAG

• Synchronous TAG

• Each elementary tree is associated with a semantic tree

• Links between nodes from the elementary tree and the semantic tree

Page 78: MultiWord Expressions in NLP

TAG

• N-1[Jean] – T-1[jean’]

• S-2[N0↓-1 V[dormir]] – F-2[R[dormir’] T1↓-1]

• S-1[N0↓-2 V[aimer] N1↓-3]– F-1[R[aimer] T0↓-2 T1↓-3]

• S-1[N0↓-2 V[ressembler] PP[P[à] N1↓-3]– F-1[R[ressembler’] T0↓-2 T1↓-3]

Page 79: MultiWord Expressions in NLP

TAG

• N-1[A[beau] N*]– F-1[R[beau’] T0*]

• N[Det[le] N*]– F-1[R[le’] T0*]

• S-1[N0↓-2 V[penser] S1’[C[que] S1*]]– F-1[R[penser’] T0↓-2 T1*]

• V-1[V[sembler] V*]– F-1[R[sembler’] T0*]

Page 80: MultiWord Expressions in NLP

TAG

• Given a pair, select a link (non-deterministically) with roots A and B

• Select another pair with roots A and B• Combine the syntactic trees (at node A) and

the semantic trees (at node B), by adjunction or substitution, and remove the link between A and B

• Do this recursively

Page 81: MultiWord Expressions in NLP

TAG

• <S[N0↓-1 V-2[dormir]], F-2[R[dormir’] T1↓-1]>, select link 1

• + <N-1[Jean], T-1[jean’]> <S[N[Jean] V-2[dormir]],

F-2[ R[dormir’] T[jean’]]

• !+ < V-1[V[sembler] V*], F-1[R[sembler’] T0*]> <S[N[Jean] V[V[sembler] V-2[dormir]]],

F[R[sembler] F-2[ R[dormir’] T[jean’]]]

Page 82: MultiWord Expressions in NLP

TAG

• Idiomatic Expressions in LTAG– Each idiomatic expression represented by an

elementary tree

• Examples– S[N0↓ V[briser] N1[D[la] N[glace]]– S[N0↓ V[prendre] N1↓ PP[P[en] N[compte]]]– S[N0[D[des] N[ailes]] V[pousser] PP1[P[à]

N1↓]]

Page 83: MultiWord Expressions in NLP

TAG

• Associated with a semantic tree

• F[R[briser-la-glace’] T0↓]

• F[R[prendre-en-compte’] T0↓ T1↓]

• F[R[des-ailes-pousser- à’] T0↓]

• With the appropriate links

Page 84: MultiWord Expressions in NLP

TAG

• Lexical rules– Apply to idiom elementary trees normally– But can be individually constrained

• Links between parts of an idiomatic elementary tree and semantic trees are allowed:– internal syntactic modification can correspond

to extenal semantic modification

Page 85: MultiWord Expressions in NLP

TAG

• MWEs are normal elementary trees

• Lexical rules apply to idiom elementary trees as usual

• Lexical rules can be restricted to apply to certain elementary trees

• Recognition and semantics: same as with single word elementary trees

Page 86: MultiWord Expressions in NLP

TAG

• But– No restrictions on elementary trees– Idiomatic elementary trees can deviate– Elementary trees are complex (esp. features):

difficult to maintain– Restrictions on grammatical processes basically

stipulated

Page 87: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 88: MultiWord Expressions in NLP

M-Grammar

• Developed by Jan Landsbergen• Inspired by Montague grammar• Compositional Grammars

– The meaning of an expression is a function of the meaning of its parts and the way they are combined

• Use traditional syntactic surface trees (but with relations)

Page 89: MultiWord Expressions in NLP

M-Grammar

• Used for Machine Translation

• Research Prototype MT System– Dutch, English, Spanish

• Developed at Philips Research Labs– Rosetta project– Rosetta3 System

• Compositional Translation Method

Page 90: MultiWord Expressions in NLP

M-Grammar

• Compositionality of Meaning– The grammars are organised in such a way that

the meaning of an expression is a function of the meaning of its parts and the way they are combined.

• Implemented by Compositional Grammars– Basic Expressions (BE), with a meaning– Rules, with a meaning (recursively applicable)

Page 91: MultiWord Expressions in NLP

M-Grammar

• Basic Expressions– With basic meaning

• M-Rules– With meaning operation

• Basic object: S-trees• S-tree = N[r1/T1,...rn/Tn]

– N: node = CAT{a-v pairs}– Ti: S-trees– Ri: grammatical relation (subject, object, head,...)

• S-tree of a basic expression is basic S-tree

Page 92: MultiWord Expressions in NLP

M-Grammar

• S-tree of a full utterance is created by applying M-rules to S-trees, initially basic S-trees

• Derivation history is recorded in syntactic derivation tree (syntactic D-tree)

• M-rules– Powerful rules

– Structure creation, deletion, permutations, movements, insertions

Page 93: MultiWord Expressions in NLP

M-Grammar

• M-Rules– Relate [T1,..,Tn] to T

– Reversible• Analytic and generative versions can be derived automatically

– Measure condition • Each Ti in [T1,...,Tn] must be `smaller’ (according to some

measure) than T

– Reversibility and Measure Condition guarantee effectiveness of parsing

Page 94: MultiWord Expressions in NLP

M-Grammar

• Syntactic D-trees contain names of basic expressions and names of rules

• Can be mapped into (isomorphic) Semantic D-trees, containing names of basic meanings and names of meaning operations

Page 95: MultiWord Expressions in NLP

M-Grammar

• Principle of Compositionality of Translation– Two expressions are each other's translation if they are

built up from parts which are each other's translation, by means of rules which are each other’s translation.

• Implemented by tuning Compositional Grammars G1 and G2– For each BE in G1 at least one BE in G2 that is

translationally equivalent– For each rule in G1 at least one rule in G2 that is

translationally equivalent

Page 96: MultiWord Expressions in NLP

M-Grammar

• Interlingual System• Interlingua obtained as a side-effect of

tuning compositional grammars (no independent interlingua)

• Interlingua expresses translational equivalence– B1 and B1’ are translations of each other– Not necessarily the meaning of B1/B1’

Page 97: MultiWord Expressions in NLP

M-Grammar

• G1:– BEs: N-boek (M-book:book’), A-interessant (M-interesting:

interesting’)– Mrule R1: meaning name: IndefPlMod

• syntax– <[N, A], NP[mod/A{e}, head/N{pl}]>

• Semantics: [[A]](x) & [[N]](x)

• G2– BEs: N-livre (M-book: book’), A-intéressant( M-interesting:

interesting’)– Mrule R2 meaning name: IndefPlMod

• Syntax– <[N,A], NP[det/D-des, head/N’[head/N{pl, m}, mod/A{pl, m}]]

• Semantics– [[A]](x) & [[N]](x)

Page 98: MultiWord Expressions in NLP

M-Grammar

• NP[mod/A{e}-interessant head/N{pl}-boek] interessante boeken

• Syn D-Tree: R1 [boek interessant• NP[det/D{pl}-des head/N’[head/N{pl,m}-livre],

mod/A{pl,m}-interéssant] des livres interéssants

• Syn D-Tree: R2 [livre interessant]• Sem D-Tree IndefPlMod[M-book M-interesting]• Semantics: book’(x) & interesting’(x)

– (ignoring plurality, indefiniteness)

Page 99: MultiWord Expressions in NLP

M-Grammar

• Full System • Sem D-Trees (IL)• A-TRANSFER G-TRANSFER• Syn D-Trees Syn D-Trees• M-PARSER M-GENERATOR• S-Trees S-Trees• S-PARSER LEAVES• [Lexical S-Trees] [Lexical S-Trees]• A-MORPH G-MORPH • String String

Page 100: MultiWord Expressions in NLP

M-Grammar:MWEs

• Method for dealing with idioms• Each idiom

– is a basic expression– Associated with a complex syntactic structure complex basic expressions

• Syntactic Structure of an idiom: – Canonical (abstracting from syntactic operations: passive,

topicalization, verb movements, wh-questioning, ...– Represented as a D-tree

• Much more stable part of the grammar• Much simpler than S-trees (no features etc)• Makes it easier to deal with bound variation• Better guarantee of correctness of structures

– In the lexicon a identifier to a D-tree (idiom pattern)

Page 101: MultiWord Expressions in NLP

M-Grammar:MWEs

• Start rules:– Combine a BE with its arguments (variables)– If the BE is complex structure created by

applying the rules in the associated D-Tree• Resulting Structure runs through all the

normal rules of grammar• In analysis: incoming structure is analyzed

using the rules in the D-tree

Page 102: MultiWord Expressions in NLP

M-Grammar:MWEs

• In analysis:– Guide the analysis by the idiom’s D-tree– If successful, and the arguments are also

correctly analyzed: extend the sentence’s D-tree with the start rule dominating a BE (for the idiom) and the D-trees for the arguments

Page 103: MultiWord Expressions in NLP

M-Grammar:MWEs

• De pijp uit gaan• D-tree for vpid30 (simplified):

Rsubst,i[RVP [$aV_00_ga,

VAR_j RPPpost [$s_prep1286700, VAR_i ]

], RNPdef [$aV_00_pijp]]

Page 104: MultiWord Expressions in NLP

M-Grammar:MWEs

• Start Rule 1: BE(verb) + VAR• S-Tree created in case of an idiom by applying the D-Tree for the

idiom• S[subj/VAR_j, head/VP[compl/PP [obj/NP[det/D-de head/N-pijp ] head/P-uit ] head/V-ga ] ]

Page 105: MultiWord Expressions in NLP

M-Grammar:MWEs

• Normal participation in grammatical processes: structures for idioms are normal

• Restrictions:– Rules that affect meaningful elements cannot apply to

meaningless parts of idioms relativization, topicalization, wh-questioning, NP-

Adv order switch, exclamative formation, adjectival modification, independent occurrence, ...

– (parts of ) Rules that are purely syntactic not restricted V2, VR, V1 (if question formation and V1 are

separated)

Page 106: MultiWord Expressions in NLP

M-Grammar:MWEs

• Deviant syntax– Portmanteau words (ten, ter)– E-form of nouns– Inalienable possession construction

• Minor Rules: rules that can only occur in an idiom D-Tree

Page 107: MultiWord Expressions in NLP

M-Grammar:MWEs

• Idioms have normal syntactic structures generated by applying the normal M-rules following the associated idiom D-Tree

• Other M-Rules apply to these structures in the normal way• Restrictions on applications accounted for because M-Rules applying

to meaningful elements cannot apply to meaningless parts of idioms• Deviant syntactic structure can be accounted for by Minor Rules• The idiom D-trees can be derived using the system itself (putting

minor rules on in the grammar)• Recognition of idioms: by guiding the analysis along the idiom D-tree• Semantics: each idiom is (complex) basic expression with associated

basic meaning

Page 108: MultiWord Expressions in NLP

M-Grammars and TAG

• Neither has an adequate treatment of semi-idioms (yet)

• Neither has an adequate treatment of transparent idioms (yet)

• Idiom representation as D-trees can also be done in TAG (Abeillé)

Page 109: MultiWord Expressions in NLP

Overview

• NLP

• MWEs

• MWEs in NLP

• MWE Types

• Treatment of MWEs in selected frameworks

• MWEs and the lexicon

Page 110: MultiWord Expressions in NLP

Lexical representation

• How do we obtain lists of MWEs?

• How do we represent them lexically

• How can we improve exchangeability of MWE lexical representations?

Page 111: MultiWord Expressions in NLP

MWE Acquisition

• Lists from existing dictionaries– Always incomplete– Especially for specific domains– Do not necessarily reflect actual usage

• Semi-automatic acquisition from text corpora is called for– Especially for rapid tuning NLP system to

specific domain, company, organization

Page 112: MultiWord Expressions in NLP

MWE Acquisition

• Use statistical properties of MWEs to acquire them (semi-)automatically

• Examples:– Mutual Information: log Pr(xy)/(Pr(x)Pr(y))– Salience: Pr(xy) * MI– Dice, chi-square, log-likelyhood, em metric, ...

Page 113: MultiWord Expressions in NLP

MWE Acquisition

• Mutual Information:– Pr(z) estimated by F(z)/T– Pr(xy)/Pr(x)Pr(y) = (F(xy)*T)/F(x)F(y)– Experiment: (just for adjacent 2-word MWEs)

• Salience– Compensates for favoring low frequency items– Experiment: (just for adjacent 2-word MWEs)

Page 114: MultiWord Expressions in NLP

MWE Acquisition

• Extend to 3-word MWEs, etc.

• Combine with NLP system/components– PoS-tag to obtain syntactically meaningful

combinations (yes: A N, no: Adv N)– Parsing and compute statistics on D-trees

• Allows acquisition of discontinuous MWEs

• Very lively and active research area

Page 115: MultiWord Expressions in NLP

Lexical Representation of Idioms

• Many grammatical treatments of idioms require– Whatever is needed for single word lexical

items– Syntactic structure– Unique references to lexical items

• Highly framework / theory / implementation-specific

Page 116: MultiWord Expressions in NLP

Lexical Representation

• M-Grammar system-specific matters:– D-trees, compatible with the specific

implementation – Unique references to items in Rosetta-

lexicon– Order of the items (`gaan uit pijp’)– Presence of items (articles absent)

Page 117: MultiWord Expressions in NLP

Lexical Representation

• LTAG system-specific matters:– Syntactic trees, compatible with the

specific implementation – Unique references to items in

system’s lexicon– Order of the items (= canonical

surface order)– Presence of items (all present)

Page 118: MultiWord Expressions in NLP

SEQCI

• Lexical Representation– Maximally theory-neutral

• Incorporation Method – Generic– Maximally reuses existing NLP system

• Core Idea:– Describes which idioms have the same structure– Structural Equivalence Classes for Idioms

Page 119: MultiWord Expressions in NLP

SEQCI

• Lexical Representation= – Idiom Descriptions– Idiom Pattern Descriptions

• Idiom description=– Idiom pattern (identifier)

• Identifier for idiom structures• Used to define the equivalence classes

– Idiom Component List (ICL), with base forms• All• Any order, but the same within one equivalence class

– Example sentence containing the idiom• Same syntactic structure for each idiom of the same

equivalence class

Page 120: MultiWord Expressions in NLP

SEQCI

• Idiom pattern description• Idiom pattern identifier• Comments (free text)

Page 121: MultiWord Expressions in NLP

SEQCI

• Example:– Idiom Descriptions

• Idp30;De pijp uit gaan;Hij is de pijp uit gegaan• Idp30;De boot in gaan;Hij is de boot in gegaan• Idp30:Het schip in gaan;Hij is het schip in gegaan

– Idiom pattern definition• Idp30• Idiom headed by a verb taking a postpositional PP

containing a definite singular NP and one free argument as subject

Page 122: MultiWord Expressions in NLP

SEQCI

• Incorporation Method– Manual part, once for each idiom pattern – Automatic Part, for each idiom

description

Page 123: MultiWord Expressions in NLP

SEQCI

• Manual part (`hij is de pijp uit gegaan’)1. Parse the example sentence of an idiom description

with idiom pattern P, yielding the Reference Parse 2. Define a transformation to turn the reference parse into

the idiom structure ( Parse Transformation, PT) 3. Determine the list of unique IDs of the lexical items in

the idiom structure for the system derived from the reference parse (Idiom Component ID List, ICIL)

4. Define a transformation to relate ICL and ICIL (Idiom Component Transformation, ICT)

5. Apply the ICT to the ICL, yielding the transformed ICL (TICL) and check that each item in it equals the base form of the corresponding element on the ICIL

Page 124: MultiWord Expressions in NLP

SEQCI

Automatic part, for each idiom description I(`hij is de boot in gegaan’)1. Parse example sentence (Syntactic Structure)2. Apply IPT and check identity with idiom

structure modulo the lexical items3. Select the component IDs from the parse tree, in

order to obtain the ICIL)4. Apply ICT to the ICL of I, yielding the TICL5. Check that <bf(c1),…bf(cn)>=TICL

where ICIL = <c1, …cn> ( TICL check)

Page 125: MultiWord Expressions in NLP

SEQCI

• Advantages– Technically Simple– As theory/grammar/implementation-

independent as possible– No need for prescribing syntactic structures– System-specific aspects are derived from the

NLP-system itself

Page 126: MultiWord Expressions in NLP

SEQCI

• Will it work?– If there are not too many different idiom

patterns, and– Sufficient number of instantiations per idiom

pattern

Page 127: MultiWord Expressions in NLP

SEQCI

• Various improvements possible– Parameterization

• Over local morphosyntactic differences (sg/pl; pos/dim; pos/comp/sup; ...)

– Abstraction• Especially for large fixed parts

– Weten waar Abraham de mosterd haalt

– Use of underspecified syntactic structures• Optional, of any kind

– Guidelines for selection of example sentences

Page 128: MultiWord Expressions in NLP

SEQCI

Coverage #idioms #patterns#idioms #patterns50% 7383 28 449 2160% 8853 54 539 3670% 10304 140 628 5980% 11773 481 716 9885% 12509 908 760 13490% 13245 1644 804 17895% 13981 2380 849 223100% 14716 3116 893 267

SAID-Database Dutch Minidatabase

Page 129: MultiWord Expressions in NLP

Conclusions

• SEQCI: – Technically simple– Highly theory, grammar, implementation independent

• Can reduce/share lexicon development efforts significantly• Candidate for a standard lexical representation for idioms• Extension to other types of MWEs looks promising• Initial experiments started• More testing required, with more NLP systems

– I can provide test data– Try it out in your own NLP system!

Page 130: MultiWord Expressions in NLP

SEQCI

• Possible Problems/Objections?– Manual Part

• Pattern sentence does not yield a parse• Pattern sentence yields multiple parses• Pattern corresponds to 2 or more different structures in my system

– Automatic Part• Sentence does not yield a parse• Sentence yields multiple parses• Multiple skeys result for the same base form

Page 131: MultiWord Expressions in NLP

SEQCI: Reference Parse

Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i)

[RVP[$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ] ]RNPdef [$aN_00_pijp] ],VAR_j ],RNP[$hij_PRON] ]]

Page 132: MultiWord Expressions in NLP

SEQCI: Idiom Structure

• IPT: IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]

• D-tree for vpid30 (simplified):Rsubst,i

[RVP [$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ]

], RNPdef [$aN_00_pijp]]

Page 133: MultiWord Expressions in NLP

ICIL

< $aV_00_ga, $prep1286700, $aN_00_pijp >

Page 134: MultiWord Expressions in NLP

SEQCI: Illustration

Manual Part, applied to `de pijp uitgaan’1. Reference Parse: See D-tree next slide2. IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]3. ICIL: < $aV_00_ga, $aN_00_pijp, $prep1286700>4. ICT: 1 2 3 4 => 4 3 25. TICL = ICT(<de, pijp, uit, gaan>) = <gaan, pijp,

uit> = < Bf($aV_00_ga), Bf($aN_00_pijp), Bf($prep1286700) >

Page 135: MultiWord Expressions in NLP

ICT

ICL: <de, pijp, uit, gaan>Must be turned into: < gaan, uit, pijp>

ICT: 1 2 3 4 => 4 3 2

Page 136: MultiWord Expressions in NLP

TICL

TICL = ICT(ICL) = ICT(<de, pijp, uit, gaan>) = <gaan, uit, pijp> = < Bf($aV_00_ga), Bf($prep1286700),

Bf($aN_00_pijp)>

Page 137: MultiWord Expressions in NLP

Syntactic Structure

Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i) [RVP[$aV_00_ga,

RPPpost [$s_prep1286800, VAR_i ] ], RNPdef [$aN_00_boot] ],VAR_j ],RNP[$hij_PRON] ]]

Page 138: MultiWord Expressions in NLP

Apply IPTRsubst,i

[RVP

[$aV_00_ga,

RPPpost

[$s_prep1286800,

VAR_i

]

],

RNPdef [$aN_00_boot]

]

Page 139: MultiWord Expressions in NLP

ICIL

ICIL=< $aV_00_ga , $s_prep1286800, $aN_00_boot>)

Page 140: MultiWord Expressions in NLP

TICL

ICT(ICL) =

ICT(<de, boot, in, gaan>)=

<gaan, in, boot>

Page 141: MultiWord Expressions in NLP

TICL check

<bf($aV_00_ga), bf($s_prep1286800), bf($aN_00_boot) > =

TICL = <gaan, in, boot>