9
Automatic Extraction of Idioms using Graph Analysis and Asymmetric Lexicosyntactic Patterns Appeared in ACL2005 Workshop on Deep Lexical Acquisition Ann Arbor, Michigan, June 30th, 2005 Dominic Widdows MAYA Design, Inc. Pittsburgh, Pennsylvania [email protected] Beate Dorow Institute for Natural Language Processing University of Stuttgart [email protected] Abstract This paper describes a technique for ex- tracting idioms from text. The tech- nique works by finding patterns such as “thrills and spills”, whose reversals (such as “spills and thrills”) are never encoun- tered. This method collects not only idioms, but also many phrases that exhibit a strong tendency to occur in one particular order, due apparently to underlying semantic is- sues. These include hierarchical relation- ships, gender differences, temporal order- ing, and prototype-variant effects. 1 Introduction Natural language is full of idiomatic and metaphor- ical uses. However, language resources such as dic- tionaries and lexical knowledge bases give at best poor coverage of such phenomena. In many cases, knowledge bases will mistakenly ‘recognize’ a word and this can lead to more harm than good: for exam- ple, a typical mistake of blunt logic would be to as- sume that “somebody let the cat out of the bag” im- plied that “somebody let some mammal out of some container.” Idiomatic generation of natural language is, if anything, an even greater challenge than idiomatic language understanding. As pointed out decades ago by Fillmore (1967), a complete knowledge of En- glish requires not only an understanding of the se- mantics of the word good, but also an awareness that this special adjective (alone) can occur with the word any to construct phrases like “Is this paper any good at all?”, and traditional lexical resources were not designed to provide this information. There are many more general examples occur: for exam- ple, “the big bad wolf” sounds right and the “the bad big wolf” sounds wrong, even though both versions are syntactically and semantically plausible. Such examples are perhaps ‘idiomatic’, though we would perhaps not call them ‘idioms’, since they are com- positional and can sometimes be predicted by gen- eral pattern of word-ordering. In general, the goal of manually creating a com- plete lexicon of idioms and idiomatic usage patterns in any language is unattainable, and automatic ex- traction and modelling techniques have been devel- oped to fill this ever-evolving need. Firstly, auto- matically identifying potential idioms and bringing them to the attention of a lexicographer can be used to improve coverage and reduce the time a lexicog- rapher must spend in searching for such examples. Secondly and more ambitiously, the goal of such work is to enable computers to recognize idioms in- dependently so that the inevitable lack of coverage in language resources does not impede their ability to respond intelligently to natural language input. In attempting a first-pass at this task, the exper- iments described in this paper proceed as follows. We focus on a particular class of idioms that can be extracted using lexicosyntactic patterns (Hearst, 1992), which are fixed patterns in text that suggest that the words occurring in them have some inter- esting relationship. The patterns we focus on are occurrences of the form “A and/or B”, where A and

Automatic Extraction of Idioms using Graph Analysis and ...maya.com/content/publications/0003-automatic... · significantly improve idiomatic language generation (Carl and Way, 2003)

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Automatic Extraction of Idioms using Graph Analysis and AsymmetricLexicosyntactic Patterns

Appeared in ACL2005 Workshop on Deep Lexical AcquisitionAnn Arbor, Michigan, June 30th, 2005

Dominic WiddowsMAYA Design, Inc.

Pittsburgh, [email protected]

Beate DorowInstitute for Natural Language Processing

University of [email protected]

Abstract

This paper describes a technique for ex-tracting idioms from text. The tech-nique works by finding patterns such as“thrills and spills”, whose reversals (suchas “spills and thrills”) are never encoun-tered.

This method collects not only idioms, butalso many phrases that exhibit a strongtendency to occur in one particular order,due apparently to underlying semantic is-sues. These include hierarchical relation-ships, gender differences, temporal order-ing, and prototype-variant effects.

1 Introduction

Natural language is full of idiomatic and metaphor-ical uses. However, language resources such as dic-tionaries and lexical knowledge bases give at bestpoor coverage of such phenomena. In many cases,knowledge bases will mistakenly ‘recognize’ a wordand this can lead to more harm than good: for exam-ple, a typical mistake of blunt logic would be to as-sume that “somebody let the cat out of the bag” im-plied that “somebody let some mammal out of somecontainer.”

Idiomatic generation of natural language is, ifanything, an even greater challenge than idiomaticlanguage understanding. As pointed out decades agoby Fillmore (1967), a complete knowledge of En-glish requires not only an understanding of the se-mantics of the wordgood, but also an awareness

that this special adjective (alone) can occur with theword any to construct phrases like“Is this paperany good at all?”, and traditional lexical resourceswere not designed to provide this information. Thereare many more general examples occur: for exam-ple, “the big bad wolf” sounds right and the “the badbig wolf” sounds wrong, even though both versionsare syntactically and semantically plausible. Suchexamples are perhaps ‘idiomatic’, though we wouldperhaps not call them ‘idioms’, since they are com-positional and can sometimes be predicted by gen-eral pattern of word-ordering.

In general, the goal of manually creating a com-plete lexicon of idioms and idiomatic usage patternsin any language is unattainable, and automatic ex-traction and modelling techniques have been devel-oped to fill this ever-evolving need. Firstly, auto-matically identifying potential idioms and bringingthem to the attention of a lexicographer can be usedto improve coverage and reduce the time a lexicog-rapher must spend in searching for such examples.Secondly and more ambitiously, the goal of suchwork is to enable computers to recognize idioms in-dependently so that the inevitable lack of coveragein language resources does not impede their abilityto respond intelligently to natural language input.

In attempting a first-pass at this task, the exper-iments described in this paper proceed as follows.We focus on a particular class of idioms that canbe extracted usinglexicosyntactic patterns(Hearst,1992), which are fixed patterns in text that suggestthat the words occurring in them have some inter-esting relationship. The patterns we focus on areoccurrences of the form “A and/orB”, whereA and

B are both nouns. Examples include “football andcricket” and “hue and cry.” From this list, we extractthose examples for which there is a strong prefer-ence on theorderingof the participants. For exam-ple, we do see the pattern “cricket and football,” butrarely if ever encounter the pattern “cry and hue.”Using this technique, 4173 potential idioms were ex-tracted. This included a number of both true idioms,and words that have regular semantic relationshipsbut do appear to have interesting orderings on theserelationships (such as earlier before later, strong be-fore weak, prototype before variant).

The rest of this paper is organized as follows. Sec-tion 2 elaborates on some of the previous worksthat motivate the techniques we have used. Sec-tion 3 describes the precise method used to extractidioms through their asymmetric appearance in alarge corpus. Section 4 presents and analyses severalclasses of results. Section 5 describes the methodsattempted to filter these results into pairs of wordsthat are more and less contextually related to one an-other. These include a statistical method that analy-ses the original corpus for evidence of semantic re-latedness, and a combinatoric method that relies onlink-analysis on the resulting graph structure.

2 Previous and Related Work

This section describes previous work in extractinginformation from text, and inferring semantic or id-iomatic properties of words from the information soderived.

The main technique used in this paper to ex-tract groups of words that are semantically or id-iomatically related is a form of lexicosyntactic pat-tern recognition. Lexicosyntactic patterns were pio-neered by Marti Hearst (Hearst, 1992; Hearst andSchutze, 1993) in the early 1990’s, to enable theaddition of new information to lexical resourcessuch as WordNet (Fellbaum, 1998). The main in-sight of this sort of work is that certain regular pat-terns in word-usage can reflect underlying seman-tic relationships. For example, the phrase “France,Germany, Italy, and other European countries” sug-gests thatFrance, Germany and Italy are part ofthe class ofEuropean countries. Such hierarchi-cal examples are quite sparse, and greater coveragewas later attained by Riloff and Shepherd (1997)

and Roark and Charniak (1998) in extracting rela-tions not of hierarchy but ofsimilarity, by find-ing conjunctions or co-ordinations such as “cloves,cinammon, and nutmeg” and “cars and trucks.” Thiswork was extended by Caraballo (1999), who builtclasses of related words in this fashion and then rea-soned that if a hierarchical relationship could be ex-tracted foranymember of this class, it could be ap-plied to all members of the class. This techniquecan often mistakenly reason across an ambiguousmiddle-term, a situation that was improved uponby Cederberg and Widdows (2003), by combiningpattern-based extraction with contextual filtering us-ing latent semantic analysis.

Prior work in discovering non-compositionalphrases has been carried out by Lin (1999)and Baldwin et al. (2003), who also used LSAto distinguish between compositional and non-compositional verb-particle constructions and noun-noun compounds.

At the same time, work in analyzing idioms andasymmetry within linguistics has become more so-phisticated, as discussed by Benor and Levy (2004),and many of the semantic factors underlying our re-sults can be understood from a sophisticated theoret-ical perspective.

Other motivating and related themes of work forthis paper include collocation extraction and ex-ample based machine translation. In the work ofSmadja (1993) on extracting collocations, prefer-ence was given to constructions whose constituentsappear in a fixed order, a similar (and more generallyimplemented) version of our assumption here thatasymmetric constructions are more idiomatic thansymmetric ones. Recent advances in example-basedmachine translation (EBMT) have emphasized thefact that examining patterns of language use cansignificantly improve idiomatic language generation(Carl and Way, 2003).

3 The Symmetric Graph Model as used forLexical Acquisition and IdiomExtraction

This section of the paper describes the techniquesused to extract potentially idiomatic patterns fromtext, as deduced from previously successful experi-ments in lexical acquisition.

The main extraction technique is to use lexicosyn-tactic patterns of the form “A, B and/orC” to findnouns that are linked in some way. For example,consider the following sentence from the British Na-tional Corpus (BNC).

Ships laden with nutmeg, cinnamon,cloves or coriander once battled theSevenSeasto bring home their preciouscargo.

Since the BNC is tagged for parts-of-speech, weknow that the words highlighted in bold are nouns.Since the phrase “nutmeg, cinnamon, cloves or co-riander” fits the pattern “A, B, C or D”, we createnodes for each of these nouns and create links be-tween them all. When applied to the whole of theBNC, these links can be aggregated to form a graphwith 99,454 nodes (nouns) and 587,475 links, as de-scribed by Widdows and Dorow (2002). This graphwas originally used for lexical acquisition, sinceclusters of words in the graph often map to recog-nized semantic classes with great accuracy (> 80%,(Widdows and Dorow, 2002)).

However, for the sake of smoothing over sparsedata, these results made the assumption that the linksbetween nodes weresymmetric, rather thandirected.In other words, when the pattern “A and/orB” wasencountered, a link fromA to B anda link from Bto A was introduced. The nature of symmetric andantisymmetric relationships is examined in detail byWiddows (2004). For the purposes of this paper, itsuffices to say that the assumption of symmetry (likethe assumption of transitivity) is a powerful tool forimproving recall in lexical acquisition, but also leadsto serious lapses in precision if the directed nature oflinks is overlooked, especially if symmetrized linksare used to infer semantic similarity.

This problem was brought strikingly to our atten-tion by the examples in Figure 1. In spite of appear-ing to be a circle of related concepts, many of thenouns in this group are not similar at all, and manyof the links in this graph are derived from very verydifferent contexts. In Figure 1,cat andmouse arelinked (they are re both animals and the phrase “catand mouse” is used quite often): but thenmouseandkeyboard are also linked because they are bothobjects used in computing. Akeyboard, as wellas being a typewriter or computer keyboard, is also

fiddlefiddlefiddle

catcatcat

barrowbarrowbarrowbowbowbow

cellocellocello

flutefluteflute

mousemousemouse

dogdogdog

gamegamegame

kittenkittenkitten

violinviolinviolin

pianopianopiano

bassbassbass

fortepianofortepianofortepiano

orchestraorchestraorchestra

keyboardkeyboardkeyboard

screenscreenscreen

monitormonitormonitor

memorymemorymemory

guitarguitarguitar

ratratrat

humanhumanhuman

Figure 1: A cluster involving several idiomatic links

used to mean (part of) a musical instrument such asan organ or piano, andkeyboard is linked tovio-lin. A violin and afiddle are the same instrument (asoften happens with synonyms, they don’t appear to-gether often but have many neighbours in common).The unlikely circle is completed (it turns out) be-cause of the phrase from the nursery rhyme

Hey diddle diddle,The cat and the fiddle,The cow jumped over the moon;

It became clear from examples such as these thatidiomatic links, like ambiguous words, were a seri-ous problem when using the graph model for lexicalacquisition. However, with ambiguous words, thisobstacle has been gradually turned into an opportu-nity, since we have also developed ways to used theapparent flaws in the model to detect which wordsare ambiguous in the first place (Widdows, 2004, Ch4). It is now proposed that we can take the same op-portunity for certain idioms: that is, to use the prop-erties of the graph model to work out which linksarise from idiomatic usage rather than semantic sim-ilarity.

3.1 Idiom Extraction by RecognizingAsymmetric Patterns

The link between thecat andfiddle nodes in Fig-ure 1 arises from the phrase “the cat and the fiddle.”

Table 1: Sample of asymmetric pairs extracted fromthe BNC.

First word Second wordhighway byway

cod haddockcomposer conductor

wood charcoalelement compoundassault batterynorth southrock rollgod goddess

porgy bessmiddle class

war aftermathgod hero

metal alloysalt pepper

mustard cressstocking suspender

bits bobsstimulus response

committee subcommitteecontinent ocean

However, no corpus examples were ever found of theconverse phrase, “the fiddle and the cat.” In caseslike these, it may be concluded that placing asym-metric link between these two nodes is a mistake.Instead, adirectedlink may be more appropriate.

We therefore formed the hypothesis that if thephrase “A and/orB” occurs frequently in a corpus,but the phrase “B and/orA” is absent, then the linkbetweenA andB should be attributed to idiomaticusage rather than semantic similarity.

The next step was to rebuild, finding those rela-tionships that have a strong preference for occurringin a fixed order. Sure enough, several British Englishidioms were extracted in this way. However, severalother kinds of relationships were extracted as well,as shown in the sample in Table 1.1

After extracting these pairs, groups of them weregathered together intodirected subgraphs.2 Some ofthese directed subgraphs are reporduced in the anal-ysis in the following section.

1The sample chosen here was selected by the authors to berepresentative of some of the main types of results. The com-plete list can be found athttp://infomap.stanford.edu/graphs/idioms.html .

2These can be viewed athttp://infomap.stanford.edu/graphs/directed_graphs.html

4 Analysis of Results

The experimental results include representatives ofseveral types of asymmetric relationships, includingthe following broad categories.

‘True’ Idioms

There are many results that display genuinely id-iomatic constructions. By this, we mean phrases thathave an explicitly lexicalized nature that a nativespeaker may be expected to recognize as having aspecial reference or significance. Examples includethe following:

thrills and spillsbread and circusesPunch and JudyPorgy and Besslies and statisticscat and fiddlebow and arrowskull and crossbones

This category is quite loosely defined. It includes

1. historic quotations such as “lies, damned liesand statistics”3 and “bread and circuses.”4

2. titles of well-known works.

3. colloquialisms.

4. groups of objects that have become fixed nom-inals in their own right.

All of these types share the common property thatany NLP system that encounters such groups, in or-der to behave correctly, should recognize, generate,or translate them as phrases rather than words.

Hierarchical Relationships

Many of the asymmetric relationships followsome pattern that may be described as roughly hi-erarchical. A cluster of examples from two domainsis shown in Figure 2. In chess, a rook outranks abishop, and the phrase “rook and bishop” is encoun-tered much more often than the phrase “bishop and

3Attributed to Benjamin Disraeli, certainly popularized byMark Twain.

4A translation of “panem et circenses,” from the Romansatirist Juvenal, 1st century AD.

Figure 2: Asymmetric relationships in the chess andchurch hierarchies

Figure 3: Different beverages, showing their di-rected relationships

rook.” In the church, a cardinal outranks a bishop,a bishop outranks most of the rest of the clergy, andthe clergy (in some senses) outrank the laity.

Sometimes these relationships coincide with fig-ure / ground and agent / patient distinctions. Ex-amples of this kind, as well as “clergy and laity”,include “landlord and tenant”, “employer and em-ployee”, “teacher and pupil”, and “driver and pas-sengers”. An interesting exception is “passengersand crew”, for which we have no semantic explana-tion.

Pedigree and potency appear to be two other di-mensions that can be used to establish the directed-ness of an idiomatic construction. For example, Fig-ure 3 shows that alcoholic drinks normally appearbefore their cocktail mixers, but that wine outrankssome stronger drinks.

Figure 4: Hierarchical relationships between aristo-crats, some of which appear to be gender based

Gender Asymmetry

The relationship between corresponding conceptsof different genders also appear to be heavily biasedtowards appearing in one direction. Many of theserelationships are shown in Figure 4. This showsthat, in cases where one class outranks another, thehigher class appears first, but if the classes are iden-tical, then the male version tends to appear beforethe female. This pattern is repeated in many pairsof words such as “host and hostess”, “god and god-dess”, etc. One exception appears to be in parent-ing relationships, where female precedes male, as in“mother and father”, “mum and dad”, “grandma andgrandpa”.

Temporal Ordering

If one word refers to an event that precedes an-other temporally or logically, it almost always ap-pears first. The examples in Table 2 were extractedby our experiment. It has been pointed out that forcyclical events, it is perfectly possible that the orderof these pairs may be reversed (e.g., “late night andearly morning”), though the data we extracted fromthe BNC showed strong tendencies in the directionsgiven.

A directed subgraph showing many events in hu-man lives in shown in Figure 5.

Prototype precedes Variant

In cases where one participant is regarded as a‘pure’ substance and the other is a variant or mix-ture, the pure substance tends to come first. Theseoccur particularly in scientific writing, examplesincluding “element and compound”, “atoms and

Table 2: Pairs of events that have a strong tendencyto occur in asymmetric patterns.

Before Afterspring autumn

morning afternoonmorning eveningevening nightmorning night

beginning endquestion answershampoo conditionermarriage divorcearrival departureeggs larvae

molecules”, “metals and alloys”. Also, we see “ap-ples and pears”, “apple and plums”, and “apples andoranges”, suggesting that an apple is a prototypicalfruit (in agreement with some of the results of pro-totype theory; see Rosch (1975)).

Another possible version of this tendency is thatcore precedes periphery, which may also account forasymmetric ordering of food items such as “fish andchips”, “bangers and mash”, “tea and coffee” (in theBritish National Corpus, at least!) In some casessuch as “meat and vegetables”, a hierarchical or fig-ure / ground distinction may also be argued.

Mistaken extractions

Our preliminary inspection has shown that the ex-traction technique finds comparatively few genuinemistakes, and the reader is encouraged to follow thelinks provided to check this claim. However, thereare some genuine errors, most of which could beavoided with more sophisticated preprocessing.

To improve recall in our initial lexical acquisitionexperiments, we chose to strip off modifiers and tostem plural forms to singular forms, so that “applesand green pears” would give a link betweenappleandpear.

However, in many cases this is a mistake, be-cause the bracketing should not be of the form “Aand (B C),” but of the form “(A andB) C.” Us-ing part-of-speech tags alone, we cannot recoverthis information. One example is the phrase “hard-ware and software vendors,” from which we ob-tain a link betweenhardware and vendors, in-stead of a link betweenhardware and software.A fuller degree of syntactic analysis would improvethis situation. For extracting semantic relationships,

Figure 5: Directed graph showing that life-eventsare usually ordered temporally when they occur to-gether

Cederberg and Widdows (2003) demonstrated thatnounphrase chunking does this work very satisfacto-rily, while being much more tractable than full pars-ing.

The mistaken pairmiddle and class shown inTable 1 is another of these mistakes, arising fromphrases such as “middle and upper class” and “mid-dle and working class.” These examples could beavoided simply by more accurate part-of-speech tag-ging (since the word “middle” should have beentagged as an adjective in these examples).

This concludes our preliminary analysis of re-sults.

5 Filtering using Latent Semantic Analysisand Combinatoric Analysis

From the results in the previous section, the follow-ing points are clear.

1. It is possible to extract many accurate exam-ples of asymmetric constructions, that would benecessary knowledge for generation of natural-sounding language.

2. Some of the pairs extracted are examples ofgeneral semantic patterns, others are examplesof genuinely idiomatic phrases.

Even for semantically predictable phrases, thefact that the words occur in fixed patterns can bevery useful for the purposes of disambiguation, asdemonstrated by (Yarowsky, 1995). However, it

would be useful to be able to tell which of the asym-metric patterns extracted by our experiments corre-spond to semantically regular phrases which hap-pen to have a conventional ordering preference, andwhich phrases correspond to genuine idioms. Thisfinal section demonstrates two techniques for per-forming this filtering task, which show promising re-sults for improving our classification, though shouldnot yet be considered as reliable.

5.1 Filtering using Latent Semantic Analysis

Latent semantic analysis or LSA (Landauer and Du-mais, 1997) is by now a tried and tested techniquefor determining semantic similarity between wordsby analyzing large corpus (Widdows, 2004, Ch 6).Because of this, LSA can be used to determinewhether a pair of words is likely to participate in aregular semantic relationship, even though LSA maynot contribute specific information regarding thena-ture of the relationship. However, once a relation-ship is expected, LSA can be used to predict whetherthis relationship is used in contexts that are typicaluses of the words in question, or whether these usesappear to be anomalies such as rare senses or idioms.This technique was used successfully by (Cederbergand Widdows, 2003) to improve the accuracy of hy-ponymy extraction. It follows that it should be use-ful to tell the difference between regularly relatedwords and idiomatically related words.

To test this hypothesis, we used an LSA modelbuilt from the BNC using the Infomap NLP soft-ware.5 This was used to measure the LSA similar-ity between the words in each of the pairs extractedby the techniques in Section 4. In cases where aword was too infrequent to appear in the LSA model,we used ‘folding in,’ which assigns a word-vector‘on the fly’ by adding together the vectors of anysurrounding words of a target word that are in themodel.

The results are shown in Table 3. The hypothesisis that words whose occurrence is purely idiomaticwould have a low LSA similarity score, becausethey are otherwise not closely related. However, thishypothesis does not seem to have been confirmed,partly due to the effects of overall frequency. Forexample, the wordPorgy only occurs in the phrase

5Freely available from http://infomap-nlp.sourceforge.net/

Table 3: Ordering of results from semantically sim-ilar to semantically dissimilar using LSA

Word pair LSA similaritynorth south 0.931middle class 0.834porgy bess 0.766war aftermath 0.676salt pepper 0.672bits bobs 0.671mustard cress 0.603composer conductor 0.588cod haddock 0.565metal alloy 0.509highway byway 0.480committee subcommittee 0.479god goddess 0.456rock roll 0.398continent ocean 0.300wood charcoal 0.273stimulus response 0.261stocking suspender 0.177god hero 0.115element compound 0.044assault battery -0.068

granitegranite

cheesecheese

breadbread

chalkchalk

limestonlimestonee flint flint

marblmarblee coa coall sandsand

sandstone sandstone butte butterr meameatt winewine

sugasugarr margarin margarinee milkmilk clay clay

Figure 6: Nodes in the original symmetric graph inthe vicinity ofchalk andcheese

“Porgy and Bess,” and the wordbobs almost alwaysoccurs in the phrase “bits and bobs.” A more effec-tive filtering technique would need to normalize toaccount for these effects. However, there are somegood results: for example, the low score betweenassault andbattery reflects the fact that this usage,though compositional, is a rare meaning of the wordbattery, and the same argument can be made forel-ement andcompound. Thus LSA might be a betterguide for recognizing rarity in meaning of individualwords than it is for idiomaticity of phrases.

5.2 Link analysis

Another technique for determining whether a link isidiomatic or not is to check whether it connects two

areas of meaning that are otherwise unconnected. Ahallmark example of this phenomenon is the “chalkand cheese” example shown in Figure 6.6 Note thatnone of the other members of the rock-types clus-ters is linked to any of the other foodstuffs. We maybe tempted to conclude that the single link betweenthese clusters is an idiomatic phenomenon. Thistechnique shows promise, but has yet to be exploredin detail.

6 Conclusions and Further Work

It is possible to extract asymmetric constructionsfrom text, some of which correspond to idiomswhich are indecomposable (in the sense that theirmeaning cannot be decomposed into a combinationof the meanings of their constituent words).

Many other phrases were extracted which exhibita typical directionality that follows from underlyingsemantic principles. While these are sometimes notdefined as ‘idioms’ (because they are still compos-able), knowledge of their asymmetric behaviour isnecessary for a system to generate natural languageutterances that would sound ‘idiomatic’ to nativespeakers.

While all of this information is useful for cor-rectly interpreting and generating natural language,further work is necessary to distinguish accuratelybetween these different categories. The first step inthis process will be to manually classify the results,and evaluate the performance of different classifica-tion techniques to see if they can reliably identifydifferent types of idiom, and also distinguish thesecases from false positives that were mistakenly ex-tracted. Once some of these techniques have beenevaluated, we will be in a better position to broadenour techniques by turning to larger corpora such asthe Web.

References

Timothy Baldwin, Colin Bannard, Takaaki Tanaka, andDominic Widdows. 2003. An empirical model ofmultiword expression decomposability. InProceed-ings of the ACL-2003 Workshop on Multiword Expres-

6“Chalk and cheese” is a widespread idiom in British En-glish, used to contrast two very different objects, e.g. “Theyare as different as chalk and cheese.” A roughly corresponding(though more predictable) phrase in American English might be“They are as different as night and day.”

sions: Analysis, Acquisition and Treatment, Sapporo,Japan.

Sarah Bunin Benor and Roger Levy. 2004. The chickenor the egg? a probabilistic analysis of english bi-nomials. http://www.stanford.edu/˜rog/papers/binomials.pdf .

Sharon Caraballo. 1999. Automatic construction of ahypernym-labeled noun hierarchy from text. In37thAnnual Meeting of the Association for ComputationalLinguistics: Proceedings of the Conference, pages120–126.

M Carl and A Way, editors. 2003.Recent Advances inExample-Based Machine Translation. Kluwer.

Scott Cederberg and Dominic Widdows. 2003. UsingLSA and noun coordination information to improvethe precision and recall of automatic hyponymy ex-traction. InConference on Natural Language Learn-ing (CoNNL), Edmonton, Canada.

Christiane Fellbaum, editor. 1998.WordNet: An Elec-tronic Lexical Database. MIT Press, Cambridge MA.

Charles J. Fillmore. 1967. The grammar of hitting andbreaking. In R. Jacobs, editor,In Readings in English:Transformational Grammar, pages 120–133.

Marti Hearst and Hinrich Schutze. 1993. Customizinga lexicon to better suit a computational task. InACLSIGLEX Workshop, Columbus, Ohio.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. InCOLING, Nantes,France.

Thomas Landauer and Susan Dumais. 1997. A solutionto plato’s problem: The latent semantic analysis the-ory of acquisition.Psychological Review, 104(2):211–240.

Dekang Lin. 1999. Automatic identification of non-compositional phrases. InACL:1999, pages 317–324.

Ellen Riloff and Jessica Shepherd. 1997. A corpus-basedapproach for building semantic lexicons. In ClaireCardie and Ralph Weischedel, editors,Proceedings ofthe Second Conference on Empirical Methods in Natu-ral Language Processing, pages 117–124. Associationfor Computational Linguistics, Somerset, New Jersey.

Brian Roark and Eugene Charniak. 1998. Noun-phraseco-occurence statistics for semi-automatic semanticlexicon construction. InCOLING-ACL, pages 1110–1116.

Eleanor Rosch. 1975. Cognitive representations of se-mantic categories.Journal of Experimental Psychol-ogy: General, 104:192–233.

Frank Smadja. 1993. Retrieving collocations from text:Xtract. Computational Linguistics, 19(1):143–177.

Dominic Widdows and Beate Dorow. 2002. A graphmodel for unsupervised lexical acquisition. In19th In-ternational Conference on Computational Linguistics,pages 1093–1099, Taipei, Taiwan, August.

Dominic Widdows. 2004. Geometry and Meaning.CSLI publications, Stanford, California.

David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. InProceed-ings of the 33rd Annual Meeting of the Association forComputational Linguistics, pages 189–196.