67
1 Transforming the Representation of Lexical Knowledge Christopher Manning University of Sydney http://www.sultry.arts.usyd.edu.au/ cmanning/

1 Transforming the Representation of Lexical Knowledge Christopher Manning University of Sydney

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1

Transforming the Representation of Lexical

Knowledge

Christopher ManningUniversity of Sydney

http://www.sultry.arts.usyd.edu.au/cmanning/

2

Project ObjectivesAims of the project:• examining the richness of lexical structure, in

particular the connotational and figurative use of words

• providing innovative ways for representing a dictionary, through creative use of the medium of computers

• augmenting dictionaries from corpora• to be able to provide practical educationally

useful programs as a result (at low labor cost)Main initial target: an interactive front end for

exploring or using the Warlpiri dictionary.

3

Acknowledgements

• Ken Hale, Mary Laughren, Jane Simpson, Robert Hoogenraad, David Nash, Kay Ross

• Kevin Jansz, Nitin Indurkhya, Katrina Avila • Susan Poetsch, Miriam Corris, John Henderson• (and many others)

4

Talk Outline

• The research agendas• Dictionary usability and usefulness• Kirrkirr: A Warlpiri dictionary browser• Underlying data• User interface and visualization• Corpus enrichment for terminology sets• User study

5

Research Program: Lexicon

• A lexicon is not just words but a vast network of associations between words and within and across the concepts represented by words

• The aim of this work is to provide people with a better understanding of this conceptual map.

• E.g., patterns of figurative extension: in a song about a stockman driving a car ‘glass’ is used first for the windscreen, and then metaphorically for ‘sexual attraction’, using a systematic pattern of figuration between shining and sex

6

Lexicon (cont)

• Traditional paper dictionaries offer very limited ways for making such networks visible

• On a computer, one can imagine all sorts of ways of bringing out such relationships

7

Research: Computational Lexicography

• Dictionaries on computers are now commonplace

• But there has been little attempt to utilize the potential of the new medium

• Goal: fun dictionary tools that are effective for language learning, browsing, and research

• Special interest: dictionaries for minority languages. Here economic, motivational, and user support reasons all point to an important role for computers.

8

MRD Structure

• The internal structures of current Machine Readable Dictionaries usually merely mimic the structure of the printed form (Boguraev 1990)

• Some work, notably WordNet (Miller 1995) has involved a fundamental rethinking of dictionary content and organization (here, organization via “synsets” which are related via links of part, subkind, opposite)

• But this research hasn’t been taken to users.

9

Research Program: Education

• Dictionary structure and usability are often dictated by professional linguists, while the needs of others (speakers, semi-speakers, young users, second language learners) are not met

• Weiner (1994) : The initial purpose of the OED:– “to create a record of vocabulary so that

English literature could be understood by all. But English scholarship grew up and lexicography grew with it … inevitably parting company with the man in the street.”

• Challenge is to avoid this.

10

Dictionary usefulness and usability

Kegl (1995) “Machine-Readable Dictionaries and Education”

• “Originally, this paper was intended as a survey of educational applications using MRDs. As far as I have been able to determine, no such applications currently exist”

• Standard dictionaries are reference works, ill-suited for use as learning tools

• Studies of American ‘dictionary skills training’ show that many tasks achieve little in the way of education (but do teach word lookup!)

11

Educational value of dictionaries

However derived lexical information is useful!Think of a high school foreign language textbook• terminology sets• pictures with parts named• vocabulary lists• word explicationsMajor issue:• Not many people sit around reading

dictionaries – need something fun

12

Data on usability: evaluating a paper dictionary

• Study of paper dictionary usability by Susan Poetsch, tested using Alawa dictionary (draft by Margaret Sharpe)

• In community, old people are very concerned to keep language strong, and help as volunteers in bilingual education. They are keen on dictionary

• However, they lack the literacy skills to use it• Susan worked with people aged 25–50• Since volunteers, probably better than average

literacy skills for the community

13

Findings

• Not very literate: A big dictionary is overwhelming to someone with emerging literacy skills

• People knew words are ordered but could not use ordering effectively (restart or flick randomly)

• Often around 3 minutes a word lookup• People lost place in page regularly• An overcrowding of information is confusing• One word correspondences are easiest for users,

but often unrealistic linguistically• Subentries were confusing; part of speech puzzling

14

Findings (2)

• Regular dictionary users (especially, compilers!) grossly underestimate the time they have spent becoming familiar with dictionary structure

• If a dictionary is going to be made for a speech community, then the people in that community need to feel confident in using it.

• Teachers felt that the draft dictionary is too long and detailed for school use

• Conclusion: These people need a different dictionary (My First Alawa)

• Would probably be used by adults as well as kids

15

Initial focusKirrkirr: a Warlpiri browser

• Warlpiri is an Australian Aboriginal language spoken in the Tanami desert (NW of Alice)

• A computer interface for browsing the Warlpiri dictionary.

• Rich lexical materials have been collected by linguists over decades (Hale’s fieldwork from 1959 on, MIT Lexicon Project in the 1980s)

• The results still haven’t been produced in a format usable by the community (only printouts)

• Previous computer projects have faltered

16

Past Problems

• “At least 15 years have passed during which the Warlpiri dictionary could have been tested, people trained in dictionary use, and the dictionary improved with user input, but all that has been produced is one badly formatted raw paper printout”

• Huge amounts of human labor have been expended

• Information systems 101: need to deliver, and provide the kind of process automation to make production and revisions easy

17

Our educational goals

• Aim at school kids• “Information seeking is a complex process which

is often not attended to in K-12 education” (Wallace et al. 1998)

• Provide learner supports for getting started with dictionaries

• Adaptable interface: can cater to different needs• Support for active reading by allowing note taking• An interface where you can see words, but are

not required to know words

18

Target user community

19

Kirrkirr: A Warlpiri dictionary browser

(Jansz 1998; Jansz, Manning and Indurkhya 1999)

• An environment for the interactive exploration of dictionaries.

• The design is general, but our current work has just been with Warlpiri (Arrernte coming soon!)

• Attempts to more fully utilize graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information

• Written in Java, it can either be run over the web [high bandwidth] or run locally (here Java’s main advantage is cross-platform support).

20

Specific goals

• An interactive environment that encouraged exploration: easy and fun to use

• Reduction of the dependence on alphabetical order

• Catering to the needs of different user groups (kids, teachers, professionals)

• Flexible enough to display appropriate information in appropriate ways depending on user level

21

Overview

Kirrkirr provides various modules• Graph layout of word relationships• Formatted dictionary entries• Semantic domain browsing• A notes facility for ‘jotting in the margin’• Multimedia: audio, pictures• Advanced searching interfaces• others in planning: colors, figuration patternsThese attempt to cater to users with different

competence levels

22

23

The lexical database

• Existing materials are stored in an ad hoc format of markup using backslash codes with some (rather odd) nesting of structural tags

• These were converted to XML using an error-correcting stack-based parser (written in PERL).– The inconsistency and flexibility of dictionary

entries actually made this a surprisingly difficult task.

– But parser tries to impose data integrity• Use of XML gives a clear structure to the data, and

makes available many (free) tools

24

XML

• XML: a descendant of SGML for structured markup of text

• XML separates the structure of the data from its presentation

• Much of the recent enthusiasm for XML has centered around representing simple and rigid structures such as database records

• The rich hierarchical and variable structure of dictionary entries is really more what something like XML excels at!

• Result remains a portable, tangible text file

25

Alternative: a database

• The obvious thing for storing a lot of data• Has clear advantages: structure, indexing,

query language, relationships, integrity.• Many people have suggested using a database

for lexical data and some have actually done it (IITLEX, Austin and Nathan)

• But in general lexicographers oppose the rigidity, and, in practice, standard relational databases are quite ill-suited to dictionaries

26

Problems

Dictionary entries vary enormously:

• word cross-reference• word POS gloss example

translation• word dialects [sense-1

POS1 definition gloss example translation example translation] [sense-2 POS2 dialect definition example translation subentry-word gloss synonym] etymology

• Data is fragmented• Same element can appear

at many levels (dialect, crossreference, …)

• Dictionaries are only loosely structured

• Database model is inflexible to extending the dictionary structure

• Lessens portability• [Answer: an object

database]

27

XML indexing

• XML is a median between the structure, indexing, etc. of a database, and the freedom of a word processor.

• To improve speed, an ad hoc index to the XML file is built, and can be used for rapid headword and gloss lookup and indexing which parts of the XML file to process.

28

Visualization of dictionary information

• For applications with simple textual content behind them, there is little that can be done but an on-line reflection of a printed page

• But we want more than just definitions of words: we want to know their relationships to other words, and the patterning in these relationships

• In a computational approach, can mediate between the lexical data and the user

• The interface can select from and choose how to present information (according to the user’s preferences) – in many different ways

29

Previous work

• Current systems present the search-dominated interface of classic Information Retrieval systems: you type a word in a search box

• Results try to mimic, but are generally inferior to, the printed version of the dictionary

• Good feature: rapid searching• These systems do little to utilize the

captivating qualities of computers: interactivity, user control and adaptability (Brown 1985).

30

Previous work (2)

• Only effective when user has a clearly specified information need – even here, we are ignoring the distinction between information gained and knowledge sought (Sharpe 1995)

• Lack browsing, and chances for incidental or curiosity driven learning

• Lack tangibility and situatedness of paper: ineffective for getting an idea of a collection

• We wish to exploit the essence of hypertext, which is “click to explore” browsing

31

Previous work (3)

• Little research work (in corpus linguistics, visualization etc.) on dictionary visualization

• WordNet built a rich network of relationships, which fundamentally departed from the paper dictionary tradition, and has been used in many computational projects

• However very little has been done in the way of interfaces that make these relationships visible and intelligible to users.

• Graphical representations seem particularly important given our target users.

32

MRD Interfaces: WordNet

33

Graph-based visualization

• There is a little previous work on graphical representations of dictionaries

• For instance, the visual-thesaurus by plumbdesign derived from WordNet

• But it is also a good demonstration of how chaotic and confusing graphical interfaces can become.

34

Perils of visualization

35

Graph-based visualization(Jansz 1998; Jansz, Manning and Indurkhya 1999)• Classic graph layout problem• Adapts work by Eades et al. (1998) and Huang et

al. (1998) on visualization and navigation of WWW document linkages

• Uses the spring algorithm. Big advantage is that it is an iterative updating algorithm, and so gives an easy interactivity:– it wiggles and people can play with it.

• Clarity and simplicity of graph: Software maintains a set of focus nodes to prevent overcrowding

36

Educational advantages

• Alphabetical order is important, but• A web of words offers other effective

opportunities for learning • A student can opportunistically explore words

that are related in various ways• Important semantic relationships can be

understood

37

Kirrkirr network display

38

Kirrkirr network display

39

Formatted dictionary entries

• Are produced automatically from the XML by using XSL (a style language)

• XSL allows easy modeling of some user preferences.

• Most trivially, one can leave out information such as part of speech, or detailed definitions

• This is useful as many users find information overload quite confusing and demotivating

• Can produce bilingual or monolingual dictionary• Opportunities for various output styles, and

formats such as RTF or TeX for printing.

40

Formatted dictionary entries

41

Rich typology of link types

• The semantically rich types of linkages present in a dictionary (synonym, antonym, hyponym, subheadword, variant, coverbs, …) solves one of the major problems of the web: we have many link types with a clear semantic interpretation

• Use consistent color-coded text and edges to show these link types

• Gives a richer browsing experience• Can tell where you are going before clicking

42

Browsing

• Work (at PARC and elsewhere: Pirolli et al. 1996) has stressed role for browsing as well as searching in information access

• It provides a context for learning• We provide browsing in several ways:

– conventional hypertext• but with rich semantically-interpreted links• their color-coding matches network edges

– network-based display of words– browsing through semantic domains

43

Semantic Domains

• Alphabetical order is one indexing strategy, but there are many others

• Most requested is ability to find things by semantic domains: e.g., food, manufactured items.

• Essentially the nouns structure of WordNet, or the classical KR ISA hierarchy

• We can exploit the domain info in the dictionary

44

Semantic Domains (Katrina Avila)

45

Other components

• Multimedia (currently pictures and audio)– Can hear pronunciations / see objects– I’m keen to put in videos of Warlpiri sign

language …• Advanced search page

– search various fields, regular expressions, etc.

• Notes: one can annotate dictionary entries (to correct or personalize)

46

Simple features

• Show the alphabet• The list on the left gives concreteness, and

tangibility– people can start with one of those words

• One can just type a few letters and then look at the list – traditional benefit of paper dictionary

• English lookup can be helpful when Warlpiri spelling fails

47

Fuzzy spelling

• We expected problems with spelling– Literacy skills based mainly in English, which

doesn’t transfer well– Different sounds in Warlpiri

• Software employs “fuzzy spelling” which allows generous matching ignoring many distinctions– done on the fly with FSMs, rather than using the

SOUNDEX strategy• Still not enough: e.g., one kid wrote “wanapy”

when wanting warnapari ‘dingo’, the end part of which knocked us out.

48

Adding more links: Terminology sets

• Related words often aren’t in same domain• Rather, words associated with a topic• E.g., a dance has an associated set of words:

clearing the ground, decorating with ochres, leaves, and feathers, singing, dancing

• A concept useful to native speakers and learners• Such cultural information is hard to learn, and

not normally in dictionaries or thesauri• Question: can terminology sets be derived

automatically from appropriate corpora?

49

Terminology sets

• Approach: terminology will be determined as “medium range” collocations

• Corpus: collection of Warlpiri stories, letters, books, fieldwork notes, etc. I have slightly over 1/4 million words of online Warlpiri– This is a large proportion of Warlpiri

available in textual form: the difference between fieldwork corpora and StatNLP corpora.

50

Building and assessing

• I stemmed words (to maximize “fuel”) – Warlpiri also has clitics that attach to words– Using a Kay/Röscheisen-style “approximate”

morphological analysis [ + vowel harmony]• Collocational bonds were assessed using

Dunning (1993)’s method of loglikelihood ratios

51

Results

• For some topics (including dances, unfortunately), one couldn’t get much out (too little data). Cf. Church and Hanks on “strong tea”.

• But for others, works well. E.g., karli ‘boomerang’:– ngurrjumani make, fix, repair– jarnti carve, trim– kijirni throw– karaly(pa) smooth– kurduju shield– maparni paint

52

Results cont.

– warrirni seek, search, try to find– kurdu child, baby, young, youth– kurrupurda boomerang (a specific type)– jarntu pet dog

• As often, the evaluation criteria are unclear, and susceptible to just-so stories. (Do people tend to sit around with their dog while carving their boomerang? I’m not sure.)

53

Another e.g.

pangurnu ‘digging scoop’:– pangurnu– pili small coolamon/digging scoop– rdaku hole in the ground– kaninjarra downwards– pangirni dig, produce cavity– mulju soak in soft earth (dig for water)– karlaja foot end of sleeping area– pirrkirni scrape– yirrarni put down

54

User study problems

• Since at present there is no dictionary available except the printed out ‘database’ [complete with markup codes], it was hard for many people to judge the use of the interface, since there was no point of comparison.

• First impressions only: It would have been good to let people try it out at their leisure, but unfortunately not possible (NT Ed all Macintosh, MRJ 2.1 shipping deadlines slipped past our study date…)

55

User study

Mim Corris (Yuendumu, Willowra)• User testing with primary and (lower)

secondary students• Comments from teachers, other adults etc.• Purely qualitative observational study of

dictionary use• (Doing anything much else would be difficult.)

56

Teachers

• Very enthusiastic• Role in encouraging kids to learn Warlpiri• See uses for it in classroom

– Would teach dictionary skills and concepts• Would also help teachers learn Warlpiri

– they’d browse in it and learn things• Liked spatial layout• Could use as a basis for classroom activities

(better with some further development: games and puzzles)

57

Elementary school kids

• One major benefit is that it was on computer.– It maintained interest

• “They were enthusiastic about the computer side of things and negotiated the interface’s various windows easily”– e.g., wanted “back” button

• Sometimes, working on sense relations and definitions was of less interest than moving things

• Word list was found helpful (can compensate for poor spelling)

58

Older children

• More thoughtful; had dictionary experience• Still really liked whole word list• Could use and liked synonyms and antonyms; • Promoting subentries to entries appeared very

effective: People enjoyed exploring and explaining relation of derived terms to main word (even if sometimes folk etymologies?)

• The semantically uninterpreted cf link was still confusing

• High school girls wanted to spend time with it!

59

Adult literacy workers

• Less interested in graphical interface• Mainly interested in looking at definitions• Started discussing and disagreeing with them

immediately– although they had and used paper printout,

first real chance to see what was there?• They wanted to, and were able to, annotate

the definitions with notes

60

Room for improvement

• More colorful!• Make more interactive: there’s not enough

that kids can create• Some cleaning up of the user interface – less

steps for searches, etc.• Adding in more views to the dictionary

– e.g., search by color

61

Conclusions

• Kirrkirr is just a prototype of what one can do to visualize dictionaries

• We’d like to go beyond that and start visualizing patterns of metaphor and sense extensions in dictionaries

• But it does show how a lot can be done to provide much more in the way of a dictionary interface which mediates between well-structured data and users’ needs for searching/browsing and presentation

62

63

• High quality dictionaries without excessive manual labor.

• Terminology sets • Richer hypertext and multimedia

64

• Traditional dictionaries tend not to capture collocational knowledge.

• For a somewhat largish window size, collocations seem a good way of getting at the notion of a terminology set.

65

Overview• A project to provide an engaging, interactive

computer front end for the Warlpiri dictionary.• Research goals:

– Effective innovative use of computer medium– Especially by dictionary visualization– Augmenting dictionaries from corpora– Assessing educational use of dictionaries

• Educational and practical goals:– Deliver it [information systems 101]– Incidental learning and regular lookup

66

• Cf. also, Atkins and Varantola (1997) IJLexicog.

67

XML vs. Databases• Flexible and hierarchical

structure is easy

• There are tools for parsing and querying XML, but much less developed

• Out of the box, one is basically using grep, perhaps with structure sensitivity

• Portable: text file

• Both flexible and hierarchical structure are difficult and can involve use of many tables

• A standard query language makes information access straightforward

• Databases provide a lot of technology for indexing to allow fast retrieval

• Less portable/tangible