What we say and what we mean

Artificial Intelligence Review (1987) 1, 139-157

What we say and what we mean

A. Ramsay Cognitive Studies Program, University of Sussex, Falmer BNI 9QN, UK

Abstract. The general problem of natural language processing has not been solved, and may never be. Nonetheless, there are now a number of well- known techniques for certain aspects of the task; and there is a certain amount of agreement about what other problems need to be tackled, if not about how to tackle them. The current paper gives a survey of what we do know, and indicates the areas in which further progress remains to be made.

Linguistic and non-linguistic knowledge

Understanding and generating natural language seems to call upon two sorts of knowledge. It requires a substantial amount of knowledge about language itself - what strings of sounds and letters are in fact words of the given language, what sequences of words are well-formed sentences, how do different word-orders encode different messages, and so on. And it requires a similar amount of knowledge about the world in general - what sorts of things can a listener generally be expected to know in advance so they need not be mentioned explicitly, what can the partners in a conversation reasonably be expected to infer from what has been said already, what names do people use for referring to things and each other, and so on. These two sorts of knowledge might be seen as being concerned with what we say, and what we mean by it, respectively. The purely linguistic knowledge specifies how strings of sounds or characters encode messages; the general world knowledge is used for working out what actions are appropriate given the message encoded by what we have just heard or read.

They also correspond, to some extent, to a split between what we do and do not know how to make computers do. I would not want to suggest that the linguistic questions are solved, or anything like it, but there do certainly seem to be some theories about what needs to be done and how to do it. We have computer programs which, for non-trivial fragments of the lower levels of language processing, seem to work. As far as the use of world knowledge is concerned, we are in a much weaker position. We do know something about what needs to be done, about what sort of knowledge is required and when, but we have very few practical theories about how to deliver it. This will become apparent as we progress through our survey of the components of natural language processing systems (NLP systems).

139

140 A. Ramsay

Architectures

The first thing we need to consider in the design of a NLP system is its architecture. There is a long tradition in linguistics that the best way to describe the rules of language is in terms of layers of rule sets. The particular layers are sometimes disputed, and in any case different layers are required for speech and text processing, but there is widespread agreement among linguists that some sort of layered approach is appropriate. A typical diagram of the layers would look something like Table 1.

Table 1. Linguistic levels

Level Subject matter

Lexical analysis Morphological analysis Syntax Semantics Discourse rules World knowledge

Words and word endings The significance of word endings Rules about word order Relationships and identities What can you say when What can you assume

Linguistics concerns itself with rules which operate within levels and ones which connect levels together. The first class would include rules which rule out strings as being possible words of English (‘there’s no English word in which h follows a vowel’, for instance), or possible inflections of English words (‘ing can only be added as a suffix to verbs’), or legal word orders (‘the must not be followed by a verb’), or meaningful sentences (‘eat must have an animate subject’), or legitimate things to say [‘You can’t refer to the blue block unless there is a unique blue block in the context’) or reasonable assumptions to make. The second class uses the structures which are implicitly referred to in the first in order to show how choices about which words to use, or what inflections to give them, or what order to say them in, can be used to encode messages.

These levels of description have an obvious implication for the design of NLP systems. If people who are interested in describing what they see real language users doing find it convenient to talk in terms of a series of levels of analysis, is it not likely that this is going to be the easiest way to build artificial language users? Following this line of argument, we might expect to see NLP systems built up out of lexical, morphological, syntactic and so on components. Certainly all the tasks implied by this separation have to be performed: different words, or word endings, or word orders all encode different messages, and all these encodings must be decoded if we want to understand an utterance or a text. The decision to have rules which deal with different levels does not, however, entail any particular way of organizing the communications between the various components. It is clear that any implementation which assumes that there is a simple uni-directional flow of information from the top to the bottom of Table 1 (for comprehension), or from the

What we say and what we mean 141

bottom to the top (for generation), is simply not going to work. At any level there will be alternative readings of the data which cannot be disambiguated (comprehension), or choices which cannot be settled (generation), for which the appropriate information is instantly available at some other level. There are abundant examples of low-level ambiguities which can easily be resolved by the use of high level information, from the choices of meaning for bat in ‘Some bats are believed to suck blood’ and ‘I left my bat in the changing room’ to the well-known problem of choosing a referent for they in ‘The councillors refused the women a permit because they feared violence’ and ‘The councillors refused the women a permit because they advocated violence’. Many NLP systems acknowledge that there is a problem here, but then ignore it and simply require the lower level system to backtrack and offer alternative analyses when the higher levels request them. There are two reasonable alternatives. One is to use a neutral working area for components to store the results of their analysis. Such an area, often termed a ‘blackboard’ after Erman and Lesser’s (1975) work on speech processing, can be used for offering partial results and hypotheses to higher level components, or even for asking them explicit questions about ambiguities. The blackboard is an appeal- ing metaphor, but there are serious problems about implementing such a system, since decisions have to be made about the format of entries on the blackboard, its integrity as different components add and delete messages, and the control over the resources which should be available to each component. A major variant on the use of a neutral area is to embody all rules, of whatever level, as a single set of productions firing on states of a single database or working memory, all controlled by a single scheduling algorithm. Systems of this sort, such as Riesbeck’s (1978) ‘conceptual analyser’, can indeed mix rules at different levels, but it seems very difficult to provide them with rules of the required degree of sophistication. The alternative to using a neutral area is to try to carry ambiguities around in the form of constraints. This approach, which follows on from problem-solving work by Suss- man and Steele (1980) and Stefik (1981), takes, for instance, the view that the definition of bat as a wooden implement for striking balls is a constraint on the object being referred to by the word. When this constraint is added to other constraints on the object referred to in the text, it should help identify a referent - some particular wooden implement. Under this interpretation, a lexical ambiguity, such as the one between wooden implements and flying mice, need not be regarded as a problem at all. The description “( (X is a flying mouse) or (X is a wooden implement) ) and (X is something which I might leave in a changing room)” is nearly always going to identify a single object, which will in all probability be a wooden implement. But we do not need to decide at the time when we recognize the word bat which of its alternative meanings is going to be the one that fits the relevant object - we just need to know how to use disjunctive descriptions when trying to find referents.

None of these ways of organizing a system seems entirely satisfactory. Using constraints has considerable appeal if we can do it, since it leaves us with a simple architecture (just work straight through from top to bottom or bottom to top, as appropriate). We do, however, have to work out how to use lexical entries, parse

142 A. Ramsey

trees, semantic representations, and so on as constraints. This is not the usual way they are used, and it may be difficult to construct programs that do it. The blackboard suffers from being much more difficult to program than might be expected at first sight, especially in terms of designing a resource allocation algorithm. To require lower levels to backtrack when high levels fail is unaccept- ably inefficient. Of the three options, this author inclines towards the first, but any system designer will have to make his own choice.

Components

Given some more or less satisfactory answer to the problem of organizing the system as a whole, we have to consider each of its parts in detail. There are various options about what components will be necessary, but a breakdown into components corresponding to the levels in Table 1 is as typical as any. It should be noted that this is offered as a reasonable architecture for a system for understanding and interpreting English. It is well known that different languages lay different emphasis on different levels: word order is less important in Finnish than in English, word structure is more important in German than in English, and so on. These changes may indicate that the levels in our diagram might need to be merged or split for different languages, though it seems unlikely that there will be very radical changes. In the remainder of the paper we will be concerned entirely with English, and hence will consider computational models of lexical, morphological, syntactic, semantic and pragmatic processing for English. We will not go deeply into the details of algorithms, but will rather consider the extent to which algorithms for comprehension and generation actually exist.

Lexical processing

Lexical processing here means the mechanisms which add inflections to lexical items to produce acceptable surface forms, and which recognize surface forms to be instances of inflected lexical items, To put it more concretely, the task is to recognize that if you add the ending -ing to the word recognize, you should end up with recognizing rather than recognizeing: and that the form recognizing, is derived from the word recognize, not from some putative form recogniz. English has a fairly simple set of rules which cover a very large proportion of cases, plus a number of words which are extremely irregular, even idiosyncratic, in the surface forms they show for various endings. Many of these are clearly inheritances from other languages which the words were imported from, and it might be possible to discover underlying regularities if we knew, for every such word, where it had come from. For example, -ought is clearly a regular past form, of which sought, bought, fought and perhaps caught are instances. There is, however, no way of predicting which words will follow this pattern rather than the usual one of adding -ed, nor of discovering the form of the root (seek, buy, fight, catch) to which the ending was added. It is thus usual to provide rules which capture all the standard cases, and to store all the idiosyncrasies explicitly.


The basic rules for recognizing inflections are very simple. The simplest way to write them down is as a set of productions such as the following:

IF the ending is -ing THEN replace it by -e and look in the dictionary

Ramsay and Barrett (1987) give a set of about 15 such rules which cover a very large percentage of English word ending changes. The task for adding inflections is slightly harder, but is still very largely a matter of applying simple, regular rules. The way inflections are added often involves inspecting several of the terminal letters of the word being inflected to see whether they belong to particular classes such as vowels or consonants, for instance:

IF the last letter of the word is y, and previous letter is a consonant, and the first letter of the suffix is e

THEN replace the y by i and add suffix.

Production rules seem rather clumsier as a way to specify how to add endings than as a way of recognizing them. It seems that we need rather more rules, though still not an outrageous number, and that many of the rules are in fact only minor variants of each other, each being included to cover some small set of cases. Kay (1983) and others have argued that a neater way to represent the knowledge required for addition of words endings is as a finite state transducer. This is a simple machine which can be in any of a finite set of states, and which changes its state depending on (i) the state it is currently in, (ii) the state of the word to which the ending is being added, and (iii) the state of the partially constructed surface form. Whatever the details of the implementation, it is clear that for the majority of cases a very simple set of rules will suffice; that for adding suffixes these rules can be used deterministically; and that for pathological cases such as caught and were no rules can possibly be developed.

Syntax

Grammars. Syntax is the study of structural patterns in word order and structure. It is the area of language which has received the greatest level of attention in linguistics in the last 30 years. It does indeed seem to be special. It is the first place at which human language diverges in principle from the sign systems of other animals, and it is the main means by which human language can denote arbitrarily complex relationships between entities, rather than just referring to objects and homogeneous states of affairs. It is also the area in which most work has occurred in computer systems for language processing. This is unsurprising. The lower levels, though non-trivial, can be dealt with adequately by the sort of rules described above, at least for text. The higher levels cannot even be started on until the syntactic processing is completed, since the relationships that they are concerned with are denoted by structural properties of the input, sometimes by very subtle structural properties.

In both linguistics and work on NLP systems, the result of all this work has been

144 A. Ramsay

the development of a number of differing schools. The following rather bland outline would probably be agreeable to about 90% of workers in both fields (we will follow the very general outline by a summary of the variants, and of the arguments to the effect that it is completely misguided):

1 The notion of a rewrite rule is central. A rewrite rule is a rule to indicate how to group contiguous words or word groups together to form bigger groups (alternatively, how to break big word groups into sequences of words or smaller groups). The simplest form of rewrite rule, known as a context-free phrase structure rule (CF-PS rule), simply has the label of some phrase category on the left of an arrow, and one or more labels of word categories or phrase categories on the right, as in Fig. I.

S+NP VP VP+verb NP NP -+ determiner noun

Fig. 1. Simple CF-PS grammar.

Here S, NP and VP are phrase categories, verb, determiner and noun are word categories. The rules may read either as a way to say how you can rewrite S as NP followed by VP, and so on, or to say that when you have NP followed by VP you can rewrite them as S.

2 But this very simple sort of rewrite rule, which just refers to basic categories of word or word group, is inadequate for capturing many (perhaps most) of the patterns we are concerned with. Context-free grammars, made up of these CF-PSG rules, have to be supplemented in some way. The most obvious instances of phenomena which are hard to describe using CF-PSG rules are ‘agreement’, where the form of one component is constrained by the form of another (example: ‘he runs a shop’ and ‘they run a shop’ are both acceptable, whereas ‘he run a shop’ and ‘they runs a shop’ are not); ‘movement’, where there is an obvious correspondence between two forms (example: ‘he and his sister have run it for years’ and ‘it’s been run by him and his sister for years’); and ‘unbounded dependency’, where there is a connection between a word and a far distant structure which seems to require that word for its own completion (example: who and meet in ‘the woman your friend wanted you to meet is here’ - in some sense, who seems to be the object of meet, even though it is not where you would expect the object to be).

3 It should be possible to analyse any piece of text, using these rules, in such a way that all the words are grouped together in a single large group. The way that particular rules are used to find out this grouping carries a large part of the message of the text.

4 It is possible, and useful, to write programs which will apply the rules either to find out what the structure of a given piece of text is, or to make a text with a given structure.

I, at least, would be almost universally acceptable. We will consider the grounds for disagreeing with even this at the end of this section, but for the moment we will

What we soy and what we mean 145

assume that it holds. The first part of 2 would be accepted by most people, but there is wide, and radical, disagreement about the particular ways in which they should be supplemented. The following three positions all have strong proponents.

Transformational grammar (TG): The basic rewrite rules should be supplemented by a set of transformations, whose function is to generate different surface forms from similar underlying structures. In recent formulations of TG, the transformations are subject to a variety of constraints, which operate to restrict the possible forms of the transformations themselves, the possible resultant surface forms, the order in which transformations apply, and the messages which can be encoded at all. Although TG has been dominant in linguistics for most of the last 30

years, and is still probably the mainstream theory, it has been far less prominent in NLP. There seem to be two possible reasons for TG’s lack of prominence in NLP. Firstly, the connections between the syntactic component of TG (the base rules + the transformations) and its lexical and semantic components are not specified in a way which makes it easy to use the theory for either generation or comprehension. The rules do indicate how to form the correct surface form corresponding to a given syntactic structure, and how to find the semantic interpretation corresponding to a syntactic structure. They do not, however, give any indication of how to perform the inverse mapping, finding the syntactic form which encodes a semantic interpretation or the syntactic structure which gives rise to a particular surface form, and hence it is very difficult to see how to get from surface form to semantic interpretation, or from semantic interpretation to surface form, via the rules of TG. The second problem is an intensification of the first, and arises because there is no a priori fixed bound on the number of transformations that may be applied in the course of deriving a surface form from a form generated by the basic grammar. This makes the task of finding out what basic structure gave rise to a given form even harder than it seems at first, since there is no way of knowing how many inverse transformations may have to be applied.

Augmented transition networks (ATNs): It is possible to view the basic rewrite rules as transition networks, with labelled arcs corresponding to components of the right-hand side of the rewrite rule. In this alternative notation, the toy grammar in Fig. 1 would look something like that portrayed in Fig. 2.

s: start NP VP finish ------- l _____ l ---* -___ ----*

VP: start verb NP finish ------- l ----- -* ---+ ------- *

NP: start determiner noun finish ------- -, -- --____ ---* ----- l _-_____ l

Fig. 2. Simple transition network grammar.

The change to transition networks makes very little difference to the descriptive power of the grammar. Transition networks are often more compact than sets of rewrite rules, but the two formalisms are ‘weakly equivalent’ in that for any set of rules written in one, it is possible to write a set of rules in the other which covers

146 A. Ramsay

exactly the same set of examples, possibly with different descriptions of how particular examples are made up. Transition networks, however, provide an obvlous computational interpretation, in terms of route-finding algorithms through a network, which can lead to efficient processing algorithms. They can also easily be ‘augmented’ by including arbitrary tests and actions before or after transitions are attempted - the NP arc of the S network above, for instance, might include a check that the number feature for the NP had the same value as it had for the VP arc, thus providing a way of checking for number agreement between subject and object.

The ATN is probably the most widely used formalism in current practical NLP systems. Its advantages are that the representation of the basic rules in terms of networks allows straightforward but efficient parsing algorithms to be used; and that the use of arbitrary program text to implement the augmentation allows programmers, who are after all the people who do the final development, to use their particular skills in order to find solutions to the awkward problems such as agreement constraints and movement phenomena. This is both the biggest advan- tage and the biggest disadvantage of the ATN. It enables people to develop working systems without too much trouble, since all the power of the programming language may be brought to bear. At the same time, it means that the resultant systems are hard to maintain, and that they may contain rules which are motivated solely by programming issues and not by facts about language. When we try to cover substantial fragments of the grammar of natural languages, it will help if the grammatical formalism is clear and easy to maintain. This is not in general true of ATN implementations.

Unification grammar: This is a term that covers a number of formalisms which try to extend the basic notion of a rewrite rule in a restrained and principled way. What they have in common is that they all permit the symbols that appear in rules to be complex tree-structured objects which are matched by ‘unification’ rather than simple equality. The rules of our basic grammar might be extended in one of the earliest unification grammars (Pereira & Warren’s (1980) definite clause grammar (DCG) ) to capture number and person agreement, and to give different forms of VP for verbs with different transitivity (Fig. 3.)

S + np (NUM, PERS) vp (NUM, PERS) vp (NUM, PER3 + verb (NUM, PERS, intrans) VP (NUM, PERS) + verb (NUM, PERS, trans) np

vp (NUM, PERS) + verb (NUM, PERS, bitrans) np np np (NUM, PERS) + determiner (NUM, PERS) noun (NUM, PERS)

Fig. 3. Simple definite clause grammar.

In these rules, words in upper case denote ‘logical variables’, words in lower case denote constants. Entries in brackets denote sub-features of the main label. The notation extends the simple CF-PSG rules by allowing complex category labels. The use of unification for matching items against labels means that the behaviour of the rules is constrained and easy to understand. Unification requires a constant to be


matched against a similar item, but allows a variable to match anything at all subject to the constraint that any other instances of the same variable in the same rule must match the same other item. Thus in the second rule the third sub-feature of the constituent on the right-hand side is constrained to be trans. The other sub- features are allowed to be anything, but whatever values they do have will be inherited by the structure described by the left-hand side of the rule.

It is no coincidence that we have switched to a notation which follows the standard syntax for the programming language PROLOG in using upper case words for variables and lower case words for constants. Unification lies at the heart of most approaches to theorem proving for logical languages such as first order predicate calculus. As a coherent and principled way of carrying around constraints between partially specified patterns and structures, it is also a good candidate for tasks other than theorem proving, and in particular for providing a constrained extension of the CF-PSG notation. Once it has been chosen for this purpose, we see that programs embodying such rules are often written in PROLOG, since it already exists and is known to be an effective way to deal with such partially specified structures.

The use of partially specified patterns and unification for matching lies behind all the grammars we are concerned with here. They do have substantive differences at other points. Unfortunately, we do not have space to discuss these differences here, but this should not lead any reader to suppose that they are not important. All we can do is give brutally concise indications of the main focus of some significant versions of unification grammar, and leave the reader to follow up the references. Generalized phrase structure grammar (GPSG, Gazdar et al. 1985) attempts to provide a minimal set of rules. These may over-generate (that is, they may permit word sequences which are not in fact acceptable), but they are further constrained by global rules on feature values and word order. The goal of the theory resembles the goal of late versions of TG, but the range and format of rules is far more tightly constrained. GPSG is also intended to provide syntactic descriptions which correspond very closely to rules for a formal theory of semantics, so that syntactic analysis leads directly to understanding. Lexical functional grammar (LFG, Bres- nan, 1978; Bresnan & Kaplan, 1982) is an explicit attempt to solve some of the problems of TG discussed above. In particular, LFG constrains the transformations to apply only to frames attached to lexical items, rather than to structures derived at arbitrary points in the derivation of the surface string. This makes parsing a more feasible task, since there are now fewer possible forms and the point at which they are generated is fixed. Functional unification grammar (FUG, Kay, 1985) differs from all other phrase structure grammars in allowing rules which only talk about part of the structure to be described. For instance, it is possible in FUG to have one rule, applying to all declarative sentences. which specifies the agreement constraint between subject and main verb, together with a collection of rules describing the form of complement a verb should take (i.e. should it take zero, one or two objects, or should it take a sentence and if so what sort). FUG is also targetted at a semantic theory, in this case a theory in which syntactic structure encodes a complex set of roles and relationships.

148 A. Ramsay

This discussion of unification grammar should not be seen as anything more than a sketch. Complex feature sets and unification are fast overtaking ATNs as a computationally tractable formalism for non-trivial grammars. For further discussion of the advantages and disadvantages of each the reader is referred to the sources given.

Parsing. No matter what form of grammar is chosen, programs for using it need to be developed. It is assumed that the structural organization of the text encodes some facet of the message, so that understanding the message requires, among other things, analysis of the structure. Similarly, in order to generate a message it is necessary to discover a structural organization which encodes it appropriately. A very large amount of work over the past 15 years has been done on efficient algorithms for parsing natural language. We do not have space here to give much detail, but there are some points to be made about state-of-the-art parsing algorithms.

Early algorithms tended to operate in one of two modes, either top-down or bottom-up. Top-down parsers try to work from a rule with S as its left-hand side down to rules which have the lexical categories matching the text as their right- hand sides. Bottom-up ones work from rules whose right-hand sides match the lexical categories in the text up to a rule with S as its left-hand side. Either strategy has disadvantages. Top-down parsers can waste time trying out rules which can be seen at a glance to have no chance at all of fitting the data, since they cannot rule out particular expansions until all possible initial segments have been checked. Bottom-up ones suffer less from this, but can easily be led by lexical ambiguities to generate completely irrelevant hypotheses. It is often argued that a strategy known as ‘chart parsing’ (Kaplan, 1973) combines the best of both worlds. Chart parsing uses a working area called ‘the chart’ to store all hypotheses about potential rule applications and potential constituents of the final analysis. The chart prevents the parser from redoing work it has done already by blocking attempts to reapply the same rule in the same place. The overall effect is to combine the best of both top- down and bottom-up approaches. It should be noted, however, that nearly all expositions of chart parsing in text books (Winograd, 1982; Charniak & McDermott, 1985) show how to construct a parser for a CF-PSG grammar (or equivalently for an ATN with no tests or actions). For a chart to be used effectively with a more complex grammar, everything must be indexed neatly, matches between constituents with specifications must be performed extremely fast, and constraints on feature values must be propagated coherently. The chart remains the best way currently known for applying the rules of a grammar in a uniform way, but it is not the simple object that it is sometimes described as [see Ramsay & Barrett, 1987 for a detailed description of a chart parser for a unification grammar to see what needs to be done).

Again the current discussion of parsing algorithms only scratches the surface. The details of all the variants on parsing algorithms would take several books. The main point to note is that the majority of current practical systems use some


variation on Kay’s original chart parser, though with the more complex grammatical formalisms the implementation of an efficient chart parser becomes rather tricky. There is far less work on systems for generating surface forms from specified structures, for instance structures which encode the required message. This is not because this is such a simple task that no work is required. It seems to be partly because the task is in fact so difficult that nobody has managed to do much about it yet; and partly because we do not even know what information is encoded by structural choices, and hence are not in a position to give very detailed descriptions of structures which are to be realized as surface strings. The mechanisms described by McDonald (1983) and Appelt (1985) seem reasonably effective, but the theory is very undeveloped and it does not seem appropriate to go into detail here.

Implicit grammars. We have assumed throughout this section that any NLP system will include a grammar, in the form of a set of structural rules, and a parser or encoding routine which applies the rules of the grammar. It seems certain that structural information does carry a large part of the message of a piece of text or speech. If this is so, then NLP systems must recognize structural patterns and must know what they denote. Any system which cannot tell the difference between ‘I want you to eat’ and ‘I want to eat you’ is not going to be much use. It may be, however, that this information need not be stored in the form of a set of (extended) rewrite rules which are uniformly applied by some parsing or encoding algorithm. A number of people have implemented systems in which this information was embodied in the form of production system rules which recognized immediately the significance of word order and inflection. without any need for construction of intermediate constituents and without having access to independent specifications of the rules. There is a general impression that these systems begin to break down when subtle distinctions are encoded, but there is no doubt that the systems developed by Riesbeck (1978) and Small (1981) are effective as far as they go. Marcus (1980) even attempts to derive some of Ross’ (1986) constraints on transformations from computational properties of a specific architecture. The question of whether human beings have explicit access to the rules of grammar remains open. It is important to note, however, that although there is doubt about the exact form in which the grammar is stored, no NLP system will ever work adequately if it does not have access to at least an implicit set of rules about how structure encodes meaning.

Semantics

As far as implementing an NLP system is concerned, syntax is only interesting because it encodes meaning. We need to specify how meaning is to be represented internally for the system, in order for there to be something to be encoded and decoded. In the most general terms, what is to be encoded or decoded is something about the internal state of the system and its relation to the state of the world. The semantic theory we choose for our NLP system, then, must depend on how its

150 A. Ramsay

internal state is represented and how it represents the world. This may put the computer on a radically different footing from the human being. It is quite conceiv- able that humans have some way of connecting some rather basic words, e.g. colour terms and basic actions, to the world via sensory and motor organs. They might further understand how words like if, all and necessarily support inferences - after all, the meanings of words like these are in some sense the inferences they support. In this case, the human language processing system might not need a semantic component at all. Recognition of structural patterns of words would allow the human to infer whatever was required so long as they had access to the meanings of individual nouns, adjectives and verbs, and these could be given by their connection to sensory and motor systems. This would mean that humans did not need to translate from English (or whatever their native tongue is) into some other language which was used for internal representation: English itself could be the language for internal representation, within which inferences can be made and facts about the world could be represented.

This line of argument is very similar to that put by workers in formal model theoretic semantics. In Montague grammar (Dowty et al., 1979) for instance, a very complex formal system called ‘intensional logic’ (IL) is developed in order that the meanings of English words and forms can be made precise. Certain English forms correspond to expressions in IL, and the meanings of expressions in IL are explained in great detail. This makes it possible to give the meanings of English forms in great detail; they are the same as the meanings of the corresponding expressions of IL, which is easier to talk about than English. There is, however, no suggestion that in order to understand a piece of English text it is necessary to translate it into IL. The meaning of the English is encoded by its own form, and recognition of the form would enable an English speaker to retrieve the meaning and act on it. IL is simply another language which can be used to say the same things as can be said in English, but which logicians and formal philosophers find easier to talk about. Translating into IL no more provides the meaning than does translating into French.

This is true for all choices of representation language. Translating into some internal representation does not constitute an analysis of meaning. It is only interesting if the system knows how to perform inferences on the basis of expressions of the internal language, and knows how to connect them to the world. If it can do this then it makes sense to say that the system understands the internal language, since it can act on the basis of messages written in it. If this internal language were English, we would say that the system speaks English as a native. If it is something else, such as Montague’s intensional logic, or ordinary predicate calculus, or whatever, then the system is like someone who speaks English as their native tongue and who has learnt O-level French. Such a person typically translates any French they hear into English, in order to be able to act on it, since although they know how French corresponds to English they have not internalized its meaning as French.

This leaves us with two options when we design the semantic level of an NLP system. We can either try to follow the work of Montague as far as possible and


discover the real meaning of English words and phrases directly and hence develop an inference system which can act immediately upon the form of the English as output by the syntactic component. Or we can make an independent choice concerning the best language for the system to do inference in, and then try to provide a translation into that. Gazdar and colleagues’ GPSG (1985) seems to be a step in the first direction, as does Barwise and Perry’s (1983) ‘situation semantics’. The meanings of English forms are shown to be equivalent to expressions in some formal language such as IL. The semantics of the formal language are then developed, so that we know how it relates to the world and what things can be proved within it. It is then argued that the same connections to the world and chains of inference are supported by the corresponding English expressions. Suppose, for instance, that the formal language we were using was ordinary predicate calculus, so that the translation in the formal language of ‘there is a unicorn in the garden’ was something like some(x) (unicorn(x) and in Ix, garden) ). We know in predicate logic that this expression entails some(x) (unicorn(x) ). This is the translation of ‘There is a unicorn’, and hence we know that this English sentence is entailed by the first one. The translation into predicate logic is for the benefit of sceptics who are not sure that they speak English correctly themselves, and who feel more at home in logic. Any ordinary speaker would simply say that the second sentence follows from the first directly by virtue of their meanings.

This line of argument is attractive but difficult to follow through. The significance of particular syntactic patterns is hard to characterize in a way which supports regular inference rules. The difficulty should not deter long-term researchers, for whom it is probably the best way to proceed. Natural language, after all, is difficult to analyse and explain - the meanings of many innocuous statements still give philosophers a great deal of trouble. For practical systems, it is probably better to take the approach of giving the system something simpler as its native language and trying to force some sort of translation into this other language. This is obvious enough where the NLP system is serving as a front-end to some existing package, e.g. a database. The ‘native tongue’ of the system is clearly the database query language. It understands this language perfectly, in that it can act appropriately on any command given to it. It will be possible to provide a translation between this language and some subset of English. If we do this, it will be possible to converse with the database using English words and forms. What we will have done is shown the correspondence between the meanings of a restricted set of English forms and the expressions of the database query language. In this situation the system and the user will each work internally in their own preferred representation language, but the system will know enough about English to translate English forms into its own language. This matches the situation where the system knows nothing about English, so that the user has to do the translation. Again. both participants do their own work in their own preferred language, but this time it is the user who has to know about the correspondence between the two.

There is a range of different languages which might be chosen. If we cannot make the machine really understand English in the sense of using it as its internal representation language, our choice of representation language may be influenced

152 A. Ramsay

by two things. We may have a system which has a well-defined function, for which it has a suitable language. In this case this is the only reasonable candidate. We may, alternatively, want to develop a system which can be used for a variety of tasks. For such a system we would want a general representation language, since we do not know in advance what the system is going to be used for. We may want to choose a language which is known to have been used successfully in a range of general AI applications, since this will guarantee that our system can be used for a variety of purposes. The main candidates here are predicate logic; various languages with roughly the same expressive power as logic but with facilities for indexing information (e.g. frames (Minsky, 1975), scripts (Schank & Abelson, 1977), KRL (Bobrow & Winograd, 1977) ); and variants on these with specific predicates selected as primitive notions (Schank, 1975). The choice between these must depend on the particular application. For completely general NLP systems, nothing is yet known which is as concise and expressive as English itself.

Pragmatics

No matter what internal representation we choose, there is a gap between what a given text or utterance means, and what it means in the context in which it was produced. To understand the connections between the words of a sentence and the world, and the connections implied by the form of the sentence, it is not enough to know what you are supposed to do when you read or hear it. Sentences are typically produced as part of a connected discourse or dialogue by a speaker in a social and physical context as an attempt to satisfy some goal of the speaker. To understand them involves discovering how they relate to this goal. Of course you can understand a sentence without then deciding to try to help the speaker achieve the goal, but you have not fully understood it until you have worked out why it was said.

In view of the unsatisfactory state of the semantic component of the typical NLP system, we cannot hope for anything particularly sophisticated at this point. The pragmatic component (or components) is parasitic on the semantic component in the same way that it was parasitic on the syntactic component. Any gaps in the semantic component due to our inability to represent the more subtle concepts which can be expressed in English will remain gaps. It is, however, worth survey- ing briefly the tasks which a normal speaker of English performs in the course of a dialogue or discourse. These may not, strictly speaking, be linguistic activities, but they are at least tasks performed at the same time as linguistic processing and as such will also be required of any full NLP system.

Real world knowledge. The first point is that any speaker of a natural language is assumed to have spent a considerable amount of time in the culture inhabited by its native speakers, or at least to be familiar with that culture. This includes physical laws which hold true everywhere, and hence should be common to people of different cultures, and social laws which may vary rather more. Any natural discourse leaves vast numbers of things unsaid on the grounds that they are so


obvious there is no need to say them. To take a very simple example, consider the following story: ‘His bike had a puncture, so he went to Rayment’s to get a new inner tube. No-one was around when he went in, so he just picked one out of the basket and left. As he left he heard the owner shouting, so he started to run.’ Comprehension of this story depends on an array of implicit facts, about punctures and tyres, about what shops are for, about the system of monetary exchange, about society’s treatment of people who break its rules. Since I know that my readers will recognize that Rayment’s must be a bike shop from their knowledge of how you obtain whatever is needed for fixing punctures, and indeed that they will know that a puncture is something which needs to be fixed, I do not need to spell them out. Any NLP system which is to comprehend the above story must be able to make all these inferences. To solve this part of the problem involves first finding a framework within which such inferences can be made in a reasonable amount of time. Such frameworks have been proposed by various people, for example the scripts and frames mentioned in the section on Semantics. The second stage in solving the current problem is a matter of actually providing all the required knowledge. Nothing has been done about that: no computer system anywhere yet possesses enough knowledge to be able to fill in the gaps in anything but the most trivial discourses.

Coherence and cohesion. In connected text or speech, the meaning of a connected set of sentences is somehow greater than the sum of their individual meanings. In our story above, we know far more about the final sentence ‘As he left he heard the owner shouting, so he started to run’ as a result of its connection to the rest of the story than we would if we just heard it in isolation. In order to be able to do this we need a systematic way to connect sentences together and to ascertain the relations between them. There are typically two sorts of rule which can be used for this purpose. Cohesion rules provide local ‘glue’ for thematic links between sentences. An example of such a rule might be that the main topic of the discourse will always be referred to by a pronoun if this is at all possible. This rule has two functions. It enables the reader/listener to track the focus of the discourse, and hence directs any inference that they need to make in the right direction. It is useful in the given story because we know that he ‘in he started to run’ probably still refers to the main focus of the story - there have been no indications that the focus has switched - and hence we can understand why he started to run. If the story had ended with ‘As John left the shop the owner came out. He was extremely angry, and shouted for the police’, the fact that John had been referred to by name rather than by a pronoun would have indicated that the focus of the story was about to be switched, and that the next pronoun would indicate where the focus had switched to. Linking words such as as and so also provides local cohesion, by specifying the relations between contiguous sentences. Most of the work in this area has been done by linguists who work within the tradition of ‘systemic grammar’ (Halliday, 1985, Halliday & Hasan, 1976), which is particularly concerned with the way syntactic structure is used for encoding aspects of the message other than the simple denotation. Winograd (1972) used a systemic grammar in his well-known program SHRDLU,

154 A. Ramsay

but without making much use of this component of the theory. The problem has, in fact, been largely ignored until very recently, when workers who tried to produce NLP systems which generated connected discourse found that their systems tended to produce very stilted text. Appelt (1985), McKeown (1985), Grosz and Sidner (1986) have all produced systems which produce text with reference to rules about the topic and focus of the overall discourse.

In addition to these local clues about the structure of the text, we also need relations between sentences which can only be established on the basis of inference. In our story, the so between ‘he heard the owner shouting’ and ‘he started to run’ suggests that the two ought to be causally connected. Our knowledge of cultural facts about the behaviour of people in shops enables us to confirm that there is a reasonable causal connection between the two, which we consequently accept. If the story had ended ‘he heard the owner shouting, so King Henry married yet another wife’ the causal connection suggested by so would seem far less acceptable. The rules to check such connections draw upon the same sort of knowledge that was discussed in the section on Real world knowledge, and hence cannot be developed until the problems discussed in that section are solved. However, rules to check causal, and other links also require judgements about plausibility and obviousness of inference chains. Almost any pair of statements can be connected by a series of three or four inferences - what is it that makes some more convincing than others? This part of the problem has been discussed by cognitive scientists (Hobbs, 1979) and psychologists (Sperber & Wilson, 1986), but nothing concrete enough to be useful within NLP systems has yet emerged.

Discourse and dialogue structure. The final problem that needs to be tackled for an NLP system to participate correctly in unrestricted dialogue is to discover the function of a sentence in context. Everything we have considered so far has concerned finding out what the sentence or discourse means in isolation, though the matters we discussed in the last two sections were indeed concerned with things that were implied rather than said. At the end of the day, however, we have to find out why it was said if we are to respond appropriately. This involves two parts: what part did the given sentence play in the discourse as a whole, and what action does the speaker want us to take in response?

The earliest work on the allocation of sentence roles within an overall text came from anthropological research on folk tales, where it was discovered that the vast majority of such tales fitted a few simple patterns. These patterns gave rise to the notion of story grammars, i.e. patterns which constrained what could reasonably be said when. The following grammar is typical, if exceptionally simple; see Fig. 4.

STORY -, SETTING EPISODE

EPISODE+ COMPLICATION RESOLUTION

Fig. 4. Story grammar.

The striking thing about this grammar is that the right-hand side components of rules are not decomposed into elements which could be recognized by their structural properties. There may be words or standard forms which are frequently


found in sentences whose function corresponds to one of the categories in Fig. 4 - ‘Once upon a time’ is probably a good clue that you are in the SETTING, ‘But then one day’ probably introduces a COMPLICATION, and so on, but stories other than folk tales do not seem to depend on such explicit markers for their structure. This is a problem, since the function of the grammar is to help the reader place the sentences in context, and hence understand them better. If fitting them into the grammar requires them to be understood first, it is unclear what the point of doing it is.

Nonetheless, the notion that stories and other types of discourse have a regular structure which is used to help a reader or listener understand the function of individual sentences does seem helpful. In particular, dialogues seem very defi- nitely to have structures which are used to indicate to a listener what sort of response is required. The theory of exactly what sort of structures are required is not very settled, but a number of people (e.g. Levin & Moore, 1977) have suggested that something like the rules of a game are appropriate. The moves of a typical conversation involve actions such as bidding to establish a particular conversation, accept- ance of the conversation, setting up the parameters, instantiating facts, and termination. Without such stereotyped rules it is not easy to work out what is appropriate to say in response to an utterance, and an NLP system without access to a set of rules will appear extremely gauche. The range of moves and rules is not yet widely agreed, but computer implementations of the theory are being developed (see for instance Petrie-Brown, 1986).

The second aspect of the function of a text or utterance has been studied in philosophy of language under the name ‘speech act theory’ (Searle, 1969; Austin, 1962). The problem here is that a simple sentence such as ‘Do you know the time?’ can be uttered in different contexts with utterly different intentions. To utter it while standing underneath a clock to someone who is an hour late for an appoint- ment is entirely different from uttering it to a stranger in the street. In neither case is it the simple yes/no question it appears to be - in the first it is a reproach, in the second it is a request for information. The goal of speech act theory was first to enumerate the sorts of function an utterance might have - queries, promises, threats, reproaches, informatives, and so on; and then to establish conditions under which an utterance could be interpreted as having a particular function-when, for instance, can a yes/no question function as a reproach? This work was later treated computationally in terms of planning theory (Allen & Perrault, 1980; Cohen & Perrault, 1979). The effects of sentences were characterized in the same way that the effects of physical actions are characterized in standard planning systems, e.g. as STRIPS (Fikes & Nilsson, 1971). Other actions were allowed to have states of knowledge as preconditions, in addition to the simple physical preconditions that they normally have, so that the act of opening a safe requires knowledge of its combination as well as physical proximity to it. The link between the information transfer effects of an utterance and its function in a context was then inferred by standard planning mechanisms.

This work was originally developed in order to see how ‘indirect’ speech acts, e.g. the question about the time, might be interpreted. As such it might seem that practical NLP systems could be developed without it, since users might easily be

156 A. Ramsay

asked to speak directly to such systems and not try to be clever. Appelt’s (1985) work on language generation in a co-operative situation makes it clear that even apparently simple tasks, e.g. choosing a form of words with which to refer to a physically present object, requires similar analysis.

Conclusions

We have sketched some of the tasks that need to be performed by an unrestricted NLP system. Systems which perform some of these to a useful, though limited, extent are currently available. The components which can be produced using state- of-the-art techniques tend to be the ones concerned with purely linguistic knowledge about lexical and syntactic rules, and simple semantic theories. It is the author’s belief that NLP systems which fail to perform the rest of the task may be more irritating to use than systems which do not have any NL component at all. Users may well be as annoyed by systems which appear to understand English, but which continually respond either incorrectly or by saying ‘I’m sorry I didn’t understand that - could you rephrase it please’, as by ones which give them menus to go through or forms to fill in. The development of an NLP system as a testbed for linguistic theory is invaluable. It is unclear how soon it will be equally useful in practice.

References

Allan. J. F. & Perrault, C. R. (1980) Analysing intention in utterances, Artificial Intelligence, 15, 143-178.

Appelt. D. E. (1985) Planning English Sentences. CUP, Cambridge. Austin, J, L. (1962) How To Do Things With Words, Clarendon Press, Oxford. Barwise. J. & Perry, J, (1983) Situations and Attitudes, Bradford Books, Cambridge MA. Bobrow, D. & Winograd, T. (1977) An overview of KRL-0, a knowledge representation language,

Cognitive Science, 1, 3-46.

Bresnan. J, W. (1978) A realistic transformational grammar. In: Linguistic Theory and Psychologi- cal Reality. (eds M. Halle, J. W. Bresnan & G. A. Miller) Cambridge, MA.

Bresnan, J. W. & Kaplan, R. (1982) Lexical functional grammar; a formal system for grammatical representation. In: The Mental Representation of Grammatical Relations. (ed. J. W. Bresnan) Cambridge, MA.

Charniak. E. & McDermott, D. V. (1985) Introduction to Artificial Intelligence. Reading, MA. Cohen P. R. & Perrault, C. R. (1979) Elements of a plan based theory of speech acts, Cognitive

Science, 3 177-212. Dowty. D. R., Wall, R. & Peters, S. (1981) Introduction to Montague Semantics. Reidel. Dordrecht. Erman, L. D. & Lesser, V. R. (1975) The HEARSAY-II speech understanding system, IJCAI, 4,

483-490. Fikes, R. E. & Nilsson, N. J. (1971) STRIPS: a new approach to the application of theorem proving to

problem solving, Artificial Intelligence, 2. 189-208. Gazdar, G., Klein, E., Pullum. G. K. & Sag, I. A. (1985) Generalized Phrase Structure Grammar.

Basil Blackwell, Oxford. Grosz, B. J, & Sidner, C. L. (1986) Attention, intentions and the structure of discourse, Computation

Linguistics, 12, 175-205. Halliday. M. A. K. (1985) An Introduction to Functional Grammar. Edward Arnold, London. Halliday, M. A. K. & Hasan, R. (1976) Cohesion in English, Longman, London.


Hobbs, J. R. (1979) Coherence and coreference, Cognitive Science, 3, 67-90.

Kaplan, R. (1973) A general syntactic processor. In: Natural Language Processing, (ed. R. Rustin) Algorithms Press, New York.

Kay, M. (1983) When meta-rules are not meta-rules. In: Natural Language Parsing, (eds. K Sparck- Jones & Y. Wilks) Ellis-Horwood, Chichester.

Kay, M. (1985) Parsing in functional unification grammar. In: Natural Language Parsing. (eds D. R. Dowty. L. Karttunen & A. M. Zwicky) CUP, Cambridge.

Levin, J. A. & Moore, J. A. (1977) Dialogue games: metacommunication structure for natural language interaction. Cognitive Science. 1, 395-420.

Marcus, M. P. (1980) A Theory of Syntactic Recognition for Natural Language. MIT Press,

Cambridge, MA. McDonald, D. D. (1983) Natural language generation as a computational problem: an introduc-

tion. In: Computational Models of Discourse. (eds J. M. Brady & R. C. Berwick) MIT Press, Cambridge, MA.

McKeown, K. (1985) Text Generation. CUP, Cambridge. Minsky, M. (1975) A framework for representing knowledge. In: The Psychology of Computer

Vision, (ed. P. H. Winston) McGraw-Hill, New York. Pereira, F. C. N. & Warren, D. H. D. (1980) Definite clause grammars for language analysis - a

survey of the formalism and a comparison with augmented transition networks. Artificial Intelligence. 13, 231-278.

Petrie-Brown. A. (1986) A Computational Model of Discourse, MA dissertation University of Sussex.

Ramsay, A. M. & Barret, M. R. (1987) AI in Practice: Examples in POP-II, Ellis Horwood, Chichester.

Riesbeck, C. K. (1978) An expectation-driven production system for natural language understanding. In: Pattern Directed Inference Systems. (eds D. A. Waterman & R. Hayes-Roth) Academic Press, New York.

Ross, J. R. (1986) Infinite Syntax, Ablex Publishing Company, Cambridge, MA. Schank, R. C. (1975) The primitive ACTS of conceptual dependency. In: Theoretical Advances in

Natural Language Processing 1, pp. 38-41. Cambridge, MA. Schank, R. C. &Abelson, R. P. (1977) Scripts, Plans. Goals and Understanding. Lawrence Erlbaum,

New Jersey. &hank, R. C. & Riesbeck, C. K. (1981) Inside Computer Understanding: Five Programs Plus

Miniatures. Lawrence Erlbaum, New Jersey. Searle, J. R. (1969) Speech Acts: an Essay in the Philosophy of Language, CUP, Cambridge. Sidner, C. L. (1983) Focusing in the comprehension of definite anaphora. In: Computational

Models of Discourse. (eds J. M. Brady & R. C. Berwick) MIT Press, Cambridge, MA. Small S. (1981) Viewing word expert parsing as linguistic theory. IJCAI, 7, 70-76. Sperber, D. & Wilson, D. (1986) Aelevance, Basil Blackwell, Oxford. Stefik, M. (1981) MOLGEN I and II. Artificial Intelligence, 16, 111-170.

Sussman, G. J. & Steele, G. L. Jr. (1980) CONSTRAINTS - a language for expressing almost- hierarchical descriptions. Artificial Intelligence. 14, l-39.

Webber, B. L. (1983) So what can we talk about now? In: Computational Models of Discourse. (eds

J. M. Brady & R. C. Berwick) MIT Press, Cambridge, MA. Winograd, T. (1972) Understanding Natural Language. Academic Press, New York. Winograd, T. (1982) Language as a Cognitive Process. Addison Wesley, Reading, MA.

Documents

What we say and what we mean