26
UNCORRECTED PROOF UNCORRECTED PROOF 2 Extraction of pragmatic and semantic salience 3 from spontaneous spoken English 4 Tong Zhang * , Mark Hasegawa-Johnson, Stephen E. Levinson 5 Department of Electrical and Computer Engineering, Beckman Institute, University of Illinois at Urbana-Champaign, 6 405 N. Mathews Avenue, Urbana, IL 61801, USA Received 1 January 2005; received in revised form 16 July 2005; accepted 19 July 2005 9 Abstract 10 This paper computationalizes two linguistic concepts, contrast and focus, for the extraction of pragmatic and seman- 11 tic salience from spontaneous speech. Contrast and focus have been widely investigated in modern linguistics, as cat- 12 egories that link intonation and information/discourse structure. This paper demonstrates the automatic tagging of 13 contrast and focus for the purpose of robust spontaneous speech understanding in a tutorial dialogue system. In par- 14 ticular, we propose two new transcription tasks, and demonstrate automatic replication of human labels in both tasks. 15 First, we define focus kernel to represent those words that contain novel information neither presupposed by the inter- 16 locutor nor contained in the precedent words of the utterance. We propose detecting the focus kernel based on a word 17 dissimilarity measure, part-of-speech tagging, and prosodic measurements including duration, pitch, energy, and our 18 proposed spectral balance cepstral coefficients. In order to measure the word dissimilarity, we test a linear combination 19 of ontological and statistical dissimilarity measures previously published in the computational linguistics literature. Sec- 20 ond, we propose identifying symmetric contrast, which consists of a set of words that are parallel or symmetric in lin- 21 guistic structure but distinct or contrastive in meaning. The symmetric contrast identification is performed in a way 22 similar to the focus kernel detection. The effectiveness of the proposed extraction of symmetric contrast and focus ker- 23 nel has been tested on a Wizard-of-Oz corpus collected in the tutoring dialogue scenario. The corpus consists of 630 24 non-single word/phrase utterances, containing approximately 5700 words and 48 min of speech. The tests used speech 25 waveforms together with manual orthographic transcriptions, and yielded an accuracy of 83.8% for focus kernel detec- 26 tion and 92.8% for symmetric contrast detection. Our tests also demonstrated that the spectral balance cepstral coef- 27 ficients, the semantic dissimilarity measure, and part-of-speech played important roles in the symmetric contrast and 28 focus kernel detections. 29 Ó 2005 Published by Elsevier B.V. 0167-6393/$ - see front matter Ó 2005 Published by Elsevier B.V. doi:10.1016/j.specom.2005.07.007 * Corresponding author. Tel.: +1 217 328 1542; fax: +1 217 244 8371. E-mail addresses: [email protected] (T. Zhang), [email protected] (M. Hasegawa-Johnson), [email protected] (S.E. Levinson). Speech Communication xxx (2005) xxx–xxx www.elsevier.com/locate/specom SPECOM 1490 No. of Pages 26, DTD = 5.0.1 15 August 2005 Disk Used ARTICLE IN PRESS

Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

2

3

4

56

9

1011121314151617181920212223242526272829

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

Speech Communication xxx (2005) xxx–xxx

www.elsevier.com/locate/specom

ROOF

ROOFExtraction of pragmatic and semantic salience

from spontaneous spoken English

Tong Zhang *, Mark Hasegawa-Johnson, Stephen E. Levinson

Department of Electrical and Computer Engineering, Beckman Institute, University of Illinois at Urbana-Champaign,

405 N. Mathews Avenue, Urbana, IL 61801, USA

Received 1 January 2005; received in revised form 16 July 2005; accepted 19 July 2005

ORRECTEDP

ORRECTEDP

Abstract

This paper computationalizes two linguistic concepts, contrast and focus, for the extraction of pragmatic and seman-tic salience from spontaneous speech. Contrast and focus have been widely investigated in modern linguistics, as cat-egories that link intonation and information/discourse structure. This paper demonstrates the automatic tagging ofcontrast and focus for the purpose of robust spontaneous speech understanding in a tutorial dialogue system. In par-ticular, we propose two new transcription tasks, and demonstrate automatic replication of human labels in both tasks.First, we define focus kernel to represent those words that contain novel information neither presupposed by the inter-locutor nor contained in the precedent words of the utterance. We propose detecting the focus kernel based on a worddissimilarity measure, part-of-speech tagging, and prosodic measurements including duration, pitch, energy, and ourproposed spectral balance cepstral coefficients. In order to measure the word dissimilarity, we test a linear combinationof ontological and statistical dissimilarity measures previously published in the computational linguistics literature. Sec-ond, we propose identifying symmetric contrast, which consists of a set of words that are parallel or symmetric in lin-guistic structure but distinct or contrastive in meaning. The symmetric contrast identification is performed in a waysimilar to the focus kernel detection. The effectiveness of the proposed extraction of symmetric contrast and focus ker-nel has been tested on a Wizard-of-Oz corpus collected in the tutoring dialogue scenario. The corpus consists of 630non-single word/phrase utterances, containing approximately 5700 words and 48 min of speech. The tests used speechwaveforms together with manual orthographic transcriptions, and yielded an accuracy of 83.8% for focus kernel detec-tion and 92.8% for symmetric contrast detection. Our tests also demonstrated that the spectral balance cepstral coef-ficients, the semantic dissimilarity measure, and part-of-speech played important roles in the symmetric contrast andfocus kernel detections.� 2005 Published by Elsevier B.V.

UNC

UNC

0167-6393/$ - see front matter � 2005 Published by Elsevier B.V.doi:10.1016/j.specom.2005.07.007

* Corresponding author. Tel.: +1 217 328 1542; fax: +1 217 244 8371.E-mail addresses: [email protected] (T. Zhang), [email protected] (M. Hasegawa-Johnson), [email protected] (S.E.

Levinson).

Page 2: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

3031

32

33343536373839404142434445464748495051525354555657585960616263

64

6566676869707172

2 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

Keywords: Spoken language understanding; Spoken dialogue systems; Computational linguistics; Information extraction

T

737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116

UNCORREC

1. Introduction

Words are tools; in real speech, every word isdeployed for the purpose of achieving a humangoal. The fields of computational semantics andpragmatics study quantifiable goal variables—variables that encode quantifiable aspects of thegoals served by a word in context—and theirsemantic and contextual correlates. This paper de-scribes the computation of two semantic and prag-matic goal variables, focus and contrast, fromspontaneous speech.

The paper is organized as follows. The remain-der of Section 1 explains why we are interested inannotating focus and contrast, defines the aspectsof focus and contrast that are under study withexamples from an intelligent tutoring system(ITS) corpus, and then puts forward the objectivesof our study in this paper. Section 2 provides somebackground in support of our work and describesrelated work in modern linguistics and computa-tional linguistics. Section 3 describes the ITS cor-pus in detail, with particular attention paid toannotations and corpus statistics of the proposedfocus and contrast variables. Sections 4 and 5 de-scribe the algorithms implemented for the purposeof detecting the proposed focus and contrast vari-ables: Section 4 describes prosodic analysis, andSection 5 describes the measurement of wordsemantic similarities. Section 6 describes systemintegration and results of experimental evaluationusing the ITS corpus. Section 7 discusses and con-cludes our work.

1.1. Motivation

The motivation of this study is to achieve ro-bust spontaneous spoken language understanding(SSLU) in an intelligent tutoring dialogue system.The system intends to provide a computer-basedenvironment for education in math and physics,using the Lego construction set, with children ofprimary and early middle school ages (9–12 yearsold). Due to the characteristics of the dialogue sce-

EDPROOF

nario as we describe in Section 3.1, the childrenusers� spontaneous utterances are often dysfluent,ungrammatical, and even incoherent. Our robustspeech understanding system design under thesecircumstances basically involves two steps: (1)Classification of each utterance into one of a listof 30 tutoring events. Similar to call types in anautomatic call center or call router (Gorin et al.,2002; Chu-Carroll and Carpenter, 1999), the tutor-ing events are used to summarize the contentmeaning of utterances in the tutoring dialogue sce-nario in a broad way. For example, the tutoringevent AskForPlayInstruction means that the userasks a question requesting the instruction onhow to play the Legos; SpinSpeed means that theuser is talking about the spinning speed of theLego gears; and ExplainAction means that the userexplains what is being done with the Legos. (2)Sometimes the tutoring event itself cannot providesufficient information for the computer to pop upproper response. For example, when the tutoringevent ArithmeticComputation is detected, some-times the computer needs to know what the typeof the arithmetic computation is; if it is division,then the computer needs to know what thedividend and divisor are for proper response. Suchdetailed information needs syntactic/semanticstructure parsing or named entity recognition(Zhang, 2004).

To analyze the content meaning of an utter-ance, we are interested in extracting a small setof words, from the utterance, that encode prag-matically and semantically salient information.We investigate the computerization of two linguis-tic concepts, focus and contrast, that are assumedto be useful for content summarization and struc-ture parsing of spoken messages. Both of the con-cepts have reasonably clear published definitions.We wish to adapt the published definitions as nec-essary in order to define a corpus transcriptionexperiment, and to train and test algorithms thatautomatically detect these two categories of sal-ience based on cues measured in the speech wave-form and in its orthographic transcription.

Page 3: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

117

118119120121122123124125

126

12130131

13413

136137138139140141142143144145146147148149150151152153154155

156

15159

16116

163164165166

167

168169

170

171172173174175176177178

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 3

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

1.2. Focus

The information structure of a sentence can bepartitioned into presupposition and focus: presup-position is what the interlocutor assumes to be truewhen the sentence is elicited in a conversation, andfocus is the non-presupposed part of the sentence(Chomsky, 1971; Zubizarreta, 1998). For example(T represents tutor and U represents user. Focus ismarked by [ ]F),

(1) T: What are you exploring there?

179180181182

U: [Seeing if the small gears move the biggears.]F9

(2) T: How many times does the small gear spin

until they line up again?

183184

U: I think it goes around [one and a half]Ftimes.

185

5

TE

186187188

189

192193

194195196197198199200201202203

204

ORREC

By definition focus is indicative of pragmati-cally and semantically new information not pre-supposed by the interlocutor. If focus can bereliably detected, it should be possible to use thedistinction between focus and presupposition todetect new information embedded in an utterance.Speakers will often signal focus of a sentence bythe use of pitch accent (we use pitch accent to meanprosodic prominence marked by F0 extrusion; thesame word is usually also marked by the otheracoustic correlates of prominence, including dura-tion, energy, and spectral balance). Pitch accentmarks the constituents within an utterance as high-lighted or unexpected; it has been argued that con-stituents outside focus are expected, and hencetend to be unaccented (Kadmon, 2001; Zubizar-reta, 1998; Hedberg and Sosa, 2001). For example(pitch accented words are marked with subscripta),

(3) T: Which gear are you counting?

CU: I am counting the smalla gear.8(4) T: Which gear do you think is the strongest?

U: Probably the largea gear.

N2

211212

213214215

UHowever, the phonological manifestation is notstraightforward: pitch accent can only approxi-mate the location of focus in a sentence. For exam-ple, in the sentence They turn in the oppositea

DPROOF

direction, the accented word �opposite� is focusfor the question What can you tell me about the

directions they turn but not for the question What

else do you notice? The latter question requires thesentence �they turn in the opposite direction� to befocus for interpretation. Such ambiguity in prag-matic interpretation of single accent has beenknown traditionally as the focus projection phe-nomenon, demonstrating that focus expressed bya single accent can project to a larger linguisticconstituent than just the word with pitch accent.

Since focus is a syntactic constituent, theboundaries of focus need to be determined to iden-tify focus. A sentence may have multiple foci, andthe size of a focus may vary from a single word toa phrase or even a sentence. It is difficult to auto-matically extract syntactic constituents containingnovel information without making use of a com-plete parse tree for the sentence in question. Evenwith a parse tree available, automatically selectingthe right constituents would be difficult; for exam-ple, in the following exchange,

(5) T: Which gear are you counting?

U: I am counting the [small]F gear [in my

hand]F,

it would be difficult for an automatic algorithm todetermine that focus consists of a single word anda prepositional phrase; it would be nearly impossi-ble without access to a correct parse of the sen-tence. It is even harder to extract focus fromspontaneous speech, since spontaneous speechoften has loose grammar structure, dysfluency,and inconsistency between linguistic segmentsand acoustic segments. For example (�. . .� repre-sents silence):

(6) T: What happens when you spin the left gears?

U: Ahmm . . . When you [after it goes around

once the other one goes around . . . the same

. . . the same I mean it goes around . . . youknow you only have to spin it around once

. . . and that makes sense basically because

they are the same size.]F

To robustly understand spontaneous speech, wepropose labeling individual words containing new

Page 4: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

216217218219220221222223224

225226

22230

23233234

23238

239

240241242243244245246247248249250251252253254255256257258259260261262263264

265

266267268269270271272273

274275276

4 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

information neither presupposed by the interlocu-tor nor contained in the preceding part of theutterance. Such a word is usually a content wordbecause of the information content requirement.We hypothesize that words matching this defini-tion will typically be the semantically salient partof focus. Therefore, we call each of these words afocus kernel. In the following examples, focus ker-nels are marked with bold:

(7) T: What happens to the different gears as you

spin the one at the end?

277278279

U: They move with the single gear that I�mspinning.9

(8) T: Oh, are you having fun?

280281

U: Yeah, it�s kind of interesting.2(9) T: How many times would it take for the reds

to come back on top?

282283

U: It would take three times to have the red be

back on top.

T284

285286287

288

291292

296297

298299300301302303304305

306

307308309

UNCORREC

7

1.3. Contrast

Contrast is a concept having multiple senses: (1)In logic, two propositions are defined to be con-trastive if it is impossible for them to be true simul-taneously. For example, in the sentence Bach was

an organ mechanic; Mozart knew little about or-

gans, the two propositions are not contrastive,whereas they become contrastive when �Mozart�is replaced by �Bach� at the beginning of the secondsentence (Bosch and van der Sandt, 1999). (2) Thediscourse relation called contrast is induced by�but,� and constitutes a pair (or pairs) of contrastedalternatives, which can be predicates (e.g., John

cleaned up the room, but he didn�t wash the dishes),individual words (e.g., John cleaned up the room,but Bill didn�t), or propositions (e.g., It is raining,but we go out for a walk) (Umbach, 2004). (3) Somelinguists use contrast to denote the mutually exclu-sive disjunction between the words contributing toa fact and other alternatives made available bycontext (Vallduvı and Vilkuna, 1998). It has beenargued that focus in general establishes a contrastsince novel information usually conveys contrastbetween a fact and the potential alternatives(Bolinger, 1961; Kruijff-Korbayova and Steedman,2003). For example, in the sentence Last night they

EDPROOF

had a party, there is a contrast between the focus�party� and any other alternative activities of thegroup. (4) Symmetric contrast consists of a set ofwords that are parallel or symmetric in linguisticstructure but mutually exclusive in meaning; thestress on one word is motivated by its distinctionfrom the others, e.g., �American� and �Canadian�in An American farmer was talking to a Canadian

farmer (Rooth, 1992; Umbach, 2004).In this study, we seek to make use of the knowl-

edge about contrast from the pragmatics and pros-ody literature, for the purpose of detecting pairs ofsymmetrically contrasted words that are assumedto be useful for spontaneous speech understand-ing. Symmetric contrast can occur within a sen-tence, e.g. (contrasted words are marked withbold),

(10) U: The large gear has five times as many teeth

as the small ones.(11) U: How about small and big and medium?

Topics and/or foci of conjunct phrases or coordi-nated sentences (by �and�, �but�, etc.) can also con-stitute symmetric contrast, e.g.

(12) T: Where are the gears?

U: The red gear is on the bottom and the yel-low gear is on the top.

(13) T: How are the gears spinning?

U: The two outside ones spin in the samedirection and the middle one spins in the oppo-site direction.

The words participating in a symmetric contrastsatisfy semantic parallelism, which has two impli-cations: (a) the conjunct alternatives have to besemantically independent of each other in the sensethat neither of them subsumes the other; and (b)there has to be a ‘‘common integrator,’’ i.e., a con-cept subsuming both conjunct alternatives(Umbach, 2004).

1.4. Study objective

As its primary technical goal, the study intendsto test whether the proposed word tags, i.e., focuskernel and symmetric contrast, can be reliably

Page 5: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

310311312313314315316317318319320321322323324325326327328329

330

331332333334335336337338339340341

342

343344345346347348349350351352

353354355356357358359360361362363364365366367368369370371372

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 5

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

annotated in a spontaneous speech corpus usingboth manual and automatic annotation. As partof this evaluation, this study tests the relationshipof focus kernel and symmetric contrast with thefollowing prosodic and pragmatic variables: (1)prosodic prominence—experiments described inthis paper test the reliability of prosodic promi-nence in the automatic identification of focus ker-nel and symmetric contrast; (2) novelty andsemantic parallelism, the semantic attributes of fo-cus kernel and symmetric contrast. Informationtheoretic measures of novelty and semantic paral-lelism are implemented, based on algorithms pro-posed in the computational linguistics literature.Implemented algorithms are tested for the purposeof automatically identifying focus kernel and sym-metric contrast; and (3) part-of-speech. In addi-tion, this study discusses the usefulness of focuskernel and symmetric contrast to spontaneousspeech understanding.

T

373374375376377378

379380381382383

384385386387388389390391392393394395396397398399400

UNCORREC

2. Background and related work

Focus and contrast in modern linguistics areused to ‘‘account for the correlation between cer-tain prosodic patterns and certain pragmatic andsemantic effects’’ (Kadmon, 2001). Sections 2.1and 2.2 describe related work on contrast and fo-cus published in the linguistics and computationallinguistics literature. Section 2.3 describes previouswork on the word dissimilarity measure (given apair of words, how much novel information aword contains with respect to the other word) innatural language processing (NLP).

2.1. Focus

The information structure of sentences can bedefined in various ways, e.g., presupposition-focus(Chomsky, 1971; Jackendoff, 1972; Zubizarreta,1998), topic-comment (Dahl, 1969), theme–rheme(Firbas, 1964, 1966; Bolinger, 1965; Steedman,2000), given-new (Halliday, 1967; Kay, 1975),background-focus (Dahl, 1969; Steedman, 2000),and background-kontrast (Vallduvı and Vilkuna,1998). Although the information structure defini-tions are diverse, overlapping and even conflicting,

EDPROOF

they can be categorized into two dimensions: (1)topic-comment or theme–rheme, in which one partdescribes what the discourse is talking about, andthe other part advances the discourse; and (2) gi-ven-new or background-kontrast, in which onepart conveys information that has been known,and the other part conveys novel information dis-tinguishing actual occurrence from potential alter-natives triggered by context (Kruijff-Korbayovaand Steedman, 2003; Lee, 2003). In defining theinformation structure of sentences in the given-new or background-kontrast dimension, we followthe partition of Chomsky (1971), Jackendoff(1972) and Zubizarreta (1998), i.e., presuppositionand focus (see Section 1.2).

Contrastive focus is a special kind of focus thatexpresses exhaustive identification of an element(or subset) given a set of candidates. Exhaustiveidentification means that the selection of a candi-date excludes all other candidates. Ordinary focusintroduces new, non-presupposed information.Contrastive focus exhaustively selects one or morecandidates from a set of candidates that are pre-supposed by the interlocutor (Umbach, 2004;Lee, 2003; Hedberg and Sosa, 2001). For exampleIt was [John]CF that baked bread for our breakfast.

When the answer to an alternative question is tochoose a disjunct from the disjunctive alternatives,the choice is thought to be exhaustive. For exam-ple, the choice of �money� or �pen� for Did the baby

pick the money first, or did she pick the pen first? iscontrastive focus (Lee, 2003). Another case ofexhaustiveness is correction in dialogues: contras-tive focus corrects the explicit or implicit assump-tion made by the interlocutor.

Much of the literature studying focus seems tobe motivated by the pitch accent correlate of focus(e.g., Hedberg and Sosa, 2001; Pierrehumbert andHirschberg, 1990): semantic or pragmatic non-pre-supposed new information is usually signaled by alanguage-dependent accentual F0 contour. Humandetection of focus is very sensitive to the presenceor absence of accenting of the focused item and thedeaccenting of the post-focus items (Welby, 2003;Xu et al., 2004; Gussenhoven, 2002). Nuclear pitchaccent (the last pitch accent in an intermediatephrase) and prenuclear pitch accent have differenteffects on the listener�s interpretation of focus

Page 6: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

401402403404405406407408409410411412413414415416417418419420421422423424425426427

428

429430431432433434435436437438

439

44244

444445446

447448449

450451

454455

456457458459460461462463464465466467468469470471

472

473474475476477478479480481482483484485486487488

6 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

CORREC

(Welby, 2003). However, as we have described,there is no decisive information source to clearlymark the focused syntactic constituent. For thisreason, automatic detection of focus has receivedlittle study. In one relevant study, Heldner et al.(1999) tried to locate narrow focus (focus consist-ing of a single accented word) in Swedish speech:they used energy and high frequency emphasis toautomatically detect narrow focus within three-word phrases. Their automatic focus detectorwas designed based on the assumption that therewas only one focused word per phrase and the fo-cused word was accented. About two thirds of thefocused words were correctly detected.

In general, pitch accent in spoken Englishmarks subconstituents within an utterance as high-lighted. A central reason for an item to be high-lighted is novelty. Focus is marked by pitchaccent because it expresses novel information.However, pitch accent may also signal somethingelse; there are a variety of factors affecting theplacement of pitch accent. As Kadmon (2001)pointed, pitch accent can be assigned to a givenitem with special importance, while novelty maynot be marked with pitch accent. Therefore, pitchaccent is an important condition, but neither anecessary nor a sufficient condition for novelty.

2.2. Contrast

Contrast is widely investigated in modern lin-guistics with special attention given to contrastive

topic and contrastive focus (the latter has been ad-dressed in Section 2.1). Contrastive topic can bedefined as a syntactic constituent that is both to-pic-marked and contains a focused item. Thatmeans, the constituent on one hand forms a topic,and on the other hand contains novel informationin contrast with other alternatives triggered by thecontext (Lee, 1999). In the following example

(14) T: What happens to the gears?

489

NU: [The large gear]CT only spins once and it

spins slower,

3

490491492

U�gears� is the topic that the speaker is addressing,but the choice of �large� is new and in contrast with�the other gears.� In addition, the topics of a

sequence of independent contrastive answers to aquestion also form contrastive topics (Krifka,1999), e.g.

(15) T: What about the large gear and the mediumgear?

U: The large CT gear spins left, and the med-

iumCT gear spins right.

EDPROOFIn (15), �large� and �medium� are identified to be

contrastive topic and symmetric contrastsimultaneously.

In the literature, much investigation of contrasthas focused on the conceptual issue of some prob-lematic cases and specification of the types of pitchaccent correlate (e.g., Gundel and Fretheim, 2001;Pierrehumbert and Hirschberg, 1990; Kadmon,2001; Lee, 1999). In general, contrast is well andclearly defined in linguistics. However, a smallset of problematic cases receive a great deal ofattention. For example, authors are divided onwhether to identify a negated presupposition ascontrastive focus or contrastive topic (e.g., thewords �anything extraordinary� in He did the only

thing you could do. He hasn�t done anything

extraordinary). In addition, published studiesagree that contrast is typically marked with pitchaccent, but there is considerable disagreement onthe particular patterns of pitch accent used tomark contrastive topic and/or contrastive focus(e.g., Lee, 1999; Hedberg and Sosa, 2001). Thesedebates are usually framed in terms of the pitch ac-cent categories specified by the ToBI (tones andbreak indices) prosodic annotation standard(Beckman and Ayers, 1994), e.g., whether contras-tive topic will be marked by H*+!H or not. How-ever, the disputes in the literature generally do notaffect the integrity of our task, because: (1) theproblematic cases requiring conceptual discrimina-tion are rare in our corpus; and (2) our study doesnot require the discrimination of pitch accentpatterns.

2.3. Word semantic similarity

The computational modeling of novelty andsemantic parallelism involves semantic analysis ofwords, more exactly, the comparison between

Page 7: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529

538

539

540

541542543544545546547548

549550

551552553

554

555

556

557558559560561562563564565566567568569570571572573574575576577578579580581582

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 7

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

words in terms of semantic meaning. Strictlyspeaking, every word is different from every otherword. ‘‘Absolute synonymy, if it exists, is quiterare. Usually, words that are close in meaningare almost synonyms, but not quite; very similar,but not identical, in meaning; not fully intersubsti-tutable, but instead varying in their shade of deno-tation, connotation, or emphasis’’ (Edmonds andHirst, 2002). So how can we formally specify sim-ilarity and dissimilarity? A word often has multiplesenses. What counts as a central trait to comparethe meaning of a pair of words? Researchers incomputational linguistics have developed variousmeasures to compute the degree of semantic simi-

larity between a pair of words. The measures arebasically divided into two categories: (1) ontologyhierarchies (e.g., Lee et al., 1993; Sussna, 1993;Resnik, 1995; Jiang and Conrath, 1997). Ontologyis a structural system of categories or semantictypes, so that knowledge about a certain domaincan be organized through the categorization ofthe entities of the domain in terms of the types inthe ontology. The length of the path between apair of words is a measure of their semantic dis-similarity. The well-known edge-based method re-flects the fact that in a hierarchical semanticnetwork, the simplest measure of the distance be-tween two elemental concept nodes, A and B, isthe shortest path that links A and B, i.e., the min-imum number of edges that separate A and B

(Rada et al., 1989). (2) Corpus statistics empiricallymodel the context-dependence characteristics ofword meaning in text (e.g., Lin, 1998; Pantel andLin, 2002; Thelen and Riloff, 2002; Terra andClarke, 2003). Three statistics are commonly em-ployed to model the similarity of words (Higgins,2004):

Topicality assumption: similar words tend tohave the same neighboring content words.Proximity assumption: similar words tend tooccur near each other. Word senses are ulti-mately grouped according to proximity ofmeaning.Parallelism assumption: similar words tend to befound in similar grammatical structures.

EDPROOF

3. Corpus description, annotations and analyses

3.1. Tutoring dialogue scenario

The intelligent tutoring system helps studentslearn basic math and physics concepts by playingwith Lego gears, with the objective of helping stu-dents develop a physical understanding of abstractconcepts. For example, one question about therelationship between gear size and spinning speedis Line up a 24-tooth gear and a 40-tooth gear. If

the 24-tooth gear spins 5 times, then how many

times must the 40-tooth gear spin for them to lineup again? Why? Children can answer this questionby spinning the gears and counting the cycles. Sim-ilarly, a physics question about interactive force isPut one hand on the 40-tooth gear axle, and put an-

other hand on the 8-tooth gear axle. What happens

if you hold one of them steady, and try to turn the

other one? Why? Children usually think that thebig gear is stronger before they do the experiment.However, it turns out that the small gear isstronger.

The complete system has not been finished yet;the database used in this study was collected byWizard-of-Oz simulations of the finished system.In the experiments, the user and the tutor (humanwizard) were sitting in separate rooms. The userorally communicated with a computerized talkinghead shown on the computer screen ahead ofhim using a head-set microphone. The lip move-ment of the talking head was coincident withspeech synthesized from text that was typed bythe tutor. The user�s speech was transmittedthrough the microphone to a digital camera placedopposite the user, and then was transmitted to theearphone of the tutor sitting in another room.Both the tutor and the user had Lego gearsets onthe table in front of them. A video of the tutor�sgearset was displayed on the user�s computer,and vice versa. The WoZ experiments allowed datato be collected that was similar in most respects tothe data the final spoken language system wouldneed to understand: because children felt that theywere communicating with a computer tutor in-stead of a human tutor, they behaved as they

Page 8: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

583584585586587588589

590

591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628

629630631632633634635

636637

638

639

640

641

642

643644645646647648649650651

652653654655656657658659660

661662663664665

6

670671

672673

8 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

would in a real computer-interacting environment.The number of experimental sessions in whicheach child participated varied from one to three,depending on the interest and cooperative attitudeof the child. The tutor adjusted the content ofexperiments according to the intelligence, coopera-tion, and learning progress of each child subject.

3.2. Characteristics of the ITS corpus

Unlike the users of a telephone weather or flightticket system, who are often expert users interestedin achieving known goals using known tools in theshortest time possible, the users of our intelligenttutoring system are perpetually naıve with respectto the future content of the dialogue. Each childparticipates in at most three experimental sessionsand each session has different tutorial content, be-cause we do not ask children to re-learn knowledgethat they have mastered. Therefore, although chil-dren who participate in more than one experimentmay gain some expertise in the use of the computerinterface, the children are never able to memorizemenu prompts, re-use conversation content, orotherwise become expert users of the dialoguesystem.

In addition, we encourage children to partici-pate in the experiments, and we instigate theirinterest in scientific learning by asking themopen-ended questions rather than questions withan absolute answer. For example, when the childis turning gears and the tutor wants to ask thechild about the motion of the gears, he usuallydoes not ask single-choice questions such as In

which direction are the gears turning? Instead, hewould ask questions whose answers are not abso-lute, e.g., What are you noticing? Compared withthe single-choice questions, the open-ended ques-tions open a wider space for children and arousechildren�s enthusiasm to use their imagination,knowledge, and observation to solve problems.

Since children are not familiar with the experi-ment contents and the answers to open-endedquestions are usually longer and more complicatedthan those to single-choice questions, their utter-ances are even more incoherent and dysfluent thanis typical in interpersonal conversations. The utter-ances usually include loose grammar structure,

EDPROOF

fragments, restarts, repairs, meaningless speech(e.g., That if the. . .), and repetitions. Moreover,sentence boundaries in spontaneous speech areambiguous because of mismatch between acousticsegmentation and linguistic segmentation. The fol-lowing example illustrates the characteristics of theITS speech data:

(16) U: Big gears move in different ways and. . .uhm. . .with the first when you push one of

the first gears, the other gear the last gear

moves you know, and the gear after that

moves.

3.3. Focus kernel annotation

Three annotators worked independently onidentification of focus kernel, if any, in each utter-ance of the ITS corpus. Annotation was based onperception, text transcriptions, and dialogue con-text. A vast majority of the ITS corpus are ques-tion-answer pairs between the tutor and user, inwhich the tutor initiates local dialogue topics byasking questions or providing suggestions (e.g.,Can you make them look like this?). In this case,presupposition of an utterance lies in the questionor suggestion of the tutor (Kadmon, 2001). Stu-dents sometimes initiate local dialogue topics byissuing commands, asking questions or simplyexplaining what they are doing. In this case, pre-suppositions for the students� utterances do not ex-ist. Annotators were given the following criteria touse in their labeling of focus kernels in anutterance:

1. Mark the content words that contain informa-tion not already available in presupposition, ifany, nor in the preceding words of the utter-ance. For example,

(17) T: What if we now add a gear?

U: The third one moves or spins the samedirection that the first one does.

Mark contrastive focus: (a) If the tutor asks analternative disjunctive question and the userresponds by choosing a single disjunct, then

Page 9: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

674675676677

681683

692693694695696697698699700701702703704705706707708709710

711

712713714715716717718

719720721722723

724

726727728729

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 9

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

the disjunct is characterized by contrastivefocus, and thus, is marked as focus kernel.For example,

(18) T: Does the top or the bottom do thepushing?

730

U: I think it�s the top. 731

682

732733734735

736

(b) If the user�s utterance is to correct theassumption of the tutor, then the correctiveword is characterized by contrastive focus,and thus, is marked as focus kernel. Forexample,

(19) T: Take that big gear, please.691

737 U: I thought you said this gear.

T738739740741742743

744745746747748749750751752753754755756757

758

759760761762763764765766767

UNCORREC

2. Focus defined by Chomsky (1971) and Zubizar-reta (1998) is confined to the rhyme of a sen-tence (Steedman, 2000). However, we are alsointerested in marking the new item in contras-tive topic, e.g., �large� in example dialogue(14), since it conveys novel information. Welabel such item as focus kernel although it isnot in the rhyme, because: in our corpus thiscase is very rare and focus kernel are generallyin the focus (rhyme), so the inclusion of thisrare case basically does not affect the fact thatwe can call the words containing novel informa-tion as the kernel of focus.

3. In the case of dysfluency, discourse marker andrepetition should not be marked, while repairshould be marked. For example, in the dysflu-ent constituent [ ]dys of the utterance If I spin

3 times it end up, [all yellows at the bottom,

and all yellows at the. . . I mean, all reds at the

bottom,]dys and all yellows at the top, the dis-course marker �I mean� should not be marked.As correction of �yellows�, �red� should bemarked.

4. Function words are not marked unless theycarry novel information, and are thus stressedby the speaker, e.g., �this� in example (19).

Each word in an utterance is labeled either fo-cus kernel or nonfocus kernel. We used the kappastatistics to evaluate consistency among the threeannotators. The kappa statistics is a chance-cor-

EDPROOF

rected inter-transcriber agreement ratej = (PO�PC)/(1�PC), where PO is the rate of in-ter-transcriber agreement, and PC is the averagerate that would be achieved by chance (Flammia,1998). Comparison of the three transcriptionsyielded a kappa score of 0.79, indicating reason-ably good agreement among the transcribers. Weused majority voting to resolve the annotation dif-ference among the three annotators. That is, foreach word, if two or more annotators had thesame focus kernel/nonfocus kernel label, then weassigned that label to the word as the target label.

3.4. Symmetric contrast annotation

The labeling of symmetric contrast was basedon pairs of words. For example, �small� and �big�are labeled to be a symmetric contrast. If morethan two words were symmetrically contrastivewith one another in an utterance, then we labeledthem by pairs. For example, in I�m counting thesmall, medium, and big, we labeled �small�, �med-ium� and �big� by three symmetric contrasts: �small�and �medium�, �small� and �big�, and �medium� and�big�. Moreover, in the case of repair (dysfluency),e.g., The small one, no, the big one is turning left,the repairing word (e.g., �big�) and the repairedword (e.g., �small�) were also labelled as a symmet-ric contrast.

Two transcribers worked together to identifyinstances of symmetric contrast in the ITS corpus.The annotation was performed based on speechperception, transcription, and dialogue context.Because the transcribers worked together, no in-ter-transcriber agreement statistics werecomputed.

3.5. Corpus analysis

To date 29 WoZ experiments with 17 subjectshave been carried out and transcribed, and wehave collected 11.7 h of audio-visual data. In eachexperiment, the child often spent most time silentlyplaying with the Legos. Therefore, the collectedaudio data corpus was small. Recordings of thestudents were manually transcribed and segmentedinto conversation sides. Each of the user�s conver-sation sides was considered an utterance. Some

Page 10: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

TDPROOF

768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801

802803804805806807808809810811

Table 1Relationship between utterance length and occurrence ofsymmetric contrast

UtterancelengthP 8

Utterancelength < 8

# of utterances containingsymmetric contrast

113 19

# of utterances not containingsymmetric contrast

140 357

mean=2.39

0

50

100

150

200

250

0 1 2 3 4 5 6 7 8 9 10# of Focus Kernels in an Utterance

# of

Utte

ranc

es

Fig. 2. Histogram of focus kernels per utterance.

10 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

REC

speech data had to be discarded because of Legoblock noise, heavy breathing, etc. This process re-sulted in a total of 714 transcribed user utterances,containing a total of approximately 50 minutes ofrelatively clean speech. On average each utterancehad 4.2 s speech and 8.1 words. The vast majorityof the utterances contained 1–20 words, while thelongest utterance had 57 words. Fig. 1 shows thehistogram of utterance lengths in the corpus.

The 714 utterances of the ITS corpus can bepartitioned into 58 single-word utterances, 26 sin-gle-phrase utterances (such as how many?), 22utterances that merely repeat the words of the tu-tor, 19 utterances that are not semantically mean-ingful (e.g., That if the. . .), and 589 multi-word,multi-phrase utterances containing meaningfuland novel information. We use the 630 multi-word, multi-phrase utterances (589 meaningfulutterances, 22 repetitions of the tutor, and 19semantically void utterances) for experiments.

Short utterances tend not to contain symmetriccontrast. We choose the average number of wordsin a sentence, which is 8, as the threshold to distin-guish long utterances from short utterances. Utter-ances containing at least 8 words occupy 40% ofthe 630-utterance corpus. Table 1 shows the rela-tionship between utterance length and occurrencesof symmetric contrast. The table shows that 45%of the long utterances contain symmetric contrast,while 5% of the short utterances contain symmet-ric contrast. Phrased in another way, more than85% of symmetric contrast instances occur inutterances of 8 words or more. Fig. 2 shows thehistogram of focus kernels per utterance in the

UNCOR

812813

mean=8.1

0

10

20

30

40

50

60

70

80

90

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58# of Words in an Utterance

# of

Utte

ranc

es

Fig. 1. Histogram of utterance lengths in the ITS corpus.

E630-utterance corpus. An utterance can has maxi-mally 10 focus kernels, while on average eachutterance has 2 focus kernels.

Part-of-speech of all words in the ITS corpus isfirst tagged using an automatic part-of-speech tag-ger (Munoz et al., 1999), and then manuallychecked against the tagging standard in Tree-bank-3 (Santorini, 1990). The part-of-speech tagsare used in the automatic detection of symmetriccontrast and focus kernel, since part-of-speechcan be used to distinguish content words fromfunction words.

814

815816817818819820821822

4. Prosodic analysis

The literature in both prosody and pragmaticsreports the pitch accent correlate of contrast andfocus. Therefore, pitch accent is a reasonable firststep in the automatic classification of focus kerneland symmetric contrast. Because of the man powerinvolved in manual labeling of pitch accent in theITS corpus, we try to use an automatic system tolabel pitch accent. To date pitch accent automatic

Page 11: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

823824825826827828829830831832833

834

835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864

868869870871

878879

880

881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 11

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

detection concentrates on Radio Speech, in whichhalf of all words may be pitch accented (Kim et al.,2003), whereas in conversational telephone speech,typically about 20% of words are pitch accented(Yoon et al., 2004). Pitch accent can be automati-cally detected in the Boston Radio Speech Corpuswith reasonable accuracy: a gender-dependent andspeaker-independent system achieved a 10% pitch-accent detection equal error rate (Kim et al., 2003).This section describes the derivation of the pro-sodic measurements used for pitch accent labeling.

4.1. Duration, pitch and energy

English speakers tend to signal prosodic prom-inence using extruded pitch, increased energy, andlonger phoneme durations. In order to matchdurations and energy with individual words, wefirst train a speaker-independent mixture GaussianHMM speech recognizer, and then adapt the rec-ognizer to children�s speech (Zhang et al., 2004).The known word-level transcriptions of each utter-ance are expanded into phoneme transcriptions,and forcibly aligned to the speech waveform usingthe automatic speech recognizer, resulting in anautomatic estimate of the duration and time align-ment of each word. Pitch and probability of voic-ing are extracted using the FORMANT programin Entropic XWAVES. Pitch measurements inframes with low probability of voicing are dis-carded. Extremely high F0 values (doubling) andextremely low F0 values (halving) are also dis-carded. Valid pitch measurements of an utteranceare then normalized by the highest F0 in the utter-ance to compensate for inter-speaker F0 differ-ences. Kim et al. (2003) found that 95% of thepitch accents in their corpus were high-F0 pitch ac-cents (in ToBI notation, these were variants of H*or !H*), and similar statistics have been reportedfor other corpora (e.g., Yoon et al., 2004). There-fore, the normalized pitches in the higher halfpitch region of a word are averaged in order to ob-tain an ‘‘average peak pitch’’ measurement for theword. The averaging scheme is given by

Dm ¼ 1

Nwm

Xf2wm

F m½f �; ð1Þ

913

EDPROOF

where Dm is the pitch value of word Wm, Fm[f] isthe pitch value of frame f in Wm, wm is a subsetof frames inWm, Nwm

is the total number of framesin wm, and

wm ¼ f jF m½f � P1

2T m

� �; ð2Þ

Tm ¼ maxF m½n�2W m

fF m½f �g. ð3Þ

Energy of a word is computed using the samestrategy as described by Eqs. (1)–(3).

4.2. Spectral balance cepstral coefficients

Spectral balance is intensity emphasis in thehigher frequency region. Sluijter et al. (1997)showed that in lexically stressed syllables, speechspectral intensity at higher frequencies (above500 Hz) increased more than the intensity at lowerfrequencies. Sluijter et al. also found that manipu-lating the high-frequency intensity of speech re-sulted in stronger stress cues than manipulatingthe entire band. These studies demonstrated thathigh frequency intensity is a stronger cue for stressthan overall intensity. Since pitch accent in spokensentences normally falls on the lexically stressedsyllables, we speculate that intensity of the spec-trum above 500 Hz may be a useful acoustic cuefor pitch accent detection.

We propose spectral balance cepstral coeffi-cients (SBCC) to encode the intensity and shapeof the speech spectrum in the range between500 Hz and 5000 Hz. The speech waveform is firstpre-emphasized and windowed by 30 ms windowswith an inter-frame window shift of 10 ms. Speechsamples in each frame are decomposed into a seriesof bands between 0 and 5000 Hz through multires-olution analysis (see Fig. 3). The multiresolutionanalysis is achieved by iterative application of afilterbank, which consists of quadrature mirror fil-ters (a low-pass smoothing filter and a high-passdifferencing filter) with Daubechies-4 orthogonalcoefficients (Daubechies, 1990); each time the filter-bank decomposes speech signals into a higher bandand a lower band. The Daubechies-4 time domainfilterbank has better out-of-band rejection than afilter constructed by adding DFT coefficients (like

Page 12: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

PROOF

914915916917918919920921

925926927928929

933

937938

939940

943944945946

947

948949950951952953

954

955956

(78Hz, 234Hz, 390Hz, 547Hz)

(156Hz, 470Hz, 782Hz, 1095Hz, 1407Hz, 1720Hz)

(313Hz, 938Hz, 1563Hz, 2188Hz)

(625Hz, 1875Hz, 3125Hz)

(1250Hz, 3750Hz)

(2500Hz)

0 5000Hz

Fig. 3. Multiresolution decomposition of 0–5000 Hz. An arrow splits the band from which it originates into two equal subbands. Thefrequencies of the splits as arrows indicate are labeled at the right side. The subbands finally obtained from the decomposition are darkmarked. The subbands over 500 Hz are the 14 rightmost bands.

12 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

the filters used in MFCC or PLP analysis), but re-tains the desirable properties of MFCC analysis:semi-logarithmic frequency scaling, and the flexi-bility necessary to select desired sub-bands, e.g.,in our case, the sub-band above 500 Hz. After mul-tiresolution analysis, each word is characterized byits average peak signal intensity in each band, usingequations similar to those used for pitch:

em ¼ 1

N/m

Xn2/m

y2m½n�; 1 6 m 6 M ; ð4Þ

where ym[n] is sample n in band Bm, em is the inten-sity of band Bm, M is the total number of bandsspanning from 500 Hz to 5000 Hz, Nwm

is the totalnumber of speech samples in set Um, Um is a subsetof Bm and

/m ¼ njjym½n�j P1

2T m

� �; ð5Þ

with

Tm ¼ maxym½n�2Bm

fjym½n�jg. ð6Þ

The log energies in each band (logarithms of Eqs.(4)–(6)) are transformed using an inverse discrete

EDcosine transform to compute the cepstral

coefficients:

El ¼XMm¼1

logðemÞ coslðm� 0.5Þp

M

� �; 6 l 6 L;

ð7Þwhere L is the desired length of the cepstrum.Cepstral mean subtraction is then applied to com-pensate for disturbances caused by the transmis-sion channel.

5. Word semantics analysis

We use a word dissimilarity measure to modelthe degree of novelty that a word has in compari-son with other words. We use Ni to denote thenovelty of word wi given dialogue context. Accord-ing to the definition of focus kernel, we computeNi by the minimum of the dissimilarity betweenwi and the words in set S, where S consists of thosewords appearing in the interlocutor�s presupposi-tion and those precedent of wi in the utterance, i.e.,

Ni ¼ minwj2S

disðwi;wjÞ. ð8Þ

Page 13: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

960961962963964965966967968969970971972973974975976977978979980981

982

983984985986987988989990991992993994995996997998999

100010011002100310041005

10061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031

Fig. 4. Partial hierarchy of the application-oriented ontology.

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 13

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

Semantic parallelism is the semantic attributecharacterizing symmetric contrast. Given a coupleof words, the quantification of their semantic par-allelism should emphasize: (1) word dissimilarity,i.e., a word cannot be parallel to itself or its syno-nym; (2) word similarity, e.g., �blue� and �white� aremore likely than �drink� and �pier� to become asymmetric contrast; and (3) non-hypernym rela-tion, i.e., a word cannot form symmetric contrastwith its hypernym. For example, �dog� and �cat�can form symmetric contrast, while �dog� and �ani-mal� cannot.

Similarity and dissimilarity of a pair of wordswas computed using measures based on methodsfrom the computational linguistics literature. Spe-cifically, as suggested in Section 2.3, two methodswere used: an ontology-based method and a meth-od based on corpus statistics. Neither of these twomethods was found to be an adequate dissimilaritymeasure by itself, but the linear interpolation ofthe two measures was found to be reasonablyadequate.

5.1. Application-oriented ontology

Ontology design determines the set of semanticcategories that properly reflects the particular con-ceptual organization of a target domain (Lenci,2001). Designing a completely new ontology iscomparably difficult to the design of a completelynew dictionary. An attractive solution is to adapta general linguistic resource to the application do-main. In this study, we use WordNet as the univer-sal background linguistic source. WordNet is ahierarchical semantic database of English words(Miller and Fellbaum, 2002). In WordNet, themain relations between words are: synonym, anto-nym, and hypernym (representing the �is–a� rela-tionship) for nouns, verbs and adjectives; andsynonym, antonym for adverbs.

The ontology in the ITS domain is defined byO = {C,R,H}, where C is the set of concepts, Ris the set of relations, and H is the concept hierar-chy. Each content word in the ITS lexicon is a con-cept. Concepts should interconnect with eachother in the ontology. The relationship betweenconcepts is represented by links. In this study,the ontology encodes the information of synonym,

EDPROOF

antonym, and hypernym for the content words.The hierarchy of word concepts is constructed fol-lowing the procedures in Appendix A. The majorsemantic categories used in the hierarchical struc-tures are listed in Appendix B. A subset of theontology is shown schematically in Fig. 4 using atree structure. In addition, function words are cat-egorized by their part-of-speech under the mainsemantic class �function� (in contrast with �content�,the main semantic class for content words).

We employ the edge-based method, in which anedge represents a direct association between a pairof semantic concepts, to compute the distance be-tween the pair of words. In a more realistic sce-nario, the distances between any two adjacentnodes are not necessarily equal. Generally, the dis-tance shrinks as one descends the hierarchy, sincedifferentiation is based on finer and finer details(Jiang and Conrath, 1997), e.g., the distance be-tween �abstraction� and �entity� (at the top levelof the hierarchy) is much bigger than the distancebetween �many� and �some� (at a lower level). It istherefore necessary to consider that the edge con-necting the two nodes should be weighted by theirdepth in the hierarchy. We propose the followingweighted edge-distance:

disðw1;w2Þ ¼ minc12senðw1Þc22senðw2Þ

dc1;c2 þ 1

dc1;c2

� �a

lenðc1;c2Þ;

ð9Þ

Page 14: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

103410351036103710381039104010411042

1043

104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070

107310741075107610771078

1079108010811082

1085108610871088108910901091

10921093109410951096

11001101110211031104

110511061107

110811091110

1113

11161117

14 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

where sen(w) denotes the set of possible senses forword w if case w has multiple senses; dc1;c2 is themean depth of nodes c1 and c2 in the hierarchy;len(c1, c2) is the number of edges connecting c1and c2; and a is a constant. Here we choosea = 3.0. In addition, we set the distance score as0 for synonym, and 15.0 for antonym based onthe maximum distance between a pair of wordsin the ontology.

5.2. Corpus statistics

Statistical methods attempt to measure depen-dence between words by using statistics taken froma large corpus. Here we use the English GigaWord

corpus, a billion-word archive of English newswiretext distributed by the Linguistic Data Consor-tium. The entire GigaWord corpus consists of314 files, nearly 11.7 GB. All texts are presentedin SGML form, using a simple markup structure.We remove the SGML tags, leaving only text con-tent. The raw database consists of four differenttypes of documents: Story, Multi, Advis, andOther. Story has uniform format, each consistingof a few paragraphs describing a same topic. Theother three types of documents do not have longparagraphs of text. Therefore, we only choosethe story session for experiments; the story session(with paragraph breaks and some other controlcharacters removed) contains 8.57 GB of text.

In this study, similarity between a pair of wordsw1 and w2 is measured based on the proximity andtopicality assumptions (Section 2.3). First, theproximity assumption tells us that we can modelword similarity by the probability of a pair ofwords occurring together in a sentence, a para-graph, a topic, or even a document. We measurethe degree of association between a pair of wordsw1 and w2 by mutual information

PMIðw1;w2Þ ¼ log2pðw1;w2Þpðw1Þpðw2Þ

Let the co-occurrence frequency of w1 and w2 bedenoted by fw1,w2. Because GigaWord is a verylarge text corpus, the co-occurrence frequency isroughly estimated by the number of topics (stories)in which w1 and w2 co-occur. Let N be the size ofthe corpus in terms of topics. The maximum like-

EDPROOF

lihood estimate of the co-occurrence probabilityis given by p(w1,w2) = fw1,w2/N. To compensatefor the sparsity of the training data, we computep(w1,w2) by

pðw1;w2Þ ¼fw1;w2 þ 1

N þ 1ð10Þ

The computations of p(w1) and p(w2) adopt thesame smoothing strategy. Second, the topicalityassumption tells us that if we have decided thattwo words are similar, then we may infer that theyhave similar mutual information with some otherword, w. Given a context C ¼ fw0

1;w02; . . . ;w

0ng,

w1 and w2 are considered semantically similar ifthey are both likely to co-occur with the wordsin C. We apply a method (Pantel and Lin, 2002)that computes the cosine distance between thetwo partial mutual information (PMI) vectors cor-responding to w1 and w2:

disðw1;w2Þ

¼ 1�P

w02CPMIðw0;w1ÞPMIðw0;w2ÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPw02CPMIðw0;w1Þ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPw02CPMIðw0;w2Þ2

q ;

ð11Þ

where C = C(w1)[C(w2),C(w1) and C(w2) are con-text of w1 and w2, respectively. Given a word w, itscontext C(w) can be the simple collection of all(content) words appearing within the window ofw in the ITS corpus. To more properly representthose words semantically associated with w, wemodify the method proposed by Dagan et al.(1995) to determine C(w):

1. Define a pair of words (w,v) to be strong neigh-bors if f(w,v) >tf, where f(w,v) is the count of(w,v) in a window of size d, tf is threshold and

tf ¼k1 �minðNw;NvÞ minðNw;NvÞ > c1

s1 minðNw;NvÞ 6 c1

�;

ð12Þwhere Nw and Nv are the number of occurrencesof w and v, respectively, and k1, c1, and s1 areconstant.

2. Collect all the strong neighbors of w as poten-tial candidates and denote as C1(w)

C1ðwÞ ¼ fvjf ðw; vÞ > tfg. ð13Þ

Page 15: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

11201121

1122

112511261127

1130

1133

11351139

11401141

1142

1143114411451146114711481149115011511152115311541155115611571158115911601161116211631164

116511661167116811691170

1173117411751176117711781179118011811182

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 15

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

3. Collect the strong neighbors of all words inC1(w) as potential candidates, and denote asC2(w)

C2ðwÞ ¼ fvjv 2 C1ðC1ðwÞÞg. ð14Þ4. Collect those words which share at least tN

strong neighbors with w in the lexicon anddenote as C3(w)

C3ðwÞ ¼ fvjC1ðwÞ \ C1ðvÞ > tNg; ð15Þwhere

tN ¼k2�minðC1ðwÞ;C1ðvÞÞ minðC1ðwÞ;C1ðvÞÞ> c2

s2 minðC1ðwÞ;C1ðvÞÞ6 c2

�;

ð16ÞNw and Nv are the number of occurrences of wand v, respectively, k2, c2, and s2 are constant.

5. CðwÞ ¼ C1ðwÞ [ C2ðwÞ [ C3ðwÞ. ð17Þ

118311841185

Examples of corpus statistics are given inAppendix C.

TD

1186

11871188118911901191119211931194

1195

1196119711981199120012011202120312041205

UNCORREC

5.3. Knowledge combination

The computation of word semantic dissimilarityis a combination of lexical ontology and corpusstatistics. First, the manually built pseudo-knowl-edge base has advantages in efficient paraphrasing,inference and reasoning. The ontology is especiallyuseful when the dissimilarities of some words aredependent on topic and pragmatic context. Forexample, �play� and �build� are dissimilar to eachother under a basketball game topic, but similarunder the ITS topic. However, the top-down orga-nization of lexical knowledge cannot incorporatethe dynamic nature of word meanings. Wordmeanings change dramatically depending on thelinguistic context. This context-dependent charac-teristic is one of the main empirical argumentsfor real text data (Lenci, 2001). Second, the statis-tical model provides computational evidence fromdistributional analysis of corpora data. The statis-tical model provides a quantification of the seman-tic space. However, corpus statistics has somelimitations, as it requires the exact correspondencebetween terms (word or character n-grams). In

PROOF

unrestricted language many reasonable co-occur-rences may fail to occur in the training corpus.

In order to combine information from both theontology-based dissimilarity measure and the cor-pus-based dissimilarity measure, we use linearinterpolation as

disðw1;w2Þ ¼ kdis1ðw1;w2Þ þ ð1� kÞdis2ðw1;w2Þ;ð18Þ

where dis1 and dis2 are the peak-normalized worddistances based on the ontological and the statisti-cal methods, respectively; and parameters k is con-stant. Here we arbitrarily choose k = 0.5. Theprerequisite to use corpus statistics is that theremust be at least two contextual words for a givenpair of words (w1,w2). Otherwise, dis2 in Eq. (11)will be 1 (# of contextual words is 0) or 0 (# ofcontextual words is 1). If the prerequisite fails tobe satisfied, then the dissimilarity is computedusing only the ontology-based dissimilarity mea-sure. Examples of feature combination are listedin Appendix D.

E6. System evaluation

The corpus for focus kernel classification con-sisted of 630 multi-word, multi-phrase utterances,containing approximately 5700 words and 48 minof speech. In the experiments of extracting focuskernel and symmetric contrast, training and testdata included different utterances from the sameset of talkers, so the experiments were multi-speak-er speaker-dependent.

6.1. Focus kernel detection

Prosodic observations (duration, average peakpitch, average peak energy, and spectral balancecepstral coefficients) and part-of-speech taggingwere integrated with each word�s semantic noveltymeasurement (Eq. (8)) using a time-delay recurrentneural network (TDRNN) for the purpose ofautomatically detecting focus kernels. TheTDRNN is a neural network that encodes andintegrates dynamic signal context using a combi-nation of delayed input nodes (for temporal loca-

Page 16: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

1206120712081209121012111212121312141215121612171218121912201221

1222122312241225122612271228122912301231123212331234123512361237

16 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

tion of important information in the input se-quence) and delayed recurrent nodes (for temporalcontext information) (Kim, 1998). This study useda modified TDRNN: we added a recurrent layerbetween the output layer and the hidden layer tofeed back the delayed values from the output layer;and the time indices were synchronized to succes-sive words rather than centisecond speech frames.

We have tried using the non/pitch accent auto-matic label as a feature for focus kernel detection:we employed a model trained from Boston RadioNews Corpus (Ren et al., 2004a) for our automaticpitch accent labeling task; then the pitch accent/nonpitch accent labels were combined with seman-tic similarity measure to input to the TDRNN.However, the performance was not satisfying,

UNCORRECT

12381239

124012411242124312441245124612471248124912501251125212531254

Table 4Precision p, recall r, and F-score f of non/focus kernel labelling using

Focus kernel

p r f

Dur 0.417 0.727 0.5Egy 0.313 0.885 0.4Pit 0.312 0.891 0.4Pos 0.582 0.236 0.3SBCC 0.373 0.364 0.3SDM 0.472 0.873 0.6Dur+Egy+Pit+Pos+SBCC+SDM

0.809 0.770 0.7

Dur = duration, Egy = energy, Pit = pitch, Pos = part-of-speech, SBdissimilarity measure.

Table 3Confusion matrix for word classification: focus kernel vs.nonfocus kernel

Focus kernel Nonfocus kernel

Focus kernel 148 17Nonfocus kernel 70 301

Table 2Confusion matrix for utterance classification: novel vs. non-novel; novel = utterance containing focus kernel, non-novel = utterance not containing focus kernel

Novel Nonnovel

Novel 53 7Nonnovel 0 3

E

DPROOF

apparently because of: (1) the application of amodel trained on radio news speech to spontane-ous speech; (2) the noisy recording environment;and (3) the fact that erroneous pitch accent detec-tion would lead to erroneous extraction of focuskernel. Therefore, we used the acoustic correlatesof pitch accent, rather than pitch accent itself,for the purpose of detecting focus kernel.

We used 90% of the corpus for training and theremaining 10% for test. Our experiments yieldedan accuracy of 83.8% on the test set, which con-sisted of 536 words in total. The confusion matri-ces summarizing the classification performance interms of utterances and words are shown in Tables2 and 3, respectively. In Table 2, novel andnonnovel denote whether the utterances containfocus kernel or not. Precision, recall, and F-scoreðf ¼ 1

0.5=pþ0.5=rÞ of non/focus kernel automatic

labeling is presented in Table 4. We further presentthe F-scores of non/focus kernel labeling based ondifferent features in Fig. 5. The figure shows thatpitch and energy yielded better performance forfocus kernel labeling than nonfocus kernel label-ing, while the other features were the visa versa,especially part-of-speech and spectral balancecepstral coefficients.

We compared the efficiency of individual fea-tures using classification accuracy and the F-score-based evaluation measures. Classificationaccuracy is the fraction of the test set that is cor-rectly classified with respect to focus kernel andnonfocus kernel. F-score is a harmonic mean ofprecision and recall: F-score is high only when

various features

Nonfocus kernel

p r f

30 0.819 0.547 0.65662 0.725 0.135 0.22762 0.723 0.127 0.21636 0.731 0.925 0.81768 0.720 0.728 0.72413 0.909 0.566 0.69889 0.900 0.919 0.909

CC = spectral balance cepstral coefficients, SDM = semantic

Page 17: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

ROOF

12551256125712581259126012611262

12631264126512661267126812691270127112721273127412751276127712781279128012811282

12831284128512861287128812891290129112921293129412951296129712981299

1300

130113021303130413051306

13071308

Fig. 5. F-score values of focus kernel labeling and nonfocus kernel labeling based on various features. Dur = duration, Egy = energy,Pit = pitch, Pos = part-of-speech, SBCC = spectral balance cepstral coefficients, SDM = semantic dissimilarity measure.

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 17

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCORREC

both precision and recall are high. Therefore,F-score is a more reliable measure. However, theF-score values that we have derived were class-dependent. We were interested in the globalperformance measure across the classes, so we de-signed two global F-score measures: (1) averagedfocus kernel F-score and nonfocus kernel F-score;and (2) fave ¼ 1

0.5=paveþ0.5=rave, where pave was the

averaged focus kernel precision and nonfocuskernel precision, and rave was the averaged focuskernel recall and nonfocus kernel recall. The com-parison results using classification accuracy andglobal F-score measures are shown in Fig. 6. Thefigure shows that the combination of featuresmuch outperformed the individual features. Inaddition, comparison based on the global F-scoremeasures shows that word dissimilarity measurewas the most important feature, followed by dura-tion, part-of-speech, and SBCC.

As for the acoustic correlates of pitch accent,the test results depicted in Fig. 6 show that dura-tion and SBCC played more important roles thanenergy and pitch for the detection of focus kernels.Both duration and high-frequency intensity havebeen shown to be reliable acoustic correlates oflexical stress (Sluijter et al., 1997). Pitch playedthe least significant role in the focus kernel detec-tion, probably by the frequent occurrence of pitch

EDP

tracking inaccuracies in noisy speech data. It washard to discriminate voiced regions from unvoicedregions in the noisy recording environment. Theenergy of unvoiced regions carried informationirrelevant to pitch estimate, therefore the automat-ically extracted pitch might have contained toomany pitch tracking errors to be an efficient fea-ture. Spectral balance cepstral coefficients showedbetter performance than pitch and energy, possiblybecause: (1) the band-pass filters eliminated thelow frequency noise that adversely affected pitchand energy estimates; and (2) spectral balancemay be more representative of vocal effort thanwas overall energy, because the difference in theshape of glottal waveform tended to cause differ-ences in the higher frequency regions (Sluijteret al., 1997).

6.2. Symmetric contrast detection

Symmetric contrast was detected using the algo-rithm shown in Fig. 7: candidate word pairs werepre-filtered using a series of knowledge-basedrules, prior to application of a decision tree pro-gram. For the decision tree program, we usedSee5, which is a data mining tool for discoveringpatterns or relationships in data, assembling theminto classifiers that are expressed as decision trees

Page 18: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

ORRECTEDPROOF

1309131013111312131313141315

1316131713181319

132013211322

Fig. 6. Feature vector accuracy ranking according to the classification accuracy and two global F-score measures: the first measure isthe averaged focus kernel F-score and nonfocus kernel F-score; the second measure is F-score computed from the average precision andaverage recall of the focus kernel and nonfocus kernel labeling. Dur = duration, Egy = energy, Pit = pitch, Pos = part-of-speech,SBCC = spectral balance cepstral coefficients, SDM = semantic dissimilarity measure.

N Y Ynonsymmetriccontrast

nonsymmetric contrast nonsymmetric contrast

at most one

content word

?

same

POS?

hyper-nym

relation?

Decision Tree

Fig. 7. Architecture of the symmetric contrast detector.

18 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

UNCor sets of if-then rules, and using them to make va-

lid predictions (Rulequest Research, 2004).In experimental tests, we labeled all of word

pairs in an utterance as either symmetric contrast

or nonsymmetric contrast. We had 265 symmetri-cally contrasted pairs. The corpus study showedthat 100% of symmetric contrast cases were pairs

of content words, and 99.2% of contrasted con-tent-word pairs had the same part-of-speech tags.Therefore, we labeled a pair of words hw1,w2i asnonsymmetric contrast if at most one of them wasa content word, or if they had different part-of-speech tags. We did not model the common inte-grator requirement for semantic parallelism, but

Page 19: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

132313241325132613271328132913301331

1334133513361337133813391340134113421343134413451346134713481349135013511352

13531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 19

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

C

the semantic independence requirement could beeasily represented by requiring that a word maynever contrast with its own hypernym.

A pair of words hw1,w2i in the utterance was in-put to the See5 program, with feature vector beingthe combination of prosodic measurements andthe semantic similarity score, i.e., f = {duration,pitch, energy, 13 SBCCs, similarity score}. Thesimilarity score of w1 and w2 was

simðw1;w2Þ ¼ 1� disðw1;w2Þ. ð19ÞThe prosodic features were the arithmetic mean ofw1 and w2. The classification target was 1 ifhw1,w2i was symmetric contrast, and 0 else. TheSee5 program was invoked with the options rule-sets (collections of if-then rules), ignore costs file

(ignore the penalty for misclassification), and glo-

bal pruning (a large tree is first grown to fit the dataclosely and is then pruned by removing parts thatare predicted to have a relatively high error rate).

After pre-filtering the word pair candidatesusing the three knowledge-based rules, we ob-tained 803 word pairs (260 utterances out of the630 utterance corpus) to input to the See5 pro-gram. Due to the reason that we describe in Sec-tion 6.1, we used the pitch accent acousticcorrelates (pitch, energy, duration, and SBCC)rather than pitch accent/nonpitch accent labelingfrom these acoustic correlates for symmetric con-trast detection. We used 50% of the filtered word

UNCORRE 1382

1383

13841385138613871388

Table 5Precision p, recall r, and F-score f for the correct labeling ofutterances containing symmetric contrast (row 1) and notcontaining symmetric contrast (row 2)

p r f

Symmetric Contrast 1.0 0.952 0.975Nonsymmetric Contrast 0.925 1.0 0.961

Table 6Precision p, recall r, and F-score f of symmetric contrast labelling usinrule-based pre-filtering

Symmetric contrast

p r

SBCC 0.789 1.000SSM 0.701 1.000Dur+Egy+Pit+SBCC+SSM 0.838 1.000

Dur = duration, Egy = energy, Pit = pitch, SBCC = spectral balance

EDPROOF

pairs (in 121 utterances) for training the See5 pro-gram. Then we used the symmetric contrast detec-tor depicted in Fig. 7 to test the other 509(630 � 121 = 509) utterances. Table 5 reports theprecision, recall, and F-score with which utter-ances were correctly labeled as either containingsymmetric contrast (first row in the table) or notcontaining symmetric contrast (second row in thetable). Our test results showed that some individ-ual prosodic features, i.e., pitch, duration and en-ergy, all classified the test word pairs (after pre-filtering) into nonsymmetric contrast. In addition,when they were combined with SBCC and seman-tic dissimilarity measure, they played little role inthe symmetric contrast detection: almost all thevariables in the rule set used for non/symmetriccontrast classification were SBCC variables andthe semantic measure. Therefore, pitch, energyand duration were not efficient in the symmetriccontrast identification. We further compared theperformance of SBCC and semantic similaritymeasure on the pairs of words after the knowledgerule-based pre-filtering, and listed the results inTable 6. The table shows that SBCC outperformedword similarity measure in the efficiency of non/symmetric contrast classification. We also com-puted the non/symmetric contrast classificationaccuracy over all of the word pairs in the 509-utterance test corpus, and the accuracy was 92.8%.

6.3. Application of focus kernel to a practical

SSLU system

Focus kernel has been successfully applied tospontaneous spoken language understanding inthe tutoring dialogue scenario by participating inthe tutoring event classification (Zhang, 2004).The study corpus was divided into a training set

g various features on the word pairs after the three knowledge

Nonsymmetric contrast

f p r f

0.882 1.000 0.841 0.9140.824 1.000 0.746 0.8550.912 1.000 0.885 0.939

cepstral coefficients, SSM = semantic similarity measure.

Page 20: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

T

138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418

1419

14201421142214231424142514261427142814291430143114321433

143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463

20 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

C

and a test set. Focus kernels were extracted fromtraining utterances. A nonparametric model ofeach tutoring event class was constructed by sim-ply listing all of focus kernels used in trainingutterances of that tutoring event. The tutoringevent classification of test utterances was basedon: (1) lexical similarity measure between focuskernels in the test utterance and focus kernels inthe nonparametric models of all tutoring eventcandidates; and (2) cognitive state (confidence, puz-zlement, or hesitation) that reflects the students�mental activities during the process of knowledgeacquisition. Because the lexical similarity measurewas trained using a task-independent corpus(GigaWord) combined with a small task-orientedontology, it was possible to create this SSLUsystem using a comparatively small amount oftask-specific training data. There were 30 tutoringevents in total, and the perplexity of the classifica-tion task was 22.5. Accuracy of the tutoringevent classification achieved 75.5% when focuskernels and cognitive states were manually anno-tated, and reduced by 15.4% relative when focuskernels and cognitive states were automaticallyextracted.

Symmetric contrast labels have not yet beenimplemented to our speech understanding task,but we expect that automatically labeled symmet-ric contrast will be useful as a cue for robustsemantic parsing.

E 1464

14651466146714681469147014711472147314741475147614771478147914801481

UNCORR7. Discussion and conclusions

This paper has computationalized two linguisticconcepts, contrast and focus, that were assumed tobe useful for robust understanding of spontaneousspoken messages in a dialogue system. Standardand reasonable linguistic definitions of focus aredifficult to implement computationally, becausethe scope of focus is dependent on the syntacticstructure of the utterance and highly variable. Inorder to create a computationally feasible focusdetector, this paper has defined focus kernel tobe the word in an utterance containing new infor-mation neither presupposed by the interlocutornor contained in the precedent words of the utter-ance. This paper has also defined symmetric con-

EDPROOF

trast to be a pair of words that are parallel orsymmetric in linguistic structure but different orcontrastive in meaning. Symmetric contrast marksinformation about the discourse structure of anutterance that may be useful for robust semanticparsing.

Novelty detection has been widely investigatedin natural language processing (NLP) at the sen-tence or document level. For example, TRECNovelty Track and Topic Detection and Trackingintend to extract sentences or documents that dis-cuss new development of a document or discoursetopic. In this study, we have extended the study ofinformative novelty to an analysis of word-levelnovelty in spontaneous speech. In future work, itmay be possible to use the focus kernel for the pur-pose of efficiently summarizing the content of anutterance. By definition, the set of words definedto be focus kernel is a maximally informative sum-mary of the utterance in context. In this sense, theuse of focus kernels for utterance summarizationwould be similar to the task of sentence-selectionbased topic summarization of a document inNLP: topic summarization of a document can berealized by seeking a pool of sentences, each ofwhich expresses information not contained in itsprecedent sentences in the document (Zechnerand Waibel, 2000).

The effectiveness of the proposed symmetriccontrast detection and focus kernel detection sys-tems has been evaluated using a transcribed chil-dren�s spontaneous speech corpus, which wascollected using Wizard-of-Oz simulations of anintelligent tutoring dialogue system. The detectionof symmetric contrast and focus kernel was basedon word dissimilarity measures, part-of-speechtagging, and measurements of the acoustic corre-lates of prosody including duration, pitch, and en-ergy. We also propose spectral balance cepstralcoefficients (intensity and shape of the high-fre-quency spectrum) as one more acoustic correlateof pitch accent. The word dissimilarity measurecombines corpus statistics and application-ori-ented ontology. Classification achieved accuraciesof 83.8% for focus kernel classification and92.8% for symmetric contrast classification. Ourtests showed that spectral balance cepstral coeffi-cients, the dissimilarity semantic measure, and

Page 21: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

148214831484148514861487148814891490149114921493149414951496149714981499150015011502

1503

1504

1505

15061507150815091510151115121513

1514

1515

15161517

15181519152015211522

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 21

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

part-of-speech played important roles in the focuskernel and symmetric contrast detections.

Focus kernel extraction has been applied in aspontaneous spoken language understanding sys-tem. Symmetric contrast has not yet been appliedin any spoken language understanding (SLU) sys-tem, but seems well suited for use as a cue in ro-bust semantic parsing. Our SLU based on thedetection of focus kernel and symmetric contrastdiffers from most existing SLU systems in that itis not keyword-based (does not require the systemdesigner to specify, in advance, a list of ‘‘casemarkers’’), and it also does not require predefinedgrammar (used for syntactic parsing or extractionof finite-state-machine-based semantic concepts).Focus kernel and symmetric contrast begin SSLUby extracting, in a relatively general way, the prag-matic and semantic salient information of theutterance, thus allowing the interpretation ofutterance meaning from unconstrained humanlanguage.

8. Uncited reference

Ren et al. (2004b).

T

RRECAcknowledgements

We would like to thank Richard W. Sproat forhelpful discussions and providing us with theGigaWords text corpus. We would also like tothank Carla Umbach and Chungmin Lee for theircomments and suggestions. This work is supportedby NSF grant number 0085980. Statements in thispaper reflect the opinions and conclusions of theauthors, and are not endorsed by the NSF.

O

UNCAppendix A

The ontology construction procedure:

1. Noun, verb, adjective, and adverb are catego-rized into the semantic class Content, and func-

PROOF

tion words are categorized into the semanticclass Function.

2. The content words are clustered according totheir semantic meanings, regardless of theirpart-of-speech:(1) For each noun

a. According to the word meaning in theITS dialogue context, search the Word-Net for the most appropriate hypernymof the word. Delete those concepts thatcontain redundant information forknowledge representation in the ITS sce-nario. For example, in WordNet we havemonth! Gregorial month! April (�! �denotes subordination). However, wedelete �Gregorial month� in constructingontology of the ITS corpus.

b. Check WordNet to see if any word inthe corpus can be the synonym of thenoun.

c. If the noun has no hypernym, then itbecomes a direct subordinate of Content.

D E(2) After the hierarchy of noun words has been

constructed, wea. Verify that every member of a concept

hierarchy is subordinate of the rootconcept.

b. Verify that the direct subordinates of aconcept are semantically parallel to eachother.

c. Verify that the hierarchy rooted in aconcept has distinct meaning from thehierarchies rooted in other parallelconcepts.

(3) For each verb, strip tense, and then checkthe WordNet to see whether its hypernyms(�is one way to� relationship) are identical orsimilar to some noun concepts has beendefined in the ontology. If yes, then mergethe associated hierarchy into the existingontology. Otherwise, add a new verb con-cept to the ontology following the step(1). Some verbs have multiple meanings rel-evant to the ITS scenario, e.g., make means

Page 22: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

UNCORRECTEDPROOF

Fig. B1. Major ontological categories in the ITS corpus.

22 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

Page 23: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

16041605160616071608160916101612

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 23

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

�make somebody do something� or �producesomething.� Decode the multiple meaningsof a word, if any.

(4) Check the ontology structure:a. Search among the ontology for sets of

concepts that can be merged together.b. Search for more synonym words in the

ITS dialogue scenario. The synonymsof a word can be words of differentpart-of-speech, e.g., a noun can be a syn-onym of a verb as long as their semanticmeanings are similar, such as rotation

and spin.c. Go to step (2).

1611

T

P1613

1614

1615161616171116191620162116221623162416251626

R

EC

(5) For each adjective, check the WordNet tosee whether its hypernyms (�is a value of�relationship) are identical or similar to anoun concept. If yes, then merge the asso-ciated hierarchy into the existing ontology.Otherwise, use the verb or noun derivation(or root) of the adjective to find its hyper-nym, following step (1). For example, weuse the information about confuse to findthe hypernym of confused. Define its syno-nym and antonym with reference to theWordNet. If the hypernym of the adjectivecannot be found, then we use the hyper-nym of its noun or verb synonyms. Forexample, we use the hypernym of the verbagree to find the hypernym of alright.Otherwise, use the synonyms of the adjec-tive as clues to find its hyponyms. Then goto step (4).

UNCOR

Fig. C1. Context

OOF

(6) For each adverb, WordNet does not define ahypernym. Therefore, we use the stem adjec-tives as information clues to find the hyper-nyms. If an adverb has no stem adjective,then use the stem adjectives of its synonymsas information clues. Then go to step (4).

3. Function words typically have little or nosemantic content apart from their syntacticuse, so we categorize them according to syntac-tic usage: the words of the same part-of-speechare synonyms to each other, and their part-of-speech is their hypernym.

RAppendix B. Fig. B1.

ED

Appendix C

To illustrate the procedure of using corpus sta-tistics for word dissimilarity, consider the utter-ance ‘‘I lost count let me try again.’’ We choosek1 = k2 = 0.6, c1 = 2, s1 = 2, c2 = 1, s2 = 1, andpresent the derived context of the content wordsin Fig. C1. The PMI scores of some pairs of wordsare shown in Fig. C2. The listed scores show theproblem with mutual information is that it isbiased towards infrequent words, such as �alright.�The dissimilarity scores are shown in Fig. C3.Therefore as defined by Eq. (8), we have the nov-elty measures for the content words in the utter-

examples.

Page 24: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

PROOF

16271628

1629

16301631

16321633

Fig. C2. PMI examples.

Fig. C3. Examples of dissimilarity scores.

24 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

ance: Ncount = 0.021, Nlet = 0.014, Ntry = 0.004,Nagain = 0.002.

T

163416351636163716381639

C

Appendix D.

Fig. D1 lists the dissimilarity scores of someword pairs obtained by the ontological, statistical,

UNCORRE

Fig. D1. Examples of word dissimilarities

EDand combinational methods, respectively. The ta-

ble shows the examples of the four cases in knowl-edge combination: (1) dissimilar by both ontologyand corpus, as in hany leti and hcenter happeni; (2)dissimilar by ontology but similar by corpus, as inhalready outi and htake downi; (3) similar by ontol-ogy but dissimilar by corpus, as in hcount figureiand hrate speedi; and (4) similar by both ontology

by combined ontology and statistics.

Page 25: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

164016411642164316441645

1646

164716481649165016511652165316541655165616571658165916601661166216631664166516661667166816691670167116721673167416751676167716781679168016811682168316841685168616871688168916901691

1692169316941695169616971698

T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx 25

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

and corpus, as in hhow wheni and hones teethi. Thecombination occurs when the interpolation param-eter k = 0.5. The examples show that the combina-tion compromises the errors caused by thespecification and the generality of ontology andcorpus, respectively.

T

1699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747

U

NCORREC

References

Beckman, M.E., Ayers, G.M., 1994. Guidelines for ToBILabeling. Available from: <http://www.ling.ohio-state.edu/phonetics/ToBI/main.html>.

Bolinger, D., 1961. Contrastive accent and contrastive stress.Language 37, 83–96.

Bolinger, D., 1965. Forms of English. Harvard UniversityPress, Cambridge, MA.

Bosch, P., van der Sandt, R., 1999. Focus: Linguistic, Cogni-tive, and Computational Perspective. Cambridge UniversityPress, Cambridge, UK.

Chu-Carroll, J., Carpenter, B., 1999. Vector-based naturallanguage call routing. Comput. Linguistics 25 (3), 361–388.

Chomsky, N., 1971. Deept structure, surface structure andsemantic interpretation. In: Steinberg, D., Jakobovits, L.(Eds.), Semantics: An Interdisciplinary Reader in Linguis-tics, Philosophy and Psychology. Cambridge UniversityPress, Cambridge, UK.

Dagan, I., Marcus, S., Markovitch, S., 1995. Contextual wordsimilarity and estimation from sparse data. ComputerSpeech Language 9, 123–152.

Dahl, D., 1969. Topic and Focus: a Study in Russian andGeneral Transformational Grammar. Elandres Botryckeri,Goteborg.

Daubechies, I., 1990. The wavelet transform, time-frequencylocalization and signal analysis. IEEE Trans. InformationTheory 36 (5), 961–1005.

Edmonds, P., Hirst, G., 2002. Near-synonymy and lexicalchoice. Comput. Linguistics 28 (2), 105–144.

Firbas, J., 1964. On defining the theme in functional sentenceanalysis. Travaux Linguistiques de Prague 1, 267–280.

Firbas, J., 1966. Non-thematic subjects in contemporaryEnglish. Travaux Linguistiques de Prague 2, 229–236.

Flammia, G., 1998. Discourse segmentation of spoken dia-logue: an empirical approach. Ph.D. Thesis. MIT.

Gorin, A.L., Abella, A., Alonso, T., Riccardi, G., Wright, J.H.,2002. Natural spoken dialog. IEEE Computer Magazine 35(4), 51–56.

Gundel, J.K., Fretheim, T., 2001. Topic and focus. In: Horn,L., Ward, G. (Eds.), The Handbook of Pragmatic Theory.Blackwell Publishers, Malden, MA.

Gussenhoven, C. 2002. Intonation and interpretation: phonet-ics and phonology. in: Bel, B., Marlien, I. (Eds.), SpeechProsody 2002, Aix-en-Provence.

Halliday, M., 1967. Notes on transitivity and theme in English.Part II. J. Linguistics 3, 199–244.

EDPROOF

Hedberg, N., Sosa, J.M., 2001. The prosody of topic and focusin spontaneous English dialogue. LSA Topic and FocusWorkshop.

Heldner, M., Strangert, E., Deschamps, T., 1999. A focusdetector using overall intensity and high frequency empha-sis. Internat. Congress of Phonetic Science.

Higgins, D., 2004. Which statistics reflect semantics? Rethink-ing synonymy and word similarity. Internat. Conf. onLinguistic Evidence.

Jackendoff, R., 1972. Semantic Interpretation in GenerativeGrammar. MIT Press, Cambridge, MA.

Jiang, J.J., Conrath, D.W., 1997. Semantic similarity based oncorpus statistics and lexical taxonomy. Proc. Internat. Conf.Research on Comput. Linguistics.

Kadmon, N., 2001. Formal Pragmatics. Blackwell Publishers,Malden, MA.

Kay, M., 1975. Syntactic processing and functional sentenceperspective. In: Schank, R., Nash-Webber, B. (Eds.),Theoretical Issues in Natural Language Processing. MITPress, Cambridge, MA.

Kim, S.-S., 1998. Time-delay recurrent neural network fortemporal correlations and prediction. Neurocomputing 20,253–263.

Kim, S.-S., Hasegawa-Johnson, M., Chen, K., 2003. Automaticrecognition of pitch movements using multi-layer percep-tron and time-delay recurrent neural network. IEEE SignalProcess. Lett. 11 (7), 645–648.

Krifka, M., 1999. Additive particles under stress. Proc. ofSALT 8.

Kruijff-Korbayova, I., Steedman, M., 2003. Discourse andinformation structure. J. Logic Language Inf. 12, 249–259.

Lee, C., 1999. Contrastive topic: a locus of the interface. In:Turner et al. (Eds.), The Semantics/Pragmatics Interfacefrom Different Points of View (CRiSPI 1). Elsevier Science,Amsterdam.

Lee, C., 2003. Contrastive topic and/or contrastive focus. in:McClure, B. (Ed.), Japanese/Korean Linguistics 12, CSLI,Stanford.

Lee, J.H., Kim, M.H., Lee, Y.J., 1993. Information retrievalbased on conceptual distance in is–a hierarchies. J. Docu-mentation 49 (2), 188–207.

Lenci, A., 2001. Building an ontology for the lexicon: semantictypes and word meaning. In: Jensen and Skadhauge (Eds.)Ontology-based Interpretation of Noun Phrases: Proc. 1stInternat. OntoQuery Workshop.

Lin, D., 1998. Automatic retrieval and clustering of similarwords. Proc. of COLING-ACL, Montreal, Canada.

Miller, G., Fellbaum, C., 2002. WordNet. http://www.cog-sci.princeton.edu/~wn/.

Munoz, M., Punyakanok, V., Roth, D., Zimak, D., 1999. Alearning approach to shallow parsing. In EMNLP-WVLC�99.

Pantel, P., Lin, D., 2002. Discovering word senses from text.ACM SIGKDD Conf. on Knowledge Discovery and DataMining.

Page 26: Extraction of pragmatic and semantic saliencefrom ...ience based on cues measured in the speech wave- 115 form and in its orthographic transcription. 116 2 T. Zhang et al. / Speech

1748174917501751175217531754175517561757175817591760176117621763176417651766176717681769177017711772177317741775177617771778

1779178017811782178317841785178617871788178917901791179217931794179517961797179817991800180118021803180418051806180718081809

26 T. Zhang et al. / Speech Communication xxx (2005) xxx–xxx

SPECOM 1490 No. of Pages 26, DTD=5.0.1

15 August 2005 Disk UsedARTICLE IN PRESS

Pierrehumbert, J., Hirschberg, J., 1990. The meaning ofintonational contours in the interpretation of discourse.In: Cohen, P.R., Morgan, J., Pollack, M.E. (Eds.), Inten-tions in Communication, pp. 271–311.

Rada, R., Mili, H., Bicknell, E., Bletner, M., 1989. Develop-ment and application of a metric on semantic nets. IEEETrans. Systems Man Cybernet. 19 (1), 17–30.

Ren, Y., Kim, S.-S., Hasegawa-Johnson, M., Cole, J., 2004a.Speaker-independent automatic detection of pitch accent.Internat. Conf. on Speech Prosody.

Ren, Y., Kim, S.-S., Hasegawa-Johnson, M., Cole, J., 2004b.Speaker-independent automatic detection of pitch accent.Proc. ISCA Internat. Conf. on Speech Prosody.

Resnik, P., 1995. Using information content to evaluatesemantic similarity in a taxonomy. Proc. of the 14thInternat. Joint Conf. on Artificial Intelligence 1, 448–453.

Rooth, M., 1992. A theory of focus interpretation. NaturalLanguage Semantics 1, 75–116.

Rulequest Research, 2004. Data Mining Tools. Available from:<http://www.rulequest.com/see5-info.html>.

Santorini, B., 1990. Part-of-speech tagging guidelines for thePenn Treebank project. Linguistic Data Consortium.

Sluijter, A., van Heuven, V.J., Pacilly, J., 1997. Spectral balanceas a cue in the perception of linguistic stress. The J. Acoust.Soc. Amer. 101 (1), 503–513.

Steedman, M., 2000. Information structure and the syntax–phonology interface. Linguistic Inquiry 31 (4), 649–689.

Sussna, M., 1993. Word sense disambiguation for free-textindexing using a massive semantic network. Proc. of the 2ndInternat. Conf. on Information and KnowledgeManagement.

UNCORRECT 1810

EDPROOF

Terra, E., Clarke, C.L.A., 2003. Frequency estimates forstatistical word similarity measures. Proc. of the HLT-NAACL.

Thelen, M., Riloff, E., 2002. A bootstrapping method forlearning semantic lexicons using extraction pattern contexts.Proc. of the EMNLP.

Umbach, C., 2004. On the notion of contrast in informationstructure and discourse structure. J. Semantics 21 (2), 155–175.

Vallduvı, E., Vilkuna, M., 1998. On rheme and contrast. In:Culicover, P., McNally, L. (Eds.), Syntax and Semantics,The Limits of Syntax, Vol. 29. Academic Press, San Diego,CA.

Welby, P., 2003. Effects of pitch accent position, type, andstatus on focus projection. Language Speech 46 (1), 53–81.

Xu, Y., Xu, C.X., Sun, X., 2004. On the temporal domain offocus. Proc. ISCA Internat. Conf. on Speech Prosody.

Yoon, T.-J., Chavarria, S., Cole, J., Hasegawa-Johnson, M.,2004. Inter-transcriber reliability of prosodic labeling ontelephone conversation using ToBI. Proc. ICSLP.

Zechner, K., Waibel, A., 2000. DIASUMM: Flexible summa-rization of spontaneous dialogues in unrestricted domains.Proc. of COLING.

Zhang, T. 2004. Spoken language understanding in an intelli-gent tutoring scenario. Dissertation, University of Illinois atUrbana-Champaign.

Zhang, T., Hasegawa-Johnson, M., Levinson, S.E., 2004.CHILdren�s emotion recognition in an intelligent tutoringscenario. Proc. ICSLP.

Zubizarreta, M.L., 1998. Prosody, Focus, and Word Order.MIT Press, Cambridge, MA.