Investigations on Prosodic Emphasis and Semantic Content of Word Categories for Speech Technology Applications Christina Alexandris National University

Investigations on Prosodic Emphasis and Semantic Content

of Word Categories for Speech

Technology Applications

Christina Alexandris

National University of Athens and

Institute for Language and Speech Processing (ILSP) Athens, Greece

[email protected]

Prosodic Emphasis and Semantic Content

In the present study we attempt to provide a differentiation between specific word categories in which prosodic emphasis does not determine their semantic content (I)

and word categories whose semantic content may be determined by prosodic emphasis (II).

In the first case, the semantic interpretation of the entire phrase or sentence may be determined by the type of element receiving prosodic emphasis, but the semantic content of the emphasized element itself is not effected.

Prosodic modeling for speech output in the CitizenShield Dialog System

The behavior of the word categories is observed during the prosodic modelling of utterances recorded in a studio for the construction of the speech output produced by a Conversational Agent in a dialog system for consumer complaints (CitizenShield (“POLIAS”) system).

(Project co-funded by the European Commission (European Union) within the Third Framework Programme, Meter 3.3, Action 3.3.1 of the National Operational Programme "Information Society", Greece, National Project: "Processing of Images, Sound and Language")

Brief Description of the CitizenShield Project

The Citizenshield Project involves a hybrid approach combining key-word recognition and free input for managing spoken user-input entered in the CitizenShield (“POLIAS”) system for consumer complaints. The Citizenshield dialogue system developed is targeted towards an improvement of the service of the EKPIZO consumer organization by relieving the organization’s call center of routine tasks and, subsequently, allowing the operators to use their time and energy in handing complex cases.

The spoken input of the customer’s complaints is automatically entered into templates containing a number of fields related to the categories and types of information concerning the product involved. The automatic filling-in of complaint forms with spoken input via the organization’s call center is especially helpful to mobile users and users that have no internet access.

In the hybrid approach concerned, spoken input is either processed by a key-word recognition mechanism or directly entered as free input in the respective field in the complaint form. The type of processing used is related to the type of question produced.

Based on data collected from conducted research concerning user requirements, the dialogue system supports directed dialogs and allows the management of the three most commonly occurring customer complaint categories, namely food products, defective or problematic retail goods and banking and insurance services. The series of questions involved constitute a directed dialog, created from a combination of requirements related to speech processing efficiency and the EKPIZO organization requirements and policy

The information contained in each field of the complaint form is automatically or manually processable, according to the type of task to be executed. These complaint forms constitute a processable data base from which on-line information is generated, for example, graphs depicting percentages of product types or companies involved in the majority of current customer complaints.

CitizenShield Dialog System - Overview

Phonetic Platform

Φωνητική Πλατφόρμα

PSTN

ASR Engines OSR 3.0

ASR Engines OSR 3.0

ASR Engines OSR 3.0

ASR/TTS Clients

ASR server

ASR server

ASR server

... Speech

Synthesis

Speech Synthesis

TTS server

TTS server

...

TCP/IP

Path server

Text To Speech Synthesis(TTS)

Speech Synthesis

Call Center

Automated Speech Recognition (ASR)

Citizenshield Databases

CitizenShield System

TCP/IP

Telephone Lines

Telephone Lines

Evaluation of Prosodic Modeling

The acceptable prosody to the target user-group, constituting the general public, is modelled with the use of emphasis as prosodic marker, placed on the appropriate elements of the utterance for achieving (A) clarity and directness as well as (B) user-group oriented user-friendliness and sense of control to the user (Alexandris, 2007, Nottas et al, 2007).

The test data set comprises mainly of live recorded telephone conversations with the EKPIZO Consumer Organization (End-User) with the help and consent of the 1500- 2800 consumers (of average age 40 -60 years) that are also registered members of the EKPIZO Organization. These users (Test users) constitute the Registered Member category (1), whose filed complaints are considered to be of greater statistical weight. The other two groups of user are (2) the Semi-anonymous Users (who have provided some contact information) and (3) the Anonymous Users (of the lowest statistical weight).

The quality of the prosodic modeling of the utterances produced by the System’s Conversational Agent, constituted one of the System’s parameters (“Naturalness” and “Credibility of System”) that were evaluated by the Users in the Evaluation Phase of the Project (Workpackage 7- “Evaluation & Pilot Application”)

For the present task, fifty (50) Users (Ages 28 – 65, of whom are approximately 60% Men and 40% Women, Average Age: 35 – 45) , were asked to evaluate the quality of the prosodic modeling of the utterances produced by the System’s Conversational Agent (“Evaluation Phase 50-Users Group”).

Evaluation of the implementation of the prosodic modeling of the Conversational Agent’s utterances (Emphasis on selected words, subsequently divided into

Categories I and II)

Interaction Aspects related to (1) Utterance Level, (2) Functional Level and (3) Satisfaction Level (3) (Moeller, 2005)

Evaluation performed according to Degree of Naturalness / Acceptability: (“informativeness”, “intelligibility” (Moeller, 2005))

Results of Evaluation Phase 50-Users Group:80%: “Very natural-sounding” (40%)/ “quite natural-sounding” (40%) and 20% “satisfactory results”

Evaluation performed according to Degree of Credibility: (“perceived task success”, “functional limits” (Moeller, 2005))

Results of Evaluation Phase 50-Users Group:50%: “Seems to inspire very high credibility” (33,3%) / Seems to inspire quite

high credibility” (16,6%) and 50%: «satisfactory results»

Initial observations of the recorded data – Categories I and II

From the data collected both from

the (a) recorded corpora from the Test Users for the System’s Automatic Speech Recognition Component and

from (b) user-tests and questionnaires related to the Conversational Agent’s output (Spoken Output) (Evaluation Phase 50-Users Group),

it was initially observed that in Modern Greek, word categories may be divided into two groups, namely, (Category I) word categories whose semantic content is not determined by prosodic emphasis and (Category II) word categories whose semantic content may be determined by prosodic

emphasis

Prosodic Emphasis as a linguistic element: Further investigations

Based on the empirical data obtained both from the recorded corpora (Speech Recognition Component) and the evaluation of the spoken utterances produced by the Conversational Agent we attempt to go a step further and provide

(1) an integration of the results of previous studies and to proceed to a

(2) differentiation between specific word categories in which prosodic emphasis does not determine their semantic

content (I) and word categories whose semantic content may be determined

by prosodic emphasis (II).

In the first case (Category I), the semantic interpretation of the entire phrase or sentence may be determined by the type of element receiving prosodic emphasis (for deixis or emphasis), but the semantic content of the emphasized element itself is not effected.

Examples of Word Category I from the CitizenShield Dialog

System (prosody and interpretation of sentence )

Examples of preferred utterances by Users (prosodic emphasis is underlined) :

[5]: INTERACTION 5: SYSTEM: Please answer the following questions with a «yes» or a “no”. Was there a problem with the products packaging?

[USER: YES/NO)]

[6]:INTERACTION 6: SYSTEM: At what price did you buy the product? [USER: PRICE (EURO)]

[10]:INTERACTION 10: SYSTEM: Please tell us any additional information you wish about the product or about your transaction.

[USER: FREE INPUT]

[5.3.1]:INTERACTION 5: SYSTEM: How did you realize this? Please speak freely. Please confirm to me that you are finished. As soon as you are ready, please press «1».

[USER: FREE INPUT](Translation from Modern Greek with proximity to original syntactic structure)

Relation of Words of Category II to Prosodic Emphasis (prosody and semantics of individual words)

Spatial and Temporal Expressions: Presence of Prosodic Emphasis = indexical interpretation (= “exactly”) Absence of Prosodic Emphasis = vague interpretation

Quantifiers and numericals: Presence of Prosodic Emphasis = indexical interpretation (“ =“exactly”) Absence of Prosodic Emphasis = fixed expression

Discourse Particles used as Politeness Markers: Presence of Prosodic Emphasis = discourse particles, not associated with

politeness Absence of Prosodic Emphasis = discourse particles as politeness markers

Spatial and Temporal Expressions

Presence of Prosodic Emphasis = indexical interpretation (= “exactly”)

“ 'dipla” [+ indexical] = “along”: [“the crack was exactly along (parallel) to the band in the packaging”]“ ‘oso” [+ indexical] = “for as long as”: [“the array is created for as long as the specific loop is running”]

Absence of Prosodic Emphasis = vague interpretation

“ 'dipla” [- indexical] = “next-to”: [“the crack was next to the band in the packaging”]

“ ‘oso” [- indexical] = “while”: [“the array is created while the specific loop is running”]

Prosodic emphasis (underlined) used for ambiguity resolution. Examples chosen from Alexandris et al., 2005

Previous Studies: Prosodic Emphasis in Greek Spatial and Temporal Expressions

In Modern Greek, previous studies have demonstrated an observed relation between prosodic information and the degree of precision and lack of ambiguity (Alexandris et al., 2005). Specifically, it was observed that emphasis as an extra-linguistic marker enables the distinction between indexical and vague temporal (Schilder & Habel, 2001) and spatial expressions, without any reference to the context.

In previous studies, the prosody-dependent indexical versus vague interpretation of spatial expressions was accounted for in the form of additional [± indexical] features located at the end-nodes of the spatial ontology, where the level of the [± indexical] features may also be regarded as a boundary between the Semantic Level and the Prosodic Level (Alexandris, 2007a).

The semantics are so restricted at the end-nodes of the ontologies, that they achieve a semantic prominence imitating the prosodic prominence in spoken texts.

Prosodic prominence on the Greek spatial and temporal expression has shown to contribute both to (1) the recognition of its “indexical” versus its “vague” interpretation (Shilder & Habel,

2001), according to previous studies (Alexandris et al., 2005), and (2) acts as a default in preventing its possible interpretation as a part of a quantificational

expression- (many Greek spatial expressions also occur within a quantificational expression).

Specifically, it has been observed that prosodic prominence (1) is equally perceived by most users (Alexandris & Fotinea, 2006) and (2) contributes to ambiguity resolution of spatial and temporal expressions (Alexandris et al.,

2005)

Quantifiers and Numericals

Presence of Prosodic Emphasis = indexical interpretation (=“exactly”)

Περιμένετε δύο λεπτά (wait for two minutes) INFORMATION: TEMPLATE:WAIT TIME (2 MINUTES)

Absence of Prosodic Emphasis = fixed expression Περιμένετε δύο λεπτά (wait for a few minutes) (= fixed expression) INFORMATION:TEMPLATE:WAIT TIME (FEW MINUTES)

Prosodic emphasis (underlined) used for ambiguity resolution (Examples chosen from Alexandris 2007a, Alexandris 2005).

Discourse Particles used as Politeness Markers

Examples of preferred utterances by Users (prosodic emphasis is underlined) :

Absence of Prosodic Emphasis = discourse particles (in italics) as politeness markers:

CITIZENSHIELD CONVERSATIONAL AGENT:

Tell me, what-is the product (for example, milk, coffee, something else) Πείτε μου τι είναι το προϊόν (παραδείγματος χάρη, γάλα καφές, κάτι άλλο)

Presence of Prosodic Emphasis = discourse particles (in italics), not associated with politeness:

CITIZENSHIELD CONVERSATIONAL AGENT:

Tell me, what-is the product (for example, milk, coffee, something else) Πείτε μου τι είναι το προϊόν (παραδείγματος χάρη, γάλα καφές, κάτι άλλο)

Classification of Modern Greek Discourse Particles signalizing Positive Politeness (“Politeness Markers”) and their Direct Translation in English (Alexandris and Fotinea 2004)

POLITENESS MARKER (Positive Politeness)

DIRECT TRANSLATION TO ENGLISH (Not literal meaning)

SPEECH ACT

Α, Ωραία

Ah, Fine

ACKNOWLEDGE

Ωραία Fine CHECK Μάλιστα, Α, Ωραία, Έχουμε και λέμε, Για να δούμε

Yes, Ah, Fine, So we have, Lets see

CONFIRM

Ωραία , Για να δούμε

Fine, Lets see

FILLED PAUSE

Έχουμε και λέμε So we have INFORM Μήπως, Πείτε μου

Would (you), Tell me

REQUEST

A, Μήπως

Ah, Would (you)

Y/N QUESTION

Category III: The “in-between” group (prosodic emphasis as an “intensifier” of semantic content)

An additional in-between group of word categories is identified, where prosodic emphasis may emphasize or intensify the semantic content, but where presence or absence of prosodic emphasis does not determine the type of semantic content. It is observed that this group involves (1) adjectives expressing quality and (2) adverbs expressing mode.

Specifically, this group involves (1) adjectives signalizing quality perceptible to the senses, used in a literal, non-

metaphorical way and (2) adverbs signalizing mode perceptible to the senses, used in a literal, non-

metaphorical way.

It is observed that prosodic emphasis on other types of adverbs and adjectives plays a similar role as in the words of Category I, namely for emphasis (point focus in entire sentence) and deixis.

However, this differentiation between pure emphasis/deixis and “intensifiying” semantic content is, not 100% speaker-independent, as observed by the evaluation results. Native speakers (at an approximate analogy of 50/50) do not always use this differentiation.

The “in-between” group: Examples

Adjectives signalizing quality perceptible to the senses, used in a non-metaphorical way (prosodic emphasis is underlined):

“strogi ‘lo” It was in a round box (= truly, “par excellence” round)“’citrino” The leaves of the lettuce were yellow (= truly, “par excellence” yellow)

Adverbs signalizing mode perceptible to the senses, used in a non-metaphorical way (prosodic emphasis is underlined):

“arg’a” I opened it slowly (= very slowly indeed)“an’apoda” I turned it upside down (=completely upside down)

Present Word category Differentiation: Categories A, B and C

The group of word categories whose semantic content may be determined by prosodic emphasis namely (1) spatial and temporal expressions, (2) a sub-group of discourse particles identified as “politeness markers” (Alexandris and Fotinea, 2004) and (3) a subgroup of quantifiers and numericals is classified as Category A.

The group of word categories where prosodic emphasis may emphasize or intensify, but may not determine the semantic content, is classified as Category B. This group involves (1) adjectives expressing quality and (2) adverbs expressing mode perceptible to the senses, used in a literal, non-metaphorical way.

The rest of the word categories that are not effected by prosodic emphasis in respect to their semantic content are classified as Category C. The presence or absence of prosodic emphasis on words of Category C only effects the semantic interpretation of the entire phrase or sentence in which they belong.

“Prosodically Determined”, “Prosodically Sensitive” and “Prosodically Independent” words in Speech Technology Applications

For reasons of clarity, we name the group of word categories in Category A “Prosodically Determined” words, the group of word categories in Category B “Prosodically Sensitive” words and the rest of the word categories, namely Category C “Prosodically Independent” words.

Prosodic emphasis on the word elements of Categories A and B is related to producing the correct semantics of the words for achieving user-friendliness in man-machine communication in the sense of accuracy and directness (Hausser, 2006). These Categories are sentence-independent and may be systematically used in various Speech Technology Applications (Text-to-Speech (TTS) and Automatic Speech Recognition (ASR)).

“Prosodically Determined”, “Prosodically Sensitive” and “Prosodically Independent” words

WORD-LEVEL SEMANTICS SENTENCE-LEVEL SEMANTICS (& PRAGMATIC LEVEL)Prosodically ProsodicallyDetermined Sensitive Prosodically IndependentCategory A Category B Category C

References Alexandris, C. (2007a). “"Show and Tell": Using Semantically Processable Prosodic Markers for

Spatial Expressions in an HCI System for Consumer Complaints”, in: Proceedings of HCI 2007, Beijing China.

Alexandris, C. (2007b). "Intergrating Prosodic Information and Selectional Restrictions in Technical Translation", in International Journal of Applied Systemic Studies (IJASS), Inaugural Issue, Geneva: Inderscience (in print).

Alexandris, C., Fotinea, S-E. (2006). “Prosodic Emphasis versus Word Order in Greek Instructive Texts”, in: Proceedings of ISCA Tutorial and Research Workshop on Experimental Linguistics, Botinis, A. (ed)., Athens, Greece, August, 28-30 2006, 65-68.

Alexandris, C., Fotinea, S-E and Efthimiou, E. Alexandris, C. (2005). Emphasis as an Extra-Linguistic Marker for Resolving Spatial and Temporal Ambiguities in Machine Translation for a Speech-to-Speech System involving Greek. In: Proceedings of the 3rd International Conference on Universal Access in Human-Computer Interaction (UAHCI 2005), July 22-27, Las Vegas, Nevada, USA.

Alexandris, C., Fotinea, S-E. (2004). “Discourse Particles: Indicators of Positive and Non-Positive Politeness in the Discourse Structure of Dialog Systems for Modern Greek”, in: International Journal for Language Data Processing "Sprache & Datenverarbeitung”, 1-2/2004, 19-29.

Hausser, R. (2006). A Computational Model of Natural Language Communication, Interpretation, Inference and Production in Database Semantics. Berlin, Springer.

Herskovits, A. (1997). Language, Spatial Cognition and Vision,. In: Stock, O. (ed.): Spatial and Temporal Reasoning. Boston, Kluwer.

Moeller, S. (2005). Quality of Telephone-Based Spoken Dialogue Systems. New York, Springer. Nottas, M., Alexandris, C, Tsopanoglou, A. Bakamidis, S. (2007). “A Hybrid Approach to Dialog

Input in the CitzenShield Dialog System for Consumer Complaints”, in: Proceedings of HCI 2007, Beijing China.

Schilder, F., Habel, C. (2001). From Temporal Expressions to Temporal Information: Semantic tagging of News Messages. In: Proceedings of the ACL-2001, Workshop on Temporal and Spatial Information Processing, Pennsylvania,) 1309-1316.

Documents

Investigations on Prosodic Emphasis and Semantic Content of Word Categories for Speech Technology Applications Christina Alexandris National University