Speech-to-speech machine translation (Cerrato feb-2016) · 10/02/2016 1 About me 1 Joined Trinity College Dublin and CNGL (Centre for Global Intelligent Content) in June 2014 as post

10/02/2016

1

www.adaptcentre.ieAbout me 1

Joined Trinity College Dublin and CNGL (Centre for GlobalIntelligent Content) in June 2014 as post doc research fellow.

Since September 2015 new role: research coordinator of the research program within the Adapt Centre (The Global Centre of Excellence for Digital Content Technology).

www.adaptcentre.ieBackground in linguistics and speech technology

• PhD in Speech Communication, Dept. of Speech Music and Hearing at KTH, Stockholm (15th May 2007). Thesis: Investigating Communicative Feedback Phenomena across Languages and Modalities.

• University Honour Degree in Modern Foreign Literatures and Languages specialization in English, French and Linguistics), University of Naples "Federico II” (1988-1992).

• Specialization in Phonetics (Rome, 1993)• Post Graduate Studies in Hearing and Psycholinguistics (Visiting Scientist

at NAL, Australian Hearing Services) 1995• Junior researcher at the Fondazione Ugo Bordoni (Rome) 1996-1999

research in the field of telephonic communication (voice intelligibility, speaker verification, automatic speech recognition, collaboration with Scientific Police)

Experience in the industrial sector 2000-2001 and 2007-2012: • Infovox and Acapela Group: Development of speech synthesis for

several languages, evaluation of Human Machine Interactions and Voice Technology (TTS, ASR, MT…)

10/02/2016

2

www.adaptcentre.ieMy research interests

Aspects of phonetics and speech sciences: • Spontaneous speech phenomena: phonetic reductions, Italian

and British dialects, sociolinguistic phenomena

• Perceptual aspects of phonetic transmission

• Analysis of H-H interactions and H-M Interaction with focus on Multimodal Feedback and Turn-Taking phenomena

• Speech data collection and multimodal corpus annotation (creation of multimodal annotation schemes: MUMIN)

• Speech technologies development (ASR, Speaker Verification, TTS, MT)

• Evaluation of Human-Machine and Human-Human Interactions mediated by machines.

A Speech-to-Speech, Machine TranslationMediated Map Task: An Exploratory StudyLoredana Cerrato, Hayakawa Akira, Nick Campbell, Saturnino Luz*School of Computer Science and Statistics, Trinity College, Dublin, Ireland*The University of Edinburgh

The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

10/02/2016

3

www.adaptcentre.ieAbstract

• Investigate how the language technologies of Automatic Speech Recognition (ASR), Machine Translation (MT), and Text To Speech (TTS) synthesis affect users during an interlingual interaction.

• A brief description of the prototype system and the data collection.

• Details on the corpus• Description and results of a usability test run to assess how the

users of the interlingual system evaluate the interactions in a collaborative map task.

• Some results of analysis of user adaptation to the system.

www.adaptcentre.ieSpeech to Speech Translation Systems

Globalisation and internationalisation >interactions across linguistic barriersS2S systems applies language technology to breakdown language barriers

Investigate how the language technologies affect users during interlingual interactions: • ASR• MT • TTS

10/02/2016

4

The ILMT-s2s system

• Python script

• Google ASR

• Microsoft MT

• Apple TTS

Example of recording

The Map Task Data collection

• Interactions between speakers of different languages(English and Portuguese)

Remote interactions - over the network - to solve a specific task

• Map Task: 2 similar — but different — maps1 route for the “Giver” to guide the “Follower” with

• Follow the Model of the HCRC map task Creates dialogue that provides a complexity of results

Giver Follower

10/02/2016

5

www.adaptcentre.ieCollected data

The collected data includes a variety of synchronised and finely time-stamped data streams:

High quality audio of the participants' utterances, video and ASR, MT, TTS events.

Heart rate, skin conductance, blood volume pressure and EEG signals of one of the participants in each dialogue (recorded by means of a biosignal monitoring device).

Gaze and eye movements (of the same participant wearing the biosignal monitoring device) recorded with a portable eye-tracking device.

www.adaptcentre.ieThe ILMT-s2s corpus

• Each recording session lasts between 20 and 74 minutes• and contains between 33 and 199 utterances to the system.• 2 different settings (video‐on vs no‐video)

15 native English speakers (5 female, 10 male) 15 native Portuguese speakers (11 female, 4 male)

Different synchronized data streams (Audio, video, eye gaze, Biosignals)

10/02/2016

6

www.adaptcentre.ieTwo different settings for the data collection

No-video: Interlocutors cannot see each other. They communicate via voice to the ILMT-s2s system and only text of the interaction is displayed on the computer's screen.

Video-on: Interlocutors can see each other. Same setup as the no-video with an addition of a constant live video stream of the other user displayed on each interlocutor's computer screen.

www.adaptcentre.ieEvaluation of interactions

Investigate how the language technologies affect users during an interlingual interaction.

Simple evaluation:• Ease of use• Effectivness• Likeability• Future use

10/02/2016

7

www.adaptcentre.ieThe user survey: statements

Statements

1 Overall, I am satisfied with how easy it was to use this system.

2 It was simple to use this system.

3 I could effectively complete the tasks using this system.

4 I was able to complete the task quickly using this system.

5 I was able to efficiently complete the task using this system.

6 I felt comfortable using this system.

7 It was easy to learn to use this system.

8 I believe I could become productive quickly using this system.

9 Whenever I made a mistake using the system, I could recover easily and quickly.

10 The interface of the system was pleasant.

11 I liked using the interface of this system.

12 This system has all the functions and capabilities I expected it to have.

13 I was satisfied with the voice of this system.

14 I was satisfied with the output of this system.

15 Overall, I am satisfied with this system.

Agreement or disagreement on each statement using a 7 point Likert scale going from strongly agree 1 2 3 4 5 6 7 to strongly disagree

www.adaptcentre.ieThe user survey: open comments

1 Please indicate why you changed the style of communication.

2 Please indicate what made you give up clarifying the intention of the other participant.

3 Please indicate all the things that irritated you.

4 Please indicate all the things that pleased you.

5 Please indicate what you felt was most difficult.

10/02/2016

8

www.adaptcentre.ieResults: Overall evaluation results

0

1

2

3

4

5

6

7

Overall users' evaluation

www.adaptcentre.ieOverall Evaluation: Grouped results

0

1

2

3

4

5

6

7

ease of use effectivness pleasentness &likeability

expetcation future use voice overall satisfaction

Overall Users' Evaluation

10/02/2016

9

www.adaptcentre.ieResults: Common open comments

1 Please indicate why you changed the style of communication“to make possible for the system to correctly translate I had to use simpler words and phrases”

“I spoke in shorter sentences to avoid bad speech recognition”

“I expected to be able to speak at my standard pace, but I couldn't. I had to speak in a slower pace so the system could understand some words; I had to change the style to t the system: talk in a slower pace, try to articulate the words really well.”

Adapting the speaking style to the system(system errors influence speaking style, see Hayakawa et al. Interspeech 2015

www.adaptcentre.ieResults: Common open comments

Please indicate all the things that irritated you.“ Some words were not translated as I expected; Some translations bear no resemblance to what the speaker has input”

“ I was not irritated at any points. I did laugh a few times though be-cause I found that translating is something reasonably difficult and it was funny”

“to see how pronunciation can impact so much on understanding; some of the misunderstanding of words were a bit annoying, but I thought that was funny”

Users were frustrated, surprised and amused by the mistranslations

(see Hayakawa et al., 2015b, Detection of Cognitive states: frustration ).

10/02/2016

10

www.adaptcentre.ieOverall evaluation: two settings results

0

1

2

3

4

5

6

7

Users' evaluation in video‐on and no‐video conditions

no‐video video‐on

www.adaptcentre.ieDifferent settings different results

The user in the no-video settings seems to have had a better experience of the system

Why?

1) Connection problems and crashes (in 4 of the 8 interactions in the

video-on setting there were between 1 and 4 system crashes). This might have slowed down the ow of interaction. In the 7 interactions in the no- video settings there were no systems crashes at all.

2) Irritation due to the lag time between input and translation.

3) Dissatisfaction stemming from high expectation for interactivity with the video.

10/02/2016

11

www.adaptcentre.ieQuantitative measures for effectiveness in the two settings

There is a increment of the number of utterances produced in the video‐on setting.

www.adaptcentre.ieDiscussion of the results

The increment of the number of utterances in the video-on setting can be interpreted in two different ways:

1. As an indication of a more fluent interaction (the video might have enhanced the interactive behaviour).

2. As an indication of a more problematic interaction (due to a higher number of errors of the ASR or MT).

Comparison of the Word Error Rates (WER) across the two settings.

10/02/2016

12

www.adaptcentre.ieWER per setting

no‐video video‐on

Mean 0.3781 0.4327

SD 0.6583 0.4499

The WER is higher in the video‐on setting.

There is also increase in the WER after the system crashes, thus supporting the hypothesis that the higher number of utterances is an indication of a more problematic interaction, since the system crashes do affect the flow of interaction.

WER is higher in the video‐on setting

WER increases after the system crashes

with crashes no crashes

Before crash Means 0.4489 0.3954

SD 0.4466 0.4467

After crash Means 0.4821 0.3954

SD 0.4553 0.4467

www.adaptcentre.ieConclusion and Discussion

The users have experienced the system as:easy to learn to use and simple to use – average score 4.9. pleasant and likeable – average score 5.8, and they liked the TTS output.

Quantitative measures show that there was a different behaviour in the interactions in the two settings: surprisingly, users rated the no-video setting consistently better.

The presence of the video channel has a disruptive rather than an enhancing effect on communication

(latency due to high processing constraints).

Participants might have a higher expectation of the system's performance.

First encounter effect.

10/02/2016

13

www.adaptcentre.ieA Study on User adaptation

A study of how speakers adjust their speaking style in relation to errors from Automatic Speech Recognition (ASR), while performing an Interlingual map task.

Our belief is that these filters affect the speakers’ performance, resulting in adaptation of their communicative behaviour.

Data: 15 dialogues, annotation of graded interactivity and cooperation in the dialogues.

The results show that the participants do adjust their speaking style and speaking rate as a way of adapting to the errors made by the system. (a) system errors influence speaking rate, (b) the perceived level of cooperation by the interlocutors increases as

system error increases.

www.adaptcentre.ieUser adaptation

2) Hayakawa A., Cerrato L., Campbell N., Luz S., A Study of Prosodic Alignment In InterlingualMap-Task Dialogues., ICPhS 2015, Aug. 10th - 14th

The speaking rate becomes slower as the task goes along, but the ASR doesn't improve, meaning that he users have got used to the systems

Weak Correlation between the distribution of errors and the behaviour of the interlocutors in terms of graded interactivity and cooperation in the dialogues. (Kendal correlation test) The higher the WER the less the transcriber/annotator thought the speakers were interactive. No statistically significant.

10/02/2016

14

www.adaptcentre.ieA Study on association between ASR performance and Cognitive states

An investigation of the possible associations between speech recognition performance and 3 cognitive states that arise in dialogues mediated by a speech-to-speech machine translation system.

Two annotators were asked to annotate: amusement, frustration and surprise in the 15 dialogues (a total of 1767 items).

Analysed data include: audio of the speakers, physiological signals (Blood Volume Pulse, Skin Conductance)-synchronised and time-stamped

www.adaptcentre.ieResults

The results indicate that (a) these cognitive states often arise as a consequence of what

happens in the S2S mediated interaction with a statistically significant difference obtained between the WER of an utterance and the different cognitive state after the utterance,

(b) the association between the cognitive state and the biosignalsdoes not seem to persist until the next sentence is uttered,

(c) features of the speech signal can be used to complement biosignals in detecting cognitive states in time windows that include the following utterance.

10/02/2016

15

www.adaptcentre.ieResults

3) Hayakawa A., Haider F., Cerrato L., Campbell N., Luz S., Detection of Cognitive States and Their Correlation to Speech Recognition Performance in Speech-to-Speech Machine Translation Systems., INTERSPECH 2015, Sept. 6th - 10th.

www.adaptcentre.ieA study on three different speech types

A comparison between three different speech types:1. On-Talk, speaking to a computer, 2. Off-Talk Self , speaking to oneself 3. Off-Talk Other, speaking to another person (outside the mediated interaction)

The characteristics of the three speech types show significant differences in terms of speech rate.

For this reason a detection method was implemented to see if the three types could also be detected with good accuracy based on their acoustic and physiological characteristics.

Acoustic and physiological measures provide good results in distinguishing betweenOn-Talk and Off-Talk, but have difficulty distinguishing Off-Talk: Self and Other.

10/02/2016

16

www.adaptcentre.ieComparison between three different speech types

Talk type duration difference comparison per speaker and dialogue quarters.0% is equal to 180 words per minuteIF= Information Follower, IG= Information FollowerOn‐Talk: speaking to a computer, Off‐Talk Self: speaking to oneself and Off‐Talk: speaking to Other

Hayakawa, A., Haider, F., Luz, S., Cerrato, L., & Campbell, N. (2016).Talking to a system and oneself: A study from a Speech-to-Speech, Machine Translation mediated Map Task.In Proceedings of Speech Prosody 2016 (SP8). Boston, Massachusetts, USA: ISCA. (Accepted)

www.adaptcentre.ieFuture work

Further analysis of speech, video, eye-movements, facial expressions and physiological signals on the data recorded.

Investigation of possible correlations between biosignals and different communicative events (e.g. reaction to errors, surprise, etc.)

We expect that the knowledge acquired by analysing the data in this interlingualcorpus can be used to provide baseline material for component development and testing and will also enable testing of methods for affect sensing“

The corpus is available for research purposes!

Acknowledgments. This research is supported by Science Foundation Irelandthrough the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre(www.adaptcentre.ie) at Trinity College, Dublin.

Documents

Speech-to-speech machine translation (Cerrato feb-2016) · 10/02/2016 1 About me 1 Joined Trinity College Dublin and CNGL (Centre for Global Intelligent Content) in June 2014 as post