On the use of Intonation in ASR: preliminary results

On the use of Intonation in ASR: preliminary results

March, 13th 14th 2003Meeting COST 275

Halmstad

OUR GROUP

UPV EI Bilbao Inma Hernáez (leader), Eva Navas, Jon Sanchez

UVA ETSII Valladolid Valentín Cardeñoso, Isaac Moro, Carlos Vivaracho,

David Escudero. Involved in a CICYT Project of Biometrics with

Javier Ortega (Madrid) and Marcos Faundez (Barcelona).

Experience in ASR and in Modelling Intonation

OUR AIM

To work in the CICYT Project with the rest of the groups.

To apply our knowledge in Intonation to the field of ASR.

Here we present our preliminary results.

INTRODUCTION

Why to make use of intonation in ASR? It is a feature that characterize to the speaker:

Speakers of the same group have a similar prosody. Each speaker can have its own prosody.

It is a very robust feature Different sessions Different microphones

Other experiences in applying intonation to ASR SUPER SID: very simple model of intonation.

INTRODUCTION

Aim of this preliminary work To show the potential capabilities of intonation facing

different sessions and microphones To show that it can be important to make use of

“sophisticated” models for getting benefits in ASR. Overview

Presentation of the model of intonation. The corpus. The experiment of speaker verification. Considerations about the robustness of the results. Consideration about the use model of intonation. Conclusions and future work.

Modelling Intonation



BASIC IDEA FOR ITS APPLICATION TO ASR:TO COMPARE THE MODELSOF DIFFERENT SPEAKERS


The Corpus

Recorded at EUPMT by Marcos Faundez One paragraph read by 16 speakers in 2

sessions with 3 microphones. Each paragraph = 11 sentences, 106 stress

groups. 3816 intonation units. Speakers are male and in the same social

group. The pitch was obtained automatically and

segmented into intonation units by hand. Intonation was parameterised according to the

intonation model.

The Experiment

Speaker Verification. We have 6 recordings for each of the Speakers: 5 for

modelling and 1 for testing. Each Speaker will have each Impostor. The impostor

is modelled with the samples of the rest of speakers. We will repeat the experiment of verification six

times (one for each of the possible set of tests) for each of the speakers.

The classifier is based on Decision Trees C.45. Freeware WEKA.

Results M1 M2 M3 M4 M5 M6 Media

L0 65,19 63,52 60,13 68,59 63,75 59,01 63,4 L1 71,60 72,67 65,82 69,28 67,70 66,25 68,9 L2 57,95 41,18 61,14 52,87 37,87 60,67 51,9 L3 51,72 53,61 58,28 40,35 59,64 50,89 52,4 L4 58,52 58,18 57,32 56,90 53,53 54,76 56,5 L5 73,24 74,81 78,03 72,54 71,13 71,43 73,5 L6 80,12 80,86 85,44 78,36 85,37 81,87 82,0 L7 65,50 56,25 59,88 63,16 61,85 58,72 60,9 L8 48,75 67,81 59,33 56,60 68,28 60,26 60,2 L9 64,77 67,74 68,03 62,57 60,13 65,56 64,8 L10 65,03 72,19 60,00 63,41 73,05 69,19 67,1 L11 62,50 72,81 64,12 69,84 64,75 66,42 66,7 L12 60,87 54,07 56,34 64,85 58,72 58,55 58,9 L13 54,97 68,64 59,52 66,28 61,05 63,64 62,4 L14 63,69 62,96 72,73 60,12 61,96 62,94 64,1 L15 67,10 66,27 66,03 62,66 52,15 67,08 63,5

Low rates, except for some of the speakers

Results: robustness M1 M2 M3 M4 M5 M6 Media

L0 65,19 63,52 60,13 68,59 63,75 59,01 63,4 L1 71,60 72,67 65,82 69,28 67,70 66,25 68,9 L2 57,95 41,18 61,14 52,87 37,87 60,67 51,9 L3 51,72 53,61 58,28 40,35 59,64 50,89 52,4 L4 58,52 58,18 57,32 56,90 53,53 54,76 56,5 L5 73,24 74,81 78,03 72,54 71,13 71,43 73,5 L6 80,12 80,86 85,44 78,36 85,37 81,87 82,0 L7 65,50 56,25 59,88 63,16 61,85 58,72 60,9 L8 48,75 67,81 59,33 56,60 68,28 60,26 60,2 L9 64,77 67,74 68,03 62,57 60,13 65,56 64,8 L10 65,03 72,19 60,00 63,41 73,05 69,19 67,1 L11 62,50 72,81 64,12 69,84 64,75 66,42 66,7 L12 60,87 54,07 56,34 64,85 58,72 58,55 58,9 L13 54,97 68,64 59,52 66,28 61,05 63,64 62,4 L14 63,69 62,96 72,73 60,12 61,96 62,94 64,1 L15 67,10 66,27 66,03 62,66 52,15 67,08 63,5

No significant changes when different test input

Results: relevance of prosodic knowledge.

L1 L5 L6 L10 L11 Total 68.89 73.53 82.00 67.15 66.74 Inicial 80.03 55.08 77.65 58.11 52.44 Central 65.02 72.95 80.86 65.22 69.01 Final 69.25 67.71 89.49 75.70 53.87

Some parts of the utterance are more relevant depending of the speaker

Conclusions and future work

Promising results: some speakers are recognised with high rates. Results are robust to changes in the session and in the microphones.

Future work: To test the benefits of including this results in a ASR system.

To explore the use of our methodology for modelling intonation in a more general way. Making use of more classes of intonation. Getting knowledge of which of the classes of intonation are

more relevant for characterizing to the speaker. New corpura are welcome.

Stop the war

Thank you

Documents

On the use of Intonation in ASR: preliminary results